Using Graph Theory to Build a Simple Recommendation Engine in JavaScript
(JavaScript)基于图论的简单推荐引擎

Leveraging User Behavior to Drive Recommendations
利用用户行为来进行推荐

Working at Storefront, we’re always excited about new ways we can keep our users engaged. Recommendations or suggestions are a fantastic way for a platform to encourage users to stick around and keep browsing. The problem is, recommendations can be tricky.

在Storefront工作,我们总是对我们能够一直使用户忙碌而感到兴奋。推荐或者建议对于一个鼓励用户逗留和持续不停的浏览的平台是一个很神奇的方式。问题在于,推荐是有技巧的。

How do we know what to recommend to our customers? Is it similar item descriptions? Colors? Locations? It could be anything, and a linear combination of parameters grows significantly in complexity with every additional variable. If we use SQL queries, they can quickly become unmanageable tangled messes of JOINs. These can that take minutes to hours to generate unique recommendations for each and every user across your entire user base.

我们如何知道推荐什么给客户。是简单的物品描述?颜色?还是地理位置?推荐可以时任何东西,参数的线性组合在附带每个附加变量的复杂情况下增长显著。如果我们使用sql查询,推荐将很快变成不可控的样子,sql语句会变成有一大推join存在的混乱样子。对于整个用户群中每个用户推荐将会花费几分钟到几小时不等的时间去产生一个个性化的推荐。

With a very small team and limited resources we wanted a robust recommendation engine that A) wouldn’t lead users astray (random recommendations were out of the question), B) was extremely performant and C) worked with our current stack (node.js). What was our best option to accomplish this?

仅拥有很小的团队和有限的资源,但是我们想要一个健壮的推荐系统引擎:

  1. 不会误导用户,用户不会迷路(随机推荐不在讨论中)
  2. 非常高效
  3. 能够兼容现在的工作环境(node.js)

能够满足这些要求的最好的选择是什么?

##1. Introducing the Graph
介绍

If you have a background in computer science or mathematics, you’re probably familiar with graph theory. If not, you stumbled upon the right article to acquaint yourself with it! Many big companies are using graph-theoretic data structures to keep you hooked. LinkedIn’s “how you’re connected.” Facebook’s “people you may know.” You don’t need a lot of resources to begin working with graphs and they can be tremendously useful.

如果你的知识背景是计算机科学或者数学,你可能对图论这个概念很熟悉。如果不是的话,你需要找些文章来熟悉图论。许多大公司正在使用图论的数据结构来使用户留下来。LinkedIn的“你是怎样被连结的“,Facebook的”你也许认识的人“。使用图不需要很多资源,而且使用图也很有用。

Graphs, quite simply, are a way to model complex relationships between many objects. A graph exists as a collection of nodes (also known as vertices) and edges. A node is merely an abstract data point — it could represent anything. A person, a computer, a building, an intersection… or whatever you’d like. An edge connects two nodes and can optionally be directional (information only flows one way).

图,很简单,是一种模拟物体之间复杂关系的方式。一个图以点和边的集合的方式存在。点是一种抽象的数据形式,可以代表一切事物,例如一个人、一个计算机、一栋建筑物、一个十字路口或者你喜欢的任何东西。一条边连结两个点,边可以是有方向的(信息的流动是单向的),也可以是无方向的。

举个栗子:
Node[张大可]:Edge[喜欢]–>Node[酸奶]

In this case, we have two nodes (Joe and Minecraft) and one directional edge (likes) connecting the two. What does this look like in code?

在这个例子中,存在两个点(张大可 和 酸奶)和一条有向边连结两个顶点。如何用代码来表示以上关系?

1
2
3
4
5
6
7
8
let joe = {type: 'node', properties: {name: 'joe'}, input: [], output: []};
let likes = {type: 'edge', properties: {name: 'likes'}, input: null, output: null};
let minecraft = {type: 'node', properties: {name: 'minecraft'}, input: [], output: []};

joe.output.push(likes);
likes.input = joe;
likes.output = minecraft;
minecraft.input.push(likes);

Now we can see that joe.output[0] references likes, likes.input and likes.output reference joe and minecraft respectively (directionally), and minecraft.input[0] references likes.

现在我们了解到joe.output[0]代表likes,likes.input和output分别代表joe和minecraft,minicraft.input[0]代表likes。

##2. Creating Our Data Structures
创建我们的数据结构

There’s a problem here, though, and it’s glaringly obvious: if we want our graph to be assembled effectively, we have to write at least four separate instructions to connect two nodes via an edge. Not only that, but our node and edge constructors (via object literals) are pretty heavy. We can simplify this with object orientation. Let’s introduce some prototypes via the ES6 Class semantics.

存在一个问题,很明显:如果想要将图有效的集合起来,我们不得不引入四个独立的部分,以便连结两个顶点和一条边。不仅如此,点和边的构造函数过于复杂,我们可以使用面向对象的思想简化它。让我们来看一看ES6语义的原型。

First, we’ll define a base prototype for shared behavior between Nodes and Edges. Let’s call it a Unit, because that’s easy. We want our Unit to be able to specify an entity (what classification of object is it?) and have properties and methods related to setting and retrieving these properties.

首先,我们定义一个基本点原型,方便共享节点和边的共享。我们叫它Unit。Unit可以详细说明一个实体,它有属性,和用于设置属性之和获取属性值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
class Unit {

// Entity is used as node or edge type, for different classifications
// i.e. 'person', 'game', 'road', etc.
constructor(entity, properties) {

this.entity = entity + '';

this.load(properties || {});

}

// load properties (id, name, age, etc.) from an object
load(properties) {

let p = Object.create(null);

Object.keys(properties).forEach(function(v) {

p[v] = properties[v];

});

this.properties = p;

return this;

}

set(property, value) {

return this.properties[property] = value;

}

unset(property) {

return delete this.properties[property];

}

has(property) {

return Object.prototype.hasOwnProperty.call(this.properties, property);

}

get(property) {

return this.properties[property];

}

toString() {

return [this.constructor.name, ' (', this.entity, ' ', JSON.stringify(this.properties) ,')'].join('');

}

}

Great! That’s a good start. Now we need to define our Node where we keep track of all edges, input edges, and output edges (for directionality). We also specify an unlink method to remove all connected edges.

现在我们需要定义节点,以便使节点能够连接到所有的边,包括输入边和输出边。我们也要指定一个unlink方法用于移除连接关系的边。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class Node extends Unit {

constructor(entity, properties) {

super(entity, properties);
this.edges = [];
this.inputEdges = [];
this.outputEdges = [];

}

unlink() {

let edges = this.edges;

for (let i = 0, len = edges.length; i < len; i++) {
edges[i].unlink();
}

return true;

}

}

Pretty lightweight, but there’s a bunch of functionality we’ll need to include in our Edge that takes care of connecting nodes to one another. In our edge we hold an inputNode, an outputNode and whether or not the edge is a duplex link (bi-directional).

轻量级的代码。有一些功能我们需要定义在Edge的定义中,以便能够连结一个点到另一个点。在Edge中我们要定义inputNdoe, outputNode, 一条边是否是双向边。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
class Edge extends Unit {

constructor(entity, properties) {

super(entity, properties);

this.inputNode = null;
this.outputNode = null;
this.duplex = false;

this.distance = 1;

}

// link a specific node in a certain direction
_linkTo(node, direction) {

if (direction <= 0) {
node.inputEdges.push(this);
}

if (direction >= 0) {
node.outputEdges.push(this);
}

node.edges.push(this);

return true;

}

// link two nodes, optionally make edge bidirectional (duplex)
link(inputNode, outputNode, duplex) {

this.unlink();

this.inputNode = inputNode;
this.outputNode = outputNode;
this.duplex = !!duplex;

if (duplex) {
this._linkTo(inputNode, 0);
this._linkTo(outputNode, 0);
return this;
}

this._linkTo(inputNode, 1);
this._linkTo(outputNode, -1);
return this;

}

// distance for traversal
setDistance(v) {
this.distance = Math.abs(parseFloat(v) || 0);
return this;
}

// weight is 1 / distance
setWeight(v) {
this.distance = 1 / Math.abs(parseFloat(v) || 0);
return this;
}

// find the opposite node given a starting node
oppositeNode(node) {

if (this.inputNode === node) {
return this.outputNode;
} else if (this.outputNode === node) {
return this.inputNode;
}

return;

}

// unlink edge, remove connections from nodes
unlink() {

let pos;
let inode = this.inputNode;
let onode = this.outputNode;

if (!(inode && onode)) {
return;
}

(pos = inode.edges.indexOf(this)) > -1 && inode.edges.splice(pos, 1);
(pos = onode.edges.indexOf(this)) > -1 && onode.edges.splice(pos, 1);
(pos = inode.outputEdges.indexOf(this)) > -1 && inode.outputEdges.splice(pos, 1);
(pos = onode.inputEdges.indexOf(this)) > -1 && onode.inputEdges.splice(pos, 1);

if (this.duplex) {

(pos = inode.inputEdges.indexOf(this)) > -1 && inode.inputEdges.splice(pos, 1);
(pos = onode.outputEdges.indexOf(this)) > -1 && onode.outputEdges.splice(pos, 1);

}

this.inputNode = null;
this.outputNode = null;

this.duplex = false;

return true;

}

}

##3. Re-implementing our Graph
重新实现我们的图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

// Create nodes...
let joe = new Node('person');
joe.set('name', 'Joe');

let minecraft = new Node('game');
minecraft.set('name', 'Minecraft');

// Create edge...
let likes = new Edge('likes');

// link 'em!
likes.link(joe, minecraft);

// add more nodes...
let notch = new Node('person', {name: 'Notch'});
let created = new Edge('created').link(notch, minecraft);

We added a bit, so the graph now looks something like:

N[Joe] : E[likes] → N[Minecraft] ← E[created] : N[Notch]

We can increase the complexity of our graph (so it isn’t just linear!) as much as we want.

我们增加了一些东西,所以图现在看起来是这样的。可以随心所欲的增加图的复杂性。
N[Joe] : E[likes] → N[Minecraft] ← E[created] : N[Notch]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Add even more nodes
let mojang = new Node('company', {name: 'Mojang'});
let microsoft = new Node('company', {name: 'Microsoft'});
let jennifer = new Node('person', {name: 'Jennifer'});

new Edge('founded').link(notch, mojang);
new Edge('acquired').link(microsoft, mojang);
new Edge('purchased').link(jennifer, minecraft);
new Edge('prints_money_for').link(minecraft, microsoft);

/*
Our new graph...
Jennifer
| (purchased)
v
Joe --(likes)--> Minecraft <--(created)-- Notch
(prints_money_for) | | (founded)
v v
Microsoft --(acquired)--> Mojang
*/

Awesome. But how would we use a structure like this to drive recommendations for our users? Let’s switch gears to apply this to our business and our product.

很好,但是我们如何使用这样一个数据结构来为用户进行推荐呢?让我们转向应用这个数据结构到业务和产品中。

##4. Mapping User Behavior
映射用户行为

Now that we have the basic data structures of our graph set up, let’s map our users’ behavior on to a graph. At Storefront, we have at least three basic ways user can interact with a listing: they can A) view a listing, B) favorite a listing or C) request to book a listing. We’ll use these three actions as our edges, and we’ll define two node types, user and listing.

现在我们已经拥有了基础的数据结构,让我们映射用户的行为到图上。在Storefront,用户与一个listing交互至少有3种基本的行为:

  1. 访问一个列表
  2. 将一个列表加入喜爱
  3. 订阅一个列表

这些动作作为图的边,我们将定义两种节点,用户(user)和列表(listing)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
let users = getUsers();         // abstract function to get user data (i.e. SQL)
let listings = getListings(); // ... listings
let views = getViews(); // ... etc.
let favorites = getFavorites();
let requests = getRequests();

// quick and dirty O(n) function to get a node by id
function getNodeById(nodes, id) {
return nodes.filter(function(node) {
return node.get('id') === id;
})[0];
}

users = users.map(function(user) {
return new Node('user', user);
});

listings = listings.map(function(listing) {
return new Node('listing', listing);
});

views = views.map(function(view) {
return new Edge('view')
.link(getNodeById(users, view.user_id), getNodeById(listings, view.listing_id));
});

favorites = favorites.map(function(favorite) {
return new Edge('favorite')
.link(getNodeById(users, favorite.user_id), getNodeById(listings, favorite.listing_id));
});

requests = requests.map(function(request) {
return new Edge('request')
.link(getNodeById(users, request.user_id), getNodeById(listings, request.listing_id));
});

Voila! Keep in mind, this graph is represented entirely in memory (RAM). For very large user bases (and large numbers of actions) this can be quite a memory hog. You can decrease the memory space by grouping all actions of a certain type between a user and a listing into a single edge, and ignoring users who haven’t performed actions on your website.

记住,这个图是在内存中表示的。如果大量的用户(大量动作),将会占用非常多的内存。你可以通过以下方法降低内存,将一个用户和一个列表点所有动作分组到一条边。或者忽略那些没有对你的网站执行操作的用户。

##5. Weighting Interactions
为交互添加权重

Before we’re ready to start giving out recommendations, we need to assign weights (or distances) to each of our user behaviors. We want more important actions to be weighted more favorably. The reason why will become clear when we put everything together. “Weight” and “distance” are inverses of each other, with a higher weight equating to a lower distance.

在我们给出推荐之前,我们需要为每个用户的行为分配权重。我们希望重要的行为被分配更大的权重。权重和距离是反义词。一条边分配一个大的权重意味着分配小一些的距离。

We’ll deal with setting distances because they’re easier to reason about when traversing graphs. Since requests are the most important, we’ll set them to a distance of 1, favorites to 2, and views to 4.

我们会处理设定距离,访问请求是最重要的,我们将其设定为1,最爱设定为2,查看设定为4。

1
2
3
4
5
6
7
8
9
10
11
12

views.forEach(function(view) {
view.setDistance(4);
});

favorites.forEach(function(favorite) {
favorite.setDistance(2);
});

requests.forEach(function(request) {
request.setDistance(1);
});

最终,我们准备开始进行推荐!

##6. Putting it All Together
整合

In order to gain meaningful insights from our graph we’ll need to start exploring it. Why? How does this provide us with recommendations? Well, what we’re going to do is use the emergent behavior of our users to tell us what listings are related to one another. Compared to pulling a bunch of levers and adjusting SQL query parameters, we’re just going to make the assumption that two listings are probably similar to each other if users have interacted with both of them in a meaningful way. The more “meaningful” that interaction, the more likely the listings are similar.

为了从图中获得有意义的见解,我们需要探索它。为什么呢?图如何为们提供推荐建议?我们要做的就是使用户自发的告诉我们一个列表是如何联系到另一个列表的。相比使用sql,调整sql参数,我们只需要做一个假设,如果两个用户以一种有意义的方式访问了两个列表,我们就认为这两个列表是彼此相似的。越有意义的行为,则列表之间越相似。

This explains our distances (weights) of user interactions above. If User 1 requests Listing A, User 2 requests Listing A and Listing B, then User 1 is a total distance of 3 away from Listing B on our graph. Let’s visualize.

User 1 : Req (1) → Listing A ← Req (1) : User 2 : Req (1) → Listing B

The distance (3) is calculated as the total length of edges between the two nodes (request + request + request). What we’re going to want to do to generate recommendations is start traversing our graph outwards from our user, and find all of the closest listings in the order that they appear.

这解释了我们上面定义的距离(权重),如果用户1请求列表A,用户2请求列表A和列表B,然后用户1在图中距离列表B的距离是3。
User 1 : Req (1) → Listing A ← Req (1) : User 2 : Req (1) → Listing B
距离3是由两个节点的边的总距离得来的。我们产生建议的方法是从用户出发遍历图,找出所有临近的列表。

Lucky for us, graph traversal is a very well-studied and documented area of mathematics and computer science. We’re going to be traversing our graph outwards from the user we wish to generate recommendations for, breadth-first. Because our edges have weights and we don’t know enough about our data set to apply heuristics effectively, we’ll use a JavaScript implementation of a modified Dijkstra’s Algorithm that will scan nodes outwards from our target (a specific user).

所幸的是,图的遍历是一门已经经过充分研究和记录的数学和计算机科学领域。我们将要从用户开始向外遍历,使用深度遍历来产生推荐建议。因为边是有权重多,对于数据集我们了解的不够多,所以不能用启发式算法。我们将使用改进的Dijkstra算法的javascript实现来扫描表,从一个特定的用户开始扫描。

Using this algorithm we can keep track of every node we find along the way, ordered by distance. As we continue, nodes are progressively further away and (hypothetically) less relevant. For example, we’re interested in the listings another user has requested if they’ve requested the same space as you in the past. We’re not nearly as interested in the listings they have viewed (which is why the distance for views is 4). Because the listings they have viewed (as opposed to requested or favorited) are further out, they’ll be traversed far later than the listings they have requested and won’t have as high of a precedence in our list of recommended spaces.

使用这种算法我们能够跟踪访问路径上的每一个节点,并且按照距离排序。当我们继续访问时,节点之间距离逐渐变远,相关性也逐渐降低。例如,我们对一个其他用户已经request过的列表感兴趣,我们基本不会对他们view过的列表感兴趣(这就是为什么view的距离是4的原因)。因为他们view(同request, favorite相反)过的列表距离更远。这些列表将会比他们request过的列表距离更远的地方才能被访问到,这些节点也不会比在推荐区域中的清单有更高的优先级。

I’ll save you the hassle of mastering Dijkstra’s Algorithm in this article, but I encourage you to investigate more on your own if you’re not familiar with graph traversal or need a refresher.

在这篇文章中我会免去掌握Dijkstra算法的难处,如果你对图遍历不熟悉的话,我鼓励你更多的自己去研究。

##7. Final Implementation
最终实现

If you were worried you’ll have to figure this all out on your own, we have you covered. UnitGraph is a lightweight io.js (node.js) graph library that uses the data structures outlined here, so you don’t have to make it on your own. It provides a Graph wrapper that allows you to find the closest nodes on a graph or trace the shortest route between two nodes and it’s what we’re currently using in production.

如果你担心所有这一切都必须你自己亲自来实现,你不必担心,我们已经都实现了。Unitgraph是一个轻量级点io.js图库,它使用这里列出的数据结构,所以你不必自己来编写。它提供了一个图包装器,允许你去寻找一个图中最近的节点,或者找到两个节点间最近的路径。这就是我们现在所做的。

Generating our graph now looks like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

let ug = require('ug');

let graph = new ug.Graph();

// fetch data

let users = getUsers(); // abstract function to get user data (i.e. SQL)
let listings = getListings(); // ... listings
let views = getViews(); // ... etc.
let favorites = getFavorites();
let requests = getRequests();

// Add to graph

users.forEach(function(user) {
graph.createNode('user', user);
});

listings.forEach(function(listing) {
graph.createNode('listing', listing);
});

views.forEach(function(view) {
graph.createEdge('view').link(
graph.nodes('user').find(view.user_id),
graph.nodes('listing').find(view.listing_id)
).setDistance(4);
});

favorites.forEach(function(favorite) {
graph.createEdge('favorite').link(
graph.nodes('user').find(favorite.user_id),
graph.nodes('listing').find(favorite.listing_id)
).setDistance(2);
});

requests.forEach(function(request) {
graph.createEdge('request').link(
graph.nodes('user').find(request.user_id),
graph.nodes('listing').find(request.listing_id)
).setDistance(1);
});

// save graph
graph.save('/path_to_saved_graph.ugd', function() {

doAfterSave(); // do whatever you'd like.

});

And here’s an example of querying for the 100 closest listings to a user on the graph:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
let ug = require('ug');

let graph = new ug.Graph();

// load graph from flatfile into RAM
graph.load('/path_to_saved_graph.ugd', function() {

// get the closest 100 'listings' nodes, at a minimum depth (distance) of 3
let results = graph.closest(node, {
compare: function(node) { return node.entity === 'listing'; },
minDepth: 3,
count: 100
});

// results is now an array of Paths, which are each traces from your starting node to your result node...
let resultNodes = result.map(function(path) {
return path.end();
}));

doSomething(resultNodes); // render, whatever you'd like

});

You’ll notice minDepth in the graph.closest() call is set to 3. This is to prevent recommending listings to a user that the user has already favorited or requested (distances of 1 and 2, respectively). A distance of at least 3 guarantees we’ll only be recommending listings to users that they’ve at least already viewed.

你可能注意到了在graph.closest()调用中被设置为了3,这是为了防止向一个用户推荐列表时这个列表是用户曾经favorite过或者request过的(距离分别是1和2)。距离为3保证了我们只会推荐他们仅view过的列表。

后面文章讲述框架的一些优点和不足,可扩展性未来会有带加强的,但是现在还不是很强。

additional documentation for UnitGraph here
An ES6 API Server and Framework for iojs

原文地址