MagicCS: Recommendation System Notes

In general, recommendation system could be summarized in two big classes. One is content-based recommendation, the other one is collaborative filtering recommendation.

==Classification
First, Non-Personalized System, or community based recommendation. which means the system does not reveal any personal information. It collects the most popular items in the system, regardless users' characteristics, like hotel, restaurant recommendation, such as yelp. You cannot filtering the result based on your personal preferences. For instance, I am in my 20s and I like some restaurant whose environment is very fancy and I do not care much about the tasty of the food.

Second, content-based recommendation. User ratings are correlated to item features. For each item, extract its features, then we can get the weighting of item features for each user. For an new item, we can predict the rating based on its features. Alternatively, no all the users like to provide the explicit ratings, for this type of systems, it is possible to get users information from the implicit information, like users' browsing/clicking/navigating information, like news feeds, some music, video recommendation systems.

Third, collaborative-filtering recommendation. The fundamental assumption is that people have similar preference would like to see the similar items. For instance, if both of us like horror romance movie, then you saw a movie that I have not seen before. It is highly likely that I would like the movie as well. The key information is the ratings information. And the problem is that the rating information is quite sparse for most the cases. Ways to handle this issue includes filling the missing values, selecting promising cells.

==Evaluation
First, Accuracy. Precision and Recall to see whether the predicted user preferences is really what user likes.

Second, Usefulness of recommendations. Such as diversity, non-obviousness.

Third, Computational performance.

==Details of Content-based Recommendation
The key is to build the attribute vectors for each item. TFIDF can be used to create a profile of a document/object. For instance, a movie could be described as a weighted vector of its tags. Three main steps:
1. computing vectors to describe items
2. building users' preference
3. predicting user interest in items.

From item vector to user profiles. There could be multiple methods to handle it.

1. Simply add together based on users preference, which is a dot-production of two vectors for each feature.

For instance, Item vectors are as following. The last column is vector rating vector.

	baseball	economics	politics	Europe	Asia	soccer	war	security	shopping	family	user_vec
doc1	1	0	1	0	1	1	0	0	0	1	1
doc2	0	1	1	1	0	0	0	1	0	0	-1
doc3	0	0	0	1	1	1	0	0	0	0
doc4	0	0	1	1	0	0	1	1	0	0
doc5	0	1	0	0	0	0	0	0	1	1
doc6	1	0	0	1	0	0	0	0	0	0
doc7	0	0	0	0	0	0	0	1	0	1
doc8	0	0	1	1	0	0	1	0	0	1	1
doc9	0	0	0	0	0	1	0	0	1	0
doc10	0	1	0	0	1	0	1	0	0	0
doc11	0	0	1	0	1	0	0	0	1	0
doc12	1	0	0	0	0	1	1	0	0	0
doc13	0	0	1	1	1	0	0	1	0	0
doc14	0	1	1	1	0	0	0	0	1	0
doc15	0	0	0	1	0	1	1	1	0	0
doc16	1	0	0	0	0	1	0	0	1	0	1
doc17	0	1	1	1	0	0	0	1	0	0
doc18	0	0	0	1	0	0	0	0	1	0
doc19	0	1	1	0	1	0	1	0	0	1	-1
doc20	0	0	1	1	0	0	1	0	1	0

A simple way to construct users' preference is:

User_Profile#baseball=~~baseball~~ * ~~user_vec~~

Based on the method, we can get results for other features as well. Here is the result:

User_Profile=(3, -2, -1, 0, 0, 2, -1, -1, 1, 0)

2. Normalize the vector. So that an vector with more features will not just simply become more important.

3. Weighted the features. Like, 1/DF or Log(1/DF). The occurrence of the features.

The most challenge part of Content-based recommendation is figuring out the right way to weights and factors

== Details for Collaborative filtering Recommendation
The fundamental assumption is that our past agreement predict our future agreement. The major steps for CF algorithm:
1. Selecting neighbors. which is to find the similar users. Normally, you can select top 25-100 neighbors. The more the better, the system will have higher coverage. The similarity algorithms could be pearson correlation. For this method, you can ignore******* normalization.
2. Scoring the items from neighbors. Predict item score based on similar users. Some methods include, average, weighted average, and multiple linear regression. Weighted average is common.
3. Normalize the data. This is due to users have different preference to rating items. Some give higher number, some give lower number. And normalized the data could reduce this difference.
4.

MagicCS

2015年12月24日星期四

Recommendation System Notes

没有评论: