Exploring Computational Vocabulary for Collaborative Filtering

Digvijay Mali
Analytics Vidhya
Published in
15 min readMar 5, 2020

--

ref: www.valdo.com

Predicting unknown from known is a classical way of how machine learning works and where the basic operation of the recommendation system begins. Many recommendation systems operate on three principles.

(i) Editorial or hand curative- list of favorites, staff picked, essential items. problem : No input from User

(ii) Simple Aggregation- Top 10 sellings, most recent favorite, most popular. problem : Not to individuals choice.

(iii) Tailored to Individual Users- Understanding a user’s taste, and then suggesting his / her taste product. Ex: Recommendations on Amazon.

Finding the taste of individual users is a challenging task for any recommendation system among all three, and is based on the accessible background and history of an individual. If a person is new to the system then it’s very difficult to discover his / her taste. Similarly, if the product is freshly introduced into the network then it is hard to find consumers for whom it would be suitable. But we can solve this problem with implicit methods that include learning from the experiences of other users. If the user is new to the system, the user has a lot of unknown features and we can either use predictive models to fill in the blanks for unknowns by finding similar users whose taste suits the user. Collaborative filtering is one of the methods used by many recommendation systems for automatic predictions (filtering) of the user’s tastes by gathering(collaborating) preferences or taste data. Here we would like to explore the mathematical vocabulary used in collaborative filtering, and how predicting unknowns makes sense.

[PART 1] UNDERSTANDING TERMINOLOGIES :

The first simple thing that non-professional would do is to find the trend (relationship) between two variables to forecast one variable using other.

(I) Slope of the Line:

The slope is a description of the steepness of the line. The slope of any line will remain constant along the line. The slope can also tell you the direction of the line on the coordinate plane. The slope can be determined either by looking at the line graph or by using the coordinates of any two points on the line. The common formula for slope =(y2−y1)/(x2−x1) where (x1,y1) and (x2,y2) are two points on the line.

The trend can be positive, negative or neutral, which can be seen when viewing data on a scatterplot where the X-axis is the data available and the Y-axis is the data that you want to predict. We are trying to fit a straight line through the data, which roughly shows the trend in the data. The line has different properties that can tell you something about the data which may be significant or irrelevant to your interest depending upon the problem you want to solve. For instance, the line with slope=2 indicating Y changes twice as X but this relationship with X and Y-axis doesn’t tell you much about the trend.

(II) Covariance:

Covariance is a function of how much the two random variables are different together. It’s similar to variance, but where variance tells you how a single variable varies, covariance tells you how two variables vary.

Covariance is something that can inform you about the trend in the data. For example, if the covariance value is positive then the trend is positive; if the covariance value is negative then the trend is negative and if the covariance value is 0 then there is no trend.

PROBLEM 1: The one issue with covariance is that it doesn’t define the line’s slope (steep / not steep), it just tells you positive and negative covariances.

PROBLEM 2: If we increase data by 4 times, the covariance will also increase. The covariance value changes but the data relationship doesn’t.

PROBLEM 3:Covariance won’t tell you about points that are close to or off the line.

So we want something different than covariance, where we can find the strength of the relationship between X and Y which is independent of factors like scaling. The alternative would be to use Correlation.

(III) Correlation:

Correlation is a statistical measure of how two variables vary with each other. Correlation is the one that describes the data relationship and is not sensitive to the scaling of data. The covariance and correlation are prerequisites for many Machine Learning problems. Pearson’s correlation coefficient (r) is given as follows:

From the correlation value, we will know that (i) if the data points are closer to the line, the value of the X-axis would tell us more about Y and the relation is stronger. (ii) If the data points are distant from the line, the value for Y corresponding to X is high and implies a weak relationship between X and Y.

Q. How covariance and correlation are related?

We measure the covariance between X and Y, which is a part of the numerator, before calculating the correlation r. If the covariance is zero, there will be no correlation between X and Y so no need to determine the denominator to find the correlation. With correlation we are just doing ‘Bivariate Standardization’.

The range of covariance is [-infinite to infinite] while the correlation value lies between [-1,1]. Correlation represents the true strength of the linear relationship between two variables.

Q. How different is the correlation coefficient (r) for two similar lines with the same slope?

(a) SC1=Scatter of data along the Y-axis. (Projection given by red lines on Y-axis)

(b) SC2=Scatter of data along ‘fitted line’. (Projections given by black line on ‘fitted line’)

If SC1>>SC2 then there is a stronger correlation and if SC1<<SC2 there is a weaker correlation.

Q. Why Correlation is not a Causation?

Correlation sometimes doesn’t always affect the real output results. For example person A studied for 30 days and got an “A” grade, person B studied for 15 days and got an “A-” grade. When you plot this data you can find a positive correlation between the number of days and grades, and you can deduce some association between the number of days and grades. Yet new data “Person C” who studied for 1 day and got “A-” grade is not going to follow the same correlation. As some “Lurking Variables” exist which are concealed and have the potential to impact the actual results, hence correlations is not always causation.

Q. If relationship strength equals correlation, how can we trust this relationship?

Ohh yeah!! That depends on how much data you have. If data is very small, we are less optimistic and we would be more positive about the relationship if data is more. For correlation the p-value tells about the possibility of drawing a random point on a graph resulting in the same strong relationship. Hence, smaller the p-value more confident is the prediction. This has nothing to do with the value of the correlation coefficient.

Q. So strong correlation and good p-value are enough to infer a relation?

No :), we can’t tell if a model with a correlation coefficient r=0.92 is better for prediction or a model with a correlation coefficient r= 0.82, R-Square can solve this problem but this isn’t a scope of our discussion.

(IV) R-Sqaured:

R-squared is a mathematical measure that reflects the proportion of the variation of a dependent variable that is clarified by an independent variable or variables I models like regression, but correlation describes the intensity of the association between an independent variable(X) and dependent variable(Y). R-squared clearly shows to the degree one variable’s variance determines the second variable’s variance. When a model’s R2 is 0.8, so approximately 80 percent of the observed variance can be clarified by the model’s independent variable/s and anything else induces the remaining 20 percent variability.

Consider, X1 is an independent variable and Y is a dependent variable. We’ve been speculating X1 should be used to estimate Y. We’re plotting an X1 and Y graph as seen in fig(A).

fig.(A)

Variation of Y around mean = SUM(actual_weight — mean_weight)²,

We square because the data points are above and below the mean line, so it does not affect the negative values.

Variation around mean-line alone can not tell us anything, as in the example below (fig(B)) with independent variable X2 and dependent variable Y, the variation of Y around is the same as before, but the correlation between X2 and Y is better than that of X1 and Y.

fig.(B)

Better is to fit the line to the data and find variation around the fitted line instead of a mean line as shown in fig(C).

fig.(C)

Variation of Y around the fitted line = SUM(actual_weight — disatnce_from_fitted_line)²

R-squared is the ratio of Y’s variance difference around mean and around fitted line with Y’s variance around mean.

Here we can say that R-squared clearly shows to the degree one variable’s variance determines the second variable’s variance and their true relationship to some extend.

[PART 2] DISTANCE MEASURES AND THEIR RELATIONS:

Most machine learning algorithms are based on measurements of the distance between data points or vectors, and the relationship among them in data space. Less distance the resemblance higher. There are two simple distance measurement types, based on data space.

(I) Euclidean Space:

Data is represented in flat, bidimensional space geometry. Generally, this distance satisfies the following criterion:

(1) distance(x,y) = 0 if x=y (2) distance(x,y) = distance(y,x) (3) triangle Inequality : distance(x,y) <= distance(x,z) + distance(z,y)

[1] Euclidean Distance: (L2 Norm):

Well known as ‘Distance Formula’. The “natural” straight-line distance between two points in Euclidean space is the Euclidean distance or Euclidean metric. With this distance Euclidean space is converted into a metric space. The corresponding norm is called the Euclidean norm.

[2] Manhattan Distance (L1 norm)

A taxicab geometry is a type of geometry in which a modern metric replaces the normal distance metric in Euclidean geometry. Here the distance between two points is the sum of the absolute differences of their Cartesian coordinates.

(II) Non-Euclidean Space:

Data is represented in curved (Hyperbolic, Elliptical) geometry, rather than flat. There exists more than two-dimensional (Hilbert Space, Minkowski Space) geometry.

[1] Jaccard Distance and Jaccard Similarity

The Jaccard similarity of two sets is the size of their intersection divided by the size of their union.

Jaccard Distance = 1- Jaccard Similarity

[2] Cosine Distance and Cosine Similarity:

(i) Cosine Similarity is cos(Θ) or can also be represented as (180-Θ).

(ii) Cosine Distance is Θ, hence, if the distance is low, the similarity is high.

Similarity and theta have inverse relation. Smaller the theta, higher is the cosine similarity thus smaller the theta, smaller is the cosine distance.

[3] Edit Distance:

It is a number of inserts and deletes to change one string into another.

[4] Hamming Distance:

It is a number of positions in which bit vectors differ.

[5] Minkowski Distance:

The Minkowski distance is a metric in a regular vector space that can be viewed as a generalization of both the distance from Euclidean and Manhattan. We use this with multidimensional data also.

p = 1, Manhattan Distance

p = 2, Euclidean Distance

p = ∞, Chebychevs Distance

[6] Mahalanobis Distance:

Q. Why use Mahalanobis Distance?

What happens if the components have different variances, or if the components have correlations? Suppose there are three clusters of movies C1, C2 and C3 based on movie_budget and imdv_score. Now we want to check cluster for new movie (title, movie_budget,imdb_score) = (The Matrix,63000000,8.7). Suppose we used euclidian distance to find a distance of new movie from the centroid of each cluster C1, C2 and C3.

(i)PROBLEM 1 : If the data is not standardized meaning scaling of movies_budget (Big values) and imdb_score (small values) is very different.

To measure each feature equally, we have to standardize the data using different methods such as standardization with Z-score, where we deduct the mean of the feature from each row of the feature and divide it by standard deviation.

(ii) PROBLEM 2 : Even if we standardize the data and if there is collinearity between multiple variables, the statistical significance of an independent variable would be undermined. For example in the table below there is high collinearity between movies_budget and duration. (HIGH CORRELATION COEFFICIENT). Therefore, if we consider both movie_budget and Duration for distance calculation, this is the same as considering the same feature twice.

The solution to this is ‘MAHALANOBIS DISTANCE’.

Q. What is Mahalanobis Distance?

step 1 : Transform correlated variables to non-correlated variables.

step 2 : make their variance 1.

step 3 : calculate the simple Euclidean distance.

D² : Mahalanobis Distance

Here, we are dividing the distance from mean by inverse of ‘Variance-Covariance matrix’, hence, we can say that we are doing ‘Multivariate Standardization’ as dimensions are more than 2. Variance-Covariance matrix of features is represented by:

If there is a good correlation between two features say X and Y, The C will be bigger hence, the distance will be smaller. In this way, we can treat each feature equally in evaluation.

Q. How Euclidean distance and Cosine Distance are related?

Here, we would explore the ‘Curse of Dimensionality’ problem. For example, we always have too many features in clustering models, observations get more difficult to cluster, too many dimensions cause any observation in your dataset to appear equidistant from all the others. Since clustering uses a distance measure like Euclidean distance to determine the similarity between observations, this is a big problem. If all distances are approximately equal, all results must appear equally, and there can be no noticeable clusters. Therefore, we must use distance metrics such as Cosine Distance for multidimensional data with vectors that are not normalized.

if d1 and d2 are tow vectors, Cos(d1,d2) = (d1.d2)/(||d1||*||d2||),

||d1|| = (dx1²+dy1²+dz1²)^(1/2), ||d2|| = (dx2²+dy2²+dz2²)^(1/2)

1-cos(d1,d2) = Cosine distance = Parameter of difference= Never suffers curse of dimensionality like Euclidean distance.

Important Result : Changing the distance matrix from Euclidean Distance to Cosine Distance will not affect the distance iff vectors are normalized to unit form.

The content we visited in part1 and part2 is useful for Machine Learning/Data mining applications like Recommendation Systems. We will explore these concepts in Collaborative Filtering which is a type of Recommendation approach.

[PART 3] COLLABORATIVE FILTERING: (Recommendation Systems)

Collaborative filtering finds similar users based on taste and profile. If user x and people in group y have the same taste or profile history, then using Collaborative filtering, we can recommend those items to user x that are purchased by people in group y. The trick is how to find similar users?

Consider a table where Y-axis is users and X-axis is a rating given by the user for the corresponding book.

Consider users x and y with rating vectors Rx and Ry. We need a similarity matrix which would be useful to find similar users. Here we want to capture the intuition of Sim(C, D)> Sim(D, A) as the profile of C matches with the profile of D for the book ‘Catch-22’.

Case 1: Apply ‘Jaccard Similarity’, Sim(C,D) = |Rc ⋂ Rd| / |Rc ⋃ Rd|

Sim(C,D) = 1/5 and Sim(D,A) = 2/4, hence we got Sim(C,D) < Sim(D,A) but this is not what we want.

PROBLEM: Jaccard Similarity ignores the value of the rating. Just consider that C and D read the book ‘Catch-22’, hence 1/5. Thus, the Jaccard Similarity, in this case is good if the ratings were boolean.

Case 2: Apply ‘Cosine Similarity’, Cos(Rc,Rd)

Sim(C,D) = Cos(Rc,Rd) = 5*4/((sqrt(5²+5²+4²) + sqrt(4²+5²+1²)) = 0.36 and Cos(Rd,Ra) = 0.33, hence we got Sim(C,D) > Sim(D,A), but don’t differ by much amout.

PROBLEM: Cosine Similarity treats the missing ratings as ‘negative ratings’, the missing ratings are considered to be 0 and that is the worst rating in this case.

Case 3: Apply ‘Centred Cosine Similarity’

This is a way to standardize the data so that the sum of all the ratings in each row gives exactly 0 and then treats the blank values as 0.

Now if we calculate cosine similarity, we would get, Sim(C,D) = Cos(Rc,Rd) =0.08 and Sim(D,A) = Cos (Rd,Ra) = -0.55, hence Sim(C,D) >> Sim(D,A) which captures the fact accurately. Here, missing ratings are treated as average ratings instead of negative ratings. This also handles ‘Tough Raters’ and ‘Easy Raters’ in better way.

Now if you remember the coefficient of correlation formula you will find it exactly similar to centered cosine similarity where in both cases we deduct and standardize the mean from results.

Q. How the Pearsons Correlation coefficient is useful for User-Based Collaborative Filtering Predictions?

Ref: http://www.mmds.org

Here we want to predict what User 1’s rating to item 2. The first task is to find a user who is similar in the User 1’s profile. We can calculate the correlation coefficient to find similarities, but we need to take care of two things.

(i) Calculate the correlation coefficient while pretending column ‘Item 2’ doesn’t exist in the table.

(ii) Consider only those users who actually provided a rating for Item 2.

We can consider all users, or we can select the top K users with a high correlation value with User 1 (close to 1), but then we consider the collaboration of all users whose profile is just like User 1. A better way is to include all users, including users with negative correlations so that there is more variety of collaboration while predicting.

“More Users” — -> “More Collaboration”

Need to calculate Pearson’s Correlation of User1 with every existing user ( Wa,u is correlation coefficient where u ∈ Users and a is active User 1). We will not consider those Users whose correlation coefficient with User 1 is zero this indicated that user-profiles of both users are not matching.

RATING PREDICTIONS:

Case 1: Simple Average

Take the average rating of all those users who are similar to User 1. Ex: If N is the total number of users who are similar to User 1 and if we are considering nearest K users, the rating for Item 2 would be,

Rating for Item 2 = (Sum of ratings given by all K users) / K

PROBLEM : Here we are treating each user equally by assuming that each user among K users is present at equal distance from User 1 but that is not the case as correlation coefficients of all those users with User 1 are different.

Case 2: Weighted Average Rating by Similarity values

Here we are not treating each user equally. Users who are nearer to User 1 will collaborate more as compared to users which are not much similar to User 1.

Ref: http://www.mmds.org

We have to consider User 2, User 4 and User 5 for calculating the pearson’s correlation coefficient with User 1 as those are the only users who rated Item 2.

The inspiration for collaborative filtering comes from the idea that people often get the best advice from someone with similar tastes to themselves. collaborative filtering includes strategies that fit people with similar interests and on this basis make recommendations. The scope of this article was to discuss the statistical terms used in collaborative filtering and their significance.

“Coming together is beginning, staying together is progress and collaborating together is success”. ~Henry Ford

References and other useful resources:

  1. http://www.mmds.org/mmds/v2.1/ch09-recsys2.pdf
  2. https://en.wikipedia.org/wiki/Collaborative_filtering
  3. https://towardsdatascience.com/intro-to-recommender-system-collaborative-filtering-64a238194a26
  4. https://towardsdatascience.com/let-us-understand-the-correlation-matrix-and-covariance-matrix-d42e6b643c22

--

--

Digvijay Mali
Analytics Vidhya

Software Engineer at Intel Corporation | Graduate Student -Viterbi School of Engineering | University of Southern California, Los Angeles.