MLND Capstone Project

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [2]:
plt.rc('axes', titlesize=20)     # fontsize of the axes title
plt.rc('axes', labelsize=16)     # fontsize of the x and y labels
plt.rc('xtick', labelsize=12)    # fontsize of the tick labels
plt.rc('ytick', labelsize=12)    # fontsize of the tick labels
plt.rc('axes', axisbelow=True)   # grids are drawn behind charts

Data analysis

Books

In [3]:
data_books = pd.read_csv('data/books.csv')
len(data_books)
Out[3]:
10000
In [4]:
data_books.head()
Out[4]:
book_id goodreads_book_id best_book_id work_id books_count isbn isbn13 authors original_publication_year original_title ... ratings_count work_ratings_count work_text_reviews_count ratings_1 ratings_2 ratings_3 ratings_4 ratings_5 image_url small_image_url
0 1 2767052 2767052 2792775 272 439023483 9.780439e+12 Suzanne Collins 2008.0 The Hunger Games ... 4780653 4942365 155254 66715 127936 560092 1481305 2706317 https://images.gr-assets.com/books/1447303603m... https://images.gr-assets.com/books/1447303603s...
1 2 3 3 4640799 491 439554934 9.780440e+12 J.K. Rowling, Mary GrandPré 1997.0 Harry Potter and the Philosopher's Stone ... 4602479 4800065 75867 75504 101676 455024 1156318 3011543 https://images.gr-assets.com/books/1474154022m... https://images.gr-assets.com/books/1474154022s...
2 3 41865 41865 3212258 226 316015849 9.780316e+12 Stephenie Meyer 2005.0 Twilight ... 3866839 3916824 95009 456191 436802 793319 875073 1355439 https://images.gr-assets.com/books/1361039443m... https://images.gr-assets.com/books/1361039443s...
3 4 2657 2657 3275794 487 61120081 9.780061e+12 Harper Lee 1960.0 To Kill a Mockingbird ... 3198671 3340896 72586 60427 117415 446835 1001952 1714267 https://images.gr-assets.com/books/1361975680m... https://images.gr-assets.com/books/1361975680s...
4 5 4671 4671 245494 1356 743273567 9.780743e+12 F. Scott Fitzgerald 1925.0 The Great Gatsby ... 2683664 2773745 51992 86236 197621 606158 936012 947718 https://images.gr-assets.com/books/1490528560m... https://images.gr-assets.com/books/1490528560s...

5 rows × 23 columns

There is general info for each book:

The Hitchhiker's Guide to the Galaxy

Each book has a count of ratings it received (ratings_count) and a count of all ratings all editions of the book received (work_ratings_count). Each book has a breakdown on how many 1 to 5 stars there are in work_ratings_count.

In [5]:
data_books[['ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5']].describe()
Out[5]:
ratings_1 ratings_2 ratings_3 ratings_4 ratings_5
count 10000.000000 10000.000000 10000.000000 1.000000e+04 1.000000e+04
mean 1345.040600 3110.885000 11475.893800 1.996570e+04 2.378981e+04
std 6635.626263 9717.123578 28546.449183 5.144736e+04 7.976889e+04
min 11.000000 30.000000 323.000000 7.500000e+02 7.540000e+02
25% 196.000000 656.000000 3112.000000 5.405750e+03 5.334000e+03
50% 391.000000 1163.000000 4894.000000 8.269500e+03 8.836000e+03
75% 885.000000 2353.250000 9287.000000 1.602350e+04 1.730450e+04
max 456191.000000 436802.000000 793319.000000 1.481305e+06 3.011543e+06
In [6]:
# Visualize ratings
number_of_ratings = data_books[['ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5']].sum()
star_rating = np.arange(1,6)

plt.figure(figsize=(10,5))
plt.bar(star_rating, number_of_ratings)
plt.title('Distribution of all ratings for all book editions')
plt.xlabel('Star rating')
plt.ylabel('Number of ratings')
plt.show()

Numbers of ratings are increasing with the positive rating itself. There might be a few reasons for that:

  • This is a dataset with the most popular books on Goodreads. The most popular books are, by definition, what people love to read so they score it highly.
  • Many people are drawn towards books they might like, either through word of mouth or by reading various editorial recommendations. Those books are more likely to receive a positive rating then if a person had chosen randomly.
  • It might be that people are more eager to leave a good rating than bother with a bad one.

We can be use rating information to calculate the average rating for a book.

In [7]:
number_of_ratings = data_books[data_books.book_id == 54]\
    [['ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5']].values[0]

print("The average rating for The Hitchhiker's Guide to the Galaxy is {0:.2f}.".format\
      (np.dot(number_of_ratings, star_rating) / np.sum(number_of_ratings)))
The average rating for The Hitchhiker's Guide to the Galaxy is 4.20.

Book ratings

In [8]:
data_ratings = pd.read_csv('data/ratings.csv')
len(data_ratings)
Out[8]:
5976479
In [9]:
data_ratings.duplicated().value_counts()
Out[9]:
False    5976479
dtype: int64

There are no duplicate entries.

In [10]:
data_ratings.head()
Out[10]:
user_id book_id rating
0 1 258 5
1 2 4081 4
2 2 260 5
3 2 9296 5
4 2 2318 3
In [11]:
number_of_users = len(data_ratings.user_id.unique())
number_of_books = len(data_ratings.book_id.unique())

print("Users: {0}".format(number_of_users))
print("Books: {0}".format(number_of_books))
Users: 53424
Books: 10000
In [12]:
data_ratings.groupby('book_id').count().describe()
Out[12]:
user_id rating
count 10000.000000 10000.000000
mean 597.647900 597.647900
std 1267.289788 1267.289788
min 8.000000 8.000000
25% 155.000000 155.000000
50% 248.000000 248.000000
75% 503.000000 503.000000
max 22806.000000 22806.000000
In [13]:
data_ratings.groupby('user_id').count().describe()
Out[13]:
book_id rating
count 53424.000000 53424.000000
mean 111.868804 111.868804
std 26.071224 26.071224
min 19.000000 19.000000
25% 96.000000 96.000000
50% 111.000000 111.000000
75% 128.000000 128.000000
max 200.000000 200.000000

Number of ratings per book range from 8 to 22806. Numbers of ratings a user has made ranges from 19 to 200. Both minimums look reasonable at the moment so we won't make any modifications.

Important: Book ratings in ratings.csv are not the same as summary book ratings in books.csv; they're just a subset.

In [14]:
for example_book_id in [54, 883, 7803]:
    print("The book ID {0} has {1} ratings in `books.csv`, but only {2} ratings in `ratings.csv`.".format(\
        example_book_id,\
        data_books[data_books['book_id'] == example_book_id].work_ratings_count.values[0],\
        len(data_ratings[data_ratings['book_id'] == example_book_id])))
The book ID 54 has 1006479 ratings in `books.csv`, but only 9960 ratings in `ratings.csv`.
The book ID 883 has 111284 ratings in `books.csv`, but only 1755 ratings in `ratings.csv`.
The book ID 7803 has 9964 ratings in `books.csv`, but only 8 ratings in `ratings.csv`.

ratings.csv dataset is the one we must use because it contains user information, but we have to keep in mind that the dataset might be lacking because it has orders of magnitude less data than what exists in the Goodreads database.

We can visually compare all ratings and those available to us. We notice the distribution of ratings is different: there are more 4-star than 5-star ratings in our smaller dataset as compared to the distribution chart at the top.

In [15]:
pivoted_ratings = pd.pivot_table(data_ratings, values=['user_id'], index=['book_id'], columns=['rating'], aggfunc=np.count_nonzero)
number_of_ratings = pivoted_ratings.sum()
star_rating = np.arange(1,6)

plt.figure(figsize=(10,5))
plt.bar(star_rating, number_of_ratings)
plt.title('Distribution of all ratings in `ratings.csv`')
plt.xlabel('Star rating')
plt.ylabel('Number of ratings')
plt.show()

Books to read

In [16]:
data_to_read = pd.read_csv('data/to_read.csv')
len(data_to_read)
Out[16]:
912705
In [17]:
data_to_read.duplicated().value_counts()
Out[17]:
False    912705
dtype: int64
In [18]:
data_to_read.head()
Out[18]:
user_id book_id
0 9 8
1 15 398
2 15 275
3 37 7173
4 34 380
In [19]:
# group books by users
grouped_to_read = data_to_read.groupby('user_id').count()

# calculate how many histogram bins are needed
bins = np.arange(0, grouped_to_read.book_id.max() + 10, 10)

plt.figure(figsize=(10,5))
plt.hist(grouped_to_read.book_id, bins=bins)
plt.title('How many books people have in their "to read" list')
plt.xlabel('Number of books in the list')
plt.ylabel('Number of users')
plt.xlim(0, bins.max())
plt.xticks(bins)
plt.show()
In [20]:
percentage_books = 100 * len(data_to_read.book_id.unique()) / number_of_books
print("{0:.2f}% of all books are in someone's to read list.".format(percentage_books))

percentage_users = 100 * len(data_to_read.user_id.unique()) / number_of_users
print("{0:.2f}% of all users have books in their to read list.".format(percentage_users))
99.86% of all books are in someone's to read list.
91.48% of all users have books in their to read list.

Tags

There are 34,252 unique user-assigned tags in this dataset.

In [21]:
data_tags = pd.read_csv('data/tags.csv')
data_tags.head()
Out[21]:
tag_id tag_name
0 0 -
1 1 --1-
2 2 --10-
3 3 --12-
4 4 --122-
In [22]:
data_tags.count()
Out[22]:
tag_id      34252
tag_name    34252
dtype: int64
In [23]:
data_tags.duplicated().value_counts()
Out[23]:
False    34252
dtype: int64

Even though there are tens of thousands tags, many are similar. For example, 681 tags have “to read” in their names of which a few randomly selected ones are “want-to-read,” “to-read-maybe-someday,” “to-read-memoir,” and “to-read-cookbook.”

In [24]:
tag_names = pd.Series(data_tags['tag_name'])
tag_names_to_read = tag_names[tag_names.str.contains('to.?read', regex=True)]

print("There are {0} total tags that contain 'to read' in its name.\n".format(tag_names_to_read.count()))
print("Sample tag names that contain 'to read':\n{0}".format(tag_names_to_read.sample(10).tolist()))
There are 681 total tags that contain 'to read' in its name.

Sample tag names that contain 'to read':
['to-read-biblioteca', 'to-read-health', 'to-read-theory-general', 'food-to-read', 'to-read-lovecraft', 'companion-books-to-read', 'to-reads-i-actually-own', 'to-read-biz', 'to-read-2014', 'to-read-contemporary']

Tags are also in different languages. Tags with the same meaning but different spelling should be considered the same. Below are some examples for German, French, and Arabic.

In [25]:
tag_names[tag_names.str.contains('zu.?lesen', regex=True)]
Out[25]:
21668    noch-zu-lesen
Name: tag_name, dtype: object
In [26]:
tag_names[tag_names.str.contains('lire')]
Out[26]:
280     101-à-lire-dans-sa-vie
4051                 bd-à-lire
Name: tag_name, dtype: object
In [27]:
tag_names[tag_names.str.contains('قرأ')]
Out[27]:
21379    never-read-لا-تقرأ-أبدا
33375               أجمل-ما-قرأت
33431               أروع-ما-قرأت
33432              أسوء-ما-أقرأت
33437               أفضل-ما-قرأت
33989            قرأ-قبل-البراءة
33990                      قرأته
34037         كتب-لم-اكمل-قرأتها
34081          لن-أقرأ-لهم-ثانية
34083                 ليتها-تقرأ
34160                   مما-قرأت
Name: tag_name, dtype: object

book_tags.csv lists all tags for a book and how many times has each tag been assigned to it.

In [28]:
data_book_tags = pd.read_csv('data/book_tags.csv')
data_book_tags.head()
Out[28]:
goodreads_book_id tag_id count
0 1 30574 167697
1 1 11305 37174
2 1 11557 34173
3 1 8717 12986
4 1 33114 12716
In [29]:
len(data_book_tags['goodreads_book_id'].unique())
Out[29]:
10000
In [30]:
tags_per_book = data_book_tags.groupby('goodreads_book_id').goodreads_book_id.count()
tags_per_book.sample(10)
Out[30]:
goodreads_book_id
7981206     100
102461      100
10822174    100
18353714    100
3545387     100
16158519    100
296662      100
16280678    100
15505346    100
37190       100
Name: goodreads_book_id, dtype: int64

All 10000 books have tags attached to it. This dataset contains only 100 tags for each book. It's not clear nor indicated anywhere in the dataset description how those 100 were selected: random sampling, most common, or something else.

In [31]:
data_book_tags['count'].describe()
Out[31]:
count    999912.000000
mean        208.869633
std        3501.265173
min          -1.000000
25%           7.000000
50%          15.000000
75%          40.000000
max      596234.000000
Name: count, dtype: float64
In [32]:
data_book_tags[data_book_tags['count'] <= 0].count()
Out[32]:
goodreads_book_id    6
tag_id               6
count                6
dtype: int64

It looks like there are 6 tags with a negative count. We'll also look for those with 0 count. We'll remove them from the dataset because it's not indicated in the dataset what that means or how the labels were counted.

In [33]:
data_book_tags = data_book_tags[data_book_tags['count'] > 0]
In [34]:
data_book_tags['count'].describe()
Out[34]:
count    999906.000000
mean        208.870892
std        3501.275640
min           1.000000
25%           7.000000
50%          15.000000
75%          40.000000
max      596234.000000
Name: count, dtype: float64

To explore what the most common tags are, we'll add tag names to ids.

In [35]:
# add tag names to book tags
data_book_tags = data_book_tags.merge(data_tags, on='tag_id')
data_book_tags.sample(10)
Out[35]:
goodreads_book_id tag_id count tag_name
851942 7415016 11425 16 fashion
696479 1689469 20957 9 mystery-crime
129284 107776 25765 9 reread
898339 13325079 29584 26 texas
156436 11450591 22034 13 novels
159452 5168 11497 17 favorite
93695 13878 24960 8 re-read
285849 6589074 10197 16 ebook
843726 14481 18418 4 little-kids
556442 33 20570 58 movies
In [36]:
most_popular_tags = data_book_tags.groupby('tag_name').tag_name.count().sort_values(ascending=False)
most_popular_tags.head(10)
Out[36]:
tag_name
to-read              9983
favorites            9881
owned                9858
books-i-own          9799
currently-reading    9776
library              9415
owned-books          9221
fiction              9097
to-buy               8692
kindle               8316
Name: tag_name, dtype: int64

Algorithms

Two benchmark models are created: random estimates and baseline estimates. For more complex models, collaborative filtering method is usded. The approach uses past user behavior for predicting future preferences. For our project, past book ratings could indicate future ratings. In collaborative filtering, two categories of methods are widely used: neighborhood-based methods and model-based method.

Neighborhood-based methods are about finding similar items or users. If a user positively reviews books A, B, C, and D, and another user positively reviews books A, B, and C, it is more likely that the second user will score book D highly too.

Instead of building direct user or item connections and neighborhoods, which is especially hard in sparse datasets, modeling-based methods tries to reduce big datasets to latent (implicit) features. In books and reading, latent features could be writing style, genres, preference for certain types of protagonists, and similar.

For each model that accepts tuning parameters, we ran grid search over a 3-fold cross-validation procedure. We set ranges of parameters around recommended values in the Surprise library, and we optimized for RMSE.

In [ ]:
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import GridSearchCV
In [ ]:
def print_gs_results(gs_results):
    """Prints the most important grid search metrics.
    
    gs_results -- GridSearchCV object after fitting
    """
    print("Mean fit time: {0:.1f}s".format(gs_results.cv_results['mean_fit_time'].mean()))
    print("The best RSME score of {0:.4f} is for these parameters:".format(gs_results.best_score['rmse']))
    print(gs_results.best_params['rmse'])
    best_score_index = gs_results.best_index['rmse']
    print("The associated standard deviation: {0:.4f}".format(gs.cv_results['std_test_rmse'][best_score_index]))
In [ ]:
# sampled data
sample_ratio = 0.3
number_of_samples = int(len(data_ratings) * sample_ratio)
data = Dataset.load_from_df(data_ratings.sample(number_of_samples), Reader(rating_scale=(1, 5)))

n_jobs = -1 # all CPUs
In [ ]:
# all data
data = Dataset.load_from_df(data_ratings, Reader(rating_scale=(1, 5)))

n_jobs = 1 # one CPU due to memory issues

Random ratings from a normal distribution

In [ ]:
from surprise import NormalPredictor

param_grid = {}
gs = GridSearchCV(NormalPredictor, param_grid, measures=['rmse'], cv=3, return_train_measures=True, n_jobs=n_jobs)
gs.fit(data)
In [ ]:
print_gs_results(gs)

Output all data

Mean fit time: 6.4s
The best RSME score of 1.3237 is for these parameters:
{}
The associated standard deviation: 0.0008

Baseline ratings

In [ ]:
%%capture
from surprise import BaselineOnly

# Parameters docs and value ranges:
# http://surprise.readthedocs.io/en/stable/prediction_algorithms.html#baseline-estimates-configuration
# http://courses.ischool.berkeley.edu/i290-dm/s11/SECURE/a1-koren.pdf

param_grid = {'bsl_options': {'method': ['als'],
                              'reg_i': [8, 9, 10, 11, 12], # lambda 2
                              'reg_u': [3, 4, 5, 6, 7],# lambda 3
                             }   
             }
gs = GridSearchCV(BaselineOnly, param_grid, measures=['rmse'], cv=3, return_train_measures=True, n_jobs=1)
gs.fit(data)
In [ ]:
print_gs_results(gs)

Output sample 100k

Mean fit time: 0.2s
The best RSME score of 0.9453 is for these parameters:
{'bsl_options': {'method': 'als', 'reg_i': 11, 'reg_u': 4}}

Output sample 30%

Mean fit time: 1.4s
The best RSME score of 0.8651 is for these parameters:
{'bsl_options': {'method': 'als', 'reg_i': 8, 'reg_u': 3}}

Output all data

Mean fit time: 12.4s
The best RSME score of 0.8529 is for these parameters:
{'bsl_options': {'method': 'als', 'reg_i': 8, 'reg_u': 3}}
The associated standard deviation: 0.0005
In [ ]:
%%capture
param_grid = {'bsl_options': {'method': ['sgd'],
                              'learning_rate': [0.004, 0.006, 0.008, 0.010], # gamma
                              'reg': [0.015, 0.020, 0.025] # lambda 1 and 5
                             }
             }
gs = GridSearchCV(BaselineOnly, param_grid, measures=['rmse'], cv=3, return_train_measures=True, n_jobs=1)
gs.fit(data)
In [ ]:
print_gs_results(gs)

Output sample 100k

Mean fit time: 0.3s
The best RSME score of 0.9509 is for these parameters:
{'bsl_options': {'method': 'sgd', 'learning_rate': 0.008, 'reg': 0.02}}

Output sample 30%

Mean fit time: 4.9s
The best RSME score of 0.8666 is for these parameters:
{'bsl_options': {'method': 'sgd', 'learning_rate': 0.006, 'reg': 0.02}}

Output all data

Mean fit time: 25.5s
The best RSME score of 0.8539 is for these parameters:
{'bsl_options': {'method': 'sgd', 'learning_rate': 0.004, 'reg': 0.015}}
The associated standard deviation: 0.0006

Item-to-item

In [ ]:
from surprise import KNNBasic

param_grid = {'k': [20, 30, 40, 50],
              'sim_options': {'name': ['pearson'],
                              'user_based': [False]
                             }
             }

gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=3, return_train_measures=True, n_jobs=1)
gs.fit(data)
In [ ]:
print_gs_results(gs)

Output all data

Mean fit time: 100.0s
The best RSME score of 0.8781 is for these parameters:
{'k': 40, 'sim_options': {'name': 'pearson', 'user_based': False}}
The associated standard deviation: 0.0003

Baseline item-to-item

In [ ]:
from surprise import KNNBaseline

param_grid = {'k': [20, 30, 40, 50],
              'sim_options': {'name': ['pearson_baseline'],
                              'user_based': [False]
                             },
              'bsl_options': {'method': ['als'],
                              'reg_i': [8],
                              'reg_u': [3]
                             }
             }

gs = GridSearchCV(KNNBaseline, param_grid, measures=['rmse'], cv=3, return_train_measures=True, n_jobs=1)
gs.fit(data)
In [ ]:
print_gs_results(gs)

Output all data

Mean fit time: 57.2s
The best RSME score of 0.8046 is for these parameters:
{'k': 30, 'sim_options': {'name': 'pearson_baseline', 'user_based': False}, 'bsl_options': {'method': 'als', 'reg_i': 8, 'reg_u': 3}}
The associated standard deviation: 0.0008

Baseline user-to-user

In [ ]:
# extract subsets of users
sorted_users = data_ratings.groupby('user_id').count().sort_values(by='rating', ascending=False)
top_users = sorted_users.index.values[:13000]
bottom_users = sorted_users.index.values[-13000:]
sampled_users = sorted_users.index.values[::4]
In [ ]:
# top users
subset_data_ratings = data_ratings.loc[data_ratings['user_id'].isin(top_users)]
data = Dataset.load_from_df(subset_data_ratings, Reader(rating_scale=(1, 5)))
In [ ]:
from datetime import datetime
now = datetime.now()
In [ ]:
from surprise import KNNBaseline
param_grid = {'k': [20, 30, 40, 50],
              'sim_options': {'name': ['pearson_baseline'],
                              'user_based': [True]
                             },
              'bsl_options': {'method': ['als'],
                              'reg_i': [8],
                              'reg_u': [3]
                             }
             }

gs = GridSearchCV(KNNBaseline, param_grid, measures=['rmse'], cv=3, return_train_measures=True, n_jobs=1)
gs.fit(data)
In [ ]:
print_gs_results(gs)

Output top 13k users

Mean fit time: 173.3s
The best RSME score of 0.8246 is for these parameters:
{'k': 50, 'sim_options': {'name': 'pearson_baseline', 'user_based': True}, 'bsl_options': {'method': 'als', 'reg_i': 8, 'reg_u': 3}}
The associated standard deviation: 0.0003
In [ ]:
# bottom users
subset_data_ratings = data_ratings.loc[data_ratings['user_id'].isin(bottom_users)]
data = Dataset.load_from_df(subset_data_ratings, Reader(rating_scale=(1, 5)))
In [ ]:
param_grid = {'k': [20, 30, 40, 50],
              'sim_options': {'name': ['pearson_baseline'],
                              'user_based': [True]
                             },
              'bsl_options': {'method': ['als'],
                              'reg_i': [8],
                              'reg_u': [3]
                             }
             }

gs = GridSearchCV(KNNBaseline, param_grid, measures=['rmse'], cv=3, return_train_measures=True, n_jobs=1)
gs.fit(data)
In [ ]:
print_gs_results(gs)

Output last 13k users

Mean fit time: 60.0s
The best RSME score of 0.8453 is for these parameters:
{'k': 50, 'sim_options': {'name': 'pearson_baseline', 'user_based': True}, 'bsl_options': {'method': 'als', 'reg_i': 8, 'reg_u': 3}}
The associated standard deviation: 0.0012
In [ ]:
# sampled users
subset_data_ratings = data_ratings.loc[data_ratings['user_id'].isin(sampled_users)]
data = Dataset.load_from_df(subset_data_ratings, Reader(rating_scale=(1, 5)))
In [ ]:
param_grid = {'k': [20, 30, 40, 50],
              'sim_options': {'name': ['pearson_baseline'],
                              'user_based': [True]
                             },
              'bsl_options': {'method': ['als'],
                              'reg_i': [8],
                              'reg_u': [3]
                             }
             }

gs = GridSearchCV(KNNBaseline, param_grid, measures=['rmse'], cv=3, return_train_measures=True, n_jobs=1)
gs.fit(data)
In [ ]:
print_gs_results(gs)

Output sampled users

Mean fit time: 115.0s
The best RSME score of 0.8319 is for these parameters:
{'k': 50, 'sim_options': {'name': 'pearson_baseline', 'user_based': True}, 'bsl_options': {'method': 'als', 'reg_i': 8, 'reg_u': 3}}
The associated standard deviation: 0.0001

Latent features - matrix factorization

In [ ]:
# load all data again (CF user to user used a subset due to memory issues)
data = Dataset.load_from_df(data_ratings, Reader(rating_scale=(1, 5)))
In [ ]:
from surprise import SVD

param_grid = {'n_epochs': [10, 20, 30],
              'n_factors': [30, 50, 70],
              'biased': [True],
              'lr_all': [0.001, 0.005, 0.010],
              'reg_all': [0.01, 0.02, 0.05],
             }

gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3, return_train_measures=True, n_jobs=1)
gs.fit(data)
In [ ]:
print_gs_results(gs)

Output all data

Mean fit time: 164.6s
The best RSME score of 0.8239 is for these parameters:
{'n_epochs': 30, 'n_factors': 70, 'biased': True, 'lr_all': 0.01, 'reg_all': 0.05}
The associated standard deviation: 0.0006

Algorithm comparison

RSME

Root Mean Square Error (RMSE) is a common metric in recommender systems. RMSE shows the difference between the predicted and the true rating of a book. The smaller RMSE is, the closer is the predicted to the true rating. This is a simple metric that’s easy to calculate, so it is often used to optimize an algorithm.

In [37]:
# from GridSearchCV outputs
algo_labels = ('Random', 'Baseline', 'CF I2I', 'CF I2I baseline', 'CF U2U top baseline',
               'CF U2U bottom baseline', 'CF U2U sample baseline', 'SVD')
algo_rmse = (1.3237, 0.8529, 0.8781, 0.8046, 0.8246, 0.8453, 0.8319, 0.8239)
algo_std = (0.0008, 0.0005, 0.0003, 0.0008, 0.0003, 0.0012, 0.0001, 0.0006)

# we calculate 95% confidence interval for the T distribution
n_folds = 3 # in cross-validation
algo_confidence_intervals = [(4.303 * x / np.sqrt(n_folds)) for x in algo_std]
In [53]:
def plot_algorithms(labels, rmse, std, show_top=False):    
    plt.figure(figsize=(16,8))
    plt.grid(True, which='major', axis='y')
    plt.bar(np.arange(0, len(rmse)), rmse, yerr=std)
    plt.xticks(np.arange(0, len(rmse)), labels, rotation=15)
    if show_top:
        plt.ylim(ymin=np.min(rmse)*0.9)
    plt.title('RMSE scores for different algorithms')
    plt.ylabel('RMSE')
    plt.show()
In [54]:
plot_algorithms(algo_labels, algo_rmse, algo_confidence_intervals)
In [55]:
plot_algorithms(algo_labels[1:], algo_rmse[1:], algo_confidence_intervals[1:], show_top=True)

Top-K evaluation

Rank through a top-K recommender evaluation. The recommender system’s goal is not to predict the rating of an item, but to recommend a small set of items that a user would like from the vast majority that are available. Showing this small set is a ranking problem, and the ranking evaluation used here is borrowed from Koren, Y. 2010.

In [41]:
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import train_test_split

data = Dataset.load_from_df(data_ratings, Reader(rating_scale=(1, 5)))
trainset, testset = train_test_split(data, test_size=0.25, random_state=2)
In [42]:
algos = {}
In [43]:
test_total = 0
test_five = 0
for rating in testset:
    test_total += 1
    if rating[2] == 5.0:
        test_five += 1

print("The testset has {0} ratings of which {1} are five-star ratings.".format(test_total, test_five))
The testset has 1494120 ratings of which 496401 are five-star ratings.
In [44]:
from surprise import BaselineOnly
bsl_options = {'method': 'als', 'reg_i': 8, 'reg_u': 3}

algo = BaselineOnly(bsl_options=bsl_options)
algo.fit(trainset)
algos['base'] = algo
Estimating biases using als...
In [45]:
from surprise import KNNBaseline

sim_options = {'name': 'pearson_baseline', 'user_based': False}
bsl_options = {'method': 'als', 'reg_i': 8, 'reg_u': 3}
algo = KNNBaseline(k=30, sim_options=sim_options, bsl_options=bsl_options)
algo.fit(trainset)
algos['cf'] = algo
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
In [46]:
from surprise import SVD

algo = SVD(n_epochs=30, n_factors=70, biased=True, lr_all=0.01, reg_all=0.05)
algo.fit(trainset)
algos['svd'] = algo
In [47]:
algos
Out[47]:
{'base': <surprise.prediction_algorithms.baseline_only.BaselineOnly at 0x7f63b736bda0>,
 'cf': <surprise.prediction_algorithms.knns.KNNBaseline at 0x7f63b736bcc0>,
 'svd': <surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f63b736b160>}
In [48]:
import random
import operator

def predict_and_sort(algo, user_id, book_ids):
    """Predict ratings for books and sort predictions from highest to lowest.
    
    Arguments:
    algo     -- the algorithm used for predictions
    user_id  -- the user for whom to make a prediction
    book_ids -- the list of book_id for which to make a prediction
    
    Returns a list of tuples (book ID, predicted rating).
    """
    predicted_ratings = []
    for book_id in book_ids:
        prediction = algo.predict(user_id, book_id)
        predicted_ratings.append((book_id, prediction.est))
        
    predicted_ratings.sort(key=operator.itemgetter(1), reverse=True)
    return predicted_ratings


ranks = {}

for algo_name, algo in algos.items():
    ranks[algo_name] = []
    # the format of a rating is (user_id, book_id, rating)
    for i, rating in enumerate(testset):
        if rating[2] == 5.0: # find a book with a five star rating from a user
            random_books = random.sample(testset, 100) # get random books from the testset
            random_books.append(rating)
            # get sorted predictions for the user
            books_with_predictions = predict_and_sort(algo, rating[0], [x[1] for x in random_books])
            # find the rank (index) of the book in the list
            for index, predicted_rating in enumerate(books_with_predictions):
                # if book IDs match, store the rank of the book
                if rating[1] == predicted_rating[0]:
                    ranks[algo_name].append(index)
                    break
In [49]:
percentile = np.arange(0, 22, 2) # we're interested in the top 20% of recommendations

# calculate the chance that a five-star book will show up X% of top recommendations
chance = {}
for algo_name, ranks_one_algo in ranks.items():
    chance[algo_name] = []
    for p in percentile:
        chance[algo_name].append(sum([1 for i in ranks_one_algo if i <= p]) / len(ranks_one_algo) * 100)
In [50]:
plt.figure(figsize=(12,12))
plt.grid(True, which='major', axis='y')
plt.plot(percentile, chance['cf'], 'k-v', label='CF U2U')
plt.plot(percentile, chance['svd'], 'k--o', label='SVD')
plt.plot(percentile, chance['base'], 'k:+', label='Baseline')
plt.plot(percentile, percentile, 'k-.s', label='Random uniform')
plt.xticks(percentile)
plt.title('Top-K recommendations for different algorithms')
plt.xlabel('Rank (%)')
plt.ylabel('Cumulative distribution (%)')
plt.legend()
plt.show()

Making predictions

In [51]:
sorted_users = data_ratings.groupby('user_id').count().sort_values(by='rating', ascending=False)
top_user_id = sorted_users.index.values[0]

Calculate mean ratings for all books from ratings.csv and merge it with book data.

In [52]:
mean_ratings = pd.pivot_table(data_ratings, values=['rating'], index=['book_id'], aggfunc=np.mean)
mean_ratings.reset_index(inplace=True)
mean_ratings.rename(columns={'rating': 'mean_rating'}, inplace=True)

data_books = pd.merge(data_books, mean_ratings, on='book_id')

What kind of books has this user rated with 5 stars? (sorted by mean book ratings)

In [53]:
five_star_ratings = data_ratings[data_ratings['user_id'] == top_user_id]
five_star_ratings = five_star_ratings[five_star_ratings['rating'] == 5]
five_star_ratings = data_books[data_books['book_id'].isin(five_star_ratings['book_id'])].sort_values(by='mean_rating', ascending=False)
five_star_ratings[['book_id', 'original_title', 'authors', 'mean_rating']].head(30)
Out[53]:
book_id original_title authors mean_rating
3627 3628 The Complete Calvin and Hobbes Bill Watterson 4.829876
2208 2209 The Complete Works William Shakespeare 4.527121
1379 1380 The Complete Maus Art Spiegelman 4.521739
963 964 The Hobbit and The Lord of the Rings J.R.R. Tolkien 4.518571
3219 3220 Percy Jackson and the Olympians Boxed Set Rick Riordan 4.503012
160 161 The Return of the King J.R.R. Tolkien 4.424735
1341 1342 Night Watch Terry Pratchett 4.414097
188 189 The Lord of the Rings J.R.R. Tolkien 4.377728
506 507 The Hunger Games Box Set Suzanne Collins 4.371503
489 490 Maus: A Survivor's Tale : My Father Bleeds His... Art Spiegelman 4.369018
1 2 Harry Potter and the Philosopher's Stone J.K. Rowling, Mary GrandPré 4.351350
154 155 The Two Towers J.R.R. Tolkien 4.332047
3 4 To Kill a Mockingbird Harper Lee 4.329369
1336 1337 Going Postal Terry Pratchett 4.309631
174 175 The Last Olympian Rick Riordan 4.307137
336 337 The Ultimate Hitchhiker's Guide: Five Complete... Douglas Adams 4.305199
1222 1223 The Foundation Trilogy Isaac Asimov 4.304348
190 191 Watchmen Alan Moore, Dave Gibbons, John Higgins 4.283660
18 19 The Fellowship of the Ring J.R.R. Tolkien 4.271828
167 168 The Stand Stephen King, Bernie Wrightson 4.255985
156 157 Green Eggs and Ham Dr. Seuss, לאה נאור 4.246904
660 661 Alexander and the Terrible, Horrible, No Good,... Judith Viorst, Ray Cruz 4.242408
22 23 Harry Potter and the Chamber of Secrets J.K. Rowling, Mary GrandPré 4.229418
1367 1368 Small Gods Terry Pratchett 4.209006
158 159 The Battle of the Labyrinth Rick Riordan 4.190997
282 283 Good Omens: The Nice and Accurate Prophecies o... Terry Pratchett, Neil Gaiman 4.165629
936 937 His Dark Materials Philip Pullman 4.161990
805 806 Wizard and Glass Stephen King, Dave McKean 4.150933
6 7 The Hobbit or There and Back Again J.R.R. Tolkien 4.148477
598 599 The Restaurant at the End of the Universe Douglas Adams 4.135909
In [54]:
book_ids_rated_by_top_user = data_ratings[data_ratings['user_id'] == top_user_id].book_id.values
books_not_rated = data_books[~data_books['book_id'].isin(book_ids_rated_by_top_user)]
book_ids_not_rated_by_top_user = [x[0] for x in books_not_rated[['book_id']].values]

if (len(data_books) - len(book_ids_rated_by_top_user) - len(book_ids_not_rated_by_top_user)) == 0:
    print("Book counts match.")
else:
    print("Book counts don't match!")
Book counts match.
In [55]:
predictions = predict_and_sort(algos['svd'], top_user_id, book_ids_not_rated_by_top_user)

predictionsDF = pd.DataFrame({
    'book_id': [x[0] for x in predictions],
    'predicted_rating': [x[1] for x in predictions]
})
predictionsDF = pd.merge(data_books, predictionsDF, on='book_id')
predictionsDF.sort_values(by='predicted_rating', ascending=False)[['book_id', 'original_title', 'authors', 'predicted_rating', 'mean_rating']].head(30)
Out[55]:
book_id original_title authors predicted_rating mean_rating
8745 8946 دیوان‎‎ [Dīvān] Hafez 5.000000 4.720000
7746 7947 NaN Anonymous, Lane T. Dennis, Wayne A. Grudem 5.000000 4.818182
961 1121 Blood Meridian: Or the Evening Redness in the ... Cormac McCarthy 4.980688 4.010597
5551 5752 NaN Garth Ennis, Steve Dillon 4.979350 4.372727
8347 8548 This is Not My Hat Jon Klassen 4.947479 4.584416
8777 8978 The Revenge of the Baby-Sat: A Calvin and Hobb... Bill Watterson 4.897248 4.761364
6389 6590 The Authoritative Calvin and Hobbes Bill Watterson 4.895839 4.757202
619 757 Lonesome Dove Larry McMurtry 4.895771 4.426042
8679 8880 Transmetropolitan, Vol. 5: Lonely City Warren Ellis, Darick Robertson, Rodney Ramos, ... 4.894311 4.464286
7812 8013 The Book with No Pictures B.J. Novak 4.894129 4.450000
7886 8087 The Lions of Al-Rassan Guy Gavriel Kay 4.878977 4.282486
7074 7275 Far from the Tree: Parents, Children, and the ... Andrew Solomon 4.877785 4.389262
407 516 The Amazing Adventures of Kavalier & Clay Michael Chabon 4.866199 4.047354
2398 2590 Tiny Beautiful Things: Advice on Love and Life... Cheryl Strayed 4.860235 4.258065
6097 6298 The Poetry of Pablo Neruda Pablo Neruda, Ilan Stavans 4.856668 4.435233
5553 5754 Cuentos completos Jorge Luis Borges, Andrew Hurley 4.854646 4.567442
3453 3650 A Supposedly Fun Thing I'll Never Do Again: Es... David Foster Wallace 4.853623 4.186916
6234 6435 The Making of the Atomic Bomb Richard Rhodes 4.852985 4.337748
9191 9392 Lincoln in the Bardo George Saunders 4.851525 3.913669
6109 6310 The Universe Versus Alex Woods Gavin Extence 4.851467 4.136364
8875 9076 Preach My Gospel (A Guide to Missionary Service) The Church of Jesus Christ of Latter-day Saints 4.838883 4.598214
9315 9516 The Way the Crow Flies Ann-Marie MacDonald 4.834888 3.846154
1291 1462 The Orphan Master's Son Adam Johnson 4.830294 4.062088
6719 6920 The Indispensable Calvin and Hobbes: A Calvin ... Bill Watterson 4.821954 4.766355
222 291 Cutting for Stone Abraham Verghese 4.811221 4.236923
5534 5735 Chronicles of The Black Company (The Black Com... Glen Cook 4.810872 4.180000
7063 7264 Master of the Senate Robert A. Caro 4.810437 4.601770
3729 3928 When Nietzsche Wept: A Novel of Obsession Irvin D. Yalom 4.810125 4.023952
5006 5207 The Days Are Just Packed: A Calvin and Hobbes ... Bill Watterson 4.807286 4.722656
8823 9024 Angels in America: A Gay Fantasia on National... Tony Kushner 4.803000 4.428571