Tkdyun's Blog

Phase 1: Data Preprocessing & Cleansing

The first step in any data science pipeline is understanding and cleaning the raw data. I worked with four main datasets: movies.csv, ratings.csv, tags.csv, and links.csv. During initial data exploration, I encountered several integrity issues that needed to be resolved.

1. Handling Missing Genres

Upon checking for null values, I discovered that 34 movies had (no genres listed). Rather than dropping these rows and losing valuable data, I manually filled in the genres by cross-referencing the movies on TMDB/IMDb. For instance, movies like Ben-hur (2016) were updated to Action|Adventure|Drama, and Cosmos to Documentary. This ensures that our genre-based similarity calculations later on have complete information.

2. Resolving Duplicated Movie Titles

Some movies appeared multiple times in the dataset but under slightly different forms or with overlapping years (e.g., Emma (1996) or War of the Worlds (2005)).

To solve this:

I identified duplicated titles and grouped them together.
For each duplicated group, I unified their genres using a set union operation so no genre tags were lost (list(set.union(*map(set, genres)))).
I retained the entry with the most comprehensive genre list and dropped the redundant rows.
Crucially, I mapped the movieId of the dropped duplicates to the retained movieId across the ratings, tags, and links datasets. This guaranteed that all user interactions were accurately merged under a single movie entity.

3. Extracting and Imputing Release Years

I noticed that some movies were missing release years in their titles (e.g., Babylon 5). Using the imdbId from the links.csv file, I manually imputed the correct years. After ensuring all titles had a year, I extracted the year into a separate year column using regex processing, leaving a heavily cleaned title_only column.

With the data thoroughly cleaned, it was exported as refined CSVs to be used in the core recommendation engine.

Phase 2: Feature Engineering & The Similarity Matrix

The heart of an Item-Based Collaborative Filtering model lies in accurately measuring the similarity between items (movies). A standard approach uses only user ratings to compute cosine similarity, but I improved this by including genre and release-year patterns.

1. Rating Standardization (Mean Centering)

Users have different rating behaviors—some are overly generous (giving everything a 4 or 5), while others are highly critical. To remove this user bias, I applied User Mean Centering (Z-score standardization process). I subtracted each user's average rating from their individual ratings and then scaled it by their standard deviation.

2. Weighted Rating Adjustments

A major challenge in recommendation systems is the "long-tail" problem: a movie with a single 5.0 rating mathematically appears better than a popular movie with hundred ratings averaging 4.5.

To address this, I applied a weighted rating formula:

Weighted Avg = (C / (C + T)) * R + (T / (C + T)) * M

C = Number of ratings the movie received
T = A threshold (set at the 90th percentile of rating counts)
R = The movie's average rating
M = The overall mean rating across all movies

By pulling the average of movies with very few ratings towards the global mean, the model avoids inaccurately boosting obscure titles over universally acclaimed ones. I also experimented with using the median instead of the mean to mitigate the impact of outliers.

3. Creating Content-Based Matrices (Genre & Year)

To make the item-to-item similarity more robust, I generated two supplementary matrices:

User-Genre Matrix: I calculated each user's preference for specific genres by taking the central value (mean or median) of their centered-ratings for movies belonging to that genre.
User-Year Matrix: Similarly, I established a profile indicating whether a user preferred older classic movies or modern releases.

Using matrix multiplication (Dot Product) between the User components and Movie components (One-Hot Encoded genre/year matrices), I generated predicted preference scores arrays.

4. The Ensemble Similarity Matrix

Instead of solely relying on rating patterns, the final similarity matrix is a weighted combination of three distinct cosine similarity calculations:

mat1: Rating-based Cosine Similarity
mat2: Genre-based Cosine Similarity
mat3: Year-based Cosine Similarity

Total Similarity = (w1 * mat1 + w2 * mat2 + w3 * mat3) / sum(weights)

This guarantees that two movies are considered similar not just because the same subset of users liked them, but also because they share thematic elements (genres) and cultural context (release era).

Phase 3: Model Evaluation and Optimization

With the prediction mechanism in place (an ib_cf function that calculates a weighted average of similar items), it was time to rigorously test the system.

Hyperparameter Tuning via Grid Search

To find the optimal combination of weights (w1, w2, w3) for my three similarity matrices, I set up a Grid Search iterating through combinations from 0.0 to 1.0 (with a step size of 0.05).
For each iteration, the model predicted the ratings of a hidden test set (x_test split via train_test_split) and calculated the Root Mean Square Error (RMSE).

Mean vs. Median Performance

A notable improvement in my 3rd phase of updates was introducing the median as an alternative to the mean when aggregating user preferences and global movie baselines. Comparing the RMSE of the median-based calculations versus the traditional mean-based calculations allowed me to pin down the most mathematically stable model for my explicit dataset.

Conclusion

By meticulously pre-processing the raw dataset, integrating rich metadata like genre and release year, and applying robust normalization techniques to mitigate rating scarcity and user bias, I was able to build a highly optimized Item-Based CF Recommender.
The modular setup of the final ensemble similarity matrix allows for extensive flexibility—scaling the importance of rating behavior against pure content attributes effortlessly. Building this pipeline reinforced the idea that a recommendation model is only ever as good as its underlying data and the mathematical nuances used to represent it.

Item-Based Collaborative Filtering Recommender