The first step in any data science pipeline is understanding and cleaning the raw data. I worked with four main datasets: movies.csv, ratings.csv, tags.csv, and links.csv. During initial data exploration, I encountered several integrity issues that needed to be resolved.
Upon checking for null values, I discovered that 34 movies had (no genres listed). Rather than dropping these rows and losing valuable data, I manually filled in the genres by cross-referencing the movies on TMDB/IMDb. For instance, movies like Ben-hur (2016) were updated to Action|Adventure|Drama, and Cosmos to Documentary. This ensures that our genre-based similarity calculations later on have complete information.
Some movies appeared multiple times in the dataset but under slightly different forms or with overlapping years (e.g., Emma (1996) or War of the Worlds (2005)).
To solve this:
list(set.union(*map(set, genres)))).movieId of the dropped duplicates to the retained movieId across the ratings, tags, and links datasets. This guaranteed that all user interactions were accurately merged under a single movie entity.I noticed that some movies were missing release years in their titles (e.g., Babylon 5). Using the imdbId from the links.csv file, I manually imputed the correct years. After ensuring all titles had a year, I extracted the year into a separate year column using regex processing, leaving a heavily cleaned title_only column.
With the data thoroughly cleaned, it was exported as refined CSVs to be used in the core recommendation engine.
The heart of an Item-Based Collaborative Filtering model lies in accurately measuring the similarity between items (movies). A standard approach uses only user ratings to compute cosine similarity, but I improved this by including genre and release-year patterns.
Users have different rating behaviors—some are overly generous (giving everything a 4 or 5), while others are highly critical. To remove this user bias, I applied User Mean Centering (Z-score standardization process). I subtracted each user's average rating from their individual ratings and then scaled it by their standard deviation.
A major challenge in recommendation systems is the "long-tail" problem: a movie with a single 5.0 rating mathematically appears better than a popular movie with hundred ratings averaging 4.5.
To address this, I applied a weighted rating formula:
Weighted Avg = (C / (C + T)) * R + (T / (C + T)) * MBy pulling the average of movies with very few ratings towards the global mean, the model avoids inaccurately boosting obscure titles over universally acclaimed ones. I also experimented with using the median instead of the mean to mitigate the impact of outliers.
To make the item-to-item similarity more robust, I generated two supplementary matrices:
Using matrix multiplication (Dot Product) between the User components and Movie components (One-Hot Encoded genre/year matrices), I generated predicted preference scores arrays.
Instead of solely relying on rating patterns, the final similarity matrix is a weighted combination of three distinct cosine similarity calculations:
mat1: Rating-based Cosine Similaritymat2: Genre-based Cosine Similaritymat3: Year-based Cosine SimilarityTotal Similarity = (w1 * mat1 + w2 * mat2 + w3 * mat3) / sum(weights)This guarantees that two movies are considered similar not just because the same subset of users liked them, but also because they share thematic elements (genres) and cultural context (release era).
With the prediction mechanism in place (an ib_cf function that calculates a weighted average of similar items), it was time to rigorously test the system.
To find the optimal combination of weights (w1, w2, w3) for my three similarity matrices, I set up a Grid Search iterating through combinations from 0.0 to 1.0 (with a step size of 0.05).
For each iteration, the model predicted the ratings of a hidden test set (x_test split via train_test_split) and calculated the Root Mean Square Error (RMSE).
A notable improvement in my 3rd phase of updates was introducing the median as an alternative to the mean when aggregating user preferences and global movie baselines. Comparing the RMSE of the median-based calculations versus the traditional mean-based calculations allowed me to pin down the most mathematically stable model for my explicit dataset.
By meticulously pre-processing the raw dataset, integrating rich metadata like genre and release year, and applying robust normalization techniques to mitigate rating scarcity and user bias, I was able to build a highly optimized Item-Based CF Recommender.
The modular setup of the final ensemble similarity matrix allows for extensive flexibility—scaling the importance of rating behavior against pure content attributes effortlessly. Building this pipeline reinforced the idea that a recommendation model is only ever as good as its underlying data and the mathematical nuances used to represent it.