;tldr
Compared to The Godfather the following movies have
Similarity with godfather2.txt: 0.9467
Similarity with gremlins.txt: 0.6318
Similarity with godfather3.txt: 0.8355
Similarity with marypoppins.txt: 0.5978
From our selection of movie scripts we can see that the top two movies in our sample of scripts are The Godfather Part 2 (1974), and The Godfather Part 3 (1990) while Mary Poppins (1964) has the least.
Introduction
In this article we will use scikit to determine how similar several movies are to The Godfather (1972). We will use something called document similarity to compare the scripts of our target movies to that of our source movie. This is probbaly not the best way to do it since it doesn’t actually read what’s happening in the script but rather looks at the words used in writing the script. More importantly we also do not, for this specific context, provide a definition for what ‘similar’ means. A more formal name for document similarity is TF-IDF (Term Frequency-Inverse Document Frequency).
Problem statement and methodology
We want to determine how similar a sample of movies are to our target movie and express that similarity as a number. For data we want to input each movie’s script and use the TF-IDF score as the result.
Alternative methods
Anothe way of of performing this analysis include the using the k-nearest neighbors algorithm
Setting up the environment
Since we will use scikit-learn to do our analysis let’s first set up a python virtual environment to install our packages into.
mkdir movie-analysis
cd movie-analysis
python -m venv .venv
source .venv/bin/activate
pip install scikit-learn
Obtaining our source data
Since our source movie is The Godfather we will use the movie’s script at the Internet Movie Script Database.
Our scripts are available at
The Godfather Part 2 (godfather2.txt)
The Godfather Part 3 (godfather3.txt)
Gremlins (gremlins.txt)
Mary Poppins (marypoppins.txt)
The python script
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Reads a file and returns its contents
def get_contents(fn):
with open(fn) as file:
return file.read()
# Define the research focus and sample movie names
research_focus = get_contents("godfather.txt")
fns = ["godfather2.txt","gremlins.txt","godfather3.txt","marypoppins.txt"]
abstracts = []
for i in fns:
abstracts.append(get_contents(i))
# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
# Transform the text into vectors
documents = [research_focus, *abstracts]
tfidf_matrix = vectorizer.fit_transform(documents)
# Calculate similarity between the focus (index 0) and the abstracts (indices 1 & 2)
# We compare row 0 against rows 1 and n...
similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:])
for idx,i in enumerate(fns):
print(f"Similarity with {i}: {similarities[0][idx]:.4f}")
Output
When the script is run we get the following ouput
Similarity with godfather2.txt: 0.9467
Similarity with gremlins.txt: 0.6318
Similarity with godfather3.txt: 0.8355
Similarity with marypoppins.txt: 0.5978
We can see the most similar movie is The Godfather Part 2, while Mary Poppins is the least similar movie script in our list of sample movie scripts.