;tldr

Compared to The Godfather the following movies have

Similarity with godfather2.txt: 0.9467
Similarity with gremlins.txt: 0.6318
Similarity with godfather3.txt: 0.8355
Similarity with marypoppins.txt: 0.5978

From our selection of movie scripts we can see that the top two movies in our sample of scripts are The Godfather Part 2 (1974), and The Godfather Part 3 (1990) while Mary Poppins (1964) has the least.

Introduction

In this article we will use scikit to determine how similar several movies are to The Godfather (1972). We will use something called document similarity to compare the scripts of our target movies to that of our source movie. This is probbaly not the best way to do it since it doesn’t actually read what’s happening in the script but rather looks at the words used in writing the script. More importantly we also do not, for this specific context, provide a definition for what ‘similar’ means. A more formal name for document similarity is TF-IDF (Term Frequency-Inverse Document Frequency).

Problem statement and methodology

We want to determine how similar a sample of movies are to our target movie and express that similarity as a number. For data we want to input each movie’s script and use the TF-IDF score as the result.

Alternative methods

Anothe way of of performing this analysis include the using the k-nearest neighbors algorithm

Setting up the environment

Since we will use scikit-learn to do our analysis let’s first set up a python virtual environment to install our packages into.

mkdir movie-analysis
cd movie-analysis
python -m venv .venv
source .venv/bin/activate
pip install scikit-learn

Obtaining our source data

Since our source movie is The Godfather we will use the movie’s script at the Internet Movie Script Database.

Our scripts are available at

The Godfather

The Godfather Part 2 (godfather2.txt)

The Godfather Part 3 (godfather3.txt)

Gremlins (gremlins.txt)

Mary Poppins (marypoppins.txt)

The python script

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Reads a file and returns its contents
def get_contents(fn):
        with open(fn) as file:
                return file.read()

# Define the research focus and sample movie names

research_focus = get_contents("godfather.txt")

fns = ["godfather2.txt","gremlins.txt","godfather3.txt","marypoppins.txt"]
abstracts = []

for i in fns:
        abstracts.append(get_contents(i))

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform the text into vectors
documents = [research_focus, *abstracts]
tfidf_matrix = vectorizer.fit_transform(documents)

# Calculate similarity between the focus (index 0) and the abstracts (indices 1 & 2)
# We compare row 0 against rows 1 and n...
similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:])

for idx,i in enumerate(fns):
        print(f"Similarity with {i}: {similarities[0][idx]:.4f}")

Output

When the script is run we get the following ouput

Similarity with godfather2.txt: 0.9467
Similarity with gremlins.txt: 0.6318
Similarity with godfather3.txt: 0.8355
Similarity with marypoppins.txt: 0.5978

We can see the most similar movie is The Godfather Part 2, while Mary Poppins is the least similar movie script in our list of sample movie scripts.