Predicting Chess Matches with PySpark

Khyatee Desai
5 min readNov 23, 2020

With chess on my mind thanks to Netflix’s The Queen’s Gambit, this week I used PySpark to build a multi-label classifier to predict chess match outcomes from the Chess Game Dataset.


The process of configuring your local environment to use Apache Spark is arguably the most arduous part of this process. First you must install Docker, and then download the container image for PySpark, specialized for Jupyter Notebooks. If you already use Docker and your Docker Hub is up to date, the setup process will likely be expedited, but unfortunately my configuration process involved not only downloading Docker, but also undergoing not one, but two Mac OS updates to actually enable me to utilize Docker on my laptop. Below is a general overview of PySpark configuration:

Install Docker and PySpark

Docker is a service used to deliver software in packages called containers, which contain all of the configurations and libraries needed to make the software run. PySpark for Jupyter Notebooks is available for download as a Docker container, so the first step of this process involves downloading Docker to your local machine.

After installing Docker to my Mac, I used terminal commands to pull down the PySpark Docker image from a remote repository, and run the container:

docker pull jupyter/pyspark-notebookdocker run -it jupyter/pyspark-notebook:latest /bin/bash

To verify successful installation of PySpark, I attempted to import it into a Jupyter Notebook, instantiate a Spark Context object, and run a simple command:

The expected output here is something along the lines of [729, 375, 601, 610, 695] which indicates that PySpark is has been installed successfully!

So far, we have created a SparkContext, which is the primary entry point for Spark. A SparkContext represents the connection to a Spark cluster, and is used to perform operations on that cluster.

In order to utilize Spark’s machine learning capabilities, we will be using a construct called a Spark Session, which is a higher-level abstraction of a Spark Context, and effectively functions as a wrapper for the object. After it is instantiated, the Spark Session can then be used to construct a Spark DataFrame:

Data Inspection

We are finally ready to begin working with some data! Let’s begin by inspecting our chess dataset. The chess dataset contains features about 20,000 chess games, including the white and black players’ rankings, the opening moves, start and end times of each game, total number of moves made in the game, and more.

We can use df.head() similarly to the manner in which we preview Pandas DataFrames to take a look at the first row of our table. The output however, is not as neatly oriented:

The first row of our Spark DataFrame

We can inspect column headers and datatypes with with df.dtypes:

DataFrame columns and datatypes

and we can view distinct values using'winner').distinct().collect():

We will use this dataset to build a classifier that determines the outcome of chess games, out of three possibilities: white, black, or draw.

Feature Engineering

We will begin the modeling process by creating a new feature, based on start and stop times of each game to determine the game duration. In Pandas, we can use a simple subtraction operation between two columns, which is automatically vectorized, and assign it to our new column name. With PySpark, we use the df.withColumn() operation to accomplish a similar process:

df_new = df.withColumn("duration", df['last_move_at'] - df['created_at'])

Dummy Variables

As can be seen above, we have a number of string variables (victory_status, increment_code, and opening_name) that must be converted to dummy variables before the modeling process. This is accomplished using a StringIndexer() object, imported from, which converts each string into an index. The three string indexers (for the three respective features that must be converted to dummies) are fit within a Pipeline, which streamlines the process of creating indexes for each of these three variables.

The indexed columns are then inputted into PySpark’s version of OneHotEncoder:

The Spark OneHotEncoderEstimator transforms each indexed feature into SparseVector. For example, the feature opening_name_index is transformed using the OHE into opening_name_dummy, which is outputted as opening_name_dummy=SparseVector(1476, {234: 1.0}))


To begin the modeling process, we first specify our features and target variable, and transform our trimmed dataframe with our engineered features into a Vectorized DataFrame to be used in our models. The VectorAssembler is a transformer that combines a list of columns into a single vector column. It is used to combine all of your features into a single feature vector, which is then used to train ML models.

To create a train/test split, we can use a pyspark package, or simply perform a randomSplit() on the data:

train_data, test_data = vector_df.randomSplit([.75, .25])

We instantiate a basic Random Forest Classifier model, specifying our feature columns, labelCol (target variable,) as well as optional hyperparameters, and fit the model to our training data:

We then use our trained model to make predictions on the testing data:

predictions = forest_model.transform(test_data).select('winner', 'prediction')
Actual labels alongside model predictions

Model Evaluation

PySpark has a MulticlassClassificationEvaluator package that can be used to evaluate your model by comparing your predictions to the expected label. After instantiating my evaluator, I used it to derive an F1 score and accuracy score for my model:

Next, I wanted to inspect feature importance, to see which features in particular had the greatest influence over predicting a chess match outcome:

Feature importance from our classifier

Not surprisingly, a player’s rating carries a lot of weight in determining who is going to win a chess match!

Final Thoughts

To carry this project forward, I could begin to incorporate more feature engineering, hyperparameter tuning with Spark ML’s equivalent for Grid Search, or even explore different types of models.

Through this exploration, I found that after climbing over a short learning curve, the syntax and modeling process for using Spark ML is pretty intuitive for data scientists who are familiar with data manipulation in Pandas and modeling with Scikit-Learn, and I would recommend all data scientists take the time to learn more about machine learning with Apache Spark.

Want to explore the full project? Check out the GitHub repo here!



Khyatee Desai

music lover, picture taker, aspiring data scientist based in nyc