Predicting Chess Matches with PySpark


The process of configuring your local environment to use Apache Spark is arguably the most arduous part of this process. First you must install Docker, and then download the container image for PySpark, specialized for Jupyter Notebooks. If you already use Docker and your Docker Hub is up to date, the setup process will likely be expedited, but unfortunately my configuration process involved not only downloading Docker, but also undergoing not one, but two Mac OS updates to actually enable me to utilize Docker on my laptop. Below is a general overview of PySpark configuration:

Install Docker and PySpark

Docker is a service used to deliver software in packages called containers, which contain all of the configurations and libraries needed to make the software run. PySpark for Jupyter Notebooks is available for download as a Docker container, so the first step of this process involves downloading Docker to your local machine.

docker pull jupyter/pyspark-notebookdocker run -it jupyter/pyspark-notebook:latest /bin/bash

Data Inspection

We are finally ready to begin working with some data! Let’s begin by inspecting our chess dataset. The chess dataset contains features about 20,000 chess games, including the white and black players’ rankings, the opening moves, start and end times of each game, total number of moves made in the game, and more.

The first row of our Spark DataFrame
DataFrame columns and datatypes

Feature Engineering

We will begin the modeling process by creating a new feature, based on start and stop times of each game to determine the game duration. In Pandas, we can use a simple subtraction operation between two columns, which is automatically vectorized, and assign it to our new column name. With PySpark, we use the df.withColumn() operation to accomplish a similar process:

df_new = df.withColumn("duration", df['last_move_at'] - df['created_at'])

Dummy Variables

As can be seen above, we have a number of string variables (victory_status, increment_code, and opening_name) that must be converted to dummy variables before the modeling process. This is accomplished using a StringIndexer() object, imported from, which converts each string into an index. The three string indexers (for the three respective features that must be converted to dummies) are fit within a Pipeline, which streamlines the process of creating indexes for each of these three variables.


To begin the modeling process, we first specify our features and target variable, and transform our trimmed dataframe with our engineered features into a Vectorized DataFrame to be used in our models. The VectorAssembler is a transformer that combines a list of columns into a single vector column. It is used to combine all of your features into a single feature vector, which is then used to train ML models.

train_data, test_data = vector_df.randomSplit([.75, .25])
predictions = forest_model.transform(test_data).select('winner', 'prediction')
Actual labels alongside model predictions

Model Evaluation

PySpark has a MulticlassClassificationEvaluator package that can be used to evaluate your model by comparing your predictions to the expected label. After instantiating my evaluator, I used it to derive an F1 score and accuracy score for my model:

Feature importance from our classifier

Final Thoughts

To carry this project forward, I could begin to incorporate more feature engineering, hyperparameter tuning with Spark ML’s equivalent for Grid Search, or even explore different types of models.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Khyatee Desai

Khyatee Desai


music lover, picture taker, aspiring data scientist based in nyc