Want to install python, spark and pyspark for some local data pipeline testing? Look no further… Accompanying this blog there is also this video on youtube. Multimedia has hit the hungovercoders!!!

Pre-Requisites

For testing you’ll need to have an IDE installed, I tend to use visual studio code.

Python Install

Install python by first downloading it from here. Choose the latest version and then the file for your operating system (in this case windows 64-bit).

Python File Download

When you start installing remember to tick “Add Python XXX to PATH” on install to ensure this is set in your environment variables.

Python Install

Click install now to complete. Check then in your environment variables to confirm python has been added to your PATH.

Python Environment Variables

Java Install

Go to java downloads and go to download and install.

Java Download

Choose the manual download page. On this page choose windows offline 64-bit.

Java Manual Download

Install Java by double clicking the executable once downloaded. Once completed confirm that JAVA has been installed by checking the installation location. Then finally add JAVA_HOME and the path to your user environment variables and add the path to java here. For me the path is on my C drive at C:\Program Files\Java\jre1.8.0_271.

Java Environment Variable

Spark Install

Download spark from here.

Spark Download

At the top of the download page, choose the suggested site for the tar file download. This will take a little while to download as its quite big.

Note the name of the download as it will tell you which version of hadoop winutils you need to install later. In the instance below it is Hadoop 3.

Spark Suggested Download

Once downloaded, unzip the tar file twice…

Copy all of the contents of the unzipped file and paste into a location on your C:\drive called C:\Spark.

Spark Copied

Then add SPARK_HOME to your system environment variables with the value C:\Spark.

Spark System Variables

Finally add a new Path in your system environment variables with the value %SPARK_HOME%\bin.

Hadoop Path Environment Variable

Hadoop WinUtils Install

Go to github hadoop winutil.

Download the appropriate winutils.exe. You will know which is the correct one based on the version of the taf file you downloaded for spark.

Hadoop Download

Add a C:\Hadoop\bin folder your machine and add the winutils.exe to this.

Hadoop Copied

Then add HADOOP_HOME to your system environment variables with the value C:\Hadoop.

Hadoop System Variables

Finally add a new Path in your system environment variables with the value %HADOOP_HOME%\bin.

Hadoop Path Environment Variable

Confirm Spark

Open a command prompt with admin privileges and run “spark-shell”. You should then see the letter SPARK come up as per below. after a little while.

Spark CMD

You can also navigate to the local host spark UI here http://localhost:4040/.

Spark UI

Pyspark Install

To install pyspark open up visual studio code. Open a terminal and run the following command to create a virtual environment and pip install pyspark. Ensure you use the correct python version that you installed originally in the command.

py -3.11 -m venv venv
venv\scripts\activate
pip install pyspark
pyspark

In the spark terminal just setup a simple dataframe and display it to confirm functionality.

data = [('Tiny Rebel','Stay Puft'),('Crafty Devil','Mike Rayer')]
columns = ["brewery","beer"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()

The deactivate your environment.

```bash
venv\scripts\deactivate

You have now successfully setup a local environment whereby you can run python, spark and pyspark locally! Time for a drink…