Apache Spark on Windows 10! A Match made in Heaven (using million websites)

Uttasarga Singh
4 min readNov 11, 2020

I was working for Hartford Steam Boiler in the Summer and Fall of 2020 (remotely due to COVID-19) as a Machine Learning Intern. I did get a chance to work on large-scale Machine Learning Projects of HSB with state-of-the-art big data technologies like PySpark, SparkR, Hadoop, Scala and I was working on developing a prediction model (Bayesian Hierarchical Model) using Gibbs Sampler(R-JAGS) and writing around thousand lines of code in R. I was able to compute the lengthier chains of Markov Chain Monte Carlo ( MCMC ) and deploy the Model and its predicted Data into production. I was amazed by the computation power which Spark brings with itself and to be very honest, I did enjoyed working using Spark for distributed computing. To anybody who is thinking about learning Spark for developing Machine Learning Models; Please don’t fear learning one more language in your over filled bucket of Skill sets; we have to be versatile as this wont go away soon. It is designed in a way to increase its stay longer, while the new libraries that are being developed to use in Spark is something worth waiting for. Enough of my rant, lets get to the exciting part ahead.

So I did successfully installed Apache Spark on a WINDOWS Machine! Yes, you read that right… I have installed Apache Spark on my WINDOWS 10 machine after first reading millions of website which had like an overview of what needs to be done, but not the detailed answer of why it needs to be done and what is not needed to be done. Please read this patiently and I would request your 100% attention on the same.

First, I will ask you to install Anaconda3–4.2.0-Windows-x86_64, because it runs Python version: 3.5.6

Python Version compatible with Spark

Now, after downloading the setup file of Anaconda from https://repo.anaconda.com/archive/ and installing it on the Computer, you should follow these following steps as written in the Order.

  1. JAVA SETUP:

i]. Install JAVA 8 from the Oracle Website by checking which version is suitable for your System i.e. Windows X86 or Windows X64. Run the Setup file and Java will be installed in your System.

ii]. The Folder jdk1.8.0_271 should be placed directly under C drive by COPYING it from the original JAVA Installation Path.

C:\jdk1.8.0_271

iii] The above step is significant and I will explain its significance by the end of this post.

2. INSTALL SPARK:

i] Download the Spark 2.4.7 tar.gz file from Spark Website and extract the same file in C:\spark folder that you can create in your C drive.

3. WINUTILS.exe

i] Download winutils.exe file and paste it again in C drive by creating a folder C:\winutils\bin

ii] Create a c:\tmp\hive directory, and open your Command prompt and change your directory to c:\winutils\bin, and run winutils.exe chmod 777 c:\tmp\hive.

5. Open the the c:\spark\conf folder, and make sure “File Name Extensions” is checked in the “view” tab of Windows Explorer.

6. Rename the log4j.properties.template file to log4j.properties.

Edit this file (using NOTEPAD++) and change the error level from INFO to ERROR for log4j.rootCategory.

7. Open Edit System Environment Variables Window by searching “Environment Variables” in Windows Startup and click on Environment Variables.

ENVIRONMENT VARIABLES

8. After clicking Environment Variables, Add these commands in the same manner in USER VARIABLES

i]. HADOOP_HOME = C:\winutils

ii] JAVA_HOME = C:\jdk1.8.0_271

iii] PYSPARK_DRIVER_PYTHON = jupyter

iv] PYSPARK_DRIVER_PYTHON_OPTS = notebook

v] PYSPARK_PYTHON = python3

vi] PYSPARK_SUBMIT_ARGS = — packages ${PACKAGES} pyspark-shell (2 dashes in the front of this variable)

vii] SPARK_HOME = C:\spark

viii] Under Path Variable in User Variables, add the following commands:

  1. %SPARK_HOME%\bin
  2. %JAVA_HOME%\bin

9. Add the following commands under PATH Variable in SYSTEM VARIABLES

  1. C:\spark\spark-2.4.7-bin-hadoop2.7\bin

10. Add the JAVA_HOME Variable in SYSTEM VARIABLES as well:

  1. JAVA_HOME = C:\jdk\jre1.8.0_271

11. After the Steps given above, open ANACONDA TERMINAL and change your directory to c:\spark.

12. Then, type pyspark to proceed with the Big Data Magic.

Please follow the steps as mentioned and we can tame this successfully without any further issue whatsoever. Please connect with me on other Social Media Platforms and you can check out some of my Projects in GitHub as well.

  1. LinkedIn
  2. GitHub
  3. Twitter
  4. Reddit
  5. Quora

I look forward to help someone who is in need of the hour of having his System Configured to perform Big-Data Computation, while researching about new Machine Learning Algorithms to learn from.

--

--

Uttasarga Singh

Machine Learning Engineer / Software Developer. More than 3 years of experience in developing/deploying Machine learning Models and Web-based applications.