Install spark and hadoop in Windows 10
Install spark to run locally
Versions
- Java
java -version
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)
- Scala
sbt
console
Welcome to Scala 2.12.13 (OpenJDK 64-Bit Server VM, Java 1.8.0_292).
Install Spark
Download from https://spark.apache.org/downloads.html
.
- Spark release
3.2.0
- For
Hadoop 2.7
(previous versions failed me) Folder downloaded and copied asC:\Spark\spark-3.2.0-bin-hadoop2.7
.
Add system environment variable SPARK_HOME
with value C:\Spark\spark-3.2.0-bin-hadoop2.7
.
Add %SPARK_HOME%\bin
to system PATH
.
Install Hadoop
Downloaded from https://github.com/cdarlint/winutils
.
Folder hadoop-2.7.7
downloaded as C:\Hadoop\hadoop-2.7.7
.
Add system environment variable HADOOP_HOME
with value C:\Hadoop\hadoop-2.7.7
Add %HADOOP_HOME%\bin
system PATH
.
Spark structured streaming with kafka
https://spark.apache.org/docs/3.1.1/structured-streaming-kafka-integration.html https://medium.com/expedia-group-tech/apache-spark-structured-streaming-checkpoints-and-triggers-4-of-6-b6f15d5cfd8d
Parquet viewer
https://github.com/mukunku/ParquetViewer
Access Azure Data Lake Gen 2
https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-get-started.html