Connecting Local Spark to a S3 Parquet Data Source (Windows 10)

How to create a local PySpark test environment using an AWS S3 data source

Rohan Kotwani
2 min readSep 18, 2021

In order to download data from an S3 bucket into local PySpark, you will need to either 1) set the AWS access environment variables or 2) create a session. To create a session, you will need to first create a IAM User Profile and a Role with S3 Access permission.

We will not need to install Hadoop on the local system, but we will need to make the hadoop-aws.jar (and its dependencies) available to the python environment. Make sure that you choose the version that corresponds to your spark version. You can download these jars and place them here: C:\Users\USER_NAME\Anaconda3\envs\ENV_NAME\Lib\site-packages\pyspark\jars

Finally, you can read your configuration file, set the environment variables, and download the data from S3.

from pyspark.sql import SparkSession
import os
import configparser
os.environ['PYSPARK_DRIVER_PYTHON']='jupyter'
os.environ['PYSPARK_DRIVER_PYTHON_OPTS']='notebook'
os.environ['PYSPARK_PYTHON']='python'
config = configparser.ConfigParser()#Normally this file should be in ~/.aws/credentials
config.read_file(open('../.aws/credentials'))
os.environ["AWS_ACCESS_KEY_ID"]= config['default']['aws_access_key_id']
os.environ["AWS_SECRET_ACCESS_KEY"]= config['default']['aws_secret_access_key']
spark = SparkSession.builder.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.appName("app_name") \
.getOrCreate()
df = spark.read.json("s3a://INSERT_BUCKET_NAME/FOLDER_NAME/*/*/*")

Writing to S3

Download the hadoop.dll file from here and place the same under C:\Windows\System32 directory path.

Using a Profile Instead

After creating the IAM Role, attach it to the IAM User Profile. Modify the Trust Relationship of the Role to allow the User Profile to access AWS.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "INSERT_USER_PROFILE_ARN"
},
"Action": "sts:AssumeRole"
}
]
}

Finally, you can create a Session with boto3 using the User Profile.

import boto3
import pyspark as pyspark
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
import os
#you might need to set these env vars
os.environ['PYSPARK_DRIVER_PYTHON']='jupyter'
os.environ['PYSPARK_DRIVER_PYTHON_OPTS']='notebook'
os.environ['PYSPARK_PYTHON']='python'
#create a user
session = boto3.session.Session(profile_name='INSERT_PROFILE_NAME')
sts_connection = session.client('sts')
#edit trust relationship of role
response = sts_connection.assume_role(RoleArn='INSERT_ROLE_ARN', RoleSessionName='INSERT_ROLE_NAME',DurationSeconds=3600)
credentials = response['Credentials']
#create a config with a jar file
# conf = pyspark.SparkConf()
# conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.0')
#add glue meta store to config (not required)
spark = SparkSession.builder.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.appName("app_name") \
.getOrCreate()
spark._jsc.hadoopConfiguration().set('fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider')
spark._jsc.hadoopConfiguration().set('fs.s3a.access.key', credentials['AccessKeyId'])
spark._jsc.hadoopConfiguration().set('fs.s3a.secret.key', credentials['SecretAccessKey'])
spark._jsc.hadoopConfiguration().set('fs.s3a.session.token', credentials['SessionToken'])
df = spark.read.parquet("s3a://INSERT_BUCKET_NAME/FILE.parquet/")

Done

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Rohan Kotwani
Rohan Kotwani

Written by Rohan Kotwani

My goal is to share a collection of thoughts, ideas, and possibilities from high quality artists and content producers.

No responses yet

Write a response