Connecting Local Spark to a S3 Parquet Data Source (Windows 10)

How to create a local PySpark test environment using an AWS S3 data source

Rohan Kotwani
2 min readSep 18, 2021

In order to download data from an S3 bucket into local PySpark, you will need to either 1) set the AWS access environment variables or 2) create a session. To create a session, you will need to first create a IAM User Profile and a Role with S3 Access permission.

We will not need to install Hadoop on the local system, but we will need to make the hadoop-aws.jar (and its dependencies) available to the python environment. Make sure that you choose the version that corresponds to your spark version. You can download these jars and place them here: C:\Users\USER_NAME\Anaconda3\envs\ENV_NAME\Lib\site-packages\pyspark\jars

Finally, you can read your configuration file, set the environment variables, and download the data from S3.

from pyspark.sql import SparkSession
import os
import configparser
os.environ['PYSPARK_DRIVER_PYTHON']='jupyter'
os.environ['PYSPARK_DRIVER_PYTHON_OPTS']='notebook'
os.environ['PYSPARK_PYTHON']='python'
config = configparser.ConfigParser()#Normally this file should be in ~/.aws/credentials
config.read_file(open('../.aws/credentials'))
os.environ["AWS_ACCESS_KEY_ID"]= config['default']['aws_access_key_id']
os.environ["AWS_SECRET_ACCESS_KEY"]= config['default']['aws_secret_access_key']
spark = SparkSession.builder.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.appName("app_name") \
.getOrCreate()
df = spark.read.json("s3a://INSERT_BUCKET_NAME/FOLDER_NAME/*/*/*")

Writing to S3

Download the hadoop.dll file from here and place the same under C:\Windows\System32 directory path.

Using a Profile Instead

After creating the IAM Role, attach it to the IAM User Profile. Modify the Trust Relationship of the Role to allow the User Profile to access AWS.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "INSERT_USER_PROFILE_ARN"
},
"Action": "sts:AssumeRole"
}
]
}

Finally, you can create a Session with boto3 using the User Profile.

import boto3
import pyspark as pyspark
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
import os
#you might need to set these env vars
os.environ['PYSPARK_DRIVER_PYTHON']='jupyter'
os.environ['PYSPARK_DRIVER_PYTHON_OPTS']='notebook'
os.environ['PYSPARK_PYTHON']='python'
#create a user
session = boto3.session.Session(profile_name='INSERT_PROFILE_NAME')
sts_connection = session.client('sts')
#edit trust relationship of role
response = sts_connection.assume_role(RoleArn='INSERT_ROLE_ARN', RoleSessionName='INSERT_ROLE_NAME',DurationSeconds=3600)
credentials = response['Credentials']
#create a config with a jar file
# conf = pyspark.SparkConf()
# conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.0')
#add glue meta store to config (not required)
spark = SparkSession.builder.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.appName("app_name") \
.getOrCreate()
spark._jsc.hadoopConfiguration().set('fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider')
spark._jsc.hadoopConfiguration().set('fs.s3a.access.key', credentials['AccessKeyId'])
spark._jsc.hadoopConfiguration().set('fs.s3a.secret.key', credentials['SecretAccessKey'])
spark._jsc.hadoopConfiguration().set('fs.s3a.session.token', credentials['SessionToken'])
df = spark.read.parquet("s3a://INSERT_BUCKET_NAME/FILE.parquet/")

Done

--

--