Connecting Local Spark to a S3 Parquet Data Source

How to create a local PySpark test environment using an AWS S3 data source

In order to download data from an S3 bucket into local PySpark, you will need to either 1) set the AWS access environment variables or 2) create a session. To create a session, you will need to first create a IAM User Profile and a Role with S3 Access permission.

We will not need to install Hadoop on the local system, but we will need to make the hadoop-aws.jar (and its dependencies) available to the python environment. Make sure that you choose the version that corresponds to your spark version. You can download these jars and place them here: C:\Users\USER_NAME\Anaconda3\envs\ENV_NAME\Lib\site-packages\pyspark\jars

Finally, you can read your configuration file, set the environment variables, and download the data from S3.

Writing to S3

Download the hadoop.dll file from here and place the same under C:\Windows\System32 directory path.

Using a Profile Instead

After creating the IAM Role, attach it to the IAM User Profile. Modify the Trust Relationship of the Role to allow the User Profile to access AWS.

Finally, you can create a Session with boto3 using the User Profile.