Connecting Local Spark to a S3 Parquet Data Source

How to create a local PySpark test environment using an AWS S3 data source

In order to download data from an S3 bucket into local PySpark, you will need to either 1) set the AWS access environment variables or 2) create a session. To create a session, you will need to first create a IAM User Profile and a Role with S3 Access permission.

We will not need to install Hadoop on the local system, but we will need to make the hadoop-aws.jar (and its dependencies) available to…