AWS: The 1st Cut is Getting the Data

Creating a S3 Data Lake with Kinesis Data Streams.

There are potentially many approaches to stream data into an S3 bucket. Strictly speaking, I did not need to use a Kinesis Data Streams to collect this data into an S3 bucket because I could directly save files from the docker container (ECS), that contains the web scraper.

Testing the Data Source with Glue and Athena

We can make sure the data is processed correctly by creating a glue crawler to extract out the schemas in the S3 bucket. The crawler will automatically create the schemas, but there is also an option to define your own schema. Structuring the data source in JSON format made it easier for Glue to parse out the schema. After the tables were created, Athena was used to inspect the data from the interactive query GUI.

Adding the Containerized Web Scraper to the Elastic Container Service

A web scraper was containerized, with Docker, and push to the Elastic Container Registry. A set of Docker commands to push the built container is available in the ECR GUI. I used starter code from another project to create the containerize web scraper. Starter Code for Containerized Web Scraper

Scheduling the ECS cluster with CloudWatch Rules

As mentioned above, I needed to set the Task Role and Task Execution Role overrides to ecsTaskExecutionRole. I was unable to get the container to access the environment file in the S3 bucket without these configurations. In addition, these parameters cannot be set from CloudWatch Rules when configuring the ECS cluster and Task Definition. I ended up making a Lambda function to create the Task Definition then and scheduled the Lambda with CloudWatch Rules. The Lambda function used to create a Task Definition is shown below. The Lambda function will need permission to access ECS.

Next Steps

I hope this article was helpful to developers who are getting started with AWS\Big Data. I tried to focus the article on parts of the data pipeline that were not explained in other articles, such as, how to use Lambda functions and what permissions to assign to the Take Role in ECS. The next step of this project will be to post-process the data and to aggregate the data in a data warehouse or with a Spark job. Once the data is built, I plan to schedule stock prediction algorithms to run on top of these aggregate tables.

Unlisted

--

--

--

https://www.linkedin.com/in/rkotwani/

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Code Mantra

Comprehensive Guide to Dates and Times in Go

Submit your story to Flutter Everywhere

Kids Code — Traffic Signal — Lesson 4

Procedural Art with Unity3D Particle Systems and Vector Fields

Weekly progress 4/5 New game type

Remotely Control your Linux Terminal through Flutter Application.

My pwn diary — 01/01/2019

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store