AWS: The 1st Cut is Getting the Data
A brief introduction to building an automated, simple web scraper in AWS
There are many technical articles on this subject that have provided comprehensive step-by-step instructions to building technical workflows. In order limit redundant information, I will focus the specific tasks I needed to do to make my solution work, and I will link the sources\starter code I used to get started.
This project collects data from a limited set of sources. In order to scale this project to hundreds of sources, further development would be needed. This article my be interesting to those who are working towards getting a Big Data certification and to those who want to collect some data in order to get hands on experience.
The first cut of any data project, Big Data or not, is to get the data.
I have been interested in stock data for a while, and I already developed some web-scrapers to collect data from Google News, Alpha Vantage (stock prices), and Twitter, previously. However, the predictive algorithms I ran after the data collection tied up my computer’s resources and prevented me from running other compute intensive projects. Moving the data collection and predictive algorithms, to the cloud, improved the consistency and the time cost of this data project. The project pipeline is shown below.
List of services used in this project:
Creating a S3 Data Lake with Kinesis Data Streams.
There are potentially many approaches to stream data into an S3 bucket. Strictly speaking, I did not need to use a Kinesis Data Streams to collect this data into an S3 bucket because I could directly save files from the docker container (ECS), that contains the web scraper.
One data stream (Kinesis Data Stream+ Kinesis Firehose pipeline) was created for each data source (Google News, Stock Price, Twitter). I found some Kinesis python starter code (the source code is linked below) and used it to stream data from the web scraper to the appropriate data stream for every data source. This allowed me to create a separate folders, in the S3 bucket, for each data source. The Kinesis Firehose applications were configured to trigger from Data Streams and output to a S3 destination from the GUI. This part was pretty straightforward. Starter Code for Kinesis Data Streams (Python 3.8 API)
The scraped data was sent to Kinesis in JSON format. I found that Kinesis will dump all JSON records into 1 file with no spaces between the JSON records. I needed to make a Lambda function to preprocess the records before delivering the data to S3 from Firehose. This can also be configured to trigger from the Kinesis Firehose GUI. The Lambda function used to process the JSON records is shown below.
Testing the Data Source with Glue and Athena
We can make sure the data is processed correctly by creating a glue crawler to extract out the schemas in the S3 bucket. The crawler will automatically create the schemas, but there is also an option to define your own schema. Structuring the data source in JSON format made it easier for Glue to parse out the schema. After the tables were created, Athena was used to inspect the data from the interactive query GUI.
Adding the Containerized Web Scraper to the Elastic Container Service
A web scraper was containerized, with Docker, and push to the Elastic Container Registry. A set of Docker commands to push the built container is available in the ECR GUI. I used starter code from another project to create the containerize web scraper. Starter Code for Containerized Web Scraper
The next step was to create a Task Definition to that describes the container’s resource usage and permissions. The container’s task execution role, ecsTaskExecutionRole, should be configured to allow access to S3 and Kinesis. The container will need to read an environment file stored in S3, pull the data with necessary credentials (defined in the environment file), and put records into the Kinesis Data Stream. You can test whether a service has the correct permission by using the IAM Policy Simulator. A Cluster VPC will also need to be created to run the Task Definition. The Cluster VPC can be created from the ECS GUI.
The Task Definition can be tested directly from the interface. In order to run this task, through the GUI, you will need to select the launch type, i.e., Fargate. The Cluster VPC and the subnets will also need to be selected. You can find a list of subnets on the Virtual Private Cloud. The Auto-Assign Public IP should be set to Enabled. In advanced options, set the Task Role and Task Execution Role overrides to ecsTaskExecutionRole.
Scheduling the ECS cluster with CloudWatch Rules
As mentioned above, I needed to set the Task Role and Task Execution Role overrides to ecsTaskExecutionRole. I was unable to get the container to access the environment file in the S3 bucket without these configurations. In addition, these parameters cannot be set from CloudWatch Rules when configuring the ECS cluster and Task Definition. I ended up making a Lambda function to create the Task Definition then and scheduled the Lambda with CloudWatch Rules. The Lambda function used to create a Task Definition is shown below. The Lambda function will need permission to access ECS.
Within the Lambda function, I used BOTO3 to run the Fargate service with the required configurations. BOTO3 Client Run Documentation.
I hope this article was helpful to developers who are getting started with AWS\Big Data. I tried to focus the article on parts of the data pipeline that were not explained in other articles, such as, how to use Lambda functions and what permissions to assign to the Take Role in ECS. The next step of this project will be to post-process the data and to aggregate the data in a data warehouse or with a Spark job. Once the data is built, I plan to schedule stock prediction algorithms to run on top of these aggregate tables.