In order to download data from an S3 bucket into local PySpark, you will need to either 1) set the AWS access environment variables or 2) create a session. To create a session, you will need to first create a IAM User Profile and a Role with S3 Access permission.

We will not need to install Hadoop on the local system, but we will need to make the **hadoop-aws.jar**** **(and its dependencies) available to the python environment. Make sure that you choose the version that corresponds to your spark version. You can download these jars and place them here: **C:\Users\USER_NAME\Anaconda3\envs\ENV_NAME\Lib\site-packages\pyspark\jars**

…

I’m hesitant to write about this as it is touches on a few controversial topics, but if I can help one person understand what I mean, it will be worth it. I often see two sides arguing about a topic describing the same concern, i.e., what is the risk. What I’m not saying is that “risk” can be decomposed into a semantic argument. Instead, I’m saying that the cross-section of information we look at, or the context, can greatly influence what is deemed a risk.

What is risk? There are whole books on this topic (I recommend reading Risky Savvy…

A brief introduction to building an automated, simple web scraper in AWS

There are many technical articles on this subject that have provided comprehensive step-by-step instructions to building technical workflows. In order limit redundant information, I will focus the specific tasks I needed to do to make my solution work, and I will link the sources\starter code I used to get started.

This project collects data from a limited set of sources. In order to scale this project to hundreds of sources, further development would be needed. …

Overall, I found the course helpful and insightful, 4.79/5. There were many ideas that I had not considered before so I am posting some of my notes here. More than likely, you have seen most of these ideas so I will try to focus on the most interesting ones. Here is the link to the course.

**Data Exploration Checklist****Validation****Target Leakage****Metrics and Loss Functions****Metric Optimization****Mean Encoding****Coding Tips****Advanced Feature Engineering****Ensemble Strategies****StackNet****Creating a Diverse Set of Models****Tips on Meta-Learning and Stacking****Text Based Features in XGBoost****Sequence Feature Extraction (XGBoost)****Semi-supervised &…**

A ranking algorithm can be used if the target variable is numerically ordered. The model will capture the shared variation between adjacent classes. This model can also be useful for semi-supervised classification, where proxy targets are assigned as a lower rank than the actual targets.

The simplest ranking model is a regression model. The output of the regression model will be rounded to the closest integer, and the values are also clipped between maximum and minimum of the target. Multi-class models linear models are another option, but it should be noted that each class is treated separately and not in…

Business-related time series are often non-continuous, or discrete, time-based processes. Operationally, it might be useful for businesses to know, or to be able to quantify when a time series will be at a peak or trough in the future. The goal is to capture the trend and periodic patterns and to forecast the signal for >1 samples into the future. The example notebook is available in the TensorFlow Formulation section of this article.

In most business-related applications, the time series have non-constant mean and variance over time, or they can be said to be *non-stationary*. This is contrasted to stationary…

The purpose, problem statement, and potential applications came from this post on datasciencecentral.com. The goal is to approximate any multi-variate distribution using a weighted sum of *kernels. *Here, a kernel refers to a parameterized distribution. This method of using a decaying weighted sum of kernels to approximate a distribution is similar to a Taylor series where a function can be approximated, around a point, using the function’s derivatives.

- Approximate any empirical distribution
- Build a parameterized density estimator
- Outlier detection and dataset noise reduction

This solution I came up with was incorporated into a python package, KernelML. …

This is a review of a modern genetic algorithm called multi-offspring improved real-coded genetic algorithm (MOIRCGA). The original paper can be found here. The method used by this algorithm is, formally, called heuristical normal distribution and direction-based crosser (HNDDBX). This algorithm uses generates a direction-based candidate solution and modifies it with some parameterized random variable. The coded algorithm can be found on my github.

Genetic algorithms have been used for optimization since the 1960's. With the increased computational power of modern computers, genetic algorithms have gained attention for solving complex, non-linear objectives. …

This notebook shows the optimization of a multi-class, linear support vector machine using a simulation-based optimizer. Any simulation-based optimizer could be used with the Cuda kernel in this notebook. I used KernelML, my custom optimizer, in this example. The runtime for this script should be set to use the GPU: Runtime->Change runtime type.

The original SVM formulation can be found in Vapnik 1992. There have been advances to the robustness of the algorithm since then. Please see Robust Classifier 2019 section 6.1. The robust impementation looks very tedious to implement. If you are interested in implementing it, please consider emailing…

The covariance matrix has many interesting properties, and it can be found in mixture models, component analysis, Kalman filters, and more. Developing an intuition for how the covariance matrix operates is useful in understanding its practical implications. This article will focus on a few important properties, associated proofs, and then some interesting practical applications, i.e., non-Gaussian mixture models.

*I have often found that research papers do not specify the matrices’ shapes when writing formulas. I have included this and other essential information to help data scientists code their own algorithms.*

The covariance matrix can be decomposed into multiple unique (2x2)…