How to create a local PySpark test environment using an AWS S3 data source


The lack of education on these topics has generated unproductive arguments of risk and probability.



My Notes for “How to Win a Data Science Competition From Top Kagglers” on Coursera.org

https://www.kaggle.com/progression

TOC

  1. Data Exploration Checklist
  2. Validation
  3. Target Leakage
  4. Metrics and Loss Functions
  5. Metric Optimization
  6. Mean Encoding
  7. Coding Tips
  8. Advanced Feature Engineering
  9. Ensemble Strategies
  10. StackNet
  11. Creating a Diverse Set of Models
  12. Tips on Meta-Learning and Stacking
  13. Text Based Features in XGBoost
  14. Sequence Feature Extraction (XGBoost)
  15. Semi-supervised &…


Rank Ordering with TensorFlow 2.0

Introduction

A ranking algorithm can be used if the target variable is numerically ordered. The model will capture the shared variation between adjacent classes. This model can also be useful for semi-supervised classification, where proxy targets are assigned as a lower rank than the actual targets.


Forecasting with TensorFlow 2.0

Trend, Periodicity, and Noise

In most business-related applications, the time series have non-constant mean and variance over time, or they can be said to be non-stationary. This is contrasted to stationary…


Density Estimation using Multi-Agent Optimization & Rewards

Introduction

The purpose, problem statement, and potential applications came from this post on datasciencecentral.com. The goal is to approximate any multi-variate distribution using a weighted sum of kernels. Here, a kernel refers to a parameterized distribution. This method of using a decaying weighted sum of kernels to approximate a distribution is similar to a Taylor series where a function can be approximated, around a point, using the function’s derivatives.

Goals

  • Approximate any empirical distribution
  • Build a parameterized density estimator
  • Outlier detection and dataset noise reduction

My Approach

This solution I came up with was incorporated into a python package, KernelML. …


A Modern Genetic Optimization Method (2019)

Review of Multi-Offspring Improved Real-Coded Genetic Algorithm (MOIRCGA)

Short Overview

Genetic algorithms have been used for optimization since the 1960's. With the increased computational power of modern computers, genetic algorithms have gained attention for solving complex, non-linear objectives. …


Train a Robust Classifier using a GPU


Essential information about the covariance matrix for data scientists

Sub-Covariance Matrices

The covariance matrix can be decomposed into multiple unique (2x2)…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store