Machine Learning to Love

Data Scientist, Developer, Lover of Computer Vision and Machine Learning. Technically a Neuroscientist.

Introduction to Neural Networks — March 13, 2022

Introduction to Neural Networks

Two inputs, two outputs.

Eventually comes a day when a random forest won’t cut it. You need to classify images into categories, and the preprocessing alone is killing you. Neural networks to the rescue!

In this tutorial, we’ll start from the most basic of neural networks so you gain a foundation of what they are, how their layers works, and how they can be assembled with multiple inputs and outputs. We’ll build our networks and train them to classify some prepared data.

Unsupervised and Semi-Supervised learning with Images — March 5, 2022

Unsupervised and Semi-Supervised learning with Images

Ada ❤ Compression

I’ve covered unsupervised learning for clustering and anomaly detection, but it has a lot of possible applications! In this notebook, we explore how it can be used for image compression with pixels. Furthermore, we will use unsupervised and semi-supervised learning to efficiently help our image classification algorithm. Check out the notebook here:

Readable Formatting for Git Log — March 2, 2022

Readable Formatting for Git Log

Sometimes you need to jump back pretty far in a git repo, but this gets tough because the default behavior of git log doesn’t give you all that much info. Here are some helpful takes, and you don’t need to memorize them! Just add them to your gitconfig (described below):

git log --pretty=oneline
This sets each commit to a single line
git log --graph --abbrev-commit --decorate --format=format:'%C(bold blue)%h%C(reset) - %C(bold green)(%ar)%C(reset) %C(white)%s%C(reset) %C(dim white)- %an%C(reset)%C(bold yellow)%d%C(reset)' --all
Great for branch merges
git log --graph --abbrev-commit --decorate --format=format:'%C(bold blue)%h%C(reset) - %C(bold cyan)%aD%C(reset) %C(bold green)(%ar)%C(reset)%C(bold yellow)%d%C(reset)%n''          %C(white)%s%C(reset) %C(dim white)- %an%C(reset)' --all
Subheadings for each event with more detailed dates

Like I said, you don’t need to memorize this code. Simply open your ~/.gitconfig file and add the aliases there as such:

lg = log --graph --abbrev-commit --decorate --format=format:'%C(bold blue)%h%C(reset) - %C(bold green)(%ar)%C(reset) %C(white)%s%C(reset) %C(dim white)- %an%C(reset)%C(bold yellow)%d%C(reset)' --all

Now, if you type git lg, it will run that entire line of code for you.

You can make multiple aliases for multiple commands.

Info for this bit is adapted from Slipp D Thompson’s answer here.

Unsupervised learning: Clustering and Anomaly Detection — February 25, 2022

Unsupervised learning: Clustering and Anomaly Detection

Detecting outlier datapoints is referred to as Anomaly Detection

How does Spotify make such great playlists on the fly just based on a single song? How do credit card companies detect fraud from hundreds of thousands of accounts without using training data? Unsupervised learning! Unlike supervised learning where we train out algorithm to label data based on previous training sets, unsupervised learning can help us glean information from our data that would otherwise be hidden. I’ve put together a notebook that takes you through K-means clustering (with cluster count optimization) to identify how samples may fall into groups. You’ll also learn about Gaussian Mixture models, and how they can help us with anomaly detection.

You can run all the code in Colab. Enjoy!

Quickly visualize a text corpus with WordCloud — February 22, 2022
Common metrics for evaluating natural language processing (NLP) models — February 18, 2022

Common metrics for evaluating natural language processing (NLP) models

Logistic regression versus binary classification?

This article is also available on Medium.

You can’t train a good model if you don’t have the right evaluation metric, and you can’t explain your model if you don’t understand the metric you’re using. So, here’s a list of common metrics which are used for ML and NLP models, along with their definitions and common applications. I’ve always had a difficult time remembering these from charts and confusion matrices, so I thought a verbal explanation might work better.

Denotes the fraction of times the model makes a correct prediction as compared to the total predictions it makes. Best used when the output variable is categorical or discrete. For example, how often a sentiment classification algorithm is correct.

Evaluates the percent of true positives identified given all positive cases. Particularly helpful when identifying positives are more important than overall accuracy. For example, if identifying a cancer that is prevalent 1% of the time, a model that always spits out “negative” will be 99% accurate, but 0% precise.

The percent of true positives versus combined true and false positives. In the example with a rare cancer that is prevalent 1% of the time, if a model creates totally random predictions (50/50), it will have 50% accuracy (50/100), 50% precision (0.5/1), and 1% recall (0.5/50)

F1 Score
Combines precision and recall to give a single metric — both completeness and exactness. (2 * Precision * Recall) / (Precision + Recall). Used together with accuracy, and useful in sequence-labeling tasks, such as entity extraction, and retrieval-based question answering.

Area Under Curve; Combines true positives vs false positives as threshold for prediction is varied. Used to measure the quality of a model independent of prediction threshold, and to find the optimal prediction threshold for a classification task.

Mean Reciprocal Rank. Evaluate the responses retrieved given their probability of being correct. The mean of the reciprocal of the ranks of the retrieved results. Used heavily in all information-retrieval tasks, including article search and e-commerce search.

Mean average precision, calculated across each retrieved result. Used in information-retrieval tasks.

Root mean squared error — very common way to capture a model’s performance in a real-value prediction task. Good way to ask “How far off from the answer am I?” Calculates the square root of the mean of the squared errors for each data point. Used in numerical prediction — temperature, stock market price, position in euclidean space…

Mean absolute percentage error. Used when the output variable is a continuous variable, and is the average of absolute percentage error for each data point. Often used in conjunction with RMSE and to test the performance of regression models.

The cheese that tastes like it sounds. Also, bilingual evaluation understudy. Captures the amount of n-gram overlap between the output sentence and the reference ground truth sentence. Has many variants, and mainly used in machine translation tasks. Has also been adapted to text to text tasks such as paraphrase generation and summarization.

Precision-based metric to measure quality of generated text. Sort of a more robust BLEU. Allows synonyms and stemmed words to be matched with the reference word. Mainly used in machine translation.

Like BLEU and METEOR, compares quality of generated to reference text. Measures recall. Mainly used for summarization tasks where it’s important to evaluate how many words a model can recall (recall = % of true positives versus both true and false positives).

Measures how confused an NLP model is, derived from cross-entropy in a next word prediction task. Used to evaluate language models, and in language-generation tasks, such as dialog generation.

Of course you can find plenty more, but that’s a fairly good list when we’re talking NLP. Thanks for reading, and follow me on twitter — @SaladZombie

Ensemble Learning —

Ensemble Learning

What’s bad? A badly chosen ML algorithm. What’s better? A well-chosen ML algorithm. What’s even better? A whole bunch of algorithms working together to get an optimal result from a bunch of different predictions.

In this full-code tutorial, you’ll learn about bagging, boosting, and how random forests pull together decisions from multiple decision trees to give you a result.

Support Vector Machines, Non-linear Datasets, and GridSearch — March 15, 2021

Support Vector Machines, Non-linear Datasets, and GridSearch

I’ve created a tutorial to take you through an SVM classifier using a nonlinear dataset – one we can’t just separate with a straight line. Here, we’re also introduced to GridSearch, a super helpful machine learning tool where we can automatically run through lots of hyperparameters to find the best ones for our model and dataset. Check it out!

Multiclass Classification and Decision Boundaries — February 8, 2021
Building a prediction pipeline with the Titanic Dataset — February 3, 2021

Building a prediction pipeline with the Titanic Dataset

our algorithm doesn’t account for greediness on shoddy wooden rafts

I’ve created step-by-step tutorial which will help explain how scikit-learn can be used to build a data pre-processing pipeline. Furthermore, it shows how to load kaggle data, do some machine learning, and make an output. Check it out! everything works in Colab, my new BFF.