Ahmed Hani

The Mandatory Cherry

2026-03-10T00:00:00+00:00

I love cake.

Not because of the cherry on top. But because of the cake itself, the dough, the cream, the layers. The cherry was always a bonus. A nice touch. Something that made a good thing a little better. But if it wasn’t there, I could still enjoy my cake just fine.

That’s how I used to think about AI.

Four years ago, AI was the cherry. It made things a little smarter, a little faster. Netflix recommended a show you might like. Your email filtered out the spam. A map found you a faster route. These were small, quiet improvements to life. Nobody told you that you needed them. Nobody said you were falling behind without them.

AI was optional. And that felt right.

The Kid Who Chose the Cherry Early

I remember my college days well.

While most of my classmates were focused on web development and mobile apps, the “safe” paths, the ones with clear job offers waiting at the end, I was fascinated by something different. Machine learning. Artificial intelligence. The idea that a machine could learn from data and make decisions felt like magic to me. I wanted to spend my career exploring that.

People thought I was being unrealistic.

Friends, classmates, even some people who meant well would say things like: “AI? That’s very niche. You won’t find a job easily. Focus on web or mobile، that’s where the market is.” Some said it with concern. Some said it with a laugh. But the message was the same: you are choosing the hard road for no good reason.

I chose it anyway.

And here is the irony that still makes me smile: the same field that people warned me would leave me unemployed is now the field that everyone is being told they must embrace or they will become irrelevant.

The cherry I picked up quietly, before anyone cared about it, is now being forced onto every plate.

I do not say this to feel superior. I say it because it taught me something important: the value of a thing does not change based on how many people are talking about it. AI was interesting and powerful back then. It is interesting and powerful now. What changed is not the technology. What changed is the noise around it.

And noise, I have learned, is rarely a good guide for important decisions.

Something Changed

Then, almost overnight, the story shifted.

It started around/late 2022, when AI tools became public and easy to use. Suddenly, everyone had an opinion. Every headline. Every conference. Every LinkedIn post. The message was the same, just dressed differently each time:

“Use AI or get left behind.”

“AI will replace people who don’t adapt.”

“The future belongs to those who embrace AI now.”

And just like that, the cherry became mandatory.

Not because the cake stopped being good without it. But because someone, somewhere, decided that a cake without a cherry is no longer a real cake.

Did We Choose This?

Here is what bothers me the most: I don’t remember voting for this.

I don’t remember a moment where humanity sat down and said, “Yes, we want AI to be at the center of everything we do.” It just… happened. Fast. Faster than we could think about it clearly.

The printing press changed the world, but it took generations to settle into human life. The internet reshaped everything, but we had years to argue about what it meant. With AI, that breathing room was gone. The hype moved faster than the thinking.

And when something moves that fast, you have to ask: who benefits from the speed?

The companies building AI tools benefit. The investors behind them benefit. The governments who want to claim they are “leading in AI” benefit.

But did you benefit? Did you get to choose?

The Productivity Trap

The most common argument you hear is this: AI makes you more productive.

And maybe it does. But productive at what? For whom?

Productivity is not a goal. It is a tool. A means to an end. If AI helps you do more of something you deeply care about, that is wonderful. But if it just helps you do more, more emails, more reports, more content, without asking whether any of it matters, then you are not living better. You are just running faster on the same wheel.

The “be more productive” message feels empowering on the surface. But underneath it is a quiet assumption: that your value is measured by your output. And that is a very old, very tired idea dressed up in new technology.

What We Are Really Losing

Before AI became mandatory, there was something beautiful about struggling with a hard problem yourself.

You sat with it. You thought. You got it wrong. You tried again. And when you finally got it right, or even when you didn’t, something happened inside you. You grew. You learned how to think.

When a tool starts doing that thinking for you, the shortcut is obvious. But the loss is invisible.

I am not saying AI is bad. I am saying that when we stop choosing it and start needing it, something shifts. The tool stops serving us. We start serving the tool.

The Question Nobody Is Asking

Here is a question I rarely hear:

Does humanity actually need AI this much?

Not “can AI help?”, yes, it can, in many situations. But need? In the deep sense of the word?

Humanity built the pyramids without AI. Shakespeare wrote without AI. We landed on the moon without AI. We fell in love, raised children, made art, and found meaning, all without AI.

None of that is an argument against progress. But it is a reminder that the story of human greatness was written long before the algorithm arrived. And it was written by people who had to think, to struggle, to feel.

My Honest Position

I am not against AI. I work with it every day. I have seen it solve real problems and create genuine value.

But I am against the pressure. The manufactured urgency. The feeling that if you pause to question whether AI belongs in a particular part of your life, you are somehow naive or falling behind.

The cherry was never mandatory. You can still eat the cake without it.

The best technologies in history found their place quietly, over time, through genuine usefulness, not through a wave of hype that made people afraid to say no.

We deserve the right to choose. To decide where AI fits, and where it does not. To keep some parts of our thinking, our creativity, and our struggle human.

Not because we are afraid of technology. But because we know what makes us human, and we are not ready to hand it over just yet.

The cake was always good. The cherry was always optional. Let’s not forget that.

A Study on CoVe, Context2Vec, ELMo, ULMFiT and BERT

2019-07-01T00:00:00+00:00

Note: This post was originally published on AH’s Blog (WordPress) on July 1, 2019, and has been migrated here.

A research study on the models that revolutionized NLP through Transfer Learning — covering architecture, key ideas, and personal notes from implementation experience.

Key Terminology

Vector Space Models (VSMs): Words as unique vectors, feeding downstream ML models.

Word Embedding: Fixed-size vectors where semantically similar words have small Euclidean distance. Foundation for Language Modeling and Machine Translation.

Sentence Embedding: Same idea applied to full sentences.

Language Model: Models a statistical distribution over sentences to predict the next word given context.

Transfer Learning: Store knowledge learned on one task; reuse and optionally fine-tune it for another task.

Multi-Task Learning: Train simultaneously on multiple subtasks; the shared representation captures generalizable knowledge.

Domain Adaptation: A Transfer Learning subfield — adapt a model trained on a source distribution to perform well on a different target distribution.

Context Vectors (CoVe)

Paper: arxiv.org/pdf/1708.00107.pdf

CoVe vectors are learned on top of existing word vectors (GloVe, Word2Vec, FastText) using the encoder of a Neural Machine Translation (NMT) seq2seq model trained on German→English translation. The encoder learns complex semantic relations between words in order to translate, making its hidden representations richer than static embeddings.

Usage:

CoVe = MT-LSTM(GloVe(sentence))

Inspired by the success of pre-trained CNNs on ImageNet, CoVe applies the same transfer idea to NLP: train on a large task (NMT), then use the encoder as an initialization layer for downstream tasks.

The paper introduced Bi-attentive Classification Network (BCN) to validate CoVe quality on tasks like Sentiment Analysis and Paraphrase Detection. BCN accepts two inputs (or duplicates one), passes them through the MT-LSTM encoder, then uses a Bi-LSTM + bi-attention architecture ending in a maxout classifier.

Personal notes:

You don’t need BCN — just prepend the frozen (or fine-tuned) encoder to your own model.
Fine-tuning is generally better than freezing to allow slight task-specific adaptation.
Use FastText over GloVe when character-level distinctions matter (e.g., named entities).

Context to Embeddings (Context2Vec)

Paper: aclweb.org/anthology/K16-1006

Consider the sentence “I can’t find April.” Without context, “April” could be a month or a person. Context2Vec extends CBOW Word2Vec by replacing the simple average-of-context-vectors with a richer parametric model — a Bi-LSTM + feedforward network.

Three-stage architecture:

Bi-LSTM processes left-to-right and right-to-left context.
Feedforward network learns from the concatenated Bi-LSTM hidden states.
Objective function (with Word2Vec negative sampling) compares output to target word embedding.

Personal note: Similar to Doc2Vec, but uses Bi-LSTM instead of a plain projection layer for deeper contextual representation.

Embeddings from Language Models (ELMo)

Paper: arxiv.org/pdf/1802.05365.pdf

ELMo addresses the same polysemy problem (a word’s meaning depends on context) by learning embeddings from a Bi-directional Language Model (BiLM):

Forward LM: Predict word given previous words — P(word left context)
Backward LM: Predict word given following words — P(word right context)

Each word’s final ELMo representation is the weighted element-wise sum of:

Original word embedding (GloVe/Word2Vec/FastText)
Forward LSTM hidden state
Backward LSTM hidden state

Weights can be task-specific (learned during fine-tuning).

Usage steps:

Train BiLM on a large corpus.
Freeze the BiLM encoders and attach them at the bottom of your model.
Replace raw word indices with their ELMo representations.

Personal notes:

Train the LM on domain-specific data for best downstream results.
Deeper models or CNN character features can improve the LM quality.

Universal Language Model Fine-tuning (ULMFiT)

Paper: arxiv.org/pdf/1801.06146.pdf

ULMFiT’s goal: one universal language model that can be fine-tuned for any classification task. The base model is AWD-LSTM — a heavily regularized LSTM targeting generalization on long sequences.

AWD-LSTM regularization techniques:

DropConnect Mask: Randomly zeroes weight connections (not activations).
Variational Dropout: Same dropout mask applied at every time step within a sequence.
ASGD (Average SGD): Averages weights over multiple steps for more stable convergence.
Variable Length BPTT: Randomizes truncation length during training.

ULMFiT introduces two fine-tuning innovations:

Discriminative Fine-tuning (Discr): Different layers use different learning rates, since lower layers capture more general features (should change slowly) while upper layers capture task-specific features (can change faster).

Slanted Triangular Learning Rates (STLR): The learning rate increases quickly then decreases slowly — a specific schedule designed for fine-tuning pre-trained models.

BERT

Paper: arxiv.org/abs/1810.04805

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained deep Transformer encoder that redefined the state of the art across NLP benchmarks. Unlike the models above which use LSTMs, BERT uses a multi-layer Transformer architecture with self-attention.

Two novel pre-training objectives:

1. Masked Language Model (MLM): Randomly mask 15% of tokens in the input; train the model to predict those masked tokens. This allows truly bidirectional context — both left and right — unlike unidirectional LMs.

2. Next Sentence Prediction (NSP): Given two sentences, predict whether sentence B actually follows sentence A in the original document. This captures inter-sentence relationships useful for QA and inference tasks.

Two model sizes:

BERT-Base: 12 Transformer layers, 768 hidden units, 12 attention heads (110M parameters)
BERT-Large: 24 layers, 1024 hidden units, 16 attention heads (340M parameters)

Fine-tuning: Add a task-specific output layer on top of BERT and fine-tune end-to-end. BERT achieved state-of-the-art on 11 NLP tasks including SQuAD, MNLI, and CoLA at time of publication.

Summary

Model	Core Idea	Architecture	Key Innovation
CoVe	NMT encoder as feature extractor	Bi-LSTM	Transfer from MT task
Context2Vec	BiLM-style context modeling	Bi-LSTM + FF	Richer CBOW context
ELMo	Contextual word embeddings from BiLM	Stacked Bi-LSTM	Per-layer weighted sum
ULMFiT	Universal LM fine-tuning	AWD-LSTM	Discr LR + STLR
BERT	Masked LM + NSP pre-training	Transformer	True bidirectionality via masking

The trajectory is clear: from static word vectors → context-dependent LSTMs → attention-based Transformers. Each step brought deeper, more context-aware representations that better model language semantics.

References

An End-to-End Note About FCIS Graduation Project(GP)

2018-12-03T00:00:00+00:00

This post is inspired by a recent post that is written by my friend Mustafa Saad. It is a great post that I recommend you to read and think about the topics and suggested projects that are mentioned on his post. Actually, I don’t completely agree with him, but Mustafa is an experienced guy whose ideas and points of view should be taken into consideration.

In this post, I mainly talk to the people who will join CS department at the faculty. I have joined CS department in my 4th year and I may know how the people there think. However, this is just my point of view and it is maybe right or wrong.

So, you have successfully made it to your final year at the college. I don’t know, but I still believe that the 3rd year will be always the hardest and the most tough year compared to other years. Most of you should have worked on some interesting topics during the previous years. Some of you maybe got interested about Machine Learning, others maybe liked Graphics, Compilers or Architecture(for real .. how could one like such Archi. stuff? It is a curse, bro!). Well, that’s great actually, it is always better to search about a specific field to focus on for either your GP or after the graduation as a career path.

I mainly will talk about Machine Learning based projects. Machine Learning is considered to be very trendy nowadays and it is related to CS and SC departments specifically. In general, there are several applications that are under the umbrella of Machine Learning, such as Natural Language Processing(NLP), Automatic Speech Recognition(ASR) and Computer Vision(CV). These fields work on Text, Signals and Images data respectively.

The main question that you need to ask yourself about is “what is the field that I am interested in? What is data type that I want to work on (text, speech signal, image, ..etc.)?”

After answering these questions, you can begin your research and survey regarding the topic you decided. Your target is to follow the top universities research labs blogs which talk about their latest researches and their applications.

For example, if you are interested in NLP, you need to identify the universities that are well-known about their great efforts on NLP, you will see that Berkley http://www.berkeley.edu/ and British-Columbia https://www.ubc.ca/ are popular universities in that field, so, you go to their websites and see what are their latest papers and participation on some top conferences, such as NIPS https://nips.cc/ and EMNLP http://www.emnlp2016.net/ .

Actually, in Graduation Projects (and maybe M.Sc too), you don’t need to create something very new, you have 2 options when you begin your journey.

Applied Research Project

Such projects, you are not looking forward to creating something that isn’t existed. You only seek for learning and increasing your programming skills by finding an existed project and either implement it from scratch as a whole or focus on specific part that you find it interesting. In such projects, you must have some existed resources to help you during your implementation phase

Papers, clear documents and useful blogs and links
Open-source Projects in git community. Your implementation shall include some important techniques such as Object Oriented Programming(OOP), Data Structures(DS) and Algorithms.
You will need to understand some mathematical and statistical content that may be included in the paper.

I prefer this kind of projects, because you learn and implement what you learnt, also, the college likes such projects at which they see an actual output to see.

Research Projects

You have an existed solution for a problem but you have in mind some theoretical enhancements. In such projects, you expect to do several experiments and search a lot to increase your knowledge and make sure of what you are doing. Choosing such projects, you must be ready to read a lot of theories, papers and some chapters from a reference book.

Also, it is very preferable to already have a previous background regarding to what you want to do. The team who want to work in such projects must have

Solid background about mathematics and statistics
Likes to read and search
Expect to understand a lot and code less.

To be honest, I don’t prefer such projects, because

The output isn't guaranteed and the college is always expecting to see output and won't appreciate any efforts without seeing an output
Such deep understanding to the theoretical background of things is very rare in people with your level

So, I consider this kind of projects as an unnecessary risk.

If you are going to work on any of these two categories of projects, you must have the following to help you to finish the project

Powerful Machine: Machines that have GPUs that you will need to train your complex models. The machines either could be online or offline. But in all aspects, you must make sure to have such machines
Available Datasets: You need to make sure to have at least one dataset to do your experiments on. Avoid collecting the dataset by your own. You won't have time to gather it and also collecting dataset needs some sort of divergence to help in generalization and coverage your patterns
It is very preferable to work on Linux-based Operating Systems instead of Windows

Look .. there are some facts that I want to share with you to know what you are going to see when you begin your project

The basic Machine Learning techniques are considered to be an old school. You won't see a lot of projects nowadays that use popular algorithms and techniques, such as Naive Bayes and Hidden Markov Models. The research community is biasing towards Deep Learning(DL). Deep Learning needs powerful machines and large datasets and fortunately, these things are available and rich compared to the past. This supports the previous notes (The available datasets and powerful machines)
There exist a lot of libraries that make the life easier while working on the projects. Commonly, the libraries are built on Python, R and C++ programming languages. The libraries helps on training and evaluating the models easily but it has a very bad effect. The problem with these libraries is in their abstraction. The libraries are made as a black box that you have several algorithms and techniques running on the backend. Trust me, you can train and produce output without even understanding 30% what is going on! The most popular libraries are Keras, Tensorflow and PyTorch.
Don't expect to have a lot of support in the college from the TAs and Drs. It is your project and you MUST be the one who fully understand what you want to do. Just Help Yourself!

From my experiences in mentoring and supervising teams after my graduation, I found that the teams are stuck in different problems.

They rely on the seminars and the grades without much caring about the actual output
They don't make use of the summer vacation and the preparations aren't organized
They don't divide the projects to several tiny modules
They don't get to the point. They waste their times watching from A to Z courses during the semesters

I will tell you something .. if you really are planning to work on Machine Learning based projects, you must be willing to spend part of your vacations for preparation and studying. If you enter the year without any prior knowledge or without finishing a beginner course in Machine Learning, then CHOOSE SOMETHING ELSE or you will end up running some code without understanding what you are doing. At least, there should be one member of the team who have some knowledge about the task so that he/she could be able to lead the team.

So, once you reached to this point, here are the concluded steps that I think that it is a good start for anyone who is working on Machine Learning projects

Use the summer vacation to enroll into a Machine Learning course and make sure to finish it before the beginning of the year and before registering the graduation project
Find an interesting field of study that is closely related to ML such as NLP and ASR
Search about some of its popular topics and the current research progress regarding to it
Gather the needed materials such as papers, useful links and books
Find a runnable complete/incomplete open-source projects and make sure that you can install and run them in your machine. Also, check the number of stars and forks.
Run and produce output from the open-source projects
Implement your own code. Use either Python or C++ to write or rewrite the code. For example, you may think of implementing a Neural Network from scratch or other complex models such as Convolutional Neural Networks(CNNs) and Recurrent Neural Networks(RNNs)
If you have time, you may create a desktop application or a web service as an interface for your project

[Kaggle] SMS Spam Collection

2017-06-30T00:00:00+00:00

Note: This post was originally published on AH’s Blog (WordPress) on June 30, 2017, and has been migrated here.

A short exploration and classification notebook for the SMS Spam Collection Dataset on Kaggle.

Results: loss of 0.1 on the test set and approximately 95% accuracy.

Notebook: sms_spam_detection.ipynb on GitHub

Feedback Sequence-to-Sequence Model – Gonna Reverse Them All!

2017-06-25T00:00:00+00:00

Note: This post was originally published on AH’s Blog (WordPress) on June 25, 2017, and has been migrated here.

This tutorial assumes familiarity with Recurrent Neural Networks and Backpropagation Through Time (BPTT).

Terminology

One-to-one: One input word → one output word (e.g., semantic synonyms like “like” → “love”).

One-to-many: One input → multiple outputs (e.g., hypernym relations: “vehicle” → [“car”, “bike”, “boat”]).

Many-to-one: Multiple inputs → one output (e.g., Sentiment Analysis: sentence → polarity label).

Many-to-many: Multiple inputs → multiple outputs (e.g., Machine Translation: English → French sentence).

Word Embedding: Fixed-size semantic vectors for words — similar words have similar vectors.

One-hot Encoding: Naive sparse representation — a vector of zeros with a single 1 at the word’s index. No semantic content; used here for simplicity.

Dataset

Characters: 'a', 'b', 'c' only. Task: reverse a string (e.g., “abc” → “cba”).

One-hot encodings:

[1, 0, 0] = 'a'
[0, 1, 0] = 'b'
[0, 0, 1] = 'c'

String “abc” is represented as the concatenated vectors [1,0,0, 0,1,0, 0,0,1].

Encoder-Decoder Architecture

Any seq2seq model has two components:

Encoder: Processes the input sequence step by step, updating its hidden state. The final hidden state — the Thought Vector — is a fixed-size representation of the entire input.
Decoder: Initialized with the Thought Vector, generates output tokens one at a time until an END token is produced or max length is reached.

This post demonstrates the Feedback Encoder-Decoder variant: the decoder’s output at time t becomes its input at time t+1.

The Encoder

Parameters:

Batch size = 1, Input shape = 1×3
Hidden layer size S(t) = 5
Input-to-hidden weights Wh (3×5), Hidden-to-hidden weights Ws (5×5)
Activation: ReLU — f(x) = max(0, x)

Hidden state recurrence: S(t) = x(t) · Wh + S(t-1) · Ws

Initial state S(0) = zeros (no prior memory).

Processing “abc”:

S(1) = x(1) · Wh + 0 · Ws = [0.1, 0.2, 0.3, 0.4, 0.5]

S(2) = x(2) · Wh + S(1) · Ws = [0.98, 1.28, 1.65, 1.74, 0.58]

S(3) = x(3) · Wh + S(2) · Ws = [1.74, 2.16, 4.56, 4.62, 3.29]

Thought Vector = [1.74, 2.16, 4.56, 4.62, 3.29] — the encoded representation of “abc”.

Note: With all-positive weights, ReLU has no effect in this toy example. In practice, weights will be mixed-sign.

The Decoder

Two key differences from the encoder:

Initial state = Thought Vector (the encoder’s final state).
Input at t+1 = output produced at time t (feedback loop).

Additional parameters:

Hidden-to-output weights _Wo (5×3)
Output activation: Softmax

The decoder runs for the desired output length, producing one character per step. The character with the highest Softmax probability is selected (argmax), fed back as the next input, and this repeats until the reversed string is complete.

This is a mapping problem solvable by an MLP if max string length is fixed. The seq2seq framing here is pedagogical — the real power of these models is in variable-length sequences like Machine Translation.

References

Source code on GitHub (Thesis project)
Data Normalization post — for softmax background

[Thesis Tutorials II] Understanding Word2vec for Word Embedding II

2017-04-27T00:00:00+00:00

Note: This post was originally published on AH’s Blog (WordPress) on April 27, 2017, and has been migrated here.

Continues from Thesis Tutorials I — Understanding Word2vec for Word Embedding I.

The Scalability Problem

Training Word2vec on a real corpus requires millions of unique words. Recalling the Skip-gram architecture:

Each Softmax output layer has V neurons — one per vocabulary word. The Softmax formula:

The denominator sums exponentials over all V words, giving O(V) complexity per output layer. With V = 1M words and multiple output layers (one per context word in Skip-gram), this becomes prohibitively expensive.

Two optimizations are proposed in the original Word2vec paper to address this: Hierarchical Softmax and Negative Sampling.

Dataset (from Part I)

D = {
  "This battle will be my masterpiece",
  "The unseen blade is the deadliest"
}

V = {this, battle, will, be, my, masterpiece, the, unseen, blade, is, deadliest}, ‖V‖ = 11

Hierarchical Softmax

Instead of a flat Softmax layer over all V words, Hierarchical Softmax replaces it with a balanced Huffman tree whose leaves are the vocabulary words.

To compute the probability of a target word (e.g., “unseen” given context “the”, “blade”, “is”), we traverse the tree from root to the word’s leaf, multiplying the probabilities at each binary decision node.

For “unseen”, the path might be: P(right) × P(left) × P(right) × P(right)

or equivalently: P(left) × P(right) × P(right) × P(right)

How node probabilities are computed

Each internal node acts like a logistic regression unit. The input to each node is the hidden layer vector from the neural network; the output is a probability obtained via Sigmoid:

P(node) = Sigmoid(hidden_layer · W_node + b)

Each tree layer has its own associated weight matrix, learned during training.

Think of it as a small neural network stacked on top of the hidden layer, where the new network’s output is the tree nodes.

At each tree layer, we evaluate the node probabilities and follow the path of highest probability to arrive at the predicted word.

Complexity improvement

Flat Softmax: O(V)
Hierarchical Softmax: O(log V)

Additionally, since P(right) + P(left) = 1, we only need to compute one branch’s probability at each node — the other is obtained by subtraction, halving the work per decision.

Negative Sampling

Negative Sampling is the more popular optimization in practice — it is the approach used in TensorFlow’s Word2vec implementation.

The key insight

After computing the output layer error, a standard training pass updates all V × hidden_layer_size weights. But in practice, the actual positive word and the incorrect (negative) words are already known. There is no need to update all vocabulary word weights for every training example — only a small selected subset.

How it works

For each training example, instead of computing Softmax over all V words, we:

Include the positive word (the actual target).
Sample a small number K of negative words (words that should not be predicted in this context).
Apply Softmax only over these K+1 words.

If K = 10, the output layer has only 11 neurons, and backpropagation updates only 11 × hidden_layer_size weights instead of V × hidden_layer_size.

Selecting negative samples

Negative words are sampled based on their frequency in the corpus — more frequent words have a higher probability of being selected as negative samples. The formula is:

Where c is a constant set by the model creator. The top K words by this weighted probability are selected as the negative sample set.

In practice, K is typically between 5 and 20 for small datasets and 2 to 5 for large corpora. The original paper found that negative sampling yields results comparable to Hierarchical Softmax while being simpler to implement.

Summary

Technique	Complexity	How it works
Flat Softmax	O(V)	Normalize over all vocabulary words
Hierarchical Softmax	O(log V)	Binary tree path; logistic nodes per decision
Negative Sampling	O(K) — K ≪ V	Update only positive + K sampled negatives

Both techniques make training large-vocabulary Word2vec models feasible. Negative Sampling is generally preferred for its simplicity and strong empirical performance.

Diagrams created with draw.io.

References

[Thesis Tutorials I] Understanding Word2vec for Word Embedding I

2017-04-25T00:00:00+00:00

Note: This post was originally published on AH’s Blog (WordPress) on April 25, 2017, and has been migrated here.

Key Terms

Vector Space Models (VSMs): Words represented as unique vectors, used as input to mathematical/statistical ML models.

Word Embedding: Fixed-size vector representations where semantically similar words have geometrically close vectors (small Euclidean distance). Used in Language Modeling, Machine Translation, and many NLP tasks.

Shallow Neural Networks: Neural networks with exactly 1 hidden (projection) layer, producing a new feature representation of the input.

One-Hot Encoding

The naive baseline. For a vocabulary V = {I, like, playing, football, basketball} (‖V‖ = 5):

I          = [1, 0, 0, 0, 0]
like       = [0, 1, 0, 0, 0]
playing    = [0, 0, 1, 0, 0]
football   = [0, 0, 0, 1, 0]
basketball = [0, 0, 0, 0, 1]

Pros: Simple, deterministic.

Cons: Vector size = vocabulary size (1M words → 1M-dim vectors). No semantic information — football and basketball are equally “distant” from each other as from “I”.

Use this when semantic relations don’t matter and vocabulary size is manageable.

Word2vec Philosophy

Word2vec represents a word using the words that surround it. Given:

“I like playing X”

Even without knowing what X is, the context (“like”, “playing”) tells us it’s something enjoyable and playable. This is exactly how humans infer meaning from context. Word2vec formalizes this: train a shallow neural network to predict context from a target word (or vice versa), and the learned weights become the word vectors.

Dataset

Corpus D:

"This battle will be my masterpiece"
"The unseen blade is the deadliest"

Vocabulary V = {this, battle, will, be, my, masterpiece, the, unseen, blade, is, deadliest}, ‖V‖ = 11

One-hot vectors (size 11) are assigned per word for use in the network.

Skip-gram Model

Task: Given a target word, predict its N surrounding context words.

Architecture: Input = one-hot vector of target word → 1 hidden (projection) layer → N Softmax output layers (one per context word to predict).

Example — target: “unseen”, context window = 3, embedding dimension = 3.

Input → Hidden (Wh, shape 11×3):

After feedforwarding “unseen”, hidden layer H = [0.8, 0.4, 0.5]. This is the initial embedding. Every row of Wh is the current embedding for each vocabulary word.

Hidden → Output:

Apply Softmax to each output vector, take the argmax index → predicted context words.

During training, errors from all N Softmax layers are averaged and backpropagated to update Wh. Repeat until max epochs or target loss is reached. The final input-to-hidden weight matrix is the word embedding.

Alternatively, the average of the Hidden-to-output weight matrices can serve as embeddings — but the input-to-hidden matrix is the standard choice.

Continuous Bag of Words (CBOW) Model

Task: Given N context words, predict the target word.

Architecture: N input one-hot vectors → 1 hidden layer (mean of input projections) → 1 Softmax output.

Example — target: “unseen”, context: “the”, “blade”, “is”.

Input → Hidden:

The hidden layer is the average of each context word’s projection:

H = [(0.8+0.2+0.2)/3, (0.9+0.8+0.3)/3, (0.1+0.9+0.7)/3] = [0.39, 0.66, 0.56]

Hidden → Output:

Apply Softmax → predicted word = “masterpiece” (in this initialization). Backpropagate error to update weights.

Conclusion

Word2vec is the bridge from symbolic NLP to semantic deep learning. Traditional rule-based systems fail to generalize across languages and cannot capture semantic similarity. One-hot encodings are equally blind to meaning. Word2vec vectors encode semantic proximity — enabling downstream models to reason about language.

Both Skip-gram and CBOW produce the same type of output (word embeddings) but differ in architecture: Skip-gram predicts context from a target; CBOW predicts a target from context. Skip-gram generally performs better on infrequent words; CBOW is faster to train on large corpora.

Part II covers negative sampling, hierarchical softmax, and practical training details.

Hello, Valor!

2017-04-05T00:00:00+00:00

Note: This post was originally published on AH’s Blog (WordPress) on April 5, 2017, and has been migrated here.

Being a researcher and programmer can make life quite monotonous. After two months of the same routine — work, study, freelance tasks, waiting for Thursday to meet friends — I decided to break the cycle by learning something completely new.

I love listening to music; I can barely work or study without it. But I’d never thought of myself as someone who could play. So I enrolled in a course and bought a violin. I’m currently at Level 1 out of 7, and I’m excited to see how far I can go by the end of the year.

Welcome, Valor 🎻

Overview: Generative Adversarial Networks – When Deep Learning Meets Game Theory

2017-01-17T00:00:00+00:00

Note: This post was originally published on AH’s Blog (WordPress) on January 17, 2017, and has been migrated here.

Before diving into Generative Adversarial Networks (GANs), a few foundational concepts are worth establishing.

Key Concepts

Discriminative Models predict a hidden class given observed features. They model the conditional probability **P(y

x₁, x₂, …, xₙ)**. Examples: SVMs, Feedforward Neural Networks.

Generative Models learn the joint distribution of features and classes — P(x₁, x₂, …, xₙ, y) — enabling them to generate new samples from the learned distribution. Examples: Restricted Boltzmann Machines (RBMs), HMMs. Note: Vanilla Auto-encoders are not generative models (they reconstruct); Variational Auto-encoders (VAEs) are.

Nash Equilibrium (Game Theory): A stable game state where no player has an incentive to change their strategy after knowing the other players’ strategies. Each player is satisfied with their outcome given the others’ choices.

Minimax: An algorithm for two-player games where each player tries to minimize the maximum possible loss the opponent can inflict. Used in Chess, Tic-Tac-Toe, Connect-4, and other rule-based decision games.

Generative Adversarial Networks (GANs)

A GAN consists of two models competing during training:

Generator (G): Produces fake samples intended to match the distribution of real data.
Discriminator (D): Learns to distinguish real samples from the Generator’s fakes.

The dynamic is adversarial — G tries to fool D; D tries to catch G. This is precisely the Minimax setup: each player attempts to minimize the worst outcome the other can produce.

Training continues iteratively until both models become experts: G generates samples indistinguishable from real data, and D becomes highly accurate at classification. When neither model can improve by changing its strategy unilaterally, the system reaches Nash Equilibrium.

During training, a shared loss function updates each model’s parameters independently via backpropagation — neither model can directly modify the other’s weights.

Status

This was an overview written while still learning GANs. The follow-up post applies the concepts in more detail: GANs Part 2 — Camouflage your Predator!

References

Goodfellow et al., NIPS 2016 Tutorial on GANs
KDnuggets: GANs Overview
Wikipedia: Generative Adversarial Networks
Wikipedia: Minimax
Wikipedia: Nash Equilibrium
Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach

Another LSTM Tutorial

2016-10-09T00:00:00+00:00

Note: This post was originally published on AH’s Blog (WordPress) on October 9, 2016, and has been migrated here.

Figures in this post are taken from Christopher Olah’s excellent Understanding LSTMs blog post.

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are designed for sequential data — data where order and dependency between elements matters. Traditional Multi-layer Perceptrons (MLPs) assume independence between inputs, which is inappropriate for text or audio.

RNNs contain self-loops that carry the previous hidden state forward, allowing the network to “remember” what it has seen.

Unrolled over time, the RNN resembles a deep feedforward network where each step receives both the current input and the previous hidden state:

The Long-term Dependencies Problem

Standard RNNs have no mechanism to selectively forget irrelevant context. For a sentence like:

“I live in France, I like playing football with my friends and going to the school, I speak french”

Predicting “french” requires connecting to “I live in France” — but the two intermediate clauses introduce noise. Regular RNNs struggle to bridge these long-range dependencies, which is the main motivation behind LSTM.

What is LSTM?

Long Short-Term Memory (LSTM) is a variant of RNN that controls the memory process through gates within each unit. These gates regulate what information to retain, update, or forget, allowing the network to maintain relevant long-range context.

The analogy: when reading a novel, your brain selectively remembers important events (subject, previous action) while discarding irrelevant details. LSTMs simulate this selective memory.

LSTM Unit Structure

A standard LSTM unit contains:

2 inputs: previous cell state C_{t-1} and previous output h_{t-1}
4 layers: 3 sigmoid activations + 1 tanh activation
5 point operators: 3 multiplications, 1 addition, 1 tanh
2 outputs: current cell state C_t and current output h_t

The cell state is the memory backbone. It flows through the unit with minimal modification unless the gates decide to change it.

Detailed Processing: 3 Groups

Group 1.1 — Forget Gate

The forget gate layer (sigmoid) decides what to discard from the previous cell state. Output of 0 → forget everything; values closer to 1 → retain.

Group 1.2 — Applying Forget to Previous State

Element-wise multiply the forget gate output with C_{t-1}. A vector of zeros means we wipe all previous memory.

Group 2.1 — Input Gate and Candidate State

The input gate layer (sigmoid) decides which state values to update. A tanh layer generates the candidate new state values to potentially add.

Group 2.2 — Scaling New State

Multiply the candidate state by the input gate output to filter which new information actually gets written.

Combining Groups 1 + 2 → New Cell State

Add the filtered old state (Group 1) and filtered new information (Group 2) to get C_t.

Group 3 — Output Gate

A sigmoid layer decides which parts of the state to output. The state is passed through tanh (to keep values in [-1, 1]) and multiplied element-wise by the sigmoid output.

Conclusion

LSTMs have proven themselves across a wide range of tasks: Language Modeling, Sentiment Analysis, Speech Recognition, Text Summarization, and Question Answering. The gating mechanism is what makes them capable of learning which context to carry forward and which to discard.