<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://ahmedhani.github.io//feed.xml" rel="self" type="application/atom+xml" /><link href="https://ahmedhani.github.io//" rel="alternate" type="text/html" /><updated>2026-03-21T13:41:30+00:00</updated><id>https://ahmedhani.github.io//feed.xml</id><title type="html">Ahmed Hani</title><subtitle>Talking about ML/NLP/GenAI/MLOps, as well as some personal thoughts!</subtitle><entry><title type="html">The Mandatory Cherry</title><link href="https://ahmedhani.github.io//the-mandatory-cherry/" rel="alternate" type="text/html" title="The Mandatory Cherry" /><published>2026-03-10T00:00:00+00:00</published><updated>2026-03-10T00:00:00+00:00</updated><id>https://ahmedhani.github.io//the-mandatory-cherry</id><content type="html" xml:base="https://ahmedhani.github.io//the-mandatory-cherry/"><![CDATA[<p><img src="/images/the-mandatory-cherry.png" alt="The Mandatory Cherry" /></p>

<p>I love cake.</p>

<p>Not because of the cherry on top. But because of the cake itself, the dough, the cream, the layers. The cherry was always a bonus. A nice touch. Something that made a good thing a little better. But if it wasn’t there, I could still enjoy my cake just fine.</p>

<p>That’s how I used to think about AI.</p>

<p>Four years ago, AI was the cherry. It made things a little smarter, a little faster. Netflix recommended a show you might like. Your email filtered out the spam. A map found you a faster route. These were small, quiet improvements to life. Nobody told you that you <em>needed</em> them. Nobody said you were falling behind without them.</p>

<p>AI was optional. And that felt right.</p>

<hr />

<h2 id="the-kid-who-chose-the-cherry-early">The Kid Who Chose the Cherry Early</h2>

<p>I remember my college days well.</p>

<p>While most of my classmates were focused on web development and mobile apps, the “safe” paths, the ones with clear job offers waiting at the end, I was fascinated by something different. Machine learning. Artificial intelligence. The idea that a machine could learn from data and make decisions felt like magic to me. I wanted to spend my career exploring that.</p>

<p>People thought I was being unrealistic.</p>

<p>Friends, classmates, even some people who meant well would say things like: <em>“AI? That’s very niche. You won’t find a job easily. Focus on web or mobile، that’s where the market is.”</em> Some said it with concern. Some said it with a laugh. But the message was the same: <em>you are choosing the hard road for no good reason.</em></p>

<p>I chose it anyway.</p>

<p>And here is the irony that still makes me smile: the same field that people warned me would leave me unemployed is now the field that everyone is being told they <em>must</em> embrace or they will become irrelevant.</p>

<p>The cherry I picked up quietly, before anyone cared about it, is now being forced onto every plate.</p>

<p>I do not say this to feel superior. I say it because it taught me something important: the value of a thing does not change based on how many people are talking about it. AI was interesting and powerful back then. It is interesting and powerful now. What changed is not the technology. What changed is the noise around it.</p>

<p>And noise, I have learned, is rarely a good guide for important decisions.</p>

<hr />

<h2 id="something-changed">Something Changed</h2>

<p>Then, almost overnight, the story shifted.</p>

<p>It started around/late 2022, when AI tools became public and easy to use. Suddenly, everyone had an opinion. Every headline. Every conference. Every LinkedIn post. The message was the same, just dressed differently each time:</p>

<p><em>“Use AI or get left behind.”</em></p>

<p><em>“AI will replace people who don’t adapt.”</em></p>

<p><em>“The future belongs to those who embrace AI now.”</em></p>

<p>And just like that, the cherry became mandatory.</p>

<p>Not because the cake stopped being good without it. But because someone, somewhere, decided that a cake without a cherry is no longer a real cake.</p>

<hr />

<h2 id="did-we-choose-this">Did We Choose This?</h2>

<p>Here is what bothers me the most: I don’t remember voting for this.</p>

<p>I don’t remember a moment where humanity sat down and said, “Yes, we want AI to be at the center of everything we do.” It just… happened. Fast. Faster than we could think about it clearly.</p>

<p>The printing press changed the world, but it took generations to settle into human life. The internet reshaped everything, but we had years to argue about what it meant. With AI, that breathing room was gone. The hype moved faster than the thinking.</p>

<p>And when something moves that fast, you have to ask: <em>who benefits from the speed?</em></p>

<p>The companies building AI tools benefit. The investors behind them benefit. The governments who want to claim they are “leading in AI” benefit.</p>

<p>But did you benefit? Did you get to choose?</p>

<hr />

<h2 id="the-productivity-trap">The Productivity Trap</h2>

<p>The most common argument you hear is this: <em>AI makes you more productive.</em></p>

<p>And maybe it does. But productive at what? For whom?</p>

<p>Productivity is not a goal. It is a tool. A means to an end. If AI helps you do more of something you deeply care about, that is wonderful. But if it just helps you do <em>more</em>, more emails, more reports, more content, without asking whether any of it matters, then you are not living better. You are just running faster on the same wheel.</p>

<p>The “be more productive” message feels empowering on the surface. But underneath it is a quiet assumption: that your value is measured by your output. And that is a very old, very tired idea dressed up in new technology.</p>

<hr />

<h2 id="what-we-are-really-losing">What We Are Really Losing</h2>

<p>Before AI became mandatory, there was something beautiful about struggling with a hard problem yourself.</p>

<p>You sat with it. You thought. You got it wrong. You tried again. And when you finally got it right, or even when you didn’t, something happened inside you. You grew. You learned how to <em>think.</em></p>

<p>When a tool starts doing that thinking for you, the shortcut is obvious. But the loss is invisible.</p>

<p>I am not saying AI is bad. I am saying that when we stop choosing it and start <em>needing</em> it, something shifts. The tool stops serving us. We start serving the tool.</p>

<hr />

<h2 id="the-question-nobody-is-asking">The Question Nobody Is Asking</h2>

<p>Here is a question I rarely hear:</p>

<p><em>Does humanity actually need AI this much?</em></p>

<p>Not “can AI help?”, yes, it can, in many situations. But <em>need</em>? In the deep sense of the word?</p>

<p>Humanity built the pyramids without AI. Shakespeare wrote without AI. We landed on the moon without AI. We fell in love, raised children, made art, and found meaning, all without AI.</p>

<p>None of that is an argument against progress. But it is a reminder that the story of human greatness was written long before the algorithm arrived. And it was written by people who had to think, to struggle, to feel.</p>

<hr />

<h2 id="my-honest-position">My Honest Position</h2>

<p>I am not against AI. I work with it every day. I have seen it solve real problems and create genuine value.</p>

<p>But I am against the pressure. The manufactured urgency. The feeling that if you pause to question whether AI belongs in a particular part of your life, you are somehow naive or falling behind.</p>

<p>The cherry was never mandatory. You can still eat the cake without it.</p>

<p>The best technologies in history found their place quietly, over time, through genuine usefulness, not through a wave of hype that made people afraid to say no.</p>

<p>We deserve the right to choose. To decide where AI fits, and where it does not. To keep some parts of our thinking, our creativity, and our struggle <em>human.</em></p>

<p>Not because we are afraid of technology. But because we know what makes us human, and we are not ready to hand it over just yet.</p>

<hr />

<p><em>The cake was always good. The cherry was always optional. Let’s not forget that.</em></p>]]></content><author><name></name></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">A Study on CoVe, Context2Vec, ELMo, ULMFiT and BERT</title><link href="https://ahmedhani.github.io//a-study-on-cove-context2vec-elmo-ulmfit-and-bert/" rel="alternate" type="text/html" title="A Study on CoVe, Context2Vec, ELMo, ULMFiT and BERT" /><published>2019-07-01T00:00:00+00:00</published><updated>2019-07-01T00:00:00+00:00</updated><id>https://ahmedhani.github.io//a-study-on-cove-context2vec-elmo-ulmfit-and-bert</id><content type="html" xml:base="https://ahmedhani.github.io//a-study-on-cove-context2vec-elmo-ulmfit-and-bert/"><![CDATA[<blockquote>
  <p><strong>Note:</strong> This post was originally published on <a href="https://ahmedhanibrahim.wordpress.com/2019/07/01/a-study-on-cove-context2vec-elmo-ulmfit-and-bert/">AH’s Blog (WordPress)</a> on July 1, 2019, and has been migrated here.</p>
</blockquote>

<p>A research study on the models that revolutionized NLP through Transfer Learning — covering architecture, key ideas, and personal notes from implementation experience.</p>

<hr />

<h2 id="key-terminology">Key Terminology</h2>

<p><strong>Vector Space Models (VSMs):</strong> Words as unique vectors, feeding downstream ML models.</p>

<p><strong>Word Embedding:</strong> Fixed-size vectors where semantically similar words have small Euclidean distance. Foundation for Language Modeling and Machine Translation.</p>

<p><strong>Sentence Embedding:</strong> Same idea applied to full sentences.</p>

<p><strong>Language Model:</strong> Models a statistical distribution over sentences to predict the next word given context.</p>

<p><strong>Transfer Learning:</strong> Store knowledge learned on one task; reuse and optionally fine-tune it for another task.</p>

<p><strong>Multi-Task Learning:</strong> Train simultaneously on multiple subtasks; the shared representation captures generalizable knowledge.</p>

<p><strong>Domain Adaptation:</strong> A Transfer Learning subfield — adapt a model trained on a source distribution to perform well on a different target distribution.</p>

<hr />

<h2 id="context-vectors-cove">Context Vectors (CoVe)</h2>

<p><strong>Paper:</strong> <a href="https://arxiv.org/pdf/1708.00107.pdf">arxiv.org/pdf/1708.00107.pdf</a></p>

<p>CoVe vectors are learned on top of existing word vectors (GloVe, Word2Vec, FastText) using the <strong>encoder</strong> of a Neural Machine Translation (NMT) seq2seq model trained on German→English translation. The encoder learns complex semantic relations between words in order to translate, making its hidden representations richer than static embeddings.</p>

<p><strong>Usage:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CoVe = MT-LSTM(GloVe(sentence))
</code></pre></div></div>

<p>Inspired by the success of pre-trained CNNs on ImageNet, CoVe applies the same transfer idea to NLP: train on a large task (NMT), then use the encoder as an initialization layer for downstream tasks.</p>

<p>The paper introduced <strong>Bi-attentive Classification Network (BCN)</strong> to validate CoVe quality on tasks like Sentiment Analysis and Paraphrase Detection. BCN accepts two inputs (or duplicates one), passes them through the MT-LSTM encoder, then uses a Bi-LSTM + bi-attention architecture ending in a maxout classifier.</p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2019/06/screen-shot-2019-06-29-at-3.20.03-pm.png" alt="BCN architecture" /></p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2019/06/screen-shot-2019-06-29-at-4.39.59-pm.png" alt="BCN results" /></p>

<p><strong>Personal notes:</strong></p>
<ul>
  <li>You don’t need BCN — just prepend the frozen (or fine-tuned) encoder to your own model.</li>
  <li>Fine-tuning is generally better than freezing to allow slight task-specific adaptation.</li>
  <li>Use FastText over GloVe when character-level distinctions matter (e.g., named entities).</li>
</ul>

<hr />

<h2 id="context-to-embeddings-context2vec">Context to Embeddings (Context2Vec)</h2>

<p><strong>Paper:</strong> <a href="https://www.aclweb.org/anthology/K16-1006">aclweb.org/anthology/K16-1006</a></p>

<p>Consider the sentence “I can’t find <strong>April</strong>.” Without context, “April” could be a month or a person. Context2Vec extends CBOW Word2Vec by replacing the simple average-of-context-vectors with a richer parametric model — a <strong>Bi-LSTM + feedforward network</strong>.</p>

<p><strong>Three-stage architecture:</strong></p>
<ol>
  <li>Bi-LSTM processes left-to-right and right-to-left context.</li>
  <li>Feedforward network learns from the concatenated Bi-LSTM hidden states.</li>
  <li>Objective function (with Word2Vec negative sampling) compares output to target word embedding.</li>
</ol>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2019/06/screen-shot-2019-06-29-at-5.43.01-pm.png" alt="Context2Vec vs CBOW" /></p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2019/06/screen-shot-2019-06-29-at-5.53.11-pm.png" alt="Context2Vec architecture" /></p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2019/06/screen-shot-2019-06-29-at-6.05.36-pm.png" alt="Context2Vec closest words sample" /></p>

<p><strong>Personal note:</strong> Similar to Doc2Vec, but uses Bi-LSTM instead of a plain projection layer for deeper contextual representation.</p>

<hr />

<h2 id="embeddings-from-language-models-elmo">Embeddings from Language Models (ELMo)</h2>

<p><strong>Paper:</strong> <a href="https://arxiv.org/pdf/1802.05365.pdf">arxiv.org/pdf/1802.05365.pdf</a></p>

<p>ELMo addresses the same polysemy problem (a word’s meaning depends on context) by learning embeddings from a <strong>Bi-directional Language Model (BiLM)</strong>:</p>

<ul>
  <li>
    <table>
      <tbody>
        <tr>
          <td><strong>Forward LM:</strong> Predict word given previous words — P(word</td>
          <td>left context)</td>
        </tr>
      </tbody>
    </table>
  </li>
  <li>
    <table>
      <tbody>
        <tr>
          <td><strong>Backward LM:</strong> Predict word given following words — P(word</td>
          <td>right context)</td>
        </tr>
      </tbody>
    </table>
  </li>
</ul>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2019/07/52861-1pb5hxsxogjrnda_si4nj9q.png" alt="Bidirectional language model" /></p>

<p>Each word’s final ELMo representation is the <strong>weighted element-wise sum</strong> of:</p>
<ol>
  <li>Original word embedding (GloVe/Word2Vec/FastText)</li>
  <li>Forward LSTM hidden state</li>
  <li>Backward LSTM hidden state</li>
</ol>

<p>Weights can be task-specific (learned during fine-tuning).</p>

<p><strong>Usage steps:</strong></p>
<ol>
  <li>Train BiLM on a large corpus.</li>
  <li>Freeze the BiLM encoders and attach them at the bottom of your model.</li>
  <li>Replace raw word indices with their ELMo representations.</li>
</ol>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2019/06/screen-shot-2019-06-29-at-10.31.49-pm.png" alt="ELMo benchmark results" /></p>

<p><strong>Personal notes:</strong></p>
<ul>
  <li>Train the LM on domain-specific data for best downstream results.</li>
  <li>Deeper models or CNN character features can improve the LM quality.</li>
</ul>

<hr />

<h2 id="universal-language-model-fine-tuning-ulmfit">Universal Language Model Fine-tuning (ULMFiT)</h2>

<p><strong>Paper:</strong> <a href="https://arxiv.org/pdf/1801.06146.pdf">arxiv.org/pdf/1801.06146.pdf</a></p>

<p>ULMFiT’s goal: one universal language model that can be fine-tuned for any classification task. The base model is <strong>AWD-LSTM</strong> — a heavily regularized LSTM targeting generalization on long sequences.</p>

<p><strong>AWD-LSTM regularization techniques:</strong></p>
<ul>
  <li><strong>DropConnect Mask:</strong> Randomly zeroes weight connections (not activations).</li>
  <li><strong>Variational Dropout:</strong> Same dropout mask applied at every time step within a sequence.</li>
  <li><strong>ASGD (Average SGD):</strong> Averages weights over multiple steps for more stable convergence.</li>
  <li><strong>Variable Length BPTT:</strong> Randomizes truncation length during training.</li>
</ul>

<p><strong>ULMFiT introduces two fine-tuning innovations:</strong></p>

<p><strong>Discriminative Fine-tuning (Discr):</strong> Different layers use different learning rates, since lower layers capture more general features (should change slowly) while upper layers capture task-specific features (can change faster).</p>

<p><strong>Slanted Triangular Learning Rates (STLR):</strong> The learning rate increases quickly then decreases slowly — a specific schedule designed for fine-tuning pre-trained models.</p>

<hr />

<h2 id="bert">BERT</h2>

<p><strong>Paper:</strong> <a href="https://arxiv.org/abs/1810.04805">arxiv.org/abs/1810.04805</a></p>

<p>BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained deep Transformer encoder that redefined the state of the art across NLP benchmarks. Unlike the models above which use LSTMs, BERT uses a multi-layer Transformer architecture with self-attention.</p>

<p><strong>Two novel pre-training objectives:</strong></p>

<p><strong>1. Masked Language Model (MLM):</strong> Randomly mask 15% of tokens in the input; train the model to predict those masked tokens. This allows truly bidirectional context — both left and right — unlike unidirectional LMs.</p>

<p><strong>2. Next Sentence Prediction (NSP):</strong> Given two sentences, predict whether sentence B actually follows sentence A in the original document. This captures inter-sentence relationships useful for QA and inference tasks.</p>

<p><strong>Two model sizes:</strong></p>
<ul>
  <li>BERT-Base: 12 Transformer layers, 768 hidden units, 12 attention heads (110M parameters)</li>
  <li>BERT-Large: 24 layers, 1024 hidden units, 16 attention heads (340M parameters)</li>
</ul>

<p><strong>Fine-tuning:</strong> Add a task-specific output layer on top of BERT and fine-tune end-to-end. BERT achieved state-of-the-art on 11 NLP tasks including SQuAD, MNLI, and CoLA at time of publication.</p>

<hr />

<h2 id="summary">Summary</h2>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Core Idea</th>
      <th>Architecture</th>
      <th>Key Innovation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>CoVe</td>
      <td>NMT encoder as feature extractor</td>
      <td>Bi-LSTM</td>
      <td>Transfer from MT task</td>
    </tr>
    <tr>
      <td>Context2Vec</td>
      <td>BiLM-style context modeling</td>
      <td>Bi-LSTM + FF</td>
      <td>Richer CBOW context</td>
    </tr>
    <tr>
      <td>ELMo</td>
      <td>Contextual word embeddings from BiLM</td>
      <td>Stacked Bi-LSTM</td>
      <td>Per-layer weighted sum</td>
    </tr>
    <tr>
      <td>ULMFiT</td>
      <td>Universal LM fine-tuning</td>
      <td>AWD-LSTM</td>
      <td>Discr LR + STLR</td>
    </tr>
    <tr>
      <td>BERT</td>
      <td>Masked LM + NSP pre-training</td>
      <td>Transformer</td>
      <td>True bidirectionality via masking</td>
    </tr>
  </tbody>
</table>

<p>The trajectory is clear: from static word vectors → context-dependent LSTMs → attention-based Transformers. Each step brought deeper, more context-aware representations that better model language semantics.</p>

<hr />

<h2 id="references">References</h2>

<ul>
  <li><a href="https://arxiv.org/pdf/1708.00107.pdf">CoVe paper</a></li>
  <li><a href="https://www.aclweb.org/anthology/K16-1006">Context2Vec paper</a></li>
  <li><a href="https://arxiv.org/pdf/1802.05365.pdf">ELMo paper</a></li>
  <li><a href="https://arxiv.org/pdf/1801.06146.pdf">ULMFiT paper</a></li>
  <li><a href="https://arxiv.org/abs/1810.04805">BERT paper</a></li>
  <li><a href="https://arxiv.org/pdf/1708.02182.pdf">AWD-LSTM paper</a></li>
  <li><a href="https://medium.com/@plusepsilon/the-bidirectional-language-model-1f3961d1fb27">BiLM explanation — Medium</a></li>
  <li><a href="https://yashuseth.blog/2018/09/12/awd-lstm-explanation-understanding-language-model/">AWD-LSTM explanation — Yash Seth</a></li>
</ul>]]></content><author><name></name></author><category term="Artificial Intelligence" /><category term="Deep Learning" /><category term="Machine Learning" /><category term="Natural Language Processing" /><category term="Neural Network" /><category term="bert" /><category term="elmo" /><category term="ulmfit" /><category term="cove" /><category term="context2vec" /><category term="transfer-learning" /><category term="nlp" /><category term="language-models" /><category term="word-embeddings" /><summary type="html"><![CDATA[Note: This post was originally published on AH’s Blog (WordPress) on July 1, 2019, and has been migrated here.]]></summary></entry><entry><title type="html">An End-to-End Note About FCIS Graduation Project(GP)</title><link href="https://ahmedhani.github.io//newblog/" rel="alternate" type="text/html" title="An End-to-End Note About FCIS Graduation Project(GP)" /><published>2018-12-03T00:00:00+00:00</published><updated>2018-12-03T00:00:00+00:00</updated><id>https://ahmedhani.github.io//newblog</id><content type="html" xml:base="https://ahmedhani.github.io//newblog/"><![CDATA[<p>This post is inspired by a recent <a href="https://moustaphasaad.github.io/Post_About_FCIS_Graduation_Projects.html">post</a> that is written by my friend Mustafa Saad. It is a great post that I recommend you to read and think about the topics and suggested projects that are mentioned on his post. Actually, I don’t completely agree with him, but Mustafa is an experienced guy whose ideas and points of view should be taken into consideration.</p>

<p>In this post, I mainly talk to the people who will join <strong>CS</strong> department at the faculty. I have joined CS department in my 4th year and I may know how the people there think. However, this is just my point of view and it is maybe right or wrong.</p>

<hr />

<p>So, you have successfully made it to your final year at the college. I don’t know, but I still believe that the 3rd year will be always the hardest and the most tough year compared to other years. Most of you should have worked on some interesting topics during the previous years. Some of you maybe got interested about Machine Learning, others maybe liked Graphics, Compilers or Architecture(for real .. how could one like such Archi. stuff? It is a curse, bro!). Well, that’s great actually, it is always better to search about a specific field to focus on for either your GP or after the graduation as a career path.</p>

<p>I mainly will talk about Machine Learning based projects. Machine Learning is considered to be very trendy nowadays and it is related to <strong>CS</strong> and <strong>SC</strong> departments specifically. In general, there are several applications that are under the umbrella of Machine Learning, such as Natural Language Processing(<strong>NLP</strong>), Automatic Speech Recognition(<strong>ASR</strong>) and Computer Vision(<strong>CV</strong>). These fields work on Text, Signals and Images data respectively.</p>

<p>The main question that you need to ask yourself about is “what is the field that I am interested in? What is data type that I want to work on (text, speech signal, image, ..etc.)?”</p>

<p>After answering these questions, you can begin your research and survey regarding the topic you decided. Your target is to follow the top universities research labs blogs which talk about their latest researches and their applications.</p>

<p>For example, if you are interested in <strong>NLP</strong>, you need to identify the universities that are well-known about their great efforts on <strong>NLP</strong>, you will see that Berkley <a class="" href="https://l.facebook.com/l.php?u=http%3A%2F%2Fwww.berkeley.edu%2F&amp;h=AT199ht0YoY_d_lwuGHuuwpN_MEWC7YABj9XF-75AXyAmGQ1_5fSf3eExiBomK0WXg5wQ1ZN1F6m-8iRRWP-VVWBzX07DbuvbxXxCFms26zaJuHMnG1z5OY6xB75h42bApabD96XdQ" target="_blank" rel="nofollow noopener">http://www.berkeley.edu/</a> and British-Columbia <a class="" href="https://www.ubc.ca/" target="_blank" rel="nofollow noopener">https://www.ubc.ca/</a> are popular universities in that field, so, you go to their websites and see what are their latest papers and participation on some top conferences, such as NIPS <a class="" href="https://l.facebook.com/l.php?u=https%3A%2F%2Fnips.cc%2F&amp;h=AT094Sb-xhKQnM9_ZFYx7MdQilLNSJ5k73iQCEDhERekhIrN9dpzWE1Czm0D2K7OjD1m2pJffj8qyH3yCBub3sbs5RzVUICbsYwYk-HH09hqFfscRWLo4uH5sZOIYE5e6h_xflCvGw" target="_blank" rel="nofollow noopener">https://nips.cc/</a> and EMNLP <a class="" href="https://l.facebook.com/l.php?u=http%3A%2F%2Fwww.emnlp2016.net%2F&amp;h=AT2FooiKy2raCaiFjAc3SkEHl0Vr_0yKjeQPPOqT0YZkDmpMFSq2neqtHPZw23gNBOKuLVycxKs2V3J2F4k0SvZWzornI5bQlQxgtoUnUvWKWLAdETieiz8jKNorHNobZ2a3IyBMgg" target="_blank" rel="nofollow noopener">http://www.emnlp2016.net/</a> .</p>

<p>Actually, in Graduation Projects (and maybe M.Sc too), you don’t need to create something very new, you have 2 options when you begin your journey.</p>
<ul>
	<li><strong>Applied Research Project</strong></li>
</ul>
<p>Such projects, you are not looking forward to creating something that isn’t existed. You only seek for learning and increasing your programming skills by finding an existed project and either implement it from scratch as a whole or focus on specific part that you find it interesting. In such projects, you must have some existed resources to help you during your implementation phase</p>
<ul>
	<li>Papers, clear documents and useful blogs and links</li>
	<li>Open-source Projects in git community. Your implementation shall include some important techniques such as Object Oriented Programming(<strong>OOP</strong>), Data Structures(<strong>DS</strong>) and Algorithms.</li>
	<li>You will need to understand some mathematical and statistical content that may be included in the paper.</li>
</ul>
<p>I prefer this kind of projects, because you learn and implement what you learnt, also, the college likes such projects at which they see an actual output to see.</p>

<p> </p>
<ul>
	<li><strong>Research Projects</strong></li>
</ul>
<p>You have an existed solution for a problem but you have in mind some <strong>theoretical</strong> enhancements. In such projects, you expect to do several experiments and search a lot to increase your knowledge and make sure of what you are doing. Choosing such projects, you must be ready to read a lot of theories, papers and some chapters from a reference book.</p>

<p>Also, it is very preferable to already have a previous background regarding to what you want to do. The team who want to work in such projects must have</p>
<ul>
	<li>Solid background about mathematics and statistics</li>
	<li>Likes to read and search</li>
	<li>Expect to understand a lot and code less.</li>
</ul>
<p>To be honest, I don’t prefer such projects, because</p>
<ul>
	<li>The output isn't guaranteed and the college is always expecting to see output and won't appreciate any efforts without seeing an output</li>
	<li>Such deep understanding to the theoretical background of things is very rare in people with your level</li>
</ul>
<p>So, I consider this kind of projects as an <strong>unnecessary risk</strong>.</p>

<p>If you are going to work on any of these two categories of projects, you must have the following to help you to finish the project</p>
<ul>
	<li><strong>Powerful Machine:</strong> Machines that have GPUs that you will need to train your complex models. The machines either could be online or offline. But in all aspects, you must make sure to have such machines</li>
	<li><strong>Available Datasets:</strong> You need to make sure to have at least one dataset to do your experiments on. Avoid collecting the dataset by your own. You won't have time to gather it and also collecting dataset needs some sort of divergence to help in generalization and coverage your patterns</li>
	<li>It is very preferable to work on Linux-based Operating Systems instead of Windows</li>
</ul>
<p>Look .. there are some facts that I want to share with you to know what you are going to see when you begin your project</p>
<ul>
	<li>The basic Machine Learning techniques are considered to be an old school. You won't see a lot of projects nowadays that use popular algorithms and techniques, such as <strong>Naive Bayes</strong> and <strong>Hidden Markov Models</strong>. The research community is biasing towards Deep Learning(<strong>DL</strong>). Deep Learning needs powerful machines and large datasets and fortunately, these things are available and rich compared to the past. This supports the previous notes (The available datasets and powerful machines)</li>
	<li>There exist a lot of libraries that make the life easier while working on the projects. Commonly, the libraries are built on Python, R and C++ programming languages. The libraries helps on training and evaluating the models easily but it has a very bad effect. The problem with these libraries is in their abstraction. The libraries are made as a black box that you have several algorithms and techniques running on the backend. Trust me, you can train and produce output without even understanding 30% what is going on! The most popular libraries are Keras, Tensorflow and PyTorch.</li>
	<li>Don't expect to have a lot of support in the college from the TAs and Drs. It is your project and you MUST be the one who fully understand what you want to do. Just Help Yourself!</li>
</ul>
<p>From my experiences in mentoring and supervising teams after my graduation, I found that the teams are stuck in different problems.</p>
<ul>
	<li>They rely on the seminars and the grades without much caring about the actual output</li>
	<li>They don't make use of the summer vacation and the preparations aren't organized</li>
	<li>They don't divide the projects to several tiny modules</li>
	<li>They don't get to the point. They waste their times watching from A to Z courses during the semesters</li>
</ul>
<p>I will tell you something .. if you really are planning to work on Machine Learning based projects, you <strong>must be willing to spend part of your vacations for preparation and studying</strong>. If you enter the year without any prior knowledge or without finishing a beginner course in Machine Learning, then <strong>CHOOSE SOMETHING ELSE</strong> or you will end up running some code without understanding what you are doing. At least, there should be one member of the team who have some knowledge about the task so that he/she could be able to lead the team.</p>

<p>So, once you reached to this point, here are the concluded steps that I think that it is a good start for anyone who is working on Machine Learning projects</p>
<ul>
	<li>Use the summer vacation to enroll into a Machine Learning course and make sure to finish it before the beginning of the year and before registering the graduation project</li>
	<li>Find an interesting field of study that is closely related to ML such as NLP and ASR</li>
	<li>Search about some of its popular topics and the current research progress regarding to it</li>
	<li>Gather the needed materials such as papers, useful links and books</li>
	<li>Find a runnable complete/incomplete open-source projects and make sure that you can install and run them in your machine. Also, check the number of stars and forks.</li>
	<li>Run and produce output from the open-source projects</li>
	<li>Implement your own code. Use either Python or C++ to write or rewrite the code. For example, you may think of implementing a Neural Network from scratch or other complex models such as Convolutional Neural Networks(<strong>CNNs</strong>) and Recurrent Neural Networks(<strong>RNNs</strong>)</li>
	<li>If you have time, you may create a desktop application or a web service as an interface for your project</li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[This post is inspired by a recent post that is written by my friend Mustafa Saad. It is a great post that I recommend you to read and think about the topics and suggested projects that are mentioned on his post. Actually, I don’t completely agree with him, but Mustafa is an experienced guy whose ideas and points of view should be taken into consideration.]]></summary></entry><entry><title type="html">[Kaggle] SMS Spam Collection</title><link href="https://ahmedhani.github.io//kaggle-sms-spam-detection/" rel="alternate" type="text/html" title="[Kaggle] SMS Spam Collection" /><published>2017-06-30T00:00:00+00:00</published><updated>2017-06-30T00:00:00+00:00</updated><id>https://ahmedhani.github.io//kaggle-sms-spam-detection</id><content type="html" xml:base="https://ahmedhani.github.io//kaggle-sms-spam-detection/"><![CDATA[<blockquote>
  <p><strong>Note:</strong> This post was originally published on <a href="https://ahmedhanibrahim.wordpress.com/2017/06/30/kaggle-sms-spam-collection/">AH’s Blog (WordPress)</a> on June 30, 2017, and has been migrated here.</p>
</blockquote>

<p>A short exploration and classification notebook for the <a href="https://www.kaggle.com/uciml/sms-spam-collection-dataset">SMS Spam Collection Dataset</a> on Kaggle.</p>

<p><strong>Results:</strong> loss of <strong>0.1</strong> on the test set and approximately <strong>95% accuracy</strong>.</p>

<p><strong>Notebook:</strong> <a href="https://github.com/AhmedHani/Kaggle-Machine-Learning-Competitions/blob/master/Dataset%20Exploration/SMS%20Spam%20Collection%20Dataset/sms_spam_detection.ipynb">sms_spam_detection.ipynb on GitHub</a></p>]]></content><author><name></name></author><category term="Deep Learning" /><category term="Machine Learning" /><category term="Neural Network" /><category term="Python Notebook" /><category term="Source Code" /><category term="kaggle" /><category term="spam-detection" /><category term="nlp" /><category term="deep-learning" /><category term="python" /><summary type="html"><![CDATA[Note: This post was originally published on AH’s Blog (WordPress) on June 30, 2017, and has been migrated here.]]></summary></entry><entry><title type="html">Feedback Sequence-to-Sequence Model – Gonna Reverse Them All!</title><link href="https://ahmedhani.github.io//feedback-sequence-to-sequence-model/" rel="alternate" type="text/html" title="Feedback Sequence-to-Sequence Model – Gonna Reverse Them All!" /><published>2017-06-25T00:00:00+00:00</published><updated>2017-06-26T00:00:00+00:00</updated><id>https://ahmedhani.github.io//feedback-sequence-to-sequence-model</id><content type="html" xml:base="https://ahmedhani.github.io//feedback-sequence-to-sequence-model/"><![CDATA[<blockquote>
  <p><strong>Note:</strong> This post was originally published on <a href="https://ahmedhanibrahim.wordpress.com/2017/06/25/feedback-sequence-to-sequence-model-gonna-reverse-them-all/">AH’s Blog (WordPress)</a> on June 25, 2017, and has been migrated here.</p>
</blockquote>

<p><em>This tutorial assumes familiarity with Recurrent Neural Networks and Backpropagation Through Time (BPTT).</em></p>

<hr />

<h2 id="terminology">Terminology</h2>

<p><strong>One-to-one:</strong> One input word → one output word (e.g., semantic synonyms like “like” → “love”).</p>

<p><strong>One-to-many:</strong> One input → multiple outputs (e.g., hypernym relations: “vehicle” → [“car”, “bike”, “boat”]).</p>

<p><strong>Many-to-one:</strong> Multiple inputs → one output (e.g., Sentiment Analysis: sentence → polarity label).</p>

<p><strong>Many-to-many:</strong> Multiple inputs → multiple outputs (e.g., Machine Translation: English → French sentence).</p>

<p><strong>Word Embedding:</strong> Fixed-size semantic vectors for words — similar words have similar vectors.</p>

<p><strong>One-hot Encoding:</strong> Naive sparse representation — a vector of zeros with a single 1 at the word’s index. No semantic content; used here for simplicity.</p>

<hr />

<h2 id="dataset">Dataset</h2>

<p>Characters: <code class="language-plaintext highlighter-rouge">'a'</code>, <code class="language-plaintext highlighter-rouge">'b'</code>, <code class="language-plaintext highlighter-rouge">'c'</code> only. Task: reverse a string (e.g., “abc” → “cba”).</p>

<p>One-hot encodings:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1, 0, 0] = 'a'
[0, 1, 0] = 'b'
[0, 0, 1] = 'c'
</code></pre></div></div>

<p>String “abc” is represented as the concatenated vectors [1,0,0, 0,1,0, 0,0,1].</p>

<hr />

<h2 id="encoder-decoder-architecture">Encoder-Decoder Architecture</h2>

<p><img src="https://camo.githubusercontent.com/097ae56dffeca6fb58767a8829d313e4c5fb69c1/687474703a2f2f7777312e73696e61696d672e636e2f6d773639302f36393762303730666a77316632377232346f3263746a3230656130636f3075382e6a7067" alt="Encoder-Decoder overview" /></p>

<p>Any seq2seq model has two components:</p>

<ul>
  <li><strong>Encoder:</strong> Processes the input sequence step by step, updating its hidden state. The final hidden state — the <strong>Thought Vector</strong> — is a fixed-size representation of the entire input.</li>
  <li><strong>Decoder:</strong> Initialized with the Thought Vector, generates output tokens one at a time until an END token is produced or max length is reached.</li>
</ul>

<p>This post demonstrates the <strong>Feedback Encoder-Decoder</strong> variant: the decoder’s output at time <em>t</em> becomes its input at time <em>t+1</em>.</p>

<hr />

<h2 id="the-encoder">The Encoder</h2>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/06/encoder.png" alt="Encoder diagram" /></p>

<p><strong>Parameters:</strong></p>
<ul>
  <li>Batch size = 1, Input shape = 1×3</li>
  <li>Hidden layer size S(t) = 5</li>
  <li>Input-to-hidden weights <strong>Wh</strong> (3×5), Hidden-to-hidden weights <strong>Ws</strong> (5×5)</li>
  <li>Activation: ReLU — f(x) = max(0, x)</li>
</ul>

<p>Hidden state recurrence: <strong>S(t) = x(t) · Wh + S(t-1) · Ws</strong></p>

<p>Initial state S(0) = zeros (no prior memory).</p>

<p><strong>Processing “abc”:</strong></p>

<p><strong>S(1)</strong> = x(1) · Wh + 0 · Ws = <strong>[0.1, 0.2, 0.3, 0.4, 0.5]</strong></p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/06/mat1.png" alt="S1 matrices" />
<img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/06/mat11.png" alt="S1 result" /></p>

<p><strong>S(2)</strong> = x(2) · Wh + S(1) · Ws = <strong>[0.98, 1.28, 1.65, 1.74, 0.58]</strong></p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/06/mat2.png" alt="S2 matrices" />
<img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/06/mat22.png" alt="S2 intermediate" />
<img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/06/mat222.png" alt="S2 result" /></p>

<p><strong>S(3)</strong> = x(3) · Wh + S(2) · Ws = <strong>[1.74, 2.16, 4.56, 4.62, 3.29]</strong></p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/06/mat3.png" alt="S3 matrices" />
<img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/06/mat33.png" alt="S3 intermediate" />
<img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/06/mat333.png" alt="S3 result" /></p>

<p><strong>Thought Vector = [1.74, 2.16, 4.56, 4.62, 3.29]</strong> — the encoded representation of “abc”.</p>

<blockquote>
  <p>Note: With all-positive weights, ReLU has no effect in this toy example. In practice, weights will be mixed-sign.</p>
</blockquote>

<hr />

<h2 id="the-decoder">The Decoder</h2>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/06/decoder.png" alt="Decoder diagram" /></p>

<p>Two key differences from the encoder:</p>

<ol>
  <li><strong>Initial state</strong> = Thought Vector (the encoder’s final state).</li>
  <li><strong>Input at t+1</strong> = output produced at time t (feedback loop).</li>
</ol>

<p><strong>Additional parameters:</strong></p>
<ul>
  <li>Hidden-to-output weights <strong>_Wo</strong> (5×3)</li>
  <li>Output activation: Softmax</li>
</ul>

<p>The decoder runs for the desired output length, producing one character per step. The character with the highest Softmax probability is selected (argmax), fed back as the next input, and this repeats until the reversed string is complete.</p>

<blockquote>
  <p>This is a mapping problem solvable by an MLP if max string length is fixed. The seq2seq framing here is pedagogical — the real power of these models is in variable-length sequences like Machine Translation.</p>
</blockquote>

<hr />

<h2 id="references">References</h2>

<ul>
  <li><a href="https://github.com/AhmedHani/nlpeast">Source code on GitHub</a> (Thesis project)</li>
  <li><a href="/2014/10/10/data-normalization-and-standardization-for-neural-networks/">Data Normalization post</a> — for softmax background</li>
</ul>]]></content><author><name></name></author><category term="Artificial Intelligence" /><category term="Deep Learning" /><category term="Machine Learning" /><category term="Natural Language Processing" /><category term="Neural Network" /><category term="Source Code" /><category term="seq2seq" /><category term="encoder-decoder" /><category term="rnn" /><category term="lstm" /><category term="nlp" /><category term="deep-learning" /><category term="tutorial" /><summary type="html"><![CDATA[Note: This post was originally published on AH’s Blog (WordPress) on June 25, 2017, and has been migrated here.]]></summary></entry><entry><title type="html">[Thesis Tutorials II] Understanding Word2vec for Word Embedding II</title><link href="https://ahmedhani.github.io//thesis-tutorials-ii-understanding-word2vec-for-word-embedding-ii/" rel="alternate" type="text/html" title="[Thesis Tutorials II] Understanding Word2vec for Word Embedding II" /><published>2017-04-27T00:00:00+00:00</published><updated>2017-04-27T00:00:00+00:00</updated><id>https://ahmedhani.github.io//thesis-tutorials-ii-understanding-word2vec-for-word-embedding-ii</id><content type="html" xml:base="https://ahmedhani.github.io//thesis-tutorials-ii-understanding-word2vec-for-word-embedding-ii/"><![CDATA[<blockquote>
  <p><strong>Note:</strong> This post was originally published on <a href="https://ahmedhanibrahim.wordpress.com/2017/04/27/thesis-tutorials-ii-understanding-word2vec-for-word-embedding-ii/">AH’s Blog (WordPress)</a> on April 27, 2017, and has been migrated here.</p>
</blockquote>

<p><em>Continues from <a href="/2017/04/25/thesis-tutorials-i-understanding-word2vec-for-word-embedding-i/">Thesis Tutorials I — Understanding Word2vec for Word Embedding I</a>.</em></p>

<hr />

<h2 id="the-scalability-problem">The Scalability Problem</h2>

<p>Training Word2vec on a real corpus requires millions of unique words. Recalling the Skip-gram architecture:</p>

<p><img src="https://i.stack.imgur.com/igSuE.png" alt="Skip-gram architecture" /></p>

<p>Each Softmax output layer has <strong>V</strong> neurons — one per vocabulary word. The Softmax formula:</p>

<p><img src="https://i.stack.imgur.com/iP8Du.png" alt="Softmax formula" /></p>

<p>The denominator sums exponentials over all V words, giving <strong>O(V)</strong> complexity per output layer. With V = 1M words and multiple output layers (one per context word in Skip-gram), this becomes prohibitively expensive.</p>

<p>Two optimizations are proposed in the original Word2vec paper to address this: <strong>Hierarchical Softmax</strong> and <strong>Negative Sampling</strong>.</p>

<hr />

<h2 id="dataset-from-part-i">Dataset (from Part I)</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>D = {
  "This battle will be my masterpiece",
  "The unseen blade is the deadliest"
}
</code></pre></div></div>

<p>V = {this, battle, will, be, my, masterpiece, the, unseen, blade, is, deadliest}, ‖V‖ = 11</p>

<hr />

<h2 id="hierarchical-softmax">Hierarchical Softmax</h2>

<p>Instead of a flat Softmax layer over all V words, Hierarchical Softmax replaces it with a <strong>balanced Huffman tree</strong> whose leaves are the vocabulary words.</p>

<p>To compute the probability of a target word (e.g., “unseen” given context “the”, “blade”, “is”), we traverse the tree from root to the word’s leaf, multiplying the probabilities at each binary decision node.</p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/huffman1.png" alt="Huffman tree" /></p>

<p>For “unseen”, the path might be: <strong>P(right) × P(left) × P(right) × P(right)</strong></p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/right1.png" alt="Right path" /></p>

<p>or equivalently: <strong>P(left) × P(right) × P(right) × P(right)</strong></p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/left1.png" alt="Left path" /></p>

<h3 id="how-node-probabilities-are-computed">How node probabilities are computed</h3>

<p>Each internal node acts like a logistic regression unit. The input to each node is the hidden layer vector from the neural network; the output is a probability obtained via Sigmoid:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>P(node) = Sigmoid(hidden_layer · W_node + b)
</code></pre></div></div>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/c3041-1443842128391.jpg" alt="Logistic node formula" /></p>

<p>Each tree layer has its own associated weight matrix, learned during training.</p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/loghuff1.png" alt="Hierarchical Softmax with logistic nodes" /></p>

<p>Think of it as a small neural network stacked on top of the hidden layer, where the new network’s output is the tree nodes.</p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/hlh1.png" alt="Hidden layer feeding tree" /></p>

<p>At each tree layer, we evaluate the node probabilities and follow the path of highest probability to arrive at the predicted word.</p>

<h3 id="complexity-improvement">Complexity improvement</h3>

<ul>
  <li>Flat Softmax: <strong>O(V)</strong></li>
  <li>Hierarchical Softmax: <strong>O(log V)</strong></li>
</ul>

<p>Additionally, since P(right) + P(left) = 1, we only need to compute one branch’s probability at each node — the other is obtained by subtraction, halving the work per decision.</p>

<hr />

<h2 id="negative-sampling">Negative Sampling</h2>

<p>Negative Sampling is the more popular optimization in practice — it is the approach used in TensorFlow’s Word2vec implementation.</p>

<h3 id="the-key-insight">The key insight</h3>

<p>After computing the output layer error, a standard training pass updates all V × hidden_layer_size weights. But in practice, the actual positive word and the incorrect (negative) words are already known. There is no need to update <em>all</em> vocabulary word weights for every training example — only a small selected subset.</p>

<h3 id="how-it-works">How it works</h3>

<p>For each training example, instead of computing Softmax over all V words, we:</p>

<ol>
  <li>Include the <strong>positive word</strong> (the actual target).</li>
  <li>Sample a small number <strong>K</strong> of <strong>negative words</strong> (words that should <em>not</em> be predicted in this context).</li>
  <li>Apply Softmax only over these K+1 words.</li>
</ol>

<p>If K = 10, the output layer has only 11 neurons, and backpropagation updates only <strong>11 × hidden_layer_size</strong> weights instead of <strong>V × hidden_layer_size</strong>.</p>

<h3 id="selecting-negative-samples">Selecting negative samples</h3>

<p>Negative words are sampled based on their frequency in the corpus — more frequent words have a higher probability of being selected as negative samples. The formula is:</p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/we.png" alt="Negative sampling probability formula" /></p>

<p>Where <strong>c</strong> is a constant set by the model creator. The top K words by this weighted probability are selected as the negative sample set.</p>

<blockquote>
  <p>In practice, K is typically between 5 and 20 for small datasets and 2 to 5 for large corpora. The original paper found that negative sampling yields results comparable to Hierarchical Softmax while being simpler to implement.</p>
</blockquote>

<hr />

<h2 id="summary">Summary</h2>

<table>
  <thead>
    <tr>
      <th>Technique</th>
      <th>Complexity</th>
      <th>How it works</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Flat Softmax</td>
      <td>O(V)</td>
      <td>Normalize over all vocabulary words</td>
    </tr>
    <tr>
      <td>Hierarchical Softmax</td>
      <td>O(log V)</td>
      <td>Binary tree path; logistic nodes per decision</td>
    </tr>
    <tr>
      <td>Negative Sampling</td>
      <td>O(K) — K ≪ V</td>
      <td>Update only positive + K sampled negatives</td>
    </tr>
  </tbody>
</table>

<p>Both techniques make training large-vocabulary Word2vec models feasible. Negative Sampling is generally preferred for its simplicity and strong empirical performance.</p>

<p><em>Diagrams created with <a href="http://www.draw.io">draw.io</a>.</em></p>

<hr />

<h2 id="references">References</h2>

<ul>
  <li><a href="https://arxiv.org/pdf/1411.2738.pdf">Word2vec Parameter Learning Explained (Rong, 2014)</a></li>
  <li><a href="http://cs224d.stanford.edu/syllabus.html">CS224d NLP Course — Stanford</a></li>
  <li><a href="https://www.tensorflow.org/tutorials/word2vec">TensorFlow Word2vec Tutorial</a></li>
  <li><a href="http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/">Word2vec Tutorial Part 2: Negative Sampling — McCormick</a></li>
</ul>]]></content><author><name></name></author><category term="Artificial Intelligence" /><category term="Machine Learning" /><category term="Natural Language Processing" /><category term="Neural Network" /><category term="word2vec" /><category term="word-embedding" /><category term="nlp" /><category term="hierarchical-softmax" /><category term="negative-sampling" /><category term="skip-gram" /><category term="optimization" /><summary type="html"><![CDATA[Note: This post was originally published on AH’s Blog (WordPress) on April 27, 2017, and has been migrated here.]]></summary></entry><entry><title type="html">[Thesis Tutorials I] Understanding Word2vec for Word Embedding I</title><link href="https://ahmedhani.github.io//thesis-tutorials-understanding-word2vec-for-word-embedding-i/" rel="alternate" type="text/html" title="[Thesis Tutorials I] Understanding Word2vec for Word Embedding I" /><published>2017-04-25T00:00:00+00:00</published><updated>2017-04-25T00:00:00+00:00</updated><id>https://ahmedhani.github.io//thesis-tutorials-understanding-word2vec-for-word-embedding-i</id><content type="html" xml:base="https://ahmedhani.github.io//thesis-tutorials-understanding-word2vec-for-word-embedding-i/"><![CDATA[<blockquote>
  <p><strong>Note:</strong> This post was originally published on <a href="https://ahmedhanibrahim.wordpress.com/2017/04/25/thesis-tutorials-i-understanding-word2vec-for-word-embedding-i/">AH’s Blog (WordPress)</a> on April 25, 2017, and has been migrated here.</p>
</blockquote>

<hr />

<h2 id="key-terms">Key Terms</h2>

<p><strong>Vector Space Models (VSMs):</strong> Words represented as unique vectors, used as input to mathematical/statistical ML models.</p>

<p><strong>Word Embedding:</strong> Fixed-size vector representations where semantically similar words have geometrically close vectors (small Euclidean distance). Used in Language Modeling, Machine Translation, and many NLP tasks.</p>

<p><strong>Shallow Neural Networks:</strong> Neural networks with exactly 1 hidden (projection) layer, producing a new feature representation of the input.</p>

<hr />

<h2 id="one-hot-encoding">One-Hot Encoding</h2>

<p>The naive baseline. For a vocabulary V = {I, like, playing, football, basketball} (‖V‖ = 5):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>I          = [1, 0, 0, 0, 0]
like       = [0, 1, 0, 0, 0]
playing    = [0, 0, 1, 0, 0]
football   = [0, 0, 0, 1, 0]
basketball = [0, 0, 0, 0, 1]
</code></pre></div></div>

<p><strong>Pros:</strong> Simple, deterministic.</p>

<p><strong>Cons:</strong> Vector size = vocabulary size (1M words → 1M-dim vectors). No semantic information — football and basketball are equally “distant” from each other as from “I”.</p>

<p>Use this when semantic relations don’t matter and vocabulary size is manageable.</p>

<hr />

<h2 id="word2vec-philosophy">Word2vec Philosophy</h2>

<p>Word2vec represents a word using the words that surround it. Given:</p>

<blockquote>
  <p>“I like playing <strong>X</strong>”</p>
</blockquote>

<p>Even without knowing what X is, the context (“like”, “playing”) tells us it’s something enjoyable and playable. This is exactly how humans infer meaning from context. Word2vec formalizes this: train a shallow neural network to predict context from a target word (or vice versa), and the learned weights become the word vectors.</p>

<hr />

<h2 id="dataset">Dataset</h2>

<p>Corpus D:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"This battle will be my masterpiece"
"The unseen blade is the deadliest"
</code></pre></div></div>

<p>Vocabulary V = {this, battle, will, be, my, masterpiece, the, unseen, blade, is, deadliest}, ‖V‖ = 11</p>

<p>One-hot vectors (size 11) are assigned per word for use in the network.</p>

<hr />

<h2 id="skip-gram-model">Skip-gram Model</h2>

<p><img src="https://i.stack.imgur.com/igSuE.png" alt="Skip-gram architecture" /></p>

<p><strong>Task:</strong> Given a target word, predict its N surrounding context words.</p>

<p><strong>Architecture:</strong> Input = one-hot vector of target word → 1 hidden (projection) layer → N Softmax output layers (one per context word to predict).</p>

<p><strong>Example</strong> — target: “unseen”, context window = 3, embedding dimension = 3.</p>

<p><strong>Input → Hidden (Wh, shape 11×3):</strong></p>

<p>After feedforwarding “unseen”, hidden layer H = [0.8, 0.4, 0.5]. This is the initial embedding. Every row of Wh is the current embedding for each vocabulary word.</p>

<p><strong>Hidden → Output:</strong></p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/ho3.png" alt="Hidden to output 1" />
<img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/ho2.png" alt="Hidden to output 2" />
<img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/ho1.png" alt="Hidden to output 3" /></p>

<p>Apply Softmax to each output vector, take the argmax index → predicted context words.</p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/soft.png" alt="Softmax outputs" /></p>

<p>During training, errors from all N Softmax layers are averaged and backpropagated to update Wh. Repeat until max epochs or target loss is reached. The final input-to-hidden weight matrix is the word embedding.</p>

<blockquote>
  <p>Alternatively, the average of the Hidden-to-output weight matrices can serve as embeddings — but the input-to-hidden matrix is the standard choice.</p>
</blockquote>

<hr />

<h2 id="continuous-bag-of-words-cbow-model">Continuous Bag of Words (CBOW) Model</h2>

<p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/Cbow.png/283px-Cbow.png" alt="CBOW architecture" /></p>

<p><strong>Task:</strong> Given N context words, predict the target word.</p>

<p><strong>Architecture:</strong> N input one-hot vectors → 1 hidden layer (mean of input projections) → 1 Softmax output.</p>

<p><strong>Example</strong> — target: “unseen”, context: “the”, “blade”, “is”.</p>

<p><strong>Input → Hidden:</strong></p>

<p>The hidden layer is the average of each context word’s projection:</p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/ihc.png" alt="CBOW input projections" />
<img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/ihccc.png" alt="CBOW average computation" />
<img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/ihcc.png" alt="CBOW hidden values" /></p>

<p>H = [(0.8+0.2+0.2)/3, (0.9+0.8+0.3)/3, (0.1+0.9+0.7)/3] = [0.39, 0.66, 0.56]</p>

<p><strong>Hidden → Output:</strong></p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/hoc.png" alt="CBOW hidden to output" /></p>

<p>Apply Softmax → predicted word = “masterpiece” (in this initialization). Backpropagate error to update weights.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>Word2vec is the bridge from symbolic NLP to semantic deep learning. Traditional rule-based systems fail to generalize across languages and cannot capture semantic similarity. One-hot encodings are equally blind to meaning. Word2vec vectors encode semantic proximity — enabling downstream models to reason about language.</p>

<p>Both Skip-gram and CBOW produce the same type of output (word embeddings) but differ in architecture: Skip-gram predicts context from a target; CBOW predicts a target from context. Skip-gram generally performs better on infrequent words; CBOW is faster to train on large corpora.</p>

<p><em>Part II covers negative sampling, hierarchical softmax, and practical training details.</em></p>]]></content><author><name></name></author><category term="Machine Learning" /><category term="Natural Language Processing" /><category term="Neural Network" /><category term="word2vec" /><category term="word-embedding" /><category term="nlp" /><category term="skip-gram" /><category term="cbow" /><category term="vsm" /><category term="neural-network" /><summary type="html"><![CDATA[Note: This post was originally published on AH’s Blog (WordPress) on April 25, 2017, and has been migrated here.]]></summary></entry><entry><title type="html">Hello, Valor!</title><link href="https://ahmedhani.github.io//hello-valor/" rel="alternate" type="text/html" title="Hello, Valor!" /><published>2017-04-05T00:00:00+00:00</published><updated>2017-04-05T00:00:00+00:00</updated><id>https://ahmedhani.github.io//hello-valor</id><content type="html" xml:base="https://ahmedhani.github.io//hello-valor/"><![CDATA[<blockquote>
  <p><strong>Note:</strong> This post was originally published on <a href="https://ahmedhanibrahim.wordpress.com/2017/04/05/hello-valor/">AH’s Blog (WordPress)</a> on April 5, 2017, and has been migrated here.</p>
</blockquote>

<p>Being a researcher and programmer can make life quite monotonous. After two months of the same routine — work, study, freelance tasks, waiting for Thursday to meet friends — I decided to break the cycle by learning something completely new.</p>

<p>I love listening to music; I can barely work or study without it. But I’d never thought of myself as someone who could play. So I enrolled in a course and bought a violin. I’m currently at Level 1 out of 7, and I’m excited to see how far I can go by the end of the year.</p>

<p>Welcome, <strong>Valor</strong> 🎻</p>

<p><a href="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/img_0945.jpg"><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2017/04/img_0945.jpg" alt="Valor the violin" /></a></p>]]></content><author><name></name></author><category term="Life Events" /><category term="life" /><category term="music" /><category term="violin" /><category term="personal" /><summary type="html"><![CDATA[Note: This post was originally published on AH’s Blog (WordPress) on April 5, 2017, and has been migrated here.]]></summary></entry><entry><title type="html">Overview: Generative Adversarial Networks – When Deep Learning Meets Game Theory</title><link href="https://ahmedhani.github.io//generative-adversarial-networks-overview/" rel="alternate" type="text/html" title="Overview: Generative Adversarial Networks – When Deep Learning Meets Game Theory" /><published>2017-01-17T00:00:00+00:00</published><updated>2017-01-17T00:00:00+00:00</updated><id>https://ahmedhani.github.io//generative-adversarial-networks-overview</id><content type="html" xml:base="https://ahmedhani.github.io//generative-adversarial-networks-overview/"><![CDATA[<blockquote>
  <p><strong>Note:</strong> This post was originally published on <a href="https://ahmedhanibrahim.wordpress.com/2017/01/17/generative-adversarial-networks-when-deep-learning-meets-game-theory/">AH’s Blog (WordPress)</a> on January 17, 2017, and has been migrated here.</p>
</blockquote>

<p>Before diving into Generative Adversarial Networks (GANs), a few foundational concepts are worth establishing.</p>

<hr />

<h2 id="key-concepts">Key Concepts</h2>

<table>
  <tbody>
    <tr>
      <td><strong>Discriminative Models</strong> predict a hidden class given observed features. They model the conditional probability **P(y</td>
      <td>x₁, x₂, …, xₙ)**. Examples: SVMs, Feedforward Neural Networks.</td>
    </tr>
  </tbody>
</table>

<p><strong>Generative Models</strong> learn the joint distribution of features and classes — <strong>P(x₁, x₂, …, xₙ, y)</strong> — enabling them to generate new samples from the learned distribution. Examples: Restricted Boltzmann Machines (RBMs), HMMs. Note: Vanilla Auto-encoders are <em>not</em> generative models (they reconstruct); Variational Auto-encoders (VAEs) are.</p>

<p><strong>Nash Equilibrium</strong> (Game Theory): A stable game state where no player has an incentive to change their strategy after knowing the other players’ strategies. Each player is satisfied with their outcome given the others’ choices.</p>

<p><strong>Minimax</strong>: An algorithm for two-player games where each player tries to minimize the maximum possible loss the opponent can inflict. Used in Chess, Tic-Tac-Toe, Connect-4, and other rule-based decision games.</p>

<hr />

<h2 id="generative-adversarial-networks-gans">Generative Adversarial Networks (GANs)</h2>

<p><img src="https://i0.wp.com/www.kdnuggets.com/wp-content/uploads/generative-adversarial-network.png" alt="GAN architecture" /></p>

<p>A GAN consists of two models competing during training:</p>

<ul>
  <li><strong>Generator (G):</strong> Produces fake samples intended to match the distribution of real data.</li>
  <li><strong>Discriminator (D):</strong> Learns to distinguish real samples from the Generator’s fakes.</li>
</ul>

<p>The dynamic is adversarial — G tries to fool D; D tries to catch G. This is precisely the Minimax setup: each player attempts to minimize the worst outcome the other can produce.</p>

<p>Training continues iteratively until both models become experts: G generates samples indistinguishable from real data, and D becomes highly accurate at classification. When neither model can improve by changing its strategy unilaterally, the system reaches <strong>Nash Equilibrium</strong>.</p>

<p>During training, a shared loss function updates each model’s parameters independently via backpropagation — neither model can directly modify the other’s weights.</p>

<hr />

<h2 id="status">Status</h2>

<p>This was an overview written while still learning GANs. The follow-up post applies the concepts in more detail: <a href="/2017/02/17/generative-adversarial-networks-2-camouflage-your-predator/">GANs Part 2 — Camouflage your Predator!</a></p>

<hr />

<h2 id="references">References</h2>

<ul>
  <li><a href="https://arxiv.org/pdf/1701.00160v1.pdf">Goodfellow et al., NIPS 2016 Tutorial on GANs</a></li>
  <li><a href="http://www.kdnuggets.com/2017/01/generative-adversarial-networks-hot-topic-machine-learning.html">KDnuggets: GANs Overview</a></li>
  <li><a href="https://en.wikipedia.org/wiki/Generative_adversarial_networks">Wikipedia: Generative Adversarial Networks</a></li>
  <li><a href="https://en.wikipedia.org/wiki/Minimax">Wikipedia: Minimax</a></li>
  <li><a href="https://en.wikipedia.org/wiki/Nash_equilibrium">Wikipedia: Nash Equilibrium</a></li>
  <li>Stuart Russell and Peter Norvig, <em>Artificial Intelligence: A Modern Approach</em></li>
</ul>]]></content><author><name></name></author><category term="Artificial Intelligence" /><category term="Deep Learning" /><category term="Machine Learning" /><category term="Reinforcement Learning" /><category term="gan" /><category term="generative-models" /><category term="deep-learning" /><category term="game-theory" /><category term="nash-equilibrium" /><category term="minimax" /><summary type="html"><![CDATA[Note: This post was originally published on AH’s Blog (WordPress) on January 17, 2017, and has been migrated here.]]></summary></entry><entry><title type="html">Another LSTM Tutorial</title><link href="https://ahmedhani.github.io//another-lstm-tutorial/" rel="alternate" type="text/html" title="Another LSTM Tutorial" /><published>2016-10-09T00:00:00+00:00</published><updated>2016-10-09T00:00:00+00:00</updated><id>https://ahmedhani.github.io//another-lstm-tutorial</id><content type="html" xml:base="https://ahmedhani.github.io//another-lstm-tutorial/"><![CDATA[<blockquote>
  <p><strong>Note:</strong> This post was originally published on <a href="https://ahmedhanibrahim.wordpress.com/2016/10/09/another-lstm-tutorial/">AH’s Blog (WordPress)</a> on October 9, 2016, and has been migrated here.</p>
</blockquote>

<p><em>Figures in this post are taken from Christopher Olah’s excellent <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">Understanding LSTMs</a> blog post.</em></p>

<hr />

<h2 id="recurrent-neural-networks">Recurrent Neural Networks</h2>

<p>Recurrent Neural Networks (RNNs) are designed for sequential data — data where order and dependency between elements matters. Traditional Multi-layer Perceptrons (MLPs) assume independence between inputs, which is inappropriate for text or audio.</p>

<p>RNNs contain <strong>self-loops</strong> that carry the previous hidden state forward, allowing the network to “remember” what it has seen.</p>

<p><img src="https://i0.wp.com/colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-rolled.png" alt="RNN single unit" /></p>

<p>Unrolled over time, the RNN resembles a deep feedforward network where each step receives both the current input and the previous hidden state:</p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2016/10/1.png" alt="Unrolled RNN" /></p>

<hr />

<h2 id="the-long-term-dependencies-problem">The Long-term Dependencies Problem</h2>

<p>Standard RNNs have no mechanism to selectively forget irrelevant context. For a sentence like:</p>

<blockquote>
  <p>“I live in France, I like playing football with my friends and going to the school, <strong>I speak french</strong>”</p>
</blockquote>

<p>Predicting “french” requires connecting to “I live in France” — but the two intermediate clauses introduce noise. Regular RNNs struggle to bridge these long-range dependencies, which is the main motivation behind <strong>LSTM</strong>.</p>

<hr />

<h2 id="what-is-lstm">What is LSTM?</h2>

<p>Long Short-Term Memory (LSTM) is a variant of RNN that controls the memory process through <strong>gates</strong> within each unit. These gates regulate what information to retain, update, or forget, allowing the network to maintain relevant long-range context.</p>

<p>The analogy: when reading a novel, your brain selectively remembers important events (subject, previous action) while discarding irrelevant details. LSTMs simulate this selective memory.</p>

<hr />

<h2 id="lstm-unit-structure">LSTM Unit Structure</h2>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2016/10/2.png" alt="LSTM unit" /></p>

<p>A standard LSTM unit contains:</p>
<ul>
  <li><strong>2 inputs:</strong> previous cell state C_{t-1} and previous output h_{t-1}</li>
  <li><strong>4 layers:</strong> 3 sigmoid activations + 1 tanh activation</li>
  <li><strong>5 point operators:</strong> 3 multiplications, 1 addition, 1 tanh</li>
  <li><strong>2 outputs:</strong> current cell state C_t and current output h_t</li>
</ul>

<p>The <strong>cell state</strong> is the memory backbone. It flows through the unit with minimal modification unless the gates decide to change it.</p>

<hr />

<h2 id="detailed-processing-3-groups">Detailed Processing: 3 Groups</h2>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2016/10/21.png" alt="LSTM overview" /></p>

<h3 id="group-11--forget-gate">Group 1.1 — Forget Gate</h3>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2016/10/6.png" alt="Forget gate" /></p>

<p>The <strong>forget gate layer</strong> (sigmoid) decides what to discard from the previous cell state. Output of 0 → forget everything; values closer to 1 → retain.</p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2016/10/7.png" alt="Forget gate formula" /></p>

<h3 id="group-12--applying-forget-to-previous-state">Group 1.2 — Applying Forget to Previous State</h3>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2016/10/8.png" alt="Forget gate × state" /></p>

<p>Element-wise multiply the forget gate output with C_{t-1}. A vector of zeros means we wipe all previous memory.</p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2016/10/9.png" alt="Forget application formula" /></p>

<h3 id="group-21--input-gate-and-candidate-state">Group 2.1 — Input Gate and Candidate State</h3>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2016/10/10.png" alt="Input gate" /></p>

<p>The <strong>input gate layer</strong> (sigmoid) decides which state values to update. A <strong>tanh</strong> layer generates the candidate new state values to potentially add.</p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2016/10/12.png" alt="Candidate state formula" /></p>

<h3 id="group-22--scaling-new-state">Group 2.2 — Scaling New State</h3>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2016/10/13.png" alt="Scaling" /></p>

<p>Multiply the candidate state by the input gate output to filter which new information actually gets written.</p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2016/10/15.png" alt="Scaled formula" /></p>

<h3 id="combining-groups-1--2--new-cell-state">Combining Groups 1 + 2 → New Cell State</h3>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2016/10/16.png" alt="New state" /></p>

<p>Add the filtered old state (Group 1) and filtered new information (Group 2) to get C_t.</p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2016/10/17.png" alt="New state formula" /></p>

<h3 id="group-3--output-gate">Group 3 — Output Gate</h3>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2016/10/19.png" alt="Output gate" /></p>

<p>A sigmoid layer decides which parts of the state to output. The state is passed through tanh (to keep values in [-1, 1]) and multiplied element-wise by the sigmoid output.</p>

<p><img src="https://ahmedhanibrahim.wordpress.com/wp-content/uploads/2016/10/20.png" alt="Output formula" /></p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>LSTMs have proven themselves across a wide range of tasks: Language Modeling, Sentiment Analysis, Speech Recognition, Text Summarization, and Question Answering. The gating mechanism is what makes them capable of learning which context to carry forward and which to discard.</p>

<hr />

<h2 id="references">References</h2>

<ul>
  <li><a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">Christopher Olah: Understanding LSTMs</a></li>
  <li><a href="https://en.wikipedia.org/wiki/Long_short-term_memory">Wikipedia: LSTM</a></li>
  <li><a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.248.4448&amp;rep=rep1&amp;type=pdf">Hochreiter &amp; Schmidhuber, 1997 (original LSTM paper)</a></li>
  <li><a href="http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/">WildML: RNN Tutorial Part 1</a></li>
</ul>]]></content><author><name></name></author><category term="Artificial Intelligence" /><category term="Deep Learning" /><category term="Machine Learning" /><category term="Natural Language Processing" /><category term="Neural Network" /><category term="lstm" /><category term="rnn" /><category term="deep-learning" /><category term="nlp" /><category term="neural-network" /><summary type="html"><![CDATA[Note: This post was originally published on AH’s Blog (WordPress) on October 9, 2016, and has been migrated here.]]></summary></entry></feed>