nanogpt

NanoGPT from Andrej Karpathy's Video tutorial Watched the entire playlist, very informative and with a teacher like Andrej, accessible too.

A. Lessons and Observations from Micrograd

Pytorch Documentation - the detail and depth of understanding of this library
Lego blocks type of building
Understanding the calculus - (n - 1) instead of n !!!
Step by Step - Explanation
Understanding of the library in depth and even questioning it (Repeated yes ....)
Dimensions, dimensions and broadcasting !!!
Subtle bugs - Generally because of broadcasting and not understanding the dimensions of the vectors
As he says - "lot of gymnastics around these multi-dimensional arrays, ton of trying to make these shapes work, layers shape"
He prototypes the layers in a jupyter notebook to check if it works and then transfers it to a working model
Dilated causal convoluted layers !!! A mouthful !!!

https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing

B. Lessons from Tokenization

What I knew of a tokernizer before this tutorial - A way to check how much to pay based the number of tokens in the input and output text. And the tokens did not match the words, they broke across words.

This tutorial opens up a whole new world of how tokenizer influences the learning of the LLM. And yes it is hairy and gnarly to use Andrej's own words.

Tokenizer for spaces in gpt2 vs latest - Earlier spaces were considered separate and spaces were not combined even if they were continuous, so number of tokens were higher. Later models do combine contiguous spaces.
It is the tokenizer that decides on the embeddings.
Treatment of punctuation with words in gpt2 at encoder.py - for e.g. - dog., dog! etc.
The regex explanation yielded some new things for me -

\p{L} - Any kind of letter from a language - regex I tested with Devanagari script, Kannada script, it worked.
\p{N} - any kind of numeric in any kind of script Tested with numbers in Kannanda and Hindi.

For reasons that I don't fully understand !!! :) - 1:17 - GPT Tokenizer video Guess there are always things to learn and understand, even if one is the same field.
Basic transformer attention model is the same, as is the tokenizer even with change in modalities to some extent (video, image), which is amazing when you think about it. However Sora has visual patches so this is different.
Testing of chatgpt using tokens along with Token irregularties - <|endoftext|> - Your knowledge of these special tokends ends up being an attack surface potentially.
He knows Korean ! :) An elegant way of displaying this.
Token economy - YAML is better than JSON
The Best !!! SoldGoldMagikarp - Last 10 min of the video

There are trigger words that make the gpt go haywire and behave strangely. One such example - Single reddit user - Potential reason for the behavior, That token never appears in the training set for the LLM although present in the tokenizer data. So token never gets activated, is never updated in the embedding table, never sampled, never used, never trained (completely untrained). When this feeds into the transformer and it creates undefined behavior.

Some learnings from the talk at Microsoft Build - "State of GPTs"

Revisit your prompting - ask if GPT-4 met the requirement - basically reflection
Spread your cognitive efforts per token
1.4 Trillion tokens instead of just 300B tokens - More tokens => Better model
Chain of thought - ask it to go step by step
The best for most accurate results Let’s work this out in a step by step way to be sure we have the right answer

nanogpt

Commits

The learnings from Tokenization

Tokenization Complete

With and Without bytefallback

Sentence Piece vocab

Minbpe Exercise

Encoder dictionary

README

nanogpt

A. Lessons and Observations from Micrograd

B. Lessons from Tokenization

Some learnings from the talk at Microsoft Build - "State of GPTs"