“best way to learn about transformers”
Summary of results
Introductory Resources
- YouTube Videos:
- “Intro to Transformers” on YouTube: “This Intro to Transformers is helpful to get some basic understanding of the underlying concepts and it comes with a really succinct history lesson as well.” Link
- Andrej Karpathy's course: “The best way to understand transformers is to take Andrej’s Karpathy course on YouTube. With a keyboard and a lot of focus time.”
- Computerphile videos by Rob Miles: “Those Computerphile videos by Rob Miles helped me understand transformers. He specifically references the 'Attention is all you need' paper.” Link, Link
Written Guides and Tutorials
- Transformers from Scratch:
- The Illustrated Transformer:
- Other Notable Articles:
- Luis Serrano's article: “I would also add Luis Serrano's article here.” Link
- “Transformer in 5 Minutes”: “I remember looking into this article. It was really helpful for me to understand transformers. Although the OP's article is detailed, this one is concise.” Link
- “The Transformer model in equations” by John Thickstun: “If you'd rather prefer something readable and explicit, instead of empty handwaving and uml-like diagrams, read 'The Transformer model in equations'.” Link
Foundational Papers
- Attention Is All You Need:
Advanced and Specialized Resources
- Formal Algorithms for Transformers:
- A Mathematical Framework for Transformer Circuits:
Practical Implementation
- Hugging Face:
- Hands-on Tutorials:
- Andrej Karpathy's hands-on video: “And for a deeper dive, Andrej Kharpaty has this hands-on video where he builds a transformer from scratch.” Link
Additional Tips
- Understanding Basic Concepts:
- Implementation Experience:
This Intro to Transformers is helpful to get some basic understanding of the underyling concepts and it comes with a really succint history lesson as well. https://www.youtube.com/watch?v=XfpMkf4rD6E
Transformers from scratch:
1.) https://e2eml.school/transformers.html
https://news.ycombinator.com/item?id=35697627 (46 comments, 1 month ago)
https://news.ycombinator.com/item?id=29315107 (17 comments, 2 years ago)
2.) http://www.peterbloem.nl/blog/transformers
https://news.ycombinator.com/item?id=20773992 (28 comments, 4 years ago)
https://news.ycombinator.com/item?id=29280909 (9 comments, 2 years ago)
Source: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
There are millions of "Transformers Explained" blog posts by now. The one I got the most out of is "Transformers from Scratch" by Peter Bloem:
For a while now, an answer I've seen is to start with "Attention Is All You Need", the original Transformers paper. It's still pretty good, but over the past year I've led a few working sessions on grokking transformer computational fundamentals and they've turned up some helpful later additions that simplify and clarify what's going on.
You can quickly get overwhelmed by the million good resources out there so I'll keep it to these three. If you have a strong CS background, they'll take you a long way:
(1) Transformers from Scratch: https://peterbloem.nl/blog/transformers
(2) Attention Is All You Need: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547de...
(3) Formal Algorithms for Transformers: https://arxiv.org/abs/2207.09238
I found this article on “transformers from scratch”[0] to be a perfect (for me) middle ground between high level hand-wavy explanations and overly technical in-the-weeds academic or code treatments.
While it is largely obsolete for practical purposes, learning about them is still valuable as they illustrate the natural evolution in the thought process behind the development of transformers.
To be honest, I'd start with some introduction to Transformer YouTube videos. They'll cover a lot of these terms and you'll then have a better understanding to find additional resources.
So how practical is learning to create your own transformers if you can't afford a giant amount of resources to train them?
> Transformer learning explained
Well, "explained" seems to be a stretching term; I would rather call it a mathematical derivation of the operation of a transformer which is certainly interesting for some specialists.
This link was posted here recently, and was the most understandable explanation I've found so far: https://e2eml.school/transformers.html
An early explainer of transformers, which is a quicker read, that I found very useful when they were still new to me, is The Illustrated Transformer[1], by Jay Alammar.
A more recent academic but high-level explanation of transformers, very good for detail on the different flow flavors (e.g. encoder-decoder vs decoder only), is Formal Algorithms for Transformers[2], from DeepMind.
The best way to understand transformers is to take Andrej’s Karpathy course on youtube. With a keyboard and a lot of focus time.
I remember looking into this article. It was really helpful for me to understand transformers. Although the OP's article is detailed, this one is concise. Here's the link: https://blue-season.github.io/transformer-in-5-minutes
To be honest, for transformers just go to huggingface.co and see what interests you. They have tons of examples to run and they also link to all the papers in the documentation. It doesn't get much easier to get into it. Even for the more recent stuff like vision transformers and diffusion models.
I'm the author of https://jalammar.github.io/illustrated-transformer/ and have spent years since introducing people to Transformers and thinking of how best to communicate those concepts. I've found that different people need different kinds of introductions, and the thread here includes some often cited resources including:
https://peterbloem.nl/blog/transformers
https://e2eml.school/transformers.html
I would also add Luis Serrano's article here: https://txt.cohere.com/what-are-transformer-models/ (HN discussion: https://news.ycombinator.com/item?id=35576918).
Looking back at The Illustrated Transformer, when I introduce people to the topic now, I find I can hide some complexity by omitting the encoder-decoder architecture and focusing only on one. Decoders are great because now a lot of people come to Transformers having heard of GPT models (which are decoder only). So for me, my canonical intro to Transformers now only touches on a decoder model. You can see this narrative here: https://www.youtube.com/watch?v=MQnJZuBGmSQ
Those Computerphile videos[0] by Rob Miles helped me understand transformers. He specifically references the "Attention is all you need" paper.
And for a deeper dive, Andrej Kharpaty has this hands-on video[1] where he builds a transformer from scratch. You can check-out his other videos on NLP as well they are all excellent.
[0] https://youtu.be/rURRYI66E54, https://youtu.be/89A4jGvaaKk
Jay alammar's Illustrated transformer, although this too is detailed. I think it's still worth taking a look, because really i don't think that people have yet "compressed" what transformers do intuitively. None of the concepts of the networks are particularly hard math - it's basic algebra. But the overall construction is complicated.
Here's my attempt at a simple explanation of transformers. I would love feedback on whether I've got it right and how I could improve it. Cheers
For those that want a high level overview of Transformers, we recently covered it in our podcast: https://www.youtube.com/watch?v=Kb0II5DuDE0
Would this teach transformers? Or is that something else?
Also any tips for finding a study group for learning the large language models? I can’t seem to self motivate.
Everytime I need a refresher on transformers, I read the same author's post on transformers. Looking forward to this one!
For specifically understanding transformers, this (w/ maybe GPT-4 by your side to unpack jargon/math) might be able to get you from lay-person to understanding enough to be dangerous pretty quickly: https://sebastianraschka.com/blog/2023/llm-reading-list.html
Without animated visuals, I don't think any non-math/non-ML person can ever get a good understanding of transformers.
You will need to watch videos.
Watch this playlist and you will understand: https://youtube.com/playlist?list=PLaJCKi8Nk1hwaMUYxJMiM3jTB...
Then watch this and you will understand even more: https://youtu.be/g2BRIuln4uc
Finally, watch this playlist: https://youtube.com/playlist?list=PL86uXYUJ7999zE8u2-97i4KG_...
Ben also has posted a Part II follow up:
https://benlevinstein.substack.com/p/a-conceptual-guide-to-t...
Also, another good intro article -
Transformers from Scratch:
If you'd rather prefer something readable and explicit, instead of empty handwaving and uml-like diagrams, read "The Transformer model in equations" [0] by John Thickstun [1].
besides everything that was mentioned here, what made it finally click for me early in my journey was running through this excellent tutorial by Peter Bloem multiple times https://peterbloem.nl/blog/transformers highly recommend
The Illustrated Transformer is pretty great. I was pretty hazy after reading the paper back in 2017 and this resource helped a lot.
I thought I understood transformers well, even though I had never implemented them. Then one day I implemented them, and they didn't work/train nearly as well as the standard pytorch transformer.
I eventually realized that I had ignored the dropout, because I thought my data could never overfit. (I trained the transformer to add numbers, and I never showed it the same pair twice.) Turns out dropout has a much bigger role than I had realized.
TLDR, just go and implement a transformer.
The more from scratch the better.
Everyone I know who tried it, ended up learning something they hadn't expected.
From how training is parallelized over tokens down to how backprop really works.
It's different for every person.
I wanted to talk about what powers LLMs, which I believe is important. The answer to that is transformers. While I may not have delved deeper into how a transformer actually works, I tried to explain the concepts in the simplest way possible.
It's also important to learn how to "teach yourself".
Understanding transformers will be really hard if you don't understand basic fully connected feedforward networks (multilayer perceptrons). And learning those is a bit challenging if you don't understand a single unit perceptron.
Transformers have the additional challenge of having a bit weird terminology. Keys, queries and values kinda make sense from a traditional information retrieval literature but they're more a metaphor in the attention system. "Attention" and other mentalistic/antrophomorphic terminology can also easily mislead intuitions.
Getting a good "learning path" is usually a teacher's main task, but you can learn to figure those by yourself by trying to find some part of the thing you can get a grasp of.
Most complicated seeming things (especially in tech) aren't really that complicated "to get". You just have to know a lot of stuff that the thing builds on.
karpathy gave a good high-level history of the transformer architecture in this Stanford lecture https://youtu.be/XfpMkf4rD6E?si=MDICNzZ_Mq9uzRo9&t=618
Is it only me, or after reading this article with a lot of high-level, vague phrases and anecdotes - skipping the actual essence of many smart tricks making transformers computationally efficient - it is actually harder to grasp how transformers “really work”.
I recommend videos from Andrej Karpathy on this topic. Well delivered, clearly explaining main techniques and providing python implementation
I endorse all of this and will further endorse (probably as a follow-up once one has a basic grasp) "A Mathematical Framework for Transformer Circuits" which builds a lot of really useful ideas for understanding how and why transformers work and how to start getting a grasp on treating them as something other than magical black boxes.