best way to learn about transformers

share

Summary of results

GPT-4o
Tip: click on links to see relevant comments
  • Introductory Resources:

  • Foundational Papers and Articles:

  • Hands-On Learning:

    • Implementing a transformer from scratch is suggested as a practical way to deepen understanding. This approach helps uncover unexpected insights about the model's workings.
    • Andrej Karpathy's course and videos are highlighted for their practical, code-based approach to learning.
  • Online Courses and Tutorials:

  • Additional Resources:

    • Hugging Face's website is recommended for practical examples and links to relevant papers. Link
    • The University of Amsterdam's notebooks, particularly Tutorial 6, cover transformers in detail. Link
  • General Advice:

    • Understanding basic neural network concepts is crucial before diving into transformers.
    • Videos and visual aids are emphasized as essential for grasping the complex concepts involved in transformers.
1.

This Intro to Transformers is helpful to get some basic understanding of the underyling concepts and it comes with a really succint history lesson as well. https://www.youtube.com/watch?v=XfpMkf4rD6E

2.

For a while now, an answer I've seen is to start with "Attention Is All You Need", the original Transformers paper. It's still pretty good, but over the past year I've led a few working sessions on grokking transformer computational fundamentals and they've turned up some helpful later additions that simplify and clarify what's going on.

You can quickly get overwhelmed by the million good resources out there so I'll keep it to these three. If you have a strong CS background, they'll take you a long way:

(1) Transformers from Scratch: https://peterbloem.nl/blog/transformers

(2) Attention Is All You Need: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547de...

(3) Formal Algorithms for Transformers: https://arxiv.org/abs/2207.09238

3.
4.

Those Computerphile videos[0] by Rob Miles helped me understand transformers. He specifically references the "Attention is all you need" paper.

And for a deeper dive, Andrej Kharpaty has this hands-on video[1] where he builds a transformer from scratch. You can check-out his other videos on NLP as well they are all excellent.

[0] https://youtu.be/rURRYI66E54, https://youtu.be/89A4jGvaaKk

[1] https://youtu.be/kCc8FmEb1nY

5.

I found this article on “transformers from scratch”[0] to be a perfect (for me) middle ground between high level hand-wavy explanations and overly technical in-the-weeds academic or code treatments.

[0] https://e2eml.school/transformers.html

6.

This concept is really interesting to me, I am very very new to transformers but would love to learn more about normal transformers and differential too.

Can anyone suggest any resources?

7.

To be honest, I'd start with some introduction to Transformer YouTube videos. They'll cover a lot of these terms and you'll then have a better understanding to find additional resources.

8.

So how practical is learning to create your own transformers if you can't afford a giant amount of resources to train them?

9.

An early explainer of transformers, which is a quicker read, that I found very useful when they were still new to me, is The Illustrated Transformer[1], by Jay Alammar.

A more recent academic but high-level explanation of transformers, very good for detail on the different flow flavors (e.g. encoder-decoder vs decoder only), is Formal Algorithms for Transformers[2], from DeepMind.

[1] https://jalammar.github.io/illustrated-transformer/

[2] https://arxiv.org/abs/2207.09238

10.

The best way to understand transformers is to take Andrej’s Karpathy course on youtube. With a keyboard and a lot of focus time.

11.

To be honest, for transformers just go to huggingface.co and see what interests you. They have tons of examples to run and they also link to all the papers in the documentation. It doesn't get much easier to get into it. Even for the more recent stuff like vision transformers and diffusion models.

12.

For those that want a high level overview of Transformers, we recently covered it in our podcast: https://www.youtube.com/watch?v=Kb0II5DuDE0

13.

Would this teach transformers? Or is that something else?

Also any tips for finding a study group for learning the large language models? I can’t seem to self motivate.

14.

Everytime I need a refresher on transformers, I read the same author's post on transformers. Looking forward to this one!

15.

For specifically understanding transformers, this (w/ maybe GPT-4 by your side to unpack jargon/math) might be able to get you from lay-person to understanding enough to be dangerous pretty quickly: https://sebastianraschka.com/blog/2023/llm-reading-list.html

16.

Without animated visuals, I don't think any non-math/non-ML person can ever get a good understanding of transformers.

You will need to watch videos.

Watch this playlist and you will understand: https://youtube.com/playlist?list=PLaJCKi8Nk1hwaMUYxJMiM3jTB...

Then watch this and you will understand even more: https://youtu.be/g2BRIuln4uc

Finally, watch this playlist: https://youtube.com/playlist?list=PL86uXYUJ7999zE8u2-97i4KG_...

17.

If you'd rather prefer something readable and explicit, instead of empty handwaving and uml-like diagrams, read "The Transformer model in equations" [0] by John Thickstun [1].

[0] https://johnthickstun.com/docs/transformers.pdf

[1] https://johnthickstun.com/docs/

18.

The Illustrated Transformer is pretty great. I was pretty hazy after reading the paper back in 2017 and this resource helped a lot.

https://jalammar.github.io/illustrated-transformer/

19.

I thought I understood transformers well, even though I had never implemented them. Then one day I implemented them, and they didn't work/train nearly as well as the standard pytorch transformer.

I eventually realized that I had ignored the dropout, because I thought my data could never overfit. (I trained the transformer to add numbers, and I never showed it the same pair twice.) Turns out dropout has a much bigger role than I had realized.

TLDR, just go and implement a transformer.

The more from scratch the better.

Everyone I know who tried it, ended up learning something they hadn't expected.

From how training is parallelized over tokens down to how backprop really works.

It's different for every person.

20.

It's also important to learn how to "teach yourself".

Understanding transformers will be really hard if you don't understand basic fully connected feedforward networks (multilayer perceptrons). And learning those is a bit challenging if you don't understand a single unit perceptron.

Transformers have the additional challenge of having a bit weird terminology. Keys, queries and values kinda make sense from a traditional information retrieval literature but they're more a metaphor in the attention system. "Attention" and other mentalistic/antrophomorphic terminology can also easily mislead intuitions.

Getting a good "learning path" is usually a teacher's main task, but you can learn to figure those by yourself by trying to find some part of the thing you can get a grasp of.

Most complicated seeming things (especially in tech) aren't really that complicated "to get". You just have to know a lot of stuff that the thing builds on.

21.

karpathy gave a good high-level history of the transformer architecture in this Stanford lecture https://youtu.be/XfpMkf4rD6E?si=MDICNzZ_Mq9uzRo9&t=618

22.

Is it only me, or after reading this article with a lot of high-level, vague phrases and anecdotes - skipping the actual essence of many smart tricks making transformers computationally efficient - it is actually harder to grasp how transformers “really work”.

I recommend videos from Andrej Karpathy on this topic. Well delivered, clearly explaining main techniques and providing python implementation

23.

Think you need to read up what transformers actually do (they refine syntactically, semantically and whatever additional way you wish) and what emergent properties they have.

24.

Knowledge distillation for transformers is already a thing and it is still actively researched since the potential benefits of not having to run these gigantic models are enormous.

25.

Not a complete answer, but here are the most helpful resources for understanding transformer basics in particular:

Original transformer paper: https://arxiv.org/abs/1706.03762

Illustrated transformer: http://jalammar.github.io/illustrated-transformer/

Transformer visualization: https://bbycroft.net/llm

minGPT (Karpathy): https://github.com/karpathy/minGPT

---

Next, some foundational textbooks for general ML and deep learning:

Elements of Statistical Learning (aka the bible): https://hastie.su.domains/ElemStatLearn/

Probabilistic ML: https://probml.github.io/pml-book/book2.html

Deep Learning Book (Goodfellow/Bengio): https://www.deeplearningbook.org/

Understanding Deep Learning:

https://udlbook.github.io/udlbook/

---

Finally, assorted tutorials/resources/intro courses:

Beyond the Illustrated Transformer: https://news.ycombinator.com/item?id=35712334

AI Zero to Hero: https://karpathy.ai/zero-to-hero.html

AI Canon: https://a16z.com/2023/05/25/ai-canon/

LLM University by Cohere: https://llm.university/

Practical Guide to LLMs: https://github.com/Mooler0410/LLMsPracticalGuide

Practical Deep Learning for Coders: https://course.fast.ai/Lessons/part2.html

---

Hope that helps!

26.

Try this: https://jalammar.github.io/illustrated-transformer/

Attention is explained separately. I have not seen an all-in-one diagram and cannot imagine one being helpful, since there's too much going on.

27.

Tangential, but: the notes that this post refers to are probably not the best way to learn how transformers work. If you want mathematical precision, those notes are based on this paper from DeepMind:

https://arxiv.org/abs/2207.09238

The paper provides mathematically precise definitions of all the parts of a transformers, though it's showing its age (ha!) in that it doesn't include some formalizations that are common in, for example, Llama.

28.

I would also recommend going through Callum McDougall/Neel Nanda's fantastic Transformer from Scratch tutorial. It takes a different approach to conceptualizing the model (or at least, it implements it in a way which emphasizes different characteristics of Transformers and self-attention), which I found deeply satisfying when I first explored them.

https://arena-ch1-transformers.streamlit.app/%5B1.1%5D_Trans...

29.

Im wondering whether your opinion is from using the product or delving into transformers. It seems like the former and you should probably spend some more time in the latter.

30.

I agree that Karpathy's YouTube video is an excellent resource for understanding Transformers from scratch. It provides a hands-on experience that can be particularly helpful for those who want to implement the models themselves. Here's the link to the video titled "Let's build GPT: from scratch, in code, spelled out": https://youtu.be/kCc8FmEb1nY

Additionally, for more comprehensive resources on Transformers, you may find these resources useful:

* The Illustrated Transformer by Jay Alammar: http://jalammar.github.io/illustrated-transformer/

* MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention: https://www.youtube.com/watch?v=ySEx_Bqxvvo

* Karpathy's course, Deep Learning and Generative Models (Lecture 6 covers Transformers): https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThs......

These resources cover different aspects of Transformers and can help you grasp the underlying concepts and mechanisms better.

31.

Those interested in this might also be interested in some notes from the University of Amsterdam:

https://uvadlc-notebooks.readthedocs.io/en/latest/index.html

Tutorial 6 covers transformers.

32.

Assuming you are using Transformers, the official notebooks are a logical place to start: https://huggingface.co/docs/transformers/notebooks


Terms & Privacy Policy | This site is not affiliated with or sponsored by Hacker News or Y Combinator
Built by @jnnnthnn