best way to learn about transformers

share

Summary of results

GPT-4o
Warning: quote links might not work while streaming
1.

This Intro to Transformers is helpful to get some basic understanding of the underyling concepts and it comes with a really succint history lesson as well. https://www.youtube.com/watch?v=XfpMkf4rD6E

2.

This link was posted here recently, and was the most understandable explanation I've found so far: https://e2eml.school/transformers.html

3.

I'm the author of https://jalammar.github.io/illustrated-transformer/ and have spent years since introducing people to Transformers and thinking of how best to communicate those concepts. I've found that different people need different kinds of introductions, and the thread here includes some often cited resources including:

https://peterbloem.nl/blog/transformers

https://e2eml.school/transformers.html

I would also add Luis Serrano's article here: https://txt.cohere.com/what-are-transformer-models/ (HN discussion: https://news.ycombinator.com/item?id=35576918).

Looking back at The Illustrated Transformer, when I introduce people to the topic now, I find I can hide some complexity by omitting the encoder-decoder architecture and focusing only on one. Decoders are great because now a lot of people come to Transformers having heard of GPT models (which are decoder only). So for me, my canonical intro to Transformers now only touches on a decoder model. You can see this narrative here: https://www.youtube.com/watch?v=MQnJZuBGmSQ

4.

I endorse all of this and will further endorse (probably as a follow-up once one has a basic grasp) "A Mathematical Framework for Transformer Circuits" which builds a lot of really useful ideas for understanding how and why transformers work and how to start getting a grasp on treating them as something other than magical black boxes.

https://transformer-circuits.pub/2021/framework/index.html

5.

Without animated visuals, I don't think any non-math/non-ML person can ever get a good understanding of transformers.

You will need to watch videos.

Watch this playlist and you will understand: https://youtube.com/playlist?list=PLaJCKi8Nk1hwaMUYxJMiM3jTB...

Then watch this and you will understand even more: https://youtu.be/g2BRIuln4uc

Finally, watch this playlist: https://youtube.com/playlist?list=PL86uXYUJ7999zE8u2-97i4KG_...

6.

This concept is really interesting to me, I am very very new to transformers but would love to learn more about normal transformers and differential too.

Can anyone suggest any resources?

7.

For a while now, an answer I've seen is to start with "Attention Is All You Need", the original Transformers paper. It's still pretty good, but over the past year I've led a few working sessions on grokking transformer computational fundamentals and they've turned up some helpful later additions that simplify and clarify what's going on.

You can quickly get overwhelmed by the million good resources out there so I'll keep it to these three. If you have a strong CS background, they'll take you a long way:

(1) Transformers from Scratch: https://peterbloem.nl/blog/transformers

(2) Attention Is All You Need: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547de...

(3) Formal Algorithms for Transformers: https://arxiv.org/abs/2207.09238

8.

To be honest, I'd start with some introduction to Transformer YouTube videos. They'll cover a lot of these terms and you'll then have a better understanding to find additional resources.

9.

So how practical is learning to create your own transformers if you can't afford a giant amount of resources to train them?

10.

> Transformer learning explained

Well, "explained" seems to be a stretching term; I would rather call it a mathematical derivation of the operation of a transformer which is certainly interesting for some specialists.

11.

An early explainer of transformers, which is a quicker read, that I found very useful when they were still new to me, is The Illustrated Transformer[1], by Jay Alammar.

A more recent academic but high-level explanation of transformers, very good for detail on the different flow flavors (e.g. encoder-decoder vs decoder only), is Formal Algorithms for Transformers[2], from DeepMind.

[1] https://jalammar.github.io/illustrated-transformer/

[2] https://arxiv.org/abs/2207.09238

12.

The best way to understand transformers is to take Andrej’s Karpathy course on youtube. With a keyboard and a lot of focus time.

13.

I remember looking into this article. It was really helpful for me to understand transformers. Although the OP's article is detailed, this one is concise. Here's the link: https://blue-season.github.io/transformer-in-5-minutes

14.

To be honest, for transformers just go to huggingface.co and see what interests you. They have tons of examples to run and they also link to all the papers in the documentation. It doesn't get much easier to get into it. Even for the more recent stuff like vision transformers and diffusion models.

15.

Those Computerphile videos[0] by Rob Miles helped me understand transformers. He specifically references the "Attention is all you need" paper.

And for a deeper dive, Andrej Kharpaty has this hands-on video[1] where he builds a transformer from scratch. You can check-out his other videos on NLP as well they are all excellent.

[0] https://youtu.be/rURRYI66E54, https://youtu.be/89A4jGvaaKk

[1] https://youtu.be/kCc8FmEb1nY

16.

Jay alammar's Illustrated transformer, although this too is detailed. I think it's still worth taking a look, because really i don't think that people have yet "compressed" what transformers do intuitively. None of the concepts of the networks are particularly hard math - it's basic algebra. But the overall construction is complicated.

https://jalammar.github.io/illustrated-transformer/

17.

Would this teach transformers? Or is that something else?

Also any tips for finding a study group for learning the large language models? I can’t seem to self motivate.

19.

For specifically understanding transformers, this (w/ maybe GPT-4 by your side to unpack jargon/math) might be able to get you from lay-person to understanding enough to be dangerous pretty quickly: https://sebastianraschka.com/blog/2023/llm-reading-list.html

20.

besides everything that was mentioned here, what made it finally click for me early in my journey was running through this excellent tutorial by Peter Bloem multiple times https://peterbloem.nl/blog/transformers highly recommend

21.

I thought I understood transformers well, even though I had never implemented them. Then one day I implemented them, and they didn't work/train nearly as well as the standard pytorch transformer.

I eventually realized that I had ignored the dropout, because I thought my data could never overfit. (I trained the transformer to add numbers, and I never showed it the same pair twice.) Turns out dropout has a much bigger role than I had realized.

TLDR, just go and implement a transformer.

The more from scratch the better.

Everyone I know who tried it, ended up learning something they hadn't expected.

From how training is parallelized over tokens down to how backprop really works.

It's different for every person.

22.

It's also important to learn how to "teach yourself".

Understanding transformers will be really hard if you don't understand basic fully connected feedforward networks (multilayer perceptrons). And learning those is a bit challenging if you don't understand a single unit perceptron.

Transformers have the additional challenge of having a bit weird terminology. Keys, queries and values kinda make sense from a traditional information retrieval literature but they're more a metaphor in the attention system. "Attention" and other mentalistic/antrophomorphic terminology can also easily mislead intuitions.

Getting a good "learning path" is usually a teacher's main task, but you can learn to figure those by yourself by trying to find some part of the thing you can get a grasp of.

Most complicated seeming things (especially in tech) aren't really that complicated "to get". You just have to know a lot of stuff that the thing builds on.

23.

karpathy gave a good high-level history of the transformer architecture in this Stanford lecture https://youtu.be/XfpMkf4rD6E?si=MDICNzZ_Mq9uzRo9&t=618

24.

Is it only me, or after reading this article with a lot of high-level, vague phrases and anecdotes - skipping the actual essence of many smart tricks making transformers computationally efficient - it is actually harder to grasp how transformers “really work”.

I recommend videos from Andrej Karpathy on this topic. Well delivered, clearly explaining main techniques and providing python implementation

25.

Knowledge distillation for transformers is already a thing and it is still actively researched since the potential benefits of not having to run these gigantic models are enormous.

26.

How come you know the efficient-transformers family, when I ask questions about transformers in ML interviews nobody has heard of them. Can't figure out why it's not common knowledge. For years all the transformer papers were about reducing O( N^2 )

27.

Not a complete answer, but here are the most helpful resources for understanding transformer basics in particular:

Original transformer paper: https://arxiv.org/abs/1706.03762

Illustrated transformer: http://jalammar.github.io/illustrated-transformer/

Transformer visualization: https://bbycroft.net/llm

minGPT (Karpathy): https://github.com/karpathy/minGPT

---

Next, some foundational textbooks for general ML and deep learning:

Elements of Statistical Learning (aka the bible): https://hastie.su.domains/ElemStatLearn/

Probabilistic ML: https://probml.github.io/pml-book/book2.html

Deep Learning Book (Goodfellow/Bengio): https://www.deeplearningbook.org/

Understanding Deep Learning:

https://udlbook.github.io/udlbook/

---

Finally, assorted tutorials/resources/intro courses:

Beyond the Illustrated Transformer: https://news.ycombinator.com/item?id=35712334

AI Zero to Hero: https://karpathy.ai/zero-to-hero.html

AI Canon: https://a16z.com/2023/05/25/ai-canon/

LLM University by Cohere: https://llm.university/

Practical Guide to LLMs: https://github.com/Mooler0410/LLMsPracticalGuide

Practical Deep Learning for Coders: https://course.fast.ai/Lessons/part2.html

---

Hope that helps!

28.

Try this: https://jalammar.github.io/illustrated-transformer/

Attention is explained separately. I have not seen an all-in-one diagram and cannot imagine one being helpful, since there's too much going on.

29.

Tangential, but: the notes that this post refers to are probably not the best way to learn how transformers work. If you want mathematical precision, those notes are based on this paper from DeepMind:

https://arxiv.org/abs/2207.09238

The paper provides mathematically precise definitions of all the parts of a transformers, though it's showing its age (ha!) in that it doesn't include some formalizations that are common in, for example, Llama.

30.

I would also recommend going through Callum McDougall/Neel Nanda's fantastic Transformer from Scratch tutorial. It takes a different approach to conceptualizing the model (or at least, it implements it in a way which emphasizes different characteristics of Transformers and self-attention), which I found deeply satisfying when I first explored them.

https://arena-ch1-transformers.streamlit.app/%5B1.1%5D_Trans...

31.

I agree that Karpathy's YouTube video is an excellent resource for understanding Transformers from scratch. It provides a hands-on experience that can be particularly helpful for those who want to implement the models themselves. Here's the link to the video titled "Let's build GPT: from scratch, in code, spelled out": https://youtu.be/kCc8FmEb1nY

Additionally, for more comprehensive resources on Transformers, you may find these resources useful:

* The Illustrated Transformer by Jay Alammar: http://jalammar.github.io/illustrated-transformer/

* MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention: https://www.youtube.com/watch?v=ySEx_Bqxvvo

* Karpathy's course, Deep Learning and Generative Models (Lecture 6 covers Transformers): https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThs......

These resources cover different aspects of Transformers and can help you grasp the underlying concepts and mechanisms better.

32.

Those interested in this might also be interested in some notes from the University of Amsterdam:

https://uvadlc-notebooks.readthedocs.io/en/latest/index.html

Tutorial 6 covers transformers.

33.

Assuming you are using Transformers, the official notebooks are a logical place to start: https://huggingface.co/docs/transformers/notebooks

34.

I’d recommend reading the paper “Attention is all you need” as it lays a lot of the foundational knowledge about transformers.

I’m no math guru, so I had to read the paper like 5 or 6 times to wrap my head around it.

I had to stop trying to understand how the math worked exactly and just accepted that it did, then it started to make sense.

Now going back I can actually understand some of why the math works.

35.

> transformers are more than meets the eye.

Really?

Transformers are very simple electrical machines. The design process is complex but the object itself is made of mild steel for the tank and supporting frameworks, special steel for the core, copper or aluminium for the windings, paper and resin for insulation, and oil for insulation and cooling.

There are a few ancillary components to do with detecting faults and switching taps and sometimes pumps to circulate the oil through radiators.

Apart from the pumps, circulating oil, and tap changer, there are no moving parts.

There are no compulsory electronic components in even the biggest transformers. Apart from improvements in material quality and the design process a power or distribution transformer built now is really not very different from those built a hundred years ago. In fact there are many transformers that were built fifty years ago that are still in service. Occasionally such transformers are repaired rather than replaced when they fail, quite often the original drawings are still available.


Terms & Privacy Policy | This site is not affiliated with or sponsored by Hacker News or Y Combinator
Built by @jnnnthnn