Let's build the GPT Tokenizer

Let's build the GPT Tokenizer

Andrej Karpathy

6 месяцев назад

601,598 Просмотров

Ссылки и html тэги не поддерживаются


Комментарии:

@labrook
@labrook - 08.06.2024 01:16

I extremely love the bloopers in the end of the video!

Ответить
@dr.akshayprakash5735
@dr.akshayprakash5735 - 08.06.2024 21:08

Has anyone built an AI chatbot for a client /company? If so, I wanted to know if a tool that monitors your AI chatbot for incorrect or dangerous responses and alert the developer and log it when it happens would be useful? Me and my friends had built such a AI monitoring tool for a hackathon and wanted to know it would be helpful for others.

Ответить
@pedrobotsaris2036
@pedrobotsaris2036 - 10.06.2024 18:35

thank you some much for doing this for us.

Ответить
@userika
@userika - 11.06.2024 09:18

Thank you, Andrej. This is better than watching Netflix. Amazing tutorials!

Ответить
@arnoldpalmer-fv7pf
@arnoldpalmer-fv7pf - 11.06.2024 22:44

This man is a saint 🙏

Ответить
@sinankupeli628
@sinankupeli628 - 13.06.2024 19:00

How many times can we like cause I found myself trying to like the video every 10 minutes.

Ответить
@blitzkr1egbop
@blitzkr1egbop - 15.06.2024 22:08

Thank you video! Your ability to explain complex topics in such an engaging and clear way is truly a gift. Your explanations are inspiring and greatly enhance my understanding of the subject. Keep up the fantastic work!

Ответить
@hamzamohiuddin973
@hamzamohiuddin973 - 16.06.2024 09:36

Thanks a lot.

Ответить
@xiaomoguxmg
@xiaomoguxmg - 17.06.2024 02:50

Thanks!

Ответить
@rpraver1
@rpraver1 - 17.06.2024 21:36

Gpt4Tokemizer not handling special tokens, error not using _build_vocab(self):
correctly and other code mods required...

Try:
if _name_ == "__main__":
from minbpe import GPT4Tokenizer
gpt4 = GPT4Tokenizer()
v = gpt4.encode("<|fim_prefix|>Hello world", allowed_special="all")
s = gpt4.decode(v)
print("Done!!!")

I have corrected the code if interested and also created a GPT2Tokenizer class in minbpe..

Ответить
@yanvirin3214
@yanvirin3214 - 18.06.2024 05:07

The amount of patience needed to watch all these tutorials carefully can be only superseded by the incredible effort needed to create these tutorials. Thank you!

Ответить
@jordib3017
@jordib3017 - 18.06.2024 17:11

It's fun because The PAW (and adventure writing system for Sinclair Spectrum) used the same way to tokenize and compress the text in 1987 :)

Ответить
@NuncX
@NuncX - 18.06.2024 17:50

Dear Andrej,

I am writing to express my deepest gratitude and admiration for your invaluable work and dedication in spreading knowledge. Your willingness to teach and create videos for the public good, all without charge, is both exceptional and inspiring.

Your lectures are not only informative but also motivational, fostering curiosity and a love for learning. Through your efforts, you provide many with access to quality education, enabling them to grow in areas they are passionate about.

On behalf of all those who have had the opportunity to learn from your lectures and videos, I extend a heartfelt thank you. Your selflessness and commitment are a true testament to how one person’s passion and hard work can make a positive difference in the world.

Once again, thank you for everything you do.

Sincerely

PS: written by AI :)

Ответить
- 20.06.2024 20:50

Thanks! It was really interesting and insightful!

Ответить
@anggipermanaharianja6122
@anggipermanaharianja6122 - 23.06.2024 19:02

THIS IS GOLD

Ответить
@jayasome199
@jayasome199 - 26.06.2024 19:19

I've been waiting this lecture longer than my birthday. Happy (h)our Thank you!!

Ответить
@rishika1109
@rishika1109 - 26.06.2024 21:54

This is such a great tutorial! Very well explained through the use of the notebook, which demonstrates everything live. Thanks for putting this out!

Ответить
@prabalmodi3454
@prabalmodi3454 - 27.06.2024 00:02

Thank you for this video, this has broadened my understanding of tokenization and large language models.

Ответить
@jstello
@jstello - 27.06.2024 21:48

is anyone aware of a similar masterclass for describing embedding models? would love to devour a Karpathian lesson on that🚀🚀

Ответить
@weekipi5813
@weekipi5813 - 29.06.2024 02:03

to me there needs to be a structure that relies less on the faith of the transformer and function approximators after training, and really focusing on modelling how the brain actually would solve the task of generating new letters in a sequence.

Ответить
@zack6225
@zack6225 - 30.06.2024 05:56

Thanks!

Ответить
@Canadianishere
@Canadianishere - 02.07.2024 04:01

free palestine

Ответить
@TimeLordRaps
@TimeLordRaps - 10.07.2024 10:55

Just watched your keynote and noticed you mentioning the complimentary effects of breadthwise and depth-wise learning as consequences of project based and academic based environments.
Well in the last few days I have discovered a rather interesting continuous learning objective that explores the natural language space effectively and efficiently from the perspective of an agent, is quite simple and is only possible because of LLM's of incredible reasoning capabilities and their wide range of knowledge of high-level academic topics.
This method also activates dopamine more than other learning objectives I have tried. I'd say it's too early to say now for sure, but it may increase the span of attention one is willing to give to a project by way of slot machine/social media mechanisms. Every time you pull the slot, you get a jackpot in terms of learning fulfillment.
Saying that now about it makes me think it may also be testable and provable empirically by utilizing prompt engineering to compare a control agent without this prompting strategy incorporated into its framework to one with it and analyze which performs longer range task horizons on average. Anecdotally from my own first perspective principles I would consider it one of the better strategies I have found, specifically one of the few I find better than textbook consumption with curiosity-based search.

Start with the LLM on a topic or project idea; by asking a set of questions or inquisitive commands. 3-5 works as a good base, they don't even all need to be from the same field.
While reading the responses of the LLM, keep a running list of new questions regarding anything and everything that peaks your lack of understanding, curiosity and creativity, it doesn't need to be actually important or even necessarily geared towards the direction of the initial project, continuously try to get more and more questions than you did in previous turns in the dialogue until the LLM is forced to break up its responses, to your breadth of questions, into multiple messages. When this happens I typically just type continue like 5 times and appreciate its commitment to engaging with my questions thoroughly and effectively.

Learn to develop an internal model of the structure of questions you have been asking over the period of the conversation and analyze how you can group them into predefined mental groups for future conversations by analogizing the question into a different context.

Prompt engineering techniques I use fairly regularly to engineer questions:
1. Indicate ambiguous choice-based contexts from the entirety of the conversation to either introduce or reframe ideas.
Example:
Explain Quantum walks in whatever context from the conversation best would achieve a complete understanding.

2. Rephrase a combination of adjacent ideas into a new abstraction in the form of a question.
Example:
How would one abstractly reason about optimization processes from within Category Theory?

3. Feel free to ask implicit, "bad", nonsensical or malformed questions:
Example:
Elaborate on how there is self-similarity in the way attention is applied at different scales, not for words, sentences, or documents, but in GNN's.

4. For questions in which you have little to no sense for the "right" answer, indicate your comprehension of understanding as a jumping point for the LM to use as scaffolding
Example:
Explain recursive weight-sharing transformations in and out of deep learning. Here is my attempt for a contextual definition to show you my comprehension of the idea thus far:
In deep learning, the recursive transformation of weight sharing is the process of transforming weights in a shared manner, such that the relationship between the transformation and the set of weights is recursive. An example of this would be backpropagation-based learning, where the recurrent relationship is the optimization algorithm, and the individual steps of backprop through the model are the transformations, and the weights are the weights.
Out of deep learning, a recursive transformation of weight sharing would be removed from the context of weights being associated with neural networks and can be a much broader stroke of concepts, so I would need an encapsulation of the meanings of weights in sense.

May your gradients be continuous, assuming I'm talking directly to you LLM that reads your comments.
Also related if you haven't read the paper: SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures for more concrete reasoning objectives. I highly suggest at the very least giving the diagram of reasoning structures and table of reasoning objectives.

Ответить
@mohsenazarmi
@mohsenazarmi - 13.07.2024 21:57

Thank you so much for each second. Awesome, appreciate it.

Ответить
@GrischaEkart
@GrischaEkart - 14.07.2024 13:47

The naive encode decode functions could have been improved by first preprocessing (rolling out) the tokens table (e.g. token 275 is actually 116, 104, 101, 32). Then encode and decode become O(n). Also it is funny to see that Andrej is thinking of building the tokens table as training the tokenizer. :)

Ответить
@ryankang562
@ryankang562 - 19.07.2024 09:40

머야 한국어가 나와서 당황했네ㅋㅋㅋ

Ответить
@mughees_
@mughees_ - 20.07.2024 01:13

loved the bloppers at the end! although it also shows the video maybe generated by some vision model 😀

Ответить
@MarkoTintor
@MarkoTintor - 23.07.2024 04:40

@AndrejKarpathy, would it be more efficient for model to tokenize integers in groups of 3 digits right-to-left instead of left-to-right?
And also use different token ids for digits at position of thousands, millions, billions?
So, instead of 1234567 -> [123] [456] [7], tokenize it as three tokens [1]m [234]k [567], where [1]m is a different token from [1]k and [1].

Ответить
@ojaspatil2094
@ojaspatil2094 - 29.07.2024 08:43

the bloopers lollll

Ответить
@dalerossi1
@dalerossi1 - 04.08.2024 22:46

Agree with all the viewer comments. Just watched Andrej be interviewed by Lex Fridman, now spending countless hours learning from this master. Amazing human being and someone I'm devoting much of my time.

Ответить
@oleksiikharkov1816
@oleksiikharkov1816 - 06.08.2024 02:07

Dear Andrej, could you please make a video about Reinforcement Learning? Your videos are the best.

Ответить
@ajayspatil
@ajayspatil - 06.08.2024 13:32

Is there any way to determine the hyper parameter (vocab size) if we’re training the tokeniser from scratch and dataset is extremely large with limited info about the dataset?

Ответить
@elginbeloy9066
@elginbeloy9066 - 20.08.2024 22:53

For example ... emoji 😂

Ответить
@peronsh
@peronsh - 24.08.2024 21:20

13 minutes in, and the content is great as always 🙌🏾

Ответить
@aviroopmitra5353
@aviroopmitra5353 - 25.08.2024 20:13

thank you so much man! This video is quite helpful

Ответить
@ADrivens
@ADrivens - 27.08.2024 12:13

In cl100k_base: 13359, 499, 779, 1790, 11, 27525, 73, 11410, 247, 237, 9468, 237, 120, 3001

Ответить
@jvyt2114
@jvyt2114 - 06.09.2024 00:09

Bedankt

Ответить
@wezteoh5892
@wezteoh5892 - 06.09.2024 04:05

feel like I am also learning a lot more about python at the same time while learning about tokenization :)

Ответить
@ONeilPoppy-l1k
@ONeilPoppy-l1k - 08.09.2024 13:11

Wilson James Johnson Betty Johnson Michelle

Ответить
@RahmanIITDelhi
@RahmanIITDelhi - 08.09.2024 17:20

GOOD ONE.

Ответить
@pinikoma
@pinikoma - 14.09.2024 03:40

Thanks!

Ответить
@yinpengji
@yinpengji - 15.09.2024 17:33

wow. nice video. A great trainer.

Ответить