this chapter has not been formatted properly just yet. will reformat it properly and create a copy of it in the blog section, most likely.
Preprocessing Practice
Includes:
- tokenization
- converting tokens into token IDs
- adding special context tokens
- byte pair encoding
- data sampling with a sliding window
- creating token embeddings
- encoding word positions
1. Tokenization
('the-verdict.txt', <http.client.HTTPMessage at 0x10806b250>)
Total length of characters in "The Verdict", a short story: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no
length of preprocessed text: 4690
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']
these are the tokens!
2. Converting tokens into token IDs
vocab_size=1130
('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)
[1,
56,
2,
850,
988,
602,
533,
746,
5,
1126,
596,
5,
1,
67,
7,
38,
851,
1108,
754,
793,
7]
'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'
3. Adding special context tokens
Let’s modify the vocabulary to add special tokens: <unk>
and <|endoftext|>
.
1130 <class 'list'>
1132 <class 'list'>
Now updating my tokenizer class accordingly!
'" It\' s the last he painted, you know," And from here <|unk|>, <|unk|> <|unk|> about <|unk|>.'
'Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.'
[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]
<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.
4. Byte pair encoding
using tiktoken from here!
https://github.com/openai/tiktoken
tiktoken version: 0.8.0
[46036,
39258,
31676,
1209,
248,
6527,
30640,
33623,
16764,
33768,
98,
17312,
105,
45739,
252,
30640,
162,
249,
116,
18566,
28134,
18566,
30159,
33623,
35585,
23513,
38,
11571,
12,
17,
43357,
22174,
31676,
46036,
39258,
31758,
49426,
228,
164,
100,
96,
49035,
118,
25748,
5641,
30640,
22180,
1792,
229,
29557,
27370,
171,
120,
253,
50256,
43,
37,
40415,
38,
50256,
11976,
254,
11976,
123,
11976,
243,
28225,
249,
11,
28225,
243,
24231,
229,
11976,
117,
24231,
222,
28225,
101,
24231,
229,
11976,
103,
48077,
11976,
110,
24231,
222,
28225,
114,
11976,
105,
24231,
235,
11976,
99,
11976,
117,
11976,
108,
24231,
224,
28225,
103,
11976,
101,
11976,
123,
30,
28225,
243,
24231,
229,
402,
11571,
12,
17,
28225,
107,
11976,
116,
11976,
110,
48077,
11976,
230,
28225,
229,
11976,
101,
24231,
235,
11976,
243,
24231,
233,
11976,
94,
28225,
108,
28225,
94,
11976,
123,
11976,
243,
24231,
233,
11976,
94,
28225,
245,
11976,
108,
24231,
235,
11976,
101,
28225,
116,
11976,
243,
24231,
235,
11976,
249,
30,
50256]
'これはペンです。日本語で書いていますが、GPT-2さんはこれを理解出るのでしょうか?<|endoftext|>LFGGGGGGGGG<|endoftext|>ठिक छ, केही नेपाली शब्दहरू पनि? के GPT-2 यसलाई इन्कोड र डिकोड गर्न सक्छ?<|endoftext|>'
5. Data sampling with a sliding window
implementing a data loader that fetches the input–target pairs in figure 2.12 from the training dataset using a sliding window approach
[290, 4920, 2241, 287, 257, 4489, 64, 319, 262, 34686]
[290] ====> 4920
Decoded text of these token IDs below:
and ====> established
[290, 4920] ====> 2241
Decoded text of these token IDs below:
and established ====> himself
[290, 4920, 2241] ====> 287
Decoded text of these token IDs below:
and established himself ====> in
[290, 4920, 2241, 287] ====> 257
Decoded text of these token IDs below:
and established himself in ====> a
6. Efficient Data Loader Implementation with PyTorch
with its built-in Dataset
and DataLoader
classes
see Datasets & DataLoaders documentation: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
[tensor([[ 40],
[ 367],
[2885],
[1464]]), tensor([[ 367],
[2885],
[1464],
[1807]])]
[tensor([[1807],
[3619],
[ 402],
[ 271]]), tensor([[ 3619],
[ 402],
[ 271],
[10899]])]
Using the data loader to sample with a batch size greater than 1
Inputs:
tensor([[ 40, 367, 2885, 1464],
[ 1807, 3619, 402, 271],
[10899, 2138, 257, 7026],
[15632, 438, 2016, 257],
[ 922, 5891, 1576, 438],
[ 568, 340, 373, 645],
[ 1049, 5975, 284, 502],
[ 284, 3285, 326, 11]])
Targets:
tensor([[ 367, 2885, 1464, 1807],
[ 3619, 402, 271, 10899],
[ 2138, 257, 7026, 15632],
[ 438, 2016, 257, 922],
[ 5891, 1576, 438, 568],
[ 340, 373, 645, 1049],
[ 5975, 284, 502, 284],
[ 3285, 326, 11, 287]])
7. Creating Token Embeddings
As a preliminary step, these embedding weights are initialized with random values. This initialization serves as the starting point for the LLM’s learning process. Then later, the embedding weights need to be optimized as part of the LLM training.
Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
[ 0.9178, 1.5810, 1.3010],
[ 1.2753, -0.2010, -0.1606],
[-0.4015, 0.9666, -1.1481],
[-1.1589, 0.3255, -0.6315],
[-2.8400, -0.7849, -1.4096]], requires_grad=True)
tensor([[ 1.2753, -0.2010, -0.1606],
[-0.4015, 0.9666, -1.1481],
[-2.8400, -0.7849, -1.4096],
[ 0.9178, 1.5810, 1.3010]], grad_fn=<EmbeddingBackward0>)
This is basically what’s happening:
8. Encoding word positions
The token embedding I just did does not have any info regarding the positions or order of the tokens being embedded (it is deterministic, position-independent embedding of the token ID). The self-attention mechanism of the transformer, however, relies on both positions and order of tokens. So, word position encoding is further required here (if it is a transformer).
creating initial position embeddings
Token IDs:
tensor([[ 40, 367, 2885, 1464],
[ 1807, 3619, 402, 271],
[10899, 2138, 257, 7026],
[15632, 438, 2016, 257],
[ 922, 5891, 1576, 438],
[ 568, 340, 373, 645],
[ 1049, 5975, 284, 502],
[ 284, 3285, 326, 11]])
Inputs shape:
torch.Size([8, 4])
--------------------------------------------
Next batch of Token IDs:
tensor([[ 287, 262, 6001, 286],
[ 465, 13476, 11, 339],
[ 550, 5710, 465, 12036],
[ 11, 6405, 257, 5527],
[27075, 11, 290, 4920],
[ 2241, 287, 257, 4489],
[ 64, 319, 262, 34686],
[41976, 13, 357, 10915]])
Inputs shape:
torch.Size([8, 4])
↑ What is happening?
We know that batch_size=8
, which means each batch will contain 8 sequences (or samples) from the data, and each sequence is max_length=4
, so we see 4 on each row (each row here is 1 sequence out of 8 sequences we set for each batch of texts).
The batch keeps on continuing until the whole raw_text
is done. What I see being printed here as Token IDs:
is just 1 part of my text.
In other words, batch_size=8
doesn’t mean the whole text is split into exactly 8 parts. It simply defines the number of sequences processed together in each batch. The loader will continue creating batches of size [8, 4]
until the entire text has been processed.
===
torch.Size([8, 4, 256])
tensor([[[ 0.4913, 1.1239, 1.4588, ..., -0.3995, -1.8735, -0.1445],
[ 0.4481, 0.2536, -0.2655, ..., 0.4997, -1.1991, -1.1844],
[-0.2507, -0.0546, 0.6687, ..., 0.9618, 2.3737, -0.0528],
[ 0.9457, 0.8657, 1.6191, ..., -0.4544, -0.7460, 0.3483]],
[[ 1.5460, 1.7368, -0.7848, ..., -0.1004, 0.8584, -0.3421],
[-1.8622, -0.1914, -0.3812, ..., 1.1220, -0.3496, 0.6091],
[ 1.9847, -0.6483, -0.1415, ..., -0.3841, -0.9355, 1.4478],
[ 0.9647, 1.2974, -1.6207, ..., 1.1463, 1.5797, 0.3969]],
[[-0.7713, 0.6572, 0.1663, ..., -0.8044, 0.0542, 0.7426],
[ 0.8046, 0.5047, 1.2922, ..., 1.4648, 0.4097, 0.3205],
[ 0.0795, -1.7636, 0.5750, ..., 2.1823, 1.8231, -0.3635],
[ 0.4267, -0.0647, 0.5686, ..., -0.5209, 1.3065, 0.8473]],
...,
[[-1.6156, 0.9610, -2.6437, ..., -0.9645, 1.0888, 1.6383],
[-0.3985, -0.9235, -1.3163, ..., -1.1582, -1.1314, 0.9747],
[ 0.6089, 0.5329, 0.1980, ..., -0.6333, -1.1023, 1.6292],
[ 0.3677, -0.1701, -1.3787, ..., 0.7048, 0.5028, -0.0573]],
[[-0.1279, 0.6154, 1.7173, ..., 0.3789, -0.4752, 1.5258],
[ 0.4861, -1.7105, 0.4416, ..., 0.1475, -1.8394, 1.8755],
[-0.9573, 0.7007, 1.3579, ..., 1.9378, -1.9052, -1.1816],
[ 0.2002, -0.7605, -1.5170, ..., -0.0305, -0.3656, -0.1398]],
[[-0.9573, 0.7007, 1.3579, ..., 1.9378, -1.9052, -1.1816],
[-0.0632, -0.6548, -1.0296, ..., -0.9538, -0.5026, -0.1128],
[ 0.6032, 0.8983, 2.0722, ..., 1.5242, 0.2030, -0.3002],
[ 1.1274, -0.1082, -0.2195, ..., 0.5059, -1.8138, -0.0700]]],
grad_fn=<EmbeddingBackward0>)
↑ What is happening?
The 8 × 4 × 256–dimensional tensor output shows that each token ID is now embedded as a 256-dimensional vector.
Btw, this [8, 4, 256]
only refers to the first batch! Not for the whole raw_text. If we look at the first row of this output, we can see the following (next cell).
This means that each row in this sample represents one of the 4 tokens, and each number in a row (i.e., each column component) is a component of the 256-dimensional embedding vector for that token.
tensor([[ 0.4913, 1.1239, 1.4588, ..., -0.3995, -1.8735, -0.1445],
[ 0.4481, 0.2536, -0.2655, ..., 0.4997, -1.1991, -1.1844],
[-0.2507, -0.0546, 0.6687, ..., 0.9618, 2.3737, -0.0528],
[ 0.9457, 0.8657, 1.6191, ..., -0.4544, -0.7460, 0.3483]],
grad_fn=<SliceBackward0>)
For a GPT model’s absolute embedding approach (since that’s what GPT uses), we just need to create another embedding layer that has the same embedding dimension as the token_embedding_layer
:
tensor([[ 1.7375, -0.5620, -0.6303, ..., -0.2277, 1.5748, 1.0345],
[ 1.6423, -0.7201, 0.2062, ..., 0.4118, 0.1498, -0.4628],
[-0.4651, -0.7757, 0.5806, ..., 1.4335, -0.4963, 0.8579],
[-0.6754, -0.4628, 1.4323, ..., 0.8139, -0.7088, 0.4827]],
grad_fn=<EmbeddingBackward0>)
torch.Size([4, 256])
↑ As I can see,
the positional embedding tensor consists of four 256-dimensional vectors. I can now add these directly to the token embeddings, where PyTorch will add the 4 × 256–dimensional pos_embeddings
tensor to each 4 × 256–dimensional token embedding tensor in each of the eight sequences in each batch:
Input embeddings:
tensor([[[ 2.2288, 0.5619, 0.8286, ..., -0.6272, -0.2987, 0.8900],
[ 2.0903, -0.4664, -0.0593, ..., 0.9115, -1.0493, -1.6473],
[-0.7158, -0.8304, 1.2494, ..., 2.3952, 1.8773, 0.8051],
[ 0.2703, 0.4029, 3.0514, ..., 0.3595, -1.4548, 0.8310]],
[[ 3.2835, 1.1749, -1.4150, ..., -0.3281, 2.4332, 0.6924],
[-0.2199, -0.9114, -0.1750, ..., 1.5337, -0.1998, 0.1462],
[ 1.5197, -1.4240, 0.4391, ..., 1.0494, -1.4318, 2.3057],
[ 0.2893, 0.8346, -0.1884, ..., 1.9602, 0.8709, 0.8796]],
[[ 0.9662, 0.0952, -0.4640, ..., -1.0320, 1.6290, 1.7771],
[ 2.4468, -0.2154, 1.4984, ..., 1.8766, 0.5595, -0.1423],
[-0.3856, -2.5393, 1.1556, ..., 3.6157, 1.3267, 0.4944],
[-0.2487, -0.5275, 2.0009, ..., 0.2930, 0.5977, 1.3300]],
...,
[[ 0.1219, 0.3991, -3.2740, ..., -1.1921, 2.6637, 2.6728],
[ 1.2438, -1.6436, -1.1101, ..., -0.7464, -0.9816, 0.5118],
[ 0.1439, -0.2428, 0.7786, ..., 0.8001, -1.5986, 2.4871],
[-0.3077, -0.6329, 0.0536, ..., 1.5188, -0.2060, 0.4254]],
[[ 1.6095, 0.0535, 1.0871, ..., 0.1512, 1.0996, 2.5603],
[ 2.1284, -2.4306, 0.6478, ..., 0.5593, -1.6896, 1.4126],
[-1.4224, -0.0750, 1.9386, ..., 3.3712, -2.4016, -0.3237],
[-0.4752, -1.2234, -0.0847, ..., 0.7834, -1.0744, 0.3429]],
[[ 0.7802, 0.1387, 0.7277, ..., 1.7101, -0.3304, -0.1471],
[ 1.5791, -1.3749, -0.8234, ..., -0.5420, -0.3528, -0.5756],
[ 0.1382, 0.1226, 2.6528, ..., 2.9576, -0.2933, 0.5577],
[ 0.4520, -0.5711, 1.2128, ..., 1.3198, -2.5226, 0.4127]]],
grad_fn=<AddBackward0>)
Shape: torch.Size([8, 4, 256])
↑ Importat Context:
The shape of one batch of token_embeddings
is [8, 4, 256]
but the shape of pos_embeddings
is [4, 256]
. How does PyTorch add these tensors of different dimensions?
⇒ With broadcasting.
https://pytorch.org/docs/stable/notes/broadcasting.html
I also wrote some detailed broadcasting examples here.
Finally, one last reminder to myself again:
PyTorch is indeed handling embeddings for all the batches across the full length of raw_text
here, even though I’m only seeing one batch printed out with shape [8, 4, 256]
.
Looking Back
In a high level, this is what I did in this notebook:
fin.