Suppose we have a sentence S = "A boy is not a cat." First, S is split into a bunch of tokens (a token is a word or a sub-word or whatever). Call the tokenization of S as where T is a sequence of string tokens. Then we have a vocabulary that each string token is part of, so we use the vocabulary to turn each string into an index co-responding to that word in the vocab. call that sequence . Then, we one hot-encode each index into a vector and collect all those vectors as columns of a matrix .
Now, suppose our vocab size is . Then a matrix (called an embedding matrix) of size (D_emb, N_v) (call it ) is multiplied with , where has size (N_v, seq_len) . Therefore our embedded matrix and has size/shape (D_emb, seq_len). Effectively takes each word-one-hot vector and embeds it into a space of dimension. The embedding matrix is also part of the learnable parameters of the model.
A learnt embedding might associate certain semantics with directions in the embedding space, hence certain words with similar semantics might have a high cosine similarity.
Notice that in practise, the sequence length has a cap, called the context size . so bigger sequences are broken up into - length chunks. So in practise has shape for GPT-2, .
Now, inuitively, this embedding matrix just takes each "token" and independently assigns a reasonable vector (at least after training) to it. However, in language, context is king, the meaning of a word is heavily dependent on the sentence and context its being used in, so we want to allow the model to learn the inter-relationships between each embedding vector.
Attention pattern
consider a chosen dimension . Then consider the "query-matrix" of shape .
so it takes in an embedded vector and spits out a vector of dims, which is essentially "asking-something" about that embedded vector. And again, all of these are just vectors with real entries, and its hard to interpret exactly what the query vector is asking, so its more of an architectural hope.
so for an embedded vector (jth column of X) the query .
Similarly a "key-matrix" of shape tries to "answer something or hold some property" about an embedded vector.
so now, the dot-product of and is hoped to encode (in its magnitude and sign) how relevant is some property of the token to some question about the token. a higher positive dot_product means that there is a high co-relation between some property of word and some question posed by word . a lower positive indicates smaller co-relation, and a high negative might indicate an opposite co-relation. (again all just hope and vectors)
Now, we would like to compute these dot products between each pair of words.
to do this, we simply use matmuls.
, and also we want . (the "value matrix" which just assigns some value vector to each embedding vector).
Then, our "attention matrix/attention_pattern) is given by
Note that in practise, is a product of two matrices where has shape and has shape . This forces a low rank linear map.
so . and has shape
In practise, a single sequence of text is broken down into subarrays to predict for each during training. Hence, it would be cheating to answer a query of an earlier word in the sequence by looking at the key of a later word. that is all queries must be answered by keys from words that appear at or before it.
so we want to be zero whenever . In practise, we compute and set the lower triangle of this matrix (has shape ) to -inf, and then compute the soft-max, then multiply by to get the final attention pattern A. (of shape (D_emb, c)).
This computation is called a head of self attention.
There is also cross attention, where the keys are being computed on one dataset (say french text) and is being computed on another say English, for translation. (we may not use masking here.)
parameter calculation so far: so we have an embedding matrix E (D_emb, N_v) query and key matrices Q,K : (d, D_emb) and W_a, W_b of shape (D_emb, d) and (d,D_emb)
So the total number of parameters are for one attention head:
Now, we generally run heads of attention parallel, we embed an input, and then run that embedding matrix paraelelly through multiple heads of attention. Another implementation detail: remeber well for each head, we only have a and all the of each attention head is stapeled togeather and is called the "output_matrix" where we collect All of (for each J)
and stapel them
and then
and finally the intermediate output of the multi headed attention There is an obvious shape issue in this modification, I'll deal with that later.
And then we pass through some MLP's and layers norms and then use a softmax and so on.