0-Transformers

Embedding:

Suppose we have a sentence S = "A boy is not a cat." First, S is split into a bunch of tokens (a token is a word or a sub-word or whatever). Call the tokenization of S as $< T >$ where T is a sequence of string tokens. Then we have a vocabulary that each string token is part of, so we use the vocabulary to turn each string into an index co-responding to that word in the vocab. call that sequence $< I >$ . Then, we one hot-encode each index into a vector and collect all those vectors as columns of a matrix $I$ .

Now, suppose our vocab size is $N_{v}$ . Then a matrix (called an embedding matrix) of size (D_emb, N_v) (call it $E$ ) is multiplied with $I$ , where $I$ has size (N_v, seq_len) . Therefore our embedded matrix $X = E I$ and has size/shape (D_emb, seq_len). Effectively $X$ takes each word-one-hot vector and embeds it into a space of $D_{e m b}$ dimension. The embedding matrix $E$ is also part of the learnable parameters of the model.

A learnt embedding might associate certain semantics with directions in the embedding space, hence certain words with similar semantics might have a high cosine similarity.

Notice that in practise, the sequence length has a cap, called the context size $c$ . so bigger sequences are broken up into $c$ - length chunks. So in practise $X$ has shape $(D_{e m b}, c)$ for GPT-2, $c = 2048$ .

Now, inuitively, this embedding matrix $X$ just takes each "token" and independently assigns a reasonable vector (at least after training) to it. However, in language, context is king, the meaning of a word is heavily dependent on the sentence and context its being used in, so we want to allow the model to learn the inter-relationships between each embedding vector.

Attention pattern

consider a chosen dimension $d$ . Then consider the "query-matrix" $W_{Q}$ of shape $(d, D_{e m b})$ .
so it takes in an embedded vector and spits out a vector of $d$ dims, which is essentially "asking-something" about that embedded vector. And again, all of these are just vectors with real entries, and its hard to interpret exactly what the query vector is asking, so its more of an architectural hope.

so for an embedded vector $X^{j}$ (jth column of X) the query $Q^{j} = W_{Q} X^{j}$ .

Similarly a "key-matrix" $W_{k}$ of shape $(d, D_{e m b})$ tries to "answer something or hold some property" about an embedded vector.

$K^{i} = W_{k} X^{i}$ so now, the dot-product of $Q^{j}$ and $K^{i}$ is hoped to encode (in its magnitude and sign) how relevant is some property of the $i t h$ token to some question about the $j t h$ token. a higher positive dot_product means that there is a high co-relation between some property of word $i$ and some question posed by word $j$ . a lower positive indicates smaller co-relation, and a high negative might indicate an opposite co-relation. (again all just hope and vectors)

Now, we would like to compute these dot products between each pair of words.
to do this, we simply use matmuls.

$K = W_{k} X$ , $Q = W_{Q} X$ and also we want $V = W_{V} X$ . (the "value matrix" which just assigns some value vector to each embedding vector).

Then, our "attention matrix/attention_pattern) is given by

A = S o f t m a x (\frac{K^{T} Q}{\sqrt{D_{e m b}}}) V

Note that in practise, $W_{V}$ is a product of two matrices $W_{V} = W_{a} W_{b}$ where $W_{b}$ has shape $(d, D_{e m b})$ and $W_{a}$ has shape $(D_{e m b}, d)$ . This forces a low rank linear map.

so $V = W_{a} W_{b} X$ . and $V$ has shape $(D_{e m b}, c)$
In practise, a single sequence of text
$s 1, s 2, . . . s_{n}$ is broken down into $n$ subarrays $s_{1}, . . s_{t}$ to predict $s_{t + 1}$ for each $t \in [1, n]$ during training. Hence, it would be cheating to answer a query of an earlier word in the sequence by looking at the key of a later word. that is all queries must be answered by keys from words that appear at or before it.

so we want $K^{j} \cdot Q^{i}$ to be zero whenever $j > i$ . In practise, we compute $\frac{K^{T} Q}{\sqrt{D_{e m b}}}$ and set the lower triangle of this matrix (has shape $D_{e m b}, D_{e m b}$ ) to -inf, and then compute the soft-max, then multiply by $V$ to get the final attention pattern A. (of shape (D_emb, c)).

This computation is called a head of self attention.

There is also cross attention, where the keys $K$ are being computed on one dataset (say french text) and $Q$ is being computed on another say English, for translation. (we may not use masking here.)

parameter calculation so far: so we have an embedding matrix E (D_emb, N_v) query and key matrices Q,K : (d, D_emb) and W_a, W_b of shape (D_emb, d) and (d,D_emb)
So the total number of parameters are for one attention head:

P a r a m s = 4 d + 5 D_{e m b} + N_{v}

Now, we generally run $N_{h e a d}$ heads of attention parallel, we embed an input, and then run that embedding matrix paraelelly through multiple heads of attention. Another implementation detail: remeber $V = W_{a} W_{b}$ well for each head, we only have a $W_{b}$ and all the $W_{a}$ of each attention head is stapeled togeather and is called the "output_matrix" where we collect All of (for each J)

A_{J} = S o f t m a x (\frac{K_{J}^{T} Q_{J}}{\sqrt{D_{e m b}}}) W_{b J}

and stapel them $A = c o n c a t (A_{J})_{J = 1}^{N_{h e a d s}}$
and then $O = c o n c a t (W_{a J})_{J = 1}^{N_{h e a d s}}$
and finally the intermediate output of the multi headed attention $H_{o u t} = A O$ There is an obvious shape issue in this modification, I'll deal with that later.
And then we pass through some MLP's and layers norms and then use a softmax and so on.