Bharat / ಭರತ್

Prompt Engineering as Hyperparameter Search: How DSPy Uses Optuna

2026-04-26T00:00:00+00:00

Most people who pick up DSPy stop at BootstrapFewShot. It works, it runs fast, it spits out a compiled program with reasonable few-shot demos, and the diminishing-returns instinct kicks in. Why keep tuning?

Because two of DSPy’s other optimizers, BootstrapFewShotWithOptuna and MIPROv2, quietly do something more interesting. They reframe prompt engineering as something that should feel familiar to anyone who has ever tuned an XGBoost model: black-box hyperparameter optimization.

Once you see it that way, a lot of folklore around prompts starts looking like an unprincipled grid search you’ve been running in your head.

What is actually being optimized

When DSPy “compiles” a program, the parameters it tunes aren’t model weights. They’re the artifacts that shape the prompt sent to the LM:

Few-shot demonstrations per predictor
Instructions (the natural-language directive at the top of a prompt)
The combination of those across multiple predictors in a pipeline

The search space here is combinatorial and discrete. If you have 16 candidate demos and want to pick 4 per predictor across 3 predictors, you’re already staring at millions of configurations. Grid search is hopeless. Random search burns LM calls for not much in return. You want something smarter.

Optuna and TPE in one paragraph

Optuna is a hyperparameter optimization framework. Its default sampler is TPE, the Tree-structured Parzen EstimatorSee Watanabe (2023), Tree-Structured Parzen Estimator: Understanding Its Algorithm Components for a careful walkthrough of TPE.. Unlike Gaussian-process Bayesian optimization, which fits a single surrogate model over the objective, TPE fits two densities: l(x) over trials that scored well and g(x) over trials that scored poorly. It then samples new candidates where the ratio l(x) / g(x) is highest. The intuition is just: spend trials in regions that look like past winners, not regions that look like past losers. It’s cheap, it scales to high-dimensional categorical spaces, and it tolerates noisy objectives. All useful properties when your “objective” is an LM-graded metric that wobbles a bit between runs.

BootstrapFewShotWithOptuna: the simpler case

The mechanic is straightforward:

Bootstrap N candidate few-shot demonstrations by running the unoptimized program over the trainset and keeping traces that satisfy the metric.
For each predictor in the program, expose demo selection as a categorical Optuna parameter.
Each Optuna trial samples one combination, runs the program against a validation set, and returns a score.
TPE updates its belief over which combinations work and proposes the next trial.
After num_candidate_programs trials, return the program that scored highest.

from dspy.teleprompt import BootstrapFewShotWithOptuna

tp = BootstrapFewShotWithOptuna(
    metric=my_metric,
    max_bootstrapped_demos=4,
    num_candidate_programs=30,  # Optuna trials
)
compiled = tp.compile(program, trainset=train, valset=val)

Why does TPE beat random search here? Because demos interact. A demonstration that’s perfect for predictor A may quietly poison predictor B’s context window with off-distribution patterns. The reward surface is non-separable. TPE picks up on those interactions across trials without you having to model the joint distribution explicitly.

MIPROv2: the richer case

MIPROv2 (Multi-prompt Instruction Proposal Optimizer, v2) takes the next step. It optimizes both instructions and demos jointly. The pipeline:

Summarize the dataset and inspect the program’s code structure.
Use an LM to propose candidate instructions grounded in that context.
Bootstrap candidate demonstrations, same as before.
Hand the joint space (instruction_candidates × demo_candidates per predictor) to Optuna.
Run Optuna trials, each one evaluating a full configuration against the validation set.

If you peek inside dspy/teleprompt/mipro_optimizer_v2.py, you’ll see the relevant bit:

sampler = optuna.samplers.TPESampler(seed=seed, multivariate=True)
study = optuna.create_study(direction="maximize", sampler=sampler)

Two details there are worth pausing on.

First, multivariate=True. Standard TPE models each parameter independently. Multivariate TPE models correlations between parameters. For MIPROv2 that matters a lot: the instruction you choose changes which demos work best with it, and the other way round. Independent sampling would miss those interactions. Multivariate captures them.

Second, the candidate instructions are LM-proposed, not human-written. So MIPROv2 is essentially using an LM to expand the search space, then using TPE to navigate it. That’s a tidy decomposition. Generative breadth from the LM, sample-efficient search from Optuna.

(Optuna is an optional dependency for both optimizers. Install with pip install dspy[optuna].)

Where this earns its compute cost

Worth being honest about the tradeoffs:

Each trial costs real money. A trial is a full program eval against the valset. With 30 trials and a 100-example valset, you’re at 3,000 LM calls before counting any internal calls the program itself makes.
TPE needs warmup. It usually takes 20 to 50 trials to convincingly beat random search. Below that, the priors don’t have enough signal to be confident about anything.
Categorical only. No continuous knobs in the loop.
Diminishing returns on small spaces. If your demo pool is small, BootstrapFewShotWithRandomSearch often ties, and at lower cost.

Reach for the Optuna-based optimizers when:

Your validation set is cheap to score (or your metric is fast).
Your program has multiple predictors with non-trivial interactions.
You’re willing to spend compute upfront for a one-time compiled artifact you’ll reuse.

Skip them when you’re in early exploration, prototyping a single predictor, or working with a metric so noisy that TPE can’t get a clean signal out of it.

The reframe

The thing to take away here isn’t “use MIPROv2.” It’s the conceptual shift:

Prompt engineering is black-box optimization over a discrete combinatorial space, scored by a metric you define.

Once you accept that framing, the entire toolbox of classical hyperparameter optimization becomes available: TPE, CMA-ES, Hyperband, multi-objective Pareto search. DSPy happens to wire up Optuna because it’s a clean fit, but nothing stops you from writing your own optimizer over the same primitives. The framework gives you compile(program, trainset, valset, metric). What runs inside that call is up to you.

That’s the actual unlock.

Unsupervised Learning, K-Means and Expectation Maximisation algorithm.

2024-08-14T00:00:00+00:00

Reference: Lecture from CS229 Lecture 17 by Anand Avati

Agenda:

Introduction to Unsupervised Learning
- Contrast with supervised learning and reinforcement learning
- Goal: Find interesting structures in data without labels
- Examples: Clustering, density estimation
K-Means Algorithm
- Purpose: Grouping data into K clusters
- Steps: a. Initialize cluster centroids randomly b. Assign each point to the nearest centroid c. Recalculate centroids based on assignments d. Repeat until convergence
- Discussion on convergence and local optima
Gaussian Mixture Models (GMM)
- Extension of K-means with probabilistic assignments
- Model components:
  - Latent variable Z (cluster assignment)
  - Observed variable X (data point)
  - Parameters: means, covariances, and mixing coefficients
- Soft assignments instead of hard assignments
Expectation Maximization (EM) Algorithm
- General framework for maximum likelihood estimation with latent variables
- Two main steps: a. E-step: Compute posterior probabilities (soft assignments) b. M-step: Update parameters to maximize the likelihood
- Application to GMM
Mathematical Foundations
- Jensen’s Inequality
  - Definition for convex and concave functions
  - Application in deriving the EM algorithm
- Evidence Lower Bound (ELBO)
  - Definition and significance in EM
- Derivation of EM algorithm using Jensen’s inequality
Convergence of EM Algorithm
- Intuitive explanation using graphical representation
- Brief mention of formal proof (to be covered in next lecture)
Practical Applications
- Brief mention of market segmentation as an example
Relationship to Deep Generative Models
- Importance of understanding EM for modern machine learning techniques
- Mention of Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs)

Full Transcript:

All right, welcome back everyone. Hope you had a good weekend. So this is lecture 16 of CS229, and today we’re going to start a new chapter on unsupervised learning.

Unsupervised learning will be the broad topic for the rest of this week and parts of next week, and the specific topics for today are the k-means algorithm, mixture of Gaussians (which is also called the GMM or Gaussian Mixture Models), and the expectation maximization algorithm.

To jump into today’s topics, what we’ve seen so far in the first few weeks (first three or four weeks or so) we covered supervised learning where we were trying to learn a function that maps X to Y. We were given pairs of X, Y as our training set or examples. After supervised learning, we went into some learning theory, we studied bias variance trade-off in the bias variance analysis, and saw a little bit into generalization.

Then last week we saw reinforcement learning where the goal, rather than minimizing some kind of a loss function, we want to maximize the value by choosing a suitable policy, right? Here value is the long term cumulative sum of discounted rewards, and we want to maximize the long term reward by choosing a policy.

Now, the new chapter we are going to start today (starting today) is unsupervised learning. In the unsupervised learning case, we are given a data set – generally a collection of X1, X2, etcetera, Xn, right – and we do not have a corresponding Y variable associated with each X variable. We are just given a set of collection examples, a set of X’s, where each Xi is in Rd – in a d dimensional dual space, for example. Our goal is to learn some kind of a structure in these X’s, right?

We do not have a corresponding correct answer. We do not have what’s otherwise called supervision of what the correct answer is for each X. But in general, we are just given some collection of X’s, and our goal is to find some kind of an interesting structure in these X’s that hopefully gives us some new insight, right?

So, we have seen logistic regression before, and in the case of logistic regression, you know, X1 XD… let me use a few colors here. We were given a data set like this, and our goal was to find a separating hyperplane. This is supervised learning because we are given the correct answer (that is, the color of each point) along with the point itself, right?

Whereas in the unsupervised learning, the problem would translate to something like this: we are given some points – just the X’s – and our goal now is to learn some kind of an interesting structure here. Previously, we had the correct answer for each input. Now, we are just given a collection of X’s and a reasonable structure to find in this are these two clusters.

So loosely speaking, we want to look for such structures when we are given just X’s. However, this problem is generally not very well defined. In the first case, for each point we were told what the correct answer is. But consider a problem like this. If you are asked to find an interesting substructure in this kind of data, it would be totally reasonable to say this is one cluster, this is one cluster, and this is another cluster, right? Another totally reasonable thing would be to say this is one cluster, and this is one cluster, right? So in a way there is no correct answer, so to speak, and our goal is to learn some kind of an interesting structure in the presence of such kind of ambiguities.

The way you want to think of this is classification problems in the supervised setting are somewhat related to clustering problems in the unsupervised setting, where the cluster identity is like the class label. We want to, looking at just the X’s, find out both how many number of classes there are and also to which class each example belongs to, right?

And why would this be interesting? This would be interesting, for example, in examples where, supposing you have, you’re working at a marketing department and you have information about your customers. The information about your customers can be represented in, you know, in some kind of a vector space where you know you have the age of the customer here, you have, you know, their annual income, and on another axis you might have their, you know, geographical location. Each customer would be a point in the space, right?

Now as a person who is working in marketing, you might be interested to perform some kind of a market segmentation to identify, you know, groups of customers so that you can do some kind of a targeted, you know, advertising or marketing campaign or some such thing, right? So that’s just one example of why unsupervised learning might be, you know, interesting.

So the first unsupervised learning algorithm that will be seeing is something called the k-means clustering algorithm. The k-means clustering algorithm is pretty straightforward. This is probably one of the simplest algorithms for unsupervised learning.

K-means Algorithm

So we are given a data set of n examples: X1 through Xn where each Xi is in Rd. Our goal is to group them into k clusters. For the purpose of the algorithm, we are, we will assume that k, n is given to us. The algorithm goes like this:

Initialize cluster centroids mu1, mu2, muk – so you have one centroid per cluster – where each of them are in Rd, randomly. So each of these mu1, mu2, mu k is a vector. Previously, in our notations, generally having a suffix to a variable generally meant it was a scalar. But in this case, mu1 is a full vector, right, and there are k such full vectors: mu1 through muk, and we initialize them randomly.

And then we repeat this until convergence. Repeat until convergence.

For every i, where i denotes the example number, set Ci equal to argmin j Xi minus mu j square.

And then, that’s step one.

Step two: for every j, where j is now, where j indicates the cluster identity, set mu j is equal to sum over i equals one to n indicator of Ci equals j of Xi and sum over i equals 1 to n indicator Ci equals j.

So what are we doing here?

K-means is an iterative algorithm where we are given a set of n examples, which we index by i, and we want to identify k clusters, where the k clusters are indexed by j. We initialize the cluster centroids randomly, where mu1 through muk are each a vector in Rd, and we repeat until convergence.

Where for every, for every i, we set Ci… here C you can think of C as an array of length n where for each example there is a corresponding Ci. For every Xi, there is a corresponding Ci. And we set Ci to be the identity, that’s the argument of j, identity of the nearest mean.

Based on the set Ci vector or Ci array, we then recalculate mu j, where mu j is calculated as the mean of all Xi’s for which Ci equals j.

Question. So as I said already, for now let’s assume k is told to us, you know, we are given what k is. And this is the algorithm, right? The way, it’s a pretty straightforward algorithm where we, where we alternate between one step where we are either where we are calculating the cluster identities for each example, and in the other step where we are recalculating the cluster centroids.

This is probably seen through a simple visualization which we’ll have a quick look at.

Any questions on this so far?

Yes.

Questions, so the question is, what happens… can we use an unsupervised learning setting to learn the different cluster centers and use that as a classification algorithm? Yeah, it might or it might not behave the same way as a supervised learning algorithm.

Yeah, so supposing this is, you know, think of this as a collection of points that are given to us, where each green point is, you know, is a data point in, Xi in Rd. The way the algorithm works like this: we, here we assume k equals 2, and the red X and the blue X you can think of them as mu1 and mu2, which are randomly initialized.

In the first step, what we do is for each point, for each point X, we identify the nearest cluster. We, we set Ci to be the identity of that cluster which has the smallest L2 distance between that point and the cluster centroid. So over here, the red dots are those for which the red X is closer, and the blue dots are those X’s for which the blue X is closer, right? So this is like setting the Ci’s in the first iteration.

Once we set the Ci’s, in the next step what we do is recalculate the mu, the mu j’s. So what happened here? I’m going to go back one slide just to see the difference.

So these two points which previously belong to the, the old blue centroid now got mapped to the new, to the red one.

![[Pasted image 20240814121147.png]]

And then we re-evaluate the centroids again, and the centroids will now move to the center. Once we reach here, in the next iteration I actually moved a slide, you know, nothing changes and we declare that the algorithm has converged, right?

It’s a pretty, pretty simple and straightforward algorithm.

Now, the, a few natural questions to ask is, will this algorithm always converge? And will it always give us the same, same answers all the time? So it can be shown that the algorithm does always converge. What we mean by convergence in this algorithm has a special meaning.

So if we consider this loss function called J of C comma mu to be equal to i equals 1 through n Xi minus mu Ci square.

![[Pasted image 20240814122246.png]]

This is also called the distortion function. The k-means algorithm is basically an algorithm to minimizing this particular loss or the distortion function in the form of coordinate descent.

So what is coordinate descent? You can think of coordinate descent as a variant of gradient descent where what we are doing is at each step, instead of minimizing the loss with respect to all the input variables, we only minimize the loss with respect to a few variables by holding the others fixed, right?

So the, step number one corresponds to minimizing the distortion function by holding mu fixed and optimizing C, where we calculate new C’s. Step number two corresponds to then minimizing J again by holding the C’s fixed and optimizing it with respect to mu, right? So k-means is coordinate descent of the distortion function J where in one step we optimize it with respect to C, in the other step we optimize it with respect to mu, and the result of the optimization results in these, in these closed form rules for recalculating the C’s and mus.

We say that k-means algorithm converges in the sense that eventually we are going to reach some kind of a local minima of this J function. It may so happen that we may have minimized J, but we may end up toggling between two sets of me or two sets of mus and C’s, alternating once we reach a local minima. Though that happens extremely rarely in practice, but we will eventually reach a state where J is no longer minimized further. We’re going to flatten out in J, and most of the times, pretty much all the time in practice, that’s going to result in some kind of mu and C that does not change anymore. This J is non-convex, which means the mus and the C’s that we end up in can change from run to run. If you start with a different initialization, you may end up with a different set of mus and C’s, right, which kind of ties back to a question that was asked before.

You know, why do we, you know, ever need to use the label identities and not just perform, perform clustering? The answer is is that this is basically a non-convex problem, and we can end up with different, different cluster identities depending on the initialization. Any questions on this?

Yes.

Question. So the question is, by looking at a function, how do we determine whether it is convex or not? In general it is… the answer is not always straightforward. So it is easy to show that something is convex by showing it as, you know, a composition of convex sub functions, right? However showing something is not convex is not always that straightforward. Something which may not appear to be convex at first can sometimes with some kind of reparameterization may end up being convex, etcetera. In this case, it happens to be non-convex. Any questions on this?

Cool.

So given this clustering pro, clustering approach that we have seen, let us move on to something that is slightly different and also somewhat related, which is the problem of density estimation.

So density estimation generally refers to the problem where we are given some number of data points, given some number of data points… and this is in R, you know, think of it as the X axis, and these points are residing in a continuous space, right… and the goal is to now… we assume that these points are sampled from some kind of a probability distribution. Because this, this, these points are coming on a, on a continuous distribution, the corresponding probability distribution has some kind of a density. It is not a probability mass function, but it is a probability density function, right?

Given these points, points that look like this, the question is, what is the density function from which these points were sampled? In general, it’s a very hard problem because if you want to fit this data really, really well then the best possible fit would be a density that looks like this, where it has, you know, like a Dirac delta function over every data point. You know, a peak that, you know, that’s really, really peaked over each data point. You can, you know, this, this, this, this is a valid, valid density, but at the same time it does not feel natural, right?

Another, another equally valid density would be something that looks like this.

Now this is also a valid density from which it could have been sampled because there is nothing to the left, nothing to the right, and there are some values over here, you know, that there are something where there’s some data. So you might, you might have some kind of a density. Also this is also valid, right?

So all these are different possible answers for what the underlying density is from which these points are calculated. Kind of the fundamental problem of density estimation is that the density function has to be a continuous function.

In the case, if these were, you know, outcomes of coin tosses where the support was discrete, then maximum likelihood was, was pretty straightforward. You could, you know, treat them as a multinomial and just count them. But whereas in density estimation, we need to come up with a smooth function over discrete observations.

I say discrete observation because we have a fixed number of observations and we want to come up with, with a smooth estimate. So a common approach in density estimation is to use this model called the Gaussian Mixture Model, or it is also called the mixture of Gaussians, right?

The Gaussian Mixture Model… where, given some, given some data points that look like this, right, we want to, we make this hypothesis that there are these two underlying, different, two different distinct Gaussian distributions. There is one Gaussian distribution from which these were sampled, another Gaussian distribution from which these were sampled. Together you can take the sum of these two Gaussians, Gaussian probability distributions and say this entire data set is sampled from these two different Gaussians, these two mixtures of these two Gaussians.

![[Pasted image 20240814131117.png]]

The choice of k is something again, is, is similar to, you know, in k-means. It, it is something that we choose by, by, you know, visual inspection or, or, or in general seeing how well the data fit the, fit the number of k.

The problem we have now is to, is to, given this data set, estimate the two Gaussians from which the data set might have come, right? We are not told what the identity of the two Gaussians are. If this were to be a supervised setting, then we would have, you know, right, they would come with some kind of an identity and we could have fit one Gaussian here and the other here.

This is exactly what we did in… do any of you remember where we did something like this?

GDA.

Exactly. So in GDA, we were, we were told that X’s come from, are sampled from Gaussians, and there are these two different classes: class 1 and class 2, right? And our goal was to take these X’s along with their cluster identities (the corresponding Y’s) and estimate the mus and sigmas for the two classes, right?

Now in Gaussian Mixture Model, we are essentially generalizing GDA in a way where we are not given what the Y labels are. We are just given the X’s, and we also relax the constraint which we had in GDA that the covariance had to be the same. In this case, the covariance can be different, and our goal is to now come up with some kind of a density p of X that allows us to, assign probability density to the observed values, right? So that is the, that is the setting in which the Gaussian Mixture Model comes into picture.

The reason why, why we are interested to even calculate this p of X… there are many reasons why calculating p of X could be interesting.

So here’s one example. Here’s a completely made up example. Supposing, supposing you are an aircraft manufacturer where let’s assume the, the parts that we manufacture have two kind of attributes. Let’s say, you know, heat tolerance… and if any of you are, you know, are in aeronautics, what, what might be another, you know… let’s say, let’s, let’s call it, heat tolerance and, and, whatever… power output, whatever that means.

So let’s assume there are, there are, there are these two kind of, attributes that are, that are, therefore, some part that we manufacture. In general, what we observe, that if we were to plot the, the, the every manufactured part as a point here, we might observe that most of the normal parts fall along some kind of a distribution like this. Maybe there are, you know, two different kinds of, subtypes of parts may be based on the material or something, where some of them fall in distribution, some of them fall in this kind of a distribution, you know, whatever be the reason, but generally, let’s assume that, you know, normal looking parts fall in this kind of, belong to this kind of a probability distribution.

Now suppose we want to, we want to have some kind of an automated anomaly detection where we want to detect that, you know, some part is, is faulty for some reason. For example, a part that may, that has this attribute, right? We want to identify that this point over here is, is faulty. At first, it, it might appear, you know, even though this looks visually away from this kind of a heat map, if you were to look at any one of the axes alone, it looks pretty normal.

From the heat tolerance point of view, it’s kind of, you know, in the close to the mean. If you just look at the power output, it is also kind of near the mean. But it is this combination that makes it kind of, you know, an anomalous example, right?

The way this kind of an anomaly detection is, can be carried out in practice is to construct, you know, a density estimate p of X for these points, where the, where this p of X assigns high probability for anything that falls in this region, and p of X assigns low probability for anything outside this region, right?

A common approach to doing this kind of a density estimation is to use mixture of Gaussians. The way we go about doing mixture of Gaussians is… so first what we are going to do is to provide you an algorithm to do mixture of Gaussians, and we are going to provide, construct this first algorithm based purely on intuition. Then what we are going to do is describe this general framework called expectation maximization and then re-derive Gaussian mixture, the Gaussian Mixture Model using this framework, and see that we end up with the same algorithm that we got using intuition.

The expectation, maximization that we’re going to see next is a more general framework that works for a broad class of, of generative models. These are examples of generative models, and the Gaussian Mixture Model is just one such, one such model which can be solved through expectation maximization. This question.

Restrict like… what is the upper bound for the frequency that you will tolerate on the PDF because the k will fit best as the k, as k approaches the number of data points, right?

So the, the question is, I guess to kind of summarize this, how do we find out k? It is true that as we increase the number k, we kind of fit the data better and better. In order to kind of think of what’s the best value of k, I will leave this as a thought exercise for now and we’ll come back to this probably next week. Try to see how you can apply, you know, what we learned in, in learning theory, bias and variance, you know, how, how can you, you know, give some thought on how you can apply bias variance analysis on this kind of a problem, right? We will come back to this later. For now, we’ll, you know, for, for, for today and for, the rest of this week, we’re going to just cover more algorithms, and we’ll you know, come back to it later with, you know, and see how we can kind of approach it in a more principled way. You know, for now, you know, as a mental exercise, see how you can apply bias variance analysis in this kind of a setting, right?

So in the mixture of Gaussians or Gaussian Mixture Model… so we are, we are given a training set of just X’s, right? And then we are going to assume there is this Zi which belongs to a multinomial distribution, parameter with parameter phi, where vj is greater than equal to 0 and the sum over all j equals 1 through k vj equals 1, right? Phi j is basically… vj is the probability that Zi equals j, right?

And then we have Xi given Zi equals j to be distributed from a normal distribution of mean mu j and covariance sigma j, right?

So this is describing the model. The, the way we assume the model works is that first we sample the class identity Z from some kind of a multinomial distribution, right? Then one, you know, once we, once we have a sample, the identity we generate, you know, an observation X condition on the Z value that we sampled from what, some particular Gaussian distribution with mean mu j and, and covariance, sigma j.

This is very similar to GDA, right? In GDA, the, the differences between this and GDA is that in GDA, we called Z as Y. Here, we’re just calling it Z, and that is a common pattern that you will see when you know, in, in the algorithms where we are doing full observation which becomes a supervised setting. The variables that we call Y we end up calling them as Z in unsupervised setting because they are not observed. In this case, there is this underlying Z that we do not observe that is sampled from some multinomial distribution. Depending on the identity of the cluster that we, that we sample, the observation is then sampled from a corresponding Gaussian distribution that has a mean and covariance specific to that cluster, right?

In these cases, Zi’s, because they are not observed, are called latent variables. So latent variable is a fancy name for a random variable that we have not observed. We’ve just not seen what, what, what its value is, and that’s why we call it latent.

Yes.

Question.

So fee over here is the… you, you can think of it as the class prior. You know, the class prior that we had in GDA. This just tells us, of all the examples, X’s that we have, what fraction of them belong to cluster j.

Good question.

And now in GDA, we performed maximum likelihood estimation in GDA, and our maximum likelihood objective was… in GDA, it was log p of X comma Y where mu sigma and p were the parameters, right? So this was the log likelihood objective in GDA, right?

Whereas in Gaussian Mixture Model, we do not observe Y, right? And so in the, in the Gaussian Mixture Model, our objective will be to maximize log p of X given, for, for phi mu sigma, and this will be our likelihood function. That’s the only difference.

Over here, the objective was the full joint distribution. Over here we would have liked to do the same, but we haven’t observed the work, the corresponding Y’s – we should call as Z’s here, right – there they’re not observed. So we cannot construct a likelihood function because we won’t know what value of Z to plug in in this expression, right? If you had observed them, it was pretty straight forward. We would have been just doing GDA.

Instead, what we do is maximize log p of X. Log p of X can, that can also be written as log sum over Z p of X comma Z phi mu, right? So we write out the full joint distribution and marginalize out the latent variable.

![[Pasted image 20240814140728.png]] Any questions how we went from this to this?

This question. So the question is, shouldn’t Z also contribute to our likelihood objective? The answer is, if we had observed Z, then yes it should have. But we don’t know what Z is, so it you know, it cannot… so the question is, we are assuming there are k clusters. Given a setting, you know, some value for k, shouldn’t we therefore account for, for the cluster identity? Are making an assumption about k, but we do not know which of those k clusters each point belongs to, right? So our objective is to now maximize this expression, which is the same as this expression, right?

For the rest of, of, you know, today’s lecture and, and throughout, it can be useful to set up, you know, some kind of, terminology. For example, p of Z, we will call it class prior. Or in cases where Z is not discrete, but it is continuous, we’ll just call it prior, right?

P of X comma Z, we will call it the model, because it describes the full data generating process. The joint distribution always describes the full data generating process, and that’s always called the model, right?

Z in unsupervised settings is called latent because we do not observe it. Latent is just a fancy word for unobserved.

P of Z given X, we will call it the posterior.

Finally, p of X… so p of Z is called the prior, and p of X will be called the evidence. Evidence because X is what we observe, that is the evidence based on which we are performing inference, right? This is just terminology, and this terminology is pretty standard and used in many papers, many, many books.

So our goal is to maximize the likelihood using the evidence.

Right? And this… the way we go about doing that is… right?

So to directly maximize this evidence, directly maximizing this likelihood… if we were to attempt it, the way we did it with GDA of taking derivatives, setting it equal to zero and solving for the parameters… you will observe that you, you won’t get a close form expression for this. You can try it out. You know, you won’t get a close form expression for the way we got it with GDA. In GDA, we got a closed form expression because we had observed both X’s and Z’s – we called it Y’s – and if we had observed both X’s and Y’s here, we would have gotten a closed form expression. But because the Z’s are not observed and we are taking this… marginalizing them… if you work it out, we will not get close form expressions.

So instead what we will do is, just like, you know, taking inspiration from k-means, we are going to first imagine or, are come up with some kind of an estimate for Zi’s first.

So the algorithm that we are going to do is repeat until convergence, where this is, you know, just taken inspired by k-means. We’ll call it the E-step.

For each i comma j, set wij to be equal to p of Zi equals j given Xi, parameters phi mu sigma.

And the L step: update the parameters.

vj equals 1 over n equals 1 to n w by j.

mu j is sum over i equals 1 to n wij Xi over i equals 1 to n wij.

And sigma j is equal to somewhere i equals 1 to n wij Xi minus mu j, Xi minus mu j transpose, divided by i equals 1 to n wij.

So what we did is we start with some random initialization. Randomly initialize parameters.

The repeat will start after we randomly initialize it.

Randomly initialize mu phi and sigma.

Start with some random initialization. Think of this as the way we randomly initialize the cluster centroids and k-means, right? Based on the random initialization in k-means, for each point we associated it to the nearest cluster centroid, right? The, the kind of similar operation that we are going to do here is for each point, we are going to assign a weight to each cluster centroid specific to that point, where the weight is the posterior distribution of p of Z given X. So given a point, we calculate a posterior distribution of the probability that this point belongs to a particular centroid, and this is just the posterior. The posterior distribution, we are going to call it a weight. Once we calculate this weight, we are going to reweight all our data points to calculate the corresponding mus and sigmas.

So for example, Xi will, may belong to, will have a weight of… let us say if there are, if there are say three centroids where k equals 3, then p of Zi equals j given Xi, right? This could be some kind of multinomial distribution like 0.1, 0.7, 0.2, where this belongs to k equals 1, k equals 2, k equals 3. Where if the centroid mu was, was close to, mu2 was close to Xi, then it would have a higher weight, and, you know, these two are farther away, so it has a lower weight.

In, in case of k-means, we would do a hard assignment of every point to one cluster only. But over here, we are doing a soft assignment where every point has, has a soft assignment in the form of this probability distribution for all the cluster centroids.

![[Pasted image 20240814143815.png]]

The probability assigned to the centroid that is closer to the point… by closer here we mean in a probabilistic sense where it is, it has a high likelihood in that, in, in that clusters Gaussian distribution, then it will have a high posterior probability… and we do this kind of a soft assignment of every point to, to the set of all clusters. Using the, the calculated weights, we recalculate the mus and sigmas using the weighted data set. Here, every, every point i contributes to every centroid j and the contribution is weighted by the corresponding wij. Questions.

Yeah. So the question is, will this have a closed form expression?

Yeah. Yeah.

So, so the question is… so for this, I would, I would, remind you in, in case of GDA we had calculated a posterior that was very similar. If you remember, the posterior had a form of… yeah, it had the form of a logistic regression in that case. However we limited ourselves to two, two points… to two classes, and we also had a constraint that sigma was common to all of them. But in this case, when we relax that constraint, what we will observe is that the posterior takes the form of a soft max.

It takes a form of a soft max that uses quadratic features. Quadratic features of X’s. But for a fixed value of, you know, for small k’s, you can, you know, come up with, with, with an expression for this. In fact, you’ll be doing this in your homework, so that will be clarified in your homework as well, right?

So, so inspired by k-means, here is a version of… you can think of the Gaussian Mixture Model as soft k-means. You know, people call it the soft k-means as well. We call it soft because in the assignment phase in k-means, right, in k-means it was a hard assignment.

This was… this corresponds to the E-step. The, in k-means, the cluster identity was a hard assignment. You can think of k-means as a way in which we calculate a posterior distribution which is always one hot.

If, if we calculate these kind of posterior distributions, then we effectively get k-means out of GMMs, right? And in the equivalent of the M step was the way in which we were recalculating mus, the mu j’s. We use only those, those X’s for which the cluster identity matched the, the corresponding cluster. And over here, instead of, instead of having an indicator function, we are going to use this soft assignment weight. Any questions?

Yeah. So if W’s are one hot, then this, this will essentially be, you know, k-means.

Yes.

Question.

So the question is how do we calculate $w_ij$ if we have not observed the i’s, this disease? Is that a question?

Yeah, so here, in order to calculate this, we don’t need to know Zi’s, right? So we don’t need to… so we are just cons… we are just constructing the probability that Zi could be equal to j, right? The way we go about doing this is to use the Bayes rule, right? The way you use the Bayes rule…

Right, so, and, and this can be calculated as p of, you know… so p of Z given X is equal to p of X given Z times p of Z over sum over all of Z p of X given Z times p of C, right?

Over here, p of X given Z is normal distribution, so this is Gaussian, right? And p of Z is just multinomial, right? In the denominator, you know, it is same… the, the same terms, Gaussian and the multinomial. But you sum it over all the classes, and you will see that, you know, just the way you showed it in homework one for GDA, where in GDA this took the form of a logistic regression, here you can actually show that it takes the form of a soft max. It’s very similar calculation. Any questions?

Right.

Yeah, so this is basically the Gaussian Mixture Model where we derive the steps to be by taking inspiration from k-means, right?

We are intentionally giving it the names, the E-step and the N-step because next we’re going to talk about the E-M algorithm, and we’re going to derive it in a more principled way, and we’re going to end up with the same update rules, right? And this is, you know, think of this as soft k-means.

Soft k-means, right?

E-M algorithm.

Now we are going to switch gears and talk about the E-M algorithm. So the E-M algorithm is also called the expectation maximization algorithm, that is an algorithm where it gives us a framework of maximum of performing maximum likelihood estimation when some variables are unobserved, right?

It is, it, it is used in cases where we have a functional form for the joint p of X comma Z, and X and Z could be anything, but Z’s are not observed.

So expectation maximization. Expectation maximization is an algorithm where we perform MLE in the presence of latent variables.

Where the true model is p of X comma Z with some parameters theta. If we observe X and Z jointly… if everything is observed, then the problem is very simple. We perform simple maximum likelihood estimation. But when Z is unobserved, we want to instead perform maximization of the objective L theta equals log p of X, where Z is marginalized. The EM algorithm or the EM framework gives us a framework for achieving this where we maximize log p of X in an indirect way rather than directly taking the derivatives and setting them to 0 and so on.

Now before we go into the EM algorithm, some kind of… it can be useful to have some kind of a context here. So the EM algorithm was, was discovered sometime in the, I think early 1970s, where, you know, where people were trying to perform MLE in the presence of unobserved, unobserved data. It so happens that this, this framework is so general and so powerful that there are… it has been adapted in so many different ways, and it has been, you know, minor extensions to the EM algorithms are there in so many different forms. But this framework is somewhat central, and understanding EM in a deep way will be extremely useful if you are interested in things like deep generative models.

So in, over the last few years, there has been tremendous growth in deep generative models where you might have heard of, you know, variational autoencoders, generative adversarial networks (or GANs), or flow-based models, glow-based models. Understanding all of them would be a lot, much easier if you really understand the EM algorithm well because the EM framework gives you a kind of a mind map where you can place all these different algorithms and kind of understand the strengths and weaknesses, and you know, what’s common between them, what’s different between them, and so on.

So, the EM algorithm is one of the key algorithms to, for, even modern deep learning or deep generative, deep generative models, right? So before we jump into the EM algorithm, we are going to first look at something called Jensen’s inequality.

Jensen’s Inequality

Jensen’s inequality is a, is a very general probabilistic inequality that’s used, you know, in many places in probability theory and applied probability theory, and it will show up in our derivation of EM algorithm as well. So, you can think of Jensen’s inequality as a probabilistic tool that we will use in deriving, uh, EM. But Jensen’s inequality by itself is, is a very generic and, you know, commonly used inequality in probability theory.

So let’s, let’s assume a function F to be convex. Assume F to be a convex function which means F double prime of X is greater than equal to 0 for all X, right? We say that F is strictly convex if F double prime of X is greater than 0 for all X, right?

So the, the mental picture to have is F of X, and this is X… F of X and X.

So this is an example of a convex function. Convex functions are bowl-shaped functions, and they can have a 0 second derivative in a few places where F prime of X is greater than or equal to 0. But in a strictly convex function, the second derivative is never exactly equal to 0. It is always greater than 0. So if there are straight lines for, for certain input ranges in your functions, then it can still be convex, but for a strictly convex function, there cannot be straight lines, right? It should always be curving upwards.

Now, Jensen’s inequality tells us that expectation of F of X, where X is some random variable, is always greater than or equal to F of expectation of X where F is convex.

![[Pasted image 20240814183754.png]]

So this is Jensen… so Jensen’s inequality tells that the expectation of F of X, where F is a convex function, X is a random variable, and the expectation is taken with respect to the randomness in X, will always be greater than equal to F of the expectation of X, right?

Moreover, if F is strictly convex, strictly convex, then expectation of F of X equals F… if, then, then if expectation F of X equals F of E of X, then X equals expectation of X with probability 1.

There’s a lot of jargon here. We will dissect it in a moment. So to, to kind of understand this more intuitively, now this, this picture can help, right?

Let this be some function, F of X, and this is basically X. Use a different color here. Let us also assume that X has a probability distribution associated with it, right? So the green dotted line represents the probability density of, of the random variable X. F is some function of X.

Now, expectation of X or E would be somewhere here. So let us call this expectation of X, right? So expectation of X is, is, um… so think of it as like the mu if this is a Gaussian, right? That’s the expectation of the random variable, right?

Now, F of expectation of X… so in, in the case where X… let us assume X takes only two possible values. So let us assume… let us draw another picture here. So this is X. This is F of X.

Let’s assume X takes only two possible values. Let’s assume it’s a discrete distribution. Here we, we… here X was continuous, but to understand Jensen’s inequality, let’s assume a discrete setting where X takes only two possible values: this value and this value. May be they are, you know, 0, 1, 2, 3, 4… 1 and 10. Let us assume X takes any one of these values.

The mean of X… if it takes, with probability half, the value 1 and with its probability, half… value 10, then expectation of X will be 5.5.

So expectation of X equals 5.5… is 5.5… yeah, 5.5. This over here is F of expectation of X, right? Does that make sense?

This is the expectation of X. You evaluate F at expectation of X, and you get F of expectation of X. That is the right-hand side, right?

Similarly, with probability half, F of X can takes this value. With probability half, another half, F of X takes this value, right? So this is… let us call this A and B.

So this is F of A, and this is F of B, right? Expectation of F of N… F of B is basically the midpoint between F of A and F of B, right? Because with probability half, it takes this value. With probability half, it takes this value. So the expectation of F of X is this one, right? This is expectation of F of X, right?

It, it so happens that this point will always be the midpoint of… will always be the midpoint of the chord connecting F of A and F of B, right? What Jensen’s inequality is telling us is that this point, the point that, that’s the, the midpoint of the chord connecting two points on F will always be higher than this point, right? ![[Pasted image 20240814184628.png]]

So, F of expectation of X is always less than the expectation of F of X, right? Is this clear? Can you raise your hand if you understood this?

Some of you have not. Okay, can anybody tell me what, what’s, what’s still confusing here? I can just go over it again.

So F is a convex function which is kind of bending upwards, right? The X axis is, is denotes some random variable. In this case, just for the purpose of understanding Jensen’s inequality, let’s assume X is, is a discrete… takes on two values, one of two values – either 1 or 10 with equal probability. So the expectation of X is therefore 5.5. That’s, that’s over here.

Now, F of 1… you know, let’s call it A. You know, F of A is, is this point. So this is F of A. So the height from the X axis to this point, this is F of A. Similarly, if this is B, this is F of B. The expectation of F of X is therefore the midpoint connecting these two points, right? That comes over here. The X… F of expectation of X, where expectation of X was, was 5.5, is this point that comes over here. Jensen’s inequality is, is therefore essentially saying that the chord connecting any two points of a convex function always lies above the chord itself, right?

The expectation… expectation of F of X is higher than F of expectation of X.

Right? Kind of understood?

All right.

Okay, let’s, let’s move on. It also tells us that if F is strictly convex, right, F is strictly convex, then F of X equals… if it is… F is strictly convex and if expectation of F of X equals F of expectation of X, then X equals the expectation of X itself, which means X is essentially a constant.

What does that mean?

So here is an example of F of X that is strictly convex.

Now, if expectation of F of X equals F of expectation of X, when can that be possible? Expectation of F of X equals F of expectation of X… when can… let’s assume, you know, A and B are here, right?

This is expectation of F of X. Let us say this is F of expectation of X.

If F is strictly convex, and if the two are equal, F of X equals expectation of X, then the only way that is possible… that, you know, F of X equals expectation of F… F of X is if X has a probability density… right now, over here, in this dotted line, essentially I am drawing the probability density of X, which is like a Dirac delta function where it has all its mass concentrated at just one point, right?

In this case, this is expectation of X, and F of expectation of X is here. Also because X always takes on this value with probability 1, F of X also always takes on this prob… this value with probability 1, and therefore F of X equals F of expectation of X equals expectation of F of X, because essentially, you know, the equivalent of the chord connecting two points has length 0 here, right? All the values of F of X are always here, and X always evaluates to the same value next.

Question. So what’s the expectation of a continuous random variable?

If X is a continuous random variable and it has a PDF, let’s call it, um, small p of X, right? Then expectation of X is equal to the integral of X times p of X dx. So p in this case is some probability. So the green line… the green dotted line is p of X here.

My question is what is E? So, oh, what’s E of F of X? So E of F of X… so if E of F of X is equal to the integral of F of X, p of X dx. Good question.

Right, so this is, this is Jensen’s inequality. The reason why we require F to be strictly convex is because if F were not strictly convex, then you could have a case where X… X is… F of X is flat in some place, and X has this density and, expectation of F of X would be here, and F of expectation would be here. Also, F… expectation of F of X would also be the average F of X in this region, which is also a constant. So expectation of F of X would be equal to F of expectation of X, even though X is not constant, because F has a flat region somewhere.

This question. So the question is, uh, in, in the convex case, can we assume that… so, for, for, I mean for this case… yeah. So if, if, for this to hold without X being a constant for X2 and F of X to be the same, then all of X should be distributed in a region where F is flat.

Yeah, that’s right.

So what are some examples of convex functions? Anybody… example for convex function?

Y equals X square.

Example for concave function?

Minus X square. Y equals minus X square.

Yeah, yeah, that’s good.

So, examples of convex, concave, and strict: yes or no. Right? So convex function we saw X square is convex. Minus X square is therefore concave. Is this strictly convex and therefore also strictly concave?

Right?

Now, another function, Y equals, or F of X equals, equals, say, mx plus C: straight line, is convex. By definition it is convex. A straight line is convex and it is also concave.

But is it strict?

No.

Right. It’s fine. Great. Now, what about e to the X?

Yes, convex.

Minus e to the X is therefore concave. Is it strict?

What about log X? So log X is concave and therefore minus log X is convex, right? And it is strict.

Cool.

Now how does this… how is this useful for expectation maximization?

Yes.

Question.

Yeah. A straight line is always convex and concave because, is, is…

So F double prime is equal to 0 and the definition of convex is that F double prime should be greater than or equal to 0. So, and it is equal to 0, so it satisfies greater than or equal to 0. Similarly for concave, it is less than or equal to 0, and it’s equal to 0, so it satisfies less than or equal to 0.

So now… so using Jensen’s inequality and with these observations that expectation of F of X is greater than equal to F of expectation of X, and, and so on… we can adapt Jensen’s inequality to the concave case where basically it says, if F is concave, right? Example: F of X equals log X, right?

Then the inequality will switch. So the expectation of log X, instead of greater than equal to, will be less than or equal to log expectation of X. This is also Jensen’s inequality.

So now let us derive the EM algorithm, right? So in the EM algorithm, our goal is to maximize log p of Xi comma, theta, where by theta I mean, you know, all the parameters i equals 1 to n.

You want to maximize this. That is the goal, what we are trying to achieve. This is our end goal: we want to maximize log p of X. However, maximizing log p of X can be hard because the Z’s are unobserved. If these were observed, this was very easy, but Z’s are unobserved, so it’s hard. That’s the case where… that’s the setting we are in.

Now for the derivation, I’m going to assume one example. So I’m just going to write it as log p of X comma theta, and I’m going to leave the summation out, but basically the whole, the entire derivation that we are going to do, you can include the summation and everything will hold, right? This is just to simplify notation.

So log p of X… we want to maximize this. That’s our goal, right? So the first thing we are going to do is write log p of X comma theta is equal to log of the sum over Z p of X Z theta, right?

First we are going to marginalize out Z. Then once we do that, we will define an, you know, some arbitrary probability distribution called q over Z, and write this as log sum over Z q of Z times p of X comma Z theta divided by q of C, where q of Z is greater than 0 for all C.

Some arbitrary probability distribution over Z. It could be anything whatsoever. Any kind of probability distribution over Z such that q of C is greater than 0 everywhere.

This question. So, so the question is, it’s, you know, why is this a hard problem? So it is hard because we are having a summation over here, right? In general, when, in the cases where Z is continuous, this will be an integral. That integral can be, you know, arbitrarily complex. It can be computationally expensive. It can be analytically not possible in cases when we want an analytical solution.

Good question.

So we come up… you know, q can be any distribution whatsoever as long as q of Z is greater than 0 for all Z. Now we can see that this can be written as log expect…

Right, what did I do here?

Nothing basically. So this is the definition of expectation, right? So this is a function of, you know, Z. Some function of Z, right? And think of this as the probability, and this is some function. Therefore, this is just the expectation. Is this clear?

Yeah.

Okay, so this is, this… just, you know, the expectation. Now we make use of Jensen’s inequality, and note that log of the expectation of something is greater than the expectation of the log of the same thing.

So this is going to be greater than or equal to expectation of Z q log p of X comma Z theta over q of C.

This question.

F here is log, and log is concave. Log is concave, right? This is our random variable X.

Any, any, any questions on how we apply… how we went from here to here? This is probably the most crucial step, right?

All good?

Okay.

This… what we see here we will call this, you know, give it a name. We will call it ELBO – evidence lower bound.

So it is the ELBO of q. And LB of X q theta.

Jensen’s inequality tells us that the ELBO, you know, which is defined to be this term, is always less than or equal to our objective that we want to maximize. Which means now if we find thetas and q’s such that we are maximizing the ELBO, then implicitly, for the same values of theta, log p of X is also going up. Does it make sense?

ELBO is defined to be… is… by Jensen’s inequality, ELBO is always a lower mound for log p of X. Our likelihood… both of them have theta in them, right? Now if we find values of theta such that we are maximizing the ELBO, then it necessarily means that log p of X at that value of theta is higher and Jensen’s inequality gives us that inequality.

Yes.

Question.

All right.

So this is the, the ELBO. This term, ELBO, is, something you will very commonly encounter if you’re reading, reading research papers about, you know, generative models or deep generative models. This is, this is a widely used term. ELBO means, you know, the, the lower, the lesser side of the Jensen’s inequality of log, log p of X, right?

Our goal is to now… well before we go into our goal, let us make a few more observations. Now, log p of X is greater than equal to ELBO at all times. That’s what Jensen’s inequality says. But are there cases when log p of X is exactly equal to the ELBO?

Are there cases… are there cases when log p of X theta is equal to ELBO of X, q, and theta?

![[Pasted image 20240814194244.png]]

The answer is yes, it is yes. Because of the second part of Jensen’s inequality that we saw, right? So Jensen’s inequality, we also saw that if F is strictly convex… log X is strictly convex, right? If F is strictly convex, then expectation of F of X… expectation of F of X equals F of expectation of X if and only if X is a constant, right?

So this is one side of Jensen’s inequality. This is the other side of Jensen’s inequality. The two will be equal if and only if the term inside is a constant. That’s what Jensen’s inequality told us. Because log is strictly concave… this question.

In this case, we want… we want this entire term over here to be a constant. Now are there cases… so the next question is, are there, you know, under what circumstances is this entire term over here always a constant?

That’s, that’s, that’s what we’re going to answer next.

It has to be, it has to be independent of Z, so it is constant with respect to Z. Right? So the question now is, in order to make this inequality an equality, because, because log is, because log is strictly concave, the inequality becomes an equality if and only if p of X comma Z theta over q of Z equals some constant C, right?

Now this implies that… take q of Z over there… q of Z is equal to 1 over C times p of X comma Z, right? We also know that, because this is just a proportionality constant, we can write this as q of Z is proportional to p of X comma Z.

So q of Z is just proportional to p of X comma Z, and in order to make this equal to, we just use the… calculate the proportional normalizing constant that you are summing over Z p of X comma Z theta, right? This is basically p of X comma Z theta divided by… when you marginalize out Z, you just get p of X theta. This is equal to p of Z given X theta.

So the question is, why did we normalize it with p of X?

So because q of Z is proportional to this… and in order to… and we know that this is a probability distribution which means it has to sum up to 1. So this must… the normalizing constant necessary must necessarily be the, um, the sum of this. So the denominator is this term that is summed over all possible values of C. But if something, if you’re summing a distribution to 1, is it multivariate? And it helps you to sum over X and Z?

So this is a distribution over Z. p of X Z could be anything. X could be continuous. q of Z is a distribution over Z that must sum up to 1, right? That’s proportional to p of X, X comma Z, and the, the… it is proportional, and the corresponding normalizing constant must necessarily be the, um, the sum over the numerators for all possible values of Z because it must, it must sum up to 1.

So when q of Z equals p of Z given X, then Jensen’s inequality will change into an equality, right? Lots of moving parts.

Yes.

So we started with, we started with log p of X, right? Wrote it out as, as, you know, the sum over Z of the joint. You know, there is nothing fancy going on here. We are just marginalizing out the Z, and then we multiply and divide it by some arbitrary distribution q. By multiplying it and dividing it, this… the numerator allowed us to write it in the form of an expectation. The… and this stayed in the denominator.

Once we wrote it as an expectation, we had log and the expectation, right? So initially we started with the log likelihood, and therefore we use the concave version of Jensen’s inequality.

So this log and this expectation are then swapped, right? We get a greater than equal to expectation of log, right? Once we get this, you know, this is basically Jensen’s inequality, the one side and the other side of Jensen’s inequality. We call the lower… the lower end of the Jensen’s inequality, we just called it ELBO. We just gave it a name, right?

Then we use the, the corollary of Jensen’s inequality to look for conditions when this inequality will be exactly equal to. Because log, because log is strictly convex, the corollary of Jensen’s inequality gives us a condition when this inequality becomes an equality. The condition is that this term must be a constant.

Then in order for this to be a constant with respect to Z, it is necessarily the case that q of Z must be equal to the posterior distribution of Z given X.

Whenever q equals Z given X at that value of theta, then Jensen’s inequality becomes a strict… becomes an equality, right?

Now given these two… yes, questions?

So why is… uh… yeah, this…

All right.

So we want to find the condition that p of X comma Z divided by q… this is equal to some constant, right? Now let it be equal to some unknown constant C, right? So you can take q of Z over here, right?

Once you take it over here, this becomes proportional because of some constant times, right? So now what we… we also know that sum over q of Z equals 1, right? Which means the sum over all of, all of the right-hand, uh, terms must be equal to C itself, right? We divide it by C, and we get exactly…

Okay, so q of Z is equal to the posterior of Z given X at that value of theta.

Thank you. Thanks for asking.

So based on this, we, we come… we write the EM algorithm, or the more general form of the EM algorithm. We call it the more general form because throughout this, throughout this setting, we have not assumed any specific form for p of X and Z, right? It could be the mixture of Gaussians. It could be any algorithm. This derivation holds for any such latent variable model.

The EM Algorithm

All right. So that gives us the algorithm. So the EM algorithm… the EM algorithm basically tells… for… so we have the E-step.

For each i, set qi of Zi is equal to p of Zi given Xi comma theta.

And the M-step.

Set theta equal to argmax theta of i equal to 1 to n alpha of Xi, qi comma theta.

Right, so what did we do? We basically, you basically get this EM algorithm where the corresponding E-step is to set q to be the posterior distribution of p of Z given X. In the M-step, we set theta to be equal to the argmax of the ELBO.

![[Pasted image 20240814201223.png]]

Now why will this, why will this work? To see why this works, let’s see this diagram. Now let’s suppose this is theta… it’s not X. This is theta. This is is arm… so let us assume this is our… so this is log p of X theta, right?

As we vary theta, log p of X gives us different values. This is not the density. This is the likelihood because the X axis is theta and not X. It’s a dotted line because we don’t know it, right? It is hard to calculate. We have to marginalize out X, which may be in an integral.

What we got from this ELBO is that, for any given value of q, the ELBO of, of X for q and theta will always be less than or equal to log p of X. That’s what Jensen’s inequality gave us.

So Jensen’s inequality tells us that, you know, this is… I’m going to write this as… right, this is one possible ELBO: mix comma q and theta. Let’s consider another ELBO.

So for example, this is another ELBO: X, q, theta.

So for different choices of q, we get different lower bounds of p of X. What Jensen’s inequality tells us is that, or the corollary tells us, is that for a given value of theta… let’s, let’s assume this is our randomly… randomly initialize theta of 0. Right? For this value of theta, if we choose q to be p of Z given X theta 0… for this choice of q, let us call it q0.

For this choice, if supposing this is the ELBO for q0, then it basically tells us that the ELBO value equals log p of X at this value of k, which means the ELBO touches log p of X at this value of, of theta naught.

When we are at theta naught, if we choose the q… if we choose to construct an ELBO using q as the posterior distribution with that parameter value, then the ELBO will be tight with respect to the invisible goal that we are trying to maximize at that value of theta. Now if we maximize in the M-step… if we choose a new theta such that we maximize…

If this is theta 1… that is the, the theta for the next iteration… and now we construct yet another ELBO.

Now the new ELBO will be cons… constructed using, you know, this q… q1, which uses state q1. Then this ELBO will be, will be tight at log p of X at theta 1.

Depending on the, the choice of the q to be equal to the posterior using the corresponding theta value, the ELBO that we’ll get will be tight. I… will be touching the, the log p of X. That’s, that’s what Jensen’s inequality is… corollary tells us.

Now if we started theta naught, maximize the, the ELBO, that is the M-step… we get a new theta 1, construct a new ELBO and maximize that one, and we reach here. So this will be theta 2, and here we construct yet another ELBO such that it is tight at, at this value of, of theta 2, and so on.

If we repeat this over and over by constructing different lower bounds at each value of theta, and we are maximizing the lower bound, and so on, then we will eventually reach a local optima where the algorithm converges, which means, you know, theta stops changing, and that’s essentially what the EM algorithm does.

So this is the rough intuition of, or, or the, or the visual intuition of how the EM algorithm works. In the next class, we will, we will go through a proof to show that it actually converges, you know, more than just, you know, just simply drawing some pictures.

Yes, questions?

Yes.

Exactly. So how do we compute the ELBO? The ELBO is, is exactly this…

Exactly.

So in the next class, we’ll see an example of how we apply to Gaussian Mixture Models where it might be more clear. But for now, for, you know, for the purpose of this lecture, it’s enough to have this abstract view of how the EM algorithm works in general. In the next class, we’ll apply it to Gaussian Mixture Models and see exactly how each of the steps work.

Is there any other question?

Yes.

Question is theta given as Z equals theta transpose X.

Now theta… theta here is some unknown parameter of the model.

No, no, that we don’t make any linearity assumptions you… …don’t make any assumptions on on the form of p of X given Z. It could be any arbitrarily complex function, it could be a neural network or something. You know, this framework is very general and that’s the beauty of EM algorithm. So this derivation of EM algorithm that we saw here, it makes no assumption whatsoever on the form of p of X comma Z or any of the individual distributions. We just require them to be, you know, valid probability distributions, but they could be arbitrarily complex and this will still hold. Yeah, good question.

Okay, I think we’ll stop here. In the next class, we’ll start from here and we’ll look at an example of how EM can be used to solve the Gaussian Mixture Model, and then look at the proof of convergence of EM algorithm.

Important Topics and Keywords:

Unsupervised Learning:

Definition: Learning from data without labeled examples or “supervision.”
Goal: Discover hidden structures, patterns, and insights within the data.
Example Applications: Clustering (e.g., customer segmentation), density estimation (e.g., anomaly detection).

K-means Clustering:

Goal: Partition data points into k clusters.
Algorithm: Iteratively assigns points to nearest centroid, then updates centroids based on assigned points.
Distortion Function: Objective function minimized by k-means.
Convergence: Guaranteed to converge to a local minimum of the distortion function.
Non-Convexity: Different initializations can lead to different final clusterings.

Density Estimation:

Goal: Estimate the underlying probability density function from which data points are sampled.
Challenge: Finding a smooth function to represent discrete observations.
Gaussian Mixture Model (GMM): A common approach using a weighted sum of Gaussian distributions to approximate the density.

Gaussian Mixture Model (GMM):

Assumption: Data points are generated from a mixture of k Gaussian distributions.
Parameters: Mixing coefficients (phi), means (mu), and covariances (sigma) of the Gaussian components.
Latent Variables: Cluster assignments (Zi) are unobserved.
Maximum Likelihood Estimation (MLE): Used to estimate model parameters.
Soft Clustering: Each point has a probability of belonging to each cluster.

Expectation Maximization (EM) Algorithm:

Goal: Perform MLE in the presence of latent variables.
E-step: Estimate posterior probabilities of latent variables given observed data and current parameter estimates.
M-step: Update parameter estimates to maximize the expected log-likelihood (ELBO).
ELBO (Evidence Lower Bound): A lower bound on the log-likelihood.
Jensen’s Inequality: Used to derive the EM algorithm and guarantee convergence.
Convergence: Iteratively maximizes a lower bound on the log-likelihood, eventually reaching a local maximum.
Generality: Applicable to a wide range of latent variable models, including GMMs.

Other Key Terms:

Supervised Learning
Reinforcement Learning
Coordinate Descent
Convex Function
Concave Function
Latent Variable
Prior Probability
Posterior Probability
Evidence
Soft k-means
Jensen’s Inequality
ELBO (Evidence Lower Bound)
Anomaly Detection
Generative Models
Variational Autoencoders
Generative Adversarial Networks (GANs)
Flow-Based Models

Accelerators for Machine Learning

2024-03-23T00:00:00+00:00

Recently, rasbt open-sourced Thunder, a new compiler for PyTorch. It can achieve a 40% speedup compared to regular PyTorch on LLM training tasks (e.g., Llama 2 7B). The Readme.md of the repo has this line “Works on single accelerators and in multi-GPU settings.”

In a forum that I’m part of, someone asked what was an accelerator? Hence wrote a small post on this.

Accelerators

Accelerators are the workhorses that a computing system needs to train and run machine learning models.

Machine learning workloads can be classified into:

Training
Inference

In an LLM, when inferring using a model, the generation happens in sequence - one token a time.

In this scenario, an accelerator such as GPU is used. Nowadays, there are many accelerators.

Accelerator glossary

CPU: Central Processing Unit

• GPU: Graphics Processing Unit: - Nvidia A100 - Nvidia H100 - AMD MI250 - Intel Arc

• HPU: Habana Gaudi AI Processor Unit - Intel Gaudi AI

• IPU: Intelligence Processing Unit - Graphcore

• MME: Matrix Multiplication Engine

• QPU: Quantum Processing Unit:

• RDU: Reconfigurable Dataflow Unit

• TPU: Tensor Processing Unit: - Google TPU

Spectrum in AI accelerators

If one reads the history of the GPUs, they can understand that GPUs where not designed with Machine learning workloads initially. As the market goes on, GPU creators and designers have been adapting for ML workloads and applications.

If one wants specialized AI accelerators, FPGA and ASIC vendors are also available thee days. While they can be very efficient, they also can be subject to lack of re-programmability and yield to locking such as Google’s TPU.

An illustrative figure can be like this that helps to understand the different AI accelerators spectrum can be seen in the picture below.

From this reference blog find this quote to explain the above spectrum.

AI accelerators such as Intel Habana Gaudi, AWS Trainium and AWS Inferentia fall somewhere to the right of GPUs. Habana Gaudi offers programmability, but is less versatile than a GPU so we can put it closer to GPUs. AWS Inferentia is not programmable, but offers a list of supported operations it can accelerate, if your machine learning model doesn’t support those operations then AWS Inferentia implements CPU fall back mode for those operations. So, we can put AWS inferentia further right on that spectrum.

Future innovations

As the field of AI and LLMs grow at bottleneck speed, expect more such innovations in single and multi-node accelerators. Probably I will cover them in future posts as I learn more about them.

Understanding model architectures

2024-03-20T00:00:00+00:00

We all work with a lot of models during training, experimentations and in various repos. So how do we understand them in detail.

Well give this prompt to gemini, it will make it simpler for you.

Prompt: Given this model architecture, can you please explain the LLM architecture

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
) 

GPT2LMHeadModel Architecture Explained

This model architecture is based on the GPT-2 (Generative Pre-trained Transformer 2) language model with a head specifically designed for language modeling tasks. Let’s break down the components:

1. GPT2Model:

This is the core of the model, responsible for understanding and encoding the input text. It consists of:

wte (Word Token Embedding): This layer converts each word in the input sequence into a vector representation. The size of the vocabulary is 50257, and each word is embedded into a 768-dimensional vector.
wpe (Word Position Embedding): This layer encodes the position of each word in the sequence. This helps the model understand the order and context of words.
drop (Dropout): This layer randomly drops out some neurons during training to prevent overfitting.
h (ModuleList): This is a stack of 12 GPT2Block modules. Each block performs the following operations:
- ln_1 (LayerNorm): Normalizes the input to improve stability and training speed
- attn (GPT2Attention): This is a self-attention mechanism that allows the model to attend to different parts of the input sequence and understand the relationships between words.
- ln_2 (LayerNorm): Normalizes the output of the attention layer.
- mlp (GPT2MLP): This is a feed-forward neural network that further processes the encoded information.
ln_f (LayerNorm): This final layer norm normalizes the output of the GPT2Model before it is passed to the language modeling head.

2. lm_head (Linear):

This layer takes the encoded representation from the GPT2Model and predicts the next word in the sequence. It has 768 input features (the output dimension of the GPT2Model) and 50257 output features (the size of the vocabulary).

In summary, this GPT2LMHeadModel architecture combines the powerful GPT-2 model for understanding and encoding text with a dedicated language modeling head for predicting the next word in a sequence. This makes it suitable for various language generation tasks, such as text continuation, story writing, and dialogue generation.

ಮಂಕುತಿಮ್ಮ ಹೇಳಿದ ಜೀವನ ಮತ್ತು ಬ್ರಹ್ಮಾಂಡದ ಸ್ವರೂಪ

2024-03-19T00:00:00+00:00

ಜೀವ ಜಡರೂಪ ಪ್ರಪಂಚವನದಾವುದೋ ।

ಆವರಿಸೆಕೊಂಡುಮಳೆನೆರೆದುಮಿಹುದಂತೆ ।।

ಭಾವಕೊಳಪಡದಂತೆ ಅಳತೆಗಳವಾಡದಂತೆ ।

ಆ ವಿಶೇಷಕೆ ನಮೆಸೊ – ಮಂಕುತಿಮ್ಮ ।।

This poem ponders the nature of life and the universe. The poet, Mankutimma, describes the world as a seemingly inert form (“ಜಡರೂಪ”) that is filled with life (“ಜೀವ”). He wonders what force (“ಯಾವುದೊ ಒಂದು ಶಕ್ತಿ”) pervades and animates this inert form, making it vibrant and alive.

The poem further emphasizes the mysterious and immeasurable nature of this life-giving force. It cannot be grasped by emotions (“ಭಾವ”) or measured by any standard (“ಅಳತೆ”).

In the end, the poet expresses reverence and awe (“ನಮೆಸೊ”) towards this enigmatic power that transcends human understanding.

Here’s a breakdown of the poem:

ಜೀವ ಜಡರೂಪ ಪ್ರಪಂಚವನದಾವುದೋ: This line questions what fills the inert world with life.
ಆವರಿಸೆಕೊಂಡುಮಳೆನೆರೆದುಮಿಹುದಂತೆ: This line describes how the life-giving force pervades and fills the world, similar to rain soaking into the earth.
ಭಾವಕೊಳಪಡದಂತೆ ಅಳತೆಗಳವಾಡದಂತೆ: This line emphasizes that the life-giving force is beyond emotions and measurements.
ಆ ವಿಶೇಷಕೆ ನಮೆಸೊ – ಮಂಕುತಿಮ್ಮ: This line expresses the poet’s reverence for this extraordinary phenomenon.

The poem beautifully captures the wonder and mystery of life, acknowledging the limitations of human understanding in comprehending its true essence.

ಕಂಬಾರರು ಬರೆದ ಮೂಡಲಮನೆ ಹಾಡಿನ ಅರ್ಥ

2024-03-16T00:00:00+00:00

ಹಾದಿ ಬೀದೆಲ್ಲಾ ತಂಪಾ ನೆರಳ… ಸೋಕಿ

ರೆಂಬೆ ಕೊಂಬಿಯ ಮ್ಯಾಲ ಗೂಡ ಕಟ್ಟಿದಾವ ರೆಕ್ಕೆ ಬಲಿತ ಹಕ್ಕಿ ಗುಡಿ ಮ್ಯಾಲ ಮಲಿಗ್ಯಾವ ಮರಿ ಹಕ್ಕಿ ದೂರ ದೇಶದ ವಲಸಿದ ಹಕ್ಕಿಗೂ ಐತಿ ಜಾಗಾ ಒಳಗ ಬನ್ನಿರಿ ನೀವು ನಮ್ಮ ಬಳಗ llಪll

ಹಳೆಯ ಭಾವಿಯ ತಳದ ನೀರಿನಾಗ ಹಸರ ಚಿಗರತಾವ ಬೇರಿನ ಮೊಳಕೆ ಒಡಿಯತಾವ ಭೂತ ಪ್ರೇತಾಳ ಜೋತ ಬಾವಲಿ ಮ್ಯಾಲ ತೂಗತಾವ ಮರದಾಗ ಕರಗ ಕುಣಿಯತಾವ ll1ll

ಮರದ ಎಲಿ ನೆರಳು ಮನಿಯ ಗ್ವಾಡಿ ಮ್ಯಾಲ ಆಡತಾವ ಆಟ ಮೂಡ್ಯಾವ ತೊಗಲ ಗೊಂಬಿ ಆಟ ಕರುಳ ಬಳ್ಳಿಯ ಕಥೆಯ ಹೇಳತಾವ ನೋಡ್ರಿ ಶಾಂತ ಚಿತ್ತ ನಾವು ನೀವು ಅದರ ಭಾಗ ಮಾತ್ರ ll2ll

So during the random net surfing, when I came across Chaitra Achaar’s rendition of the above song, I wanted to know the meaning of the above song. So this stanza is from Chandrashekhar Kambara’s famous Kannada song, which most of us will have a Kannada school text book.

So, I fired Google gemini UI and asked it this with a prompt to check how a LLM can explain the song

Prompt: explain meaning of the above song

Here is what Gemini gives as a response:

Meaning of the Kannada Song:

This beautiful song paints a picture of a serene and welcoming natural environment, inviting the listener to become a part of it. Here’s a breakdown of the lyrics:

First Stanza:

“ಹಾದಿ ಬೀದೆಲ್ಲಾ ತಂಪಾ ನೆರಳ… ಸೋಕಿ”: This describes how the paths and streets are covered in cool shade, creating a soothing atmosphere.
“ರೆಂಬೆ ಕೊಂಬಿಯ ಮ್ಯಾಲ ಗೂಡ ಕಟ್ಟಿದಾವ”: Birds have built their nests on the branches, signifying the presence of life and activity.
“ರೆಕ್ಕೆ ಬಲಿತ ಹಕ್ಕಿ ಗುಡಿ ಮ್ಯಾಲ ಮಲಿಗ್ಯಾವ ಮರಿ ಹಕ್ಕಿ”: Strong-winged birds sleep on the temple roof, while young chicks rest nearby. This shows the diversity of life and the sense of security found in this place.
“ದೂರ ದೇಶದ ವಲಸಿದ ಹಕ್ಕಿಗೂ ಐತಿ ಜಾಗಾ ಒಳಗ”: Even migratory birds from faraway lands find a place to rest here, highlighting the inclusivity and welcoming nature of the environment.
“ಬನ್ನಿರಿ ನೀವು ನಮ್ಮ ಬಳಗ”: This is a direct invitation to the listener to join and become part of this harmonious community.

Second Stanza:

“ಹಳೆಯ ಭಾವಿಯ ತಳದ ನೀರಿನಾಗ ಹಸರ ಚಿಗರತಾವ”: Green shoots sprout from the water at the bottom of the old well, symbolizing renewal and growth.
“ಬೇರಿನ ಮೊಳಕೆ ಒಡಿಯತಾವ”: Roots break through the soil, further emphasizing the thriving life force present.
“ಭೂತ ಪ್ರೇತಾಳ ಜೋತ ಬಾವಲಿ ಮ್ಯಾಲ ತೂಗತಾವ”: Bats hang alongside ghosts and goblins, adding a touch of mystery and intrigue to the scene.
“ಮರದಾಗ ಕರಗ ಕುಣಿಯತಾವ”: Woodpeckers dance on the tree trunks, showcasing the playful and vibrant energy of the ecosystem.

Third Stanza:

“ಮರದ ಎಲಿ ನೆರಳು ಮನಿಯ ಗ್ವಾಡಿ ಮ್ಯಾಲ ಆಡತಾವ ಆಟ”: Children play games in the courtyard under the shade of the tree leaves, depicting the innocence and joy found in this peaceful setting.
“ಮೂಡ್ಯಾವ ತೊಗಲ ಗೊಂಬಿ ಆಟ”: Traditional games like “Togalu Gombe” are played, highlighting the cultural richness and connection to the past.
“ಕರುಳ ಬಳ್ಳಿಯ ಕಥೆಯ ಹೇಳತಾವ”: Stories of deep connection and kinship are shared, emphasizing the strong bonds within the community.
“ನೋಡ್ರಿ ಶಾಂತ ಚಿತ್ತ ನಾವು ನೀವು ಅದರ ಭಾಗ ಮಾತ್ರ”: The song concludes by reminding us to observe with a calm mind and recognize that we are all simply part of this larger, interconnected web of life.

Overall, the song celebrates the beauty and harmony of the natural world and encourages the listener to appreciate its wonder and become a responsible participant in its preservation.

Programming Matrix multiplication in a functional level programming

2024-03-04T00:00:00+00:00

So, thanks to one of the connections I met recently named Sasank, I have been getting a healthy dose of lisp and different way of looking at current systems.

Sasank has been implementing few nifty things which he hopes will solve problems caused by Von Neumann Bottleneck. For more details the reader, is requested to check out his repo at llama.lisp. This post will to distill some of my learnings from perusal of the projects he has been working on in exhaustive style.

So what is von Neumann Bottleneck ?

If one checks the Turing lecture paper given by Backus of BNF fame, they will understand why current imperative language paradigms do not scale and generalize well to certain mathematical properties. Backus shows with very clear descriptions and bulleted points on the shortcomings of so called Von Neumann programming languages.

The famed landmark lecture by John Backus can be read here. Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs

So a von Neumann computer simplistically is nothing but a computer having

Central Processing Unit (CPU)
A store
And a tube that transfers the data between teh store and the CPU.

Backus shows that how an imperative program that has assignment operations causes lots of wastage owing to lot of data fro and to in the tube in the above design. He states that “the assignment statement is the von Neumann bottleneck of programming languages and keeps us thinking in word-at-a-time terms in much the same way the computer’s bottleneck does”.

Defining Inner product in functional level programming style

So inner product is nothing but this formulation

Sasank has written a neat introduction where he shows via python how this form application works for those of us who are wired to think in terms of modern languages here.

So in pythonic code below:

## Inner Product
IP = comp(
    insert(add),
    alpha(mul),
    trans
)

comp function

To dig this further this let us see how he explains comp.

Comp is defined like this

def comp(*fn_list):
    return lambda X: fn_list[0](X) if len(fn_list) == 1 else comp(*fn_list[:-1])(fn_list[-1](X))

When you define a function like comp(*fn_list), the before fn_list means “collect all positional arguments into a tuple named fn_list”.
Inside the function, you can use fn_list like a regular tuple, and it will contain all the arguments you passed to the function.
When you call a function with an argument like *fn_list, it’s doing the opposite: it’s unpacking the tuple (or list) fn_list into separate arguments.

Lambda function

Let us analyse this: lambda X: fn_list[0](X)

The expression lambda X: fn_list[0](X) is a lambda function in Python. A lambda function is a small anonymous function that is defined with the lambda keyword, and it can take any number of arguments, but can only have one expression.

Here’s a breakdown of this lambda function:

lambda is the keyword that starts the definition of the lambda function.
X is the argument to the lambda function. You can pass a value to the function using this argument.
: separates the arguments from the body of the lambda function.
fn_list0 is the body of the lambda function. It calls the first function in fn_list with X as the argument.

This lambda function takes one argument X, and applies the first function in fn_list to X.

How comp works?

So if the length of the fn_list argument is 1, then we return the lambda expressions detailed above.

Else, we branch to comp(*fn_list[:-1])(fn_list[-1](X))

The expression comp(*fn_list[:-1])(fn_list-1) is part of the recursive function comp shown earlier. This expression is used to create a composition of functions from the list fn_list.

Here’s a breakdown:

fn_list[:-1] gives all elements in fn_list except for the last one. The * operator is used to unpack these elements as separate arguments to the comp function.
comp(*fn_list[:-1]) recursively calls the comp function with the unpacked elements, which results in a composition of all functions in fn_list except for the last one.
fn_list[-1] applies the last function in fn_list to X.
comp(*fn_list[:-1])(fn_list-1) applies the composition of all functions except for the last one to the result of applying the last function to X. In other words, if fn_list is a list of functions [f, g, h], then comp(*fn_list[:-1])(fn_list-1) would be equivalent to f(g(h(X))). This is a way to create a composition of functions.

Trace back to inner product definition

So, now we understand inner product is composition of these functions:

insert(add)
alpha(mul)
transpose

Transpose of a matrix

Transpose function implementation is straightforward.

def trans(X):
    return [
        [X[n][m] for n in range(len(X))]
        for m in range(len(X[0]))
    ]

Mul and Add functions

The below are also straightfoward.

def mul(X):
    return X[0] * X[1]

def add(X):
    return X[0] + X[1]

alpha function

def alpha(fn):
    return lambda X: [fn(x) for x in X]

This is nothing but applying a function for X as argument to the lambda function.

insert function

def insert(f):
    return lambda X: X[0] if len(X) == 1 else f([X[0], insert(f)(X[1:])])

lambda X:: This defines an anonymous function (a function without a name), which takes one argument X. The X is expected to be a list.
X[0] if len(X) == 1: If the length of X is 1 (i.e., X contains only one element), the function returns the first (and only) element of X.
else f([X[0], insert(f)(X[1:])]): If X contains more than one element, the function calls another function f with a list as an argument. This list contains the first element of X and the result of calling the function insert(f) with the rest of X (i.e., X without its first element) as an argument.

index functions

def idx_0(X):
    return X[0]

def idx_1(X):
    return X[1]

The indexing functions given a list, X as shown above are simple to understand.

distl function

def distl(X):
    assert len(X) == 2
    return [[X[0], z] for z in X[1]]

distl(X), is designed to distribute the first element of a list X with each element in the second element of X, which is expected to be a list itself.

Here’s a breakdown of how the function works:

The function takes one argument, X, which is expected to be a list of two elements. The first element can be of any type, and the second element should be a list.
The assert len(X) == 2 statement ensures that X has exactly two elements. If X has more or fewer than two elements, the function will raise an AssertionError.
The list comprehension [[X[0], z] for z in X[1]] creates a new list. For each element z in the second element of X (which is a list), it creates a new list [X[0], z] and adds this list to the new list.

distr function

def distr(X):
    assert len(X) == 2
    return [[y, X[1]] for y in X[0]]

This must be obvious to decipher after reading the above explianation.

cat function

def cat(*fns):
    return lambda X: [fn(X) for fn in fns]

This Python function, cat(*fns), is designed to create a new function that applies a list of functions to an input. Here’s a breakdown of how it works:

The function takes any number of arguments, *fns, which are expected to be functions. The * operator in the function definition allows for a variable number of arguments to be passed.
It returns a lambda function, which takes one argument X.
This lambda function applies each function in fns to X and returns a list of the results.

Example:

def square(x):
    return x**2

def cube(x):
    return x**3

f = cat(square, cube)
print(f(2))  # prints [4, 8]

understanding execution of mat mul

So, now we can see how it all ties up to implement mat mul, which is the basic building block of all machine learning and deep learning algorithms and model inferencing.

MM = comp(
    alpha(alpha (IP)),
    alpha(distl),
    distr,
    cat(idx_0, comp(trans, idx_1))
)

cat(idx_0, comp(trans, idx_1)): This function takes a list X as input and returns a new list.

The first element of the new list is the first element of X (obtained by idx_0(X)), and the second element is the transpose of the second element of X (obtained by comp(trans, idx_1)(X)).

distr: This function takes a list X as input, where X is expected to have exactly two elements. It returns a new list where each element is a list containing an element from the first element of X and the second element of X.

alpha(distl): This function applies the distl function to each element in a list X.

alpha(alpha (IP)): This function applies the IP function twice to each element in a list X.

Building GPT from scratch in PyTorch via understanding from Karpathy’s tutorials

2024-02-06T00:00:00+00:00

Andrej Karpathy is one of the pioneers in Machine learning who served as the director of artificial intelligence and Autopilot Vision at Tesla. He currently works for OpenAI, where he specializes in deep learning and computer vision. He is well known for amazing tutorials explained via code sessions on YouTube and for his landmark well curated lectures in Stanford University.

Generative Pre-trained Transformers have set the world of AI and Machine learning on fire in this decade. Open AI’s GPT models have set teh world ablaze for it’s nearly accurate human mimicking abilities among certain natural language tasks such as text summarization, code interpretation, text generation and translation.

In this blog, I delve to capture the important notes as I dig into code explained and orchestrated by Karpathy in his youtube session. I played around with this code and also trained on a tiny Kannada data set taken from Wikipedia (for the lack of any other open license datasets).

Youtube link for lecture by Karpathy

Youtube session by Karpathy on building a GPT

Transformer

Transformer is the neural network that does all the heavy lifting. Its genesis lies in the paper in 2017 that was written by Ashish Vaswani et al at Google called “Attention is all you need”.

Intro notes from Karpathy’s session

Karpathy lays down the agenda, that the goal of his session is not to build Chat-GPT, because it is a serious production grade system that leverages models pre-trained on internet data and chunks of documents.

So instead we learn how to build transformer based language model at character level.

Datasets used

Tiny Shakespeare data, which is collection of all shakespeare’s work in single file.

nanoGPT

It is a repository for training transformers written by karpathy. This trained on open web text can mimic performance of gpt2, by open ai.

Constructing a language model to generate texts (characters)

Loading the dataset: Read the tiny Shakespeare dataset into a string.
Take a set of strings in step 1, meaning unique characters from the whole corpus are taken.
List on the set of characters created in the step 2 is created.
Sorting the list created in step 3, gives an ordering of all the characters in the tiny Shakespeare dataset.

Vocabulary Size

The number of characters obtained in step 4 earlier is going to be the vocabulary size.
These are going to be possible elements of our sequence.
Vocabulary is the possible set of characters the model can see or emit.

Tokenizer

Tokenizer means converting vocabulary to some integers.
Here we convert characters to integers using the vocabulary constructed above.

Google’s Sentencepiece encodes text into integers, but in different schema and using different vocabulary. It is a sub-word tokenizer. This means that you are not encoding words or characters, but sub-words

Open AI uses the Tiktoken library which uses Byte Pair Encoding (BPE) to encode tokens. This is what GPT uses.

	enc = tiktoken.get_encoding("gpt2")
	enc.n_vocab

this tells that open ai has 50k+ sized vocabulary.

Training and validation data

90% of the data is broken into train and the other 10% into validation datasets.

Transformer training

The data chunks are sampled at random from the training set during training. These chunks have a maximum length. that is called as block size. You can find it as different names like context length.

X is the input to the transformer. Y is the next block size characters from 1 to block_size + 1.

Preparation script

https://github.com/karpathy/nanoGPT/blob/eba36e84649f3c6d840a93092cb779a260544d08/data/shakespeare_char/prepare.py

Trainer

For each batch size of 4, we loop through block size tensors of inputs and targets.

Bigram baseline language model

Next, he shows how to implement a Bigram language model using the PyTorch framework. BigramLanguageModel is implemented as a subclass of nn.Module from PyTorch We construct a lookup embedding table using vocab_size which is the size of the vocabulary.

Logits are obtained from this table using index (idx).
Understand what nn.Module forward pass does.
Understand what cross-entropy of nn.Module does

Training the model

AdamW optimizer is selected
Learning rate is kept as 10^-3
Batch size of 4 is increased to 32, and steps loop of 100 is attempted
At each step, we sample a batch of data from train data created earlier.
Logits and losses are computed

optimizer.zero_grad - this if set to set_to_none = True will stop gradients from updating the parameters

If we run for 100 iterations and print out the loss, if it is decreasing, it seems optimization is definitely happening.

bigram script is ported to py.

    import torch
    import torch.nn as nn
    from torch.nn import functional as F
    torch.manual_seed(1337)

    class BigramLanguageModel(nn.Module):

        def __init__(self, vocab_size):
            super().__init__()
            # each token directly reads off the logits for the next token from a lookup table
            self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

        def forward(self, idx, targets=None):

            # idx and targets are both (B,T) tensor of integers
            logits = self.token_embedding_table(idx) # (B,T,C)
            
            if targets is None:
                loss = None
            else:
                B, T, C = logits.shape
                logits = logits.view(B*T, C)
                targets = targets.view(B*T)
                loss = F.cross_entropy(logits, targets)

            return logits, loss
        
        def generate(self, idx, max_new_tokens):
            # idx is (B, T) array of indices in the current context
            for _ in range(max_new_tokens):
                # get the predictions
                logits, loss = self(idx)
                # focus only on the last time step
                logits = logits[:, -1, :] # becomes (B, C)
                # apply softmax to get probabilities
                probs = F.softmax(logits, dim=-1) # (B, C)
                # sample from the distribution
                idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
                # append sampled index to the running sequence
                idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
            return idx

    m = BigramLanguageModel(vocab_size)
    logits, loss = m(xb, yb)
    print(logits.shape)
    print(loss)

    print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

The code implements a bigram language model using PyTorch. A bigram language model is a statistical language model that predicts the next word in a sequence based on the previous word.

The model consists of an embedding layer and a linear layer. The embedding layer converts each word in the vocabulary into a vector of real numbers. The linear layer then takes the vector representation of the previous word and predicts the probability distribution of the next word.

The forward method of the model takes two arguments:

idx: A tensor of integers representing the indices of the words in the current context.
targets (optional): A tensor of integers representing the indices of the target words.

If targets is not provided, the model simply returns the logits for the next word. Otherwise, the model computes the cross-entropy loss between the logits and the targets.

The generate method of the model takes two arguments:

idx: A tensor of integers representing the indices of the words in the current context.
max_new_tokens: The maximum number of new tokens to generate.

The generate method uses the model to generate a sequence of words. It starts by generating the first word in the sequence, then uses the model to predict the next word, and so on. The method stops generating words when it reaches the maximum number of new tokens.

The following is an example of how to use the model:

    import torch
    import torch.nn as nn
    from torch.nn import functional as F

    # Create a bigram language model with a vocabulary size of 10000.
    model = BigramLanguageModel(vocab_size=10000)

    # Generate a sequence of 100 words.
    idx = torch.zeros((1, 1), dtype=torch.long)
    generated_words = model.generate(idx, max_new_tokens=100)

    # Print the generated words.
    print(decode(generated_words[0].tolist()))

This code will print a sequence of 100 words that were generated by the model.

Building a transformer that can generate characters

In the code below, we explain the transformer-based model iterating and fine-tuning the baseline Bigram model developed above.

    import torch
    import torch.nn as nn
    from torch.nn import functional as F

    #hyperparameters
    batch_size = 16 # how many independent sequences will we process in parallel?
    block_size = 32 # what is the maximum context length for predictions?
    max_iters = 5000
    eval_interval = 100
    learning_rate = 1e-3
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    eval_iters = 200
    n_embd = 64
    n_head = 4
    n_layer = 4
    dropout = 0.0
    # ------------

    torch.manual_seed(1337)


    #here are all the unique characters that occur in this text
    chars = sorted(list(set(text)))
    vocab_size = len(chars)
    #create a mapping from characters to integers
    stoi = { ch:i for i,ch in enumerate(chars) }
    itos = { i:ch for i,ch in enumerate(chars) }
    encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
    decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

    #Train and test splits
    data = torch.tensor(encode(text), dtype=torch.long)
    n = int(0.9*len(data)) # first 90% will be train, rest val
    train_data = data[:n]
    val_data = data[n:]

    #data loading
    def get_batch(split):
        # generate a small batch of data of inputs x and targets y
        data = train_data if split == 'train' else val_data
        ix = torch.randint(len(data) - block_size, (batch_size,))
        x = torch.stack([data[i:i+block_size] for i in ix])
        y = torch.stack([data[i+1:i+block_size+1] for i in ix])
        x, y = x.to(device), y.to(device)
        return x, y

    @torch.no_grad()
    def estimate_loss():
        out = {}
        model.eval()
        for split in ['train', 'val']:
            losses = torch.zeros(eval_iters)
            for k in range(eval_iters):
                X, Y = get_batch(split)
                logits, loss = model(X, Y)
                losses[k] = loss.item()
            out[split] = losses.mean()
        model.train()
        return out

    class Head(nn.Module):
        """ one head of self-attention """

        def __init__(self, head_size):
            super().__init__()
            self.key = nn.Linear(n_embd, head_size, bias=False)
            self.query = nn.Linear(n_embd, head_size, bias=False)
            self.value = nn.Linear(n_embd, head_size, bias=False)
            self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

            self.dropout = nn.Dropout(dropout)

        def forward(self, x):
            B,T,C = x.shape
            k = self.key(x)   # (B,T,C)
            q = self.query(x) # (B,T,C)
            # compute attention scores ("affinities")
            wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
            wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
            wei = F.softmax(wei, dim=-1) # (B, T, T)
            wei = self.dropout(wei)
            # perform the weighted aggregation of the values
            v = self.value(x) # (B,T,C)
            out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
            return out

    class MultiHeadAttention(nn.Module):
        """ multiple heads of self-attention in parallel """

        def __init__(self, num_heads, head_size):
            super().__init__()
            self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
            self.proj = nn.Linear(n_embd, n_embd)
            self.dropout = nn.Dropout(dropout)

        def forward(self, x):
            out = torch.cat([h(x) for h in self.heads], dim=-1)
            out = self.dropout(self.proj(out))
            return out

    class FeedFoward(nn.Module):
        """ a simple linear layer followed by a non-linearity """

        def __init__(self, n_embd):
            super().__init__()
            self.net = nn.Sequential(
                nn.Linear(n_embd, 4 * n_embd),
                nn.ReLU(),
                nn.Linear(4 * n_embd, n_embd),
                nn.Dropout(dropout),
            )

        def forward(self, x):
            return self.net(x)

    class Block(nn.Module):
        """ Transformer block: communication followed by computation """

        def __init__(self, n_embd, n_head):
            # n_embd: embedding dimension, n_head: the number of heads
            super().__init__()
            head_size = n_embd // n_head
            self.sa = MultiHeadAttention(n_head, head_size)
            self.ffwd = FeedFoward(n_embd)
            self.ln1 = nn.LayerNorm(n_embd)
            self.ln2 = nn.LayerNorm(n_embd)

        def forward(self, x):
            x = x + self.sa(self.ln1(x))
            x = x + self.ffwd(self.ln2(x))
            return x

    # super simple bigram model
    class BigramLanguageModel(nn.Module):

        def __init__(self):
            super().__init__()
            # each token directly reads off the logits for the next token from a lookup table
            self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
            self.position_embedding_table = nn.Embedding(block_size, n_embd)
            self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
            self.ln_f = nn.LayerNorm(n_embd) # final layer norm
            self.lm_head = nn.Linear(n_embd, vocab_size)

        def forward(self, idx, targets=None):
            B, T = idx.shape

            # idx and targets are both (B,T) tensor of integers
            tok_emb = self.token_embedding_table(idx) # (B,T,C)
            pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
            x = tok_emb + pos_emb # (B,T,C)
            x = self.blocks(x) # (B,T,C)
            x = self.ln_f(x) # (B,T,C)
            logits = self.lm_head(x) # (B,T,vocab_size)

            if targets is None:
                loss = None
            else:
                B, T, C = logits.shape
                logits = logits.view(B*T, C)
                targets = targets.view(B*T)
                loss = F.cross_entropy(logits, targets)

            return logits, loss

        def generate(self, idx, max_new_tokens):
            # idx is (B, T) array of indices in the current context
            for _ in range(max_new_tokens):
                # crop idx to the last block_size tokens
                idx_cond = idx[:, -block_size:]
                # get the predictions
                logits, loss = self(idx_cond)
                # focus only on the last time step
                logits = logits[:, -1, :] # becomes (B, C)
                # apply softmax to get probabilities
                probs = F.softmax(logits, dim=-1) # (B, C)
                # sample from the distribution
                idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
                # append sampled index to the running sequence
                idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
            return idx

    model = BigramLanguageModel()
    m = model.to(device)
    # print the number of parameters in the model
    print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

    # create a PyTorch optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

    for iter in range(max_iters):

        # every once in a while evaluate the loss on train and val sets
        if iter % eval_interval == 0 or iter == max_iters - 1:
            losses = estimate_loss()
            print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

        # sample a batch of data
        xb, yb = get_batch('train')

        # evaluate the loss
        logits, loss = model(xb, yb)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

    # generate from the model
    context = torch.zeros((1, 1), dtype=torch.long, device=device)
    print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

This code implements a transformer-based language model for text generation. Here’s a simplified explanation:

Data Preparation:
- The text is tokenized into a sequence of integers, where each integer represents a unique character.
- The data is split into training and validation sets.
Model Architecture:
- The model consists of multiple layers of transformer blocks.
- Each transformer block contains a self-attention layer and a feed-forward layer.
- The self-attention layer allows the model to learn relationships between different parts of the input sequence.
- The feed-forward layer adds non-linearity to the model.
Training:
- The model is trained using the AdamW optimizer.
- During training, the model learns to predict the next character in a sequence given the previous characters.
- The loss function is the cross-entropy loss between the predicted probabilities and the actual next character.
Evaluation:
- The model’s performance is evaluated on the validation set periodically during training.
- The evaluation metric is the average loss over a set of validation examples.
Generation:
- After training, the model can generate new text by starting with a short sequence of characters and then predicting the next characters one by one.
- The model uses a greedy decoding strategy, where it always chooses the most likely next character.

Here’s an example of how the model can generate text:

context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

This code generates a sequence of 2000 characters starting from an empty context. The generated text is then decoded from the sequence of integers back to a string.

The model can generate coherent and meaningful text, but it may contain occasional errors or repetitions. This is because the model is trained on a limited dataset and may not have seen all possible combinations of characters.

Hyperparameters

block size

A larger block_size allows the model to consider a longer context when making predictions, which can improve the quality of the generated text. However, a larger block_size also increases the computational cost of training and inference.

In the provided code, the block_size is set to 32, which is a common choice for text generation tasks.

n_embd: This hyperparameter specifies the dimension of the embedding vectors used to represent the input characters. A larger embedding dimension allows the model to capture more information about each character, but it also increases the computational cost of training and inference.

n_head: This hyperparameter specifies the number of attention heads in the transformer blocks. Each attention head learns to attend to different parts of the input sequence, allowing the model to capture different types of relationships between characters. A larger number of attention heads can improve the model’s performance, but it also increases the computational cost.

n_layer: This hyperparameter specifies the number of transformer blocks in the model. Each transformer block consists of a self-attention layer and a feed-forward layer. A larger number of layers can improve the model’s performance, but it also increases the computational cost.

dropout: This hyperparameter specifies the dropout rate used to prevent overfitting. Dropout randomly drops out some of the neurons in the model during training, which helps to prevent the model from learning too much from the training data and generalizing poorly to new data. A higher dropout rate can help to prevent overfitting, but it can also reduce the model’s performance.

batch size: The batch size determines how many independent sequences the model will process in parallel during training and inference. A larger batch size can improve the efficiency of training and inference, but it can also increase the memory requirements.

block_size: The block size determines the maximum context length for predictions. In the transformer-based language model, the model can only process sequences of length block_size at a time. A larger block size allows the model to consider a longer context when making predictions, which can improve the quality of the generated text. However, a larger block size also increases the computational cost of training and inference.

max_iters: The max_iters hyperparameter specifies the maximum number of iterations for training the model. One iteration consists of a forward pass through the model on a batch of data, followed by a backward pass to compute the gradients of the loss function with respect to the model’s parameters. The model is trained for a fixed number of iterations, or until it reaches a desired level of performance.

In the provided code, the hyperparameters are set to the following values:

batch_size = 16
block_size = 32
max_iters = 5000

These values are common choices for text generation tasks, but they can be tuned to optimize the model’s performance on a specific dataset.

Here’s an example to illustrate the difference between batch_size and block_size:

Consider a dataset of text sequences, where each sequence is 100 characters long.

If the batch_size is set to 16 and the block_size is set to 32, then the model will process 16 sequences of length 32 in parallel during each iteration.
If the batch_size is set to 32 and the block_size is set to 16, then the model will process 32 sequences of length 16 in parallel during each iteration.

The choice of batch_size and block_size depends on the available computational resources and the desired trade-off between efficiency and performance.

Embedding table

What does this do? token_embedding_table = nn.Embedding(vocab_size, n_embd)

The line token_embedding_table = nn.Embedding(vocab_size, n_embd) creates an embedding table for the input tokens in the transformer-based language model. Here’s how it works:

Embedding Table: An embedding table is a lookup table that maps each token in the vocabulary to a fixed-size vector. The size of the embedding vector is specified by the n_embd hyper-parameter.
Vocabulary Size: The vocab_size parameter specifies the number of unique tokens in the vocabulary. In this case, the vocabulary size is equal to the number of unique characters in the text dataset.
Embedding Dimension: The n_embd parameter specifies the dimension of the embedding vectors. A larger embedding dimension allows the model to capture more information about each token, but it also increases the computational cost of training and inference.

When the model processes a sequence of tokens, it first looks up the corresponding embedding vectors in the embedding table. These embedding vectors are then used as input to the transformer blocks.

The embedding table is a learned parameter of the model. During training, the model learns to adjust the embedding vectors to better represent the relationships between tokens.

Here’s an example to illustrate how the embedding table works:

Consider a vocabulary of 100 unique characters. The embedding dimension is set to 64.

The embedding table is a 2-dimensional matrix with 100 rows and 64 columns.
Each row in the embedding table corresponds to a unique character in the vocabulary.
Each column in the embedding table represents a dimension of the embedding vector.

When the model processes a sequence of characters, it looks up the corresponding embedding vectors in the embedding table. For example, if the sequence is “hello”, the model would look up the embedding vectors for the characters ‘h’, ‘e’, ‘l’, ‘l’, and ‘o’.

These embedding vectors are then used as input to the transformer blocks. The transformer blocks learn to combine the embedding vectors in a way that captures the relationships between the characters in the sequence.

The embedding table is a crucial component of the transformer-based language model. It allows the model to represent the input tokens in a way that is suitable for learning long-range dependencies in the text.

What is the purpose of position embedding table

position_embedding_table = nn.Embedding(block_size, n_embd) The purpose of the position embedding table is to encode the position of each token in a sequence. This is important because the transformer model, which the position embedding table is a part of, is an attention-based model, and the attention mechanism relies on knowing the relative positions of the tokens in the sequence in order to determine which tokens to attend to.

The position embedding table is a learned embedding table, which means that its weights are updated during training. This allows the model to learn the optimal way to encode the positional information of the tokens in the sequence.

The dimension of the position embedding table is typically the same as the dimension of the token embeddings, which is typically 512 or 1024. This means that each token in the sequence is represented by a vector of 512 or 1024 numbers, which includes both the token embedding and the position embedding.

The position embedding table is added to the token embeddings before the transformer model is applied. This allows the model to learn to attend to the tokens in the sequence in a way that is informed by their positions.

Here is a diagram that shows how the position embedding table is used in the transformer model:

https://datascience.stackexchange.com/questions/51065/what-is-the-positional-encoding-in-the-transformer-model

The position embedding table is a key part of the transformer model, and it allows the model to learn to attend to the tokens in a sequence in a way that is informed by their positions. This is essential for the model to be able to understand the meaning of the sequence.

Great video that helped me understand this visually: https://www.youtube.com/watch?v=dichIcUZfOw

Block in a transformer

A block in a Transformer model is a building block that performs two main operations: communication and computation. Here’s a simplified explanation with an example:

    class Block(nn.Module):
        """ Transformer block: communication followed by computation """

        def __init__(self, n_embd, n_head):
            # n_embd: embedding dimension, n_head: the number of heads we'd like
            super().__init__()
            head_size = n_embd // n_head
            self.sa = MultiHeadAttention(n_head, head_size)
            self.ffwd = FeedFoward(n_embd)
            self.ln1 = nn.LayerNorm(n_embd)
            self.ln2 = nn.LayerNorm(n_embd)

        def forward(self, x):
            x = x + self.sa(self.ln1(x))
            x = x + self.ffwd(self.ln2(x))
            return x

A block in a Transformer model is a building block that performs two main operations: communication and computation. Here’s a simplified explanation with an example:

Communication:

Imagine a group of people (called “heads”) standing in a circle, passing messages to each other.
Each person (head) receives information from everyone else in the circle.
This communication allows each person to gain a broader understanding of the overall situation.

Computation:

After the communication phase, each person (head) performs some calculations on the information they have gathered.
They combine and process the information to make better decisions or predictions.

Example:

Think of a scenario where you’re trying to predict the weather.
The communication phase is like gathering data from weather stations, satellites, and other sources.
The computation phase is where you process this data to make a prediction about the weather.

In a Transformer model, the blocks are stacked together to form the entire model. Each block communicates and computes to learn patterns and relationships in the data. This allows the model to make predictions or solve problems.

Attention head

head_size = n_embd // n_head

The line head_size = n_embd // n_head calculates the size of each attention head in the Transformer block.

n_embd is the embedding dimension, which is the size of the input and output vectors for each block.
n_head is the number of attention heads in the block.

The division (//) operator in Python performs integer division, which means it divides two numbers and truncates the result to the nearest whole number.

So, head_size = n_embd // n_head calculates the size of each attention head by dividing the embedding dimension by the number of heads. This ensures that the size of each head is an integer, which is required for the attention mechanism to work properly.

For example, if n_embd is 512 and n_head is 8, then head_size will be 512 // 8 = 64. This means that each attention head will have a size of 64.

In general, the size of the attention heads is a hyperparameter that can be tuned to achieve the best performance for a given task.

what is an attentio nhead ? An attention head is a component within a Transformer neural network architecture that allows the model to focus on specific parts of the input data. It helps the model learn relationships and dependencies between different parts of the input.

Here’s a simplified analogy to understand attention heads:

Imagine you’re reading a long document and trying to understand its main points. You might quickly skim through the document to get a general idea, but then you start to focus on specific sections, sentences, or even words that you find particularly relevant.

Attention heads work in a similar way. They allow the model to focus on important parts of the input data and ignore the less relevant parts. This helps the model learn more effectively and make better predictions.

In a Transformer model, there are multiple attention heads, each focusing on different aspects of the input. The outputs of these attention heads are then combined to create a more comprehensive understanding of the input.

For example, in a natural language processing task, different attention heads might focus on different words or phrases in a sentence. This allows the model to understand the meaning of the sentence and perform tasks like sentiment analysis or machine translation.

The number of attention heads in a Transformer model is a hyper parameter that can be tuned to optimize performance. Typically, more attention heads lead to better performance, but it also increases the computational cost.

Overall, attention heads are important components of Transformer models that enable them to learn relationships and dependencies in the input data and perform complex tasks effectively.

Head in attention layer in the transformer

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

In the provided code, the Head class represents a single head of a self-attention mechanism within a transformer model. Each head is responsible for calculating attention scores between different positions in a sequence and then using these scores to compute a weighted aggregation of the input sequence.

Here’s a breakdown of what each part of the Head class does:

Initialization (__init__ method):
- head_size: The dimension of the output embeddings from each head.
- key, query, and value: These are linear layers (also known as projections) that map the input vectors to the key, query, and value vectors used in the attention computation. All of these have the shape (n_embd, head_size).
- tril: A buffer that contains a pre-computed lower triangular matrix of ones. This is used to mask out the diagonal and upper triangular elements in the attention matrix, preventing self-attention at the same position.
- dropout: A dropout layer used for regularization.
Forward Pass (forward method):
- The forward method takes an input sequence x of shape (B, T, C), where B is the batch size, T is the sequence length, and C is the embedding dimension.
- The linear layers key, query, and value are applied to x to obtain the key, query, and value vectors, all of shape (B, T, head_size).
- The attention scores (wei) are calculated by multiplying the query and key vectors and scaling them by C**-0.5.
- The attention scores are masked using the tril buffer to prevent self-attention at the same position.
- The attention scores are normalized using the softmax function along the last dimension, yielding a probability distribution over the sequence positions.
- Dropout is applied to the attention scores.
- The value vectors are multiplied by the attention scores, resulting in a weighted aggregation of the input sequence. This weighted sum is the output of the head.

The output of the Head class is a sequence of vectors of shape (B, T, head_size). This output is typically concatenated with the outputs of other heads (if there are multiple heads) and linearly transformed to produce the final output of the transformer layer.

The Head class allows multiple heads to attend to different aspects of the input sequence simultaneously, capturing diverse relationships between elements in the sequence. This is a key component of the transformer architecture’s ability to learn long-range dependencies and perform tasks such as machine translation and language modeling.

References:

Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov
Karpathy Nano GPT Lecture Code, Neural Networks: Zero To Hero video lecture series.

Grouped Query Attention

2024-02-04T00:00:00+00:00

Llama 2: Open Foundation and Fine-Tuned Chat Models

2024-02-04T00:00:00+00:00

In this blog, I capture the notes on paper session on Meta’s Llama 2, conducted by the paper reading community under the aegis of fifth elephant community orchestrated by Hasgeek. Sachin and Anjineyulu presented this paper recently and it was a very interesting discussion and introduction to salient and high level important points in the paper by Meta. I capture them here below.

Paper link

Llama2 Paper by Meta

High level architecture of the model training pipeline followed in LLAMA2 model training

Why continual pre-training is hard?

Continual pre-trained models are difficult to train because they must learn to perform well on a wide range of tasks, often without access to a lot of data for each task.

This can make it difficult to find a set of hyper-parameters that work well for all tasks. Additionally, continual pre-trained models must be able to learn new tasks without forgetting what they have already learned. This is a difficult problem, as it requires the model to be able to distinguish between different tasks and to update its knowledge in a way that does not interfere with its performance on previous tasks.

Here are some specific challenges associated with continual pre-trained models:

Data scarcity: Continual pre-trained models often have access to limited data for each task. This can make it difficult to learn to perform well on a wide range of tasks.
Hyper-parameter tuning: Finding a set of hyper-parameters that work well for all tasks can be difficult. This is because different tasks may require different settings in order to achieve good performance.
Catastrophic forgetting: Continual pre-trained models must be able to learn new tasks without forgetting what they have already learned. This is a difficult problem, as it requires the model to be able to distinguish between different tasks and to update its knowledge in a way that does not interfere with its performance on previous tasks.
Negative transfer: Continual pre-trained models may experience negative transfer, where learning a new task can hurt performance on previous tasks. This can be caused by the model learning to focus on features that are specific to the new task, at the cost of features that are important for previous tasks.

Mixture of Experts

Mixture of experts is a way to achieve lower inference latency but with more parameters. More details at HF blog on MOE

Responsible AI:

Pre-trained dataset had documents filtered for PII. So it is easy to fine-tune on LLAMA2 base model without worrying on hateful content.

Supervised fine-tuning [SFT]

Flan dataset by Google used. Manually annotated 27450 instruction and response pairs.

27k instruction and response pairs are sufficient for fine-tuning task, basically. This is for English or any other multilingual task ? This has to be considered and evaluated.

Comparing of pre-trained hyperparameters and the SFT hyperparameters.

The cosine learning rate in SFT is reduced by an order of one magnitude. This is likely because in SFT, we want to change the style of information and we don’t want to add any new information. Hence lower learning rate is picked.
The weight decay remains the same
Sequence length also remains the same.

In pre-training, we ask the model to learn the next token. In SFT we are asking the model to learn the response tokens. We don’t care on instruction, we backpropagate for our loss on the prompt.

RLHF data collection

A binary comparison of this response vs other responses was done. They used four degrees of comparison

Significantly better
Better
Slightly better
Unsure

Reward model

The reward model was used on fine-tuning RLHF for weeks, till they were confident of improvements.

The objective for the ranking for the reward model

If the reward is low, the negative value will come, and vice versa.

Margin term gives them more granular control over how they can control the function.

1 epoch of training was done, so that it won’t overfit. DPO sometimes has this issue of overfitting.

In the reward model, the learning rate is further reduced by an order of one magnitude.

RLHF: RL training

In RLHF, the agent (our fine-tuned instruct LLM) in its environment (Context Window) takes one action (of generating text) from all available actions in the action space (the entire vocabulary of tokens/words in the LLM).

Rejection sampling

Close to SFT
given the prompt to the model, generate 10 samples with 10 different temperatures, ask the reward model which of these samples had max reward, then fine-tune on that particular prompt response pair

PPO - proximal policy optimization

We make our policy get maximum amount of reward
Two models were trained for saftey and helpfulness
If safety was less than 0.15 they won’t look at helpfulness
- They take the safety model output and then just say it is
If the safety score is above a certain threshold of 0.15, they determine the response to be safe and optimize for the helpfulness score.

AdamW is used as optimizer because it takes care of Weight decay in a nicer way than the ADAM standard optimizer

Context distillation

In this stage, you set the context using system prompt for the model, such as “You are safe and responsible assistant” and fine-tune on those responses.

Ghost attention:

This is exactly like context distillation for dialogue setting, where a synthetic instruction is added before all dialogues and then fine-tuned.

Interesting findings

Temperature rescaling. Higher temperatures give creative generations, and lower give factual generations.

Model understands the time. For eg: if you set a system prompt as specific to a date like 1940, and post a question after that date, it might say i don’t know about it.

Emergent tooling - function calling. Able to do zero shot function calling.

Difference from llama 1 to llama2

We adopt most of the pretraining setting and model architecture from Llama 1. We use the standard transformer architecture (Vaswani et al., 2017), apply pre-normalization using RMSNorm (Zhang and Sennrich, 2019), use the SwiGLU activation function (Shazeer, 2020), and rotary positional embeddings (RoPE, Su et al. 2022). The primary architectural differences from Llama 1 include increased context length and grouped-query attention (GQA)

Grouped Query Attention GQA

This video explains how Grouped Query Attention works.

So the attention score in the above diagram is calculated this way.

Query and Key matrixes are multiplied.
Scaling on the dot product of it happens using d_k and taking square root of d_k
A mask can also be applied (optional)
Finally on this scaled Q x K^T product, we apply soft max
This is multiplied by value matrix to get an attention score matrix

In multi-head attention, we have h heads, as shown in the middle. Each head produces a scaled dot-product attention as described earlier. It is concatenated and fed into a linear layer.

Challenge of Multi-head attention

The crux of the issue lies in the memory overhead. Each decoding step in autoregressive models like Transformers requires loading decoder weights along with all attention keys and values. This process is not only computationally intensive but also memory bandwidth-intensive. As model sizes grow, this overhead also increases, making scaling up an increasingly arduous task.

The below figure shows how a Grouped-attention scenario works.