� �

LIVE � �
� �

12.05.2024

Welcome to mutual information. my name is dj in this video we're going. to talk about. jensen's inequality it is a famous. theorem that is needed in some of the. most foundational algorithms of machine. learning. let me take a minute to convince you of. that first it enables the em algorithm. which gives us the ability to learn. model parameters in the presence of. missing variables. in fact that undersells it the em. algorithm expanded the space of models. we could practically optimize. by orders of magnitude and it couldn't. be done without this inequality. also it shows up a lot in information. theory. one of the most central results is. showing that the kl divergence. which is a measure of the dissimilarity. between two given probability. distributions. is nonnegative that theorem is then. used in several other core theorems.

The nonnegativity of mutual information. a few bounds on the entropy of a random. variable. conditioning on one random variable. always reduces the entropy of another. if these don't mean a lot to you that's. fine they aren't necessary for this. video. the point is information theory couldn't. exist without jensen's inequality. and beyond that variational inference. leans heavily. on information theory and in machine. learning variational inference is a big. deal. to obnoxiously oversimplify it's a big. group of methods used to calculate. approximately otherwise intractable. distributions and those distributions. are what we need to model. big complicated things that we find out. there in the wild. so variational inference is yet another. avalanche of utility kicked off by. jensen's inequality. and it's not super surprising that it's.

Everywhere it makes a useful statement. about how two. super fundamental objects interact. convex functions. and distributions whenever those two. things show up together. jensen's inequality is likely right. around the corner so all this is to say. you gotta learn this unfortunately. jensen's inequality is not hard to learn. i think once you see it from the right. angle. it'll be hard not to think that it's. true so. let's begin but before i do so quick. caveat. i'll be explaining the form that shows. up in statistics there's another form. involving a weighted average of two. values. which may be simpler but isn't utilized. as readily in the applications we care. about. hence why i'm going with this version. cool okay back to it. the inequality says if you're given a. convex function. and a random variable x then this.

Inequality holds. which i'll break down first if you don't. know what a convex function means. that's all right we'll define it. visually in a minute for now take it as. a special simple function. with that let's explain the left side. this box is telling you how to calculate. this value. all it's saying is you draw n values of. x assuming n is some very big number. and then you calculate their average. then you take that average and you pass. it into our convex function. giving us back a number that number is. the left side. and i'll refer to it as the output of. the average input. next on the right side we draw a lot of. samples of x again. but this time we pass each into the. function giving us many. outputs then we take the average of. those outputs giving us the number on. the right side. i'll refer to that as the average output.

Okay. and now we get stated quickly jensen's. inequality says for a convex function. the output of the average input is less. than the average output. okay at this point we should have the. statement in our heads we don't need to. understand it we just need the statement. in our heads. in fact if you don't pause the video. until you do good. that's done everyone's on the same page. let's move on to the next part the first. thing we need is a convex. function let's go with this as promised. i'll tell you what it means to be convex. well there are a few equivalent. definitions but the one i find most. intuitive. and easiest to extrapolate to higher. dimensions is that the epigraph. which is the space of points above the. function. is a convex set that means if you pick. any two points. in that epigraph and draw a line between.

Them then every point along that line. is also within the epigraph as you can. imagine. there are many different breeds of. convex functions. so try to imagine the one we're using as. a representative. of all of them next we need to add the. random variable x here. i won't tell you which type of random. variable it is because it doesn't matter. but i will show samples of it with that. we can calculate one side of the. inequality. first we calculate the average x and. then pass it into our function. giving us the output of the average. input easy. naturally let's calculate the other. value this time we pass each sample to. the function. giving us many outputs and then we just. take the average of those outputs. and this is what i called earlier the. average output also easy. and now we can see it the output of the. average input.

In red is less than the average output. in green. jensen equality says that must be true. for any convex function. and any random variable that's it. but now what mathematicians will ask and. engineers probably won't. is why why is this true well. i could provide the main gist of the. proof basically you can show that the. output of the average input is. equal to the average output if the. function is linear. so for a convex function you find a. related line. this line now you know the two values. are equal when the function is a line. so now we ask what happens when you bend. back into the convex function. noticing anything anything jumping out. when we move from the line to the convex. function every sampled output. moves up so their average must move up. and that gives us the inequality once. again this is easy.

Now i'm showing this proof for one. reason i think it comes with good. intuition. the inequality comes from the difference. the convex function. has with the line with that i think it's. pretty easy to make. some other conclusions first the. inequality is related to the. variance of the random variable if the. variance is very small. then the convex function is effectively. line like and the difference is small. if the variance is large the difference. is large. second the inequality is related to the. curvature of the convex function. more curved functions yield bigger. differences hopefully mechanically. from this graphic this is all pretty. obvious and the last thing i'll say. is that this inequality generalizes. easily to higher dimensions. samples of x become a vector and the. function accepts that vector as input.

in the 2d case this convex function. would be a bowllike thing. given how little the idea changes and. how hard it is to animate. i'll just leave that to your imagination. and there you have it. that's all there is to jen's inequality. if you didn't know it before. you can now follow a new logical step in. the justification for those foundational. algorithms. and finally thank you for your focus if. you like content like this and would. like to learn more about machine. learning and statistics. please like and subscribe content like. this is the content i'll continue to. make. especially if i get your support