Distribution attributes: an economical approach

Discuss all kind of algorithms and data structures from their mathematical and programming sides.

Moderators: Darobat, RecursiveS, Dante Shamest, Bugdude, Wizard

Distribution attributes: an economical approach

Bear with me, as I've a feeling this will be a long post...

Consider a frog that can jump left or right any number of units. For example, he may jump left 4 units (i.e. -4), then right 2 units (i.e. +2), then right 1 unit (i.e. +1), et cetera. If I record the lengths of all of his jumps, after a time I will have a probability mass function (as space is discrete in this problem, not continuous) that describes the distribution of his movements. With me so far? I believe I can determine the mean by a running total (called total) adding on each jump length (called d), as below:
Code: Select all
`total+=d;//For each jump...total/=n;//At end of simulation`
where n is the total number of jumps.

If the distribution is symmetric, the variance could be calculated as
Code: Select all
`variance+=pow(d,2);//For each jump...variance/=n;//At end of simulation`
So, if I knew a priori that the distribution was to be symmetric, this would be an economical way of calculating the variance. My question: is there a similar economical approach whereby I don't have to record the entire probability mass function for a skewed or non-symmetric distribution? I have been trying to figure out what variance (as calculated above) actually means in the case of asymmetry, but with no success so far.

Hah. Actually not such a long post!

DannyBoy

Posts: 1160
Joined: Fri Feb 13, 2004 12:56 pm
Location: In the Billiard Room with the Lead Pipe

I think to calculate without recording the the entire pmfunction means to use a generating function...one that will produce/predict an estimated total t for the number jumps n given the values of some inital jumps...I think it would be safe to assume that would be a uniform way to calculate such a variance in the case of asymmetry...just a guess though
insert something clever
gotclout

Posts: 106
Joined: Thu Jan 06, 2005 2:22 pm

Re: Distribution attributes: an economical approach

DannyBoy wrote:Consider a frog that can jump left or right any number of units. For example, he may jump left 4 units (i.e. -4), then right 2 units (i.e. +2), then right 1 unit (i.e. +1), et cetera. If I record the lengths of all of his jumps, after a time I will have a probability mass function (as space is discrete in this problem, not continuous) that describes the distribution of his movements. With me so far?

Well... not really. There is nothing probabilistic in the problem you have described. A frog makes a series of jumps. Fine but, what is probabilistic about it? Does the frog pick the lengths from a random distribution, which you are trying to figure out? I'll continue the problem assuming that's what you mean.

I believe I can determine the mean by a running total (called total) adding on each jump length (called d), as below:
Code: Select all
`total+=d;//For each jump...total/=n;//At end of simulation`
where n is the total number of jumps.

If the distribution is symmetric, the variance could be calculated as
Code: Select all
`variance+=pow(d,2);//For each jump...variance/=n;//At end of simulation`

Well, that's how you estimate the mean and the variance. You can't compute them with a finite amount of data. There are reasons to divide the variance by (n-1) instead of n, so your estimate is unbiased. You also have to substract the square of the mean, because the variance is the deviation from the mean, not from zero. See http://en.wikipedia.org/wiki/Variance for the details.

So, if I knew a priori that the distribution was to be symmetric, this would be an economical way of calculating the variance.

Actually, this works just fine even if the distribution is not symmetric. Oh, wait! I think by symmetric you mean zero mean. Well, I already answered: substract the square of the mean at the end.

My question: is there a similar economical approach whereby I don't have to record the entire probability mass function for a skewed or non-symmetric distribution? I have been trying to figure out what variance (as calculated above) actually means in the case of asymmetry, but with no success so far.

It's just a measure of the how far from the mean typical values are. It really has nothing to do with symmetry.

Alvaro
Moderator

Posts: 5185
Joined: Mon Sep 22, 2003 4:57 pm
Location: NY, USA

Re: Distribution attributes: an economical approach

Alvaro wrote:Well... not really. There is nothing probabilistic in the problem you have described. A frog makes a series of jumps. Fine but, what is probabilistic about it? Does the frog pick the lengths from a random distribution, which you are trying to figure out? I'll continue the problem assuming that's what you mean.
Apologies. That's what I mean: the jump lengths are chosen randomly.

Alvaro wrote:Actually, this works just fine even if the distribution is not symmetric. Oh, wait! I think by symmetric you mean zero mean. Well, I already answered: substract the square of the mean at the end.
I'm afraid I mixed up "symmetric" and "zero mean", which are obviously not the same thing. Sorry.

I don't understand what you mean by "substract the square of the mean at the end" as the variance is the the jump lengths minus the mean jump length, then squared, etc.

Alvaro wrote:It's just a measure of the how far from the mean typical values are. It really has nothing to do with symmetry.
I realise this, but for a zero mean case the calculation of the variance is easy, and can be done on the fly. However, to calculate the variance in the none zero mean case I have to know the mean. Typically, you'd just take your distribution, calcalute the mean, then go on to calculate the variance. But I want to know if I there is an economical way of doing this as I go along, e.g. avoiding having to store the entire distribution in an array.

(Am I making sense, as I realise I probably haven't described this well?)

DannyBoy

Posts: 1160
Joined: Fri Feb 13, 2004 12:56 pm
Location: In the Billiard Room with the Lead Pipe

Believe me: substracting the square of the mean at the end of your computation will give you the correct value.

Let's do some math. We'll call E(X) the expected value of a random variable X. The variance is defined as
Var(X) := E((X-E(X))^2)

You can prove that E((X-E(X))^2) = E(X^2) - E(X)^2. The first term is what you computed (the expected value of the square of the random variable), to which you have to substract the square of the mean. To prove that formula, you just have to use the fact that E is linear (i.e., E(X+Y)=E(X)+E(Y) and E(k*X)=k*E(X) if X and Y are random variables and k is a real number).

Alvaro
Moderator

Posts: 5185
Joined: Mon Sep 22, 2003 4:57 pm
Location: NY, USA

Some code:
[syntax="cpp"]double compute_variance(double *d, int n){
double sum_x=0;
double sum_x2=0;

for(int i=0;i<n;++i){
double x=d[i];
sum_x+=x;
sum_x2+=x*x;
}
double variance=(sum_x2-sum_x*sum_x/n)/(n-1);

return variance;
}

[/syntax]

It's not exactly what I told you (substracting the square of the mean), because I also introduced the unbiasing correction and the formula ends up being slightly more complicated.

Alvaro
Moderator

Posts: 5185
Joined: Mon Sep 22, 2003 4:57 pm
Location: NY, USA

Alvaro wrote:Believe me: substracting the square of the mean at the end of your computation will give you the correct value.

Let's do some math. We'll call E(X) the expected value of a random variable X. The variance is defined as
Var(X) := E((X-E(X))^2)

You can prove that E((X-E(X))^2) = E(X^2) - E(X)^2. The first term is what you computed (the expected value of the square of the random variable), to which you have to substract the square of the mean. To prove that formula, you just have to use the fact that E is linear (i.e., E(X+Y)=E(X)+E(Y) and E(k*X)=k*E(X) if X and Y are random variables and k is a real number).
Thanks Alvaro, I wasn't doubting what you said, I simply didn't get it! It was E((X-E(X))^2) = E(X^2) - E(X)^2 that tripped me up. I'll have to set about proving that to myself. One further question: is E being linear making any sort of assumption about the distribution, or is E((X-E(X))^2) = E(X^2) - E(X)^2 true for all distributions?

DannyBoy

Posts: 1160
Joined: Fri Feb 13, 2004 12:56 pm
Location: In the Billiard Room with the Lead Pipe

DannyBoy wrote:[...]One further question: is E being linear making any sort of assumption about the distribution, or is E((X-E(X))^2) = E(X^2) - E(X)^2 true for all distributions?

It's true for all distributions.

Alvaro
Moderator

Posts: 5185
Joined: Mon Sep 22, 2003 4:57 pm
Location: NY, USA

Alvaro wrote:
DannyBoy wrote:[...]One further question: is E being linear making any sort of assumption about the distribution, or is E((X-E(X))^2) = E(X^2) - E(X)^2 true for all distributions?

It's true for all distributions.
Muchos gracias, amigo. You don't realise how much that has helped me!

DannyBoy

Posts: 1160
Joined: Fri Feb 13, 2004 12:56 pm
Location: In the Billiard Room with the Lead Pipe