繁星客栈 - Hodgson悖论与统计学家的鸵鸟

gauge

发表文章数: 596
内力值: 375/375
贡献度: 8310
人气: 1396

论坛嘉宾学术成员

Hodgson悖论与统计学家的鸵鸟 [文章类型: 原创]

假设$X,Y$为服从两个正态分布的随机变量，比如测量的误差、人类的身高、人类的智商等等。再假设$X,Y$具有相同的平均值，但方差不一定相同。容易证明这样两个随机变量的比值$X/Y$服从Cauchy分布。Cauchy分布的平均值、方差均不存在。这与我们通常对于误差的看法不一致，它意味着相对误差是没有意义的。这个矛盾称之为Hodgson悖论。

这个悖论来自Hodgson在1979年发表于American Journal of Physics 47(12)的论文。是我们直观上的理解不正确呢还是其他的什么原因呢？无论如何我们必须认为相对误差是有意义的。因而看起来问题不应该是我们的直观出了问题。对这个悖论的解释，统计学家采用了著名的鸵鸟政策。统计学家认为产生这个现象的原因是因为正态分布并不是那些随机变量的真实分布，仅仅是一个近似而已。Hodgeson的论文举的例子是人类的身高，但是人类的身高不可能服从正态分布，因为人类的身高总是大于0的。而正态分布允许取负的值。这个论点同样适用于误差的分布。这样的解释没有多大意义，因为几乎所有的统计上的悖论都可以这样解释一通。

R. T. Hodgson. "The problem of being a normal deviate", American Journal of Physics 47(12), December 1979.

发表时间: 2007-02-08, 21:41:15

个人资料

星空浩淼

发表文章数: 799
内力值: 423/423
贡献度: 8426
人气: 1826

客栈长老学术成员

Re: Hodgson悖论与统计学家的鸵鸟 [文章类型: 原创]

我以前当技术员的时候，接手人家的产品，改变了配料的方法，依据就是采用了不同的统计误差理论（原来的技术员采用的是原单位技术员常犯的错误），结果大获成功，成品率大幅度提高。

One may view the world with the p-eye and one may view it with the q-eye but if one opens both eyes simultaneously then one gets crazy

发表时间: 2007-02-09, 04:56:07

个人资料

大漠孤狼

发表文章数: 623
内力值: 361/361
贡献度: 6528
人气: 648

论坛嘉宾

Re: Hodgson悖论与统计学家的鸵鸟 [文章类型: 原创]

：：容易证明这样两个随机变量的比值$X/Y$服从Cauchy分布
：：Cauchy分布的平均值、方差均不存在。这与我们通常对于误差的看法不一致，
：：它意味着相对误差是没有意义的

这个地方没有看懂。你的意思是说假如“$X/Y$”服从的某个分布规律（假设不是Cauchy分布）有“平均值、方差”，那么就表明“相对误差”有意义了，就和直觉一致了？

发表时间: 2007-02-09, 23:30:52

个人资料

gauge

发表文章数: 596
内力值: 375/375
贡献度: 8310
人气: 1396

论坛嘉宾学术成员

Re: Hodgson悖论与统计学家的鸵鸟 [文章类型: 原创]

没有人能够证明我们的直觉是什么。这个悖论和其他的很多悖论也不一样，不是逻辑上的悖论。

假设有一组数据，x_1,...,x_n.
平均值为a=(x_1+...+x_n)/n.
粗看起来，相对误差应该用x_i/a来表示。

如果这一组数据是由一个正态变量产生的，则a也服从正态分布。这大约就是Hodgson悖论的来历了。但Hodgson本人的说法不是这样，似乎他没有这样解释。我这样用不确切的语气来叙述，是因为没有逐字逐句的看Hodgson的论文。Hodgson的论文很短也很简单，在google上搜索一下就可以找到。

至于为什么相对误差应该有平均值和方差，原因很简单。因为没有平均值和方差的分布在统计中是不好的。或者说统计学不喜欢这种东西。当然这不是一个客观的标准。这是统计学家的另一只鸵鸟。

发表时间: 2007-02-10, 00:49:50

个人资料

Omni

发表文章数: 280
内力值: 263/263
贡献度: 4868
人气: 688

论坛嘉宾学术成员

Re: Hodgson悖论 [文章类型: 原创]

I did some extensive reading on this previously unfamiliar topic tonight. According to Wikipedia, the so-called "Hodgson's paradox" is defined as:

"Hodgson's paradox is the observation that the ratio of two normally distributed random variables, both with equal mean, has neither mean nor variance, and thus no well-defined expectation. This appears to be inconsistent with conventional views of error estimation."

The best statistics textbook addressing issues related to this topic (although without even mentioning this so-called paradox) is "Statistical Inference" by Casella & Berger (2nd ed, 2001). It has become my favorite statistics theory book ever since I read its excellent coverage on the "Monte Hall problem" brought up by many people at the old Shining Stars Forum:

http://www.changhai.org/bbs/load_article.php?fid=5&aid=1153705369

After reading Casella & Berger and several other online resources, I think the use of the term "paradox" was an overstatement by Hodgson. I would say that the ratio of two independent normal random variables follows an ill-behaved p.d.f named "Cauchy Distribution". Here are my specific comments:

>>Cauchy分布的平均值、方差均不存在。这与我们通常对于误差的看法不一致，它意味着相对误差是没有意义的。

Gauge's rewording of the "Hodgson's paradox" went a step further than Hodgson's overstatement. I don't think you can conclude that the relative error (defined as the ratio of two Gaussian error random variables) is meaningless just because no moments of the Cauchy distribution exist (in other words, all of its absolute moments equal to infinity).

It can be easily shown that the mu parameter in the Cauchy p.d.f. is the median of the distribution although this distribution doesn't have a mean. In many statistical applications, a median is almost as important as a mean in measuring the "center" of a probability distribution! Furthermore, we can show that mu+sigma and mu-sigma are the quartiles of the Cauchy distribution, that is,

P(X >= mu+sigma) = P(X <= mu-sigma) = 1/4

>>至于为什么相对误差应该有平均值和方差，原因很简单。因为没有平均值和方差的分布在统计中是不好的。或者说统计学不喜欢这种东西。当然这不是一个客观的标准。这是统计学家的另一只鸵鸟。

The use of "median" plus "interquartile range" to describe a probability distribution is as intuitive as the use of the more familiar "mean" plus "variance" (standard deviation) combination. The use of the well-known box-and-whisker plot (box plot) to visualize median and interquartile range is almost as popular as the use of histogram plus error bars. Of course, mean and variance are mathematically more tractable than median and quartile. But we cannot use mathematical tractability to make a "good vs. bad" judgment in statistics.

BTW, the use of error bars is limited to symmetric distributions. When statisticians deal with asymmetric distributions, they have to resort to confidence intervals. In the same spirit as my statements above, we can't say that error bars are better than confidence intervals.

Casella & Berger commented that "the Cauchy distribution plays a special role in the theory of statistics. It represents an extreme case against which conjectures can be tested. But do not make the mistake of considering the Cauchy distribution to be only a pathological case, for it has a way of turning up when you least expect it. ... Taking ratios can lead to ill-behaved distributions". Also note that the standard Cauchy(0,1) distribution arises as a special case of Student's t distribution with one degree of freedom.

In summary, I don't think there is any paradox with the Cauchy distribution. Therefore, Hodgson's so-called resolution of the paradox by saying that "random variables are never exactly Gaussian" is also meaningless. He posed a "pseudo-question" which doesn't really require an answer.

发表时间: 2007-02-10, 02:04:22

个人资料