Statistical
Inferences and Creative Thinking
Measures
Of Central Tendency
|
Syed
Imtiaz Ahmad
|
17/09/2002
|
We
begin by raising issues and asking questions about phenomena in
nature - things that are not necessarily man-made although man may
have contributed positively or negatively to the phenomena. For our
purpose, phenomena are things that we observe in time and space e.g.
the amount of rainfall, people dying of accidents, incidence of
certain diseases, the performance of students in classes, and so on.
A thoughtful question to ask whether there is an observable pattern
or perceived nature of things in the phenomena. Why should we look
for patterns or perceived nature of things in phenomena? From the
thinking perspective, it is simply to gain insight into what
happens. For example, it may be simply comforting to know the
pattern that rainfall follows in a given region as we move in time
through the year, or to know the pattern of rainfall as we move in
space from region to region.
Knowledge
brings comfort. Knowing what may happen next, in time or space, is
at least comforting even if we are not able to do much about it.
Knowledge is also a potential source of power. We can use the
knowledge to be in the right place at the right time, improve the
quality of life or make financial gains or avoid financial losses.
In addition to discovering useful features of phenomena and making
use of these features in time of need, we may also examine whether
certain phenomena show a cause and effect relationship.
Once
a pattern is discovered, it can be improved further with repeated
observations, and used as a basis to make inferences about new
observations.
Experience
has shown that the observed values of features in many phenomena
have a certain tendency i.e. the values tend to cluster around what
may be called a mid-point and gradually disperse away from this
mid-point in decreasing number. This is called the central tendency.
Of course, we assume that what we are observing is random i.e. it is
based on the nature of things that happen on their own rather driven
by some systematic force aligning the values in its favor. The
average of all observed values, also called the mean, is a measure
of central tendency in that it identifies a mid-point. However,
without knowing how the values disperse away from this mid-point,
the mean value itself may not serve much useful purpose. The
dispersion or the variability of the values is calculated and named
as variance. Mean and variance are common measures of central
tendency and variability in observed values. These are statistics of
common choice when the observed values behave as discussed in this
paragraph, and the sample of values is randomly drawn so as to make
it representative of the population at large.
Let
the random sample of data values be denoted as x1 , x2 , x3 ,
..........., xn for n values of data. Then the mean value is
calculated as:
mean
= (x1 + x2 +x3 + ........... xn )/n
(8)
We
may also denote mean as
mean
= 1/n? xi for I=1,2,3, ......, n
(9)
Applying
equation (1) or (2) to the data in Table 1, the calculated value is:
mean
= 55.723, or
mean
= 55.73 (approximate but a better choice)
(10)
Equations
(1) and (2) can be easily adopted for the summarized (frequency
distribution) data in Table 2, as described below.
mean
=(f1x1 + f2x2 +f3x3 + ... fkxk )/(f1 + f2 + f3 +... (fk)
= (f1x1 + f2x2 +f3x3 + ... fkxk )/n
(11)
or
mean = 1/n?fj xj for j=1,2,3, ......,k
(12)
Using
(11) or (12), the calculated value is:
mean
= 55 (13)
Note
that the value in (13) differs only marginally from the value in
(10). Generally, classifying raw data into frequency distribution
form has only marginal effect on the calculated statistics.
What
is important here is not to simply calculate the mean but to
determine how usefully it characterizes the sample values and the
population that the sample represents. Computational aids for
statistics are within easy reach via a calculator or through
software packages in Microsoft Windows environment. However, the
relative ease with which these statistics can be derived should not
lead to their thoughtless and misleading usage.
For
the data in Table 1, the calculated mean in (10) is 55.73. Let us
simply use mean=55 in our discussion. How well does this mean value
characterize data shown in Table 1. Can we use this mean value to
say that majority of students are scoring or near 55? Of course, we
can answer this question by using the methods discussed in the
previous section. Here, we would like to draw inferences by using
the means and the related statistic of variance. In addition to
using the mean in characterizing the sample, we are interested in
finding whether we can do some generalizations. After all, the
purpose of statistics is not simply to calculate a statistic. It is
more to see if we can draw broader inferences from what we have
calculated. In addition to asking whether a sample mean
characterizes the population from which the sample is drawn, we
should also ask whether the mean we have calculated would repeat its
value or very nearly repeat its value if we recalculated it by
taking many more samples.
We
know that for data on the entire population, the calculated mean
value is precise. For sample data from the population, the
calculated mean is only an estimate of the population mean. It is a
measure of the population mean if the sample is random in character.
A sample is called random if every member of the population has an
equal chance of being included in the sample and each selection of
data is made independently of all others.
How
good is a sample mean in terms of the likelihood that most values
encountered in the population would be in proximity of this mean? In
order to find an answer, we develop a measure of variations from the
mean value. Let us start by determining how sample values deviate
from the calculated mean. We calculate the difference of each sample
value from the mean. The differences or deviations will take both
positive and negative values depending on whether a sample value is
larger or smaller than the calculated mean. If we sum up the
differences, the positive values would cancel the negative values
producing a result that does not represent the accumulated
difference from the mean but something quite different altogether.
The accumulated difference can be found by squaring each difference
from the mean, summing them up, taking the average, and finally
taking the square root to have a value that can be compared to the
mean. This is called the standard difference or deviation. Squaring
the differences and summing them up is easily understood
intuitively. The mean value of sample variance is obtained by
dividing the accumulated difference with the number of values, n, in
the sample. However, the mean value of the variance for the
population is obtained by dividing it with n-1. Why use n-1 and not
n? We may view it as compensating the answer for using sample data
as opposed to data for the entire population. Given n sample data
values, n-1 denotes the degrees of freedom. The degrees of freedom
are derived by taking the total number values in the sample and
reducing it by an amount equal to the number of restrictions placed
on calculations from sample data values. For calculating the mean
value of variance, the degrees of freedom is one less than the
number of calculated differences, assuming that once we have found
n-1 differences, the nth difference cannot be freely calculated. It
is fixed for a population. The calculation of standard deviation may
now be described as follows:
Variance
(of differences from the mean),
var = 1/(n-1)?( xi - mean )2 for I=1,2,3, ….., n
(14)
The
variance is often calculated more easily and with lesser potential
impact of round-off approximations in calculations as:
varx
=(sumsq-sum*sum/n)/(n-1) (15)
where
sumsq is the sum of the squares of given data values, sum is the sum
of given data values, and n is the number of data values
Equations (14) and (15) can be easily modified for data represented
in the form of frequency distribution.
Standard deviation is then calculated by taking the square root of
variance i.e.
stdev
= sqrt (var) (16)
The
calculated value of standard deviation for the above data is 19.28.
Standard deviation is an indicator of how much the sample values may
vary from the mean. How do we use this value of standard deviation
in speaking about the mean value we have calculated? Can we say that
the mean values we may find when we take more and more samples from
the population would be near the one we have calculated? In other
words, would the mean value remain stable from sample to sample? Can
we come up with a statement of confidence about the calculated mean
value versus the mean values from other samples? In order to answer
these questions, we introduce a measure of variation for the sample
mean. It is called standard error of the mean and calculated as
follows:
standard
error of the mean,
stderr = standard deviation/sqrt(n)
(17)
A
small value for standard error of the mean implies that the
calculated mean will not differ much from means calculated with
other samples from the given population. This is a significant
advance in that we can generalize the mean calculated from the given
sample to mean values from other samples drawn from the same
population. If we assume a very large, almost infinite, number of
such samples and their mean values being near the same as the one
just calculated then this calculated mean may in fact be claimed as
very nearly the true mean of the entire population.
1.
Introduction
2.
Creative Thinking and Statistics
3.
Raw Data And Data Aggregations By Categories
4.
Measures Of Central Tendency
5.
Assessing Sample Values On The Basis Of Sample Statistics
6.
Conclusions
7.
Cited References
