Statistical
Inferences and Creative Thinking
Raw
Data And Data Aggregations By Categories
|
Syed
Imtiaz Ahmad
|
17/09/2002
|
Sometime
the raw data is simply overwhelming. We may not be able to deal with
it without suitable aggregation by categories even if those
categories are not obvious, as may be seen in (1) and (5) described
previously. As before, these aggregations can help us to see
underlying patterns more easily. Consider the example of 40 data
values shown in Table I. One may say that 40 values of some data
need not be overwhelming in many situations. That is very true. We
are not looking at the number 40 but whatever number is found to be
overwhelming in a given situation. Sometimes, it is also a question
of effectiveness. If a reduced set of things is likely to lead us to
the same conclusions as a large set then why not conserve our
resources in measuring, handling, and computing by using the reduced
set. We may also be more efficient in our work and more effective in
conveying it.
Figure
3: Frequency distribution of heights shown as a curve indicating the
peak and tapering of the values away from the peak.
Table
1: Raw Sample Data Values
67
53 83 69 33 17 39 74 85 41 35 70 58 49 42 43 62 40 63 21
66
79 76 27 60 51 70 96 60 71 90 37 56 45 48 54 25 75 55 44
This
data represents examination scores in a class. However, the same
data may represent a random sample of annual income of people (in
thousands) for some population, or their ages, or the number of
highway accidents in various regions or over a period of time, or
the number of people with aids disease in various cities, etc.
Let us summarize this data into categories or classes as shown in
Table 2.
Table 2: Summarization of Raw Data into Intervals and Frequencies
(Frequency Distribution)
|
No.
|
Class
Interval
|
Midpoint
|
Frequency
|
|
i
|
|
mi
|
f
|
|
1
|
11-20
|
15
|
1
|
|
2
|
21-30
|
25
|
3
|
|
3
|
31-40
|
35
|
5
|
|
i
|
|
mi
|
f
|
|
4
|
41-50
|
45
|
7
|
|
5
|
51-60
|
55
|
8
|
|
6
|
61-70
|
65
|
7
|
|
7
|
71-80
|
75
|
5
|
|
8
|
81-90
|
85
|
3
|
|
9
|
91-100
|
95
|
1
|
This
summarized data, or the frequency distribution of raw data values,
appears to be more meaningful in form. It shows that most values
fall within the range 31 to 80, the highest occurring values are
around 51-60, with very few values at the low or high end.
The frequency distribution is often displayed pictorially as shown
in Figure 4.
The frequency distribution graph in Figure 4 shows the pattern of
distribution for data values in a more noticeable form. This point
has also been discussed in the preceding section. In Figure 4, we
see that most of the values are clustered at the middle, and that
the midpoint value, in this case, is in the interval 51-60 (despite
any apparent misalignment of the horizontal scale in the figure as
drawn). The value at this central point is the midpoint of 51-60
i.e. 55. This is called the mean value.
What is the significance of
this mean value? How can we use it to think and articulate about the
characteristics of the measured scores for the sample and the
general population? We will deal with these questions in the next
section. For now, we may make the following observations. If the
frequency data plot takes the form of a sharp bell curve then the
sample mean value is a very good estimator of population mean value.
If the bell is stretched out then the mean value is likely to change
considerably from sample to sample and thus the mean value by itself
cannot be a very good indicator of the population mean. If the bell
is not symmetrical on its sides then the sample median may become a
better indicator of the population mean. We say that the bell curve
is negatively skewed if it is slow in rising from the start to its
peak and positively skewed if it falls slowly down from the peak to
its right.
As
mentioned earlier, the shape of frequency distribution for many
phenomena takes the form of a bell curve i.e. a frequency
distribution with a peak in the middle and tapering off on both
sides symmetrically. This is called a normal distribution.
The question is how normal is a situation that looks quite normal?
We may visualize the top points of the bars in Figure 4, when
connected together, produce a good bell shape, although in this case
it appears to be a little stretched out. We will also examine how
the shape of this bell may influences us in making abstractions or
generalizations from the given values of data.
Figure 4: Frequency Distribution Graph for Data in Table 1 and 2
The
bibliographic references on statistics as well as the Qur’an and
Sunnah provide very useful material on discerning patterns, and
reflecting on patterns to develop insights on situations that may
appear to be different on the surface but largely rest on the same
basic foundations. This is discussed elsewhere in the work related
to this paper [Ahmad et al, 1997].
Returning now to the data in Table 2, we may notice that the reduced
form of data lacks precision seemingly contained in the original
data as listed in Table 1. However, in most practical cases this
loss of precision does not affect resulting statistics in any
significant way. The question is whether not expressing something
precisely is undesirable. The answer in many cases is no. First, the
precision used in recording measurements may itself be misleading.
For data in Table 1, when a number such as 67 is recorded, how sure
are we that it could not be 65 or 69 or something like it. For this
particular set of score values, there may be have been limitations
which make the score of 67 no more a true reflection of students
performance than say 65 or 69. We may be on much safer ground if we
make the score somewhat fuzzy, indicative of a range of values
rather a single value. This possibly what happens when we assign
letter grades to numeric scores recorded for students. Let us say
that we are recording temperature values in a particular place in
our home
The measuring instrument may not be very precise so that
any value we record is suspect but we may be more certain if we
record a range given the known limitations of the measuring
instrument used. The problem is further compounded by that fact that
the same type different instances of measuring instrument may not
record exactly the same value. These remarks aptly to a student's
work being graded by different instructors.
It is highly unlikely
that different instructor, even teaching the same material to
different sections of a course would give the same numeric score for
a given piece of work.
Making fuzzy statements about situations may often be more accurate
and meaningful than precise statements. Fuzzy values and how to draw
inferences from fuzzy values is an area that we will take up
elsewhere.
An
equally important consideration is the issue of judicial use of the
resources. Let us say that we are given a thousand items, and
somehow we are able to select a representative sample of say 20 or
fewer items. If we are able to work with this small sample to draw
whatever inferences we wished as well as we could do one thousand,
then working with a few makes both efficient and effective.
Efficient because less resources and energy will be used in working
with a small sample. Effective because it is a lot easier to grasp
and articulate with a small number rather than a large number of
items. Small may turn out to be quite beautiful in many situations.
1.
Introduction
2.
Creative Thinking and Statistics
3.
Raw Data And Data Aggregations By Categories
4.
Measures Of Central Tendency
5.
Assessing Sample Values On The Basis Of Sample Statistics
6.
Conclusions
7.
Cited References
