## Introduction

The data collected in its original form is unorganised. Hence we call it raw data. This raw data is to be organised or classified so that it will become meaningful for the purpose of further statistical analysis.

Have you ever observed sorting of letters in a post office?

Letters collected in a post office are sorted into different lots on a geographical basis. They are then put in separate bags, each containing letters with a common characteristic, viz., having the same destination. In other words, they are classified to form groups of homogeneous character.

Similarly, when you arrange your books in a certain order, it will be easier for you to handle them. You may group (classify) them according to subjects. In such a case, each subject becomes a group or a class. If you require a book, say, Economics I, what you should do is to search for that book on the group ‘Economics’. Otherwise, you have to search through the entire books to find the particular book you require.

The activity taking place in the above instances is what is called classification Similarly, the raw data collected have to be organised or classified to make them useful for statistical interpretation.

## Raw Data

The data collected in its original form is highly disorganised. They are often very large and cumbersome to handle. It is a tedious task to draw meaningful conclusions from such a raw data, Therefore, proper organisation and presentation of such data is required before any systematic statistical analysis is undertaken. Hence, after collecting data, the next and the most important step is to organise and present them in a classified form.

Suppose you want to know the performance of students in Economics I. You collected data on marks in Economics I of 100 students of your school. The data presented on a table will appear as follows:

Table 13.1 Marks in Economics 1 of 100 Students
46 44 11 10 50 55 48 36 88 41
61 58 57 56 57 48 54 57 100 40
43 69 63 60 59 58 59 64 65 51
65 29 37 51 54 55 71 81 57 91
70 50 52 49 48 49 70 71 54 54
48 46 49 54 55 54 60 59 56 66
49 45 63 63 62 61 59 59 49 65
65 41 26 25 24 27 18 24 39 23
45 52 45 44 43 45 46 12 24 34
47 45 58 57 59 58 59 60 74 24

The-data shown in Table 1 is raw data or unclassified data. Numerical values are not shown in any order. If we want to know how many students got marks above 40 or between 40 and 60, etc., it is very difficult to find them from the table. For proper analysis of data we have to classify the data. Thus by classification, units.having a common characteristic are placed in one class and the whole data are thus divided into a number of classes.

## Classification of Data

Classification of data stands for grouping of related facts into classes. Facts in one class differ from those of another class with respect to some characteristics they possess. These characteristics form the basis of classification. Groups or classes of a classification can be done in many ways.

## Objectives of Classification

1. To condense the data for easy understanding

2. To help comparison

3. To eliminate unnecessary details

4. To make decision making possible

5. To enable further statistical treatments

6. To identify main features of the data

## Types of Classification

Data can be classified on the following four basis:

1. Geographical, i.e., area wise

2. Chronological, i.e., on the basis of time

3. Qualitative, i.e., according to some attributes

4. Quantitative, i.e., in terms of magnitudes

i. Chronological Classification

It is the arrangement of data in ascending or descending order with reference to time. When data are observed over a period of time the type of classification is known as chronological classification. For example, population of India may be observed for a number of years and shown timewise.

13.2 Chronological Classification
Year Population
2009 567
2010 638
2011 736
2012 758

ii. Geographical Classification

In this type, data are classified on the basis of geographical differences between the various items. It is also called spatial classification. For example, population on India can be shown state wise. This is a geographical classification.

It is the arrangement of data with reference to geographical location such as countries, states (Spatial). Production of rice in different states of India is given in below table.

13.3 Geographical Classification
States Production of Rice
Kerala 830

iii. Qualitative Classification

Under this method, data are classified on the basis of some qualities or values or attribute such as sex, colour of hair, literacy, religion, etc. They are not measurable. Their presence or absence can only be known.

Classification according to attributes may be (a) simple or (b) manifold.

In simple classification, data are divided on the basis of only one attribute. For example, the population under study may be divided into two categories on the basis of sex as male and female. In manifold classification, data are divided on the basis of more than one attribute. For example, population of India divided on the basis of sex and literacy, so that there are four groups: (1) male literate, (2) male illiterate (3) female literate and (4) female illiterate.

13.4 Qualitative Classification
States Literacy
Kerala 99.5%
Karnataka 95.6%
Bihar 68%

iv. Quantitative Classification

Quantitative classification refers to the classification of data according to some quantitative measurement, such as height, weight, etc.

13.5 Quantitative Classification
Companies Sales
Hundai 800
Tata 638
Maruti 736

Quantitative Data:

Data can be measured numerically-eg; Income, Production, Price, Cost..

Qualitative:

Data cannot be measured numerically- eg; Health, Intelligence, Ability..Also termed as Attributes.

Before discussing the process of classification, let us consider certain terms which are commonly used in our study.

## Variables and Attributes

Variation is the order of the day. People have different life styles, habits, physical features, age, income, etc. Characteristics like height, weight, etc., are called quantitative characteristics while characteristics like sex, colour of hair, literacy, religion, etc., are called qualitative characteristics. A characteristic that can be measured numerically is called a quantitative characteristic. A characteristic that cannot be numerically measured but can only be expressed on the basis of quality or attributes is called qualitative characteristic. A quantitative characteristic which varies from unit to unit is a variable or variate. Thus weight, height, etc., are variables. Here we shall discuss the variation in characteristics which can be expressed quantitatively.

## Continuous and Discrete Variables

In the last chapter you have learnt the term variable. But it does not tell you how it varies. Variable is that characteristic whose value is capable of changing from unit to unit. Suppose, the weight of one of the students in a class is 45 k.g. and of another student. 52 k.g, Now the quantitative characteristic, i.e., weight, changes its value from unit to unit. Hence weight is a variable.

Different variables vary differently. In other words, they differ on the basis of Specific criterion. They are broadly classified into two:

1. CONTINUOUS
2. DISCRETE
A continuous variable is that which can take any numerical value.

A continuous variable is that which can take any numerical value It can take integers such as (3, 4, 5, 6........), fractional values ($${{\frac{1}{2}} }$$, $${{\frac{3}{4}} }$$,$${{\frac{2}{3}} }$$ ......), and values that are not exact fractions (irrational numbers) like √3 or √7. Take the example, of the height of a student. The height of a student as he/she grows, say, from 80cm to 140 cm., would take all the values in between them. It can take whole numbers like 85 cm, 100 cm, 112 cm, 145 cm, etc. It can also take fractional values like 91.45 cm, 103.35 cm, 148. 89 cm, etc., which are not whole numbers.

Other examples of a continuous variable are weight, time, distance etc.

Unlike continuous variables,

discrete variables are those which can take only certain values

Those. values are isolated and discontinuous. Its value changes only by finite jumps. It means that, it jumps from one value to another, but does not assume any intermediate value between them. For example, a variable like the number of employees in a firm, for different firms, would assume values with respect to only whole numbers. It cannot take any fractional value like 0.5, because half of an employee is absurd. Therefore, it cannot take a value like 28.5 between 28 and 29 It should be either 28 or 29. What we see here is that as its value changes from 28 to 29, the values in between them, ie., the fractions are not considered by it.

Number of students in each class of plus one in your school can be taken as an example of discrete variable. But we should not be under the impression that a discrete variable cannot take any fractional value. Suppose X is a variable that takes values like $${{\frac{1}{7}} }$$, $${{\frac{1}{8}} }$$, $${{\frac{1}{32}} }$$, $${{\frac{1}{46}} }$$............ Is it a discrete series or a continuous Series? Definitely it is a discrete series, because though X takes fractional values it cannot take any value in between $${{\frac{1}{7}} }$$ and $${{\frac{1}{8}} }$$ or between $${{\frac{1}{8}} }$$ and $${{\frac{1}{32}} }$$. In other words, X cannot take continuous values.

### Statistical Series :

When the items collected are arranged according to some logical order, it becomes a series, Statistical series may be divided into three types on the basis of their construction:

• Individual series.

• Discrete series.

• Continuous series.

#### Indivisual Series (Simple Array)

In this type the items are listed singly, showing the observations relating to them. Each value of the variable occurs usually once. It can be arranged either in ascending or descending order. It may also be called a simple array. For example, the wages earned by 3 workers a day can be shown in an individual series as follows:

13.6 Individual Series
Number of workers Wage (Rs)
1 500
2 600
3 550

#### Discrete Series (Frequency Array)

Certain items occur many times in the data. Items are arranged indicating the number of times each item occurs (ascending or descending order). A discrete series is also called a frequency array. In discrete series, statistical unit is either not divisible or is not divided. Each class is distinct and different from other class.

13.7 Discrete Series
Number of Children per couple Number of Couples (Frequency)
0 21
1 19
2 10
Total 50

#### Continuous Series

In continuous series, different values of the variable are stated in a continuous manner with respect to their frequencies. In continuous series the statistical unit is capable unit, is capable of division and can be measured in fractions of any Size. They are expressed in class interval, and are continuous from beginning to end.

13.8 Continuous Series
Marks (Class) Number of Students (Frequency)
0 - 10 5
10 - 20 10
20 - 30 17
30 - 40 13
40 - 50 5
Total 50

## The Array

The first step in the organising raw data is to arrange them by their magnitude. A mass of raw data when put into an orderly arrangement by magnitude (ascending or descending order) is called an array. The following example will make you clear this.

Suppose, the raw data obtained from a business unit with regard to the daily wages in rupees of 20 workers are as follows:

20, 35, 31, 33, 30, 27, 36, 21, 39, 23

34, 41, 47, 32, 46, 49, 42, 43, 52, 50

The raw data when arranged in ascending and descending order is shown below:

13.9 Array in Ascending Order
Rs Rs
20 36
21 39
23 41
27 42
30 43
31 46
32 47
33 49
34 50
35 52

13.10 Array in Descending Order
Rs Rs
52 35
50 34
49 33
47 32
46 31
43 30
42 27
41 23
39 21
36 20

A look at the arrayed figure in Table 9 or 10 gives us information with respect to the lowest wage (Rs. 20) and the highest wage (Rs. 52). We also know that the range (deviation) between the lowest and the highest wage is Rs. 32 (Rs.52 - Rs. 20). We also notice a concentration of wage between Rs. 30 and Rs. 40.

An array is useful when the number of items in the raw data is small. But if the items are hundreds or thousands, it is very difficult to handle them, and is time consuming, Hence it necessitates the condensation of data. And that is the second step in the organising data. Here condensation or simplification of data is done through a process of classification into groups or classes.

## The Frequency Array

While making an array it is possible that some values occur frequently. The number of times a value of item occurs in a series is called the frequency. If we mark the number of times a value appears in the series, we will get what is known as frequency array. The frequency array is useful only when the number of items in the raw data is small. The frequency array exhibits the frequency of observations and indicates the concentration of items around certain values.

Let us arrange the following raw data of daily wages (in rupees) of 20 employees in a factory into a frequency array.

50, 54, 50, 55, 56, 54, 50, 57, 50, 55

56, 54, 54, 50, 54, 56, 55, 54, 50, 56

13.11 Frequency Array
Daily Wages No. of Employees
50 6
54 6
55 3
56 4
57 1
Total 20

## Frequency Distribution

A frequency distribution is an orderly arrangement of data classified according to the magnitude of observations. When data are grouped into classes of appropriate size indicating the number of observations in each class we get a frequency distribution. For example, the students of a college may be classified according to weight as follows:

13.12
Weight (in Kg) No. of Students
40 - 45 40
45 - 50 110
50 - 55 35
55 - 60 240
60 - 65 355
65 - 70 20
Total 800

There are two elements, viz., (1) the variable, i.e., the weight and (2) the frequency.

## Construction of Frequency Distribution

Selection of Class

• There is no hard and fast rule to determine number of classes
• A class should not be too big or too small
• There should not be too much classes or too short
Example:- 0 - 10, 10 - 20, 20 - 30...etc

Class Limit

• The class limits are the lowest and the highest values that can be included in the class.
• It is the two ends of a class.
• In class 20 – 30, 20 is called the lower class limit and 30 is called upper class limit.
Class Interval

• It is the difference between the upper and lower class limits.
• Class interval is also known as class width or class size.
• The class interval of the class 50 – 100 is 50 (100 – 50 = 50)
Class Midpoint

• It is the middle value of a class. It is also known as mid value or class mark.
• It lies half way between the lower and upper class limits of a class.
Magnitude of Class Interval

• The difference between lower and upper class boundaries is called the magnitude of a class interval
Class frequency

• The number of observation corresponding to a particular class is known as the class frequency.

### Construction of Frequency Distribution

The following technical terms are important when a frequency distribution is formed:

1. Selection of class: The quality of a frequency distribution is determined by a wise choice of the number of classes. There is no hard and fast rule to determine the number of classes. Ordinarily, a frequency distribution should to contain more than 20 to 25 classes and not less than 6 to 8 classes depending on the total number of items of the series. Suppose, in an example, 100 entries are given and the lowest value is 3 and the highest 96. In such a case we can have 10 classes as 0-10, 10-20........:.....90 - 100.
2. Class limits: The class limits are the lowest and the highest values that can be included in the class. For example, if we take the class of 20 - 30, the lowest value 20 is the lower limit and 30, the upper limit.
3. Class intervals: The difference between the upper and lower limits is known as class interval. In a class 100-200, the class interval is 100 {i.e., 200 minus 100).
4. Class mid-point or class mark: It is the value lying half-way between the lower and upper class limits of a class interval. Class mark = 1/2 (lower limit + upper limit).
5. Magnitude of class interval: The difference between lower and upper class boundaries is called the magnitude of a class interval.
6. Class frequency: The number of observation corresponding to a particular class is known as the frequency of that class or the class frequency.

## How to find Frequency of distribution ?

We had seen that frequency means the number of times a value or item occurs and we have to count the number of times each value of the variable is repeated in the data to get the frequency. If the data is large, the counting simply will invite errors. For this we use the method of tally marks. Tally marks are vertical bars (/) used for counting.

• Let us create a frequency distribution for the following data.
• 70, 54, 35, 45, 45, 73, 56, 46, 3, 42, 43, 43, 43, 36, 47, 23, 57, 45, 25, 43, 55, 21, 65, 78, 39, 28, 42, 21, 27, 70, 23, 85, 41, 71, 24, 43, 17, 26, 56, 39, 87, 43, 8, 38, 12, 71, 68, 28, 47, 23, 67, 60, 34, 59, 2, 77, 91, 56, 28, 43, 40, 21, 80, 56, 55, 51, 34, 58, 28, 28, 54, 34, 68, 30, 45, 24, 32, 34, 21, 54, 7, 16, 49, 32, 26, 21, 5, 26, 29, 37, 34, 21, 29, 71, 35, 8, 34, 20, 21, 80.

Using tally marks, we can create a frequency distribution. For that first we will draw a table with three columns. In the first column we write the class, in the second we write tally marks, and in the third frequency. All the entries in the first column are filled with classes. Now look at the data given. The first entry is 70. That-will fall in the class 70 - 80. Now strike off the entry 70 in the data and and put a tally mark in the second column right to the class 70 - 80. The second entry is 54. That will fall in the class 50 - 60. Now strike off the entry 54 in the data and put a tally mark in the second column right to the class 50 - 60. This process will be repeated up to when all the entries in the data gone stroked off. One more thing to notice is that, after placing 4 tally marks vertically, for the fifth we put the tally mark horizontally to cut the first four tally marks, so that this gives us a block of 5. For the sixth we put another tally mark vertically leaving some space from the first block. Look at the given below table, it is completed by doing the above said process.

13.13 Frequency Distribution with Tally Mark
Class Tally Marks Marks
0 - 10 //// / 6
10 - 20 /// 3
20 - 30 //// //// //// //// //// 25
30 - 40 //// //// //// / 16
40 - 50 //// //// //// //// 19
50 - 60 //// //// /// 13
60 - 70 //// 5
70 - 80 //// /// 8
80 - 90 //// 4
90 - 100 / 1
Total 100

## Exclusive Method

When the class intervals are so fixed that the upper limit of one class is the lower limit of the next class, it is known as the exclusive method of classification. The classes are, for example, written as 5-10, 10-15, etc. Here a frequency of 10 is not included in the first class 5-10. It is included in the class 10-15 (Second class).

13. 14 Exclusive Classes
Marks (Class)
0 - 10
10 - 20
20 - 30

## Inclusive Method

Under the inclusive method of classification the upper limit of one class is included in that class itself. The class under this method are written, for example, as 5-9, 10-14, etc. Here a frequency 9 is included in the first class 5-9.

13.15 Inclusive Classes
Marks (Class)
0 - 9
10 - 19
20 - 29

## How to Convert Inclusive Classes into Exclusive Classes ?

Find the difference between the upper limit of a class and the lower limit of the next class. Find half the difference. Subtract this number from all the lower limits and add this number to all the upper limits.

• Let us convert the below given inclusive type classes into exclusive type classes.
• 13.16 Inclusive Classes
Marks (Class)
0 - 9
10 - 19
20 - 29

Given classes, 0 – 9, 10 – 19 , 20 – 29

Difference between the upper limit of a class and the lower limit of the next class = 10 – 9 = 1

Half the difference : $${{\frac{ 1}{2}} }$$ or (0.5).

Now we can get exclusive type class as given below.

13.17 Exclusive Classes
Marks (Class)
-0.5 - 9.5
9.5 - 19.5
19.5 - 29.5

## Cumulative Series

In a cumulative series the frequencies are progressively totalled and aggregates are shown.

13.18 Cumulative Series
Marks (Class) Number of Students (Frequency)
Marks below 10 12
" below 20 18
" below 30 24
" below 40 30
" below 50 36

The cumulation may be upward or downward.

### Loss of Information

When we classify data into a frequency distribution there is an inherent shortcoming. When it summarises the raw data to make it concise, it fails to give all details that are found in raw data. That is, while summarising it as a classified data, there is a loss of information. We noted that once the data are grouped into classes, an individual observation has no significance in further statistical computations. Consider an example of a class 30 - 40 containing 6 observations, 35, 35, 30, 32, 35 and 38. When we use the frequency table for further analysis, we will not attach any importance to the actual value of the items. We consider only the total number of items (6). All values in the class are taken to be equal to the middle value of the class interval (i.e., 35); individual values are not considered. This is true for other classes as well. Thus the use of mid value of each class in place of actual values of the observations in statistical methods involves considerable loss of information.

## Open end Class

If the lower limit of the first class or upper limit of the last class are not given, such series are called open end class series.

13.19 Open end Class
Marks (Class) Number of Students (Frequency)
Marks below 10 4
10 - 20 6
20 - 30 6
30 - 40 9
40 and above 5

## Unequal Class

We are now familiar with frequency distributions of equal class intervals. But in some cases, frequency distributions with unequal class intervals will be more appropriate. If all classes in the distributions are not equal, it can be called unequal class distribution. Observe the frequency distribution given below:

13.20 Frequency distribution of Marks in Economics
Marks (Class) Mid Value Number of Students (Frequency)
0 - 10 5 2
10 - 20 15 8
20 - 30 25 5
30 - 40 35 6
40 - 50 45 24
50 - 60 55 18
60 - 70 65 20
70 - 80 75 7
80 - 90 85 6
90 - 100 95 4

In the above frequency distribution we notice that most of the observations are concentrated in classes 40 - 50, 50 - 60 and 60 - 70. Frequencies corresponding to these classes are 24, 18, 20 respectively. This means that majority of items (62) are highly concentrated around these three classes. This implies that 62 per cent are in the middle range of 40 - 70. Only 38 per cent of data are in other seven classes. These seven classes are sparsely populated. Further we notice that observations in these classes deviate more from their respective class marks than in comparison to those in other classes. Hence making small classes will be more suitable in this case. Unequal class interval is more appropriate to the above frequency distribution.

What we are going to do is that the class with highest concentration ( 40 - 50, 50 - 60 and 60 - 70) are split into two classes. The class 40 -50 into 40 - 45; 45 - 50, class 50 - 60 into 50 - 55; 55 - 60 and class 60 - 70 into 60 - 65; 65 - 70. We retain the other classes as was done earlier (i-e., class interval with 10).

40 - 50 Total number of students in class " " " " " " " "

The new classification along with frequency class marks is given in the following table. The new class mark values are more representative of the data in these classes than the old values.

13.21 Frequency distribution of unequal classes
Marks (Class) Mid Value Number of Students (Frequency)
0 - 10 5 2
10 - 20 15 8
20 - 30 25 5
30 - 40 35 6
40 - 45 42.5 11
45 - 50 47.5 13
50 - 55 52.5 8
55 - 60 57.5 10
60 - 65 62.5 9
65 - 70 67.5 11
70 - 80 75 7
80 - 90 85 6
90 - 100 95 4

## Univariate Distribution.

The frequency distribution of a single variable is called a univariate frequency distribution. The data given in example (inclusive method) shows the univariate distribution of the single variable ‘number of students’.

13.22 Univariate Distribution
Marks. Number of Students.
40 - 50 5
50 - 60 8
60 - 70 15
70 - 80 20
80 - 90 7
90 - 100 2

## Bivariate Distribution.

A bivariate frequency distribution ts the frequency distribution of two variables.

The following table shows the frequency distribution of two variables. Two yariables are sales and advertisement expenditure. The values of variable sales are given in columns and the values of variable advertisement expenditure are shown in rows.

13.23 Bivariate distribution
Sales. 100 - 200 200 - 300 300 - 400 400 - 500
Cost.
40 - 50 5 3 2 1
50 - 60 8 4 3 1
60 - 70 8 3 1 1
70 - 80 6 1 2 1
80 - 90 4 1 1 2