**What is Statistics?**

Statistics is that branch of science which deals with the collection, organization, analysis, and interpretation of numerical data. It is the study of the collection, analysis, interpretation, presentation, and organization of data. It studies the methods of collection, tabulation, summarizing and drawing conclusions from data to make informed choices.

**Since When Have We Been Using Statistical Metrics?**

Probably since the days we could compute jyotish nakshatras. But evidence is scarce. We still do not understand the observational methods of the oral tradition of the vedic period. Interestingly, administrative data collection was seen around 300 BC, during the reign of Chandragupta Maurya in India. In late 16th Century, Abul Fazl and Faizi talk of surveys in Ain-i-Akbari. The states needed to base policy on demographic and economic data and employed statistical thinking for it. Hence this etymology of – ‘stat’ stuck. The discipline of statistics broadened scope to include the collection and analysis of data in the early 19th century in the West. In the early 19th century, the word statistics was formally used to mean collection and classification of data.

**Why Should Doctors Or Any Professional Study **

**Or Practice Statistics?**

Statistics is the collection of methods and techniques to analyse numerical data or non-numerical data, which can be converted into numerical form without loss of interpretation. It is used in diverse fields like economics, biology, psychology, business management etc. With increased computational capabilities, most specialized fields would not require indepth knowledge of all aspects of statistics. However, a thorough understanding of certain aspects and only basic familiarity and awareness with most others would suffice. There is an oft quoted eighty twenty rule which states that eighty percent of all work in a field is accomplished using twenty percent of the tools and the vice versa. Hence, the students should learn about the ropes of research methods and interpretation of evidence.

**What is Biostatistics?**

Biostatistics is the application of statistics to biology, public health and other biomedical sciences. It covers the design of biological experiments in medicine, pharmacy, agriculture, veterinary and fishery; the collection, summarization, and analysis of data from those experiments; and the interpretation of, and inference from the results. A major branch is medical biostatistics, the branch of statistics that deals with data relating to living organisms.

Biostatistics is a special branch because subjects (patients, mice, cells, etc.) exhibit variation in their response to various factors due to genotype or the physical factors interacting with it/phenotype. Therefore, differences observed could be due to different treatments or they may be attributable to chance, measurement error, or other characteristics of the individual subjects. In fact it was the agricultural experiments that turned R. A. Fisher, a British statistician and geneticist, into “a genius who almost single-handedly created the foundations for modern statistical science” and “the single most important figure in 20th century statistics”. Using these tools of statistics, we can answer pressing research questions in medicine and public health like- does a new drug work; what causes cancer (cf Hill and Doll); how long is a person with a certain illness likely to survive.

**How To Practice The Scientific Rigour Of Evidence **

**Generation And Interpretation?**

Research process is a series of steps or actions along with their desirable sequence to conduct research on a given problem. The steps are as follows:

a. Formulation of the research question

b. Execution of extensive and exhaustive literature search

c. Development of a hypothesis

d. Preparation of a research design

e. Collection of data

f. Analysis of data

g. Testing of the hypothesis

h. Generalization and interpretation of findings

i. Monitoring and evaluation of the project

j. Preparations of a report with formal write up of conclusions.

These steps are not mutually exclusive and sequential. They may overlap and may not be separate and distinct.

**What are Levels of Measurement?**

Measurement is a process of assigning numerals to a characteristic or an attribute of an object to represent its qualities according to some rules. It is a way of using symbols to represent the properties of persons, objects, events or states. Measurement has four levels of hierarchy which can be loosely called the four Levels of Measurement:

• Nominal

• Ordinal

• Interval and

• Ratio

These are referred to as ‘primary scales of measurement’.

What is the importance of ‘Primary Scales

of Measurement’?

These scales or levels of measurement are important in deciding how to interpret data of the variable and what statistical tools are appropriate for the data. This is because most tests use some underlying assumptions which must be met to use the given method of evaluation in inferential statistics.

Nominal: This is the most basic scale where simply numbers are assigned (as tags or labels) to the study objects. For example, we can assign roll numbers to the students of the class or house numbers to apartments in a building. This provides a unique and mutually exclusive ‘numbers/ names’ to the attribute or qualitative variable. These measurements allow simple operations only like counts and frequency tabulation but no statistical operations like mean etc are possible. They are least informative and weakest.

**Ordinal:** These are ranking scales or ordered data. The ranking or ordering is carried out on the basis of specific attribute/ characteristics of interest in a specific direction. These scales provide all information provided by nominal scale but in addition, there is also an order, for example mild, moderate and severe. It is unable to convey the quantum difference between any two or more ranks which means that distance between ranks does not have any meaning i.e. interval between the ranks is not interpretable Equivalent units get equal ranks but the gap between mild and moderate or grade 1 and grade 2 is not the same as between moderate and severe or grade3 and grade 4. Ordinal scale is used for qualitative data. It provides minimal information and has low statistical power. This scale allows a few positional statistical tools to be used like Median, Quartlie etc. Mean cannot be used on ranked data.

**Interval:** Interval scales utilize a proper unit of measurement but zero point may be arbitrary and does not signify absence. For example temperature of 800 F and 900 F. indicates a difference of 100 F, which can be compared with the difference between 300 F and 400 F allow meaningful statements about differences between two characteristics. It is mathematically powerful and is used for quantitative data. The main drawback of this scale is that the ratio of two interval scale values are not interpretable. A temperature of 400 F does not imply temperature twice as hot as compared to 200 F.

**Ratio:** These measurements have an absolute zero which is meaningful and signifies the absence of the characteristic, i.e. a zero which signifies absence of the attribute or characteristic. Ratio measurements are statistically the strongest as they allow all statistical operations. These scale values reflect equal ratios. For example a length of 8 cm is twice of 4cm.

Discrete data is that in which only a finite number of values are possible and the values cannot be subdivided meaningfully. Discrete data can take on only integer values. For example, the number of births, pregnancies or deaths. It cannot take part values.

Continuous data is information that can be measured on a continuum. It can assume any numeric value and can be meaningfully subdivided into finer and finer increments, depending upon the precision of the measurement system.it can assume part values, for example weight is 42.75 kilograms.

Significance of nominal and ordinal scaled data is tested only through non-parametric tests while parametric tests can be utilized for interval and ratio scaled data if they follow the normal distribution or are expected to reasonably follow the normal distribution.

Numerical statistics refers to numbers eg tables. Pictorial statistics uses numerical data and presents it in pictures or graphs. Data visualization in the form of a graphic allows complex and confusing information to be presented in a more simple and straight-forward manner.

**What are branches of statistics?**

There are two main branches of statistics

What Is Descriptive Statistics?

Descriptive statistics are numbers and graphical methods that are used to summarize and describe data.It helps to characterize data based on its properties. There are four major types of descriptive statistics:

1. Measures of Frequency

Count, Percent, Frequency which shows how often some event occurs.

2. Measures of Central Tendency

Mean, Median, and Mode which denotes the distribution by various points. This is used when average or most commonly indicated responses are of interest.

3. Measures of Dispersion or Variation

Range, Variance, Mean Deviation, Standard Deviation which identify the spread of scores by stating intervals to show how “spread out” the data are. It is helpful to know when your data is so spread out that it affects the mean.

4. Measures of Position

Percentile Ranks, Quartile Ranks show relation of the responses to one another. We will use an example to understand the practical implications of use of biostatistics hereinafter. We will use our OPD Data to understand the concepts.

Proforma of Data Collected

Now we have 30 patients where we collected the proformae. Do the 30 proformas convey any information? The answer is no. So all work needs some way of converting this data into meaningful information.

To do that, we tabulate the data from the raw format into forms or charts or master chart as shown in Table 2 on following page.

For the sake of brevity, we will consider only the first three parameters in Table 2 on the following page.

This again does not convey information. So, we describe the information using some statistical metrics. The mean age of these sixty subjects was 59.57 years with a standard deviation of 6.01 years. There was equal representation of the sexes.

Now we know that the data can be described so that the reader can understand what we did.

The different methods of measuring the central tendency are:

• Arithmetic Mean

• Median, Quartile and Percentile

• Mode

• Geometric Mean

• Harmonic Mean

Arithmetic Mean

Simple Arithmetic Mean =(x1+x2+x3+... +xn)=(?x)

Where, x1,x2, x3,…,xn = Various values of the variable ‘x’

?x = Sum of all values of ‘x’

n = Total number of Observations.

For example for the series 7 8 9 10 11 12 13 15 35

Arithmetic Mean x=120=13.33

Is Central Tendency Enough To Describe My Data?

The answer is no. It is the barest minimum. Now let us consider two datasets

8, 9, 10, 11, 12, 13, 14, 15, 16 and

2 , 2 , 2, 11, 14, 15, 16, 38

The mean for both is 12.5. The median is 12.5 for both. But the two are not similar distributins. Clearly, we need something more to define the data. So we wish to see the spread of the data also. This is done by using measures of dispersion. For a detailed discussion on the dispersion the reader is referred to “Basics of Biostatistics: A Manual for the Medical Practitioners” or other texts.

The different types of measures of dispersion are:

• Range

• Mean Deviation

• Standard Deviation

Range

Range is the difference between the highest and the lowest values of a dataset. To find the range, first order the data from least to greatest. Then subtract the smallest value from the largest value in the set.

In the two datasets

8, 9, 10, 11, 12, 13, 14, 15, 16 and

2 , 2 , 2, 11, 14, 15, 16, 38

The range is 8-16 or 8 and 2-38 or 36 units. This is a less precise function and we can use other measures like

**Standard Deviation **

Standard Deviation is the most important and also the most popular measure of dispersion. It is denoted by the Greek symbol Sigma (s). It is important for effect size and normalization or standardization of data.

Standard Deviation of the distribution is calculated by the given formula:

For Individual Observations

Standard Deviation (s)= ?n(xi-x)2

For Discrete and Continuous Distributions

Standard Deviation (s)= (?i fi(xi-x)2)

*Where *

*N=?fi =Sum of all frequencies or obsevations *

*(in case of individual observations)*

*xi=Variable value *

*(class mid point in case of Continuous distribution)*

*fi=Corresponding Frequencies and*

* x=Mean of the Distribution*

It is different from Root Mean Square Deviation. Please refer to “Basics of Biostatistics: A Manual for the Medical Practitioners” for further details. The standard deviations for the two datasets 8, 9, 10, 11, 12, 13, 14, 15, 16 and

2 , 2 , 2, 11, 14, 15, 16, 38 are 2.45 and 11.98 respectively. This means that the second dataset values are more spaced out than the first one.

**What is The Bare Minimum For Describing Data?**

The previous measures were used most commonly in the past. However, today with the computation being done by automated means, it makes more sense to tell about

• Centre

• Spread

• Shape of Data and

• Any unusual features in the data.

For age and Axial Length, let us look at the histograms.

**How do we describe data? Is there a hard and fast rule in it?**

It is useful to summarize data using a combination of tabulated description (i.e., tables), graphical description (i.e., graphs and charts) and statistical commentary (i.e., a discussion of the results). The purpose of descriptive statistics is to convey the shape, symmetry and any unusual features of the data series. The data should tell its own story. There is no hard and fast convention or rule. Different branches of science have followed their own traditions. The accent on data visualization has increased in recent years. Convey information in a manner that is easily understood. In datasets with a high number of outliers or skewed distributions, the mean is not accurate for making nuanced decisions. Similarly, the standard deviation is also deceptive if taken alone. If the datasets have a non-normal curve or a large amount of outliers, then the standard deviation can be fallacious. So always look at the shape and outliers. A detailed discussion is out of scope of this manuscript and reader is well advised to refer to “Basics of Biostatistics: A Manual for the Medical Practitioners” or other texts for further analysis.

**What Is Univariate Analysis?**

Univariate analysis, by definition, describes the distribution of a single variable. The description should include central tendency (the mean, median, and mode) and dispersion (the range and quartiles of the data-set and measures of spread like variance and standard deviation). The shape of the distribution may also be described via indices such as skewness and kurtosis. The other characteristics of the distribution of the variables can be described using tabular or graphical format like histograms, qq plots and stem-and-leaf display.

**Bivariate and Multivariate Analysis**

When the description of more than one variable is given, descriptive statistics describe the relationship between pairs of variables.

In this case, descriptive statistics include:

• Cross-tabulations and contingency tables

• Graphical representation via scatterplots

• Quantitative measures of dependence

• Descriptions of conditional distributions

**To summarize **

Descriptive statistics include: frequencies and percentages for categorical (ordinal and nominal) data; and averages (means, medians, and/or ranges) and standard deviations for continuous data. Frequency is the number of participants that fit into a certain category or group; it is beneficial to know the percent of the sample that coincides with that category/group. Percentages can be calculated to assess the percent of the sample that corresponds with the given frequency. Typically, the average that is calculated/presented is the mean. Means describe the average unit for a continuous item; and standard deviations describe the spread of those units in reference to the mean.

**What is a hypothesis?**

A hypothesis is defined as “a supposition or proposed explanation made on the basis of limited evidence as a starting point for further investigation.” It must be stated in advance, have clear inclusion and exclusion parameters.

**How to test the hypothesis?**

The established steps in testing of hypothesis are:

This is the common scenario used by doctors to understand the 2X2 contigency table. Now we extend it to statistical tests. Type I Error is a false positive when results from a hypothesis test suggest that Alternate Hypothesis is true, when in fact Null Hypothesis holds. Type II Error is a false negative when results from a hypothesis test suggest that H0 is true, when in fact HA is.

The ideal diagnostic test should have no false positives or false negatives. However, in statistics, the problem is of defining the acceptance or critical region. These Type I or Type II errors do not mean that the investigator is making a mistake. It is just a reflection of the Acceptance/Rejection region. For more details on the issue, please refer to “Basics of Biostatistics: A Manual for Medical Practitioners”. The significance level denoted as alpha or a, is the probability of rejecting the null hypothesis when the null hypothesis is true. Its the probability of making a wrong decision of finding some difference when there is no actual difference. Thanks to R.A. Fisher, convention typically uses an alpha level of 0.05. However, lower or higher levels can be and are used when dictated by the importance of the research problem. A significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference. If the P value is less than alpha, the confidence interval will not contain the null hypothesis value. P-values refer to the probability of obtaining an effect at least as extreme as the one in sample data in population, assuming the truth of the null hypothesis.