Statistics for Data Science.
What is Statistics? A branch of Mathematics? Or is it to find sense out of the chaos that is called data? To use that data to make predictions which will never be cent percent accurate? Statistics forms the mathematical background of data science and in this and several coming blog posts we will explore just that. Lets get our hands dirty with the basics here.
Individuals:
These are those objects which are described by a set of data.
Variables: These are any characteristic of an individual.
Population: The bunch of individuals from which the Statistical data is drawn. Represented by N
Example: There is a data giving the age of all people in a town. The individuals are the people and the characteristic of the people being described here is the age which is the variable. As different people have different ages in general variables take up different values for different individuals in general. A collection of ages of these people is the data. The people of the town are the population here.
Sample: It is a portion of the population.
Example: In the given example if we consider only the people of Hemmingway (a colony of that town) then it is a sample. Represented by n.
Representative sample: If you are assigned to draw some statistical inferences from the age of the people of the town and it is not possible for you to knock at every door and do that survey then what do you do? You take some colonies which you think represents the entire town. You include a colony which contains a old age home and you also include a colony which contains a boarding school and you also include a colony where there are many college students' hostel and so on in proportion to their majorities in the town. The inferences that you draw from such a data will be similar to the inferences drawn had you knocked every door of the town to do your survey.
Biased sample: If you choose your sample to contain more of those colonies that contain old age homes than those which contain the boarding schools then what inferences you draw from that data will be more biased towards the older age group and not the kids. This is called a biased sample.
Census: If you conduct the survey by knocking the door of each single household then you get a census.
Sample data: If you take a small portion of that population then it is called sample data as explained earlier.
Sample data is more frequently used in research as compared to census data.
Parameter: It is a measure of the entire population.
Statistics: It is used as a measure of a sample (a portion of population).
Example: It we knock at every door of the town to note their ages then the inferences drawn from it will be parameter but if we draw inferences from data collected in only few colonies (representative sample or biased sample) it is called statistics.
Descriptive statistics: These are descriptive coefficients to summarize a data set. It involves drawing tables, bar graphs, pie charts, etc. It helps to describe the data set. For example:
Inferential statistics: It can only be used on a sample and used to infer about things from the data. It uses samples so it can only give an approximate idea about the real things without actually telling the real inference. For example, in the above diagram if a sample has been described then what we can infer is that the maximum people in the town are mid aged. The sample tells us that this age is around 45 years but had the entire census been used may be the age would have been anything between 40 to 50 years so it only an idea of what the things really are in the town.
Types of Variables:
Comments
Post a Comment