Statistics is the study of the collection, organization, analysis, interpretation and presentation of data. It deals with all aspects of data including the planning of data collection in terms of the design of surveys and experiments.
Statistical Terminologies
Various statistics terminologies which are used extensively are
- Data – facts, observations, and information that come from investigations.
- Measurement data sometimes called quantitative data — the result of using some instrument to measure something (e.g., test score, weight);
- Categorical data also referred to as frequency or qualitative data. Things are grouped according to some common property and the number of members of the group are recorded (e.g., males/females, vehicle type).
- Variable – property of an object or event that can take on different values. For example, college major is a variable that takes on values like mathematics, computer science, etc.
- Discrete Variable – a variable with a limited number of values (e.g., gender (male/female).
- Continuous Variable – It is a variable that can take on many different values, in theory, any value between the lowest and highest points on the measurement scale.
- Independent Variable – a variable that is manipulated, measured, or selected by the user as an antecedent condition to an observed behavior. In a hypothesized cause-and-effect relationship, the independent variable is the cause and the dependent variable is the effect.
- Dependent Variable – a variable that is not under the user’s control. It is the variable that is observed and measured in response to the independent variable.
- Outlier – An outlier is an observation point that’s distant from the other observations in your dataset. This may be due to variability in the measurement or it may indicate some kind of experimental error, which you may want to throw out of your dataset. Outliers can occur by chance at any distribution and often indicate a measurement error or that the population has a heavily tailed distribution. The frequent cause of outliers is a mixture of two distributions, which may be two distinct subpopulations, or it may indicate a correct trial issue versus a measurement error. This is often modeled by using a mixture model.
Inferential Statistics
Inferential statistics, sometimes called analytical statistics, are more than descriptions. They’re used to reach conclusions that go beyond the straightforward presentation of data, allowing you to make inferences about characteristics of a population based on a sample. Arguably, the Chi-square statistic is the best known test for nominal data. Chi-square determines statistical significance – the likelihood that you’ve acquired the results you find in your analysis by chance alone. The Chi-square uses nominal data to determine whether a relationship between two variables in a sample is likely to reflect a real association between these two variables in the population as a whole. Simply put, it tests the goodness of fit between an observed and an expected distribution.
Descriptive Statistics
Descriptive statistics, sometimes called enumerative statistics, are used to describe the basic features of data. They’re simple, descriptive summaries that state what the sample and measures show about the data. Descriptive statistics are straightforward and easy to interpret. Paired with simple graphics analysis, they’re the foundation of quantitative data analysis. A frequency distribution is a list of the values that a variable takes in a sample, ordered by quantity. It can be used with nominal, ordinal, interval, and ratio data. Because frequency distributions are usually represented in graphic form, such as a bar chart, or pie chart, they are useful for determining the nature and shape of your data.
Population and sample data
A population is essentially a collection of units that you’re going to study. This can be people, places, objects, time periods, pharmaceuticals, procedures, and all kinds of different things. Much of statistics is really concerned with understanding the numerical properties or parameters of an entire population, often using a random sampling from that population. Defining a population means you’re going to have a well-defined collection of objects or individuals that have similar characteristics. All those individuals or objects in that population usually have a common or binding characteristic or trait that you’re interested in. The samples are a collection of units from within the population and sampling is very often a process of taking a subset of those subjects that are represented over the entire population. It needs to be of sufficient size so you can actually warrant a statistical analysis.
Population parameters
When dealing with population parameters it’s important to understand several key data elements – mean, standard deviation, and variance – as well as the sample statistics for each of these. For example, the mean is represented as the Mu symbol with Xbar as the sample statistic, or a symbol that would represent that. Standard deviation to the left or to the right of the central mode is represented by the sigma symbol, and the sample statistic for that is the lowercase s symbol. Variance is sigma squared or the standard deviation squared, and the sample statistic is a lowercase s squared.
It’s important to understand that the relation of sampling to statistics in a population parameter is that they are somewhat similar. The statistic describes what percentage of the population fits in the category and the parameter describes the entire population.