The statistical processThis page discusses some of the basic ideas in statistical research.
- Why collect data
- How to collect data
- Analysis of the ata
- Interpretation and presentation
- Check list
DataBroadly speaking, there are two type of data collection: observational studies and experiments. For observational studies, we let the world take its course while we observe and record the issues of interest to us. With experiments we take an active hand and deliberately influence what is happening in the world while observing and recording the issues of interest.
If I were to measure and record how tall you were I would produce a data point. If I was to record your height and age, I would have two data points.
Data points record something about an individual. An individual is the person or thing on which a measurement is made. For example, when I record your height, you are the individual that I am measuring. If I were to record the length of my pet cockroach, the cockroach is the individual I am measuring. If I were to record total rainfall in January this year, January this year would be the individual that I am measuring.
When we make the same measurements across a number of different individuals, we produce a data set. For example, I could produce data sets by:
- recording the heights and ages of every student in your school;
- recording the length of 100 cockroaches of the same species; or
- recording the total rainfall each month for 12 months.
Why collect data?We learn by observing the world. We learn, for example, that most things we drop fall to the floor not the ceiling. We learn which type of clouds are rain clouds and which are just fluff balls. We learn to tell when the teacher is really angry and when she is just trying to keep things under control. All of these things we learn by remembering what happened in the past and applying this knowledge to the present.
Data is just another way of recording previous experiences. It allows us to do more complex learning.
Consider this. You know that if you leave an egg out of the fridge, it'll go off. You know that from previous learning and experience. But you may not know how long it will take to go off. You probably don't have a lot of previous experience in timing eggs going rotten. Without that previous experience it's hard to estimate how long that process will take.
If we really wanted an answer to that question, we'd need some experience. That would mean we'd have to deliberately leave an egg out of the fridge and time how long it takes for them to go rotten.
Of course, we'd probably need to do it more than once to check that the result is consistent from egg to egg. We may also need to take into account the other factors like temperature and location. Different temperatures and different locations could change how long it the eggs take to go off.
It's getting pretty complex already. We're going to need to keep an eye on multiple eggs at multiple locations and multiple temperatures... We'll really need to start writing down the details of each egg before it all gets hopelessly confusing.
The details we record for each egg is of course our data set. The variables in the data set are location of egg, temperature and time till the egg goes rotten.
The role of the data set is to provide an accurate record of all of the observations (experience) we are obtaining so that we can use this experience to learn new knowledge and answer complex questions.
How to collect dataData is usually collected from a sample of all possible individuals. That is, to investigate how long it takes for eggs to go rotten, we look at a limited number of eggs. It costs money to buy eggs and people want the eggs to eat; they don't want us making them all rotten!
A sample is a small group of individuals used to represent some wider population. For the sample to give reasonably accurate information about the population of interest, the sample needs to be as similar as possible to that population. For example, even though small eggs are cheaper, we can't only use small eggs in our study. Small eggs may go off faster or slower than larger eggs. If we only studied small eggs, our results may not be applicable to the whole population.
We should try to choose our samples at random. That is, give every individual in the population the same chance of being included in the sample. The problem with eggs is that they usually come pre-sorted into different sizes. To obtain a random sample, we'd really need to get them from the farm, before they were sorted (and where we could be sure they were all of equal freshness).
Be careful of measurement bias too. Measurement bias occurs when we make the same error in all our measurements. In the egg example above, measurement is a potential problem. How do we define when an egg is rotten? How do we measure this accurately? Eggs that go rotten will eventually float in water ? but they start to taste rotten before they start to float. If we wait for them to float we may consistently overestimate the time it takes for eggs to go rotten.
Analysis of dataData often contains too much information or detail. We are usually interested in the overall characteristics rather than the details of every individual member. To obtain overall results from individual detail, data must be summarised.
A number of different tools can be used to summarise data: some give numeric summaries (called statistics) and some give graphical summaries. Statistics provide less detailed information and are easier to compare between groups. Graphs can contain more detailed information in a more easily understood form.
The choice of which summary is correct for any situation depends on what variables are to be analysed. Broadly speaking we can divide variables into two types: categorical and quantitative. A quantitative variable is measured on some mathematical measurement scale; it covers all variables that you measure or count, like kilograms for weight or millimetres for rainfall. A categorical variable has categories which don't have any mathematical relationship between them, like your town of birth. For example:
- 10 cm + 15 cm = 25 cm (centimetres is a mathematical scale)
- 1 defect + 3 defects = 4 defects (number of defects is a mathematical scale)
- blue eyed + green eyed = cross eyed (eye colour is not a mathematical scale)
- male + female = ?? (definitely not measured on a mathematical scale)
The following table summarises some descriptive analysis techniques we recommend. These techniques are described in Appendix B of the Teacher's Pack.
|Variable||Numerical summary||Graphical summary|
|Single categorical variable||Frequency count; Table of proportions||Bar chart;
|Single quantitative variable||Mean and standard deviation;
median and inter-quartile range
|Stem and leaf plot;
|Two categorical variables||Two-way table;
row or column percentages
|Two quantitative variables||Scatterplot|
|A categorical variable and a quantitative variable||Side-by-side boxplots|
Interpretation and presentationThere is rarely a correct answer in statistics. The focus is on appropriate analysis and correct interpretation of the collected data. If you find some inaccuracy or bias in your data set, bring this out and discuss it. The onus is on the readers to decide whether or not your findings are of interest to them. The onus is on you to assist the reader to make that decision by discussing the likely strengths and weaknesses of your research.
No study is perfect,
even among professional researchers.
You show your expertise more by identifying the weaknesses of your own study
than by believing you have achieved perfection.
The competitions posters present the story of the research project.
The important issues for the reader are:
To ensure you have the best possible poster,
list of what the judges are looking for
if this Competition.