Exploring Data Science
There tends to be a recurring structure in the process used by data scientists to analyze data. It typically consists of four stages:
-
Asking a question is the starting point for any analysis, which guides the rest of the process. This question can be very specific, such as "are global temperatures increasing over time?" This happens to be the question we'll work with in this chapter. We can also work with more vauge questions, which can occur when have a data set but have only a vague idea of patterns or associations that we expect to find in it.
-
Exploratory data analysis occurs when we obtain data, clean it up, and check it makes sense. You shouldn't think that this is trivial; it can often be the most involved part of the entire process. We may have to sift through many sources of data before we find ones that address our question. Real world data is often very messy and incomplete, and requires significant transformation before it's in a form we can draw reliable conclusions from.
-
Inference
-
Prediction
I'm assuming that you have little or no experience with statistics or data science. If so, this part of the book will introduce a number of new concepts. It's a lot easier to understand new idea when we have concrete examples of them, so in this chapter we will work through a complete example of the data science lifecycle. This will ground all the concepts which we will then go through in more detail in later chapters.
We are going to analyse climate data. We are all familiar with the weather, and climate is nothing more than the weather taken over time and space. Climate data is also easy to obtain, with records going back hundreds of years in some locations. The data we are using is freely available, and is averaged over the northern hemisphere. You may wish to find data for your local area and perform a similar analysis on it. Finally, climate change is topical and analyzing climate data lets us take our own look at the issue.
Enough background, let's get on with it!