Initial Investigations

Now we have data, the natural next step is to perform some initial investigations. For example, how many measurements do we have?

data.size
// res0: Int = 2082

This tells us we have 2082 records, one for each month from January 1850 to July 2023. This isn't a huge amount, but it's certainly too much to analyze just by looking at it. We'll see some approaches to analzying the data in just a moment, but for now let's continue exploring.

Perhaps the next obvious step is to look at some of the data. Here's the first element.

data.head
// res1: Record = Record(
//   year = 1850,
//   month = 1,
//   anomaly = -0.67456436,
//   lower = -0.98177195,
//   upper = -0.3673568
// )

This tells use this element refers to January of 1850, the average temperature was -0.6C below the 1961-1990 baseline, with lower and upper error intervals of approximately -0.9C and -0.3C. (How do know what the meaning of these fields? By reading the documentation, in particular the linked paper.) TODO: Check this

We can also look at the last element.

data.last
// res2: Record = Record(
//   year = 2023,
//   month = 6,
//   anomaly = 1.0509841,
//   lower = 0.9951409,
//   upper = 1.1068273
// )

Here we have information for July of 2023, and the temperature is now above the baseline. The difference between the first and last measurements supports the idea that temperatures are rising over time. However, we could quite rightly suggest this may just be a conincidence. What do the measurements between these points show? Looking at the same month from every year is likely to still be too much to read, but we could look at the same month from each decade.

val decades = data.filter(r => r.year % 10 == 0 && r.month == 6)
// decades: ArraySeq[Record] = ArraySeq(
//   Record(
//     year = 1850,
//     month = 6,
//     anomaly = -0.34424013,
//     lower = -0.60947233,
//     upper = -0.079007894
//   ),
//   Record(
//     year = 1860,
//     month = 6,
//     anomaly = -0.21145956,
//     lower = -0.45290247,
//     upper = 0.029983336
//   ),
//   Record(
//     year = 1870,
//     month = 6,
//     anomaly = -0.23123583,
//     lower = -0.4569157,
//     upper = -0.005555956
//   ),
//   Record(
//     year = 1880,
//     month = 6,
//     anomaly = -0.38053036,
//     lower = -0.57394856,
//     upper = -0.18711212
//   ),
//   Record(
//     year = 1890,
//     month = 6,
//     anomaly = -0.49480304,
//     lower = -0.6899875,
//     upper = -0.2996186
//   ),
//   Record(
//     year = 1900,
//     month = 6,
//     anomaly = -0.21710625,
//     lower = -0.4151495,
//     upper = -0.019062985
//   ),
//   Record(
//     year = 1910,
//     month = 6,
//     anomaly = -0.5593434,
//     lower = -0.7473471,
//     upper = -0.37133965
// ...

With only 18 measurements, this is more manageable. Overall, the data does seem to show a trend of increasing temperatures, but once again we've made some rather arbitrary choices: to only sample each decade and to choose July as our month of interest within the decade. At this point we have, however, gained some knowledge of our data. We know how many measurements we have, the time span they cover, and what each measurement contains. We're lucky that we're using carefully prepared data, and as such we don't have any missing or erroneous measurements. Usually things will not be so simple.

In the next section we'll look at how we can more systematically analyze the data. Before we get there, however, it's time for you to do a bit of analysis on your own.

Exercise: Shall I compare thee to a summer's day?

In this chapter we're learning about data analysis, but we're also learning how to work with collections of data such as List.

When we selected data by decades, we rather arbitrarily chose June as our month of interest. Write code that instead selects data from January. Do you still see a similar trend?

This is a small modification of the original code. Instead of looking for r.month == 6 we look for r.month == 1, which is the numeric code corresponding to January.

val januaryByDecades = data.filter(r => r.year % 10 == 0 && r.month == 1)
// januaryByDecades: ArraySeq[Record] = ArraySeq(
//   Record(
//     year = 1850,
//     month = 1,
//     anomaly = -0.67456436,
//     lower = -0.98177195,
//     upper = -0.3673568
//   ),
//   Record(
//     year = 1860,
//     month = 1,
//     anomaly = -0.39058298,
//     lower = -0.6584608,
//     upper = -0.12270518
//   ),
//   Record(
//     year = 1870,
//     month = 1,
//     anomaly = -0.21106681,
//     lower = -0.4358849,
//     upper = 0.013751276
//   ),
//   Record(
//     year = 1880,
//     month = 1,
//     anomaly = -0.39386344,
//     lower = -0.60645384,
//     upper = -0.181273
//   ),
//   Record(
//     year = 1890,
//     month = 1,
//     anomaly = -0.53390056,
//     lower = -0.68983096,
//     upper = -0.37797013
//   ),
//   Record(
//     year = 1900,
//     month = 1,
//     anomaly = -0.5065199,
//     lower = -0.6711187,
//     upper = -0.34192112
//   ),
//   Record(
//     year = 1910,
//     month = 1,
//     anomaly = -0.4327842,
//     lower = -0.5747264,
//     upper = -0.29084206
// ...

The trend is not exactly the same as before, but it is simlar enough.

Exercise: Statistics is the Grammar of Science

Can you think of other ways we could analyse the data to see if there is a difference in temperature over time? This is an open-ended question; any answers are good answers!

There are no right and wrong answers to this. Here are just a few ideas:

  • The main data point is the divergence from the baseline of 1961-1990. So perhaps we could sum the divergences before that period and compare them to the sum of divergences after that baseline? We would have to normalize the sums by the number of years, as there are more years before the baseline than after it. So in effect we would compare average divergence before and after the baseline. We could also compute averages by month, and compare those, if we suspect the divergence changes by month.

  • We could plot the data, which allow us to easily see trends in the data that we cannot see when reading numbers.