1.3 Types of data
After doing measurement (e.g. measuring temperature of patients), an analyst typically obtains data. According to Cambridge dictionary, data is “information, especially facts or numbers, collected to be examined and considered and used to help decision-making” (“Data,” 2022). Data can be unprocessed (or raw data), containing an disorganised array of recordings about a process under consideration, or it can be cleaned, having a proper structure with correctly encoded variables, without obvious mistakes or missing values. When first dealing with data, an analyst needs to clean it and transform it into a usable format in order to be able to extract useful information from it or apply models.
The process of data cleaning itself depends on a variety of factors, and most importantly on what the analyst wants to do. For example, if an analyst is interested in getting insights about daily A&E attendance, and they have records of every patient arriving to A&E, the first step would be to aggregate those values into daily buckets, obtaining the information about the number of A&E arrivals per day. As another example, if an analyst conducts a survey, aiming to find brand preferences of citizens of Lancaster, then after collecting the survey data, they need to transform some of the questions into appropriate variables to be able to work with them (e.g. if there were multiple choice questions with several options, they need to be transformed into a set of variables equal to the number of that options).
The question related to the data cleaning is what type of data the analyst is working with. There are three fundamental types:
- Cross-sectional data;
- Time series;
- Panel data.
In the example above the cross-sectional data is the data collected after conducting a survey in a specific location over a fixed period of time. We end up with answers to the questions (and thus values of variables) of different respondents at the fixed time. Mathematically, we denote these observations with index \(j\), which separates, for example, one respondent from another, and in this textbook we will use letter \(n\) to denote the number of elements (respondents) in our sample. So, the variable \(y_j\) would mean the value of a variable for the \(j^{\mathrm{th}}\) respondent.
The time series data is typically measured for one and the same object over time. In the example above, the A&E arrival would imply time series data, where we observe a value (number of patients arriving) over time (daily). Mathematically, we denote the observation over time with index \(t\), separating, for example, one day from another. The last available observation in this case will be denoted with the capital letter \(T\). In our notations, the variable \(y_t\) contains the value at a specific moment of time.
Remark. In time series, the observations typically do not happen at random, the number of A&E arrivals will depend on the time of day, day of week and month of year. This is an important characteristic of this specific type of data, and we will come back to it later in this textbook.
Finally, in some situations we might be able to measure data of several objects over time. For example, we could have daily A&E arrival in several hospitals. This type of data would be called panel data, and in this situation we would use both indices \(j\) and \(t\), ending up with a variable \(y_{j,t}\), showing, for example, a specific number of patients arriving to a specific hospital at a specific moment of time.
In this textbook we will focus on cross-sectional data and then will move to the time series one. We will also briefly discuss panel data models, but we do not discuss them in detail, as they become available to analysts less often than the other two types.
We have already used the term “variable” several times in this chapter, assuming that a reader is familiar with it. In mathematics, variable is a symbol that represents any of a set of potential values. In this textbook, we will face several types of variables. We will work with a response variable, representing a place holder for a quantity of the main interest of our analysis, something that is formed using an assumed mechanism. This will be denoted with letter \(y\). We will also work with explanatory variables, which are supposed to explain how the response variable is formed and are denoted using letter \(x\) with potential subscripts, e.g. \(x_1\), \(x_2\) etc, representing a first, a second etc variables. In some cases, we will also use terms “exogenous” and “endogenous” variables, where the former means the variable that is formed on its own and is not impacted by any of variables under consideration, while the latter represents a variable that is created by a combination of variables under consideration. Sometimes, the terms “response” and “endogenous” are used as synonyms. Similarly, the terms “explanatory” and “exogenous” are used as synonyms as well. The basic model with one response variable \(y_j\) and one explanatory variable \(x_j\) can be written as: \[\begin{equation*} y_j = \beta_0 + \beta_1 x_j + \epsilon_j , \end{equation*}\] where \(\beta_0\) and \(\beta_1\) are parameters of the model. This model is discussed in more detail in Chapter 10.