MEASURING TREATMENT EFFECTS OF ONLINE VIDEOS ON ACADEMIC PERFORMANCE USING DIFFERENCE-IN-DIFFERENCES ESTIMATIONS

Supplementing student learning with online videos has produced mixed results in respect of improving academic performance. This study proposed that these mixed results, before COVID-19, are due to most of the literature on online videos being observational studies and not taking confounding factors into account. This study applied the difference-indifferences (DID) technique, which measures treatment effects in observational studies. Using student grades data from an engineering mechanics course, the treatment effects of videos were measured 1) on the entire group that used the videos, 2) on those initially with failing and then with passing grades, and 3) on those grouped by grade percentage ranges. It was found that the videos had no effect on the entire group but did significantly affect those who had initial failing grades — specifically, grades of 40% to 49%. A main finding was that initial grades are an indication of how effective the online videos are in improving grades.


INTRODUCTION
The use of supplementary online videos to improve student learning promises much, but in the past, before COVID-19, has produced mixed results. For example, the literature shows videos having a positive impact on performance [1,2,3], a negative impact [4,5], and no impact [6,7]. The mixed results have been attributed to different learning strategies [5], passive learning [7], attending fewer face-to-face lectures [1,5,4], and videos potentially not being effective across all fields of study [3].
Although these may be factors, this study proposes that a reason for the mixed results is that most of the studies are observational [8], not randomised control trials, and therefore were not measured using the appropriate methods. Unlike randomised control trials, participants in observational studies are not randomly assigned to the control and treatment groups. In these studies, students self-selected to participate in an intervention (such as watching online videos). This potentially results in a selection bias [9,10] and thus in biased treatment effects.
To measure the treatment effect of an observational study such as an academic intervention (in our case, watching online videos), it is crucial that appropriate measurement techniques be used. For observational studies, if the treatment group (the group that watched videos) had overall better (or worse) results than the control group (the group that elected not to watch videos), it cannot be concluded that the intervention caused a better (or a worse) performance, owing to the self-selection. It may be caused by confounding features, such as the treatment group students being more diligent and harder working or more aware of their failing grades. For this reason, the grades of groups that use online videos cannot simply be compared with groups that do not use online videos.
An example is Traphagan et al. who, in their observational study, first found that those watching videos performed worse than the no-videos group. Thereafter they controlled for grade point average (GPA) and found that the two groups performed similarly [4].
Based on these findings, and because the authors had only grade information for the case study that follows, it was proposed to use the difference-in-differences (DID) method to measure the treatment effect on student performance of the videos as an intervention. To the authors' knowledge, this method has thus far not been used in the literature to measure the effectiveness of academic interventions. It is proposed because it does not need a large number of features that describe the students: it simply requires the grades of the students before and after the intervention.

LITERATURE REVIEW
According to Muller et al. the belief that multimedia will improve learning has been around for almost a hundred years [11]. Even before COVID-19, the area of multimedia that saw the most growth was reported to be online education videos [12]. Although the use of videos promises much, the literature reports that their impact on student learning varied. A major reason for videos not delivering what they promise is attributed by the literature [5,7,13] to superficial or passive learning.
The attitude [14], motivation [15,16], and attention [17] of the student have also been shown to be important in learning from multimedia. When investigating why videos are or are not effective, the abovementioned reasons are certainly valid ones.

Observational studies
In general, there are two types of study to measure intervention or treatment effectiveness. Observational studies are those in which the participants (i.e., students) self-select to use the intervention (in this case, videos) [8]. This contrasts with randomised controlled trials, in which participants are randomly assigned to either a control or a treatment group. Although randomised control trials are seen as the 'gold standard' for evaluating the efficacy of an intervention, ethical issues may arise from randomly assigning students to watch or not watch videos [8].
Furthermore, the benefit of observational studies is that the natural behaviour of participants can be observed -i.e., participants will act naturally when using the intervention, and not be affected by being monitored in a study setting. Therefore, simply allowing the students to self-select to watch the videos is seen as the appropriate way to provide this intervention.
Self-selection, however, generally produces biased results when measuring the effectiveness of the treatment (referred to as 'treatment effects' hereafter), owing to confounding factors.

Previous case studies on online videos performance
When using observational studies to study the effects of online videos on performance, Traphagan et al. used simple correlation to study how the amount of video watching impacts performance. The problem with regression and correlation studies is that they essentially measure correlation and not causal (treatment) effects [7].
When measuring treatment effects in observational studies, the aim should be to balance out the features so that the treatment group and the control group have balanced features. The aim is to create a counterfactual group to compare with the treatment group. For observational studies, the main way of carrying this out is to use propensity score matching [18]; this is beneficial if the study has a large range of features describing the participants and the effects are measured using cross-sectional data. In the absence of a range of descriptive features, the difference-in-differences (DID) method can be used, which uses data captured over time.

Difference-in-differences
The major benefit of DID is that it can measure treatment effects without needing a large range of features. It simply requires data captured over time. DID takes into account changes in outcome (e.g., grades) in order to measure effects. The intuition behind DID can be explained as follows: Consider measuring the change in grades (positive or negative) of the students that self-selected to watch videos (the treatment group). After watching the videos, there could be the erroneous thought that the change in grades was due to the videos (the intervention). However, to measure the videos' effect properly, there needs to be a counterfactual group to compare against -i.e., a similar group that did not receive the intervention (the control group).
The counterfactual control group's average change in grades should be measured and compared with the treatment group's change in grades. Whenever treatment effects are measured, there is the requirement to compare a treatment group with a counterfactual group. The control group is the counterfactual group that aims to explain what the treatment group would have obtained had they not received the treatment.
DID assumes that only the difference in the change in grades between the treatment group and the control group needs to be compared. This is based on the parallel trends assumption [19,20], which states that the treatment (videos) group, in the absence of the treatment, would have had the same change in grades as the control group (no-videos) over the time period being considered.
The DID method aims to remove or control for the effects of selection bias. In the literature there are examples that use this technique to measure important concerns such as the effect that raising the minimum wage has on employment [21], estimating the impact of training programmes on earnings [22], and how disability benefits affect time off work after injury [23].
Equation (1) provides an intuition for the treatment effect, or DID [24], where D is the DID value, y tf ̅̅̅ and y ti ̅̅̅ refer respectively to the average grade at the end and at the beginning of the time span for the treatment group. Similarly, y cf ̅̅̅̅ and y ci ̅̅̅̅ refer respectively to the average grade at the end and at the beginning of the time span for the control group. The equation shows that the change in the control group's average grade from the beginning to the end is subtracted from the change in the treated group's average grade over the same time.
Although Equation (1) is a simple way to calculate the DID value, it is also required to calculate the p-values to determine whether these values are statistically significant. For this, ordinary least squares (OLS) regression is used, based on Equation (2) [23,24,25]; where y refers to the grade (either mid-year or yearend); T is a binary variable indicating whether the grade is at the start or the end of the time period (midyear, T = 0; year-end, T = 1); S is a binary variable indicating whether the student received treatment (novideos, S = 0; videos, S = 1). Finally, T * S is an interaction variable. In terms of the coefficients, 0 refers to the control group's initial grade; 1 indicates how much the control group changes grades over the time span of the intervention; 2 is the difference between the initial grade of the treatment group and control group; and 3 is the DID coefficient value, which can be used to estimate the prediction of the treatment effect.
The p-value represents the likelihood of the statistical significance of evidence between the predictor and the response [26]. In order to measure whether the OLS regression model is providing a 3 coefficient that accurately predicts the treatment effect on student performance, the p-value for the two-sided t-test is calculated. In this case, the null hypothesis is that the 3 coefficient is zero [25,26], which means that the

RESEARCH METHODOLOGY
In this study, grades from a first-year engineering mechanics course in 2017 and in 2018 were used to analyse the actual video watch time as a predictor of student performance. The course had an estimated 1 000 students registered per year. Supplementary videos, each typically between five and ten minutes long, were recorded to cover both conceptual and practice content.
The analysis focused on comparing the students' video watch time for the second semester of the course. The reason for focusing on the second semester grades was that no videos were available in the first semester of 2017, and students already had a starting grade to use as reference with which to compare the changes in grades for the difference-in-differences statistics. In our case, the mid-year grades (after the July exam), calculated based on the Semester 1 assessments, were used as the starting grades for the DID comparisons. In 2017, a total of 80 videos were made available for Semester 2, and in 2018, 129 videos. The video hosting platform, Panotpo TM [27], was configured to record the exact number of minutes that each student watched each video.
Because the students self-selected to watch the videos, the study is seen as observational [8]. Grades vs actual video watch time were first directly compared to evaluate whether the results would be similar to those reported in the literature -i.e., that there was no relationship between videos and performance [4,5,6,7] when ignoring confounding features from the self-selection nature of the study. Next, the DID values for the changes in grades over time were compared to measure more appropriately the treatment effect of watching videos.

Video watch time as a predictor of grades
For the direct comparisons, mid-year grades (from the first semester of the course, running February to July) were compared with three major assessments in the second semester: the results of two major tests (September test -covering two textbook chapters; October test -covering one chapter) and the final November exam (covering four chapters). These were also compared with the overall year-end grades for the course, based on a combination of the mid-year results as well as all of the 2nd semester assessments and tutorial tests.
To determine whether there was any association of minutes watched with performance, the following analyses were carried out for each of the three assessments in each year: 1. Plots were generated to show grades versus minutes watched for all of the students for each test or exam. 2. The beta value of an ordinary least squares (OLS) linear regression model between the total number of minutes that videos were watched, and the students' grades were calculated. 3. T-tests were used to calculate the p-value. The p-value was used to determine whether the linear regression model in the previous step could be used as a significant predictor of student performance when comparing the videos and the no-videos groups.

Difference-in-differences (DID)
For the DID comparisons, the mid-year grades were compared with the final November exam. The DID was first computed for the entire cohort of students. The aim here was to see whether the videos had a treatment effect on the entire group that watched them.
Next, DID estimates were also computed, based on whether students initially passed or failed at mid-year. The aim was to see whether the initial performance was an indicator of the treatment effects. Thereafter, the initial mid-year grades were further broken down into ranges of grades from 30% to above 80% in ranges of 10%. The aim here was to see whether different ranges had different treatment effects.
For all of the DID estimates, the p-values were calculated to determine whether the DID was a significant predictor of the treatment effect.

Analysis of video watching time
Before comparing the performance of the student grades, the amount of video watching time before assessments was analysed. The distribution of the September 2017 test grades compared with the frequency of videos watched is shown in Figure 1. 44 videos covered the topics for this test. The graph reflects an approximately normal distribution with mean = 51.96%, std = 17.00%. Figure 2 illustrates the independent variable 'minutes watched', which is skewed to the right. Given that skewness, Spearman's correlation was calculated for all of the assessments in 2017, but it consistently confirmed that there was no relationship between the minutes watched and the grades in any of the assessments.

Figure 3: November 2017 exam -grades of videos group
To confirm that there is no direct relationship between the number of minutes watched and the grades, ordinary least squares (OLS) regression models were used. Fitting a linear regression model, a beta coefficient and the associated p-value for each assessment were obtained for the students who watched videos. The results are shown in Table 1. The beta coefficients for the majority of the assessments are close to zero. This is consistent with the null hypothesis, that the beta coefficient is zero and that there is therefore no linear relationship between the minutes watched and the grades. The June 2018 exam is the only assessment that had a beta coefficient significantly higher than 0. This assessment was thus the only one in which a possible linear relationship was found. However, the p-value for this assessment was 0.52, indicating that the linear relationship was not significant at α = 0.05.
The p-values for most of the assessments were also above an α = 0.05 significance level, confirming that there was no relationship between the minutes watched and the grades. The November 2018 exam was the only assessment that had a significant p-value, but the beta coefficient was 0.003. Instead of contradicting it, this just confirmed the null hypothesis that the true value of the beta coefficient was indeed very close to zero, and that the minutes watched only had a very slight impact on the grades in this case.
As argued in Section 1, attempting to measure the effect of video watching on grades directly is not ideal; and in our case, no relationship between minutes watched and student performance could be found.
When it comes to observational studies that have inherent bias, the question should not be, "Do the videos cause significantly better performance in the treatment group than in the control group?", but rather, "Do the videos cause a significant positive change (or improvement) in grades for the treatment group compared with the control group?" In observational studies the aim should be to look at changes in the treatment group's grades, not at the actual grades, and to compare these with changes in the control group's grades. For this reason, the next section focuses on this comparison.  Figure 4 shows that the treatments group consistently had higher average grades than the control group for the 2017 assessments.

Comparison between the mean grades for the video and no-video groups
For the 2018 assessments, a third group, called 'treatment 2 nd ', was identified. These were students who did not watch videos in the first semester but started watching in the second semester. They were therefore part of the control group for Semester 1 and part of the treatment group for Semester 2.
In Figure 5, the treatment group consistently got better marks than the control group. However, as discussed in Section 1, because this is an observational study, the difference in marks could be due to confounding variables, such as student motivation.
The treatment 2 nd group, however, showed a dip below the control group for the June 2018 exam. After this, their marks started following the treatment group's marks when they started watching videos and were higher than the control group. The exception was the November 2018 exam, for which the average marks for the treatment 2 nd group were the same as those for the control group.
The dip of the treatment 2 nd below the Control group may be an instance of Ashenfelter's dip [28]: their drop in grades for the June exam might be why they began watching videos in the second semester.
To investigate the differences in grades further, two-sided t-tests were used to calculate p-values. These were used to assess whether the differences in grades between the video and the no-video groups were significant. For most of the assessments, the p-values were higher than α = 0.05, indicating that there were no significant differences. For the October 2017 test, the means of the video group were significantly higher (p = 0.016) than those of the no-video group. The reason for this significance may be the short periodonly one month -between the September 2017 test and the October 2017 test. This suggests that the test content and the 19 videos created for one chapter were very focused. There was thus a low variance in the range of material covered, and perhaps this low variance made the videos effective.

DID values based on initial grades
In this section, the DID values to estimate treatment effects were computed for the periods from mid-year to the November exam for both 2017 and 2018. First, the DID value for the entire cohort, broken up into the treatment and control groups, was computed for both years. Next, the DID values for passing and failing students, based on their initial mid-year grades, were computed. This was done to investigate how the initial grades of the students at mid-year would affect (1) how they performed in the second semester, and (2) whether the use of videos was effective. We therefore controlled for mid-year passing or failing grades and estimated the DID values. Table 2 presents the results for the period from mid-year to the November exam for both 2017 and 2018.
Although not significant at α = 0.05, the failing group for 2017 showed a significant DID estimate of 4.3% at α = 0.1, whereas all of the other groups had a non-significant DID estimate. 2018 Assessments Finally, the initial mid-year grades were further broken up into control groups, based on the ranges in grades. The aim was to see how different initial ranges of mid-year grades affected the DID values. Table  3 presents the DID values for these sub-groups from mid-year to the November exam for both years. The range <30 had an insignificantly small number of students in both years and was left out of the analysis. For the group that obtained between 40% and 49% at mid-year 2017, significant DID values were computed.
For 2018, this group had a p-value of 0.11, which was lower than any of the other p-values from this year, suggesting that, at a higher significance level, the DID values would be significant. The results suggest that those that were just under the passing grade were most impacted by the videos, and that just missing the 50% mark may significantly motivate students to use the videos intervention. Further study needs to be carried out to determine why this specific sub-group was impacted while a nearby sub-group, such as the 50%-59% group, was not. This will require interviewing students in these sub-groups to obtain further insight.
In 2017, the >80 group also had a significant DID value, if α = 0.1, but not for α = 0.05. This indicates that there could be some benefit for distinction students to keep their grades high by watching the videos. Unlike the 40%-49% group, however, this was not significant in the 2018 group.

DISCUSSION
Although the direct comparison of grades between those who watched videos and those who did not found no association except for one assessment (the Oct 2017 test), the DID values did find treatment effects. A main finding in this study, using the DID estimations, is that the initial grade is a predictor of the treatment effect for certain percentage ranges -or, put the other way around, the treatment effect is dependent on the initial grade; and that, for those initially failing, the intervention is found to be effective. The literature has linked initial factors to performance, but none of the literature has found that initial factors link to the treatment effect. This finding suggests that those who are failing (and specifically those in the 40%-49% range) use the videos in the most productive way. A reason for the videos being effective for this group might lie more in the student than in the videos. For example, these students might use the videos in a targeted or strategic way [7], meaning that they use them to focus on topics with which they are struggling.
Another reason for the videos' effectiveness in this group might be intrinsic motivation [11]. Those who initially perform poorly may have a strong incentive to use the available resources as effectively as possible. Therefore, although the intervention is seen as a valuable resource, the motivation of the student is seen to be the source of the videos' effectiveness.
This study has perhaps found a proxy for motivation: the initial grade that a student obtains early in a semester. This study suggests that an intervention has no value in itself, but that it needs to be combined with other factors within the student. A similar finding by Boaler et al. showed that mindset beliefs impact academic performance [16].
Although future interventions will always be open to any students, this finding suggests that the focus needs to be placed on the group of students who are just under the failing grade, encouraging them to use the videos as much as possible. The idea then is that this practice would improve the pass rate in the course.

CONCLUSION
This study proposed that a reason for the mixed results in student performance reported in the literature, but often overlooked, is the use of the wrong measurement techniques to measure treatment effects in observational studies. To improve interventions, it is crucial to measure their effectiveness accurately. A potential reason provided in this research for mixed results in the effect of videos on grades is that the majority of the studies are observational (students self-electing to watch the videos instead of being placed in randomised control groups). Thus, the literature is mostly measuring video effectiveness without considering confounding factors (e.g, students being more motivated or diligent owing to an awareness of their failing grades, and so taking the initiative to make a greater effort, including watching videos).
In the current observational study, carried out over two years, we also did not find any association between videos and performance when comparing grades with the amount of time spent watching videos. The difference-in-differences (DID) method was then applied to measure more accurately the treatment effects on changes in grades. DID estimates were measured (1) for the entire cohort; (2) based on initial grades; and (3) based on the percentage of videos watched. The findings were that (1) there was no significant impact on the overall group; and (2) the grades at the beginning of the time span were a predictor of the videos' impact -and, more specifically, that the sub-group with initial failing grades of between 40% and 49% was significantly impacted by the videos.

FURTHER WORK
Intrinsic factors within a student -such as their motivation and their method of video usage -are seen as major factors in how effective the video intervention is. Therefore, in follow-up studies, the entire cohort of students could be interviewed and ask various questions about their motivational state and their method of usage. This opens an important window into combining actual watch time, academic performance, and the motivational state of the students, to establish the differences between cohorts.
A more thorough study of the quality and objective of the videos, especially after the start of COVID-19, in combination with students' views on the videos, could also provide a better understanding of how to improve performance. This could also be used to investigate how COVID-19 has impacted student behaviour and whether there is a difference in the impact of video watching on student performance. If the objective of the videos focuses more on laying the foundations of concepts, it makes sense that failing students would benefit more from the videos, while distinction students would benefit by spending more of their time on mastering more complicated topics not addressed in the videos. It is believed that the students themselves play an integral part in moving the conversation forward about using videos as an online learning tool.
The amount of video watch time could also benefit from further in-depth analysis and from more studies of where the actual minutes were spent, whether the videos were watched to the end or repeatedly, and the students' attention span of while watching the videos.
Finally, although global (group and sub-group) analyses have some value, the literature increasingly shows the need to measure individualised treatment effects that are shown to be more effective for each student [9,29,30]. Therefore, further research could focus on individualised treatment effects that could assist the students with personalised learning [31].