Open data for countries/areas with reported cases of COVID-19 is available on 27 February 2020. Open data for list of buildings of the home confinees under mandatory home quarantine according to Cap. 599C of Hong Kong Laws is available on 29 February 2020.
Data downloaded from: https://data.gov.hk/en-data/dataset/hk-dh-chpsebcddr-novel-infectious-agent
dim(df_HK)
[1] 101 10
names(df_HK)
[1] "Case no." "Report date" "Date of onset"
[4] "Gender" "Age" "Name of hospital admitted"
[7] "Hospitalised/Discharged/Deceased" "HK/Non-HK resident" "Case classification*"
[10] "Confirmed/probable"
str(df_HK)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 101 obs. of 10 variables:
$ Case no. : num 1 2 3 4 5 6 7 8 9 10 ...
$ Report date : chr "23/01/2020" "23/01/2020" "24/01/2020" "24/01/2020" ...
$ Date of onset : chr "21/01/2020" "18/01/2020" "20/01/2020" "23/01/2020" ...
$ Gender : chr "M" "M" "F" "F" ...
$ Age : num 39 56 62 62 63 47 68 64 73 72 ...
$ Name of hospital admitted : chr "Princess Margaret Hospital" "Princess Margaret Hospital" "Princess Margaret Hospital" "Princess Margaret Hospital" ...
$ Hospitalised/Discharged/Deceased: chr "Discharged" "Discharged" "Discharged" "Hospitalised" ...
$ HK/Non-HK resident : chr "Non-HK resident" "HK resident" "Non-HK resident" "Non-HK resident" ...
$ Case classification* : chr "Imported" "Imported" "Imported" "Imported" ...
$ Confirmed/probable : chr "Confirmed" "Confirmed" "Confirmed" "Confirmed" ...
- attr(*, "spec")=
.. cols(
.. `Case no.` = [32mcol_double()[39m,
.. `Report date` = [31mcol_character()[39m,
.. `Date of onset` = [31mcol_character()[39m,
.. Gender = [31mcol_character()[39m,
.. Age = [32mcol_double()[39m,
.. `Name of hospital admitted` = [31mcol_character()[39m,
.. `Hospitalised/Discharged/Deceased` = [31mcol_character()[39m,
.. `HK/Non-HK resident` = [31mcol_character()[39m,
.. `Case classification*` = [31mcol_character()[39m,
.. `Confirmed/probable` = [31mcol_character()[39m
.. )
The variable Report date
was loaded as a string. If we want to use it as a date, we need to convert it to a Date format.
# convert `Report date` to a `Date` and order by `Report date`
df_HK$`Report date` <-
dmy(df_HK$`Report date`)
df_HK <-
df_HK[order(df_HK$`Report date`), ]
summary(df_HK$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.00 46.00 59.00 56.56 68.00 96.00
table(df_HK$Gender)
F M
51 50
barplot(table(df_HK$Gender))
barplot(table(df_HK$`Hospitalised/Discharged/Deceased`))
boxplot(df_HK$Age)
boxplot(Age ~ Gender, data = df_HK)
We will use a t-test to determine if the mean age for female patient is greater than the mean age of male patients.
women <- df_HK$Age[df_HK$Gender == "F"]
men <- df_HK$Age[df_HK$Gender == "M"]
summary(women)
Min. 1st Qu. Median Mean 3rd Qu. Max.
21.00 50.50 61.00 59.33 69.00 96.00
summary(men)
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.00 39.75 57.50 53.74 68.00 80.00
t.test(women, men, alternative = "greater", var.equal = T, paired = F)
Two Sample t-test
data: women and men
t = 1.6264, df = 99, p-value = 0.05352
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-0.1169456 Inf
sample estimates:
mean of x mean of y
59.33333 53.74000
# make the cumulative sum of infected patients
date_start <- dmy("22-01-2020")
df_HK %>%
mutate(Time = `Report date` - date_start) %>%
group_by(`Report date`) %>%
mutate(n_case = n()) %>%
summarise(n_case = max(n_case), Time = unique(Time)) %>%
ungroup() %>%
mutate(n_case = cumsum(n_case)) -> infection_evol
plot(x = infection_evol$Time,
y = infection_evol$n_case,
xlab = "Time (Days)",
ylab = "Cases")
l_model <- lm(n_case ~ Time, data = infection_evol)
summary(l_model)
Call:
lm(formula = n_case ~ Time, data = infection_evol)
Residuals:
Min 1Q Median 3Q Max
-7.8596 -2.7270 0.0741 2.6901 9.5244
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.00777 1.72593 -5.798 2.18e-06 ***
Time 2.74171 0.06983 39.264 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.59 on 31 degrees of freedom
Multiple R-squared: 0.9803, Adjusted R-squared: 0.9797
F-statistic: 1542 on 1 and 31 DF, p-value: < 2.2e-16
abline(l_model, col = "red")
Any conclusions about that ?