- Use R to compare two or more groups in a crude analysis
- Use R to estimate the effect (odds ratio) of one exposure, controlled for the effect of a second variable, and obtain a confidence interval for this odds ratio
- Use logistic regression to assess if the effect of an exposure (e.g. visual impairment) is confounded by another variable (e.g. age)

Ensure to download and store the data file into your PC from the link: https://github.com/winterwang/logistic-reg-CSS/raw/gh-pages/dataset/mortality.dta

Ensure you have installed R, Rstudio, and the following packages by commands below:

```
requiredPackages = c('haven','Epi','epiDisplay', 'tidyverse')
for(p in requiredPackages){
if(!require(p,character.only = TRUE)) install.packages(p)
library(p,character.only = TRUE)
}
```

- If data is downloaded to your desktop, use the following command to read the data into R:

```
library(haven)
mortality <- read_dta("C:/Users/{YOUR.USER.NAME}/Desktop/mortality.dta")
```

- If data is successfully loaded into R, you can see the dataname, number of variables, number of observations on the upper right side of the Rstudio program:

- You may go ahead and explore the dataset as you wish by clicking on the name of the data, or by commands such as
`summary()`

or etc. to see more details of the data.

`library(epiDisplay)`

`## Loading required package: foreign`

`## Loading required package: survival`

`## Loading required package: MASS`

`## Loading required package: nnet`

`with(mortality, tabpct(mfgrp, died, graph = FALSE))`

```
##
## Original table
## died
## mfgrp 0 1 Total
## 0 1274 29 1303
## 1 1609 62 1671
## 2 996 33 1029
## 3 193 9 202
## Total 4072 133 4205
##
## Row percent
## died
## mfgrp 0 1 Total
## 0 1274 29 1303
## (97.8) (2.2) (100)
## 1 1609 62 1671
## (96.3) (3.7) (100)
## 2 996 33 1029
## (96.8) (3.2) (100)
## 3 193 9 202
## (95.5) (4.5) (100)
##
## Column percent
## died
## mfgrp 0 % 1 %
## 0 1274 (31.3) 29 (21.8)
## 1 1609 (39.5) 62 (46.6)
## 2 996 (24.5) 33 (24.8)
## 3 193 (4.7) 9 (6.8)
## Total 4072 (100) 133 (100)
```

The outcome variable is

`died`

, which is coded as 1 for an individual who has died and 0 for an individual who is alive after 3 years of follow-up. Analyse this data set treating it as a cohort study with fixed follow-up time (ie., assume that all individuals were followed for the same length of time and analyse the data using odds).To use logistic regression to examine the association between

`died`

and`vimp`

, type

```
vimp_logm0 <- glm(died ~ vimp, data = mortality,
family = binomial(link = "logit"))
summary(vimp_logm0)
```

```
##
## Call:
## glm(formula = died ~ vimp, family = binomial(link = "logit"),
## data = mortality)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5108 -0.2224 -0.2224 -0.2224 2.7247
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.6873 0.1028 -35.870 <2e-16 ***
## vimp 1.7167 0.1976 8.687 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1213.8 on 4297 degrees of freedom
## Residual deviance: 1154.7 on 4296 degrees of freedom
## AIC: 1158.7
##
## Number of Fisher Scoring iterations: 6
```

Check that the the output corresponds to that in the slides.

Logistic regression estimates are derived by starting with a guess at the parameter estimates, then using the result to compute a better guess (nearer to the maximum likelihood estimates). This is known as **iteration.** The log likelihood for each iteration is shown. The procedure stops when there is no further increase in the log likelihood.

- So far, all the output has been on the log scale. This was needed in order to explain how confidence intervals and Wald tests (z-tests) are derived. In fact, R allows us to derive estimates on the odds ratio scale, which is much more convenient for reporting results. Type:

```
library(epiDisplay)
logistic.display(vimp_logm0, decimal = 3)
```

```
##
## Logistic regression predicting died
##
## OR(95%CI) P(Wald's test) P(LR-test)
## vimp: 1 vs 0 5.566 (3.779,8.199) < 0.001 < 0.001
##
## Log-likelihood = -577.36596
## No. of observations = 4298
## AIC value = 1158.73192
```

The results are now shown as odds ratios. Note that:

- The baseline term (intercept) represents the odds of outcome in the baseline group of the variable – so in this example it is the odds of death amongst those visually unimpaired.
- The confidence interval is also derived using the standard error for the log odds ratio, as shown in the slides. Hence, the confidence interval is correct.

- Example of how results may be summarised in a sentence.

**“Visual impairment is strongly associated with risk of death (P<0.0001). Those visually impaired have nearly 6 times the odds of death compared to those with unimpaired vision (OR 5.57 95% CI: 3.78-8.20).”**

- When tabulating variables for a cohort (or cross-sectional) study, use row percentages if the explanatory variable is the row variable, column percentages if it is the column variable. For example:

```
with(mortality, tabpct(mfgrp, died,
percent = "row",
graph = FALSE))
```

```
##
## Row percent
## died
## mfgrp 0 1 Total
## 0 1274 29 1303
## (97.8) (2.2) (100)
## 1 1609 62 1671
## (96.3) (3.7) (100)
## 2 996 33 1029
## (96.8) (3.2) (100)
## 3 193 9 202
## (95.5) (4.5) (100)
```

We can then describe the risk of death over a 3 year period for those with no microfilarial infection (2.2%), <10 microfilarial load/mg infection (3.7%), etc.

- As explained in the slides, we need to use indicator variables to examine the association between microfilarial load (4 levels) and death. We need to recode the varibales, type

```
library(tidyverse)
mortality <- mortality %>%
mutate(mfgrp = factor(mfgrp)) %>%
mutate(mfgrp = fct_recode(mfgrp,
"Uninfected" = "0",
"< 10" = "1",
"10-49" = "2",
"50+" = "3"))
vimp_logm1 <- glm(died ~ mfgrp, data = mortality,
family = binomial(link = "logit"))
summary(vimp_logm1)
```

```
##
## Call:
## glm(formula = died ~ mfgrp, family = binomial(link = "logit"),
## data = mortality)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3019 -0.2750 -0.2553 -0.2122 2.7587
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.7826 0.1878 -20.143 <2e-16 ***
## mfgrp< 10 0.5264 0.2281 2.308 0.0210 *
## mfgrp10-49 0.3754 0.2580 1.455 0.1457
## mfgrp50+ 0.7172 0.3893 1.842 0.0655 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1180.4 on 4204 degrees of freedom
## Residual deviance: 1173.7 on 4201 degrees of freedom
## (93 observations deleted due to missingness)
## AIC: 1181.7
##
## Number of Fisher Scoring iterations: 6
```

`To get the output on the odds ratio scale, type:`

`logistic.display(vimp_logm1, decimal = 3)`

```
##
## Logistic regression predicting died
##
## OR(95%CI) P(Wald's test) P(LR-test)
## mfgrp: ref.=Uninfected 0.0822
## < 10 1.693 (1.083,2.647) 0.021
## 10-49 1.456 (0.878,2.414) 0.1457
## 50+ 2.049 (0.955,4.394) 0.0655
##
## Log-likelihood = -586.86507
## No. of observations = 4205
## AIC value = 1181.73014
```

Note that there are three odds ratios each of which refers to the same baseline group (those uninfected). The odds ratio is - 1.69 for those with microfilarial load <10 mf/mg (coded as 1) compared to those uninfected (coded as 0) - 1.46 for those microfilarial load 10-49 mf/mg (coded as 2) compared to those uninfected (0) and - 2.05 for those microfilarial load ≥50 mf/mg (coded as 3) compared to those uninfected (0).

- Write 2-3 sentences describing the relationship between microfilarial infection and death

There is weak evidence of an association between microfilarial infection (measured on 4 levels) and death (p=0.08). The risk of death among those uninfected is 2.2% which is lower than those infected. The odd ratios for those with <10, 10-49 and 50+ microfilarial /mg compared with being infected were 1.69 (95% CI: 1.08-2.64), 1.46 (0.88-2.41) and 2.05 (0.96-4.39), respectively.

- Logistic regression model with agegrp as an explanatory variable:

```
mortality <- mortality %>%
mutate(agegrp = factor(agegrp)) %>%
mutate(agegrp = fct_recode(agegrp,
"15-34" = "0",
"35-54" = "1",
"55-64" = "2",
"65+" = "3"))
vimp_logm2 <- glm(died ~ vimp + agegrp,
data = mortality,
family = binomial(link = "logit"))
logistic.display(vimp_logm2, decimal = 3)
```

```
##
## Logistic regression predicting died
##
## crude OR(95%CI) adj. OR(95%CI)
## vimp: 1 vs 0 5.566 (3.779,8.199) 2.202 (1.409,3.443)
##
## agegrp: ref.=15-34
## 35-54 2.578 (1.633,4.07) 2.355 (1.484,3.737)
## 55-64 6.934 (4.058,11.847) 5.415 (3.083,9.513)
## 65+ 14.802 (8.809,24.872) 9.901 (5.541,17.692)
##
## P(Wald's test) P(LR-test)
## vimp: 1 vs 0 < 0.001 < 0.001
##
## agegrp: ref.=15-34 < 0.001
## 35-54 < 0.001
## 55-64 < 0.001
## 65+ < 0.001
##
## Log-likelihood = -545.8609
## No. of observations = 4298
## AIC value = 1101.7218
```

There is strong evidence against the null hypothesis of no association of age with death (p<0.001). The odds of death increase with increasing age.