Introduction:

In 2018, Nature Scientific Reports published an article by Australian researchers published describing the relationship between posture and “enlarged protuberances”, especially among younger subjects. In this study, the researchers analyzed data from individuals with different ages and sex, and concluded that prominent exostosis were more prevalent in young adults than older groups. However, skepticism arose about this study and a published author correction indicated errors existed in the original work. In this project, dataset from this study was used to test whether issues exist and reproduced results are consistent with that reported in the article.

1. Dataset

1.1 Description

raw_data = 
  read_excel("./data/p8105_mtp_data.xlsx", skip = 8)
mtp_data =   
  raw_data %>% 
  janitor::clean_names() %>% 
  replace_na(list(eop_size_mm = 0, eop_shape = 0)) %>% 
  mutate(
    sex = as.character(sex),
    age = as.integer(age),
    age_group = case_when(
      age_group %in% c('6', '7', '8') ~ '6+', 
      TRUE ~ age_group ),
    age_group = factor(age_group, levels = c('1', '2', '3', '4', '5', '6+')),
    eop_size = factor(eop_size, levels = c('0', '1', '2', '3', '4', '5')),
    eop_visibility_classification = factor(eop_visibility_classification, levels = c('0', '1', '2')),
    eop_shape = as.character(eop_shape),
    fhp_category = factor(fhp_category, levels = c('0', '1', '2', '3', '4', '5', '6', '7'))
  )

str(mtp_data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1221 obs. of  9 variables:
##  $ sex                          : chr  "1" "1" "0" "1" ...
##  $ age                          : int  18 26 22 20 27 28 17 21 25 30 ...
##  $ age_group                    : Factor w/ 6 levels "1","2","3","4",..: 2 2 2 2 2 2 1 2 2 2 ...
##  $ eop_size_mm                  : num  14.8 17.1 0 23.9 0 7.9 6.4 9.4 0 16.5 ...
##  $ eop_size                     : Factor w/ 6 levels "0","1","2","3",..: 3 4 1 5 1 2 2 2 1 4 ...
##  $ eop_visibility_classification: Factor w/ 3 levels "0","1","2": 3 3 2 3 1 3 3 3 1 3 ...
##  $ eop_shape                    : chr  "3" "3" "0" "1" ...
##  $ fhp_size_mm                  : num  34.7 32.9 28.1 11 19.3 27.6 30.6 27.8 26.2 21 ...
##  $ fhp_category                 : Factor w/ 8 levels "0","1","2","3",..: 4 4 3 2 2 3 4 3 3 3 ...

In the data cleaning, missing values were filled based on given information, variable types were modified and orders were given to factor variables. Individuals older than 60 were all assigned as ‘6+’ under the age_group variables to facilitate further analyses.

The resulting dataset mtp_data contains 1221 observations (i.e. the number of participants) and 9 variables. Key variables: The average age was 45.5118755. The average EOP size was 6.8330549 mm. The average FHP size was 26.1013416 mm. NA exist in the fhp_size_mm and eop_shape. Since no information was provided, NA’s here were left blank.

A table below shows the age and gender distribution.

mtp_data %>% 
  mutate(
    age_group = case_when(
      age_group == '1' ~ 'less than 18',
      age_group == '2' ~ '18-30',
      age_group == '3' ~ '31-40',
      age_group == '4' ~ '41-50',
      age_group == '5' ~ '51-60',
      age_group == '6+' ~ 'larger than 60'
    )
  ) %>% 
  group_by(sex, age_group) %>% 
  summarize(n = n()) %>%
  pivot_wider(
    names_from = sex,
    values_from = n
  ) %>% 
  knitr::kable(
    col.names = c('Age', 'Female', 'Male'),
  )
Age Female Male
18-30 151 152
31-40 102 102
41-50 106 101
51-60 99 101
larger than 60 155 150
less than 18 1 1

1.2 Issues

mtp_data %>% 
  ggplot(aes(x = age_group, y = age, color = age_group)) +
  geom_point() +
  geom_hline(yintercept = c(18, 31, 41, 51, 61), color = 'red', linetype = 'dotted') + 
  xlab('Age Group') +
  ylab('Age (years)') +
  scale_color_hue(name = "age group") +
  scale_x_discrete(
    labels = c("<18's", "18-29 years","30's", "40's", "50's", ">60's")) +
  scale_color_hue(name = "age group",
                  labels = c("<18's", "18-29 years","30's", "40's", "50's", ">60's"))

  • An outlier in ‘<18’s’ group indicates a person older than 18 was mistakenly labeled into this group.
mtp_data %>% 
  ggplot(aes(x = eop_size, y = eop_size_mm, color = eop_size)) +
  geom_point() +
  geom_hline(yintercept = c(5, 10, 15, 20, 25), color = 'red', linetype = 'dotted') + 
  xlab('EOP Size Group') +
  ylab('EOP Size (mm)') +
  scale_color_hue(name = "eop size group")

  • Misclassifications existed in EOP size group, especially in group 1, 3, 4, 5.
row_na = which(is.na(pull(mtp_data, eop_size)))
raw_data[row_na,] %>% 
  knitr::kable()
Sex Age Age group EOP size (mm) EOP size EOP visibility Classification EOP Shape FHP size (mm) FHP Category
1 62 6 15 14.6 2 2 40.8 4

The NA in x-axis was due to an abnormal value ‘14.6’ in ‘EOP size’ in raw dataset which supposed to be 0 ~ 5.

mtp_data %>% 
  ggplot(aes(x = eop_visibility_classification, y = eop_size_mm, color = eop_visibility_classification)) +
  geom_point() +
  geom_hline(yintercept = c(0, 5), color = 'red', linetype = 'dotted') + 
  xlab('EOP Visibility Classification') +
  ylab('EOP Size (mm)') +
  scale_color_hue(name = "EOP visibility classification")

  • Misclassifications existed in all three groups in EOP visibility classification.
mtp_data %>% 
  ggplot(aes(x = fhp_category, y = fhp_size_mm, color = fhp_category)) +
  geom_point() +
  geom_hline(yintercept = c(0, 10, 20, 30, 40, 50, 60, 70), color = 'red', linetype = 'dotted') + 
  xlab('FHP Size Group') +
  ylab('FHP Size (mm)') +
  scale_color_hue(name = "FHP size group")

  • Misclassifications existed in FHP size group, especially in group 1, 2, 4.
row_na = which(is.na(pull(mtp_data, fhp_category)))
raw_data[row_na,] %>% 
  knitr::kable()
Sex Age Age group EOP size (mm) EOP size EOP visibility Classification EOP Shape FHP size (mm) FHP Category
0 51 5 NA 0 0 NA 30.3 30.8

The NA in the x-axis was due to an abnormal value ‘30.8’ in ‘FHP Category’ in raw dataset which supposed to be 0 ~ 7.

2. Visualization Improvement

fhp_plot =
  mtp_data %>% 
  filter(age_group != '1') %>% 
  drop_na(fhp_size_mm) %>% 
  ggplot(aes(x = age_group, y = fhp_size_mm, fill = sex)) +
  geom_boxplot() +
  stat_summary(
    fun.y = mean, aes(colour = "Mean"), 
    geom = "point", 
    shape = 18, 
    size = 3,
    position = position_dodge(width = 0.75)) +
  stat_summary(
    fun.data = mean_se, 
    geom = "errorbar",
    width = 0.15,
    position = position_dodge(width = 0.75)) +
  scale_fill_hue(
    name = "Sex", 
    labels = c("Females", "Males")) +
  scale_colour_manual(
    values = c("Mean" = "darkred"), 
    labs(colour = '')) +
  xlab('Age Group (years)') +
  ylab('FHP Size (mm)') +
  scale_x_discrete(
    labels = c("18-29 years","30's", "40's", "50's", ">60's"))
eeop_plot = 
  mtp_data %>% 
  filter(age_group != '1') %>% 
  mutate(
    eeop = case_when(
      eop_size_mm > 10 ~ 'y',
      eop_size_mm <= 10 ~ 'n'
    )
  ) %>% 
  group_by(age_group, sex) %>% 
  count(eeop) %>%
  mutate(rate = prop.table(n)) %>% 
  filter(eeop == 'y') %>% 
  ggplot(aes(y = rate, x = age_group)) +
  geom_line(aes(group = sex, color = sex)) +
  geom_point() +
  scale_color_hue(
    name = "Sex", 
    labels = c("Females", "Males")) +
  xlab('Age Group (years)') +
  ylab('EEOP Rate (%)') +
  scale_x_discrete(
    labels = c("18-29 years","30's", "40's", "50's", ">60's"))
# create a two-panel figure and show the plot

fhp_plot / eeop_plot

From upper panel, mean value equals to median in most of the age groups, indicating a non-skewed distributions in those data. For females in 30’s, 50’s, >60’s group, data distributions were right-skewed. Also, the males had relatively larger FHP size than females.

From lower panel, overall trend of EEOP rate decreased first and then increased. It had largest rate in 18-29 years group and lowest rate in 40’s group. Also, EEOP rates in males were higher than that in females at all age groups.

fhp_eop_plot = 
  mtp_data %>% 
  filter(age_group != '1') %>% 
  drop_na(fhp_size_mm) %>% 
  filter(eop_size_mm != 0) %>% 
  ggplot(aes(x = fhp_size_mm, y = eop_size_mm, color = sex)) +
  geom_point() +
  geom_smooth(
    method = "lm", 
    color = 'black',
    se = FALSE) +
  facet_grid(
    sex ~ age_group, 
    labeller = as_labeller(c('0' = 'Female', '1' = 'Male', '2' = '18-29 years', 
                             '3' = "30's", '4' = "40's", '5' = "50's", '6+' = ">60's"))) +
  xlab('FHP Size (mm)') +
  ylab('EOP Size (mm)') +
  scale_color_hue(
    name = "Sex", 
    labels = c("Females", "Males"))

fhp_eop_plot

For male group, the EOP size and FHP size had slightly positive relationships in all age groups. However, for female group, the relationships were not identical among different age groups.

3. Reproduce Reported Results

mtp_data %>% 
  mutate(
    age_group = case_when(
      age_group == '1' ~ 'less than 18',
      age_group == '2' ~ '18-30',
      age_group == '3' ~ '31-40',
      age_group == '4' ~ '41-50',
      age_group == '5' ~ '51-60',
      age_group == '6+' ~ 'larger than 60'
    )
  ) %>% 
  group_by(age_group) %>% 
  summarize(n = n()) %>% 
  knitr::kable(
    col.names = c('Age Group', 'Sample Size')
  )
Age Group Sample Size
18-30 303
31-40 204
41-50 207
51-60 200
larger than 60 305
less than 18 2
# overall mean

mean(pull(mtp_data, fhp_size_mm), na.rm = TRUE)
## [1] 26.10134
# mean and std by sex group

mtp_data %>% 
  mutate(
    sex = case_when(
      sex == '0' ~ 'Female',
      sex == '1' ~ 'Male'
    )
  ) %>%
  group_by(sex) %>% 
  summarize(
    maen = mean(fhp_size_mm, na.rm = TRUE), 
    sd = sd(fhp_size_mm, na.rm = TRUE)
  ) %>% 
  knitr::kable(
    col.names = c('Sex', 'Mean (FHP size)', 'Standard Deviation (FHP size)')
  ) 
Sex Mean (FHP size) Standard Deviation (FHP size)
Female 23.72580 10.61789
Male 28.51234 14.66670
mtp_data %>% 
  mutate(
    eeop = case_when(
      eop_size_mm > 10 ~ 'EEOP',
      eop_size_mm <= 10 ~ 'Not EEOP'
    )
  ) %>% 
  count(eeop) %>%
  mutate(rate = prop.table(n)) %>% 
  knitr::kable(
    col.names = c('EEOP or not', 'Number', 'Prevalence')
  )
EEOP or not Number Prevalence
EEOP 392 0.3210483
Not EEOP 829 0.6789517
fhp_violin = 
  mtp_data %>% 
  filter(age_group != '1') %>% 
  drop_na(fhp_category) %>% 
  ggplot(aes(x = age_group, y = fhp_size_mm)) +
  geom_violin() +
  xlab('Age Group') +
  ylab('FHP Size (mm)') +
  scale_x_discrete(
    labels = c("18-29 years","30's", "40's", "50's", ">60's"))
  
fhp_col = 
  mtp_data %>% 
  filter(age_group != '1') %>% 
  drop_na(fhp_category) %>% 
  group_by(age_group) %>% 
  count(fhp_category) %>% 
  mutate(rate = prop.table(n)) %>%  
  ggplot(aes(x = fhp_category, y = rate)) +
  geom_col() +
  xlab('FHP Categroy') +
  ylab('Percentage') +
  facet_grid(~age_group) 

fhp_violin / fhp_col

mtp_data %>% 
  filter(age_group != '1') %>% 
  drop_na(fhp_category) %>% 
  group_by(age_group) %>% 
  count(fhp_category) %>% 
  mutate(rate = prop.table(n)) %>%
  filter(age_group == '6+' & fhp_category %in% c('4', '5', '6', '7')) %>% 
  summarize(sum = sum(rate)) %>% 
  knitr::kable(
    col.names = c('Age Group', 'Percentage of FHP >40 mm')
  )
Age Group Percentage of FHP >40 mm
6+ 0.3245902

4. Discussions

Based on the available data, lots of misclassification were found in category variables and mistakes were found in data entry, which made the data analyses in original report questionable. Besides, some statistical values were found not consistent with that reported in the article. Although board trends were consistent with the article that younger individuals had higher EEOP rates, we still cannot conclude cell phones cause hron growth. Extra data showing the prevalence of cell-phones usage among different age groups will be necessary to better address such hypothesis.