DACSS 603

Introduction to Quantitative Research

By Kristina Becvar in DACSS UMass Amherst Data Analytics R

May 12, 2022

Commentary

A thorough overview of statistical analysis methods, this course engaged in four in-depth assignments involving the use of R to practice regression analysis.

The final project asked us to utilize all of the tools we learned through the semester and apply it to any data set of our choice.

Poster Presentation

You can check out the final project by clicking here.

Project Data Sources

DDOS User Observations

The primary data is a set of observations of users of a novice “hacking tool” to engage in DDOS (denial of service) attacks against Russian targets in March 2022. The data contains a total of users cumulatively for each day of the series March 2 through March 11, and the users represent participants from 98 counties.

WVS/EVS

I will also be using a data set of observations from the World Values Survey conducted from 2017-2021 as a joint project between the World Values Survey and the European Values Studies. This data was released in July 2021, and contains responses from ~135,000 respondents among 95 countries.

Spike/Newswhip

The third is a data set of media coverage (media articles and social media mentions) of the Ukrainian minister’s call for volunteers for the “IT Army of Ukraine” to help fight the invasion of Russia on the digital front.

Data Analysis

DDOS Users

I moved the data into various forms to best explore ways to analyze it.

DDOS Daily Observations

  • Country Name
  • Population (as indicated by the U.S. CIA World factbook website)
  • Region (as indicated by the U.S. CIA World factbook website)
  • Columns for each date being observed from March 2 - March 11 of DDOS users from each country. This is difficult to use for analysis because the daily observations do NOT represent new users added on that day; rather, the daily observations represent the cumulative users from each country as of that day.
#load the data
ddos_daily <- read_csv("ddos_observations.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   country = col_character(),
##   population = col_number(),
##   region = col_character(),
##   `3/2/2022` = col_double(),
##   `3/3/2022` = col_double(),
##   `3/4/2022` = col_double(),
##   `3/5/2022` = col_double(),
##   `3/6/2022` = col_double(),
##   `3/7/2022` = col_double(),
##   `3/8/2022` = col_double(),
##   `3/9/2022` = col_double(),
##   `3/10/2022` = col_double(),
##   `3/11/2022` = col_double()
## )

#assign column names to represent variables accurately
colnames(ddos_daily) <- c("Country", "Population", "Region", "March2", "March3", "March4", "March5", "March6", "March7", "March8", "March9", "March10", "March11")
#summarize the data
options(scipen = 999)
head(ddos_daily)

## # A tibble: 6 x 13
##   Country   Population Region   March2 March3 March4 March5 March6 March7 March8
##   <chr>          <dbl> <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Aland          29789 Europe        1      1      1      1      1      1      1
## 2 Albania      3088385 Europe       19     22     22     23     32     44     51
## 3 Algeria     43576691 Africa        0      8      8      8      8      8      8
## 4 Andorra        85645 Europe        2      6      6      6      6      6      6
## 5 Argentina   45864941 South A~      9      9      9     11     11     11     11
## 6 Armenia      3011609 Asia          1      7      7      9      9     13     13
## # ... with 3 more variables: March9 <dbl>, March10 <dbl>, March11 <dbl>

The total DDOS users as of the first day of observations, March 2, 2022, and the last day available for observation, March 11, 2022 began at 7,850 and grew to a total of 48,879.

sum(ddos_daily$March2)

## [1] 7850

sum(ddos_daily$March11)

## [1] 48879

DDOS Cumulative Observations

However, I am not going to examine the panel data; I am only going to look at the cumulative data - or the count of users on the last day of observations, March 11. So this looks at:

  • Country Name
  • Population (as indicated by the U.S. CIA World factbook website)
  • Region (as indicated by the UN classifications)
  • Users of the DDOS tool from each country as of the last day observed, March 11
#load the data
ddos_cumulative <- read_csv("ddos_cumulative.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   country = col_character(),
##   population = col_number(),
##   region = col_character(),
##   users = col_double()
## )

#summarize the data
options(scipen = 999)
head(ddos_cumulative)

## # A tibble: 6 x 4
##   country   population region        users
##   <chr>          <dbl> <chr>         <dbl>
## 1 Aland          29789 Europe            1
## 2 Albania      3088385 Europe           57
## 3 Algeria     43576691 Africa           10
## 4 Andorra        85645 Europe            6
## 5 Argentina   45864941 South America    11
## 6 Armenia      3011609 Asia             16

DDOS Regional Observations

It is still important to be able to visualize the dramatic change in user count over time, even if I am not analyzing the time series in this analysis. I experimented with displaying the increase as a whole and the increase by region. So this looks at:

  • Date of observations
  • Users of the DDOS tool from each region as of the the given day
ddos_regions <- read_csv("ddos_by_region.csv", 
    col_types = cols(Date = col_date(format = "%m/%d/%Y")))
ddos_regions <- as_tibble(ddos_regions) 
ddos_regions

## # A tibble: 10 x 10
##    Date       Africa  Asia Europe Middle_East North_America Oceania
##    <date>      <dbl> <dbl>  <dbl>       <dbl>         <dbl>   <dbl>
##  1 2022-03-02     15   180   4863          72          1208      90
##  2 2022-03-03     32   419   6994         115          1723     119
##  3 2022-03-04     39   467   9069         137          1905     135
##  4 2022-03-05     59   604  17392         163          2416     177
##  5 2022-03-06     77   694  18447         184          2653     195
##  6 2022-03-07     88   867  20999         206          3057     245
##  7 2022-03-08    129  1143  27081         306          4028     363
##  8 2022-03-09    137  1171  27996         320          4245     580
##  9 2022-03-10    156  1308  30141         353          4548     623
## 10 2022-03-11    164  1443  34439         390          5245     718
## # ... with 3 more variables: South_America <dbl>, Southeast_Asia <dbl>,
## #   Ukraine <dbl>

ggplot(ddos_regions, aes(x = Date)) +
  geom_line(aes(y = Africa, colour = "Africa")) +
  geom_line(aes(y = Asia, colour = "Asia")) +
  geom_line(aes(y = Europe, colour = "Europe")) +
  geom_line(aes(y = Middle_East, colour = "Middle East")) +
  geom_line(aes(y = North_America, colour = "North America")) +
  geom_line(aes(y = Oceania, colour = "Oceania")) +
  geom_line(aes(y = South_America, colour = "South America")) +
  geom_line(aes(y = Southeast_Asia, colour = "Southeast Asia")) + 
  geom_line(aes(y = Ukraine, colour = "Ukraine")) +
  scale_colour_discrete((name = "Region")) +
  xlab("Dates") +
  ylab("Users") +
  ggtitle("Increase in Regional Users by Date") +
  theme_minimal()

If we eliminate the most significant location of users (Europe) it is simply easier to get an idea of how the users from the remaining regions increased over time.

ggplot(ddos_regions, aes(x = Date)) +
  geom_line(aes(y = Africa, colour = "Africa")) +
  geom_line(aes(y = Asia, colour = "Asia")) +
  geom_line(aes(y = Middle_East, colour = "Middle East")) +
  geom_line(aes(y = North_America, colour = "North America")) +
  geom_line(aes(y = Oceania, colour = "Oceania")) +
  geom_line(aes(y = South_America, colour = "South America")) +
  geom_line(aes(y = Southeast_Asia, colour = "Southeast Asia")) + 
  geom_line(aes(y = Ukraine, colour = "Ukraine")) +
  scale_colour_discrete((name = "Region")) +
  xlab("Dates") +
  ylab("Users") +
  ggtitle("Increase in Non-European Users by Date") +
  theme_minimal()

And the total users over time.

ddos_time <- read_csv("daily_observations.csv", 
    col_types = cols(Date = col_date(format = "%m/%d/%Y")))
ddos_time <- as_tibble(ddos_time) 
gg <- ggplot(ddos_time, aes(x = Date)) +
  geom_line(aes(y = Total)) +
  xlab("Dates") +
  ylab("Users") +
  ggtitle("Increase in Total Users by Date") +
  theme_minimal()
gg

Population & User Data Only

I’ll start with a basic visualization of the relationship between the population of the countries and the number of users of DDOS attacks from the corresponding countries:

#create plot
ggplot(ddos_cumulative, aes(x = log(population), y = log(users), color = region)) +
  geom_point () +
  facet_wrap("region")

Linear Model of Population and Users

What I want to look at is the linear model of the relationship between the population of each country with participating users and the corresponding sample of users from that country.

I’ll first simplify my data set to only contain the columns I am looking at here.

pop_users <- ddos_cumulative %>% 
  select(c(population, users))
gg1 <- ggplot(pop_users, aes(x=population, y=users)) +
   geom_point() +
   geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="cornflowerblue") +
   labs(title= "Population and Users",
        x= "Population",
        y = "Users") +
    theme_minimal_hgrid()
gg1

## `geom_smooth()` using formula 'y ~ x'

That’s a mess. I want to take the log() of the data to achieve a better look at the model

gg1b <- ggplot(pop_users, aes(x=log(population), y=log(users))) +
  geom_point() +
  geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="cornflowerblue") +
   labs(title= "Log: Population and Users",
        x= "Population (log)",
        y = "Users (log)") +
   theme_minimal_hgrid()

gg1b

## `geom_smooth()` using formula 'y ~ x'

On first look at this relationship, it seems clear that there is no correlation between a country’s population and the number of users of the DDOS tool.

pop_users_lm <- lm(population~users, data = pop_users)
summary(pop_users_lm)

## 
## Call:
## lm(formula = population ~ users, data = pop_users)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
##  -68122950  -59609854  -52417541  -18720617 1334465479 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 63129462   21142912   2.986  0.00359 **
## users           4092      12789   0.320  0.74972   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 199600000 on 96 degrees of freedom
## Multiple R-squared:  0.001065,   Adjusted R-squared:  -0.009341 
## F-statistic: 0.1024 on 1 and 96 DF,  p-value: 0.7497

IVS Data

The next data source I want to explore is the IVS data set.

Reading in Data

This brings in an overwhelming 135,000 observations of 231 variables. I selected the columns I am interested in working with and saved as a .csv file, which I will read in for the rest of the analysis.

A full accounting of the variables and descriptions are in the “About” tab of this GitHub Page.

To make matching easier, I used the “countrycode” package to assign proper country names to the ISO-3 numeric code from the data set.

#read in .dta file
#library(haven)
#ivs_data <- read_dta("data/ivs/ZA7505_v2-0-0.dta")
#head(ivs_data[33:42])
#write.csv(ivs_data, file = "./data/ivs/ivs_data.csv", row.names = TRUE)
#select relevant data
#ivs_subset <- select(ivs_data,10,34,35,40:50,106,109:114,119:138,150:162,166,188:196,199,201,210:214,222,224,225,230,231)
#ivs_df <- as.data.frame(ivs_subset)
#load package for converting country codes
#library(countrycode)
#ivs_df$country.name <- countrycode(ivs_df$cntry, origin = "iso3n", destination = "country.name")

ivs_clean <- read.csv("ivs-df-clean.csv")
ivs_clean <- as_tibble(ivs_clean)
names(ivs_clean)[1] <- 'country'
head(ivs_clean)

## # A tibble: 6 x 72
##   country weight imp_family imp_friends imp_leisure imp_politics imp_work
##   <chr>    <dbl>      <int>       <int>       <int>        <int>    <int>
## 1 Albania  0.697          2           1           2            3        2
## 2 Albania  0.697          1           1           4            4        1
## 3 Albania  0.697          1           2           2            4        1
## 4 Albania  0.697          1           2           2            4        1
## 5 Albania  0.697          1           1           2            4        1
## 6 Albania  0.697          1           3           3            4        1
## # ... with 65 more variables: imp_religion <int>, sat_happiness <int>,
## #   sat_health <int>, sat_life <int>, sat_control <int>,
## #   willingness_fight <int>, interest_politics <int>, prop_petition <int>,
## #   prop_boycotts <int>, prop_demonstrations <int>, prop_strikes <int>,
## #   self_position <int>, conf_churches <int>, conf_armed <int>,
## #   conf_press <int>, conf_unions <int>, conf_police <int>,
## #   conf_parliament <int>, conf_civilservices <int>, conf_regional <int>, ...

Transforming IVS Data

Preprocessing

In the original data in the IVS datasets, there are some meaningless choices in the value labels such as “Not asked,” “NA,” and “DK.” Additionally, some response have negative serial numbers. Furthermore, I excluded variables that have a response structure that do not follow the structures that are congruous to the structure of the majority of the responses.

Grouping by Mean

There are some changes needed to make the data more manageable. I have cleaned up some of the data by assigning all negative values representing various codes for no available observation to “NA” when applicable. I took means when applicable saved the resulting means by country as a series of data sets I saved offline that I will import.

Example of how I manipulated the data before saving:

#select relevant data
#ivs_important <- select(ivs_clean,1:8)

#find mean of each column
#important <- ivs_important %>%
  #group_by(country) %>%
  #summarise(
    #family = mean(imp_family, na.rm = TRUE),
    #friends = mean(imp_friends, na.rm = TRUE),
    #leisure = mean(imp_leisure, na.rm = TRUE),
    #politics = mean(imp_politics, na.rm = TRUE),
    #work = mean(imp_work, na.rm = TRUE),
    #religion = mean(imp_religion, na.rm = TRUE)
    #)

Looking at data frames representing country means:

important <- read_csv("important.csv")

## Warning: Missing column names filled in: 'X1' [1]

## 
## -- Column specification --------------------------------------------------------
## cols(
##   X1 = col_double(),
##   country = col_character(),
##   family = col_double(),
##   friends = col_double(),
##   leisure = col_double(),
##   politics = col_double(),
##   work = col_double(),
##   religion = col_double()
## )

head(important)

## # A tibble: 6 x 8
##      X1 country   family friends leisure politics  work religion
##   <dbl> <chr>      <dbl>   <dbl>   <dbl>    <dbl> <dbl>    <dbl>
## 1     1 Albania     1.02    1.73    2.01     3.30  1.20     2.15
## 2     2 Andorra     1.12    1.54    1.42     2.94  1.53     2.97
## 3     3 Argentina   1.09    1.54    1.81     2.81  1.47     2.39
## 4     4 Armenia     1.11    1.74    1.99     2.79  1.47     1.83
## 5     5 Australia   1.11    1.48    1.65     2.41  2.00     2.90
## 6     6 Austria     1.20    1.45    1.63     2.51  1.67     2.64

Matching Data

When eliminating the countries who did not have a profile in the IVS dataset from my observation data, I lost approximately 2,000 observations and have 67 countries to compare. I created a data frame of this information to use going forward.

all_data <- read_csv("integrated_data.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   country = col_character(),
##   population = col_number(),
##   region = col_character()
## )
## i Use `spec()` for the full column specifications.

head(all_data)

## # A tibble: 6 x 23
##   country population region users family friends leisure politics  work religion
##   <chr>        <dbl> <chr>  <dbl>  <dbl>   <dbl>   <dbl>    <dbl> <dbl>    <dbl>
## 1 Albania    3088385 South~    57   1.02    1.73    2.01     3.30  1.20     2.15
## 2 Andorra      85645 South~     6   1.12    1.54    1.42     2.94  1.53     2.97
## 3 Argent~   45864941 South~    11   1.09    1.54    1.81     2.81  1.47     2.39
## 4 Armenia    3011609 Weste~    16   1.11    1.74    1.99     2.79  1.47     1.83
## 5 Austra~   25809973 Ocean~   717   1.11    1.48    1.65     2.41  2.00     2.90
## 6 Austria    8884864 Weste~  3276   1.20    1.45    1.63     2.51  1.67     2.64
## # ... with 13 more variables: willingness <dbl>, petition <dbl>, boycott <dbl>,
## #   demonstration <dbl>, strikes <dbl>, identity <dbl>, marital <dbl>,
## #   parents <dbl>, children <dbl>, household <dbl>, education <dbl>,
## #   income <dbl>, scaled_weights <dbl>

Using Scaled Data

Normalization

Some of the variables have different value labels and maximum values, even within the same family of topics. For example, I may want to normalize the user scale when looking at, for example, the first set of variables that have responses on a scale of 1 to 4 accordingly?

all_data <- read.csv("integrated_data.csv")
scale_4 <- rescale(all_data$users, to=c(1,4))
summary(scale_4)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.004   1.019   1.159   1.109   4.000

scale_users_4 <- as.data.frame(scale_4)
head(scale_users_4)

##    scale_4
## 1 1.012742
## 2 1.001138
## 3 1.002275
## 4 1.003413
## 5 1.162912
## 6 1.745165

Linear Regression: Scaled Data

#Join scaled values of users to summary data
all_data$scaled_users <- scale_4
#Linear regression of "importance" variables + scaled user variable  
lm_imp <- lm(scaled_users ~ family + friends + leisure + politics + work + religion, data = all_data, na.action = na.exclude)
summary(lm_imp)

## 
## Call:
## lm(formula = scaled_users ~ family + friends + leisure + politics + 
##     work + religion, data = all_data, na.action = na.exclude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38183 -0.13888 -0.07790  0.02971  2.60239 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.90748    1.14576   0.792    0.431
## family       0.76647    1.01582   0.755    0.453
## friends     -0.16610    0.33585  -0.495    0.623
## leisure      0.03493    0.28318   0.123    0.902
## politics    -0.31986    0.19966  -1.602    0.114
## work         0.23951    0.32225   0.743    0.460
## religion     0.03842    0.11651   0.330    0.743
## 
## Residual standard error: 0.4202 on 60 degrees of freedom
## Multiple R-squared:  0.1275, Adjusted R-squared:  0.04025 
## F-statistic: 1.461 on 6 and 60 DF,  p-value: 0.2069

Linear Regression: Unscaled Data

Compare that to the un-scaled user data. I’m not sure that scaling will make a difference in the data integrity using regression analysis going forward.

However, this is very informative to me as a novice user of linear models to understand how scaling affects the degrees of freedom, but not the adjusted R-squared or p-values.

#Linear regression of "importance" variables + unscaled user variable  
lm_imp2 <- lm(users ~ family + friends + leisure + politics + work + religion, data = all_data, na.action = na.exclude)
summary(lm_imp2)

## 
## Call:
## lm(formula = users ~ family + friends + leisure + politics + 
##     work + religion, data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1678.1  -610.4  -342.4   130.6 11437.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -405.6     5035.6  -0.081    0.936
## family        3368.6     4464.5   0.755    0.453
## friends       -730.0     1476.1  -0.495    0.623
## leisure        153.5     1244.6   0.123    0.902
## politics     -1405.8      877.5  -1.602    0.114
## work          1052.6     1416.3   0.743    0.460
## religion       168.8      512.1   0.330    0.743
## 
## Residual standard error: 1847 on 60 degrees of freedom
## Multiple R-squared:  0.1275, Adjusted R-squared:  0.04025 
## F-statistic: 1.461 on 6 and 60 DF,  p-value: 0.2069

Multiple Linear Regression

Importance to Respondents

#Linear regression of "importance" variables
mlm1 <- lm(users ~ family + friends + leisure + politics + work + religion, data = all_data, na.action = na.exclude)
summary(mlm1)

## 
## Call:
## lm(formula = users ~ family + friends + leisure + politics + 
##     work + religion, data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1678.1  -610.4  -342.4   130.6 11437.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -405.6     5035.6  -0.081    0.936
## family        3368.6     4464.5   0.755    0.453
## friends       -730.0     1476.1  -0.495    0.623
## leisure        153.5     1244.6   0.123    0.902
## politics     -1405.8      877.5  -1.602    0.114
## work          1052.6     1416.3   0.743    0.460
## religion       168.8      512.1   0.330    0.743
## 
## Residual standard error: 1847 on 60 degrees of freedom
## Multiple R-squared:  0.1275, Adjusted R-squared:  0.04025 
## F-statistic: 1.461 on 6 and 60 DF,  p-value: 0.2069

Removing the largest p-value first:

#Linear regression of "importance" variables
mlm1b <- lm(users ~ family + friends + politics + work + religion, data = all_data, na.action = na.exclude)
summary(mlm1b)

## 
## Call:
## lm(formula = users ~ family + friends + politics + work + religion, 
##     data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1698.1  -614.1  -341.5   117.6 11443.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -164.8     4604.1  -0.036    0.972
## family        3289.9     4382.8   0.751    0.456
## friends       -656.9     1340.8  -0.490    0.626
## politics     -1407.4      870.3  -1.617    0.111
## work          1068.7     1398.8   0.764    0.448
## religion       157.4      499.5   0.315    0.754
## 
## Residual standard error: 1832 on 61 degrees of freedom
## Multiple R-squared:  0.1273, Adjusted R-squared:  0.05574 
## F-statistic: 1.779 on 5 and 61 DF,  p-value: 0.1305

Removing the next largest p-value:

#Linear regression of "importance" variables
mlm1c <- lm(users ~ family + friends + politics + work, data = all_data, na.action = na.exclude)
summary(mlm1c)

## 
## Call:
## lm(formula = users ~ family + friends + politics + work, data = all_data, 
##     na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1738.3  -621.8  -293.5    60.6 11473.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -416.8     4501.0  -0.093    0.927
## family        3689.2     4165.0   0.886    0.379
## friends       -745.2     1301.6  -0.573    0.569
## politics     -1407.4      863.9  -1.629    0.108
## work          1263.9     1245.1   1.015    0.314
## 
## Residual standard error: 1819 on 62 degrees of freedom
## Multiple R-squared:  0.1259, Adjusted R-squared:  0.06946 
## F-statistic: 2.232 on 4 and 62 DF,  p-value: 0.07575

Removing the next largest p-value:

#Linear regression of "importance" variables
mlm1d <- lm(users ~ family + politics + work, data = all_data, na.action = na.exclude)
summary(mlm1d)

## 
## Call:
## lm(formula = users ~ family + politics + work, data = all_data, 
##     na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1744.5  -645.5  -276.6    63.3 11525.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -970.1     4372.6  -0.222   0.8251  
## family        3039.9     3986.2   0.763   0.4485  
## politics     -1512.6      839.7  -1.801   0.0764 .
## work          1474.1     1183.4   1.246   0.2175  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1809 on 63 degrees of freedom
## Multiple R-squared:  0.1212, Adjusted R-squared:  0.07939 
## F-statistic: 2.897 on 3 and 63 DF,  p-value: 0.04195

Removing the next largest p-value:

#Linear regression of "importance" variables
mlm1e <- lm(users ~ politics + work, data = all_data, na.action = na.exclude)
summary(mlm1e)

## 
## Call:
## lm(formula = users ~ politics + work, data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1775.5  -704.3  -219.2    45.5 11535.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1463.3     2979.8   0.491   0.6250  
## politics     -1417.5      827.6  -1.713   0.0916 .
## work          1927.1     1020.2   1.889   0.0634 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1803 on 64 degrees of freedom
## Multiple R-squared:  0.1131, Adjusted R-squared:  0.08541 
## F-statistic: 4.082 on 2 and 64 DF,  p-value: 0.02146

Political Inclinations of Respondents

#Linear regression of "politics" variables
mlm2 <- lm(users ~ willingness + petition + boycott + demonstration + strikes + identity, data = all_data, na.action = na.exclude)
summary(mlm2)

## 
## Call:
## lm(formula = users ~ willingness + petition + boycott + demonstration + 
##     strikes + identity, data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2609.2  -623.3  -264.6   -26.5 10359.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    -212.45    5053.29  -0.042    0.967  
## willingness    -687.64    1753.88  -0.392    0.697  
## petition      -1328.59    1274.62  -1.042    0.302  
## boycott        1454.15    1765.01   0.824    0.414  
## demonstration -3373.22    1984.10  -1.700    0.095 .
## strikes        3377.05    1452.75   2.325    0.024 *
## identity         31.85     588.76   0.054    0.957  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1893 on 53 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.1779, Adjusted R-squared:  0.08481 
## F-statistic: 1.911 on 6 and 53 DF,  p-value: 0.09602

Removing the highest p-value

#Linear regression of "politics" variables
mlm2b <- lm(users ~ willingness + petition + boycott + demonstration + strikes, data = all_data, na.action = na.exclude)
summary(mlm2b)

## 
## Call:
## lm(formula = users ~ willingness + petition + boycott + demonstration + 
##     strikes, data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2569.1  -609.5  -241.1    20.6 10478.8 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     -967.7     3878.5  -0.250   0.8038  
## willingness     -435.5     1560.6  -0.279   0.7811  
## petition       -1366.1     1175.7  -1.162   0.2498  
## boycott         1371.3     1630.5   0.841   0.4036  
## demonstration  -2960.9     1719.6  -1.722   0.0902 .
## strikes         3327.3     1347.2   2.470   0.0163 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1770 on 61 degrees of freedom
## Multiple R-squared:  0.1856, Adjusted R-squared:  0.1189 
## F-statistic: 2.781 on 5 and 61 DF,  p-value: 0.02506

Removing the next highest p-value

#Linear regression of "politics" variables
mlm2c <- lm(users ~ petition + boycott + demonstration + strikes, data = all_data, na.action = na.exclude)
summary(mlm2c)

## 
## Call:
## lm(formula = users ~ petition + boycott + demonstration + strikes, 
##     data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2700.7  -641.6  -283.2    57.8 10493.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      -1602       3119  -0.514   0.6092  
## petition         -1498       1069  -1.401   0.1661  
## boycott           1500       1552   0.966   0.3376  
## demonstration    -2954       1707  -1.731   0.0884 .
## strikes           3266       1319   2.476   0.0160 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1756 on 62 degrees of freedom
## Multiple R-squared:  0.1846, Adjusted R-squared:  0.132 
## F-statistic: 3.509 on 4 and 62 DF,  p-value: 0.01202

Removing the next highest p-value

#Linear regression of "politics" variables
mlm2d <- lm(users ~ petition + demonstration + strikes, data = all_data, na.action = na.exclude)
summary(mlm2d)

## 
## Call:
## lm(formula = users ~ petition + demonstration + strikes, data = all_data, 
##     na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2590.3  -688.4  -235.1   113.7 10732.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     -80.13    2690.23  -0.030   0.9763  
## petition       -870.42     848.77  -1.026   0.3090  
## demonstration -2611.25    1668.38  -1.565   0.1226  
## strikes        3321.60    1317.16   2.522   0.0142 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1755 on 63 degrees of freedom
## Multiple R-squared:  0.1723, Adjusted R-squared:  0.1329 
## F-statistic: 4.372 on 3 and 63 DF,  p-value: 0.007358

Removing the next highest p-value

#Linear regression of "politics" variables
mlm2e <- lm(users ~ demonstration + strikes, data = all_data, na.action = na.exclude)
summary(mlm2e)

## 
## Call:
## lm(formula = users ~ demonstration + strikes, data = all_data, 
##     na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2615.7  -689.7  -272.3    99.1 10722.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      815.7     2545.5   0.320 0.749678    
## demonstration  -3877.1     1122.9  -3.453 0.000989 ***
## strikes         3417.7     1314.3   2.600 0.011554 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1756 on 64 degrees of freedom
## Multiple R-squared:  0.1585, Adjusted R-squared:  0.1322 
## F-statistic: 6.027 on 2 and 64 DF,  p-value: 0.003997

Looking at only “demonstration”

#Linear regression of "politics" variables
mlm2f <- lm(users ~ demonstration, data = all_data, na.action = na.exclude)
summary(mlm2f)

## 
## Call:
## lm(formula = users ~ demonstration, data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1266.7  -766.3  -367.9    54.0 11672.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     5120.7     2017.4   2.538   0.0135 *
## demonstration  -1906.5      864.6  -2.205   0.0310 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1832 on 65 degrees of freedom
## Multiple R-squared:  0.0696, Adjusted R-squared:  0.05528 
## F-statistic: 4.862 on 1 and 65 DF,  p-value: 0.031

Plotting this model

plot(mlm2f)

Looking at only “strikes”

#Linear regression of "politics" variables
mlm2g <- lm(users ~ strikes, data = all_data, na.action = na.exclude)
summary(mlm2g)

## 
## Call:
## lm(formula = users ~ strikes, data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -827.2  -676.0  -592.6  -182.7 12475.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -222.8     2731.8  -0.082    0.935
## strikes        355.3     1048.2   0.339    0.736
## 
## Residual standard error: 1898 on 65 degrees of freedom
## Multiple R-squared:  0.001764,   Adjusted R-squared:  -0.01359 
## F-statistic: 0.1149 on 1 and 65 DF,  p-value: 0.7358

Plotting the best model

plot(mlm2e)

Demographics of Respondents

#Linear regression of "demographics" variables
mlm3 <- lm(users ~ marital + parents + children + household + education + income, data = all_data, na.action = na.exclude)
summary(mlm3)

## 
## Call:
## lm(formula = users ~ marital + parents + children + household + 
##     education + income, data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1678.0  -748.7  -399.7   136.5 11849.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   3333.9     4885.7   0.682    0.498
## marital       -437.5      692.4  -0.632    0.530
## parents      -2125.9     2979.5  -0.714    0.478
## children      -182.7      905.7  -0.202    0.841
## household     -125.1      798.5  -0.157    0.876
## education      534.3      405.4   1.318    0.193
## income        -111.3      482.0  -0.231    0.818
## 
## Residual standard error: 1863 on 59 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1261, Adjusted R-squared:  0.03726 
## F-statistic: 1.419 on 6 and 59 DF,  p-value: 0.2226

Remove highest p-value first

#Linear regression of "demographics" variables
mlm3b <- lm(users ~ marital + parents + children + education + income, data = all_data, na.action = na.exclude)
summary(mlm3b)

## 
## Call:
## lm(formula = users ~ marital + parents + children + education + 
##     income, data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1705.3  -732.8  -371.8   160.1 11859.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   3484.2     4751.5   0.733    0.466
## marital       -431.2      685.6  -0.629    0.532
## parents      -2522.6     1558.7  -1.618    0.111
## children      -259.1      757.3  -0.342    0.733
## education      562.5      360.3   1.561    0.124
## income        -126.4      468.5  -0.270    0.788
## 
## Residual standard error: 1847 on 60 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1258, Adjusted R-squared:  0.05291 
## F-statistic: 1.726 on 5 and 60 DF,  p-value: 0.1424

Remove next highest p-value

#Linear regression of "demographics" variables
mlm3b <- lm(users ~ marital + parents + education + income, data = all_data, na.action = na.exclude)
summary(mlm3b)

## 
## Call:
## lm(formula = users ~ marital + parents + education + income, 
##     data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1687.4  -785.4  -314.5   132.6 11881.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   2784.6     4257.6   0.654   0.5156  
## marital       -363.3      651.5  -0.558   0.5791  
## parents      -2579.5     1538.6  -1.677   0.0988 .
## education      589.4      349.0   1.689   0.0964 .
## income        -118.2      464.5  -0.255   0.7999  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1834 on 61 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1241, Adjusted R-squared:  0.06662 
## F-statistic:  2.16 on 4 and 61 DF,  p-value: 0.08419

Remove next highest p-value

#Linear regression of "demographics" variables
mlm3c <- lm(users ~ marital + parents + education, data = all_data, na.action = na.exclude)
summary(mlm3c)

## 
## Call:
## lm(formula = users ~ marital + parents + education, data = all_data, 
##     na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1674.3  -777.4  -306.5    72.5 11854.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   2201.8     3528.0   0.624   0.5348  
## marital       -364.8      639.4  -0.570   0.5704  
## parents      -2479.7     1455.2  -1.704   0.0933 .
## education      568.9      319.1   1.783   0.0794 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1806 on 63 degrees of freedom
## Multiple R-squared:  0.1244, Adjusted R-squared:  0.08268 
## F-statistic: 2.983 on 3 and 63 DF,  p-value: 0.03786

Remove next highest p-value

#Linear regression of "demographics" variables
mlm3d <- lm(users ~ parents + education, data = all_data, na.action = na.exclude)
summary(mlm3d)

## 
## Call:
## lm(formula = users ~ parents + education, data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1564.6  -760.7  -299.0    92.3 11920.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    874.7     2638.4   0.332   0.7413  
## parents      -2186.2     1354.0  -1.615   0.1113  
## education      559.7      317.0   1.766   0.0822 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1796 on 64 degrees of freedom
## Multiple R-squared:  0.1199, Adjusted R-squared:  0.09235 
## F-statistic: 4.358 on 2 and 64 DF,  p-value: 0.01682

Users to Politics and Education

I’ll look at a model that uses combinations of variables from demographics as well as political variables.

#Linear regression of "demographics" variables
mlm4 <- lm(users ~  politics + education, data = all_data, na.action = na.exclude)
summary(mlm4)

## 
## Call:
## lm(formula = users ~ politics + education, data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1586.2  -754.7  -274.5   122.4 11471.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1724.5     2634.3   0.655   0.5151  
## politics     -1595.8      801.2  -1.992   0.0507 .
## education      691.4      295.5   2.340   0.0224 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1778 on 64 degrees of freedom
## Multiple R-squared:  0.1375, Adjusted R-squared:  0.1105 
## F-statistic:   5.1 on 2 and 64 DF,  p-value: 0.008809

Users to Strike Propensity and Education

I’ll look at a model that uses combinations of variables from demographics as well as political variables.

#Linear regression of "demographics" variables
mlm5 <- lm(users ~ strikes + education, data = all_data, na.action = na.exclude)
summary(mlm5)

## 
## Call:
## lm(formula = users ~ strikes + education, data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1746.0  -732.5  -386.1   192.2 12134.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -4479.7     3138.2  -1.427   0.1583  
## strikes        635.5     1015.2   0.626   0.5336  
## education      756.8      304.6   2.485   0.0156 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1827 on 64 degrees of freedom
## Multiple R-squared:  0.08957,    Adjusted R-squared:  0.06112 
## F-statistic: 3.148 on 2 and 64 DF,  p-value: 0.04964

One more combination to try and understand if different variables can create a better model

#Linear regression of "politics" variables
mlm7 <- lm(users ~ demonstration + strikes + education, data = all_data, na.action = na.exclude)
summary(mlm7)

## 
## Call:
## lm(formula = users ~ demonstration + strikes + education, data = all_data, 
##     na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2708.0  -701.5  -351.0   210.3 10702.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    -2395.9     3043.2  -0.787  0.43406   
## demonstration  -3377.8     1134.9  -2.976  0.00414 **
## strikes         3225.9     1294.4   2.492  0.01534 * 
## education        547.2      296.0   1.849  0.06918 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1724 on 63 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.1638 
## F-statistic: 5.309 on 3 and 63 DF,  p-value: 0.002512

par(mfrow = c(1,1))
plot(mlm7, 1:6)

Simple Linear Regression

Potential Correlation with Strike Propensity

I am going to look further at the potential correlation between countries with a propensity to engage in strikes and engage in DDOS attacks.

#Linear regression of "politics" variable "strikes"
lm_strikes <- lm(users ~ strikes, data = all_data, na.action = na.exclude)
summary(lm_strikes)

## 
## Call:
## lm(formula = users ~ strikes, data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -827.2  -676.0  -592.6  -182.7 12475.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -222.8     2731.8  -0.082    0.935
## strikes        355.3     1048.2   0.339    0.736
## 
## Residual standard error: 1898 on 65 degrees of freedom
## Multiple R-squared:  0.001764,   Adjusted R-squared:  -0.01359 
## F-statistic: 0.1149 on 1 and 65 DF,  p-value: 0.7358

Potential Correlation with Education Level

I am going to look further at the potential correlation between education level with a propensity to engage in strikes and engage in DDOS attacks.

#Linear regression of "educational level" and "users"
lm_education <- lm(users ~ education, data = all_data, na.action = na.exclude)
summary(lm_education)

## 
## Call:
## lm(formula = users ~ education, data = all_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1533.6  -810.1  -400.2   154.9 12164.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -2730.8     1422.6  -1.920   0.0593 .
## education      735.6      301.3   2.441   0.0174 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1818 on 65 degrees of freedom
## Multiple R-squared:  0.084,  Adjusted R-squared:  0.06991 
## F-statistic: 5.961 on 1 and 65 DF,  p-value: 0.01736

It may be that there are no correlations to be found here. But I want to run the analyses again using a column found in the codebook indicating ‘weights’ for population that should be used when comparing multiple variables to account for population size.

Adding Population Weights

Importance to Respondents

I’ll re-run the model with users and importance variables, but using the weights column. This gives me a warning that this is an “essentially perfect fit” and that the summary may be reliable. This result is consistent for each of the weighted models.

#Linear regression of "importance" variables + weighted   
mlm1w <- lm(users ~ family + friends + leisure + politics + work + religion, data = all_data, na.action = na.exclude, weights)
summary(mlm1w)

## Warning in summary.lm(mlm1w): essentially perfect fit: summary may be unreliable

## 
## Call:
## lm(formula = users ~ family + friends + leisure + politics + 
##     work + religion, data = all_data, subset = weights, na.action = na.exclude)
## 
## Residuals:
##          1        1.1 
## -5.024e-15  5.024e-15 
## 
## Coefficients: (6 not defined because of singularities)
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 5.700e+01  5.024e-15 1.134e+16   <2e-16 ***
## family             NA         NA        NA       NA    
## friends            NA         NA        NA       NA    
## leisure            NA         NA        NA       NA    
## politics           NA         NA        NA       NA    
## work               NA         NA        NA       NA    
## religion           NA         NA        NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.105e-15 on 1 degrees of freedom

Political Inclinations of Respondents

#Linear regression of "politics" variables + weighted
mlm2w <- lm(users ~ willingness + petition + boycott + demonstration + strikes + identity, data = all_data, na.action = na.exclude, weights)
summary(mlm2w)

## Warning in summary.lm(mlm2w): essentially perfect fit: summary may be unreliable

## 
## Call:
## lm(formula = users ~ willingness + petition + boycott + demonstration + 
##     strikes + identity, data = all_data, subset = weights, na.action = na.exclude)
## 
## Residuals:
##          1        1.1 
## -5.024e-15  5.024e-15 
## 
## Coefficients: (6 not defined because of singularities)
##                Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)   5.700e+01  5.024e-15 1.134e+16   <2e-16 ***
## willingness          NA         NA        NA       NA    
## petition             NA         NA        NA       NA    
## boycott              NA         NA        NA       NA    
## demonstration        NA         NA        NA       NA    
## strikes              NA         NA        NA       NA    
## identity             NA         NA        NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.105e-15 on 1 degrees of freedom

Demographics of Respondents

#Linear regression of "demographics" variables + weighted
mlm3w <- lm(users ~ marital + parents + children + household + education + income, data = all_data, na.action = na.exclude, weights)
summary(mlm3w)

## Warning in summary.lm(mlm3w): essentially perfect fit: summary may be unreliable

## 
## Call:
## lm(formula = users ~ marital + parents + children + household + 
##     education + income, data = all_data, subset = weights, na.action = na.exclude)
## 
## Residuals:
##          1        1.1 
## -5.024e-15  5.024e-15 
## 
## Coefficients: (6 not defined because of singularities)
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 5.700e+01  5.024e-15 1.134e+16   <2e-16 ***
## marital            NA         NA        NA       NA    
## parents            NA         NA        NA       NA    
## children           NA         NA        NA       NA    
## household          NA         NA        NA       NA    
## education          NA         NA        NA       NA    
## income             NA         NA        NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.105e-15 on 1 degrees of freedom

The best model I generated with the weights function added

#Linear regression of "politics" variables
mlm7b <- lm(users ~ demonstration + strikes + education, data = all_data, na.action = na.exclude, weights)
summary(mlm7b)

## Warning in summary.lm(mlm7b): essentially perfect fit: summary may be unreliable

## 
## Call:
## lm(formula = users ~ demonstration + strikes + education, data = all_data, 
##     subset = weights, na.action = na.exclude)
## 
## Residuals:
##          1        1.1 
## -5.024e-15  5.024e-15 
## 
## Coefficients: (3 not defined because of singularities)
##                Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)   5.700e+01  5.024e-15 1.134e+16   <2e-16 ***
## demonstration        NA         NA        NA       NA    
## strikes              NA         NA        NA       NA    
## education            NA         NA        NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.105e-15 on 1 degrees of freedom

Government Right to Access

ivs_right <- read.csv("government_rights.csv")
names(ivs_right)[1] <- 'country'
head(ivs_right)

##     country surveillance  monitor  collect users
## 1   Albania     2.443206 3.027178 2.951220    57
## 2   Andorra     2.261952 3.588645 3.569721     6
## 3 Argentina     2.499501 3.044865 2.775673    11
## 4   Armenia     2.359333 2.657333 2.552000    16
## 5 Australia     1.787645 2.877551 2.896856   717
## 6   Austria     2.495134 3.085158 3.333333  3276

Exploratory Model

right <- lm(users ~ surveillance + monitor + collect, data = ivs_right)
summary(right)

## 
## Call:
## lm(formula = users ~ surveillance + monitor + collect, data = ivs_right)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1285.4  -692.4  -360.4    10.9 11564.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)    -788.8     1407.0  -0.561    0.577
## surveillance   -992.7      847.5  -1.171    0.246
## monitor       -1297.9     1726.1  -0.752    0.455
## collect        2516.6     1533.5   1.641    0.106
## 
## Residual standard error: 1855 on 63 degrees of freedom
## Multiple R-squared:  0.0762, Adjusted R-squared:  0.0322 
## F-statistic: 1.732 on 3 and 63 DF,  p-value: 0.1695

removing highest p-value

right2 <- lm(users ~ surveillance + collect, data = ivs_right)
summary(right2)

## 
## Call:
## lm(formula = users ~ surveillance + collect, data = ivs_right)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1537.0  -711.1  -431.3    10.0 11813.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    -876.5     1397.4  -0.627   0.5327  
## surveillance  -1278.4      755.1  -1.693   0.0953 .
## collect        1487.9      690.5   2.155   0.0350 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1848 on 64 degrees of freedom
## Multiple R-squared:  0.0679, Adjusted R-squared:  0.03878 
## F-statistic: 2.331 on 2 and 64 DF,  p-value: 0.1054

removing highest p-value

right3 <- lm(users ~ collect, data = ivs_right)
summary(right3)

## 
## Call:
## lm(formula = users ~ collect, data = ivs_right)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1070.4  -707.3  -573.1  -203.9 12247.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -1136.8     1408.7  -0.807    0.423
## collect        620.0      469.2   1.321    0.191
## 
## Residual standard error: 1875 on 65 degrees of freedom
## Multiple R-squared:  0.02616,    Adjusted R-squared:  0.01117 
## F-statistic: 1.746 on 1 and 65 DF,  p-value: 0.191

Media Data

Read in data for both measures of media interest gathered from the Spike/Newswhip media site

Public Interest = Social media interactions on articles

Media Interest = Number of articles published

Public Interest

#public interest
public_interest <- read_csv("IT_Army_ Public_Interest.csv", 
    col_types = cols(Date = col_date(format = "%m/%d/%Y")))
public_interest <- as_tibble(public_interest)
head(public_interest)

## # A tibble: 6 x 10
##   Date       All_Locations Europe North_America Oceania South_America  Asia
##   <date>             <dbl>  <dbl>         <dbl>   <dbl>         <dbl> <dbl>
## 1 2022-02-26         19811   7777          4763    2301             0  4960
## 2 2022-02-27        410149 346425         56690    3703           171  1765
## 3 2022-02-28        295464 238816         53122     278           166   532
## 4 2022-03-01        202270 171920         28329     662            34  3250
## 5 2022-03-02         95478  73628         21099     836            12  1786
## 6 2022-03-03         33566   8929         24598     164             6   482
## # ... with 3 more variables: Africa <dbl>, Middle_East <dbl>,
## #   Southeast_Asia <dbl>

Media Interest

#media interest
media_interest <- read_csv("IT_Army_ Media_Interest.csv", 
    col_types = cols(Date = col_date(format = "%m/%d/%Y")))
media_interest <- as_tibble(media_interest)
head(media_interest)

## # A tibble: 6 x 10
##   Date       All_Locations Europe North_America Oceania South_America  Asia
##   <date>             <dbl>  <dbl>         <dbl>   <dbl>         <dbl> <dbl>
## 1 2022-02-26           134     15            61      16             0    39
## 2 2022-02-27            89     14            42       4             2    25
## 3 2022-02-28           356     46           249       4             1    59
## 4 2022-03-01           106     29            46       8             0    27
## 5 2022-03-02           152     26            75       3             0    45
## 6 2022-03-03           109     11            62       1             0    33
## # ... with 3 more variables: Africa <dbl>, Middle_East <dbl>,
## #   Southeast_Asia <dbl>

Media, Population, and User Data

All measure are in one dataset for analysis.

#data frame with all regional observations
regional_all <- read_csv("comprehensive_by_region.csv") 

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Region = col_character(),
##   Total_Population = col_number(),
##   Sample_Population = col_number(),
##   DDOS_Users = col_double(),
##   Public_Interest = col_double(),
##   Media_Interest = col_double()
## )

regional_all <- as_tibble(regional_all)
regional_all

## # A tibble: 8 x 6
##   Region         Total_Population Sample_Population DDOS_Users Public_Interest
##   <chr>                     <dbl>             <dbl>      <dbl>           <dbl>
## 1 Africa               1234685606         459973244        164            7666
## 2 Asia                 4434971532        3314507290       1443           29078
## 3 Europe                748481333         745852328      40262          914764
## 4 Middle_East           393498810         303663005        390            1511
## 5 North_America         600504974         512581441       5245          291484
## 6 Oceania                43693399          30801415        718            9700
## 7 South_America         437360279         404401418        393             735
## 8 Southeast_Asia        680855171         614900482        264            9033
## # ... with 1 more variable: Media_Interest <dbl>

Proportional Data

#proportions
proportions <- read.csv("proportions.csv")
proportions

##           Region Total_Population Sample_Population   DDOS_Users
## 1         Africa      0.082521173       0.030742670 1.096110e-08
## 2           Asia      0.296414773       0.221527673 9.644400e-08
## 3         Europe      0.050025332       0.049849621 2.690942e-06
## 4    Middle_East      0.026299799       0.020295553 2.606600e-08
## 5  North_America      0.040135217       0.034258779 3.505537e-07
## 6        Oceania      0.002920282       0.002058637 4.798810e-08
## 7  South_America      0.029231315       0.027028483 2.626650e-08
## 8 Southeast_Asia      0.045505485       0.041097352 1.764460e-08
##   Public_Interest Media_Interest
## 1    5.123631e-07    1.33670e-09
## 2    1.943451e-06    2.05186e-08
## 3    6.113896e-05    1.53054e-08
## 4    1.009889e-07    2.20560e-09
## 5    1.948156e-05    6.02190e-08
## 6    6.483070e-07    4.01010e-09
## 7    4.912430e-08    3.34200e-10
## 8    6.037276e-07    6.54990e-09

Basic Pearson Correlation

cor(regional_all$DDOS_Users, regional_all$Public_Interest, method = "pearson")

## [1] 0.9811875

Linear Regression

Simple Correlation of Population and Sample Population

Comparing the population of each region being examined and the representative population of the countries represented in my sample. These are highly correlated.

#Linear regression
pop_lm <- lm(Total_Population ~ Sample_Population, data = regional_all, na.action = na.exclude)
summary(pop_lm)

## 
## Call:
## lm(formula = Total_Population ~ Sample_Population, data = regional_all, 
##     na.action = na.exclude)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -253750322 -121385069  -57833324    -973754  611162518 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.419e+07  1.298e+08   0.109    0.917    
## Sample_Population 1.325e+00  1.032e-01  12.837 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 283600000 on 6 degrees of freedom
## Multiple R-squared:  0.9649, Adjusted R-squared:  0.959 
## F-statistic: 164.8 on 1 and 6 DF,  p-value: 1.373e-05

Multiple Linear Correlation of Variables

Explanatory variables are both the public interest and media interest counts with the DDOS user counts as the outcome variable.

#Linear regression user and media variables
regional_mlm <- lm(DDOS_Users ~ Public_Interest + Media_Interest, data = regional_all, na.action = na.exclude)
summary(regional_mlm)

## 
## Call:
## lm(formula = DDOS_Users ~ Public_Interest + Media_Interest, data = regional_all, 
##     na.action = na.exclude)
## 
## Residuals:
##         1         2         3         4         5         6         7         8 
## -819.0092 2051.3093  167.6466 -199.0906 -704.4965   -0.2989 -410.6496  -85.4111 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     815.022433 478.700459   1.703  0.14938    
## Public_Interest   0.045171   0.001319  34.234    4e-07 ***
## Media_Interest   -8.914703   1.411800  -6.314  0.00147 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1060 on 5 degrees of freedom
## Multiple R-squared:  0.9958, Adjusted R-squared:  0.9942 
## F-statistic: 599.5 on 2 and 5 DF,  p-value: 1.112e-06

library(stargazer)

## Warning: package 'stargazer' was built under R version 4.1.2

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

stargazer(regional_mlm, type = "text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                             DDOS_Users         
## -----------------------------------------------
## Public_Interest              0.045***          
##                               (0.001)          
##                                                
## Media_Interest               -8.915***         
##                               (1.412)          
##                                                
## Constant                      815.022          
##                              (478.700)         
##                                                
## -----------------------------------------------
## Observations                     8             
## R2                             0.996           
## Adjusted R2                    0.994           
## Residual Std. Error     1,060.060 (df = 5)     
## F Statistic           599.468*** (df = 2; 5)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

#create plot
regional_plot <- ggplot(regional_all, aes(x = log(Media_Interest), y = log(DDOS_Users), color = Region)) +
   geom_point() +
   geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="cornflowerblue") +
   labs(title= "Media Interest and Users",
        x= "Media Interest",
        y = "Users") +
    theme_minimal_hgrid()
regional_plot

## `geom_smooth()` using formula 'y ~ x'

#create plot
regional_plot2 <- ggplot(regional_all, aes(x = log(Public_Interest), y = log(DDOS_Users), color = Region)) +
   geom_point() +
   geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="cornflowerblue") +
   labs(title= "Public Interest and Users",
        x= "Public Interest",
        y = "Users") +
    theme_minimal_hgrid()
regional_plot2

## `geom_smooth()` using formula 'y ~ x'

Interaction Term

Fitting another model, this time with an interaction term allowing interaction between population and media interest

mlm3d <- lm(DDOS_Users ~ Public_Interest + Media_Interest + Public_Interest*Media_Interest, data = regional_all)

summary(mlm3d)

## 
## Call:
## lm(formula = DDOS_Users ~ Public_Interest + Media_Interest + 
##     Public_Interest * Media_Interest, data = regional_all)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -342.124   81.899    2.874  202.827   -4.138  111.255  252.096 -304.690 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     1.008e+02  1.583e+02   0.637  0.55879    
## Public_Interest                 5.302e-02  1.050e-03  50.513 9.19e-07 ***
## Media_Interest                  2.492e-01  1.214e+00   0.205  0.84739    
## Public_Interest:Media_Interest -4.010e-05  5.038e-06  -7.959  0.00135 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 288.8 on 4 degrees of freedom
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9996 
## F-statistic:  5405 on 3 and 4 DF,  p-value: 1.141e-07

Single Models

Public Interest

pi <- lm(DDOS_Users ~ Public_Interest, data = regional_all)
summary(pi)

## 
## Call:
## lm(formula = DDOS_Users ~ Public_Interest, data = regional_all)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6530.1   465.6   853.1   930.3  2035.0 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -5.955e+02  1.158e+03  -0.514    0.625    
## Public_Interest  4.244e-02  3.409e-03  12.449 1.64e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2899 on 6 degrees of freedom
## Multiple R-squared:  0.9627, Adjusted R-squared:  0.9565 
## F-statistic:   155 on 1 and 6 DF,  p-value: 1.641e-05

Media Interest

mi <- lm(DDOS_Users ~ Media_Interest, data = regional_all)
summary(mi)

## 
## Call:
## lm(formula = DDOS_Users ~ Media_Interest, data = regional_all)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5676  -5161  -4585  -4362  33997 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)    4678.357   6515.562   0.718    0.500
## Media_Interest    6.928     18.681   0.371    0.723
## 
## Residual standard error: 14850 on 6 degrees of freedom
## Multiple R-squared:  0.02241,    Adjusted R-squared:  -0.1405 
## F-statistic: 0.1375 on 1 and 6 DF,  p-value: 0.7235

Multiple Lines

gg2 <- ggplot() +
   geom_smooth(aes(log(DDOS_Users), log(Media_Interest)), data = regional_all, fullrange=TRUE,
               method = "lm", se = TRUE, color = "red3") + 
   geom_smooth(aes(log(DDOS_Users), log(Public_Interest)), data = regional_all, fullrange=TRUE,
               method = "lm", se = TRUE, color = "forestgreen") +
  ggtitle("Media Interest and Public Interest & DDOS Users", subtitle = "Green Indicating Public Interest, Red Indicating Media Interest") +
  xlab("log: DDOS Users") +
  ylab("log: Media Interactions") +
  theme_minimal_hgrid()

 gg2 

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Stargazer

Fitting the media models with stargazer

library(stargazer)
library(tinytex)
library(sandwich)

## Warning: package 'sandwich' was built under R version 4.1.2

m1 = lm(DDOS_Users ~ Media_Interest, data = regional_all)
m2 = lm(DDOS_Users ~ Public_Interest, data = regional_all)

model.lst = list(m1, m2)

stargazer(m1,
          m2,
          title="Displaying results for multiple response variables",
          type = "text",
          float = TRUE,
          report = "vcsp",
          se=lapply(model.lst, function(x) sqrt(diag(vcovHC(x, type = "HC1")))),
          no.space = TRUE,
          header=FALSE,
          single.row = TRUE,
          font.size = "small",
          intercept.bottom = F,
          covariate.labels = c("Intercept", "Media Interest", "Public Interest"),
          column.labels = c("Media Interest", "Public Interest"),
          column.separate = c(1, 1),
          digits = 3,
          t.auto = T,
          p.auto = T
          )

## 
## Displaying results for multiple response variables
## =====================================================================
##                                        Dependent variable:           
##                              ----------------------------------------
##                                             DDOS_Users               
##                                 Media Interest      Public Interest  
##                                       (1)                 (2)        
## ---------------------------------------------------------------------
## Intercept                    4,678.357 (5,155.242) -595.471 (807.308)
##                                    p = 0.365           p = 0.461     
## Media Interest                   6.928 (8.020)                       
##                                    p = 0.388                         
## Public Interest                                      0.042 (0.003)   
##                                                        p = 0.000     
## ---------------------------------------------------------------------
## Observations                           8                   8         
## R2                                   0.022               0.963       
## Adjusted R2                         -0.141               0.957       
## Residual Std. Error (df = 6)      14,846.860           2,898.960     
## F Statistic (df = 1; 6)              0.138             154.983***    
## =====================================================================
## Note:                                     *p<0.1; **p<0.05; ***p<0.01