Acknowledgement

Special thanks to my friend Ömer Bayraklı for his help with the preparation of this workshop.

Structure of this workshop

November 24 - Introduction to R programming
December 1 - Descriptive statistics with Tidyverse
December 8 - Inferential statistics (linear regression) and RMarkdown
December 15 - Machine learning and Twitter data analysis

Note: This is a beginner course! If you already know advanced statistics or R, this workshop will not be for you!

Note2: There will be mini exercises (which could be done in the second hour of the workshop). To get a certificate by Compec, I will ask you to open up a Github repository and upload your solutions/notes there. Optionally, you can get a final hands-on data science project which can be useful for your future job applications. After today’s workshop, I will share an Google sheets with you and you can enter your Github repo link there. You can also collaborate on exercises. If you have not used Github before, don’t worry. I’ll send you a video explaining how you can open an account and push your code to a repository.

About the instructor

More info about me

Week 1 (Nov 23, 2023)

Intro to R

What is it?

R is a programming language widely used especially in data science. It is possible to do data mining, data processing, statistical programming and machine learning with it. It is a leading language, especially among researchers in natural or social sciences.

Why R?

Completely free and open source
Open science and scientific reproducibility
Customized analyses

How will we code?

Throughout the workshop, we will use RStudio by Posit, which is the most popular IDE for R. There are also other options like RStudio Cloud or Google Colab that allow you to write code in your web browser.

Cheat sheets

Base R:

https://iqss.github.io/dss-workshops/R/Rintro/base-r-cheat-sheet.pdf

Data manipulation with tidyverse:

https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Introduction book:

https://intro2r.com

Turkish cheat sheets:

https://github.com/rstudio/cheatsheets/blob/main/translations/turkish/base-r_tr.pdf https://github.com/rstudio/cheatsheets/blob/main/translations/turkish/data-transformation_tr.pdf https://github.com/rstudio/cheatsheets/blob/main/translations/turkish/data-visualization_tr.pdf

Also see DataCamp!

Real world use cases of R:

Python vs. R?

## Let’s start coding!

# Addition
2 + 2

## [1] 4

# Subtraction
3 - 2

## [1] 1

# Multiplication
3 * 2

## [1] 6

# Division
3 / 2

## [1] 1.5

# Exponentiation
2 ^ 5

## [1] 32

# Order of operations
2 + 3 * 4

## [1] 14

(2 + 3) * 4

## [1] 20

Functions in R

In R, instead of using mathematical operators like this, we will primarily use “functions” that allow us to perform various tasks. Each function takes specific arguments. Arguments are the inputs to the function, i.e., the objects on which the function operates. Some of these arguments may be required to be explicitly specified. If a function requires multiple arguments, the arguments are separated by commas.

Functions are a way to package up and reuse code.

The function below is called “add_two” and it adds two to any number you give it.

add_two <- function(x) {
  x + 2
}

Now we can use the function we just created.

add_two(3)

## [1] 5

Other functions are built into R. For example, the “log” function computes the natural logarithm.

log(10)

## [1] 2.302585

sqrt(4)

## [1] 2

abs(-2)

## [1] 2

You can also use functions inside other functions.

log(sqrt(4))

## [1] 0.6931472

Variables in R

A variable in a computer’s memory can be any object that is defined. We can give it any name and value we want. The computer stores the values we assign to variables in memory, and later, we can access the values within that variable.

In R, we assign variables using the <- operator.

# this code will not produce any output but will assign the value 100 to the variable 'var'
var <- (2*5)^2

# if we want to see the value of the variable, we can just type the name of the variable or print it to the console
var

## [1] 100

print(var)

## [1] 100

Operations with variables

# we can use variables in operations
var + 1

## [1] 101

var2 <- sqrt(16)

var2 + var

## [1] 104

var2 * var

## [1] 400

Logical operators

Using the <, >, <=, >=, ==, !=, |, and & operators, we can perform comparisons between two variables. As a result, these operators will give us either TRUE, meaning the comparison is true, or FALSE, meaning the comparison is false.

var < 105 # smaller than

## [1] TRUE

var > 1 # bigger than

## [1] TRUE

var <= 8 # smaller than or equal to

## [1] FALSE

var >= 8 # bigger than or equal to

## [1] TRUE

var == 8 # equal to

## [1] FALSE

var != 6 # not equal to

## [1] TRUE

var == 4 | 8 # either 4 or 8

## [1] TRUE

var == 4 & 8 # both 4 and 8

## [1] FALSE

Note: You can always get help about a specific function or operator by using the help() command.

help(log)

help("+")

Data types in R

In R, values can have different types. The main data types include integer, double (for real numbers), character, and logical. You can use the typeof() function to determine the data type of a variable.

Here’s an example:

var <- as.integer(2)
var2 <- 2.2
var3 <- "hey learning R is cool"
var4 <- TRUE

typeof(var)

## [1] "integer"

typeof(var2)

## [1] "double"

typeof(var3)

## [1] "character"

typeof(var4)

## [1] "logical"

Vectors

Numeric vectors

A vector is a collection of values of the same type. We can create a vector using the c() function. The c() function takes any number of arguments and combines them into a vector.

# create a vector of numbers
numbers <- c(1, 2, 3, 4, 5)

print(numbers)

## [1] 1 2 3 4 5

# use length() to get the length of a vector
length(numbers)

## [1] 5

# consecutive numbers can be created using the : operator
5:90

##  [1]  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [26] 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [51] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
## [76] 80 81 82 83 84 85 86 87 88 89 90

# or use seq() to create a sequence of numbers
seq(5, 90, by = 2)

##  [1]  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53
## [26] 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89

# use rep() and seq() to create a vector of repeated numbers
rep(seq(1,10,3),5)

##  [1]  1  4  7 10  1  4  7 10  1  4  7 10  1  4  7 10  1  4  7 10

Some functions that you can use with numeric vectors:

# sum() adds up all the numbers in a vector
sum(numbers)

## [1] 15

# mean() computes the mean of all the numbers in a vector
mean(numbers)

## [1] 3

# max() and min() return the maximum and minimum values in a vector
max(numbers)

## [1] 5

min(numbers)

## [1] 1

# sort() sorts the numbers in a vector in ascending order
sort(numbers)

## [1] 1 2 3 4 5

# you can also sort in descending order
sort(numbers, decreasing = TRUE)

## [1] 5 4 3 2 1

# sd() computes the standard deviation of the numbers in a vector
sd(numbers)

## [1] 1.581139

# median() computes the median of the numbers in a vector
median(numbers)

## [1] 3

Operations with vectors:

# you can add two vectors together
numbers + c(1, 2, 3, 4, 5)

## [1]  2  4  6  8 10

# you can multiply two vectors together
numbers * c(1, 2, 3, 4, 5)

## [1]  1  4  9 16 25

Indexing vectors:

# you can access the elements of a vector using the [] operator
new_vector <- 7:21

new_vector[1]

## [1] 7

new_vector[2:7]

## [1]  8  9 10 11 12 13

new_vector[c(1, 3, 5, 7)]

## [1]  7  9 11 13

new_vector[-1]

##  [1]  8  9 10 11 12 13 14 15 16 17 18 19 20 21

new_vector[-(1:3)]

##  [1] 10 11 12 13 14 15 16 17 18 19 20 21

Logical vectors

Logical vectors are vectors that contain TRUE and FALSE values. You can create logical vectors using the c() function.

# create a logical vector
logical_vector <- c(TRUE, FALSE, TRUE, FALSE, TRUE)

# operators like <, >, <=, >=, ==, !=, |, and & can be used to create logical vectors
new_vector <- 1:8

new_vector < 3

## [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

new_vector == 7

## [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

new_vector != 0

## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Filtering vectors:

# you can use logical vectors to filter other vectors

new_vector[new_vector < 3] # returns all values in new_vector that are smaller than 3

## [1] 1 2

new_vector[new_vector == 7] # returns all values in new_vector that are equal to 7

## [1] 7

Character vectors

Character vectors are vectors that contain strings. You can create character vectors using the c() function.

# create a character vector
character_vector <- c("hello", "learning", "R", "is", "cool")
print(character_vector)

## [1] "hello"    "learning" "R"        "is"       "cool"

# you can use the nchar() function to get the number of characters in each string
nchar(character_vector)

## [1] 5 8 1 2 4

# you can use the paste() function to concatenate strings
paste("hello", "learning", "R", "is", "cool")

## [1] "hello learning R is cool"

# you can use the strsplit() function to split a string into a vector of substrings
strsplit("hello learning R is cool", " ")

## [[1]]
## [1] "hello"    "learning" "R"        "is"       "cool"

Data frames

Data frames are used to store tabular data. You can create a data frame using the data.frame() function.

# create a data frame
df <- data.frame(
  age = c(55, 95, 67, 89, 24),
  height = c(1.78, 1.65, 1.90, 1.45, 1.67)
)

print(df)

##   age height
## 1  55   1.78
## 2  95   1.65
## 3  67   1.90
## 4  89   1.45
## 5  24   1.67

# you can use the $ operator or [] to access a column in a data frame
df$age

## [1] 55 95 67 89 24

df['age']

##   age
## 1  55
## 2  95
## 3  67
## 4  89
## 5  24

There are some in-built datasets in R like state.x77. You can use data() to view other available datasets in R.

This is a matrix with 50 rows and 8 columns giving the following statistics in the respective columns.

Population: population estimate as of July 1, 1975.
Income: per capita income (1974)
Illiteracy: illiteracy (1970, percent of population)
Life Exp: life expectancy in years (1969-71)
Murder: murder and non-negligent manslaughter rate per 100,000 population (1976)
HS Grad: percent high-school graduates (1970)
Frost: mean number of days with minimum temperature below freezing (1931-1960) in capital or large city
Area: land area in square miles

Source: U.S. Department of Commerce, Bureau of the Census (1977) Statistical Abstract of the United States, and U.S. Department of Commerce, Bureau of the Census (1977) County and City Data Book.

# save the dataset to a variable as a dataframe object in R
df <- as.data.frame(state.x77)

# view the df:
#View(state.x77)
head(state.x77)

##            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

tail(state.x77)

##               Population Income Illiteracy Life Exp Murder HS Grad Frost  Area
## Vermont              472   3907        0.6    71.64    5.5    57.1   168  9267
## Virginia            4981   4701        1.4    70.08    9.5    47.8    85 39780
## Washington          3559   4864        0.6    71.72    4.3    63.5    32 66570
## West Virginia       1799   3617        1.4    69.48    6.7    41.6   100 24070
## Wisconsin           4589   4468        0.7    72.48    3.0    54.5   149 54464
## Wyoming              376   4566        0.6    70.29    6.9    62.9   173 97203

# you can use the str() function to get information about the structure of a data frame
str(df)

## 'data.frame':    50 obs. of  8 variables:
##  $ Population: num  3615 365 2212 2110 21198 ...
##  $ Income    : num  3624 6315 4530 3378 5114 ...
##  $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
##  $ Life Exp  : num  69 69.3 70.5 70.7 71.7 ...
##  $ Murder    : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
##  $ HS Grad   : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
##  $ Frost     : num  20 152 15 65 20 166 139 103 11 60 ...
##  $ Area      : num  50708 566432 113417 51945 156361 ...

# you can use the summary() function to get summary statistics about a data frame
summary(df)

##    Population        Income       Illiteracy       Life Exp    
##  Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96  
##  1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12  
##  Median : 2838   Median :4519   Median :0.950   Median :70.67  
##  Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88  
##  3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89  
##  Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60  
##      Murder          HS Grad          Frost             Area       
##  Min.   : 1.400   Min.   :37.80   Min.   :  0.00   Min.   :  1049  
##  1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985  
##  Median : 6.850   Median :53.25   Median :114.50   Median : 54277  
##  Mean   : 7.378   Mean   :53.11   Mean   :104.46   Mean   : 70736  
##  3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81162  
##  Max.   :15.100   Max.   :67.30   Max.   :188.00   Max.   :566432

# get the dimension
dim(df)

## [1] 50  8

Quick visualization

Let’s create some dummy data and visualize it.

x <- 0:10
y <- x^2

# you can use the plot() function to create a scatter plot
plot(x, y, 
     xlab = "X-axis title",
     ylab = "Y-axis title")

# simulate different data
teams <- c("Team A", "Team B", "Team C", "Team D", "Team E") # generating team names

scores <- sample(50:100, length(teams)) # generating the scores for each team

# bar plot
barplot(scores, 
        main = "Scores by Teams",
        xlab = "Teams",
        ylab = "Scores",
        col = "lightgreen",
        border = "black",
        names.arg = teams)

We will learn later how to create more advanced visualizations using the ggplot2 package.

In-Class Exercises

Exercise 1

Today is Monday. What day of the week will it be 9, 54, 306, and 8999 days from now?

Note: Create a character vector containing the days of the week and repeat this vector 9000 times. Then, use indexing to find the desired day. Hint: Write the days of the week in the character vector starting from Tuesday.

days <- c("Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", "Monday")

# you complete...

Exercise 2

Create a vector containing the numbers 1 to 100. Then, find the sum of the numbers that are divisible by 3 or 5.

Tip: Use the %% operator to find the remainder of a division.

# answer:
numbers <- 1:100

# you complete...

Exercise 3

You are taking measurements every 5 days throughout the year. Create a number sequence that shows on which days you take measurements and assign it to a variable named “measurement_days” The result should look like this: 5, 10, 15, 20… 365.

# answer:


# you complete...

Practice questions for next week

Q1: Create a vector containing 50 random numbers with a normal (Gaussian) distribution, mean 20 and standard deviation 2. You can do this with the rnorm() function. Then assigns the numbers to a variable and use that variable as an argument to the sample() function to randomly select 10 samples from that vector. Run ?rnorm() ?sample() to see how the functions work and what arguments they take.

Q2: Download and load “LearnBayes” package and take a look at the first few columns of the data set called “studentdata”.

Answer the following questions:

3.1. Remove rows that include NA observations.

3.2. Get the number of female students.

3.3. Number of students who are taller than 180 cm (tip: the height is given in inches. please first turn them to cm by multiplying the observations with 2.54)

3.4. Plot the relationship between height and sex in a line graph.

See you all next Friday!

Week 2 (Dec 1, 2023)

Answers to Exercise Questions

Exercise 1

Today is Monday. What day of the week will it be 9, 54, 306, and 8999 days from now?

days <- c("Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", "Monday")

# you complete...

days_rep <- rep(days,9000)

days_rep[9]

## [1] "Wednesday"

days_rep[54]

## [1] "Saturday"

days_rep[306]

## [1] "Saturday"

days_rep[8999]

## [1] "Friday"

# or::

days_rep[c(9,54,306,8999)]

## [1] "Wednesday" "Saturday"  "Saturday"  "Friday"

Exercise 2

Create a vector containing the numbers 1 to 100. Then, find the sum of the numbers that are divisible by 3 or 5.

Tip: Use the %% operator to find the remainder of a division.

# answer:
numbers <- 1:100

# find numbers that are divisible by 3:

numbers_3_5 <- numbers[numbers %% 3 == 0 | numbers %% 5 == 0]

sum(numbers_3_5)

## [1] 2418

Exercise 3

# answer:

days <- seq(5,365,5)
days[44]

## [1] 220

Practice questions for next week

dist <- round(rnorm(n=50, mean=20, sd=2))

sample(dist, 10)

##  [1] 21 21 18 22 20 23 20 18 18 18

Q2: Download and load “LearnBayes” package and take a look at the first few columns of the data set called “studentdata”.

Answer the following questions:

3.1. Remove rows that include NA observations.

#install.packages("LearnBayes")
library(LearnBayes)
df <- studentdata

dim(df)

## [1] 657  11

sum(is.na(df))

## [1] 118

df <- na.omit(df)

dim(df)

## [1] 559  11

sum(is.na(df))

## [1] 0

3.2. Get the number of female students.

# alt 1:
gender <- df$Gender
length(gender[gender == "female"])

## [1] 364

# alt 2:
length(df$Gender[df$Gender == "female"])

## [1] 364

# alt 3:
#install.packages("tidyverse")
library(tidyverse)
nrow(filter(df, Gender == "female"))

## [1] 364

# Find percetange of female students
nrow(filter(df, Gender == "female")) / nrow(df)*100

## [1] 65.11628

3.3. Number of students who are taller than 180 cm (tip: the height is given in inches. please first turn them to cm by multiplying the observations with 2.54)

# You can use dplyr's mutate() function to change values in a dataset
library(dplyr)

df <- mutate(df, Height = Height * 2.54)

nrow(filter(df, Height > 180))

## [1] 115

3.4. Plot the relationship between height and sex in a line graph.

plot(df$Gender, df$Height)

Data Manipulation: In-class practice

Now, let’s practice a bit more. Follow these steps:

We will now work with COVID-19 dataset. Let’s import it to our R session.

We can use dplyr’s read_csv() package to import a spreadsheet in .csv format.

Source: https://github.com/sadettindemirel/Covid19-Turkey

When working with data, it’s important to create a workspace folder that contains both your data and R script. You can either download the .csv file and import it to R this way. Make sure that you set your working directory to the correct folder by going to Session -> Set Working Directory -> To Source File Location.

df_covid <- read_csv("http://kelesonur.github.io/compec-r/covid_sayi.csv")
head(df_covid)

## # A tibble: 6 × 12
##   tarih      gunluk_test gunluk_vaka gunluk_olum gunluk_iyilesen toplam_test
##   <date>           <dbl>       <dbl>       <dbl>           <dbl>       <dbl>
## 1 2020-03-11           0           1           0               0           0
## 2 2020-03-12           0           0           0               0           0
## 3 2020-03-13           0           4           0               0           0
## 4 2020-03-14           0           1           0               0           0
## 5 2020-03-15           0          12           1               0           0
## 6 2020-03-16           0          29           0               0           0
## # … with 6 more variables: toplam_vaka <dbl>, toplam_olum <dbl>,
## #   toplam_iyilesen <dbl>, yogun_bakim_hasta <dbl>, toplam_intube <dbl>,
## #   agir_hasta <dbl>

Create a histogram (of daily death toll) and time-series plot (x: time, y: daily death toll). You can use hist() and boxplot() functions respectively to do that. Add label names.

# Your turn:

How many cases of COVID were detected and how many people died on July 20, 2020?

# Your turn:

Can you show the total number of COVID deaths and patients in intensive care between 2020-03-11 and 2020-07-20 in the timeline graph? Tip: You can use indexing to filter the dates and save them as a new data frame.

# Your turn:

Data Manipulation with Tidyverse

So far, we have examined data frames using Base R, that is, the functions and operators native to R. However, today most data scientists using R do not process data with these, but with the more modern “Tidyverse” packages.

These packages make organizing data much easier and more practical (Wickham, 2017).

Let’s look at some important packages within Tidyverse:

tibble: as.tibble()
readr: read_csv()
dplyr: subset(), select(), filter(), summarize() and more…
magrittr: %>% operator
ggplot2: ggplot()

library(tidyverse)

Tibble

Tibble is essentially a more modern version of data frames in R. Here’s where Tibbles are superior to data frames: There is no need to use the head() function. It automatically shows us the first 10 rows. There is no need to use nrow() and ncol(). This information is included in the table. tells us the type of vectors in each column (character, integer, etc.)

df <- starwars
df <- as.tibble(df)
head(df)

## # A tibble: 6 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
## 2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
## 3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
## 4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
## 5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
## 6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Data manipulation with dplyr

dplyr is the most practical package for editing data frames or tibbles. Let’s see what we can do with the functions included in it. For this we will use a dataset from a package called gapminder:

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453
Afghanistan	Asia	1957	30.332	9240934	820.8530
Afghanistan	Asia	1962	31.997	10267083	853.1007
Afghanistan	Asia	1967	34.020	11537966	836.1971
Afghanistan	Asia	1972	36.088	13079460	739.9811
Afghanistan	Asia	1977	38.438	14880372	786.1134

Let’s load the data:

library(gapminder)
  
df <- gapminder

`filter()` function

We can easily filter the data we want by using this function:

filter(df,country == "Turkey") 

filter(df, year > 2002) 

filter(df, year > 2002 & country == "Germany")

How can you filter for multiple countries? Any guesses?

`select()` function:

select() function allows us to select only the columns we want:

select(df, country, pop, year) # only tak

select(df, -continent) # kıta hariç bütün sütunları al

select(df, pop:gdpPercap) # nüfustan gdp'ye kadar bütün sütunları al

`rename()` function:

We can change the names of the columns we want with this. If we want this to be saved, we need to overwrite the variable:

library(dplyr)
df <- rename(df, Country = country, CONTINENT = continent, YEAR = year, LifeExpectancy = lifeExp)

head(df)

# Extra: clean the column names
library(janitor)
df_cleaned <- clean_names(df)

`mutate()` function:

We can make changes to a tibble with the mutate() function.

For example, let’s create a new column and write the population there as one per million:

df_mutated <- mutate(df_cleaned, pop_million = pop / 1000000)

head(df_mutated)

## # A tibble: 6 × 7
##   country     continent  year life_expectancy      pop gdp_percap pop_million
##   <fct>       <fct>     <int>           <dbl>    <int>      <dbl>       <dbl>
## 1 Afghanistan Asia       1952            28.8  8425333       779.        8.43
## 2 Afghanistan Asia       1957            30.3  9240934       821.        9.24
## 3 Afghanistan Asia       1962            32.0 10267083       853.       10.3 
## 4 Afghanistan Asia       1967            34.0 11537966       836.       11.5 
## 5 Afghanistan Asia       1972            36.1 13079460       740.       13.1 
## 6 Afghanistan Asia       1977            38.4 14880372       786.       14.9

`arrange()` function:

# sort increasingly

arrange(df_mutated, continent)

arrange(df_mutated, gdp_percap)

# sort in an descending order

arrange(df_mutated, desc(year))

Pipe with the magrittr package

The pipe operator is represented by the symbol %>%. This operator takes the variable on its left and places it as the first argument of the function on its right. This operator will be preferred instead of using nested and long functions. This saves us from hard-to-read and nested functions and helps us follow a linear order:

filter(df_mutated, country == "Turkey") # previous method

df_mutated %>% filter(country == "Turkey") # with pipe

select(filter(df_mutated, country == "Turkey" & year == "2007"), pop) # previous method

df_mutated %>% filter(country == "Turkey" & year == "2007") %>% select(pop) # pipe ile

Visualization with the ggplot2 package

Scatterplots

You can use geom_point().

library(ggplot2)

df_mutated %>% filter(year == 2007) %>%
ggplot(aes(x=gdp_percap, y=life_expectancy)) +
 geom_point()

Or you can try using geom_jitter() or geom_violin()

df_mutated %>% filter(year == 2007) %>%
ggplot(aes(x=gdp_percap, y=life_expectancy)) +
 geom_violin()

You can use text labels.

p1 <- df_mutated %>% filter(year == 2007) %>%
  ggplot(aes(x=gdp_percap, y=life_expectancy, label = country)) +
  geom_point() +
  geom_text()

p1 + ggtitle("Plot title") + xlab("GDP") + ylab("Life Exp")

# save it
# ggsave('plot1.png', width = 8, height = 6)

Using the color option:

df_mutated %>% filter(year == 2007) %>%
  ggplot(aes(x=gdp_percap, y=life_expectancy, label = country, color = continent)) +
  geom_point() +
  geom_text()

Add new facets:

 df_mutated %>% filter(year == 2007) %>%
  ggplot(aes(x=gdp_percap, y=life_expectancy, label = country, color = continent)) +
  geom_point() +
  geom_text() + 
  facet_wrap(~continent)

Bar plots

df_mutated %>% filter(country %in% c("Turkey","Brazil","Thailand","Nigeria","New Zealand")) %>%
  ggplot(aes(x = reorder(continent,gdp_percap), y = gdp_percap, fill = continent)) +
  geom_bar(stat="identity")

In-class Exercises

What years of information are included in the Gapminder data? Tip: You can find this out with the unique() command.

# Your answer:

What are the mean and median of life years in 1962 and 2002? Tip: You can do this with the filter() and select() functions.

# Your answer:

Can you create a scatterplot of income (x-axis) and years of life expectancy (y-axis) for the European continent in 1960, 1980 and 2000 using ggplot() and facet_wrap()? Country names must appear as text. Also color by continent. Tip: after filtering the data by these years you will need to call facet_wrap(~year).

# Your answer:

Can you create a line plot and show Turkey’s population growth? Can you name the plot and axes? Tip: you will need to use the geom_line() function to create the line chart.

# Your answer:

Week 3 (Dec 9, 2023)

Answers to Exercise Questions

What years of information are included in the Gapminder data? Tip: You can find this out with the unique() command.

# Your answer:
df <- gapminder
unique(df$year)

##  [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007

What are the mean and median of life years in 1962 and 2002? Tip: You can do this with the filter() function.

# Your answer:
df %>% 
  filter(year == 1962 | year == 2002) %>% 
  group_by(year) %>%
  summarise(mean(lifeExp), median(lifeExp))

## # A tibble: 2 × 3
##    year `mean(lifeExp)` `median(lifeExp)`
##   <int>           <dbl>             <dbl>
## 1  1962            53.6              50.9
## 2  2002            65.7              70.8

Can you create a scatterplot of income (x-axis) and years of life expectancy (y-axis) for the European continent in 1962, 1982 and 2002 using ggplot() and facet_wrap()? Country names must appear as text. Tip: after filtering the data by these years you will need to call facet_wrap(~year).

# Your answer:
library(ggrepel)

df %>% 
  filter(year == 1962 | year == 1982 | year == 2002) %>% 
  filter(continent == "Europe") %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp, label = country)) +
  geom_text_repel(color="red") +
  facet_wrap(~year) +
  labs(x = "GDP per capita", y = "Life expectancy", title = "Life expectancy vs. GDP per capita") +
  scale_x_continuous(labels = scales::comma) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Can you create a line plot and show Turkey’s population growth? Can you name the plot and axes? Tip: you will need to use the geom_line() function to create the line chart.

# Your answer:
df %>% 
  filter(country == "Turkey") %>% 
  ggplot(aes(x = year, y = pop)) +
  geom_line() +
  labs(x = "Year", y = "Population", title = "Population growth in Turkey") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  # no scientific notation
  scale_y_continuous(labels = scales::comma)

Creating summary tables and performing simple data analysis with Dplyr

We will work with a dataset named “prestige” now. The data looks like the following:

	education	income	women	prestige	census	type
gov.administrators	13.11	12351	11.16	68.8	1113	prof
general.managers	12.26	25879	4.02	69.1	1130	prof
accountants	12.77	9271	15.70	63.4	1171	prof
purchasing.officers	11.42	8865	9.11	56.8	1175	prof
chemists	14.62	8403	11.68	73.5	2111	prof
physicists	15.64	11030	5.13	77.6	2113	prof

Let’s import it first:

#install.packages('car')
library(car)

df <- Prestige
head(df)

`group_by()` and `summarize()`

We can extract summary tables using these functions in the Dplyr package. Dplyr creates group-based summaries with group_by(). We can also make the calculations we want with summarize():

df %>%
  group_by(type) %>%
  summarize(mean_prestige = mean(prestige),
            mean_income = mean(income)) %>%
  arrange(mean_income)

## # A tibble: 4 × 3
##   type  mean_prestige mean_income
##   <fct>         <dbl>       <dbl>
## 1 <NA>           34.7       3344.
## 2 wc             42.2       5052.
## 3 bc             35.5       5374.
## 4 prof           67.8      10559.

Let’s omit the NA values:

df <- na.omit(df)

Find out the percentage of female workers according to the types of professions:

df %>%
  group_by(type) %>%
  summarize(woman_perc = mean(women))

## # A tibble: 3 × 2
##   type  woman_perc
##   <fct>      <dbl>
## 1 bc          19.0
## 2 prof        25.5
## 3 wc          52.8

We want to see the relationship between education and income. Let’s create a scatter plot for this:

df %>% 
  ggplot(aes(x = education, y = income)) +
  geom_point() +
  labs(x = "Education", y = "Income", title = "Income vs. Education") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  # no scientific notation
  scale_y_continuous(labels = scales::comma)

Linear Regression

We noticed that there is a relationship between education and income, but can we confirm this with statistics? Why do we need statistics?

If you want to go beyond our data and find out whether the results are generalizable to the whole world, we need to model. With a linear regression analysis, we will create a line that explain our data best. Also, we will be able to predict situations that are not observed in our data set. Let’s first understand what linear regression is:

In the context of simple linear regression, the dependent variable’s (the outcome variable) value is determined by a linear function of the predictor variable, expressed as:

y = \(\alpha\) + \(\beta\) x

Let’s delve into the interpretation of these components:

Y (dependent variable) = \(\alpha\) (intercept) + \(\beta\) (slope) X (predictor)

In this equation:

Y represents the dependent variable,
\(\alpha\) is the intercept, an additive term,
\(\beta\) is the slope, another additive term,
X is the predictor variable.

Mathematically, a line is characterized by an intercept and a slope. The slope \(\beta\)) signifies the change in y for a one-unit change in x.

In simpler terms, the slope represents the rate of change of the dependent variable y with respect to changes in the predictor variable x.

Let’s create a linear regression model for our data:

model <- lm(income ~ education, data = df)
summary(model)

## 
## Call:
## lm(formula = income ~ education, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5524  -2400   -186   1398  17647 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2593.4     1431.4  -1.812   0.0731 .  
## education      883.0      128.5   6.870  6.4e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3480 on 96 degrees of freedom
## Multiple R-squared:  0.3296, Adjusted R-squared:  0.3226 
## F-statistic:  47.2 on 1 and 96 DF,  p-value: 6.404e-10

The first thing we need to look at is the coefficient. The coefficient is the slope of the line. In our case, the coefficient is 898.8. This means that for every one unit increase in education, income increases by 898.8.

Then we can look at the p-value. The p-value is the probability that the observed data occurs with the assumption that education does not have an effect on income. In our case, the p-value is 2.079e-10 (a very small number), close to 0. This is a very small number, so we can say reject the previous assumption. In other words, we can say that there is a statistically significant relationship between education and income.

Finally, let’s examine the R-squared value. R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model.

In our case, the R-squared value is 0.3336. This means that 33.36% of the variation in income is explained by education. This is a good value for social sciences.

Let’s plot the regression line:

df %>% 
  ggplot(aes(x = education, y = income)) +
  geom_point() +
  labs(x = "Education", y = "Income", title = "Income vs. Education") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  # no scientific notation
  scale_y_continuous(labels = scales::comma) +
  geom_smooth(method = "lm", se = FALSE)

We can also predict future values with our model with the predict() function.

# predict income for 10 years of education
new_data <- data.frame(education = 10)
predict(model, new_data)

##        1 
## 6236.767

# predict income for 20 years of education
new_data <- data.frame(education = 20)
predict(model, new_data)

##        1 
## 15066.95

# predict income for 30 years of education
new_data <- data.frame(education = 30)
predict(model, new_data)

##        1 
## 23897.14

Time Series Data Analysis

Now let’s work with time series data. We will use the AirPassengers dataset in R. This dataset contains the number of passengers who traveled by plane in a month between 1949 and 1960. Let’s import it first:

df <- AirPassengers
head(df)

## [1] 112 118 132 129 121 135

We then need to convert the data to a time series object which is a special type of object in R. We will use the ts() function for this. We need to specify the frequency of the data (12 for monthly data) and the start and end dates.

# convert the data to a time series object
df <- ts(df, frequency = 12, start = c(1949, 1), end = c(1960, 12))
head(df)

## [1] 112 118 132 129 121 135

# plot the data with axis names:
plot(df, main = "Air Passengers", xlab = "Year", ylab = "Number of Passengers")

We can see that there is an increasing trend in the data. We can also see that there is a seasonal component. We can decompose the data into trend, seasonal, and random components with the decompose() function.

Let’s explain the components: Trend is the long-term progression of the series. Seasonality is a short-term cycle that occurs regularly. Random is the residual variation that cannot be explained by the trend or the seasonal components.

df_decomposed <- decompose(df)


# plot the decomposed data
plot(df_decomposed)

We can use linear regression to model the trend component. We will use the lm() function for this. We will use the time() function to create a variable for time.

# turn our ts object into a data frame
df_time_series <- data.frame(Y=as.matrix(df), Time = time(df))

Model the trend component:

model <- lm(Y ~ Time, data = df_time_series)
# summary(model)

# plot the regression line:
df_time_series %>% 
  ggplot(aes(x = Time, y = Y)) +
  geom_point() +
  labs(x = "Time", y = "Number of Passengers", title = "Air Passengers") +
  geom_smooth(method = "lm", se = FALSE)

Now let’s predict the number of passengers for the next 10 years:

new_data <- data.frame(Time = 1970:1980)
predict(model, new_data)

##         1         2         3         4         5         6         7         8 
##  759.9203  791.8065  823.6927  855.5789  887.4651  919.3513  951.2375  983.1238 
##         9        10        11 
## 1015.0100 1046.8962 1078.7824

In-class Exercises

1.1. Work with the mtcars dataset. Use group_by() and summarize() to find the average miles per gallon for each number of cylinders.

1.2. Create a linear regression model. Predict the miles per gallon for a car with 6 cylinders and a weight of 3000 lbs.

1.3. Plot the regression line.

2.1. Work with another time series dataset called BJsales in R. Sales, 150 months; taken from Box and Jenkins (1970). Visualize the time series data.

2.2. Use linear regression to model the trend component. Predict the number of sales for the next 10 months

2.3. Plot the regression line.

Week 4 (Dec 15, 2023)

Answers to Exercise Questions

In-class Exercises

1.1. Work with the mtcars dataset. Use group_by() and summarize() to find the average miles per gallon for each number of cylinders.

library(dplyr)

avg_mpg_per_cyl <- mtcars %>% group_by(cyl) %>% summarize(avg_mpg = mean(mpg))

avg_mpg_per_cyl

## # A tibble: 3 × 2
##     cyl avg_mpg
##   <dbl>   <dbl>
## 1     4    26.7
## 2     6    19.7
## 3     8    15.1

1.2. Create a linear regression model. Predict the miles per gallon for a car with 6 cylinders and a weight of 3000 lbs.

model <- lm(mpg ~ cyl + wt, data = mtcars)
predicted_mpg <- predict(model, newdata = data.frame(cyl = 6, wt = 3000/1000)) # weight converted to 1000s lbs
predicted_mpg

##        1 
## 21.06658

1.3. Plot the regression line.

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  geom_point(aes(x = 3, y = predicted_mpg), color = "red", size = 3) +
  labs(title = "Regression Line: MPG vs. Weight",
       x = "Weight (1000 lbs)", y = "Miles per Gallon")

2.1. Work with another time series dataset called BJsales in R. Sales, 150 months; taken from Box and Jenkins (1970). Visualize the time series data.

plot(BJsales, main = "BJ Sales Time Series", ylab = "Sales", xlab = "Months")

2.2. Use linear regression to model the trend component. Predict the number of sales for the next 10 months

df_time_series <- data.frame(Y=as.matrix(BJsales), Time = time(BJsales))
model_BJ <- lm(Y ~ Time, data = df_time_series)
future_time <- data.frame(Time = c(151:160))
predicted_sales <- predict(model_BJ, newdata = future_time)

2.3. Plot the regression line.

df_time_series %>% 
ggplot(aes(Time, Y)) +
  geom_line() + 
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  # add future predictions
  geom_point(data = data.frame(Time = future_time$Time, Y = predicted_sales), color = "red")

Twitter Analysis with R

Why Perform Data Mining on Twitter?

Twitter, with its 217 million active users, sees an average of 500 million tweets per day. This high volume makes social media platforms like Twitter a rich source of user-generated textual data. By processing and interpreting these tweets, we can gain insights into people’s preferences, sentiments, and trends on various topics.

Twitter mining can be utilized for analyzing advertising campaigns, studying customer behavior, predicting election outcomes, and even in academic research.

Installing Necessary Packages

#install.packages("rtweet") 
#install.packages("wordcloud")
#install.packages("stopwords")
#install.packages("syuzhet")
#install.packages("xlsx")

Loading Packages

library(stringr)    # for str_replace_all function (cleaning tweets)
library(dplyr)      # for data frame manipulation functions like anti_join, inner_join, count, ungroup
library(magrittr)   # for pipe operator %>%
library(ggplot2)    # for data visualization
library(readr)      # for reading data frames
library(rtweet)     # for fetching tweets
library(wordcloud)  # for creating word clouds
library(stopwords)  # for a package of stop words
#library(syuzhet)    # for sentiment analysis in English
library(xlsx)       # for Excel

Twitter Settings

To analyze Twitter data, you need access to the Twitter API. This requires an application for a Twitter Developer account. Once approved, you will be provided with personal keys for access.

Twitter API Settings

Note: The following code will not work as the keys are placeholders. Replace them with your personal keys.

token <- create_token(
  app = "mytwitterapp",
  consumer_key = "your_consumer_key",
  consumer_secret = "your_consumer_secret",
  access_token = "your_access_token",
  access_secret = "your_access_secret",
  set_renv = TRUE
)

Timeline Analysis

Let’s fetch all tweets from a user’s timeline and create a word cloud based on the most frequently used words.

Fetching Tweets from an Account

my_tweets <- get_timeline("username", n = 20, include_rts = FALSE)

Let’s look at the column names:

colnames(my_tweets)

You can download the Tweets data I scraped with this link: http://kelesonur.github.io/compec-r/tweets_ince.csv

Cleaning the Tweets

Before analyzing the content, it’s essential to clean the tweets to remove unwanted characters, URLs, mentions, hashtags, and convert them to lower case for uniformity.

clean_tweets <- function(x) {
  x %>%
    str_remove_all(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%
    str_replace_all("&amp;", "and") %>%
    str_replace("RT @[a-z,A-Z]*: ","") %>%
    str_remove_all("[[:punct:]]") %>%
    str_replace_all("@[a-z,A-Z]*","") %>%
    str_replace_all("#[a-z,A-Z]*","") %>%
    str_remove_all("^RT:? ") %>%
    str_remove_all("@[[:alnum:]]+") %>%
    str_remove_all("#[[:alnum:]]+") %>%
    str_replace_all("\\\n", " ") %>%
    str_to_lower() %>%
    str_trim("both")
}

clean_tweet = gsub("&amp", "", my_tweets$text)
clean_tweet = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", clean_tweet)
clean_tweet = gsub("@\\w+", "", clean_tweet)
clean_tweet = gsub("[[:punct:]]", "", clean_tweet)
clean_tweet = gsub("[[:digit:]]", "", clean_tweet)
clean_tweet = gsub("http\\w+", "", clean_tweet)
clean_tweet = gsub("[ \t]{2,}", "", clean_tweet)
clean_tweet = gsub("^\\s+|\\s+$", "", clean_tweet)

my_tweets$text_clean <- clean_tweet %>% clean_tweets

Creating a Stop Words List for Turkish

stop_turkish <- data.frame(word = stopwords::stopwords("tr", source = "stopwords-iso"), stringsAsFactors = FALSE)

head(stop_turkish)

Tokenizing and Cleaning

library(tidytext)

tweets_clean <- my_tweets %>% 
  select(text_clean) %>% 
  unnest_tokens(word, text_clean) %>% 
  anti_join(stop_words) %>% 
  anti_join(stop_turkish)

tweets_clean %<>% rename(Word = word)`

Counting Words

words <- tweets_clean %>%
  count(Word, sort = TRUE) %>%
  ungroup()

head(words)

Creating a Word Cloud

library(wordcloud) 
library(RColorBrewer)

wordcloud(words = words$Word, freq = words$n, min.freq = 1, max.words = 200, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))`

Hashtag and Timeline Tweet Analysis

Analyzing Tweets with a Specific Hashtag

tesla_tweets <- search_tweets("#Tesla", lang = "tr", n = 30, include_rts = FALSE) 

head(tesla_tweets)

Timeline Analysis

ts_plot(tesla_tweets, "hours", trim = 0L)

Sentiment Analysis in Turkish

Sentiment analysis, also known as opinion mining, is a method used in natural language processing to identify and categorize opinions expressed in a text. The goal is to determine the writer’s or speaker’s attitude towards a particular topic, product, or service as positive, negative, or neutral. This technique is widely used to analyze customer feedback, social media conversations, and product reviews, helping businesses and organizations gauge public opinion, monitor brand and product sentiment, and understand customer needs and concerns. Advanced sentiment analysis may also capture emotional nuances and intensity, providing deeper insights into the underlying sentiments.

Import the lexicon

To do a sentiment analysis in Turkish, we need a lexicon. You can download the lexicon with this link: http://kelesonur.github.io/compec-r/Turkish-tr-NRC-VAD-Lexicon.txt

Lexicon <- read_delim(file = "Turkish-tr-NRC-VAD-Lexicon.txt", "\t", 
                      locale = locale(date_names = "tr", encoding = "UTF-8"))

Examine the lexicon

head(Lexicon, 10)

Prepare the Turkish lexicon

Lexicon %<>% rename(turkishword = "Turkish-tr")
TR_Lexicon <- Lexicon %>% select(-Word)
TR_Lexicon %<>% rename(Word = turkishword)
TR_Lexicon = TR_Lexicon[!duplicated(TR_Lexicon$Word),]

Get the words and calculate sentiment

sentiment <- tweets_clean %>%
  inner_join(TR_Lexicon) %>%
  count(Word, Arousal, Valence, Dominance, sort = TRUE) %>%
  ungroup()

head(sentiment)

Arousal, dominance, and valence are three dimensions often used in psychology to describe and measure emotions:

Arousal: This dimension refers to the level of alertness or stimulation an emotion provokes. It ranges from calm or low arousal (e.g., relaxed, bored) to excited or high arousal (e.g., angry, ecstatic). It represents the intensity of the emotion but not its nature (positive or negative).
Dominance: This dimension pertains to the sense of control or power associated with an emotion. Low dominance indicates feelings of being controlled or submissive (e.g., scared, anxious), while high dominance involves feeling in control or empowered (e.g., authoritative, independent). It reflects the degree of control a person feels they have in a particular emotional state.
Valence: Valence is about the intrinsic attractiveness (positive valence) or averseness (negative valence) of an emotion. Simply put, it describes the pleasantness or unpleasantness of an emotion. Happiness, joy, and love are examples of emotions with positive valence, whereas sadness, anger, and fear are associated with negative valence.

These three dimensions are used in models like the PAD (Pleasure-Arousal-Dominance) emotional state model, which is applied in various fields including psychology, affective computing, and even marketing research, to understand and predict human emotions and behaviors.

Visualize arousal and valence

p_boxplot <- boxplot(sentiment$Arousal, sentiment$Valence, sentiment$Dominance,
              main = "Sentiment Analysis",
              names = c("Arousal", "Valence", "Dominance"))
View(p_boxplot)

Most positive and negative words used by the person

positive <- sentiment %>%
  subset(Valence > 0.90) %>% 
  group_by(Word) %>%
  top_n(10) %>% 
  ungroup() %>%
  mutate(Word = reorder(Word, Valence)) %>%
  ggplot(aes(Word, Valence)) +
  geom_point(show.legend = FALSE) +
  labs(title = "Most Positive Words",
       y = "Valence",
       x = NULL) +
  coord_flip()

positive

# Negativity

negative <- sentiment %>%
  subset(Valence < 0.10) %>% 
  group_by(Word) %>%
  top_n(10) %>% 
  ungroup() %>%
  mutate(Word = reorder(Word, Valence)) %>%
  ggplot(aes(Word, Valence)) +
  geom_point(show.legend = FALSE) +
  labs(title = "Most negative words",
       y = "Valence",
       x = NULL) +
  coord_flip()
negative

Use `get_nrc_sentiment()` for English sentiment analysis

words <- iconv(words, from="UTF-8", to="ASCII", sub="")

ew_sentiment<-get_nrc_sentiment((words))

sentimentscores<-data.frame(colSums(ew_sentiment[,]))

names(sentimentscores) <- "Score"

sentimentscores <- cbind("sentiment"=rownames(sentimentscores),sentimentscores)

rownames(sentimentscores) <- NULL

ggplot(data=sentimentscores,aes(x=sentiment,y=Score))+
  geom_bar(aes(fill=sentiment),stat = "identity")+
  theme(legend.position="none")+
  xlab("Sentiments")+ylab("Scores")+
  ggtitle("Total sentiment based on scores")+
  theme_minimal()

In-class Exercises

We have fetched Muharrem İnce’s tweets for you and shared them as a .csv file earlier. Do the following task in groups using this csv:

Sentiment Analysis
Tweet Series Analysis
Word Cloud

RMarkdown

RMarkdown Tutorial

Introduction to RMarkdown

RMarkdown is a powerful tool for creating dynamic documents, presentations, and reports that combine R code with written narratives. It allows you to embed R code within Markdown documents, which is particularly useful for data analysis, academic research, and reproducible reporting.

Getting Started with RMarkdown

To use RMarkdown, you need to have R and RStudio installed. RStudio is an integrated development environment (IDE) for R that provides a convenient interface for working with RMarkdown. You have already completed this step!

Installing R and RStudio

Install R: Download and install R from CRAN.
Install RStudio: Download and install RStudio from RStudio’s website.

Installing the RMarkdown Package

Open RStudio and install the rmarkdown package by running:

# install.packages("rmarkdown")

Creating Your First RMarkdown Document

Create a New RMarkdown Document: In RStudio, go to File > New File > R Markdown.... You’ll be prompted to create a new document.
Choose Document Type: Select the type of document you want to create (e.g., HTML, PDF, Word). Click OK.
RMarkdown Structure: An RMarkdown file opens with some default content. It contains:
- YAML Header: At the top, enclosed within ---, where you specify document settings like title, author, and output format.
- Markdown Text: For writing narrative text. Markdown is a simple formatting syntax.
- Code Chunks: Enclosed in ```{r} and ```, where you write R code.

Example Document

---
title: "RMarkdown Example"
author: "Your Name"
date: "Today's Date"
output: html_document
---

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.

Compiling the Document

To compile the document into your chosen format (HTML, PDF, or Word), click the `Knit` button in RStudio. This will execute the R code within the document and combine the results with the narrative text.

To conclude, RMarkdown is a versatile tool for combining code, data analysis, and narrative in a single document. It’s highly useful for reproducible research and reporting.

Let’s now do some hands-on tutorial on RMarkdown!

In-class Exercises:

Write an RMarkdown report (pdf, word, or html document) that include your analysis of Muharrem İnce tweets.

Compec VB R Workshop - Fall 2023

Onur Keleş

Last updated on 2023-12-15

Acknowledgement

Structure of this workshop

About the instructor

Week 1 (Nov 23, 2023)

Intro to R

What is it?

Why R?

How will we code?

Cheat sheets

Real world use cases of R:

Python vs. R?

Functions in R

Variables in R

Operations with variables

Logical operators

Data types in R

Vectors

Numeric vectors

Operations with vectors:

Indexing vectors:

Logical vectors

Filtering vectors:

Character vectors

Data frames

Quick visualization

In-Class Exercises

Exercise 1

Exercise 2

Exercise 3

Practice questions for next week

Week 2 (Dec 1, 2023)

Answers to Exercise Questions

Exercise 1

Exercise 2

Exercise 3

Practice questions for next week

Data Manipulation: In-class practice

Data Manipulation with Tidyverse

Tibble

Data manipulation with dplyr

filter() function

select() function:

rename() function:

mutate() function:

arrange() function:

Pipe with the magrittr package

Visualization with the ggplot2 package

Scatterplots

Bar plots

In-class Exercises

Week 3 (Dec 9, 2023)

Answers to Exercise Questions

Creating summary tables and performing simple data analysis with Dplyr

group_by() and summarize()

Linear Regression

Time Series Data Analysis

In-class Exercises

Week 4 (Dec 15, 2023)

Answers to Exercise Questions

In-class Exercises

Twitter Analysis with R

Why Perform Data Mining on Twitter?

Why Choose R over Other Platforms (e.g., Sprout Social, Talkwalker)?

Installing Necessary Packages

Loading Packages

Twitter Settings

Twitter API Settings

Timeline Analysis

Fetching Tweets from an Account

Cleaning the Tweets

Creating a Stop Words List for Turkish

Tokenizing and Cleaning

Counting Words

Creating a Word Cloud

Hashtag and Timeline Tweet Analysis

Analyzing Tweets with a Specific Hashtag

Timeline Analysis

`filter()` function

`select()` function:

`rename()` function:

`mutate()` function:

`arrange()` function:

`group_by()` and `summarize()`

Use `get_nrc_sentiment()` for English sentiment analysis