# Dataset Preparation
if (!require("gapminder", quietly = TRUE))
install.packages("gapminder")
library(gapminder)
aust <- gapminder[gapminder$country == "Australia",]
head(aust)
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Australia Oceania 1952 69.1 8691212 10040.
## 2 Australia Oceania 1957 70.3 9712569 10950.
## 3 Australia Oceania 1962 70.9 10794968 12217.
## 4 Australia Oceania 1967 71.1 11872264 14526.
## 5 Australia Oceania 1972 71.9 13177000 16789.
## 6 Australia Oceania 1977 73.5 14074100 18334.
dim(aust)
## [1] 12 6Exploring Data Frames
1. Adding Columns and Rows
We will still use the gapminder dataset as demo dataste.
1.1 Individually Adding Columns
You can add a new column by simply assigning values to a name that doesn’t exist yet using the $ operator or bracket [ , ] notation.
Crucial Rule: The new column must have the exact same number of rows as the existing data frame.
Method A: Using the $ operator
# Method A: Using the $ operator
# Adding a random column for "mean_children"
mean_children <- sample(1:10, nrow(aust), replace = TRUE)
aust$mean_children <- mean_children
# the same as: aust$mean_children <- sample(1:10, nrow(aust), replace = TRUE)
# Adding a calculated column (GDP = population * GDP per capita)
aust$GDP <- aust$pop * aust$gdpPercap
head(aust)# A tibble: 6 × 8
country continent year lifeExp pop gdpPercap mean_children GDP
<fct> <fct> <int> <dbl> <int> <dbl> <int> <dbl>
1 Australia Oceania 1952 69.1 8691212 10040. 1 8.73e10
2 Australia Oceania 1957 70.3 9712569 10950. 9 1.06e11
3 Australia Oceania 1962 70.9 10794968 12217. 6 1.32e11
4 Australia Oceania 1967 71.1 11872264 14526. 6 1.72e11
5 Australia Oceania 1972 71.9 13177000 16789. 2 2.21e11
6 Australia Oceania 1977 73.5 14074100 18334. 2 2.58e11
Understanding sample():
1:10: The range of numbers to choose from.nrow(aust): Tells R exactly how many numbers to generate (12 in this case).replace = TRUE: Allows the same number to be picked more than once.
Method B: Using bracket notation
# Method B: Using bracket notation
mean_bikes <- sample(1:4, nrow(aust), replace = TRUE)
aust[, "mean_bikes"] <- mean_bikes
str(aust)tibble [12 × 9] (S3: tbl_df/tbl/data.frame)
$ country : Factor w/ 142 levels "Afghanistan",..: 6 6 6 6 6 6 6 6 6 6 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 5 5 5 5 5 5 5 5 5 5 ...
$ year : int [1:12] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num [1:12] 69.1 70.3 70.9 71.1 71.9 ...
$ pop : int [1:12] 8691212 9712569 10794968 11872264 13177000 14074100 15184200 16257249 17481977 18565243 ...
$ gdpPercap : num [1:12] 10040 10950 12217 14526 16789 ...
$ mean_children: int [1:12] 1 9 6 6 2 2 2 3 4 5 ...
$ GDP : num [1:12] 8.73e+10 1.06e+11 1.32e+11 1.72e+11 2.21e+11 ...
$ mean_bikes : int [1:12] 2 4 1 1 2 4 2 2 2 2 ...
What happens if the lengths don’t match?
If you try to “force” a vector into a table that is a different height, R will protect the integrity of your data and throw an error.
mean_bikes <- sample(1:4, 5, replace = TRUE)
aust[, "mean_bikes"] <- mean_bikesError in `[<-`:
! Assigned data `mean_bikes` must be compatible with existing data.
✖ Existing data has 12 rows.
✖ Assigned data has 5 rows.
ℹ Only vectors of size 1 are recycled.
Caused by error in `vectbl_recycle_rhs_rows()`:
! Can't recycle input of size 5 to size 12.
Why did this fail? R uses a concept called Recycling.
If you provide 1 value: R repeats it for every row (this is allowed).
If you provide 12 values: It’s a perfect fit (this is allowed).
If you provide 5 values: R doesn’t know what to do with the remaining 7 rows, so it stops the code to prevent errors in your analysis.
Best Practice: Use nrow()
Always use nrow(your_dataframe) instead of typing a specific number (like 12). If you later filter your data or add more patients, your code will automatically adjust the length of your new columns!
1.2 Combining Data Frames by Columns
Method A: Using the cbind()
If you have a whole table of new data, you can “staple” it to the side of your original data using cbind() (column bind). Unlike adding columns one by one, cbind() allows you to merge two entire data frames into one.
aust <- gapminder[gapminder$country == "Australia", ]
# Create a separate data frame with two new variables
df <- data.frame(mean_children = sample(1:10, nrow(aust), replace = TRUE),
mean_bikes = sample(1:4, nrow(aust), replace = TRUE))
head(df)
## mean_children mean_bikes
## 1 9 4
## 2 6 1
## 3 10 2
## 4 7 3
## 5 6 1
## 6 3 1
aust <- cbind(aust,df)
head(aust)
## country continent year lifeExp pop gdpPercap mean_children mean_bikes
## 1 Australia Oceania 1952 69.12 8691212 10039.60 9 4
## 2 Australia Oceania 1957 70.33 9712569 10949.65 6 1
## 3 Australia Oceania 1962 70.93 10794968 12217.23 10 2
## 4 Australia Oceania 1967 71.10 11872264 14526.12 7 3
## 5 Australia Oceania 1972 71.93 13177000 16788.63 6 1
## 6 Australia Oceania 1977 73.49 14074100 18334.20 3 1⚠️ A Critical Warning: The Row Order
While cbind() is powerful, it is also “blind.”
It assumes that Row 1 of your first table belongs with Row 1 of your second table.
In medical research, this is very dangerous if your tables aren’t sorted perfectly. If your first table has patients in the order P001, P002, P003 and your second table has them as P003, P001, P002, cbind() will mismatch the data!
Best Practice: Use merge() or left_join()
If you aren’t 100% sure the rows are in the same order, don’t use cbind(). Instead, use merge() (Base R) or left_join()(dplyr). These functions look at an ID column (like a patient ID) and automatically align the rows for you.
Method B: Using the merge()
To avoid mismatching data, we use merge(). This function looks at a specific Key Column (like Patient_ID or country) and aligns the rows automatically, even if they are in a different order.
# 1. Prepare a subset of data
countries <- gapminder[gapminder$country %in% c("Taiwan", "Canada"), ]
# 2. Create a secondary table with a "Key Column" (country)
# Note: This table could be in any order!
df_info <- data.frame(
country = c("Canada", "Taiwan"),
avg_coffee_price = c(4.50, 3.20)
)
# 3. Merge based on the "country" column
# R will find where "Taiwan" is in both tables and match them up
countries_merged <- merge(countries, df_info, by = "country")
head(countries_merged) country continent year lifeExp pop gdpPercap avg_coffee_price
1 Canada Americas 1952 68.75 14785584 11367.16 4.5
2 Canada Americas 1957 69.96 17010154 12489.95 4.5
3 Canada Americas 1962 71.30 18985849 13462.49 4.5
4 Canada Americas 1967 72.13 20819767 16076.59 4.5
5 Canada Americas 1972 72.88 22284500 18970.57 4.5
6 Canada Americas 1977 74.21 23796400 22090.88 4.5
# 1. Prepare a subset of data
patient_info <- data.frame(
patientID = c("P1", "P2", "P3", "P100"),
BW = c(50, 62, 73, 85)
)
# 2. Create a secondary table
patient_BG <- data.frame(
patientID = c("P1", "P100", "P3", "P2"),
BG = c(7.1, 8.3, 9.5, 11)
)
# 3. Merge based on the "country" column
# R will find where "Taiwan" is in both tables and match them up
data_merged <- merge(patient_info, patient_BG, by = "patientID")
head(data_merged) patientID BW BG
1 P1 50 7.1
2 P100 85 8.3
3 P2 62 11.0
4 P3 73 9.5
💡 Why merge() is better than cbind()
Order Doesn’t Matter: In the example above,
df_infolists Canada first, but in thecountriesdataset, the rows might appear in a different order.merge()finds the match regardless of where the row is located.Flexible Lengths: Notice that
countrieshas many years of data for each country, butdf_infoonly has one price per country.merge()is smart enough to “broadcast” (repeat) that one coffee price across all the matching years for that country.Pattern Matching: Instead of “mechanically” pasting columns,
merge()looks for a specific pattern (the country name). If a country in your first table isn’t found in your second table, R can be told exactly how to handle that “missing” data.
1.3 Adding Rows (rbind)
Adding rows is slightly more complex because a row usually contains mixed data types (text, numbers, etc.).
Why use a list for a new row? A data frame is a list of vectors. A single row crosses those vectors, so it must be able to hold different types of data at once. A list is the perfect container for this.
aust <- gapminder[gapminder$country == "Australia",]
# Create a new row as a list
new_row <- list(
"country" = "Australia",
"continent" = "Oceania",
"year" = 2022,
"lifeExp" = mean(aust$lifeExp),
"pop" = mean(aust$pop),
"gdpPercap" = mean(aust$gdpPercap)
)
# Append the row to the bottom
aust <- rbind(aust, new_row)
tail(aust)
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Australia Oceania 1987 76.3 16257249 21889.
## 2 Australia Oceania 1992 77.6 17481977 23425.
## 3 Australia Oceania 1997 78.8 18565243 26998.
## 4 Australia Oceania 2002 80.4 19546792 30688.
## 5 Australia Oceania 2007 81.2 20434176 34435.
## 6 Australia Oceania 2022 74.7 14649312. 19981.2. Removing and Filtering Data
2.1 Removing Columns and Rows
We use negative indexing to drop data.
Method A: Removing by Index (Position)
If you know exactly where a column is located, you can drop it using its number.
aust <- gapminder[gapminder$country == "Australia",]
dim(aust)
## [1] 12 6
colnames(aust)
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
# 1. Remove the LAST column dynamically
# ncol(aust) finds the last number for you, so you don't have to count!
aust <- aust[, -ncol(aust)]
colnames(aust)
## [1] "country" "continent" "year" "lifeExp" "pop"
# 2. Remove multiple specific columns (e.g., the 1st and 3rd)
aust <- aust[, -c(1, 3)]
colnames(aust)
## [1] "continent" "lifeExp" "pop"
dim(aust)
## [1] 12 3Method B: Removing by Name (The Safer Way)
Using numbers like -3 is risky because if your dataset changes, the 3rd column might become something else! It is safer to remove columns by their actual Name.
You ask R to keep every column except the one using the “Not Equal To” operator (!=).
# Keep everything where the column name is NOT "pop"
colnames(aust)
## [1] "continent" "lifeExp" "pop"
aust <- aust[, colnames(aust) != "pop"]
colnames(aust)
## [1] "continent" "lifeExp"Method C: Removing Rows
The logic for rows is the same: use a negative sign before the row indices.
# Remove the first 10 rows
# Both of these are correct; the first is the most professional standard
aust_reduced <- aust[-c(1:10), ]
## aust_reduced <- aust[-(1:10), ]
dim(aust_reduced)[1] 2 2
2.2 Applying Filters (Logical Subsetting)
Filtering (or subsetting) is how you “query” your data to find specific samples or variables. In R, we use Logical Operators (like >=, ==, or %in%) to tell R exactly what to keep.
Method A: Filtering Columns by Name
If you have a table with 20,000 genes but only want to see two specific ones, you can use the %in% operator.
aust <- gapminder[gapminder$country == "Australia",]
# Keep only the columns "year" and "pop"
# which() identifies the position (index) of the names that match our list
aust_col <- aust[ , which(colnames(aust) %in% c("year", "pop"))]
colnames(aust_col)
## [1] "year" "pop"Why use which()?
You might notice that colnames(aust) %in% c("year", "pop") returns a list of TRUE and FALSE. Wrapping it in which() converts those into actual numbers (the column positions), which is often safer and faster when dealing with very large datasets.
colnames(aust)
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
colnames(aust) %in% c("year", "pop")
## [1] FALSE FALSE TRUE FALSE TRUE FALSE
which(colnames(aust) %in% c("year", "pop"))
## [1] 3 5Method B: Filtering Rows by Condition
This is where you ask R to find samples that meet a specific clinical or biological threshold.
aust <- gapminder[gapminder$country == "Australia",]
# 1. Filter by a fixed threshold
# Keep only rows where Life Expectancy is 70 or higher
long_life <- aust[aust$lifeExp >= 70, ]
dim(long_life)
## [1] 11 6
# 2. Filter by a calculated threshold
# Keep only rows where GDP is above average
high_gdp <- aust[aust$gdpPercap >= mean(aust$gdpPercap), ]
dim(high_gdp)
## [1] 5 6💡 Understanding the “Comma” Rule
The most common mistake for beginners is forgetting the comma inside the brackets: [rows, columns].
aust[aust$lifeExp >= 70 ]→ Error! R doesn’t know if this is a row or column instruction.aust[aust$lifeExp >= 70 , ]→ Success! The comma tells R: “Filter the rows based on this math, but keep all columns.”
3. Cleaning Your Data
Raw datasets often contain “noise”—such as duplicate entries from technical replicates or empty rows created during data entry.
3.1 Handling Duplicates
Duplicates can artificially inflate your sample size and skew your statistics. The unique() function scans your data frame and keeps only the first instance of any identical rows.
aust <- gapminder[gapminder$country == "Australia",]
# Create a dataset with duplicates (binding the data to itself)
aust_double <- rbind(aust, aust)
dim(aust_double) # Should show 24 rows
## [1] 24 6
# Remove duplicate rows
aust_unique <- unique(aust_double)
dim(aust_unique) # Back to 12 rows!
## [1] 12 63.2 Handling Missing Values (NA)
In R, missing data is represented by NA (Not Available). Most statistical functions (like mean() or sd()) will fail or return NA if even one value is missing, so clearing them is essential.
A. Identifying NAs: is.na()
You cannot use == NA to find missing data because NA is not a value; it is a placeholder. Instead, we use the logical function is.na().
B. Filtering Out Empty Rows
We use the “Logical NOT” operator (!) to tell R: “Keep only the rows where the column is NOT NA.”
# 1. Let's create an empty row for practice
# rep(NA, ncol(aust)) creates a row of NAs the same width as our table
na_row <- rep(NA, ncol(aust))
aust <- rbind(aust, na_row)
tail(aust) # You will see the last row is all NA
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Australia Oceania 1987 76.3 16257249 21889.
## 2 Australia Oceania 1992 77.6 17481977 23425.
## 3 Australia Oceania 1997 78.8 18565243 26998.
## 4 Australia Oceania 2002 80.4 19546792 30688.
## 5 Australia Oceania 2007 81.2 20434176 34435.
## 6 <NA> <NA> NA NA NA NA
# 2. Remove rows where 'country' is NA
# We use !is.na() to keep the "Not NA" data
aust <- aust[!is.na(aust$country), ]
tail(aust) # The empty row is gone!
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Australia Oceania 1982 74.7 15184200 19477.
## 2 Australia Oceania 1987 76.3 16257249 21889.
## 3 Australia Oceania 1992 77.6 17481977 23425.
## 4 Australia Oceania 1997 78.8 18565243 26998.
## 5 Australia Oceania 2002 80.4 19546792 30688.
## 6 Australia Oceania 2007 81.2 20434176 34435.Remove any row that has even one NA anywhere in it: na.omit
If you want to remove any row that has even one NA anywhere in it, you can use the “nuclear option”:
aust_clean <- na.omit(aust)Warning: Be careful with na.omit(). If a patient is missing just one minor piece of information (like a middle name), na.omit() will delete their entire record, including their vital clinical data! It is usually safer to filter by specific, essential columns using the !is.na() method.
3.3 Editing Specific Elements
If you find a typo in your data, you can target a single “cell” using [row, col].
# Add 1 year to the Life Expectancy of the first row
aust[1, "lifeExp"] <- aust[1, "lifeExp"] + 1
aust[1, ]
## # A tibble: 1 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Australia Oceania 1952 70.1 8691212 10040.
# Specify the Life Expectancy of specific cell
aust[1, "lifeExp"] <- 50
aust[1, ]
## # A tibble: 1 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Australia Oceania 1952 50 8691212 10040.