Key Fields in the Dataset¶
Here is a brief explanation of the key fields in the dataset to help understand the factors analysed in the study:
ADR (Average Daily Rate): The average revenue generated per day for a hotel room. It reflects the economic aspect of bookings and is measured in monetary units.
Lead Time: The number of days between the booking date and the arrival date. Longer lead times often indicate a higher likelihood of cancellations as plans are more likely to change over time.
Stays in Weekend Nights: The number of weekend nights (Friday and Saturday) included in the booking. This reflects patterns in leisure travel behaviour.
Stays in Week Nights: The number of weekday nights (Monday to Thursday) included in the booking. This provides insight into business or non-leisure travel trends.
Deposit Type: Indicates whether a deposit was made and its type:
- No Deposit: No upfront payment.
- Non-Refundable: Payment made is not refundable.
- Refundable: Payment is refundable under certain conditions.
Market Segment: Categorises how the booking was made:
- Examples include Online Travel Agencies (OTA), Groups, and Corporate bookings.
Is Canceled: The target variable indicating whether a booking was cancelled (
1
) or not (0
).
These fields were analysed to determine their influence on cancellations and included in the predictive models.
We start by loading the necesssary libraries needed for this analysis
# Load necessary libraries
library(tidyverse)
library(readr)
library(naniar)
library(patchwork)
library(ggplot2)
library(ggcorrplot)
library(caret)
library(dplyr)
library(pROC)
library(randomForest)
Access the file and check the dataset¶
We start by accessing the file, checking the structure of the dataset and finding out how many rows and columns are in the dataset. From the results there are 119,390 rows with 32 columns
# Define the file path
file_path <- "C:/Users/sheyi/OneDrive/Documents/rassignment/hotel_bookings.csv"
# Load the dataset without showing column types message
hotel_data <- read_csv(file_path, show_col_types = FALSE)
# Check the structure of the dataset
glimpse(hotel_data)
# Print the number of rows and columns
cat("Dataset Dimensions:", dim(hotel_data), "\n")
Rows: 119,390 Columns: 32 $ hotel <chr> "Resort Hotel", "Resort Hotel", "Resort… $ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, … $ lead_time <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, … $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201… $ arrival_date_month <chr> "July", "July", "July", "July", "July",… $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,… $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, … $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, … $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB… $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR… $ market_segment <chr> "Direct", "Direct", "Direct", "Corporat… $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corporat… $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C",… $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C",… $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposit… $ agent <chr> "NULL", "NULL", "NULL", "304", "240", "… $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL",… $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … $ customer_type <chr> "Transient", "Transient", "Transient", … $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,… $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, … $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out", … $ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02, 20… Dataset Dimensions: 119390 32
Checking for Missing Values¶
To ensure data quality and completeness, I checked for missing values across all variables in the dataset
# Check for missing values
missing_summary <- hotel_data %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "MissingCount")
# Display only variables with missing values
missing_summary_with_na <- missing_summary %>%
filter(MissingCount > 0)
print(missing_summary_with_na)
# Visualise missing values
gg_miss_var(hotel_data) +
labs(title = "Missing Values by Variable")
# A tibble: 1 × 2 Variable MissingCount <chr> <int> 1 children 4
Findings:¶
- Only the
children
variable has missing values, with 4 missing entries.
Visualisation:¶
The plot above highlights the number of missing values by variable. Since the missing data is minimal, it can be handled effectively.
Imputing Missing Values¶
To handle the missing values in the dataset, the children
column was identified as having missing entries. These missing values were imputed using the median value of the column. After imputation, the dataset was verified to ensure there were no remaining missing values.
# Impute missing values in the 'children' column with the median
hotel_data <- hotel_data %>%
mutate(children = ifelse(is.na(children), median(children, na.rm = TRUE), children))
# Verify that there are no missing values remaining
missing_summary_after_imputation <- hotel_data %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "MissingCount")
print(missing_summary_after_imputation)
#visualise if there are any missing variables
gg_miss_var(hotel_data) +
labs(title = "Missing Values by Variable")
# A tibble: 32 × 2 Variable MissingCount <chr> <int> 1 hotel 0 2 is_canceled 0 3 lead_time 0 4 arrival_date_year 0 5 arrival_date_month 0 6 arrival_date_week_number 0 7 arrival_date_day_of_month 0 8 stays_in_weekend_nights 0 9 stays_in_week_nights 0 10 adults 0 # ℹ 22 more rows
Exploring the Distribution of Numerical Variables¶
To better understand the data, histograms were plotted for key numerical variables such as lead_time
, adr
(Average Daily Rate), stays_in_weekend_nights
, and stays_in_week_nights
. These visualisations provide insights into the spread and skewness of the data.
# Define numerical variables
numeric_vars <- c("lead_time", "adr", "stays_in_weekend_nights", "stays_in_week_nights")
# Plot histograms for numerical variables
library(ggplot2)
plots <- lapply(numeric_vars, function(var) {
ggplot(hotel_data, aes_string(x = var)) +
geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
labs(title = paste("Distribution of", var), x = var, y = "Frequency") +
theme_minimal()
})
# Arrange plots
library(patchwork)
wrap_plots(plots, ncol = 2)
Observations:¶
Lead Time: The distribution is highly right-skewed, with most bookings occurring within a shorter lead time. Some bookings have very high lead times, which might indicate outliers.
ADR (Average Daily Rate): The majority of values are concentrated on the lower end, but there are a few instances of very high rates.
Stays in Weekend Nights: Most bookings involve a few nights, with a sharp drop-off for longer stays.
Stays in Week Nights: Similar to weekend nights, shorter stays are far more common, with very few extended stays.
# Plot cancellation distribution
ggplot(hotel_data, aes(x = factor(is_canceled), fill = factor(is_canceled))) +
geom_bar(alpha = 0.8) +
scale_fill_manual(values = c("steelblue", "orange"), labels = c("Not Canceled", "Canceled")) +
labs(title = "Booking Cancellation Rates", x = "Booking Status", y = "Count", fill = "Cancellation") +
theme_minimal()
Booking Cancellation Rates¶
This visualisation illustrates the distribution of bookings that were either cancelled or not cancelled. It provides an overview of the cancellation rates, allowing us to understand the balance between the two outcomes.
Observation:¶
A significant proportion of bookings were not cancelled compared to those that were cancelled. This imbalance highlights the need to account for cancellation rates in our analysis, as they may impact the prediction model.
# Plot distribution of hotel types
ggplot(hotel_data, aes(x = hotel, fill = hotel)) +
geom_bar(alpha = 0.8) +
labs(title = "Hotel Types Distribution", x = "Hotel Type", y = "Count", fill = "Hotel Type") +
theme_minimal()
Hotel Types Distribution¶
This visualisation shows the distribution of hotel types in the dataset. It highlights that the majority of bookings are for City Hotels compared to Resort Hotels.
Insights:¶
- The dataset is heavily skewed towards City Hotels, which make up a significantly larger proportion of the bookings.
- This imbalance might affect cancellation trends and should be considered during analysis.
# creating total stay variable for visualisation
hotel_data <- hotel_data %>%
mutate(total_stay = stays_in_weekend_nights + stays_in_week_nights)
# Lead Time
ggplot(hotel_data, aes(x = factor(is_canceled), y = lead_time, fill = factor(is_canceled))) +
geom_boxplot(alpha = 0.7) +
labs(
title = "Lead Time by Cancellation Status",
x = "Cancellation Status (0 = Not Canceled, 1 = Canceled)",
y = "Lead Time",
fill = "Cancellation"
) +
theme_minimal()
Lead Time by Cancellation Status¶
This boxplot illustrates the relationship between lead time (the number of days between booking and check-in) and booking cancellation status.
Insights:¶
- Bookings that are cancelled (cancellation status = 1) tend to have significantly higher lead times compared to bookings that are not cancelled (cancellation status = 0).
- This indicates that customers who book far in advance are more likely to cancel their reservations, possibly due to changes in their plans over time.
- Outliers can be observed for both groups, with extremely high lead times for some bookings.
# Total Stay
ggplot(hotel_data, aes(x = factor(is_canceled), y = total_stay, fill = factor(is_canceled))) +
geom_boxplot(alpha = 0.7) +
labs(
title = "Total Stay by Cancellation Status",
x = "Cancellation Status (0 = Not Canceled, 1 = Canceled)",
y = "Total Stay (Nights)",
fill = "Cancellation"
) +
theme_minimal()
Total Stay by Cancellation Status¶
This boxplot depicts the relationship between the total stay duration (in nights) and the cancellation status of bookings.
Insights:¶
- The median total stay is similar for both cancelled (cancellation status = 1) and non-cancelled bookings (cancellation status = 0), suggesting that total stay duration is not a strong differentiator between the two groups.
- A few extreme outliers are present, representing bookings with exceptionally long total stays.
- The interquartile ranges and distributions for both groups are largely overlapping, further indicating minimal variation in total stay duration based on cancellation status.
# Market Segment
ggplot(hotel_data, aes(x = market_segment, fill = factor(is_canceled))) +
geom_bar(position = "dodge", alpha = 0.7) +
labs(
title = "Market Segment by Cancellation Status",
x = "Market Segment",
y = "Count",
fill = "Cancellation"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Market Segment by Cancellation Status¶
This bar chart compares the distribution of cancellations across different market segments.
Insights:¶
- The Online TA (Travel Agents) segment shows the highest number of bookings, with a significant portion of those being cancelled. This indicates that bookings from online travel agents are more prone to cancellations compared to other segments.
- The Groups segment also has a high cancellation rate, suggesting a strong relationship between group bookings and cancellations.
- Direct and Corporate bookings have noticeably lower cancellation rates, implying that these segments are more stable and less likely to result in cancellations.
- Other segments, such as Complementary and Aviation, have minimal booking volumes and cancellations, making them less impactful overall.
This visualisation highlights how different market segments contribute to booking cancellations, providing actionable insights for targeted interventions in specific segments.
# Deposit Type
ggplot(hotel_data, aes(x = deposit_type, fill = factor(is_canceled))) +
geom_bar(position = "dodge", alpha = 0.7) +
labs(
title = "Deposit Type by Cancellation Status",
x = "Deposit Type",
y = "Count",
fill = "Cancellation"
) +
theme_minimal()
Deposit Type by Cancellation Status¶
This bar chart illustrates the relationship between deposit types and booking cancellation rates.
Insights:¶
- No Deposit: This category accounts for the majority of bookings, with a significant proportion resulting in cancellations. The absence of a deposit may reduce the customer's commitment to completing their stay, increasing the likelihood of cancellations.
- Non-Refundable: Bookings in this category exhibit higher commitment, as cancellations are comparatively fewer. The financial implication of non-refundable deposits likely discourages cancellations.
- Refundable: This category shows the lowest volume of bookings and cancellations, suggesting limited usage of refundable deposits in the dataset.
The visualisation highlights that bookings requiring non-refundable deposits tend to have lower cancellation rates, whereas "No Deposit" bookings are associated with higher cancellation rates.
Outlier Detection and Winsorisation¶
Outlier Detection¶
To detect outliers in the dataset, the Interquartile Range (IQR) method was applied to numerical variables: adr
, lead_time
, stays_in_week_nights
, and stays_in_weekend_nights
. Outliers were identified as values falling outside the range defined by:
- Lower Bound: Q1 - 1.5 * IQR
- Upper Bound: Q3 + 1.5 * IQR
Summary of Outliers:
- ADR: 3,793 outliers (3.18% of the data)
- Lead Time: 3,005 outliers (2.52% of the data)
- Stays in Week Nights: 3,354 outliers (2.81% of the data)
- Stays in Weekend Nights: 265 outliers (0.22% of the data)
Outlier detection revealed that adr
(Average Daily Rate) and lead_time
had the most significant number of outliers.
Winsorisation¶
Winsorisation was applied to cap outliers at the upper and lower bounds calculated using the IQR method. This ensured that extreme values did not overly influence the analysis while preserving the overall data distribution.
Post-Winsorisation Summary:
- ADR: Values capped between -6.38 and 211.06
- Lead Time: Values capped between 0.0 and 373.0
- Stays in Week Nights: Values capped between 0.0 and 6.0
- Stays in Weekend Nights: Values capped between 0.0 and 5.0
Impact of Winsorisation¶
After Winsorisation:
- The data range for each variable is reduced, eliminating extreme values.
- The statistical summaries (mean, median, quartiles) are less influenced by extreme outliers.
This process ensures that the dataset remains robust for further analysis while minimising the influence of extreme values on the results.
# Define a function to summarise outliers using IQR
summarise_outliers <- function(data, variable) {
Q1 <- quantile(data[[variable]], 0.25, na.rm = TRUE)
Q3 <- quantile(data[[variable]], 0.75, na.rm = TRUE)
IQR_value <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
# Count outliers
num_outliers <- sum(data[[variable]] < lower_bound | data[[variable]] > upper_bound, na.rm = TRUE)
total_values <- sum(!is.na(data[[variable]]))
percentage_outliers <- (num_outliers / total_values) * 100
return(data.frame(
Variable = variable,
Lower_Bound = lower_bound,
Upper_Bound = upper_bound,
Num_Outliers = num_outliers,
Total_Values = total_values,
Percentage_Outliers = percentage_outliers
))
}
# Select numerical variables
numerical_vars <- c("adr", "lead_time", "stays_in_week_nights", "stays_in_weekend_nights")
# Apply the function to each numerical variable
outliers_summary <- do.call(rbind, lapply(numerical_vars, function(var) summarise_outliers(hotel_data, var)))
# Print the summary of outliers
print(outliers_summary)
# Define a function for Winsorisation
winsorise <- function(data, variable) {
Q1 <- quantile(data[[variable]], 0.25, na.rm = TRUE)
Q3 <- quantile(data[[variable]], 0.75, na.rm = TRUE)
IQR_value <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
# Cap the values
data[[variable]] <- ifelse(data[[variable]] < lower_bound, lower_bound, data[[variable]])
data[[variable]] <- ifelse(data[[variable]] > upper_bound, upper_bound, data[[variable]])
return(data)
}
# Variables to Winsorise
numerical_vars <- c("adr", "lead_time", "stays_in_week_nights", "stays_in_weekend_nights")
# Apply Winsorisation to each variable
for (var in numerical_vars) {
hotel_data <- winsorise(hotel_data, var)
}
# Confirm Winsorisation
print("Winsorisation completed for selected variables.")
summary(hotel_data[numerical_vars])
Variable Lower_Bound Upper_Bound Num_Outliers Total_Values 25% adr -15.775 211.065 3793 119390 25%1 lead_time -195.000 373.000 3005 119390 25%2 stays_in_week_nights -2.000 6.000 3354 119390 25%3 stays_in_weekend_nights -3.000 5.000 265 119390 Percentage_Outliers 25% 3.1769830 25%1 2.5169612 25%2 2.8092805 25%3 0.2219616 [1] "Winsorisation completed for selected variables."
adr lead_time stays_in_week_nights stays_in_weekend_nights Min. : -6.38 Min. : 0.0 Min. :0.000 Min. :0.0000 1st Qu.: 69.29 1st Qu.: 18.0 1st Qu.:1.000 1st Qu.:0.0000 Median : 94.58 Median : 69.0 Median :2.000 Median :1.0000 Mean :100.66 Mean :102.2 Mean :2.406 Mean :0.9227 3rd Qu.:126.00 3rd Qu.:160.0 3rd Qu.:3.000 3rd Qu.:2.0000 Max. :211.06 Max. :373.0 Max. :6.000 Max. :5.0000
# Define numerical variables for plotting
numerical_vars <- c("adr", "lead_time", "stays_in_week_nights", "stays_in_weekend_nights")
# Create histograms for the numerical variables
plots <- lapply(numerical_vars, function(var) {
ggplot(hotel_data, aes_string(x = var)) +
geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
labs(title = paste(var, "After Winsorisation"), x = var, y = "Frequency") +
theme_minimal()
})
# Arrange plots in a grid
library(patchwork)
wrap_plots(plots, ncol = 2)
Histograms of Variables After Winsorisation¶
The histograms above display the distribution of numerical variables after Winsorisation:
ADR (Average Daily Rate):
- The values are capped between -6.38 and 211.06.
- The distribution remains positively skewed, with most values concentrated around 50 to 150.
Lead Time:
- The range is limited between 0 and 373 days.
- The distribution exhibits a heavy right skew, with most bookings having a lead time of fewer than 100 days.
Stays in Week Nights:
- Values are capped between 0 and 6 nights.
- The data is relatively spread out, with peaks at 1, 2, and 3 nights.
Stays in Weekend Nights:
- The range is capped between 0 and 5 nights.
- The distribution shows that most stays involve 0 or 1 weekend night.
Insights:¶
- Winsorisation successfully mitigates the impact of extreme outliers while preserving the overall structure of the data.
- The distributions indicate that the majority of the data lies within expected ranges for each variable, making the dataset more robust for further analysis.
# Boxplots for Winsorised Data
boxplots <- lapply(numerical_vars, function(var) {
ggplot(hotel_data, aes(y = .data[[var]])) +
geom_boxplot(fill = "orange", alpha = 0.7) +
labs(title = paste(var, "After Winsorisation"), y = var) +
theme_minimal()
})
# Arrange boxplots
wrap_plots(boxplots, ncol = 2)
Boxplots of Variables After Winsorisation¶
The boxplots above present the distribution of numerical variables after Winsorisation, showing the range, median, and spread of the data:
ADR (Average Daily Rate):
- The distribution is compact, with a median around 100.
- Most values are within the interquartile range (IQR), and extreme outliers have been capped.
Lead Time:
- The median lead time is approximately 100 days.
- The spread of the data indicates a right-skewed distribution with longer lead times capped.
Stays in Week Nights:
- The median stay is around 2 nights.
- The data is concentrated within a narrow range, with few outliers remaining.
Stays in Weekend Nights:
- The median is close to 1 night.
- The data shows a compact distribution, with most stays involving up to 2 nights.
Insights:¶
- Winsorisation effectively caps outliers, ensuring that the visualisation highlights the central tendency and variability of the data.
- The resulting boxplots are more representative of the typical values, which is beneficial for modelling and analysis.
# Compute correlation matrix for numerical variables
cor_matrix <- cor(hotel_data %>% select(adr, lead_time, stays_in_week_nights, stays_in_weekend_nights, is_canceled))
# Load ggcorrplot for visualisation
library(ggcorrplot)
# Plot correlation heatmap
ggcorrplot(cor_matrix, lab = TRUE, outline.color = "white", colors = c("red", "white", "blue")) +
labs(title = "Correlation Heatmap")
Correlation Heatmap After Winsorisation¶
The heatmap above displays the correlation matrix for the numerical variables in the dataset after Winsorisation. Key observations include:
lead_time
andis_canceled
: A moderately positive correlation of 0.3, suggesting that longer lead times are associated with a higher likelihood of cancellations.stays_in_week_nights
andstays_in_weekend_nights
: A positive correlation of 0.38, indicating that guests who stay longer during the week also tend to have longer weekend stays.adr
(Average Daily Rate): Shows weak correlations with all other variables, indicating minimal influence of daily rate on cancellation or stay characteristics.is_canceled
andstays_in_week_nights
/stays_in_weekend_nights
: Weak correlations, suggesting the duration of stay has limited direct impact on cancellations.
Insights:¶
- The correlation matrix highlights key relationships between variables, particularly between
lead_time
andis_canceled
, which could be an important predictor for cancellation likelihood. - The weak correlations for most variables with
is_canceled
indicate the need for more complex models or interactions to understand cancellations better.
Feature Selection and Data Preprocessing¶
Feature Selection:¶
- The dataset has been reduced to include only the relevant features for predicting cancellations:
is_canceled
: The target variable.- Predictors include:
- Numerical features:
lead_time
,adr
(Average Daily Rate),stays_in_week_nights
, andstays_in_weekend_nights
. - Categorical features:
hotel
,market_segment
,deposit_type
, andcustomer_type
.
- Numerical features:
Encoding Categorical Variables:¶
- Categorical variables (
hotel
,market_segment
,deposit_type
,customer_type
) were encoded using one-hot encoding, creating binary columns for each category. This ensures compatibility with machine learning models.
Train-Test Split:¶
- The dataset was split into:
- Training Set: 70% of the data (83,573 rows).
- Testing Set: 30% of the data (35,817 rows).
Normalisation:¶
- Numerical features (
lead_time
,adr
,stays_in_week_nights
,stays_in_weekend_nights
) were normalised using z-score scaling to standardise the feature range. This step is particularly important for models sensitive to feature scaling, such as logistic regression.
Importance of Preprocessing:¶
- These steps prepare the dataset for modelling by ensuring:
- Balanced and representative data for training and testing.
- Consistency in feature scales for numerical predictors.
# Select features
selected_features <- hotel_data %>%
select(is_canceled, lead_time, adr, stays_in_week_nights, stays_in_weekend_nights,
hotel, market_segment, deposit_type, customer_type)
# Encode categorical variables using one-hot encoding
dummy_vars <- dummyVars("~ .", data = selected_features, fullRank = TRUE)
encoded_data <- data.frame(predict(dummy_vars, newdata = selected_features))
# Split into training and testing subsets
set.seed(123) # For reproducibility
train_index <- createDataPartition(encoded_data$is_canceled, p = 0.7, list = FALSE)
train_data <- encoded_data[train_index, ]
test_data <- encoded_data[-train_index, ]
# Normalise numerical variables (for Logistic Regression)
num_vars <- c("lead_time", "adr", "stays_in_week_nights", "stays_in_weekend_nights")
preprocess <- preProcess(train_data[, num_vars], method = c("center", "scale"))
train_data[, num_vars] <- predict(preprocess, train_data[, num_vars])
test_data[, num_vars] <- predict(preprocess, test_data[, num_vars])
# Confirm the structure of training and testing data
cat("Training Data Dimensions:", dim(train_data), "\n")
cat("Testing Data Dimensions:", dim(test_data), "\n")
Training Data Dimensions: 83573 18 Testing Data Dimensions: 35817 18
# Train a logistic regression model
logistic_model <- glm(
is_canceled ~ .,
data = train_data,
family = binomial
)
# Display model summary
summary(logistic_model)
# Predict probabilities on the testing dataset
test_data$predicted_prob <- predict(logistic_model, newdata = test_data, type = "response")
# Convert probabilities to binary predictions (threshold = 0.5)
test_data$predicted_class <- ifelse(test_data$predicted_prob > 0.5, 1, 0)
Call: glm(formula = is_canceled ~ ., family = binomial, data = train_data) Coefficients: (1 not defined because of singularities) Estimate Std. Error z value Pr(>|z|) (Intercept) -1.478764 0.200560 -7.373 1.67e-13 *** lead_time 0.410698 0.010349 39.684 < 2e-16 *** adr 0.123881 0.009218 13.440 < 2e-16 *** stays_in_week_nights 0.076243 0.009578 7.960 1.72e-15 *** stays_in_weekend_nights 0.064280 0.009239 6.958 3.46e-12 *** hotelResort.Hotel -0.271546 0.019315 -14.059 < 2e-16 *** market_segmentComplementary -0.066100 0.232504 -0.284 0.77618 market_segmentCorporate -0.169554 0.200171 -0.847 0.39697 market_segmentDirect -0.490157 0.195961 -2.501 0.01237 * market_segmentGroups 0.433559 0.197457 2.196 0.02811 * market_segmentOffline.TA.TO -0.393231 0.195465 -2.012 0.04424 * market_segmentOnline.TA 0.510557 0.193860 2.634 0.00845 ** market_segmentUndefined NA NA NA NA deposit_typeNon.Refund 5.847317 0.130783 44.710 < 2e-16 *** deposit_typeRefundable -0.010815 0.217401 -0.050 0.96033 customer_typeGroup -0.220390 0.168583 -1.307 0.19111 customer_typeTransient 0.552017 0.052911 10.433 < 2e-16 *** customer_typeTransient.Party 0.042291 0.057109 0.741 0.45898 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 110085 on 83572 degrees of freedom Residual deviance: 81867 on 83556 degrees of freedom AIC: 81901 Number of Fisher Scoring iterations: 7
Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : "prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases"
Logistic Regression Model Summary¶
The logistic regression model tries to predict if a booking will be cancelled (is_canceled
) based on several features, such as lead time, average daily rate (ADR), and market segment. It assigns a weight (called a "coefficient") to each feature to understand how important it is for predicting cancellations.
Key points from the model:
- Positive coefficients (e.g.,
lead_time
andadr
) mean that as these values increase, the likelihood of a cancellation increases. - Negative coefficients (e.g.,
hotelResort.Hotel
) mean that these features reduce the likelihood of a cancellation. - Some categories, like
market_segmentComplementary
, are less significant in predicting cancellations because their p-values are large (e.g., 0.776), which means they don't have a strong relationship with cancellations.
Model Fit and Performance¶
- The model successfully found the best fit for the data (this is called "convergence").
- The AIC (Akaike Information Criterion) value is 81901, which helps measure how well the model fits the data. A lower AIC means a better fit.
# Confusion Matrix
conf_matrix <- confusionMatrix(
factor(test_data$predicted_class),
factor(test_data$is_canceled)
)
print(conf_matrix)
# AUC-ROC Curve
roc_curve <- roc(test_data$is_canceled, test_data$predicted_prob)
plot(roc_curve, col = "blue", main = "ROC Curve for Logistic Regression")
auc_value <- auc(roc_curve)
cat("AUC:", auc_value, "\n")
Confusion Matrix and Statistics Reference Prediction 0 1 0 21625 7919 1 837 5436 Accuracy : 0.7555 95% CI : (0.751, 0.76) No Information Rate : 0.6271 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.4143 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.9627 Specificity : 0.4070 Pos Pred Value : 0.7320 Neg Pred Value : 0.8666 Prevalence : 0.6271 Detection Rate : 0.6038 Detection Prevalence : 0.8249 Balanced Accuracy : 0.6849 'Positive' Class : 0
Setting levels: control = 0, case = 1 Setting direction: controls < cases
AUC: 0.7893547
Logistic Regression Model Evaluation¶
The model was evaluated using a confusion matrix and additional metrics to measure its performance.
Key Metrics:¶
Accuracy: 75.55% of the predictions made by the model were correct.
- This means the model correctly identified cancellations and non-cancellations 75.55% of the time.
Sensitivity (Recall for Non-Cancellations): 96.27%
- The model is very good at identifying bookings that were not cancelled (
class 0
).
- The model is very good at identifying bookings that were not cancelled (
Specificity (Recall for Cancellations): 40.70%
- The model struggles more with identifying cancellations (
class 1
).
- The model struggles more with identifying cancellations (
Positive Predictive Value (Precision for Non-Cancellations): 73.20%
- Of all the bookings predicted as non-cancelled, 73.20% were correct.
Negative Predictive Value (Precision for Cancellations): 86.66%
- Of all the bookings predicted as cancelled, 86.66% were correct.
Balanced Accuracy: 68.49%
- This is an average of sensitivity and specificity, giving a more balanced view of the model’s performance.
AUC (Area Under the Curve): 0.789
- The AUC measures how well the model distinguishes between cancellations and non-cancellations. A score of 0.789 indicates decent performance.
Observations:¶
- The model performs well in predicting non-cancellations but is less effective at identifying cancellations.
- The low specificity indicates the need for improvement in detecting cancellations.
- The AUC of 0.789 suggests that the model has a good ability to separate cancelled and non-cancelled bookings overall.
Note on Warnings:¶
The output also includes information that the model correctly set the positive class (0
for not cancelled) and the direction for calculations. This helps ensure accurate results.
# Ensure the target variable is a factor
hotel_data$is_canceled <- as.factor(hotel_data$is_canceled)
# Create an 80-20 train-test split
set.seed(123)
train_index <- createDataPartition(hotel_data$is_canceled, p = 0.8, list = FALSE)
# Split data
train_data <- hotel_data[train_index, ]
test_data <- hotel_data[-train_index, ]
# Confirm the structure of train and test data
cat("Training Data Dimensions:", dim(train_data), "\n")
cat("Testing Data Dimensions:", dim(test_data), "\n")
# Convert categorical variables to factors
train_data <- train_data %>%
mutate(
hotel = as.factor(hotel),
market_segment = as.factor(market_segment),
deposit_type = as.factor(deposit_type),
customer_type = as.factor(customer_type),
is_canceled = as.factor(is_canceled)
)
test_data <- test_data %>%
mutate(
hotel = as.factor(hotel),
market_segment = as.factor(market_segment),
deposit_type = as.factor(deposit_type),
customer_type = as.factor(customer_type),
is_canceled = as.factor(is_canceled)
)
# Verify transformation
str(train_data)
# Train the model
set.seed(123)
rf_model <- randomForest(
is_canceled ~ lead_time + adr + stays_in_week_nights + stays_in_weekend_nights +
hotel + market_segment + deposit_type + customer_type,
data = train_data,
ntree = 500,
mtry = 3,
importance = TRUE
)
# Print model summary
print(rf_model)
# Ensure test data levels match train data levels
test_data$hotel <- factor(test_data$hotel, levels = levels(train_data$hotel))
test_data$market_segment <- factor(test_data$market_segment, levels = levels(train_data$market_segment))
test_data$deposit_type <- factor(test_data$deposit_type, levels = levels(train_data$deposit_type))
test_data$customer_type <- factor(test_data$customer_type, levels = levels(train_data$customer_type))
# Verify that levels match
str(test_data)
# Make predictions on the test set
rf_predictions <- predict(rf_model, test_data)
# Convert predictions and actual values to factors
rf_predictions <- as.factor(rf_predictions)
test_data$is_canceled <- as.factor(test_data$is_canceled)
# Confusion Matrix
rf_conf_matrix <- confusionMatrix(rf_predictions, test_data$is_canceled)
# Print the confusion matrix results
print(rf_conf_matrix)
Training Data Dimensions: 95513 33 Testing Data Dimensions: 23877 33 tibble [95,513 × 33] (S3: tbl_df/tbl/data.frame) $ hotel : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ... $ is_canceled : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 2 ... $ lead_time : num [1:95513] 342 373 7 14 14 0 9 85 75 23 ... $ arrival_date_year : num [1:95513] 2015 2015 2015 2015 2015 ... $ arrival_date_month : chr [1:95513] "July" "July" "July" "July" ... $ arrival_date_week_number : num [1:95513] 27 27 27 27 27 27 27 27 27 27 ... $ arrival_date_day_of_month : num [1:95513] 1 1 1 1 1 1 1 1 1 1 ... $ stays_in_weekend_nights : num [1:95513] 0 0 0 0 0 0 0 0 0 0 ... $ stays_in_week_nights : num [1:95513] 0 0 1 2 2 2 2 3 3 4 ... $ adults : num [1:95513] 2 2 1 2 2 2 2 2 2 2 ... $ children : num [1:95513] 0 0 0 0 0 0 0 0 0 0 ... $ babies : num [1:95513] 0 0 0 0 0 0 0 0 0 0 ... $ meal : chr [1:95513] "BB" "BB" "BB" "BB" ... $ country : chr [1:95513] "PRT" "PRT" "GBR" "GBR" ... $ market_segment : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 7 7 4 4 7 6 7 ... $ distribution_channel : chr [1:95513] "Direct" "Direct" "Direct" "TA/TO" ... $ is_repeated_guest : num [1:95513] 0 0 0 0 0 0 0 0 0 0 ... $ previous_cancellations : num [1:95513] 0 0 0 0 0 0 0 0 0 0 ... $ previous_bookings_not_canceled: num [1:95513] 0 0 0 0 0 0 0 0 0 0 ... $ reserved_room_type : chr [1:95513] "C" "C" "A" "A" ... $ assigned_room_type : chr [1:95513] "C" "C" "C" "A" ... $ booking_changes : num [1:95513] 3 4 0 0 0 0 0 0 0 0 ... $ deposit_type : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ... $ agent : chr [1:95513] "NULL" "NULL" "NULL" "240" ... $ company : chr [1:95513] "NULL" "NULL" "NULL" "NULL" ... $ days_in_waiting_list : num [1:95513] 0 0 0 0 0 0 0 0 0 0 ... $ customer_type : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ... $ adr : num [1:95513] 0 0 75 98 98 ... $ required_car_parking_spaces : num [1:95513] 0 0 0 0 0 0 0 0 0 0 ... $ total_of_special_requests : num [1:95513] 0 0 0 1 1 0 1 1 0 0 ... $ reservation_status : chr [1:95513] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ... $ reservation_status_date : Date[1:95513], format: "2015-07-01" "2015-07-01" ... $ total_stay : num [1:95513] 0 0 1 2 2 2 2 3 3 4 ... Call: randomForest(formula = is_canceled ~ lead_time + adr + stays_in_week_nights + stays_in_weekend_nights + hotel + market_segment + deposit_type + customer_type, data = train_data, ntree = 500, mtry = 3, importance = TRUE) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 3 OOB estimate of error rate: 19.78% Confusion matrix: 0 1 class.error 0 57333 2800 0.04656345 1 16097 19283 0.45497456 tibble [23,877 × 33] (S3: tbl_df/tbl/data.frame) $ hotel : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ... $ is_canceled : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ... $ lead_time : num [1:23877] 13 37 72 77 118 69 43 1 1 14 ... $ arrival_date_year : num [1:23877] 2015 2015 2015 2015 2015 ... $ arrival_date_month : chr [1:23877] "July" "July" "July" "July" ... $ arrival_date_week_number : num [1:23877] 27 27 27 27 27 27 27 27 27 27 ... $ arrival_date_day_of_month : num [1:23877] 1 1 1 1 1 2 2 2 2 2 ... $ stays_in_weekend_nights : num [1:23877] 0 0 2 2 4 2 1 0 0 0 ... $ stays_in_week_nights : num [1:23877] 1 4 4 5 6 4 3 1 1 2 ... $ adults : num [1:23877] 1 2 2 2 1 2 3 2 2 2 ... $ children : num [1:23877] 0 0 0 0 0 0 0 0 2 0 ... $ babies : num [1:23877] 0 0 0 0 0 0 0 0 0 0 ... $ meal : chr [1:23877] "BB" "BB" "BB" "BB" ... $ country : chr [1:23877] "GBR" "PRT" "PRT" "PRT" ... $ market_segment : Factor w/ 8 levels "Aviation","Complementary",..: 3 6 4 7 4 6 7 7 4 7 ... $ distribution_channel : chr [1:23877] "Corporate" "TA/TO" "Direct" "TA/TO" ... $ is_repeated_guest : num [1:23877] 0 0 0 0 0 0 0 0 0 0 ... $ previous_cancellations : num [1:23877] 0 0 0 0 0 0 0 0 0 0 ... $ previous_bookings_not_canceled: num [1:23877] 0 0 0 0 0 0 0 0 0 0 ... $ reserved_room_type : chr [1:23877] "A" "E" "A" "A" ... $ assigned_room_type : chr [1:23877] "A" "E" "A" "A" ... $ booking_changes : num [1:23877] 0 0 1 0 2 0 0 0 0 0 ... $ deposit_type : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ... $ agent : chr [1:23877] "304" "8" "250" "240" ... $ company : chr [1:23877] "NULL" "NULL" "NULL" "NULL" ... $ days_in_waiting_list : num [1:23877] 0 0 0 0 0 0 0 0 0 0 ... $ customer_type : Factor w/ 4 levels "Contract","Group",..: 3 1 3 3 3 3 3 3 3 3 ... $ adr : num [1:23877] 75 97.5 84.7 94 62 ... $ required_car_parking_spaces : num [1:23877] 0 0 0 0 0 0 0 1 1 0 ... $ total_of_special_requests : num [1:23877] 0 0 1 0 2 0 0 0 2 1 ... $ reservation_status : chr [1:23877] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ... $ reservation_status_date : Date[1:23877], format: "2015-07-02" "2015-07-05" ... $ total_stay : num [1:23877] 1 4 6 7 14 6 4 1 1 2 ... Confusion Matrix and Statistics Reference Prediction 0 1 0 14332 3951 1 701 4893 Accuracy : 0.8052 95% CI : (0.8001, 0.8102) No Information Rate : 0.6296 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.5481 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.9534 Specificity : 0.5533 Pos Pred Value : 0.7839 Neg Pred Value : 0.8747 Prevalence : 0.6296 Detection Rate : 0.6002 Detection Prevalence : 0.7657 Balanced Accuracy : 0.7533 'Positive' Class : 0
Random Forest Model Evaluation¶
Training and Testing Data¶
- The dataset was split into:
- Training Data: 80% (95,513 records).
- Testing Data: 20% (23,877 records).
- Categorical variables were converted to factors to ensure compatibility with the Random Forest model.
Model Summary¶
- Number of Trees: 500
- Variables Tried at Each Split: 3
- Out-of-Bag (OOB) Error Rate: 19.78%
- This is an internal estimate of prediction error during training.
Confusion Matrix on Testing Data:¶
Reference Class 0 | Reference Class 1 | |
---|---|---|
Predicted 0 | 14,332 | 3,951 |
Predicted 1 | 701 | 4,893 |
Key Metrics:¶
- Accuracy: 80.52%
- The model correctly predicted cancellations and non-cancellations 80.52% of the time.
- Sensitivity (Recall for Non-Cancellations): 95.34%
- The model effectively identified bookings that were not cancelled (
class 0
).
- The model effectively identified bookings that were not cancelled (
- Specificity (Recall for Cancellations): 55.33%
- The model had moderate success in identifying cancellations (
class 1
).
- The model had moderate success in identifying cancellations (
- Positive Predictive Value (Precision for Non-Cancellations): 78.39%
- Of all predictions for non-cancelled bookings, 78.39% were correct.
- Negative Predictive Value (Precision for Cancellations): 87.47%
- Of all predictions for cancelled bookings, 87.47% were correct.
- Balanced Accuracy: 75.33%
- This metric averages sensitivity and specificity for a balanced assessment of the model's performance.
Observations:¶
- The Random Forest model performed better than Logistic Regression in terms of accuracy and its ability to handle complex patterns in the data.
- However, the relatively lower specificity indicates room for improvement in predicting cancellations (
class 1
).
Insights:¶
- The high sensitivity makes the model reliable for identifying non-cancelled bookings.
- Improving specificity could enhance the model's ability to detect cancellations accurately, potentially by tuning hyperparameters or balancing the dataset.
# Get predicted probabilities
rf_probabilities <- predict(rf_model, test_data, type = "prob")[,2]
# Compute ROC curve
rf_roc_curve <- roc(test_data$is_canceled, rf_probabilities)
# Plot ROC Curve
plot(rf_roc_curve, col = "red", main = "ROC Curve - Random Forest", lwd = 2)
# Compute AUC
rf_auc <- auc(rf_roc_curve)
# Add AUC to the plot
legend("bottomright", legend = paste("AUC =", round(rf_auc, 3)), col = "red", lwd = 2)
Setting levels: control = 0, case = 1 Setting direction: controls < cases
ROC Curve for Random Forest Model¶
The ROC Curve (Receiver Operating Characteristic) visualises the trade-off between the model's sensitivity (true positive rate) and specificity (false positive rate) at various thresholds.
Key Observations:¶
- The curve rises sharply, indicating that the model performs well in distinguishing between cancellations and non-cancellations.
- AUC (Area Under the Curve): 0.87
- An AUC value of 0.87 suggests strong discriminatory power, where the model effectively differentiates between the two classes (cancellations and non-cancellations).
Insights:¶
- The high AUC reflects the Random Forest model's ability to predict cancellations with a good balance of sensitivity and specificity.
- This supports the model's robustness for cancellation prediction in hotel bookings.
# Plot feature importance
varImpPlot(rf_model, main = "Feature Importance - Random Forest")
# Extract numerical importance values
feature_importance <- as.data.frame(importance(rf_model))
feature_importance$Feature <- rownames(feature_importance)
# Print top features
#feature_importance <- feature_importance[order(-feature_importance$MeanDecreaseGini), ]
#print(feature_importance)
Feature Importance - Random Forest Model¶
The Feature Importance Plot shows which variables are most important for making accurate predictions in the Random Forest model. It uses two key measures:
- Mean Decrease Accuracy: This shows how much the model's accuracy drops if a variable is removed. Higher values mean the variable is very important.
- Mean Decrease Gini: This measures how much a variable helps reduce "disorder" (or impurity) in the data when the decision tree splits. Lower impurity means better separation of classes.
Key Findings:¶
- Most Important Variables:
- Deposit Type: The strongest predictor of cancellations, showing that the type of deposit is crucial in determining whether a booking gets cancelled.
- Lead Time: Highly influential, with longer lead times being associated with more cancellations.
- ADR (Average Daily Rate): Reflects how pricing impacts cancellations.
- Other significant variables:
- Stays in Week Nights and Stays in Weekend Nights: Capture patterns in the length of stay.
- Market Segment: Highlights how customer groups relate to cancellations.
Insights:¶
- The deposit type and lead time are the most critical factors influencing cancellations. This suggests that financial and booking behaviour play a major role.
- Hotels can focus on these key variables to develop strategies for reducing cancellations, such as offering flexible deposit options or targeting specific customer segments.
# Boxplot for Lead Time by Cancellation Status
boxplot_lead_time <- ggplot(hotel_data, aes(x = factor(is_canceled), y = lead_time, fill = factor(is_canceled))) +
geom_boxplot(alpha = 0.7) +
labs(
title = "Lead Time by Cancellation Status",
x = "Cancellation Status (0 = Not Canceled, 1 = Canceled)",
y = "Lead Time (Days)",
fill = "Cancellation"
) +
theme_minimal()
# Boxplot for ADR by Cancellation Status
boxplot_adr <- ggplot(hotel_data, aes(x = factor(is_canceled), y = adr, fill = factor(is_canceled))) +
geom_boxplot(alpha = 0.7) +
labs(
title = "ADR by Cancellation Status",
x = "Cancellation Status (0 = Not Canceled, 1 = Canceled)",
y = "Average Daily Rate (ADR)",
fill = "Cancellation"
) +
theme_minimal()
# Combine the two plots side by side
boxplot_lead_time + boxplot_adr
Boxplots of Key Features by Cancellation Status¶
The boxplots below examine the relationship between two critical features and cancellation status:
Lead Time by Cancellation Status:
- Bookings that were cancelled tend to have significantly higher lead times compared to those that were not cancelled.
- This indicates that bookings made far in advance are more likely to be cancelled, highlighting lead time as a crucial predictor of cancellations.
Average Daily Rate (ADR) by Cancellation Status:
- The ADR distribution shows a slight difference between cancelled and non-cancelled bookings.
- Cancelled bookings generally exhibit a marginally higher ADR, suggesting that price sensitivity might influence cancellation behaviour.
Insights:¶
- Lead Time is a strong differentiator between cancelled and non-cancelled bookings, making it an essential factor for predictive modelling.
- ADR provides additional insights into the economic aspects of cancellations, albeit with a less pronounced difference.
# Boxplot for Stays in Week Nights by Cancellation Status
boxplot_week_nights <- ggplot(hotel_data, aes(x = factor(is_canceled), y = stays_in_week_nights, fill = factor(is_canceled))) +
geom_boxplot(alpha = 0.7) +
labs(
title = "Stays in Week Nights by Cancellation Status",
x = "Cancellation Status (0 = Not Canceled, 1 = Canceled)",
y = "Stays in Week Nights",
fill = "Cancellation"
) +
theme_minimal()
# Boxplot for Stays in Weekend Nights by Cancellation Status
boxplot_weekend_nights <- ggplot(hotel_data, aes(x = factor(is_canceled), y = stays_in_weekend_nights, fill = factor(is_canceled))) +
geom_boxplot(alpha = 0.7) +
labs(
title = "Stays in Weekend Nights by Cancellation Status",
x = "Cancellation Status (0 = Not Canceled, 1 = Canceled)",
y = "Stays in Weekend Nights",
fill = "Cancellation"
) +
theme_minimal()
# Combine the two plots side by side
boxplot_week_nights + boxplot_weekend_nights
Stays in Week Nights and Weekend Nights by Cancellation Status¶
The boxplots above compare the number of Stays in Week Nights and Stays in Weekend Nights between bookings that were cancelled (1
) and those that were not (0
).
Observations:¶
- For both Stays in Week Nights and Stays in Weekend Nights, there is no significant difference in the median values between cancelled and non-cancelled bookings.
- However, the distributions indicate that non-cancelled bookings may slightly favour shorter stays, while cancelled bookings include a slightly broader range of stay lengths.
- The variability in stay durations appears similar for both week and weekend stays, suggesting that length of stay alone may not be a strong predictor of cancellations.
Insights:¶
These findings highlight that while length of stay contributes some information, it is not a definitive factor in determining cancellation likelihood. Hotels may need to look at other factors, such as booking behaviour or customer segments, to better understand cancellation trends.
Business Insights & Conclusion¶
Based on our analysis, hotels can take the following actionable steps:
- Reduce long lead-time cancellations: Offer discounts for early bookings that are non-refundable to encourage commitment.
- Encourage deposits: Customers who pay upfront are more likely to follow through with their bookings, reducing cancellations.
- Adjust pricing strategies: Tailor pricing for different market segments to minimise high-risk cancellations.
- Monitor online travel agency bookings: These bookings tend to have a higher cancellation rate and may require additional policies or incentives.
Summary¶
This analysis explored factors influencing booking cancellations and assessed the predictive accuracy of machine learning models. Through data visualisation, statistical summaries, and model evaluation, several key findings were identified.
Key Influencing Factors:¶
- Lead Time: Longer lead times significantly increase the likelihood of cancellations, suggesting customers booking far in advance are more prone to changing plans.
- Deposit Type: Non-refundable deposits strongly reduce cancellations, highlighting their effectiveness in securing bookings.
- Market Segment: Online bookings and groups are more likely to cancel, reflecting distinct customer behaviours in these segments.
- Economic Factors (ADR): Higher average daily rates (ADR) are associated with increased cancellations, potentially due to price sensitivity.
Predictive Modelling:¶
- Two models were developed to predict booking cancellations: Logistic Regression and Random Forest.
- The Random Forest model achieved higher predictive accuracy (80.5%) compared to Logistic Regression (75.5%), with an AUC of 0.87, indicating a strong ability to distinguish between cancelled and non-cancelled bookings.
- Feature importance analysis revealed that deposit type, lead time, and ADR are the most critical predictors of cancellations.
Conclusion:¶
The analysis highlights that financial conditions (e.g., deposit type) and booking behaviours (e.g., lead time and market segment) are key factors driving cancellations. The Random Forest model demonstrated robust performance, providing an effective tool for predicting cancellations and enabling hotels to take proactive measures to reduce them. Insights from this study can help hotels optimise booking policies, improve revenue management, and enhance customer satisfaction.