Project Data Analysis : Investigate Hotel Business

Bela Moneta
7 min readAug 26, 2023

--

Investigating Hotel Business using Python

In the business world, looking at how a company is doing by using data analysis is really important. This helps us understand a lot of important things that can help us make better decisions, plan smartly, and be successful overall.

Right now, I’m going to focus on hotels. I want to learn more about how people act when they’re booking hotel rooms. Later, I’ll show this information using graphs to make it easier to understand.

For this project, I’ll be using a set of organized information from Kaggle. It’s like a treasure trove of data that includes details about bookings in both city hotels and resort hotels. This data tells us things like when the bookings were made, how long people stayed, if there were parking spots available, and more.

What are we aiming to discover?
Based on the information provided, our objective is to uncover the following insights:

  1. Let’s analyze the cancellations for both resort and city hotels over the months.
  2. We will examine the trends in reservations for both resort and city hotels across the months.
  3. Compare the reservation statuses between the resort hotel and the city hotel.
  4. We’ll identify the primary countries from which the majority of guests originate.

Dataset Brief

Before delving into this project, the initial step is to ensure the dataset is ready for analysis. In this dataset, I’ve already taken care of tasks such as handling missing values, adjusting data types, and removing rows where the ‘adults’, ‘children’, and ‘babies’ columns all held zero values. This prepares the data for further processing and analysis.

However, before we start, it’s essential to ensure that the required libraries for this project are installed.

A. Import the libraries

import pandas as pd
import sort_dataframeby_monthorweek as sd
import matplotlib.pyplot as plt
import plotly.express as px
import folium

what these libraries mean
pandas: It provides data structures and functions to help us work with structured data, such as CSV files, Excel sheets, SQL databases, and much more.

sort_dataframeby_monthorweek: This is a third-party library designed for sorting data frames by months or weeks.

matplotlib.pyplot: This module is widely used for visualization in Python.

plotly.express: Similar to matplotlib, but Plotly supports a variety of chart types, including bar charts, line charts, scatter plots, and more.

folium: It’s a Python library used to create interactive maps.

B. Import the data

#import data
df = pd.read_csv('cleaned_data.csv')

Let’s Answer All the Tasks!

A. Analyze the cancellations for both resort and city hotels over the months.

If you look at the information in the ‘hotel’ column, you’ll see that this dataset contains details about reservations in both city hotels and resort hotels. Because we’re dealing with two different kinds of hotels, we should organize the data for cancellations separately for each type. This way, we can better understand and study the cancellations for each type of hotel.

# month highest cancellation for city hotel 
# filter city hotel with cancellation > 0
city_hotel_cancellation = df[(df['hotel']== 'City Hotel') & (df['is_canceled']>0)]
# group by city hotel with cancellation > 0 and arrival date month
month_city_hotel_cancellation = city_hotel_cancellation['arrival_date_month'].value_counts().reset_index()
month_city_hotel_cancellation.columns=['month','no of cancellations city hotel']
month_city_hotel_cancellation

Once you take a look at the data, you might notice that the months are not arranged in a sequential order. Because we plan to create graphs using this data, it’s important to rearrange the months in the correct order. This way, when we plot the data, it will be easier to understand and interpret.

#sort the month city hotel cancellation 
def sort_month(df, column_name):
return sd.Sort_Dataframeby_Month(df, column_name)
sort_month_city_hotel_cancellation= sort_month(month_city_hotel_cancellation, 'month')
sort_month_city_hotel_cancellation

Once we’re finished working with the data for city hotels, let’s repeat the same process for the data related to resort hotels.

#month highest cancellation for resort hotel 
#filter resort hotel cancellation > 0
resort_hotel_cancellation = df[(df['hotel']=='Resort Hotel') & (df['is_canceled']>0)]
#group by resort hotel cancellation > 0 and arrival date month
month_resort_hotel_cancellation = resort_hotel_cancellation['arrival_date_month'].value_counts().reset_index()
month_resort_hotel_cancellation.columns=['month','no of cancellations resort hotel']
month_resort_hotel_cancellation

don’t forget to arrange the months in sequential order.

#sort the month city hotel cancellation 
def sort_month(df,column_name):
return sd.Sort_Dataframeby_Month(df, column_name)
sort_month_resort_hotel_cancellation= sort_month(month_resort_hotel_cancellation, 'month')
sort_month_resort_hotel_cancellation

After we’ve separately worked with both the city hotel and resort hotel data, the next step is to combine or merge these two sets of data into a single table. This will give us a comprehensive view of the information from both types of hotels together.

#merge resort and city cancellation
final_cancellation = sort_month_resort_hotel_cancellation.merge(sort_month_city_hotel_cancellation, on = 'month')
final_cancellation.columns = ['month','no of cancellations resort hotel', 'no of cancellations city hotel']
final_cancellation

let’s turn these data into visualization

#graph cancellation on resort city hotel
plt.figure(figsize=(19,10))
px.line(final_cancellation, x='month', y=['no of cancellations resort hotel','no of cancellations city hotel'], title='Resort and City Hotel Cancellations over the Months', template ='plotly_white')

B. Trends in reservations for both resort and city hotels across the months.

In this code, there’s a column called ‘is_canceled’ that tells us if a hotel reservation was canceled or not. When someone makes a reservation and actually shows up, the value in that column is set to 0.

#resort_hotel reservation month 
#resort reservation with 0 cancellation
#groupby resort reservation with 0 cancellation and month
resort_hotel_reservation = df[(df['hotel']=='Resort Hotel')&(df['is_canceled']==0)]
month_resort_hotel_reservation= resort_hotel_reservation['arrival_date_month'].value_counts().reset_index()
month_resort_hotel_reservation.columns=['month','no of reservation resort hotel']
month_resort_hotel_reservation

Just like in the previous step, we’re going to sort the months to help us see the data more clearly.

#sort month resort hotel reservation 
def sort_month(df, column_name):
return sd.Sort_Dataframeby_Month(df, column_name)
sort_month_resort_hotel_reservation=sort_month(month_resort_hotel_reservation,'month')
sort_month_resort_hotel_reservation

Repeat the exact same step for city hotel reservation data

#city hotel reservation month 
#city hotel reservation with 0 cancellation
#group by city hotel reservation with 0 cancellation with month
city_hotel_reservation= df[(df['hotel']=='City Hotel')&(df['is_canceled']==0)]
month_city_hotel_reservation = city_hotel_reservation['arrival_date_month'].value_counts().reset_index()
month_city_hotel_reservation.columns=['month','no of reservation city hotel']
month_city_hotel_reservation

don’t forget to arrange the months

#sort city hotel reservation 
def sort_month(df,column_name):
return sd.Sort_Dataframeby_Month(df,column_name)
sort_month_city_hotel_reservation = sort_month(month_city_hotel_reservation,'month')
sort_month_city_hotel_reservation

then merge both resort hotel reservation and city hotel reservation data.

#merge resort and city hotel reservation
final_reservation= sort_month_resort_hotel_reservation.merge(sort_month_city_hotel_reservation, on='month')
final_reservation.columns=['month','no of reservation resort hotel', 'no of reservation city hotel']
final_reservation

Now, we’re going to transform this data into a static visualization

#graph of merge resort and city hotel reservation 
plt.figure(figsize=(19,10))
px.line(final_reservation, x='month', y=['no of reservation resort hotel','no of reservation city hotel'], title='Resort anc City Hotel Reservation over the Months', template='plotly_white')

By looking at this graph, we can come to the conclusion that both the resort hotel and the city hotel had a significant number of reservations in the month of August. While the reservations for both the resort hotel and the city hotel show a decrease over the months after August, there is a small increase in reservations for the resort hotel, specifically in October.

C. Compare the reservation statuses between the resort hotel and the city hotel.

Resort Hotel Reservation Status

In order to discover the reservation status for the Resort Hotel, we should extract two data frames that we located in the previous step. These data frames are named “resort_hotel_cancellation” and “resort_hotel_reservation.”

Before calculating the percentage, utilize the len() function in Python to determine the count or the total number of each categories.

#calculate the number of reservation for each category
num_resort_hotel_cancellation =len(resort_hotel_cancellation)
num_resort_hotel_reservation=len(resort_hotel_reservation)
#calculate the percentage of cancellation
percentage = (num_resort_hotel_cancellation/(num_resort_hotel_reservation + num_resort_hotel_cancellation))*100
#calculate the percentage for the pie chart
percentage_for_chart=[percentage, 100-percentage]
#labels for the pie chart
labels = ['canceled', 'not canceled']
#create the pie chart
plt.figure(figsize=(6,6))
plt.pie(percentage_for_chart,labels=labels, autopct='%.2f%%', startangle=140, colors=['green','red'])
plt.title("Resort Hotel Reservation Status")
plt.axis('equal') #Equal aspect ratio ensures that pie chart is drawn as a circle
plt.show()

let’s see the output

Perform a similar process for the reservation status of the city hotel.

#make a percentage of City Hotel Reservation Status 
#calculate the number of reservation for each category
num_city_hotel_cancellation =len(city_hotel_cancellation)
num_city_hotel_reservation=len(city_hotel_reservation)
#calculate the percentage of cancellation
percentage = (num_city_hotel_cancellation/(num_city_hotel_reservation + num_city_hotel_cancellation))*100
#calculate the percentage for the pie chart
percentage_for_chart=[percentage, 100-percentage]
#labels for the pie chart
labels = ['canceled', 'not canceled']
#create the pie chart
plt.figure(figsize=(6,6))
plt.pie(percentage_for_chart,labels=labels, autopct='%.2f%%', startangle=140, colors=['green','red'])
plt.title("City Hotel Reservation Status")
plt.axis('equal') #Equal aspect ratio ensures that pie chart is drawn as a circle
plt.show()

D. Identify the primary countries from which the majority of guests originate.

To answer this task, I will retrieve the data with a value of 0 in the “is_canceled” column. The reason behind this choice is to extract data about countries where the guests actually arrived and didn’t cancel their reservations.

#country guests list 
country_reservations =df[df['is_canceled']==0]['country'].value_counts().reset_index()
country_reservations.columns=['country','no of guests']
country_reservations

Now let’s visualize it on a map.

#make it into a map 
basemap = folium.Map()
reservations_map=px.choropleth(country_reservations, locations=country_reservations['country'], color=country_reservations['no of guests'], hover_name=country_reservations['country'])
reservations_map.show()

From this map, we can see that the majority of guests come from Portugal, followed by Great Britain, and then other European countries.

--

--