Predicting Airbnb Listing Price in Sydney

Rosa Caminal
30 min readSep 27, 2020
Source: Unsplash.com

1. Introduction

1.1 Business problem

The use of analytics in predicting real estate prices has become a very popular approach in recent years. Entrepreneurs, real estate companies as well as investors are discovering the benefits behind using predictive analytics in making more informed decisions.

In this report, the business problem we will be focusing on is predicting the nightly stay of an Airbnb listing in Sydney given a range of features such as the number of bedrooms, bathrooms, the property type, amenities, host status and many more.

Our main goal is to use the data provided by Airbnb listings, to create a model that accurately estimates the value of the listing per night. In order to do so, we will conduct the Exploratory Data Analysis (EDA) first. Secondly, we will implement some feature engineering which will be fundamental for our methodology section whereby we will be selecting and validating a range of advanced predictive models. Finally, our best models will allow us to produce key insights to help stakeholders make better decisions.

1.2 Final Solution

Once the models were validated using the training data, we could choose our best model which provides the most accurate estimation of the price per night of a property. Our best performing model is XGBoost with Bayesian Hyperparameters.

1.3 Key Insights

After analysing the models, some of our key insights are:

  • Becoming a verified and superhost does not have a lot of difference in the price of the property, therefore investing in both may not be the most important thing for hosts.
  • Properties with more bedrooms are associated with higher prices as it will fit more people. However, it must be proportionate with the numbers of bathrooms
  • Listings within the Eastern Suburbs are more likely to have a higher nightly stay price due to highly touristic destinations and attractions in its vicinity.

2. Data processing and exploratory data analysis

2.1 Dropping parameters

The dataset consists of 10.635 listing entries, 82 features and 1 response variable. A domain knowledge analysis has been conducted to remove features which do not predict the rent price of Airbnb. In addition, the features are checked and dropped because of several reasons outlined below.

First, features containing free text will be dropped because a lot of processing is needed to make use of them. These features include name, summary, space, notes, house rules etc. We will include 1 text feature, amenities, as this is seen to be important to predict the price, and it consists of sub-categories which are not as complex as the other free text features.

Next, there are features that consist of similar meanings, including an attempt by Airbnb to clean up the data. When these similar columns are found, we only choose one, looking at the most insightful, yet simple for model building and interpretation. For example, the features neighbourhood, neighbourhood_cleansed, city, zip_code and smart_location all provide location information for the property. Hence, we only keep neighbourhood_cleansed, which is Airbnb’s attempt to clean up the location data for the property. Another example is minimum_nights chosen rather than minimum_minimum nights, etc. The availability_365 is also preferred instead of 30, 60, and 90 as it gives a broader view of the year availability as we do not know what time of the year the other measurements were taken which might highly affect this number. The review_scores_rating is also chosen because it is seen as a summary of all the review scores from the other predictors.

The features which include a lot of null values (more than 90%) will be dropped as it does not give enough information for the model and may cause bias. These include weekly_discount (93% n/a), square_feet (almost 100% n/a), and monthly_discount (95% n/a). Moreover, the categorical features which are highly dominated by one entry will also be dropped as it will not give insight for the different prices for different categories. These features include experiences_offered (only consists of “none”), requires_license (only “f”), require_guest_profile_picture and require_guest_phone_verification (99.5% “f”).

The initial dropping leaves us with 18 features which will be individually checked with the EDA in the next section. Further EDA and figures can be found in the appendix.

2.2 Response variable: Price

The distribution of the response variable “Price” is right-skewed. A log transformation of the response variable “Price” is required to make the distribution more normalised, which helps meet normality assumptions.

Figure 1 & 2: Histograms illustrating the distribution of Price and LogPrice.

2.3 Variables EDA

For the purpose of this report, the Exploratory Data Analysis (EDA) is divided into categorical and numerical to better outline common insights and to apply feature engineering more efficiently. The training data is split into a train set and a validation set with a ratio of 7:3. The training set is used for EDA and initial analysis, while the validation set is used for model selection and evaluation (more figures can be found in the appendix).

2.3.1 Categorical EDA

Host is Superhost

Figure 22: Bar chart illustrating the difference in number of superhosts and non-superhosts.

The feature host_is_superhost determines whether or not the host is a superhost, which is a status given to hosts who provide a shining example for other hosts, and extraordinary experiences for their guests (Airbnb, 2020). It consists of “t” (true) and “f” (false). Out of 7.444 entries, 6.310 hosts are not a super host, which is almost 85% of the data (Figure 22).

Figure 23: Boxplot illustrating the differences between superhosts and non-superhosts.

However, Figure 23 shows that the mean log price of property owned by the super host and non-super host are not significantly different (5.08 and 5.04 respectively). The median is also very similar (difference of 0.069). The interval range of the 2 categories is very similar. Hence, from the EDA, there seems to be no significant relationship between this feature and the response variable.

Neighbourhood Cleansed

Neighbourhood_cleansed feature refers to the location of the property, whose categories are already standardised by Airbnb. There are 38 categories in the neighbourhood_cleansed feature. The training set properties are mostly located in Sydney (1998), Waverley (1084), and Randwick (634), illustrated in Figure 3. The other neighbourhood consists of very little inputs, which make it hard to conclude whether there is a significant relationship between the location and log price.

Figure 3: Bar chart illustrating the number of properties in each of the neighbourhoods

In Figure 26, we can observe that some neighbourhoods have higher prices ( with Pittwater having the highest price) than the others with various price ranges. Recategorizing of the neighbourhoods will be done in feature engineering (Section 3.3) to merge neighbourhoods in the same area, to better view the association between them and the price.

Figure 26: Boxplot illustrating the differences in LogPrice between the neighbourhood

Property Type

Figure 27: Bar chart illustrating the number of types of property

In Figure 27 we can observe that there are 26 property types in the training set, including the “other” category. However, the majority of the properties belong to apartment (4487) and house (1894). The other types consist of very little inputs which makes it hard to conclude the relationship between the property type and log price.

Figure 4: Boxplot illustrating the differences in LogPrice between the different types of property

Figure 4 shows the mean prices of different property types, with castles having the highest mean price. There are some potential outliers in the apartment type that may affect the analysis. It is hard to conclude given the numerous categories and little samples in many of them. Therefore, merging the rest property type with the “other” category will be done in feature engineering (Section 3.3).

Room Type

Figure 29: Bar chart illustrating the different numbers for different room type

As seen in Figure 29, there are 4 room types in the training set. Most properties offer “entire home/apt” room type (5106). On the other hand, Figure 30 shows that different room types have different log prices. The distribution of each category is mostly right-skewed, with potential outliers shown in the boxplot for the “entire home/apt” and “private room” categories. This is potentially because of the small inputs in the other categories.

Figure 30: Boxplot illustrating the difference in LogPrice for different room types

From Figure 30, we can also see that the “entire home/apt” room type has the highest log price compared to the other room types. Interestingly, renting an entire home or apartment will be costing more than renting a hotel room. However, a hotel room is more expensive than renting a shared or a private room.

Review Score Rating

Figure 31: Bar chart illustrating the distribution of review scores.
Figure 32: Scatterplot illustrating the relationship between review scores rating and LogPrice.

The review_score_rating feature has a lot of null values (27%). The distribution is highly left-skewed as the majority of the inputs rated between 95 to 100 (Figure 31). The scatter plot shows that log price is higher when the rating is higher, but not significantly. In order to create a more normalised distribution and clearer interpretation of the relationship, we carried feature engineering in Section 3.3.

Cancellation Policy

Figure 34: Bar chart illustrating the LogPrice of properties for each of the cancellation policies.
Figure 35: Box Plot illustrating the LogPrice between properties with different cancellation policies.

The cancellation_policy feature consists of 6 categories. Most properties have a “strict 14 with grace period” cancellation policy, followed by flexible and moderate cancellation policy. In Figure 34, we can see that the properties with stricter cancellation policies are associated with higher rental price. However, the other categories do not have enough observations to conclude if the association is true or just a biased conclusion. Therefore, the categories will be merged into the three main categories in feature engineering (Section 3.3).

Availability 365

Figure 37: Scatterplot illustrating the relationship between availability and LogPrice

Availability_365 refers to the number of nights available to be booked in 365 days (1 year). The distribution of “availability_365” is very right-skewed (Figure 36). Figure 37 shows that the more availability days a property has, the higher the price is. Considering the vast number of possible values in this categorical variable, and the right-skewed distribution, we decided to bin it into several categories in feature engineering (Section 3.3).

Amenities

Amenities is a feature that contains various text lists of amenities a property offers. There are over 200 different types of different amenities. However, because this feature is a free-text that is inputted by the host, there are some lists that have the same meaning (example: internet and wifi) which can be merged into the same group. Therefore, merging the list into the same category will be done in feature engineering (Section 3.3).

2.3.2 Numerical EDA

Bathrooms

Figure 40: Bar chart illustrating the distribution of properties sorted by number of bathrooms.

We can observe from Figure 40, that having 1 bathroom when renting an Airbnb listing is the most popular option with more than 7000 observations. Having 2 or 3 bathrooms is also more common than the average with a count close to 2000 observations. However, what is also popular is the occurrence of “half bathrooms”. Those are bathrooms with only a sink and a toilet.

Figure 5: Box Plot illustrating the difference in LogPrice with different number of bathrooms

As expected the relationship between the number of bathrooms and the log-price of an Airbnb listing follows a positive correlation. In fact, as the number of bathrooms increases, the price also increases (Figure 5). However, from Figure 5 we can also observe the existence of outliers in regards to the relationship just established above, whereby a property with 6 bathrooms is listed at a log price of 4.5, while a property with only 2 bathrooms is listed at a log-price of 8. Most likely, the location, as well as the rating of the house, are determining factors that could influence such a relationship.

Security deposit %

Figure 44: Bar chart illustrating the distribution of properties sorted by security deposit fee.
Figure 45: Scatterplot illustrating the relationship between security deposit fee and LogPrice.

In this case, any missing value has been interpreted as if the security deposit for a certain property is equal to zero. Figure 45 shows the existence of outliers when the security deposit percentage is more than 5000. Security deposit percentage indicated the up-front amount a guest has to give the host, in case of any damages to the property. It can be deduced that a high-security deposit is explained by a high number of minimum nights.

However, Figure 45 also indicates a weak negative relationship between the security deposit percentage and the price per night. As the price increases, the percentage security deposit of a listing decreases. This is due to the fact that the security deposit percentage is calculated as a percentage of the nightly price. Therefore as the price goes up, the percentage is proportionally smaller but as the price of the night is higher the sum to be paid is considerably higher.

Minimum nights

Figure 48: Scatterplot illustrating the relationship between minimum nights and LogPrice

After conducting some research, it was found that the law limits the number of nights you can stay at an Airbnb in Sydney to 180 nights. Therefore, all the listings with minimum nights higher than 180 have been transformed to 180 nights. Figure 48 shows a positive relationship between the minimum nights needed and the price. The higher the minimum nights, the higher the price per night.

It is also interesting to observe that:

  • Most values lie under 1–7 minimum nights a week.
  • Apart from those, most values are around 30, 60, 90 nights explained by typical long short term leases
  • Finally, there is also a peak around 180 as it has been set as a maximum

3. Feature engineering

3.1 Fixing missing values

The Missing Values that appeared in the dataset were dealt with in different ways. First of all, variables such as the monthly and weekly discounts, where more than 90% of the data was missing (92.0% and 95.0% respectively) were removed from the selected features.

Furthermore, the missing values for the remaining features were adapted to the type data. For example, in the case of the review score rating, certain properties did not have a score. As we decided to transform this variable into categorical measures, the missing values were recorded as “no reviews” instead of assigning them a score of 0.

Figure 7: Bar chart illustrating the number of properties in each review score rating

When analyzing the Cleaning fee variable, it was found to have 1755 missing values in the training set. These missing values were assigned a value of 0 as this would most likely mean that there is no cleaning fee.

A similar approach was taken for the security deposit percentage, as the training set had over 2400 missing values. Again if the value was missing, it was assumed that a security deposit was not necessary, therefore assigning it a value of 0. Furthermore, it was decided that any value over 5000 would be considered as an outlier as all the properties with the value had minimum night stays of 0.

3.2 Dropping variables

The variables in this dataset were dropped for various reasons. Some of the features were removed for the missing values, as was stated before. Indeed, when handling missing data, it generally is preferable to have no more than 15% to 20% in a studied variable (Dong, Y. et al. 2013). Therefore, the following variables were removed for lack of data such as:

  • weekly_discount: 92.0% of the data is missing
  • monthly_discount: 95.0% of the data is missing

Furthermore, some data where the outcomes were too similar, especially in categories where the only outcome is true or false were also taken out (Figures 2 & 4). With help from the illustrations from the boxplots, we can see that there is no significant difference in the price when the identity of a host was verified and when it was not. The same holds true whether the host is a superhost or not.

3.3 Creating new columns

When this dataset was analysed, new columns were decided to be created in order to work with better-sourced information. The response variable was looked at initially, the results of the raw data are right-skewed.

In order to normalise the distribution, which will also be further discussed in the following parts, we created a new variable LogPrice where a log transformation was applied to the Price variable. Following this, LogPrice, is considerably more centred and follows a discernible normal distribution.

Various locations in the neighbourhood variable were grouped to create new columns in the feature. This allowed for a better distribution of the data for that variable allowing for a less crowded engineered variable thus facilitating future interpretation. The initial neighbourhood variable consisted of 38 different areas.

Figure 8: Boxplot illustrating the differences in LogPrice across different neighbourhoods

The neighbourhoods were then merged into 7 bigger groups, each containing from 3 to 12 of the previously recorded groups. Furthermore, with this transformation, all of the groups have a substantial quantity allowing for better future statistical inference. Indeed, all had at least 470 observations in the training set, whereas, before this transformation, some had as little as 7 and 10.

Also, when paying more attention to the data, we can see that all the categories are positively skewed, meaning there is a bigger difference in price between quantile 3 and 4 than between quantile 1 and 2. There is a bigger price difference in the second half of each category (more expensive properties) than in the first half (cheaper properties). The price medians and means are all different from each other yet statistical significance is not necessarily asserted as all quantiles 2 and 3 overlap. The widest range of prices is for properties in the Northern Beaches area. Finally, the highest price seems to belong to Airbnb’s in the Northern Beaches and the lower prices in South Sydney or Great Western Sydney Area.

The cancellation policy also saw some of its values transformed. Initially, the feature had 6 sub-variables, two of which had less than 25 observations.

Figure 9: Bar chart illustrating the number of properties in each of the cancellation policies

This limited amount of observations made merging categories necessary for further statistical inference, as all samples need at least 25 observations, but this works aims to have a minimum of 100 for each of its variables used. The 6 variables were merged into 3 more consequent categories with a minimum of 1500 observations, which can be seen below.

The Review scores were also changed, this time from a numerical to categorical variable to fit Airbnb classification categories. Indeed, the initial variable was providing ratings from 0 to 100. It was decided to group this as the following: 0 to 79, 80 to 95, 95+ and no reviews for the missing values. For example, a rating would be equivalent to a 4 out of 5-star rating, and 95+ would be a 5-star result. Furthermore, 0 to 79 would be equivalent to a three-star rating or less, which Airbnb considers a “bad” rating. This would make sense as the average Airbnb rating was found to be at 4.7 out of 5 stars ( Zervas, G. et al. 2017).

Figure 10: Bar Chart illustrating the numbers of the different in property types

A similar approach to Review scores can be taken for the property type. Indeed, the initial feature started with 26 different types of properties. This was engineered into 3 more consequent categories Apartment, house and other. With this transformation, the smallest category has over 500 observations. On the graph opposite, we see that the price is lower when the property is an apartment compared to a house, and even lower when it is something else, such as a cabin or a cottage.

Finally, the availability of the properties over the year was transformed. The numerical variable was changed into a categorical variable with the following 4 statements: 0–2 weeks, 2–8 weeks, 2 to 4 months and 4+ months. We chose to make the final category as 4+ months and not show further months because Australian law does not allow for short term rental over 180 days. All values over this threshold were considered as available for 180 days.

More categories were created when categorical variables were subdivided into new groups, such as specific amenities types (i.e. air conditioning or pools) through dummy encoding.

3.4 Dummy encoding

In order to be able to build most of our models, we first needed to encode the categorical values. Some models require all features used to be numerical. Therefore by encoding the categorical features we can implement any model. We, therefore, utilised dummy encoding in all the categorical values we have left in our dataset.

Dummy variables are variables that take the values of only 0 or 1. Therefore dummy encoding is the process of converting categorical variables into multiple dummy variables for each of the categories. By doing this it allows easy interpretation and calculation of the odds ratios and increases the stability and significance of the coefficients (Garavaglia, S. et al. 1998). The following variables were encoded: neighbourhood_cleansed, property_type, room_type, review_scores_rating, Cancellation_policy, Availability_365, Amenities.

In the case of Amenities, once the variable was transformed into different types of property characteristics (i.e. balcony, internet), some of them were removed as they did not provide enough information or did not have enough values.

Figures 10 & 11: Bar charts illustrating the difference in numbers and price for TV and Internet

Internet and tv were two variables that were kept. We can see that they were of importance as most properties have this amenity and it provided enough information. On the other hand, some of the variables (such as balcony) did not prove to be of importance so they were taken out. Initially, 27 variables were created out of amenities, but only 16 were kept.

3.5 Standardization and Normalization

To ensure the best results in our models, we normalise the values in order to avoid bias when creating the model. By normalising the values we make sure that all the attributes are scaled in the same way, meaning that they are all measured in the range [0,1]. This is a very important step. If we don’t normalize the dataset, we could end up with biased results especially if implementing a regressor that uses a distance measure. The attributes in the dataset are represented in different units therefore normalization is critical in this case to accurately predict the price of new properties.

Standardisation and normalisation was carried in the following features: ‘Accommodates’, ‘Bathrooms’, ‘Bedrooms’, ‘Beds’, ‘Security_deposit_perc’, ‘Extra_people_perc’, ‘Cleaning_fee_perc’, ‘Guests_included’ and ‘Minimum_nights’

Figure 12: Bar charts illustrating the distribution of all numerical categories before
Figure 13: Bar charts illustrating the distribution of all numerical categories after

As shown in Figure 13, this appears to have helped some of the distributions, although some (e.g. bathrooms, bedrooms, beds, guest_included, minimum_rights) contain a large number of 0s, which means these features are not normally distributed. Most importantly, however, the target variable price now appears much more normally distributed.

3.6 Correlation and Multicollinearity

We then explored the correlation and multicollinearity between the different features. From the heatmap (figure x), we cannot find any multicollinearity between the features, except for bed, bedrooms, bathrooms, and accommodations. However, given their high correlation with the response variable, we decide to keep them. These are the features with the highest correlations with the response variable.

4. Methodology

4. 1 Models created

We created a total of 17 models to solve the business problem from simple models linear regression (utilised as a benchmark) to more complex models such as Gradient Boosting. The models were the following ones:

  1. Linear Regression
  2. Quadratic
  3. Quadratic Splines
  4. XGBoost
  5. XGBoost with Bayesian Hyperparameter (xbst_opt)
  6. Gradient Boosting
  7. Gradient Boosting with Bayesian Hyperparameter (gb_opt)
  8. KNN
  9. Random Forest
  10. Regression Tree
  11. Local General Additive model
  12. General additive model Splines
  13. Model Average (XGBoost & Local GAM)
  14. Model Average (gb_opt, xbst_opt, Random Forest)
  15. Model Average (gb_opt, xbst_opt, GAM Spline)
  16. Model Average (gb_opt & xbst_opt)
  17. Model Stacking (gb_opt, xbst, Random Forest, xbst_opt (metamodel))

We will go into more detail on three of the models we created in the following sections.

4.2 Data mining model (Regression Tree)

Firstly we had to carry out a model that would be easy to interpret better known as the data mining model. In order to be able to evaluate what the best hosts are doing, we needed to create an easily interpretable model. To create a more interpretable model we decided that we would be able to create a visualisation of such a model, so we decided to do a regression tree.

We choose the regression tree as it is easy to read and understand, even for non-technical Airbnb hosts:

Figure 14: Regression tree illustrating the model

On top of that, the Regression tree has a few key advantages. Firstly there is no need to carry our feature engineering, issues with having a range of different data types or missing values. Trees are able to perform feature selection automatically and are quite good at approximating complex interactions. However, regression trees do come with disadvantages, as the results we obtained by its predictions were less than desirable.

When it comes to the regression tree, there is 1 main hyperparameter that we had to choose, the maximum depth of the tree. In order to select the most accurate and efficient parameter, we did some hyperparameter tuning using a Grid Search Cross-Validation (the most common and expensive of all the methods). It is important to set a Maximum Depth because it is used to control over-fitting. After carrying out the Grid Search Cross-Validation we found the optimal depth for this model would be 3. It makes sense to use this model as it would help us identify the things that the most successful hosts are doing and quantify those insights. As mentioned before, this is extremely useful as Airbnb hosts can be explained that the value of their Airbnb will change by X if Y is done when everything else is fixed. For example, by building a tree you are able to see the change in LogPrice when choosing TV over no TV when everything else is fixed is 0.649 (more detail in figure 17).

4.3 Best Model (XGBoost with Bayesian Hyperparameter)

An additional model we decided to implement is Extreme Gradient Boosting (XGB). XGB is one of the most successful and most adopted machine learning models in business context application. Its great flexibility, as well as accuracy, led our team to opt for this model. In fact, its sequential ensemble approach ensures speed and accuracy yet resulting computationally cheap. This is due to the ‘slow’ learning approach which makes minor adjustments to ensure high performance (UC, 2018). After choosing the model, our next challenge was to pick the best hyper-parameters for the model. Hyper-parameters are a key aspect when determining the performance of our model as we want them to minimize the objective function loss. Therefore, we conduct a Random Search Cross-Validation by offering a range for each parameter (details on section 7.3).

In this specific scenario, the learning rate determines the speed at which the algorithm proceeds towards the optimal point. The number of estimators simply refers to the number of trees, the maximum depth defines the number of splits in each tree, thus impacting the complexity of the model. Finally, the subsample controls the fraction of available training observations. These indicators are shown in the table shown above.

After implementing a randomized search cross-validation, we obtain the best hyper-parameters for the model.

Figure 15 shows how Bedroom is by far the most influential variable, followed by the Entire home type or apartment room type and private room type. On the other hand, we also observe how the suburbs Inner West Suburbs, seems to influence sale price the least succeeded by Other property type and tv. This information will be useful when determining what the best hosts will do in order to optimize revenues or profit.

Figure 15: Bar chart illustrating the importance of each of the variables for XGB

Our next step consists in fitting XGBoosting with our optimized tuned hyper-parameters into our validation set. The table below summarized some key information.

The table above shows how, as we expected, that the XGBoosting model performs relatively well on the validation set. Even though the RMSE for the validation set is not too high (0.345), the difference to the train RMSE (0.281) may be due to overfitting, whereby the model has ‘picked up’ on random patterns generated by independent variables.

Nevertheless, the performance of our XGB model can be improved by tuning Bayes Hyper-parameters. This is because Bayes Hyper-parameters can result in higher test performance with fewer iterations, thus optimizing even further our model. XGB with Bayesian Hyper-parameters uses more accurate approximations to find the best tree model. Similarly, we establish a range for each of the hyper-parameters (further details in section 7.3)

The rationale behind the choice of the Range in the table shown above is as follows:

  • We select a small and larger boundary in order to allow the tuning to find its optimal learning rate (0.005–0.1)
  • We choose a relatively small number of estimators and relatively short depth to ensure slow learning and low model complexity
  • We allow the model to choose over the representative section of the train sample
  • We include the hyper-parameter Colsample_bytree to represent the subsample ratio of columns when constructing each tree which occurs once every tree is constructed

After implementing a randomized search cross-validation, we obtain the best hyper-parameters for the model.

We now repeat the process of fitting XGBoosting with Bayesian Hyper-parameters into our validation set. The table below summarized some key information.

Interestingly, the adoption of XG Boosting with Bayesian Hyper-parameters yielded the lowest RMSE, and best R2 and MAE recorded so far for both our train and validation set, thus making this model our best model so far. In fact, with a slightly lower RMSE than XGBoosting 0.341 vs 0.345, XGBoosting with Bayesian Hyper-parameters becomes the most effective model for predicting the nightly stay of an Airbnb listing.

Another interesting aspect to mention is the update of variable importance according to XGB with Bayesian Hyper-parameters (Figure 16). In fact according to our best model, an entire home or apartment as a room type is the most influential variable followed by private room type and bedrooms. We also notice changes at the bottom of the graph whereby air-conditioned is considered the least important variable after hotel room type and the number of guests included.

Figure 16: Bar chart illustrating the variable importance for XGB with BH

4.4 Stack Model

To improve the performance of our models, we decided to ensemble our best performing models by model stacking. 3 base models and 1 meta-model are chosen. The stacking that we did is only 2 levels as the computational cost of stacking these models are very high. The models with optimal hyperparameters formed individually before being used as the base models. The models that are stacked are the models with higher accuracy while being as diverse as possible. We want to do this because we want to combine models that are less correlated which will subsequently minimise risk, hence produce more accurate predictions (Juhi, 2019). We chose the XGBoost with Bayesian Hyperparameter as the meta-model as it is our current best performing model.

We use the “StackingRegressor” method from sklearn to fit the model stacking. We start by fitting the base models individually and use the concatenated the prediction outputs of them as an input for the meta-model. The meta-model is trained through cross-validation. The reason for this is to avoid overfitting. The number of k chosen is 4, which is quite low, to save computational cost but still within a reasonable ratio. Ideally, the base model should be fitted through cross-validation as well, however, the method “StackingRegressor” fit the base models on the full X, while the meta-model using validated predictions of the base estimators (scikitlearn, n.d), which may lead to overfitting. We overcome this by doing cross-validation earlier with the base models.

Model Stacking Performance

The performance of the model stacking does not seem to improve from the best performing model. This might be because of overfitting because cross-validation when fitting the base models is not performed. Also, there may be a high correlation between each base model resulting in only little improvement from the average score.

Other than high computational cost, the disadvantage of model stacking is the reduction in model interpretability due to the increased complexity. It makes it very difficult to draw crucial business insights from the model. Moreover, the selection of models for creating the ensemble is an art which needs a lot of experience to master (Juhi, 2019). Other than that, the improvement of model stacking is usually not very significant. Although this may be beneficial, the amount of work and complexity that is done makes it not worth it in business cases.

5. Model validation

We then proceeded to validate our models using the validation set. Model evaluation is also done by fitting the best models from the validation set to the test set and submitting on Kaggle. These are the results we obtained from each of the models:

6. What are the best hosts doing?

The analysis for this section is based on the Regression Tree Model as it has a high level of interpretability with the tree visualisation. To be able to provide solutions and actionable insights for all types of hosts, we decided the best way was to divide the insights we found in the following categories: short term decisions and long term decisions.

This is due to the fact that we found out that some of the most profitable decisions would be important long term changes to the properties or even renovating fixed environments of an apartment or house such as bathrooms or bedrooms. Therefore we focused on generating useful insights to consider when considering an expansion or diversification in the characteristics of short-rental properties

However, this paper acknowledges hosts’ financial capabilities in terms of investing capital, thus posing a limit to long-term investment plans. Therefore we decided to include a section for short term decisions and small investments a host could consider on his/her current listing which can help optimize the price of the property within its capacities.

6.1 Short term decisions

6.1.a Amenities

When we decided to look into short term investments the first thing that we thought of was the amenities included by each of the properties, as they can be used to differentiate within other similar properties in terms of size, rooms or location. When investigating which type of amenities had a higher impact on the price of the property using data mining we found that the most important ones were: TV and internet (Figure 16).

Therefore by the incorporation of the relatively low-cost amenities mentioned above, your property would be able to reach the highest nightly rate given the number of bedrooms, property type and location. Utilising a regression tree we were able to identify key insights such as that by including a TV in your property (everything else fixed) the LogPrice of the apartment increases by 0.649.

Figure 17: Extract of the regression tree nº2

According to our model with the lowest RMSE, if you need to invest in a single amenity to add to your apartment, it should be the internet as it is the amenity with the highest variable importance in the model, as seen below.

6.2.b Minimum nights

The following short decision that can be made by a host is to implement a minimum number of nights rather than allow short term rental as this will maximize price and revenue, the response variable focused on here. Indeed, unlike previous decisions which consisted in economic investment, this change does not require an external output and is simply an organisational change.

By implementing a minimum number of nights over allowing short term rentals, certain risks are minimised. Indeed, if no minimum nights are required, the host risks seeing the properties empty of guests for one night at a time. On the other hand, if a minimum number of nights are implemented, this brings certain insurance to the host that the guest will remain.

Figure 19: Extract of regression tree nº3

For example, using the regression tree we found that, keeping all other criteria constant, when a property in Western Sydney that is not a full house or apartment with one bedroom, the price increased when a minimum number of nights of at least three was required. Indeed the price is 153.70$ (log-transformed value of 5.035) with minimum nights, whilst this price decreases by over 15 % to 128.77$ (log-transformed value of 4.858).

Figure 20: Extract of regression tree nº4

Another example of this would be when a property of 4 or more bedrooms, with a high cleaning fee, is rented. Indeed, without a minimum stay required, the average nightly price is around 168.85$ (log value of 5.129). With a minimum of 3 nights, this price is raised by over 30% to 220.74$ (log value of 5.397).

Therefore the host can decide to implement a minimum number of nights that can positively impact the final outcome, illustrated by the two situations above.

6.2 Long term decisions

6.2.a Long term renovation

The regression tree shows that the most important feature is the number of bedrooms a property has. The root node determines whether the property has less than 1.4 bedrooms, which we round down to 1 bedroom as having 1.4 bedrooms makes no meaning. The property that has less than 1 bedroom is associated with a lower rent price ($106.27) compared to those which have more than 1 bedroom ($262.7). The number of samples between the two nodes are almost equal, which made this association is true among the dataset.

In the next node, we can see that when the number of bedrooms is more than 3, the rental price is significantly higher than the properties with a number of bedrooms between 2 and 3 ($570.78 and $221.41 respectively). However, the increasing number of bedrooms must also be followed by a proportionate number of bathrooms to increase the confidence in setting a higher price.

Therefore, a host can decide to invest in adding a number of bedrooms in their properties to be able to increase the rental price. The additional number of bedrooms is better followed by a proportionate number of bathrooms to improve usability, hence setting a logical higher rent price.

6.2.b Long term investments

By consulting our data mining model, we identified long-term investments as an opportunity for hosts to optimize the nightly stay price of an Airbnb listing in the future. In fact, we considered a likely business scenario whereby a property investor or simply an independent entrepreneur decides to expand his or her investment portfolio by adopting the Airbnb model.

Figure 21: Map of Sydney illustrating the range of LogPrice in different locations

Data mining allowed us to discover that within the range of all the Neighbourhoods, only Eastern Suburbs always yielded a higher price when compared to any other Sydney Suburb which generated a lower outcome. Therefore, such insights indicate that if an investor was to consider a location to either purchase or initiate a leasing strategy with Airbnb, choosing the Eastern Suburbs would allow a more secure and low-risk investment than any other suburb in Sydney.

That is due to the fact that Sydney’s tourist attractions and most popular suburbs are within the vicinity of the Eastern Suburbs. Areas such as Bondi, Double Bay, Surry Hills and a section of Sydney’s CBD are well-known for their unique beaches, clubs, nightlife, shops and monuments.

Therefore, given Airbnb’s business model’s nature of offering a relatively low-cost stay to travelling individuals, it is not a surprise that the demand for a stay in the Eastern Suburbs is higher than anywhere in Sydney which eventually leads owners to set a more competitive price. However, there are some limitations to consider when generating such insight. When considering the number of samples providing across all the different Neighborhoods or suburbs, Eastern Suburbs was the most popular observation within its categorical variable.

This could mean that the model may not have predicted the price of other suburbs as well as Eastern Suburbs. Nevertheless, we further acknowledge that this may be because there is more business centred around the Eastern Suburbs of Sydney such as Surry Hills, Bondi and so on. Whilst in the North Shore and Northern Beaches suburbs there are fewer businesses established.

6.3 Limitations

There are several limitations that have arisen and that need to be accounted for. Indeed, the dataset in itself provides a deep insight into the various characteristics of Airbnb properties, and certain host information, in Sydney. However, there is no data regarding the rental periods and success for each property. This means that the acceptance rate of an Airbnb host could be very low but this could be explained by the fact that the property is only seasonally listed and may be prone to high demand in that period. Another problematic situation could be when an individual lists a property at a very high price and has the characteristics of a highly coveted property, whilst having a very low rental rate. We use the review ratings to help with this limitation, with the assumption that a place that has been rented will have more rating, which is not always necessarily the case

Another limitation lies in the method itself, the regression tree. Low predictive accuracy can be an issue, but due to the important size of the dataset and the transformed variables added, this helps limit the loss of accuracy. Secondly, instability is also a problem with this method of data mining. Indeed, a small change in the data can have a butterfly effect on the rest of the data as the tree tends to overfit and be influenced by noise. Due to the hierarchical nature of the regression tree, a small change leads to statistically significant changes in results, especially in the lower levels of the tree. Finally, there is an innate difficulty in capturing additive structure as you have to take into account all the levels prior to the one observed to have a precise interpretation of the effects. The important number of levels also lead to a lack of statistical significance, as the sample sizes become smaller.

Nonetheless, this method also has advantages that outweigh the disadvantages of this model. For example, its easy interpretability allows it to be used in setting other than technical discussion, thus accessing more individuals. It also serves many practical purposes, as the missing values are accounted for, variable selection is done automatically and there is no need to engineer the features.

7. Bibliography

Airbnb. (2020). What is a superhost? Retrieved from https://www.airbnb.com.au/help/article/828/what-is-a-superhost

Airbnb. (2013). Airbnb announces “verified identification”. Retrieved from https://www.airbnb.com.au/press/news/airbnb-announces-verified-identification

Dong, Y. and Peng, C.Y.J., 2013. Principled missing data methods for researchers. SpringerPlus, 2(1), p.222.

Garavaglia, S., & Sharma, A. (1998, October). A smart guide to dummy variables: Four applications and a macro. In Proceedings of the Northeast SAS Users Group Conference (p. 43).

Juhi. (2019). Simple guide for ensemble learning methods. Retrieved from https://towardsdatascience.com/simple-guide-for-ensemble-learning-methods-d87cc68705a2

ScikitLearn. (n.d). sklearn.ensemble.StackingRegressor. Retrieved from Sklearn Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingRegressor.html

UC. (2018). UC Business Analytics R Programming Guide. Retrieved from Gradient Boosting Machines: http://uc-r.github.io/gbm_regression#idea

Zervas, G., Proserpio, D. and Byers, J.W., (2017). The rise of the sharing economy: Estimating the impact of Airbnb on the hotel industry. Journal of marketing research, 54(5), pp.687–705.

Written by Salvatore Sidoti, Gabby Joanne Christie Wijaya, Vladimir Tesniere & Rosa Caminal Díaz

--

--

Rosa Caminal

MSci Management Science with Artificial Intelligence student at UCL. Currently on my last year completing a masters concentration in Business Analytics.