Predicting AirBnB Prices in NYC
The goal of this project was to build an AirBnB clone focused solely in NYC. I worked on a team of UI, front-end, back-end, and data science developers. My job was to build the API that connected the model to the back-end.
While another Data Scientist developer was working on writing a good model for predicting the price, I started writing a quick baseline model to implement into the API. I got my data from Kaggle and hosted it on my Github here for personal ease of access.
I started where every DS project starts, data exploration and cleaning. Let’s take a look at the dataframe:
Apologies for the width…there’s a lot of features here. So it looks like we have a lot of great features to train a model with, but that’s not actually the case. I was limited to only using the latitude, longitude, and minimum_nights columns. More on that later.
Let’s look at some descriptive elements of the dataset. Nulls, means, standard deviations, etc.:
As we can see in the first image, this data doesn’t contain many null values, and it doesn’t contain any in the features we’re interested in. In the second image, the main thing we need to take note of is the mean value for the price column. Our baseline for predictions is the mean value of the target feature. This means that 152.7 is our baseline prediction for this project.
We can calculate how far off the true prediction we would be on average if we subtract our prediction of mean from each true price. Then we can take the mean of the absolute value of each entry in the errors variable. This leaves us with the amount of error in our baseline — 92.4.
Based off of the mean price in this dataset, 152, that’s pretty far off. That’s why it’s just a baseline though. This gives you a place to start improving from.
The next step is to create x_train and y_train dataframes that contain our desired features and target. As I said earlier, we’re limited to only latitude, longitude, and minimum_nights in this project.
After that, all that’s left before we can train our model is to split the training data to give us a test set. I’ll use the train_test_split package for this. My test set will be 1/4 of the full set.
Finally we can train the model. I’ll be using a simple linear regression model in this case.
Let’s take a look at how this simple model performs:
We have a mean prediction value of 152.7, barely higher than the mean price feature in the raw data. The amount of error we have in our model’s predictions is 92.1, which is only technically better than our mean baseline.
Kind of a disappointing result, but all that this tells me is that it’s hard to predict price with only lat and lon as inputs into a linear regression model. This would feel like a waste of time if it had taken longer than a few minutes. Remember, this was only designed to be a dummy model so that I had something to throw into the API while I waited for the real model.
The next step in building the API is to write the predict_price function. The API will call this function to pass the inputs into the model and receive the output.
The predict_price function takes in three parameters: the input received from the backend, the model, and the pipeline.
After we pass the input into the function, it will need to be transformed according to the pipeline. Once transformed, we return the output of model.predict(x).
Finally, we can move on to building the API.
I’ll be using Flask API since I was already familiar with it. My API is very simple, taking in only 3 inputs: latitude, longitude, and min_nights (minimum number of nights stayed.) The API will feed the inputs into the model, and then it will return the price.
I won’t dive into how to build an API with Flask since there are plenty of tutorials out there that cover the subject better than I could ever hope to. With that being said, let’s get started.
The path we want our inputs to be passed into will be called the ‘/predict’ route. Inside this route will be the predict function.
First, we need to get the json data from the user. We can do that easily through Flask with the flask.request.json line. This will collect any json data that is sent to the /predict route in a request.
Second, that json data needs to be converted into something a little easier for me to work with. The python module json has the json.dumps() function built for decoding json data into strings.
Third, I want to store the json dump into a DataFrame. I chose to store the dump in a df because it needed to be transposed and it was easy to access the data from inside. I could have chosen another data type for this job (and perhaps I should’ve) and it would’ve done the job just as well.
Fourth, the df can be passed into the predict_price function.
Fifth, we want to send the backend the response in json format. In order for us to do that, we’ll use the Flask function jsonify. Jsonify only takes in python dictionaries as an input, so we’ll need to convert the output of the predict_price function to a dict. All that’s left is to return the json output and we’re done!
During this project one of the many things I learned was working on a diverse team to complete a common goal. There were many people depending on me to get the API up and running, and I was depending mostly on the ML engineer to get me a good model. I learned about effective communication, not only asking others was expected of me and my API, but also expressing what was and wasn’t possible given the timeline of the project.
Unfortunately, our ML engineer wasn’t able to complete the good model in time. As a result, the API, which I deployed on Heroku, isn’t very good at predicting. It was still good practice in building API’s, and a great learning experience all-around. I’ll leave you with a screenshot of it working live. Feel free to try it out yourself in something like Postman!