How to handle NaN values in rental price prediction project - python

I am working on a rental price prediction project where I web scraped data from Facebook Marketplace. When extracting the areas of the properties, I am encountering many NaN values.
I am web scraping from a small city and it is unlikely that I will be able to find more data. How can I effectively handle the NaN values in my data? Are there any machine learning algorithms or external sources of information that can be used to impute missing values in this situation?
Any suggestions or advice would be greatly appreciated. Thank you in advance!
I have considered using the mean or median based on property type, number of bedrooms, and bathrooms, but I am not sure if this is the best approach.

There are many methods that you can use when it comes to missing values in your data. As you mentioned general apprach is to fill with mean-median. I recommend grouping them first then filling with mean or median.
df['a'].fillna(df.groupby('b')['a'].transform('mean'))
I recon you can use zipcode or something simmilar to group them. Another thing you can do is before filling empty places, create another column that indicates if the values are missing. this may help your model to treat those values differently and dont overfit on those values.
for furter info link

Related

How to forecast data based on variables from different datasets?

For a university project I’m trying to see the relation oil production/consumption and crude oil price have on certain oil stocks, and I’m a bit confused about how to sort this data.
I basically have 4 datasets-
-Oil production
-Oil consumption
-Crude oil price
-Historical price of certain oil company stock
If I am trying to find a way these 4 tables relate, what is the recommended way of organizing the data? Should I manually combine all this data to a single Excel sheet (seems like the most straight-forward way) or is there a more efficient way to go about this.
I am brand new to PyTorch and data, so I apologise if this is a very basic question. Also, the data can basically get infinitely larger, by adding data from additional countries, other stock indexes, etc. So is there a way I can organize the data so it’s easy to add additional related data?
Finally, I have the month-to-month values for certain data (eg: oil production), and day-to-day values for other data (eg: oil price). What is the best way I can adjust the data to make up for this discrepancy?
Thanks in advance!
You can use pandas.DataFrame to create 4 dataframes for each dataset, then proceed with combining them in one dataframe by using merge

If I'm trying to predict a label for a sample, but the sample is missing features, how should I deal with it?

I'm having a conceptual issue right now; I know that sklearn does not like it when .predict() is used on examples with NaN values, but what should I do if I want to predict a label for a example with NaN/missing features?
Currently, I'm replacing the NaN cells with -999 as a placeholder measure, but I'm not sure if that's a good idea. Unfortunately, searching about missing values in prediction samples doesn't yield helpful results.
One approach you could try is to fill in the missing value in your test example with the value you use to fill in missing values in your training dataset. For example, if you fill in missing values for that feature with the mean of the training data, you could use that mean to fill in the missing value in your test example.
Machine learning models perform better when your data is complete, therefore it is advisable that you impute missing values with a summary statistic or with the same information as a closely located data point (using KNN, for instance).
Scikit Learn contains a suite of algorithms to impute missing values. The most common method is to use the SimpleImputer with a "mean" strategy.
You can also use simpler approaches and use Pandas to either fill all NAs in your dataset with fillna() or remove the NAs with dropna().
It is important that you familiarize yourself with the data that you are working with. Sometimes missing data has a meaning to it. For instance, when working with income data, some very affluent people have refused to disclose their income, whereas people with low income would always disclose it. In this case, if the income of the former group was replaced with 0 or the mean, the results of the prediction could have been invalid.
Have a look at this step-by-step guide on how to handle missing data in Python.

How can ı find a similarity value of a data to specific group of data?

I have a data set. Some information of customers is kept in this data set and its columns are contains numbers.
These data contain information about the behavior of the customers a month before leaving the system. So there is exact information that these customers will leave the system within a month.
An up-to-date customer behavior data set is also available. But since these are up to date, we do not know whether they will leave the system or not.
Both data set contains same features.
In fact, I would like to find a probability value of leaving the system for each customer in the second data set with using what I learned from the first data set. How similar to the customers in this first data set.
I tried many methods for this, but they were not healthy. Because I cannot create a data set that customers who will not leave the system. For this reason, the algorithms in the classic sklearn library (classifier or regression algorithms) couldn't solved my problem in real life because I can't determine the Y column content precisely.
I am not sure if this question can be asked here.
But what method should I follow for such a problem? Is there an algorithm that can solve this? What kind of research should I do? With which keywords can I analyze the solution of the problem?

ML Models that are null invariate

My company's hosting a friendly data science competition based off of the PhysioNet challenge here
But with a much simpler scoring system and unfortunately no prize XD Some quick background, I'm not a data science/machine learning expert by any means. I've done some simple projects messing around with standard library models like kNN, regression, etc. And I've worked with data with missing values, but in those cases it wasn't 95+% and you could safely impute using mean or median.
A quick overview of the problem, so much of the data is missing because the values they measure are test results, and due to costs tests are taken infrequently and only when ordered for specific reasons (most likely because a physician suspects something's wrong with the patient). I've read a bunch of the papers from the physionet challenge submissions and have come up with a few points. I've chosen features based on those used to identify sepsis and then use PCA to see if any of them are highly correlated to other sparse features that I could drop. Missing values are imputed with foward-fill if possible, but if there's no previous value they remain NaN. For features that are decently populated, I can fill missing values with the mean or median depending on the shape of the data.
Also I'm working in python, if that makes a difference for modules and such.
But that's where I hit a bit of a problem. There's still a lot of null values left and imputing doesn't make sense (if you have 98% null, imputing with the mean would introduce tremendous bias) but discarding them also seems bad. For example, a patient who stayed for a very short time and left without getting any tests taken because they had a quick recovery would be missing all their test values. So, ironically, the lack of data actually tells you something in some cases.
I was trying to do some research on what models can handle null values and so far the internet has given me Regression Trees and Gradient Boosting models, both of which I've never used so I don't know how well they work with missing values. Also, I've read some conflicting information that some of them actually just use mean imputation under the hood to fill in the values? (though I might be wrong since, again, I don't have any first hand experience with these models. But for example in this post).
So tl;dr, if your dataset has null values you don't want to throw out, what are some models that would handle that well? Or for regression trees and gradient boosting models, how do they handle null values? It seems like they replace them in some cases, but there's a lot of different sources conflicting on how. The most popular model I seem to be running into is XGBoost which also does well with structured data (which this is)
EDIT: Some relevant analysis - the data is highly skewed. There's like 400+K entries and 90% of them are non-sepsis. Also the fields with high sparsity are like 90-99% null, which is why imputing with the mean or median doesn't make sense to me. Forward-filling lowers that number by quite a bit but there's still a significant amount. There's also cases where a patient will have 100% null values for a field because they never had a test requested (so there's no way to impute even if you wanted to)

How to find out how many items in a dataframe is within a certain range?

I'm currently doing some analysis on the stats of my podcast, and I have merged my Spotify listening numbers with the ones from my RSS-feed in pandas. So far so good, I now have a dataframe with a column of "Total" which tells me how many listeners I had on each episode and what the average number of listeners is.
Now, what I want to do is to see how many of my episodes fit in to three categories (at least), Good, Normal and Bad. So I need to divide my Totals into three ranges and then see how many of my episodes land within each of those frames. I have some limited experience of messing around with Python and pandas, but it's been a while since I last sat down with it and I dont really know how to approach this problem.
Any help is highly appreciated!

Categories