Pandas: Filling nan poor performance - avoid iterating over rows? - python

I have a performance problem with filling missing values in my dataset. This concerns a 500mb / 5.000.0000 row dataset (Kaggle: Expedia 2013).
It would be easiest to use df.fillna(), but it seems I cannot use this to fill every NaN with a different value.
I created a lookup table:
srch_destination_id | Value
2 0.0110
3 0.0000
5 0.0207
7 NaN
8 NaN
9 NaN
10 0.1500
12 0.0114
This table contains per srch_destination_id the corresponding value to replace NaN with in dataset.
# Iterate over dataset row per row. If missing value (NaN), fill in the min. val
# found in lookuptable.
for row in range(len(dataset)):
if pd.isnull(dataset.iloc[row]['prop_location_score2']):
cell = dataset.iloc[row]['srch_destination_id']
df.set_value(row, 'prop_location_score2', lookuptable.loc[cell])
This code works when iterating over 1000 rows, but when iterating over all 5 million rows, my computer never finishes (I waited hours).
Is there a better way to do what I'm doing? Did I make a mistake somewhere?

pd.Series.fillna does accept a series or a dictionary, as well as scalar replacement values.
Therefore, you can create a series mapping from lookup:
s = lookup.set_index('srch_destination')['Value']
Then use this to fill in NaN values in dataset:
dataset['prop_loc'] = dataset['prop_loc'].fillna(dataset['srch_destination'].map(s.get))
Notice that in the fillna input we are mapping an identifier from dataset. In addition, we use pd.Series.map to perform the necessary mapping.

Related

Search long series for non NaN entries

I am looking through a DataFrame with different kinds of data whose usefulness I'm trying to evaluate. So I am looking at each column and check the kind of data it is. E.g.
print(extract_df['Auslagenersatz'])
For some I get responses like this:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
263 NaN
264 NaN
265 NaN
266 NaN
267 NaN
I would like to check whether that column contains any information at all so what I am looking for is something like
s = extract_df['Auslagenersatz']
print(s.loc[s == True])
where I am assuming that NaN is interpreted as False in the same way an empty set is. I would like it to return only those elements of the series that satisfy this condition (being not empty). The code above does not work however, as I get an empty set even for columns that I know have non-NaN entries.
I oriented myself with this post How to select rows from a DataFrame based on column values
but I can't figure where I'm going wrong or how to do this instead. The Problem comes up often so any help is well appreciated.
import pandas as pd
df = pd.DataFrame({'A':[2,3,None, 4,None], 'B':[2,13,None, None,None], 'C':[None,3,None, 4,None]})
If you want to see non-NA values of column A then:
df[~df['A'].isna()]

Add missing dates in pandas df, but date range has (valid) duplicates

I have a dataset that has multiple values received per second - up to 100 DFS (no more, but not consistently 100). The challenge is that the date field did not capture time more granularly than second, so multiple rows have the same hh:mm:ss timestamp. These are fine, but I also have several seconds missing across the set, i.e., not showing at all.
Therefore my 2 initial columns might look like this, where I am missing the 54 sec step:
2020-08-24 03:36:53, 5
2020-08-24 03:36:53, 8
2020-08-24 03:36:53, 6
2020-08-24 03:36:55, 8
Because of the legit date "duplicates" and the information I need from this, I don't want to aggregate but I do need to create the missing seconds, insert them and fill (NaN, etc) so I can then manage them appropriately for aligning with other datasets.
The only way I can seem to do this is with a nested if loop which looks at the previous timestamp and if it is the same as the current cell (pt == ct) then no action, if it is 1 less (pt = (ct-1)) then no action but it if is more than the current cell by 2 or more, insert the missing (pt <= (ct-2)). This feels a bit cumbersome (though workable). Am I missing an easier way to do this?
I have checked a lot of "fill missing dates" threads on here as well as in various functions on pandas.pydata.org but reindexing and the most common date fills all seem to rely on dates not having duplicates. Any advice would be fantastic.
This can be solved by creating a pandas series containing all timepoints you want to consider and then merge this with the original dataframe.
For example:
start, end = df['date'].min(), df['date'].max()
all_timepoints = pd.date_range(start, end, freq='s').to_series(name='date')
df.merge(all_timepoints , on='date', how='outer', sort=True).fillna(0)
Will give:
date value
0 2020-08-24 03:36:53 5.0
1 2020-08-24 03:36:53 8.0
2 2020-08-24 03:36:53 6.0
3 2020-08-24 03:36:54 0.0
4 2020-08-24 03:36:55 8.0

Does storing a large amout of NaN values in a large panda dataframe massively effect performance and memory usage?

I have several large dataframes which are built up from a vehicle log. As only one message can be present on the CAN bus (vehicle communication protocol) at any time.
This is a simlipied dataframe without any interpolation:
time messageA1 messageA2 messageA3 messageB1 messageB2 message C1 messageC2
0 1 2 1 NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN 3 2
2 NaN NaN NaN 3 7 NaN NaN
And this can continue for millions of rows with NaN values consisting of about 95% of the entire dataframe. I have read that when a NaN/Null/None value is within a dataframe it is float64 value.
My questions:
Is a float64 value allocated for every NaN value?
If yes, does it do this memory efficiently?
Will having a large dataframe, with 95% of it NaN values, be inefficient when it comes to process performance?
Is a float64 value allocated for every NaN value?
Yes it is;
If yes, does it do this memory efficiently?
No it does not, instead you are supposed to use a sparse data structure;
Will having a large dataframe, with 95% of it NaN values, be inefficient when it comes to process performance?
Yes it will, on all those operations that are O(f(N)), depending on the f(N). Think of you averaging data, for instance. You'll have to check if any is NaN, then don't use it (or maybe consider it 0, it depends) and this is just overhead.
You might want to compare the shear size of dense (your current implementation) against spares data structures in your case:
'dense : {:0.2f} Kbytes'.format(df.memory_usage().sum() / 1e3)
'sparse: {:0.2f} Kbytes'.format(sdf.memory_usage().sum() / 1e3)
The two numbers should be pretty different

How to check the type of missing data in python(randomly missing or not)?

I have a big amount of data with me(93 files, ~150mb each). The data is a time series, i.e, information about a given set of coordinates(3.3 million latitude-longitude values) is recorded and stored everyday for 93 days, and the whole data is broken up into 93 files respectively. Example of two such files:
Day 1:
lon lat A B day1
68.4 8.4 NaN 20 20
68.4 8.5 16 20 18
68.6 8.4 NaN NaN NaN
.
.
Day 2:
lon lat C D day2
68.4 8.4 NaN NaN NaN
68.4 8.5 24 25 24.5
68.6 8.4 NaN NaN NaN
.
.
I am interested in understanding the nature of the missing data in the columns 'day1', 'day2', 'day3', etc. For example, if the values missing in the concerned columns are evenly distributed among all the set of coordinates then the data is probably missing at random, but if the missing values are concentrated more in a particular set of coordinates then my data will become biased. Consider the way my data is divided into multiple files of large sizes and isn't in a very standard form to operate on making it harder to use some tools.
I am looking for a diagnostic tool or visualization in python that can check/show how the missing data is distributed over the set of coordinates so I can impute/ignore it appropriately.
Thanks.
P.S: This is the first time I am handling missing data so it would be great to see if there exists a workflow which people who do similar kind of work follow.
Assuming that you read file and name it df. You can count amount of NaNs using:
df.isnull().sum()
It will return you amount of NaNs per column.
You could also use:
df.isnull().sum(axis=1).value_counts()
This on the other hand will sum number of NaNs per row and then calculate number of rows with no NaNs, 1 NaN, 2 NaNs and so on.
Regarding working with files of such size, to speed up loading data and processing it I recommend using Dask and change format of your files preferably to parquet so that you can read and write to it in parallel.
You could easily recreate function above in Dask like this:
from dask import dataframe as dd
dd.read_parquet(file_path).isnull().sum().compute()
Answering the comment question:
Use .loc to slice your dataframe, in code below I choose all rows : and two columns ['col1', 'col2'].
df.loc[:, ['col1', 'col2']].isnull().sum(axis=1).value_counts()

Ordinary Least Squares Regression for multiple columns in Pandas Dataframe

I'm trying to find a way to iterate code for a linear regression over many many columns, upwards of Z3. Here is a snippet of the dataframe called df1
Time A1 A2 A3 B1 B2 B3
1 1.00 6.64 6.82 6.79 6.70 6.95 7.02
2 2.00 6.70 6.86 6.92 NaN NaN NaN
3 3.00 NaN NaN NaN 7.07 7.27 7.40
4 4.00 7.15 7.26 7.26 7.19 NaN NaN
5 5.00 NaN NaN NaN NaN 7.40 7.51
6 5.50 7.44 7.63 7.58 7.54 NaN NaN
7 6.00 7.62 7.86 7.71 NaN NaN NaN
This code returns the slope coefficient of a linear regression for the very ONE column only and concatenates the value to a numpy series called series, here is what it looks like for extracting the slope for the first column:
from sklearn.linear_model import LinearRegression
series = np.array([]) #blank list to append result
df2 = df1[~np.isnan(df1['A1'])] #removes NaN values for each column to apply sklearn function
df3 = df2[['Time','A1']]
npMatrix = np.matrix(df3)
X, Y = npMatrix[:,0], npMatrix[:,1]
slope = LinearRegression().fit(X,Y) # either this or the next line
m = slope.coef_[0]
series= np.concatenate((SGR_trips, m), axis = 0)
As it stands now, I am using this slice of code, replacing "A1" with a new column name all the way up to "Z3" and this is extremely inefficient. I know there are many easy way to do this with some modules but I have the drawback of having all these intermediate NaN values in the timeseries so it seems like I'm limited to this method, or something like it.
I tried using a for loop such as:
for col in df1.columns:
and replacing 'A1', for example with col in the code, but this does not seem to be working.
Is there any way I can do this more efficiently?
Thank you!
One liner (or three)
time = df[['Time']]
pd.DataFrame(np.linalg.pinv(time.T.dot(time)).dot(time.T).dot(df.fillna(0)),
['Slope'], df.columns)
Broken down with a bit of explanation
Using the closed form of OLS
In this case X is time where we define time as df[['Time']]. I used the double brackets to preserve the dataframe and its two dimensions. If I'd done single brackets, I'd have gotten a series and its one dimension. Then the dot products aren't as pretty.
is np.linalg.pinv(time.T.dot(time)).dot(time.T)
Y is df.fillna(0). Yes, we could have done one column at a time, but why when we could do it altogether. You have to deal with the NaNs. How would you imagine dealing with them? Only doing it over the time you had data? That is equivalent to placing zeroes in the NaN spots. So, I did.
Finally, I use pd.DataFrame(stuff, ['Slope'], df.columns) to contain all slopes in one place with the original labels.
Note that I calculated the slope of the regression for Time against itself. Why not? It was there. Its value is 1.0. Great! I probably did it right!
Looping is a decent strategy for a modest number (say, fewer than thousands) of columns. Without seeing your implementation, I can't say what's wrong, but here's my version, which works:
slopes = []
for c in cols:
if c=="Time": break
mask = ~np.isnan(df1[c])
x = np.atleast_2d(df1.Time[mask].values).T
y = np.atleast_2d(df1[c][mask].values).T
reg = LinearRegression().fit(x, y)
slopes.append(reg.coef_[0])
I've simplified your code a bit to avoid creating so many temporary DataFrame objects, but it should work fine your way too.

Categories