Pandas: downsample and sum without filling spaces (or leaving `NaN`s) - python

Having this time series:
>>> from pandas import date_range
>>> from pandas import Series
>>> dates = date_range('2019-01-01', '2019-01-10', freq='D')[[0, 4, 5, 8]]
>>> dates
DatetimeIndex(['2019-01-01', '2019-01-05', '2019-01-06', '2019-01-09'], dtype='datetime64[ns]', freq=None)
>>> series = Series(index=dates, data=[0., 1., 2., 3.])
>>> series
2019-01-01 0.0
2019-01-05 1.0
2019-01-06 2.0
2019-01-09 3.0
dtype: int64
I can resample with Pandas to '2D' and get:
series.resample('2D').sum()
2019-01-01 0.0
2019-01-03 0.0
2019-01-05 3.0
2019-01-07 0.0
2019-01-09 3.0
Freq: 2D, dtype: int64
However, I would like to get:
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
Freq: 2D, dtype: int64
Or at least (so that I can drop the NaNs):
2019-01-01 0.0
2019-01-03 Nan
2019-01-05 3.0
2019-01-07 Nan
2019-01-09 3.0
Freq: 2D, dtype: int64
Notes
Using latest Pandas version (0.24)
Would like to be able to use the '2D' syntax (or 'W' or '3H' or whatever...) and let Pandas care about the grouping/resampling
This looks dirty and inefficient. Hopefully someone comes up with a better alternative. :-D
>>> resampled = series.resample('2D')
>>> (resampled.mean() * resampled.count()).dropna()
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
dtype: float64

It would be clearer to use resampled.count() as a condition after using sum like this :
resampled = series.resample('2D')
resampled.sum()[resampled.count() != 0]
Out :
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
dtype: float64
On my computer this method is 22% faster (5.52ms vs 7.15ms).

You can use the named argument min_count:
>>> series.resample('2D').sum(min_count=1).dropna()
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
Performance comparison with other methods, from faster to slower (run your own tests, as it may depend on your architecture, platform, environment...):
In [38]: %timeit resampled.sum(min_count=1).dropna()
588 µs ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [39]: %timeit (resampled.mean() * resampled.count()).dropna()
622 µs ± 3.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [40]: %timeit resampled.sum()[resampled.count() != 0].copy()
960 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Related

In dataframe, how to speed up recognizing rows that have more than 5 consecutive previous values with same sign?

I have a dataframe like this.
val consecutive
0 0.0001 0.0
1 0.0008 0.0
2 -0.0001 0.0
3 0.0005 0.0
4 0.0008 0.0
5 0.0002 0.0
6 0.0012 0.0
7 0.0012 1.0
8 0.0007 1.0
9 0.0004 1.0
10 0.0002 1.0
11 0.0000 0.0
12 0.0015 0.0
13 -0.0005 0.0
14 -0.0003 0.0
15 0.0001 0.0
16 0.0001 0.0
17 0.0003 0.0
18 -0.0003 0.0
19 -0.0001 0.0
20 0.0000 0.0
21 0.0000 0.0
22 -0.0008 0.0
23 -0.0008 0.0
24 -0.0001 0.0
25 -0.0006 0.0
26 -0.0010 1.0
27 0.0002 0.0
28 -0.0003 0.0
29 -0.0008 0.0
30 -0.0010 0.0
31 -0.0003 0.0
32 -0.0005 1.0
33 -0.0012 1.0
34 -0.0002 1.0
35 0.0000 0.0
36 -0.0018 0.0
37 -0.0009 0.0
38 -0.0007 0.0
39 0.0000 0.0
40 -0.0011 0.0
41 -0.0006 0.0
42 -0.0010 0.0
43 -0.0015 0.0
44 -0.0012 1.0
45 -0.0011 1.0
46 -0.0010 1.0
47 -0.0014 1.0
48 -0.0011 1.0
49 -0.0017 1.0
50 -0.0015 1.0
51 -0.0010 1.0
52 -0.0014 1.0
53 -0.0012 1.0
54 -0.0004 1.0
55 -0.0007 1.0
56 -0.0011 1.0
57 -0.0008 1.0
58 -0.0006 1.0
59 0.0002 0.0
The column 'consecutive' is what I want to compute. It is '1' when current row has more than 5 consecutive previous values with same sign (either positive or negative, including it self).
What I've tried is:
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: np.all(arr > 0) or np.all(arr < 0), raw=True
).replace(np.nan, 0)
But it's too slow for large dataset.
Do you have any idea how to speed up?
One option is to avoid the use of apply() altogether.
The main idea is to create 2 'helper' columns:
sign: boolean Series indicating if value is positive (True) or negative (False)
id: group identical consecutive occurences together
Finally, we can groupby the id and use cumulative count to isolate the rows which have 4 or more previous rows with the same sign (i.e. get all rows with 5 consecutive sign values).
# Setup test dataset
import pandas as pd
import numpy as np
vals = np.random.randn(20000)
df = pd.DataFrame({'val': vals})
# Create the helper columns
sign = df['val'] >= 0
df['id'] = sign.ne(sign.shift()).cumsum()
# Count the ids and set flag to True if the cumcount is above our desired value
df['consecutive'] = df.groupby('id').cumcount() >= 4
Benchmarking
On my system I get the following benchmarks:
sign = df['val'] >= 0
# 92 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
df['id'] = sign.ne(sign.shift()).cumsum()
# 1.06 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df['consecutive'] = df.groupby('id').cumcount() >= 4
# 3.36 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Thus in total we get an average runtime of: 4.51 ms
For reference, your solution and #Emma 's solution ran respectively on my system in:
# 287 ms ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 121 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Not sure this is fast enough for your data size but using min, max seems faster.
With 20k rows,
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: np.all(arr > 0) or np.all(arr < 0), raw=True
)
# 144 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: (arr.min() > 0 or arr.max() < 0), raw=True
)
# 57.1 ms ± 85.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fill NaN values from previous column with data

I have a dataframe in pandas, and I am trying to take data from the same row and different columns and fill NaN values in my data. How would I do this in pandas?
For example,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
83 27.0 29.0 NaN 29.0 30.0 NaN NaN 15.0 16.0 17.0 NaN 28.0 30.0 NaN 28.0 18.0
The goal is for the data to look like this:
1 2 3 4 5 6 7 ... 10 11 12 13 14 15 16
83 NaN NaN NaN 27.0 29.0 29.0 30.0 ... 15.0 16.0 17.0 28.0 30.0 28.0 18.0
The goal is to be able to take the mean of the last five columns that have data. If there are not >= 5 data-filled cells, then take the average of however many cells there are.
Use function justify for improve performance with filter all columns without first by DataFrame.iloc:
print (df)
name 1 2 3 4 5 6 7 8 9 10 11 12 13 \
80 bob 27.0 29.0 NaN 29.0 30.0 NaN NaN 15.0 16.0 17.0 NaN 28.0 30.0
14 15 16
80 NaN 28.0 18.0
df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
print (df)
name 1 2 3 4 5 6 7 8 9 10 11 12 13 \
80 bob NaN NaN NaN NaN NaN 27.0 29.0 29.0 30.0 15.0 16.0 17.0 28.0
14 15 16
80 30.0 28.0 18.0
Function:
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
Performance:
#100 rows
df = pd.concat([df] * 100, ignore_index=True)
#41 times slowier
In [39]: %timeit df.loc[:,df.columns[1:]] = df.loc[:,df.columns[1:]].apply(fun, axis=1)
145 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [41]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
3.54 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#1000 rows
df = pd.concat([df] * 1000, ignore_index=True)
#198 times slowier
In [43]: %timeit df.loc[:,df.columns[1:]] = df.loc[:,df.columns[1:]].apply(fun, axis=1)
1.13 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [45]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
5.7 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Assuming you need to move all NaN to the first columns I would define a function that takes all NaN and places them first and leave the rest as it is:
def fun(row):
index_order = row.index[row.isnull()].append(row.index[~row.isnull()])
row.iloc[:] = row[index_order].values
return row
df_fix = df.loc[:,df.columns[1:]].apply(fun, axis=1)
If you need to overwrite the results in the same dataframe then:
df.loc[:,df.columns[1:]] = df_fix.copy()

Adding random number of days to a series of datetime values

I am trying to add random number of days to a series of datetime values without iterating each row of the dataframe as it is taking a lot of time(i have a large dataframe). I went through datetime's timedelta, pandas DateOffset,etc but they they do not have option to give the random number of days at once i.e. using list as an input(we have to give random numbers one by one).
code:
df['date_columnA'] = df['date_columnB'] + datetime.timedelta(days = n)
Above code will add same number of days i.e. n to all the rows whereas i want random numbers to be added.
If performance is important create all random timedeltas by to_timedelta with numpy.random.randint and add to column:
np.random.seed(2020)
df = pd.DataFrame({'date_columnB': pd.date_range('2015-01-01', periods=20)})
td = pd.to_timedelta(np.random.randint(1,100, size=len(df)), unit='d')
df['date_columnA'] = df['date_columnB'] + td
print (df)
date_columnB date_columnA
0 2015-01-01 2015-04-08
1 2015-01-02 2015-01-11
2 2015-01-03 2015-03-12
3 2015-01-04 2015-03-13
4 2015-01-05 2015-04-07
5 2015-01-06 2015-01-10
6 2015-01-07 2015-03-20
7 2015-01-08 2015-03-06
8 2015-01-09 2015-02-08
9 2015-01-10 2015-02-28
10 2015-01-11 2015-02-13
11 2015-01-12 2015-02-06
12 2015-01-13 2015-03-29
13 2015-01-14 2015-01-24
14 2015-01-15 2015-03-08
15 2015-01-16 2015-01-28
16 2015-01-17 2015-03-14
17 2015-01-18 2015-03-22
18 2015-01-19 2015-03-28
19 2015-01-20 2015-03-31
Performance for 10k rows:
np.random.seed(2020)
df = pd.DataFrame({'date_columnB': pd.date_range('2015-01-01', periods=10000)})
In [357]: %timeit df['date_columnA'] = df['date_columnB'].apply(lambda x:x+timedelta(days=random.randint(0,100)))
158 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [358]: %timeit df['date_columnA1'] = df['date_columnB'] + pd.to_timedelta(np.random.randint(1,100, size=len(df)), unit='d')
1.53 ms ± 37.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
import random
df['date_columnA'] = df['date_columnB'].apply(lambda x:x+timedelta(days=random.randint(0,100))
import numpy as np
import pandas as pd
df['date_columnA'] = df['date_columnB'] +np.random.choice(pd.date_range('2000-01-01', '2020-01-01' , len(df))

Is there a faster way to read following pandas dataframe?

I have a huge .csv file(2.3G) which I have to read into pandas dataframe.
start_date,wind_90.0_0.0,wind_90.0_5.0,wind_87.5_2.5
1948-01-01,15030.64,15040.64,16526.35
1948-01-02,15050.14,15049.28,16526.28
1948-01-03,15076.71,15075.0,16525.28
I want to process above data into below structure:
start_date lat lon wind
0 1948-01-01 90.0 0.0 15030.64
1 1948-01-01 90.0 5.0 15040.64
2 1948-01-01 87.5 2.5 16526.35
3 1948-01-02 90.0 0.0 15050.14
4 1948-01-02 90.0 5.0 15049.28
5 1948-01-02 87.5 2.5 16526.28
6 1948-01-03 90.0 0.0 15076.71
7 1948-01-03 90.0 5.0 15075.0
8 1948-01-03 87.5 2.5 16525.28
Code I have so far which does what I want but is too slow and takes up a lot of memory.
def load_data_as_pandas(fileName, featureName):
df = pd.read_csv(fileName)
df = pd.melt(df, id_vars = df.columns[0])
df['lat'] = df['variable'].str.split('_').str[-2]
df['lon'] = df['variable'].str.split('_').str[-1]
df = df.drop('variable', axis=1)
df.columns = ['start_date', featureName,'lat','lon']
df = df.groupby(['start_date','lat','lon']).first()
df = df.reset_index()
df['start_date'] = pd.to_datetime(df['start_date'], format='%Y-%m-%d', errors='coerce')
return df
This should spead up your code:
We can use melt to unpivot your data from wide to long. Then we use str.split on the column name (values) and use expand=True to get a new column for each split. Finally we join these newly created columns back to our original dataframe:
melt = df.melt(id_vars='start_date').sort_values('start_date').reset_index(drop=True)
newcols = melt['variable'].str.split('_', expand=True).iloc[:, 1:].rename(columns={1:'lat', 2:'lon'})
final = melt.drop(columns='variable').join(newcols)
Output
start_date value lat lon
0 1948-01-01 15030.64 90.0 0.0
1 1948-01-01 15040.64 90.0 5.0
2 1948-01-01 16526.35 87.5 2.5
3 1948-01-02 15050.14 90.0 0.0
4 1948-01-02 15049.28 90.0 5.0
5 1948-01-02 16526.28 87.5 2.5
6 1948-01-03 15076.71 90.0 0.0
7 1948-01-03 15075.00 90.0 5.0
8 1948-01-03 16525.28 87.5 2.5
Timeit test on 800k rows:
3.55 s ± 347 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Nan values in columns in python

I have a data set which is created based on other data set. In my new data fame some columns have nan values. I want to make a log on each columns. However I need all the rows even though they have Nan values. What should I do with Nan values before applying log? For example consider the following data set:
a b c
1 2 3
4 5 6
7 nan 8
9 nan nan
I do not want to drop the rows with nan values. I need them for applying log on them.
I need to have the values of 7 and 8 in the row 6 for example.
Thanks.
Having nan won't affect log when calculating for each individual cell. What's more is that np.log has the property that it will operate on a pd.DataFrame and return a pd.DataFrame
np.log(df)
a b c
0 0.000000 0.693147 1.098612
1 1.386294 1.609438 1.791759
2 1.945910 NaN 2.079442
3 2.197225 NaN NaN
Notice the difference in timing
%timeit np.log(df)
%timeit pd.DataFrame(np.log(df.values), df.index, df.columns)
%timeit df.applymap(np.log)
134 µs ± 5.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
107 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
835 µs ± 12.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Response to #IanS
Notice the subok=True parameter in the documentation
It controls whether the original type is preserved. If we turn it to False
np.log(df, subok=False)
array([[ 0. , 0.69314718, 1.09861229],
[ 1.38629436, 1.60943791, 1.79175947],
[ 1.94591015, nan, 2.07944154],
[ 2.19722458, nan, nan]])

Categories