I have a data set which is created based on other data set. In my new data fame some columns have nan values. I want to make a log on each columns. However I need all the rows even though they have Nan values. What should I do with Nan values before applying log? For example consider the following data set:
a b c
1 2 3
4 5 6
7 nan 8
9 nan nan
I do not want to drop the rows with nan values. I need them for applying log on them.
I need to have the values of 7 and 8 in the row 6 for example.
Thanks.
Having nan won't affect log when calculating for each individual cell. What's more is that np.log has the property that it will operate on a pd.DataFrame and return a pd.DataFrame
np.log(df)
a b c
0 0.000000 0.693147 1.098612
1 1.386294 1.609438 1.791759
2 1.945910 NaN 2.079442
3 2.197225 NaN NaN
Notice the difference in timing
%timeit np.log(df)
%timeit pd.DataFrame(np.log(df.values), df.index, df.columns)
%timeit df.applymap(np.log)
134 µs ± 5.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
107 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
835 µs ± 12.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Response to #IanS
Notice the subok=True parameter in the documentation
It controls whether the original type is preserved. If we turn it to False
np.log(df, subok=False)
array([[ 0. , 0.69314718, 1.09861229],
[ 1.38629436, 1.60943791, 1.79175947],
[ 1.94591015, nan, 2.07944154],
[ 2.19722458, nan, nan]])
Related
I have a dataframe like this.
val consecutive
0 0.0001 0.0
1 0.0008 0.0
2 -0.0001 0.0
3 0.0005 0.0
4 0.0008 0.0
5 0.0002 0.0
6 0.0012 0.0
7 0.0012 1.0
8 0.0007 1.0
9 0.0004 1.0
10 0.0002 1.0
11 0.0000 0.0
12 0.0015 0.0
13 -0.0005 0.0
14 -0.0003 0.0
15 0.0001 0.0
16 0.0001 0.0
17 0.0003 0.0
18 -0.0003 0.0
19 -0.0001 0.0
20 0.0000 0.0
21 0.0000 0.0
22 -0.0008 0.0
23 -0.0008 0.0
24 -0.0001 0.0
25 -0.0006 0.0
26 -0.0010 1.0
27 0.0002 0.0
28 -0.0003 0.0
29 -0.0008 0.0
30 -0.0010 0.0
31 -0.0003 0.0
32 -0.0005 1.0
33 -0.0012 1.0
34 -0.0002 1.0
35 0.0000 0.0
36 -0.0018 0.0
37 -0.0009 0.0
38 -0.0007 0.0
39 0.0000 0.0
40 -0.0011 0.0
41 -0.0006 0.0
42 -0.0010 0.0
43 -0.0015 0.0
44 -0.0012 1.0
45 -0.0011 1.0
46 -0.0010 1.0
47 -0.0014 1.0
48 -0.0011 1.0
49 -0.0017 1.0
50 -0.0015 1.0
51 -0.0010 1.0
52 -0.0014 1.0
53 -0.0012 1.0
54 -0.0004 1.0
55 -0.0007 1.0
56 -0.0011 1.0
57 -0.0008 1.0
58 -0.0006 1.0
59 0.0002 0.0
The column 'consecutive' is what I want to compute. It is '1' when current row has more than 5 consecutive previous values with same sign (either positive or negative, including it self).
What I've tried is:
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: np.all(arr > 0) or np.all(arr < 0), raw=True
).replace(np.nan, 0)
But it's too slow for large dataset.
Do you have any idea how to speed up?
One option is to avoid the use of apply() altogether.
The main idea is to create 2 'helper' columns:
sign: boolean Series indicating if value is positive (True) or negative (False)
id: group identical consecutive occurences together
Finally, we can groupby the id and use cumulative count to isolate the rows which have 4 or more previous rows with the same sign (i.e. get all rows with 5 consecutive sign values).
# Setup test dataset
import pandas as pd
import numpy as np
vals = np.random.randn(20000)
df = pd.DataFrame({'val': vals})
# Create the helper columns
sign = df['val'] >= 0
df['id'] = sign.ne(sign.shift()).cumsum()
# Count the ids and set flag to True if the cumcount is above our desired value
df['consecutive'] = df.groupby('id').cumcount() >= 4
Benchmarking
On my system I get the following benchmarks:
sign = df['val'] >= 0
# 92 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
df['id'] = sign.ne(sign.shift()).cumsum()
# 1.06 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df['consecutive'] = df.groupby('id').cumcount() >= 4
# 3.36 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Thus in total we get an average runtime of: 4.51 ms
For reference, your solution and #Emma 's solution ran respectively on my system in:
# 287 ms ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 121 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Not sure this is fast enough for your data size but using min, max seems faster.
With 20k rows,
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: np.all(arr > 0) or np.all(arr < 0), raw=True
)
# 144 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: (arr.min() > 0 or arr.max() < 0), raw=True
)
# 57.1 ms ± 85.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have a dataframe in pandas, and I am trying to take data from the same row and different columns and fill NaN values in my data. How would I do this in pandas?
For example,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
83 27.0 29.0 NaN 29.0 30.0 NaN NaN 15.0 16.0 17.0 NaN 28.0 30.0 NaN 28.0 18.0
The goal is for the data to look like this:
1 2 3 4 5 6 7 ... 10 11 12 13 14 15 16
83 NaN NaN NaN 27.0 29.0 29.0 30.0 ... 15.0 16.0 17.0 28.0 30.0 28.0 18.0
The goal is to be able to take the mean of the last five columns that have data. If there are not >= 5 data-filled cells, then take the average of however many cells there are.
Use function justify for improve performance with filter all columns without first by DataFrame.iloc:
print (df)
name 1 2 3 4 5 6 7 8 9 10 11 12 13 \
80 bob 27.0 29.0 NaN 29.0 30.0 NaN NaN 15.0 16.0 17.0 NaN 28.0 30.0
14 15 16
80 NaN 28.0 18.0
df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
print (df)
name 1 2 3 4 5 6 7 8 9 10 11 12 13 \
80 bob NaN NaN NaN NaN NaN 27.0 29.0 29.0 30.0 15.0 16.0 17.0 28.0
14 15 16
80 30.0 28.0 18.0
Function:
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
Performance:
#100 rows
df = pd.concat([df] * 100, ignore_index=True)
#41 times slowier
In [39]: %timeit df.loc[:,df.columns[1:]] = df.loc[:,df.columns[1:]].apply(fun, axis=1)
145 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [41]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
3.54 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#1000 rows
df = pd.concat([df] * 1000, ignore_index=True)
#198 times slowier
In [43]: %timeit df.loc[:,df.columns[1:]] = df.loc[:,df.columns[1:]].apply(fun, axis=1)
1.13 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [45]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
5.7 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Assuming you need to move all NaN to the first columns I would define a function that takes all NaN and places them first and leave the rest as it is:
def fun(row):
index_order = row.index[row.isnull()].append(row.index[~row.isnull()])
row.iloc[:] = row[index_order].values
return row
df_fix = df.loc[:,df.columns[1:]].apply(fun, axis=1)
If you need to overwrite the results in the same dataframe then:
df.loc[:,df.columns[1:]] = df_fix.copy()
I want to add three additional columns using pandas and python. I'm not sure how to add additional columns based on rows who have the same GroupID value.
min_avg: Which is the lowest avg value for rows with the same GroupID
max_avg: Which is the highest avg value for rows with the same GroupID
group_avg: Which is the avg value for each rows 'min_avg, max_avg' columns
I'm not entirely sure where to begin with this one.
I have this:
avg groupId
0 25.5 1016
1 26.7 1048
2 25.8 1016
3 53.5 1048
4 29.3 1064
5 27.7 1016
and my goal is this:
avg groupId min_avg max_avg group_average
0 25.5 1016 25.5 27.7 26.6
1 26.7 1048 26.3 53.5 39.9
2 25.8 1016 25.5 27.7 26.6
3 53.5 1048 26.3 53.5 39.9
4 29.3 1064 29.3 29.3 29.3
5 27.7 1016 25.5 27.7 26.6
We can do merge with groupby describe
df=df.merge(df.groupby('groupId').avg.describe()[['mean','min','max']].reset_index(),how='left')
Out[25]:
avg groupId mean min max
0 25.5 1016 26.333333 25.5 27.7
1 26.7 1048 40.100000 26.7 53.5
2 25.8 1016 26.333333 25.5 27.7
3 53.5 1048 40.100000 26.7 53.5
4 29.3 1064 29.300000 29.3 29.3
5 27.7 1016 26.333333 25.5 27.7
The describe method, as given in YOBEN_S's solution, will compute more statistics than is required, including count, std, and dtypes. See here.
We can get around this by using the agg method.
df.merge(df.groupby('groupId')['avg'].agg([min, max, 'mean']), on='groupId')
# output
avg groupId min max mean
0 25.5 1016 25.5 27.7 26.333333
1 26.7 1048 26.7 53.5 40.100000
2 25.8 1016 25.5 27.7 26.333333
3 53.5 1048 26.7 53.5 40.100000
4 29.3 1064 29.3 29.3 29.300000
5 27.7 1016 25.5 27.7 26.333333
Speed Comparison
Approach 1
%%timeit -n 1000
df.merge(df.groupby('groupId').avg.describe()[['mean','min','max']].reset_index(),how='left')
9.6 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Approach 2
%%timeit -n 1000
df.merge(df.groupby('groupId')['avg'].agg([min, max, 'mean']), on='groupId')
3.42 ms ± 74.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Approach 3
Additionally, we can get a slight speedup by converting df.merge to df.join.
2.96 ms ± 29.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I am trying to add random number of days to a series of datetime values without iterating each row of the dataframe as it is taking a lot of time(i have a large dataframe). I went through datetime's timedelta, pandas DateOffset,etc but they they do not have option to give the random number of days at once i.e. using list as an input(we have to give random numbers one by one).
code:
df['date_columnA'] = df['date_columnB'] + datetime.timedelta(days = n)
Above code will add same number of days i.e. n to all the rows whereas i want random numbers to be added.
If performance is important create all random timedeltas by to_timedelta with numpy.random.randint and add to column:
np.random.seed(2020)
df = pd.DataFrame({'date_columnB': pd.date_range('2015-01-01', periods=20)})
td = pd.to_timedelta(np.random.randint(1,100, size=len(df)), unit='d')
df['date_columnA'] = df['date_columnB'] + td
print (df)
date_columnB date_columnA
0 2015-01-01 2015-04-08
1 2015-01-02 2015-01-11
2 2015-01-03 2015-03-12
3 2015-01-04 2015-03-13
4 2015-01-05 2015-04-07
5 2015-01-06 2015-01-10
6 2015-01-07 2015-03-20
7 2015-01-08 2015-03-06
8 2015-01-09 2015-02-08
9 2015-01-10 2015-02-28
10 2015-01-11 2015-02-13
11 2015-01-12 2015-02-06
12 2015-01-13 2015-03-29
13 2015-01-14 2015-01-24
14 2015-01-15 2015-03-08
15 2015-01-16 2015-01-28
16 2015-01-17 2015-03-14
17 2015-01-18 2015-03-22
18 2015-01-19 2015-03-28
19 2015-01-20 2015-03-31
Performance for 10k rows:
np.random.seed(2020)
df = pd.DataFrame({'date_columnB': pd.date_range('2015-01-01', periods=10000)})
In [357]: %timeit df['date_columnA'] = df['date_columnB'].apply(lambda x:x+timedelta(days=random.randint(0,100)))
158 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [358]: %timeit df['date_columnA1'] = df['date_columnB'] + pd.to_timedelta(np.random.randint(1,100, size=len(df)), unit='d')
1.53 ms ± 37.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
import random
df['date_columnA'] = df['date_columnB'].apply(lambda x:x+timedelta(days=random.randint(0,100))
import numpy as np
import pandas as pd
df['date_columnA'] = df['date_columnB'] +np.random.choice(pd.date_range('2000-01-01', '2020-01-01' , len(df))
Having this time series:
>>> from pandas import date_range
>>> from pandas import Series
>>> dates = date_range('2019-01-01', '2019-01-10', freq='D')[[0, 4, 5, 8]]
>>> dates
DatetimeIndex(['2019-01-01', '2019-01-05', '2019-01-06', '2019-01-09'], dtype='datetime64[ns]', freq=None)
>>> series = Series(index=dates, data=[0., 1., 2., 3.])
>>> series
2019-01-01 0.0
2019-01-05 1.0
2019-01-06 2.0
2019-01-09 3.0
dtype: int64
I can resample with Pandas to '2D' and get:
series.resample('2D').sum()
2019-01-01 0.0
2019-01-03 0.0
2019-01-05 3.0
2019-01-07 0.0
2019-01-09 3.0
Freq: 2D, dtype: int64
However, I would like to get:
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
Freq: 2D, dtype: int64
Or at least (so that I can drop the NaNs):
2019-01-01 0.0
2019-01-03 Nan
2019-01-05 3.0
2019-01-07 Nan
2019-01-09 3.0
Freq: 2D, dtype: int64
Notes
Using latest Pandas version (0.24)
Would like to be able to use the '2D' syntax (or 'W' or '3H' or whatever...) and let Pandas care about the grouping/resampling
This looks dirty and inefficient. Hopefully someone comes up with a better alternative. :-D
>>> resampled = series.resample('2D')
>>> (resampled.mean() * resampled.count()).dropna()
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
dtype: float64
It would be clearer to use resampled.count() as a condition after using sum like this :
resampled = series.resample('2D')
resampled.sum()[resampled.count() != 0]
Out :
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
dtype: float64
On my computer this method is 22% faster (5.52ms vs 7.15ms).
You can use the named argument min_count:
>>> series.resample('2D').sum(min_count=1).dropna()
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
Performance comparison with other methods, from faster to slower (run your own tests, as it may depend on your architecture, platform, environment...):
In [38]: %timeit resampled.sum(min_count=1).dropna()
588 µs ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [39]: %timeit (resampled.mean() * resampled.count()).dropna()
622 µs ± 3.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [40]: %timeit resampled.sum()[resampled.count() != 0].copy()
960 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)