I am trying to add random number of days to a series of datetime values without iterating each row of the dataframe as it is taking a lot of time(i have a large dataframe). I went through datetime's timedelta, pandas DateOffset,etc but they they do not have option to give the random number of days at once i.e. using list as an input(we have to give random numbers one by one).
code:
df['date_columnA'] = df['date_columnB'] + datetime.timedelta(days = n)
Above code will add same number of days i.e. n to all the rows whereas i want random numbers to be added.
If performance is important create all random timedeltas by to_timedelta with numpy.random.randint and add to column:
np.random.seed(2020)
df = pd.DataFrame({'date_columnB': pd.date_range('2015-01-01', periods=20)})
td = pd.to_timedelta(np.random.randint(1,100, size=len(df)), unit='d')
df['date_columnA'] = df['date_columnB'] + td
print (df)
date_columnB date_columnA
0 2015-01-01 2015-04-08
1 2015-01-02 2015-01-11
2 2015-01-03 2015-03-12
3 2015-01-04 2015-03-13
4 2015-01-05 2015-04-07
5 2015-01-06 2015-01-10
6 2015-01-07 2015-03-20
7 2015-01-08 2015-03-06
8 2015-01-09 2015-02-08
9 2015-01-10 2015-02-28
10 2015-01-11 2015-02-13
11 2015-01-12 2015-02-06
12 2015-01-13 2015-03-29
13 2015-01-14 2015-01-24
14 2015-01-15 2015-03-08
15 2015-01-16 2015-01-28
16 2015-01-17 2015-03-14
17 2015-01-18 2015-03-22
18 2015-01-19 2015-03-28
19 2015-01-20 2015-03-31
Performance for 10k rows:
np.random.seed(2020)
df = pd.DataFrame({'date_columnB': pd.date_range('2015-01-01', periods=10000)})
In [357]: %timeit df['date_columnA'] = df['date_columnB'].apply(lambda x:x+timedelta(days=random.randint(0,100)))
158 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [358]: %timeit df['date_columnA1'] = df['date_columnB'] + pd.to_timedelta(np.random.randint(1,100, size=len(df)), unit='d')
1.53 ms ± 37.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
import random
df['date_columnA'] = df['date_columnB'].apply(lambda x:x+timedelta(days=random.randint(0,100))
import numpy as np
import pandas as pd
df['date_columnA'] = df['date_columnB'] +np.random.choice(pd.date_range('2000-01-01', '2020-01-01' , len(df))
Related
I have a dataframe like this.
val consecutive
0 0.0001 0.0
1 0.0008 0.0
2 -0.0001 0.0
3 0.0005 0.0
4 0.0008 0.0
5 0.0002 0.0
6 0.0012 0.0
7 0.0012 1.0
8 0.0007 1.0
9 0.0004 1.0
10 0.0002 1.0
11 0.0000 0.0
12 0.0015 0.0
13 -0.0005 0.0
14 -0.0003 0.0
15 0.0001 0.0
16 0.0001 0.0
17 0.0003 0.0
18 -0.0003 0.0
19 -0.0001 0.0
20 0.0000 0.0
21 0.0000 0.0
22 -0.0008 0.0
23 -0.0008 0.0
24 -0.0001 0.0
25 -0.0006 0.0
26 -0.0010 1.0
27 0.0002 0.0
28 -0.0003 0.0
29 -0.0008 0.0
30 -0.0010 0.0
31 -0.0003 0.0
32 -0.0005 1.0
33 -0.0012 1.0
34 -0.0002 1.0
35 0.0000 0.0
36 -0.0018 0.0
37 -0.0009 0.0
38 -0.0007 0.0
39 0.0000 0.0
40 -0.0011 0.0
41 -0.0006 0.0
42 -0.0010 0.0
43 -0.0015 0.0
44 -0.0012 1.0
45 -0.0011 1.0
46 -0.0010 1.0
47 -0.0014 1.0
48 -0.0011 1.0
49 -0.0017 1.0
50 -0.0015 1.0
51 -0.0010 1.0
52 -0.0014 1.0
53 -0.0012 1.0
54 -0.0004 1.0
55 -0.0007 1.0
56 -0.0011 1.0
57 -0.0008 1.0
58 -0.0006 1.0
59 0.0002 0.0
The column 'consecutive' is what I want to compute. It is '1' when current row has more than 5 consecutive previous values with same sign (either positive or negative, including it self).
What I've tried is:
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: np.all(arr > 0) or np.all(arr < 0), raw=True
).replace(np.nan, 0)
But it's too slow for large dataset.
Do you have any idea how to speed up?
One option is to avoid the use of apply() altogether.
The main idea is to create 2 'helper' columns:
sign: boolean Series indicating if value is positive (True) or negative (False)
id: group identical consecutive occurences together
Finally, we can groupby the id and use cumulative count to isolate the rows which have 4 or more previous rows with the same sign (i.e. get all rows with 5 consecutive sign values).
# Setup test dataset
import pandas as pd
import numpy as np
vals = np.random.randn(20000)
df = pd.DataFrame({'val': vals})
# Create the helper columns
sign = df['val'] >= 0
df['id'] = sign.ne(sign.shift()).cumsum()
# Count the ids and set flag to True if the cumcount is above our desired value
df['consecutive'] = df.groupby('id').cumcount() >= 4
Benchmarking
On my system I get the following benchmarks:
sign = df['val'] >= 0
# 92 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
df['id'] = sign.ne(sign.shift()).cumsum()
# 1.06 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df['consecutive'] = df.groupby('id').cumcount() >= 4
# 3.36 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Thus in total we get an average runtime of: 4.51 ms
For reference, your solution and #Emma 's solution ran respectively on my system in:
# 287 ms ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 121 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Not sure this is fast enough for your data size but using min, max seems faster.
With 20k rows,
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: np.all(arr > 0) or np.all(arr < 0), raw=True
)
# 144 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: (arr.min() > 0 or arr.max() < 0), raw=True
)
# 57.1 ms ± 85.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I need to generate a list of dates in a dataframe by days and that each day is a row in the new dataframe, taking into account the start date and the end date of each record.
Input Dataframe:
A
B
Start
End
A1
B1
2021-05-15 00:00:00
2021-05-17 00:00:00
A1
B2
2021-05-30 00:00:00
2021-06-02 00:00:00
A2
B3
2021-05-10 00:00:00
2021-05-12 00:00:00
A2
B4
2021-06-02 00:00:00
2021-06-04 00:00:00
Expected Output:
A
B
Start
End
A1
B1
2021-05-15 00:00:00
2021-05-16 00:00:00
A1
B1
2021-05-16 00:00:00
2021-05-17 00:00:00
A1
B2
2021-05-30 00:00:00
2021-05-31 00:00:00
A1
B2
2021-05-31 00:00:00
2021-06-01 00:00:00
A1
B2
2021-06-01 00:00:00
2021-06-02 00:00:00
A2
B3
2021-05-10 00:00:00
2021-05-11 00:00:00
A2
B3
2021-05-11 00:00:00
2021-05-12 00:00:00
A2
B4
2021-06-02 00:00:00
2021-06-03 00:00:00
A2
B4
2021-06-03 00:00:00
2021-06-04 00:00:00
Use:
#convert columns to datetimes
df["Start"] = pd.to_datetime(df["Start"])
df["End"] = pd.to_datetime(df["End"])
#subtract values and convert to days
s = df["End"].sub(df["Start"]).dt.days
#repeat index
df = df.loc[df.index.repeat(s)].copy()
#add days by timedeltas, add 1 day for End column
add = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df['Start'] = df["Start"].add(add)
df['End'] = df["Start"] + pd.Timedelta(1, 'd')
#default index
df = df.reset_index(drop=True)
print (df)
A B Start End
0 A1 B1 2021-05-15 2021-05-16
1 A1 B1 2021-05-16 2021-05-17
2 A1 B2 2021-05-30 2021-05-31
3 A1 B2 2021-05-31 2021-06-01
4 A1 B2 2021-06-01 2021-06-02
5 A2 B3 2021-05-10 2021-05-11
6 A2 B3 2021-05-11 2021-05-12
7 A2 B4 2021-06-02 2021-06-03
8 A2 B4 2021-06-03 2021-06-04
Performance:
#4k rows
df = pd.concat([df] * 1000, ignore_index=True)
In [136]: %timeit jez(df)
16.9 ms ± 3.94 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [137]: %timeit andreas(df)
888 ms ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#800 rows
df = pd.concat([df] * 200, ignore_index=True)
In [139]: %timeit jez(df)
6.25 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [140]: %timeit andreas(df)
170 ms ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
def andreas(df):
df['d_range'] = df.apply(lambda row: list(pd.date_range(start=row['Start'], end=row['End'])), axis=1)
return df.explode('d_range')
def jez(df):
df["Start"] = pd.to_datetime(df["Start"])
df["End"] = pd.to_datetime(df["End"])
#subtract values and convert to days
s = df["End"].sub(df["Start"]).dt.days
#repeat index
df = df.loc[df.index.repeat(s)].copy()
#add days by timedeltas, add 1 day for End column
add = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df['Start'] = df["Start"].add(add)
df['End'] = df["Start"] + pd.Timedelta(1, 'd')
#default index
return df.reset_index(drop=True)
You can create a list of dates and explode it:
df['d_range'] = df.apply(lambda row: list(pd.date_range(start=row['Start'], end=row['End'])), axis=1)
df = df.explode('d_range')
I have a pandas dataframe which is having long term data,
point_id issue_date latitude longitude rainfall
0 1.0 2020-01-01 6.5 66.50 NaN
1 2.0 2020-01-02 6.5 66.75 NaN
... ... ... ... ... ... ... ...
6373888 17414.0 2020-12-30 38.5 99.75 NaN
6373889 17415.0 2020-12-31 38.5 100.00 NaN
6373890 rows × 5 columns
I want to extract the Standard Meteorological Week from its issue_date column, as
given in this figure.
I have tried in 2 ways.
1st
lulc_gdf['smw'] = lulc_gdf['issue_date'].astype('datetime64[ns]').dt.strftime('%V')
2nd
lulc_gdf['iso'] = lulc_gdf['issue_date'].astype('datetime64[ns]').dt.isocalendar().week
The output in both cases is same
point_id issue_date latitude longitude rainfall smw iso
0 1.0 2020-01-01 6.5 66.50 NaN 01 1
1 2.0 2020-01-02 6.5 66.75 NaN 01 1
... ... ... ... ... ... ... ...
6373888 17414.0 2020-12-30 38.5 99.75 NaN 53 53
6373889 17415.0 2020-12-31 38.5 100.00 NaN 53 53
6373890 rows × 7 columns
The issue is that the week starts here by taking reference of Sunday or Monday as the starting day of week, irrespective of year.
Like here in case of year 2020 the day on 1st January is Wednesday (not Monday),
so the 1st week is of 5 days only i.e (Wed, Thu, Fri, Sat & Sunday).
year week day issue_date
0 2020 1 3 2020-01-01
1 2020 1 4 2020-01-02
2 2020 1 5 2020-01-03
3 2020 1 6 2020-01-04
... ... ... ...
6373889 2020 53 4 2020-12-31
But in the case of Standard Meteorological Weeks,
I want output as:
for every year
1st week should always be from - 1st January to 07th January
2nd week from - 8th January to 14th January
3rd week from - 15th January to 21st January
------------------------------- and so on
irrespective of the starting day of year (Sunday, monday etc).
How to do so?
Use:
df = pd.DataFrame({'issue_date': pd.date_range('2000-01-01','2000-12-31')})
#inspire https://stackoverflow.com/a/61592907/2901002
normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
leap_year = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
days = df['issue_date'].dt.dayofyear
df['smw'] = np.where(df['issue_date'].dt.is_leap_year,
leap_year[days - 1],
normal_year[days - 1])
print (df[df['smw'] == 9])
issue_date smw
56 2000-02-26 9
57 2000-02-27 9
58 2000-02-28 9
59 2000-02-29 9
60 2000-03-01 9
61 2000-03-02 9
62 2000-03-03 9
63 2000-03-04 9
Performance:
#11323 rows
df = pd.DataFrame({'issue_date': pd.date_range('2000-01-01','2030-12-31')})
In [6]: %%timeit
...: normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
...: leap_year = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
...: days = df['issue_date'].dt.dayofyear
...:
...: df['smw'] = np.where(df['issue_date'].dt.is_leap_year, leap_year[days - 1], normal_year[days - 1])
...:
3.51 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %%timeit
...: df['smw1'] = get_smw(df['issue_date'])
...:
17.2 ms ± 312 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#51500 rows
df = pd.DataFrame({'issue_date': pd.date_range('1900-01-01','2040-12-31')})
In [9]: %%timeit
...: normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
...: leap_year = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
...: days = df['issue_date'].dt.dayofyear
...:
...: df['smw'] = np.where(df['issue_date'].dt.is_leap_year, leap_year[days - 1], normal_year[days - 1])
...:
...:
11.9 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: %%timeit
...: df['smw1'] = get_smw(df['issue_date'])
...:
...:
64.3 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
You can write a custom function to calculate the Standard Meteorological Weeks.
Normal calculation is by taking the difference in number of days from 1st January of the same year, then divide by 7 and add 1.
Special adjustment for leap year on Week No. 9 to have 8 days and also special adjustment for the last week of the year to have 8 days:
import numpy as np
# convert to datetime format if not already in datetime
df['issue_date'] = pd.to_datetime(df['issue_date'])
def get_smw(date_s):
# get day-of-the-year minus 1 in range [0..364/365] for division by 7
days_diff = date_s.dt.dayofyear - 1
# adjust for leap year on Week No. 9 to have 8 days: (minus one day for 29 Feb onwards in the same year)
leap_adj = date_s.dt.is_leap_year & (date_s > pd.to_datetime(date_s.dt.year.astype(str) + '-02-28'))
days_diff = np.where(leap_adj, days_diff - 1, days_diff)
# adjust for the last week of the year to have 8 days:
# Make the value for 31 Dec to 363 instead of 364 to keep it in the same week of 24 Dec)
days_diff = np.clip(days_diff, 0, 363)
smw = days_diff // 7 + 1
return smw
df['smw'] = get_smw(df['issue_date'])
Result:
print(df)
point_id issue_date latitude longitude rainfall smw
0 1.0 2020-01-01 6.5 66.50 NaN 1
1 2.0 2020-01-02 6.5 66.75 NaN 1
2 3.0 2020-01-03 6.5 66.75 NaN 1
3 4.0 2020-01-04 6.5 66.75 NaN 1
4 5.0 2020-01-05 6.5 66.75 NaN 1
5 6.0 2020-01-06 6.5 66.75 NaN 1
6 7.0 2020-01-07 6.5 66.75 NaN 1
7 8.0 2020-01-08 6.5 66.75 NaN 2
8 9.0 2020-01-09 6.5 66.75 NaN 2
40 40.0 2020-02-26 6.5 66.75 NaN 9
41 41.0 2020-03-04 6.5 66.75 NaN 9
42 42.0 2020-03-05 6.5 66.75 NaN 10
43 43.0 2020-03-12 6.5 66.75 NaN 11
6373880 17414.0 2020-12-23 38.5 99.75 NaN 51
6373881 17414.0 2020-12-24 38.5 99.75 NaN 52
6373888 17414.0 2020-12-30 38.5 99.75 NaN 52
6373889 17415.0 2020-12-31 38.5 100.00 NaN 52
7000040 40.0 2021-02-26 6.5 66.75 NaN 9
7000041 41.0 2021-03-04 6.5 66.75 NaN 9
7000042 42.0 2021-03-05 6.5 66.75 NaN 10
7000042 43.0 2021-03-12 6.5 66.75 NaN 11
7373880 17414.0 2021-12-23 38.5 99.75 NaN 51
7373881 17414.0 2021-12-24 38.5 99.75 NaN 52
7373888 17414.0 2021-12-30 38.5 99.75 NaN 52
7373889 17415.0 2021-12-31 38.5 100.00 NaN 52
I have a dataframe in pandas, and I am trying to take data from the same row and different columns and fill NaN values in my data. How would I do this in pandas?
For example,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
83 27.0 29.0 NaN 29.0 30.0 NaN NaN 15.0 16.0 17.0 NaN 28.0 30.0 NaN 28.0 18.0
The goal is for the data to look like this:
1 2 3 4 5 6 7 ... 10 11 12 13 14 15 16
83 NaN NaN NaN 27.0 29.0 29.0 30.0 ... 15.0 16.0 17.0 28.0 30.0 28.0 18.0
The goal is to be able to take the mean of the last five columns that have data. If there are not >= 5 data-filled cells, then take the average of however many cells there are.
Use function justify for improve performance with filter all columns without first by DataFrame.iloc:
print (df)
name 1 2 3 4 5 6 7 8 9 10 11 12 13 \
80 bob 27.0 29.0 NaN 29.0 30.0 NaN NaN 15.0 16.0 17.0 NaN 28.0 30.0
14 15 16
80 NaN 28.0 18.0
df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
print (df)
name 1 2 3 4 5 6 7 8 9 10 11 12 13 \
80 bob NaN NaN NaN NaN NaN 27.0 29.0 29.0 30.0 15.0 16.0 17.0 28.0
14 15 16
80 30.0 28.0 18.0
Function:
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
Performance:
#100 rows
df = pd.concat([df] * 100, ignore_index=True)
#41 times slowier
In [39]: %timeit df.loc[:,df.columns[1:]] = df.loc[:,df.columns[1:]].apply(fun, axis=1)
145 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [41]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
3.54 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#1000 rows
df = pd.concat([df] * 1000, ignore_index=True)
#198 times slowier
In [43]: %timeit df.loc[:,df.columns[1:]] = df.loc[:,df.columns[1:]].apply(fun, axis=1)
1.13 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [45]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
5.7 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Assuming you need to move all NaN to the first columns I would define a function that takes all NaN and places them first and leave the rest as it is:
def fun(row):
index_order = row.index[row.isnull()].append(row.index[~row.isnull()])
row.iloc[:] = row[index_order].values
return row
df_fix = df.loc[:,df.columns[1:]].apply(fun, axis=1)
If you need to overwrite the results in the same dataframe then:
df.loc[:,df.columns[1:]] = df_fix.copy()
Having this time series:
>>> from pandas import date_range
>>> from pandas import Series
>>> dates = date_range('2019-01-01', '2019-01-10', freq='D')[[0, 4, 5, 8]]
>>> dates
DatetimeIndex(['2019-01-01', '2019-01-05', '2019-01-06', '2019-01-09'], dtype='datetime64[ns]', freq=None)
>>> series = Series(index=dates, data=[0., 1., 2., 3.])
>>> series
2019-01-01 0.0
2019-01-05 1.0
2019-01-06 2.0
2019-01-09 3.0
dtype: int64
I can resample with Pandas to '2D' and get:
series.resample('2D').sum()
2019-01-01 0.0
2019-01-03 0.0
2019-01-05 3.0
2019-01-07 0.0
2019-01-09 3.0
Freq: 2D, dtype: int64
However, I would like to get:
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
Freq: 2D, dtype: int64
Or at least (so that I can drop the NaNs):
2019-01-01 0.0
2019-01-03 Nan
2019-01-05 3.0
2019-01-07 Nan
2019-01-09 3.0
Freq: 2D, dtype: int64
Notes
Using latest Pandas version (0.24)
Would like to be able to use the '2D' syntax (or 'W' or '3H' or whatever...) and let Pandas care about the grouping/resampling
This looks dirty and inefficient. Hopefully someone comes up with a better alternative. :-D
>>> resampled = series.resample('2D')
>>> (resampled.mean() * resampled.count()).dropna()
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
dtype: float64
It would be clearer to use resampled.count() as a condition after using sum like this :
resampled = series.resample('2D')
resampled.sum()[resampled.count() != 0]
Out :
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
dtype: float64
On my computer this method is 22% faster (5.52ms vs 7.15ms).
You can use the named argument min_count:
>>> series.resample('2D').sum(min_count=1).dropna()
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
Performance comparison with other methods, from faster to slower (run your own tests, as it may depend on your architecture, platform, environment...):
In [38]: %timeit resampled.sum(min_count=1).dropna()
588 µs ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [39]: %timeit (resampled.mean() * resampled.count()).dropna()
622 µs ± 3.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [40]: %timeit resampled.sum()[resampled.count() != 0].copy()
960 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)