I have a dataframe in pandas, and I am trying to take data from the same row and different columns and fill NaN values in my data. How would I do this in pandas?
For example,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
83 27.0 29.0 NaN 29.0 30.0 NaN NaN 15.0 16.0 17.0 NaN 28.0 30.0 NaN 28.0 18.0
The goal is for the data to look like this:
1 2 3 4 5 6 7 ... 10 11 12 13 14 15 16
83 NaN NaN NaN 27.0 29.0 29.0 30.0 ... 15.0 16.0 17.0 28.0 30.0 28.0 18.0
The goal is to be able to take the mean of the last five columns that have data. If there are not >= 5 data-filled cells, then take the average of however many cells there are.
Use function justify for improve performance with filter all columns without first by DataFrame.iloc:
print (df)
name 1 2 3 4 5 6 7 8 9 10 11 12 13 \
80 bob 27.0 29.0 NaN 29.0 30.0 NaN NaN 15.0 16.0 17.0 NaN 28.0 30.0
14 15 16
80 NaN 28.0 18.0
df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
print (df)
name 1 2 3 4 5 6 7 8 9 10 11 12 13 \
80 bob NaN NaN NaN NaN NaN 27.0 29.0 29.0 30.0 15.0 16.0 17.0 28.0
14 15 16
80 30.0 28.0 18.0
Function:
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
Performance:
#100 rows
df = pd.concat([df] * 100, ignore_index=True)
#41 times slowier
In [39]: %timeit df.loc[:,df.columns[1:]] = df.loc[:,df.columns[1:]].apply(fun, axis=1)
145 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [41]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
3.54 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#1000 rows
df = pd.concat([df] * 1000, ignore_index=True)
#198 times slowier
In [43]: %timeit df.loc[:,df.columns[1:]] = df.loc[:,df.columns[1:]].apply(fun, axis=1)
1.13 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [45]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
5.7 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Assuming you need to move all NaN to the first columns I would define a function that takes all NaN and places them first and leave the rest as it is:
def fun(row):
index_order = row.index[row.isnull()].append(row.index[~row.isnull()])
row.iloc[:] = row[index_order].values
return row
df_fix = df.loc[:,df.columns[1:]].apply(fun, axis=1)
If you need to overwrite the results in the same dataframe then:
df.loc[:,df.columns[1:]] = df_fix.copy()
Related
I have a dataframe like this.
val consecutive
0 0.0001 0.0
1 0.0008 0.0
2 -0.0001 0.0
3 0.0005 0.0
4 0.0008 0.0
5 0.0002 0.0
6 0.0012 0.0
7 0.0012 1.0
8 0.0007 1.0
9 0.0004 1.0
10 0.0002 1.0
11 0.0000 0.0
12 0.0015 0.0
13 -0.0005 0.0
14 -0.0003 0.0
15 0.0001 0.0
16 0.0001 0.0
17 0.0003 0.0
18 -0.0003 0.0
19 -0.0001 0.0
20 0.0000 0.0
21 0.0000 0.0
22 -0.0008 0.0
23 -0.0008 0.0
24 -0.0001 0.0
25 -0.0006 0.0
26 -0.0010 1.0
27 0.0002 0.0
28 -0.0003 0.0
29 -0.0008 0.0
30 -0.0010 0.0
31 -0.0003 0.0
32 -0.0005 1.0
33 -0.0012 1.0
34 -0.0002 1.0
35 0.0000 0.0
36 -0.0018 0.0
37 -0.0009 0.0
38 -0.0007 0.0
39 0.0000 0.0
40 -0.0011 0.0
41 -0.0006 0.0
42 -0.0010 0.0
43 -0.0015 0.0
44 -0.0012 1.0
45 -0.0011 1.0
46 -0.0010 1.0
47 -0.0014 1.0
48 -0.0011 1.0
49 -0.0017 1.0
50 -0.0015 1.0
51 -0.0010 1.0
52 -0.0014 1.0
53 -0.0012 1.0
54 -0.0004 1.0
55 -0.0007 1.0
56 -0.0011 1.0
57 -0.0008 1.0
58 -0.0006 1.0
59 0.0002 0.0
The column 'consecutive' is what I want to compute. It is '1' when current row has more than 5 consecutive previous values with same sign (either positive or negative, including it self).
What I've tried is:
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: np.all(arr > 0) or np.all(arr < 0), raw=True
).replace(np.nan, 0)
But it's too slow for large dataset.
Do you have any idea how to speed up?
One option is to avoid the use of apply() altogether.
The main idea is to create 2 'helper' columns:
sign: boolean Series indicating if value is positive (True) or negative (False)
id: group identical consecutive occurences together
Finally, we can groupby the id and use cumulative count to isolate the rows which have 4 or more previous rows with the same sign (i.e. get all rows with 5 consecutive sign values).
# Setup test dataset
import pandas as pd
import numpy as np
vals = np.random.randn(20000)
df = pd.DataFrame({'val': vals})
# Create the helper columns
sign = df['val'] >= 0
df['id'] = sign.ne(sign.shift()).cumsum()
# Count the ids and set flag to True if the cumcount is above our desired value
df['consecutive'] = df.groupby('id').cumcount() >= 4
Benchmarking
On my system I get the following benchmarks:
sign = df['val'] >= 0
# 92 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
df['id'] = sign.ne(sign.shift()).cumsum()
# 1.06 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df['consecutive'] = df.groupby('id').cumcount() >= 4
# 3.36 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Thus in total we get an average runtime of: 4.51 ms
For reference, your solution and #Emma 's solution ran respectively on my system in:
# 287 ms ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 121 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Not sure this is fast enough for your data size but using min, max seems faster.
With 20k rows,
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: np.all(arr > 0) or np.all(arr < 0), raw=True
)
# 144 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: (arr.min() > 0 or arr.max() < 0), raw=True
)
# 57.1 ms ± 85.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have a pandas dataframe which is having long term data,
point_id issue_date latitude longitude rainfall
0 1.0 2020-01-01 6.5 66.50 NaN
1 2.0 2020-01-02 6.5 66.75 NaN
... ... ... ... ... ... ... ...
6373888 17414.0 2020-12-30 38.5 99.75 NaN
6373889 17415.0 2020-12-31 38.5 100.00 NaN
6373890 rows × 5 columns
I want to extract the Standard Meteorological Week from its issue_date column, as
given in this figure.
I have tried in 2 ways.
1st
lulc_gdf['smw'] = lulc_gdf['issue_date'].astype('datetime64[ns]').dt.strftime('%V')
2nd
lulc_gdf['iso'] = lulc_gdf['issue_date'].astype('datetime64[ns]').dt.isocalendar().week
The output in both cases is same
point_id issue_date latitude longitude rainfall smw iso
0 1.0 2020-01-01 6.5 66.50 NaN 01 1
1 2.0 2020-01-02 6.5 66.75 NaN 01 1
... ... ... ... ... ... ... ...
6373888 17414.0 2020-12-30 38.5 99.75 NaN 53 53
6373889 17415.0 2020-12-31 38.5 100.00 NaN 53 53
6373890 rows × 7 columns
The issue is that the week starts here by taking reference of Sunday or Monday as the starting day of week, irrespective of year.
Like here in case of year 2020 the day on 1st January is Wednesday (not Monday),
so the 1st week is of 5 days only i.e (Wed, Thu, Fri, Sat & Sunday).
year week day issue_date
0 2020 1 3 2020-01-01
1 2020 1 4 2020-01-02
2 2020 1 5 2020-01-03
3 2020 1 6 2020-01-04
... ... ... ...
6373889 2020 53 4 2020-12-31
But in the case of Standard Meteorological Weeks,
I want output as:
for every year
1st week should always be from - 1st January to 07th January
2nd week from - 8th January to 14th January
3rd week from - 15th January to 21st January
------------------------------- and so on
irrespective of the starting day of year (Sunday, monday etc).
How to do so?
Use:
df = pd.DataFrame({'issue_date': pd.date_range('2000-01-01','2000-12-31')})
#inspire https://stackoverflow.com/a/61592907/2901002
normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
leap_year = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
days = df['issue_date'].dt.dayofyear
df['smw'] = np.where(df['issue_date'].dt.is_leap_year,
leap_year[days - 1],
normal_year[days - 1])
print (df[df['smw'] == 9])
issue_date smw
56 2000-02-26 9
57 2000-02-27 9
58 2000-02-28 9
59 2000-02-29 9
60 2000-03-01 9
61 2000-03-02 9
62 2000-03-03 9
63 2000-03-04 9
Performance:
#11323 rows
df = pd.DataFrame({'issue_date': pd.date_range('2000-01-01','2030-12-31')})
In [6]: %%timeit
...: normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
...: leap_year = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
...: days = df['issue_date'].dt.dayofyear
...:
...: df['smw'] = np.where(df['issue_date'].dt.is_leap_year, leap_year[days - 1], normal_year[days - 1])
...:
3.51 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %%timeit
...: df['smw1'] = get_smw(df['issue_date'])
...:
17.2 ms ± 312 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#51500 rows
df = pd.DataFrame({'issue_date': pd.date_range('1900-01-01','2040-12-31')})
In [9]: %%timeit
...: normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
...: leap_year = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
...: days = df['issue_date'].dt.dayofyear
...:
...: df['smw'] = np.where(df['issue_date'].dt.is_leap_year, leap_year[days - 1], normal_year[days - 1])
...:
...:
11.9 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: %%timeit
...: df['smw1'] = get_smw(df['issue_date'])
...:
...:
64.3 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
You can write a custom function to calculate the Standard Meteorological Weeks.
Normal calculation is by taking the difference in number of days from 1st January of the same year, then divide by 7 and add 1.
Special adjustment for leap year on Week No. 9 to have 8 days and also special adjustment for the last week of the year to have 8 days:
import numpy as np
# convert to datetime format if not already in datetime
df['issue_date'] = pd.to_datetime(df['issue_date'])
def get_smw(date_s):
# get day-of-the-year minus 1 in range [0..364/365] for division by 7
days_diff = date_s.dt.dayofyear - 1
# adjust for leap year on Week No. 9 to have 8 days: (minus one day for 29 Feb onwards in the same year)
leap_adj = date_s.dt.is_leap_year & (date_s > pd.to_datetime(date_s.dt.year.astype(str) + '-02-28'))
days_diff = np.where(leap_adj, days_diff - 1, days_diff)
# adjust for the last week of the year to have 8 days:
# Make the value for 31 Dec to 363 instead of 364 to keep it in the same week of 24 Dec)
days_diff = np.clip(days_diff, 0, 363)
smw = days_diff // 7 + 1
return smw
df['smw'] = get_smw(df['issue_date'])
Result:
print(df)
point_id issue_date latitude longitude rainfall smw
0 1.0 2020-01-01 6.5 66.50 NaN 1
1 2.0 2020-01-02 6.5 66.75 NaN 1
2 3.0 2020-01-03 6.5 66.75 NaN 1
3 4.0 2020-01-04 6.5 66.75 NaN 1
4 5.0 2020-01-05 6.5 66.75 NaN 1
5 6.0 2020-01-06 6.5 66.75 NaN 1
6 7.0 2020-01-07 6.5 66.75 NaN 1
7 8.0 2020-01-08 6.5 66.75 NaN 2
8 9.0 2020-01-09 6.5 66.75 NaN 2
40 40.0 2020-02-26 6.5 66.75 NaN 9
41 41.0 2020-03-04 6.5 66.75 NaN 9
42 42.0 2020-03-05 6.5 66.75 NaN 10
43 43.0 2020-03-12 6.5 66.75 NaN 11
6373880 17414.0 2020-12-23 38.5 99.75 NaN 51
6373881 17414.0 2020-12-24 38.5 99.75 NaN 52
6373888 17414.0 2020-12-30 38.5 99.75 NaN 52
6373889 17415.0 2020-12-31 38.5 100.00 NaN 52
7000040 40.0 2021-02-26 6.5 66.75 NaN 9
7000041 41.0 2021-03-04 6.5 66.75 NaN 9
7000042 42.0 2021-03-05 6.5 66.75 NaN 10
7000042 43.0 2021-03-12 6.5 66.75 NaN 11
7373880 17414.0 2021-12-23 38.5 99.75 NaN 51
7373881 17414.0 2021-12-24 38.5 99.75 NaN 52
7373888 17414.0 2021-12-30 38.5 99.75 NaN 52
7373889 17415.0 2021-12-31 38.5 100.00 NaN 52
I'm trying to run anova test to dataframe that looks like this:
>>>code 2020-11-01 2020-11-02 2020-11-03 2020-11-04 ...
0 1 22.5 73.1 12.2 77.5
1 1 23.1 75.4 12.4 78.3
2 2 43.1 72.1 13.4 85.4
3 2 41.6 85.1 34.1 96.5
4 3 97.3 43.2 31.1 55.3
5 3 12.1 44.4 32.2 52.1
...
I want to calculate one way anova for each column based on the code. I have used for that statsmodel and for loop :
keys = []
tables = []
for variable in df.columns[1:]:
model = ols('{} ~ code'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model)
keys.append(variable)
tables.append(anova_table)
df_anova = pd.concat(tables, keys=keys, axis=0)
df_anova
The problem is that I keep getting error for the 4th line:
PatsyError: numbers besides '0' and '1' are only allowed with **
2020-11-01 ~ code
^^^^
I have tried to use the Q argument as suggested here:
...
model = ols('{Q(x)} ~ code'.format(x=variable), data=df).fit()
KeyError: 'Q(x)'
I have also tried to locate the Q outside but got the same error.
My end goal: to calculate one-way anove for each day (each column) based on the "code" column.
You can try to pivot it long and skip the iteration through columns:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.DataFrame({"code":[1,1,2,2,3,3],
"2020-11-01":[22.5,23.1,43.1,41.6,97.3,12.1],
"2020-11-02":[73.1,75.4,72.1,85.1,43.2,44.4]})
df_long = df.melt(id_vars="code")
df_long
code variable value
0 1 2020-11-01 22.5
1 1 2020-11-01 23.1
2 2 2020-11-01 43.1
3 2 2020-11-01 41.6
4 3 2020-11-01 97.3
5 3 2020-11-01 12.1
6 1 2020-11-02 73.1
7 1 2020-11-02 75.4
8 2 2020-11-02 72.1
9 2 2020-11-02 85.1
10 3 2020-11-02 43.2
11 3 2020-11-02 44.4
Then applying your code:
tables = []
keys = df_long.variable.unique()
for D in keys:
model = ols('value ~ code', data=df_long[df_long.variable == D]).fit()
anova_table = sm.stats.anova_lm(model)
tables.append(anova_table)
pd.concat(tables,keys=keys)
Or simply:
def aov_func(x):
model = ols('value ~ code', data=x).fit()
return sm.stats.anova_lm(model)
df_long.groupby("variable").apply(aov_func)
Gives this result:
df sum_sq mean_sq F PR(>F)
variable
2020-11-01 code 1.0 1017.6100 1017.610000 1.115768 0.350405
Residual 4.0 3648.1050 912.026250 NaN NaN
2020-11-02 code 1.0 927.2025 927.202500 6.194022 0.067573
Residual 4.0 598.7725 149.693125 NaN NaN
I have the following dataframe.
a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999
5 NaN NaN NaN NaN
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001
11 NaN NaN NaN NaN
12 4.43 29.724285 126.620003 24.764999
13 NaN NaN NaN NaN
14 4.29 29.010000 120.309998 24.730000
15 4.11 29.420000 119.480003 25.035000
I want to split this df into multiple dfs when there is row with all NaN.
I explored the following links but could not figure out how to apply it to my problem.
Split pandas dataframe in two if it has more than 10 rows
Splitting dataframe into multiple dataframes
In my example, I would have 4 dataframes with 5,5,1 and 2 rows as the output.
Please suggest the way forward.
Using isna, all, cumsum and groupby.
First we check if all the values in a row are NaN, then use cumsum to create a group indicator and finally we save these dataframes in a list with groupby:
grps = df.isna().all(axis=1).cumsum()
dfs = [df.dropna() for _, df in df.groupby(grps)]
for df in dfs:
print(df)
a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999
a b c d
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001
a b c d
12 4.43 29.724285 126.620003 24.764999
a b c d
14 4.29 29.01 120.309998 24.730
15 4.11 29.42 119.480003 25.035
Something like this should do the trick:
import pandas as pd
import numpy as np
data_frame = pd.DataFrame({"a":[1,np.nan,3,np.nan,4,np.nan,5],
"b":[1,np.nan,3,np.nan,4,np.nan,5],
"c":[1,np.nan,3,np.nan,4,np.nan,5],
"d":[1,np.nan,3,np.nan,4,np.nan,5],
"e":[1,np.nan,3,np.nan,4,np.nan,5],
"f":[1,np.nan,3,np.nan,4,np.nan,5]})
all_nan = data_frame.index[data_frame.isnull().all(1)]
df_list = []
prev = 0
for i in all_nan:
df_list.append(data_frame[prev:i])
prev = i+1
for i in df_list:
print(i)
Just another flavor of doing the same thing:
nan_indices = df.index[df.isna().all(axis=1)]
df_list = [df.dropna() for df in np.split(df, nan_indices)]
df_list
[ a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999,
a b c d
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001,
a b c d
12 4.43 29.724285 126.620003 24.764999,
a b c d
14 4.29 29.01 120.309998 24.730
15 4.11 29.42 119.480003 25.035]
I have a huge .csv file(2.3G) which I have to read into pandas dataframe.
start_date,wind_90.0_0.0,wind_90.0_5.0,wind_87.5_2.5
1948-01-01,15030.64,15040.64,16526.35
1948-01-02,15050.14,15049.28,16526.28
1948-01-03,15076.71,15075.0,16525.28
I want to process above data into below structure:
start_date lat lon wind
0 1948-01-01 90.0 0.0 15030.64
1 1948-01-01 90.0 5.0 15040.64
2 1948-01-01 87.5 2.5 16526.35
3 1948-01-02 90.0 0.0 15050.14
4 1948-01-02 90.0 5.0 15049.28
5 1948-01-02 87.5 2.5 16526.28
6 1948-01-03 90.0 0.0 15076.71
7 1948-01-03 90.0 5.0 15075.0
8 1948-01-03 87.5 2.5 16525.28
Code I have so far which does what I want but is too slow and takes up a lot of memory.
def load_data_as_pandas(fileName, featureName):
df = pd.read_csv(fileName)
df = pd.melt(df, id_vars = df.columns[0])
df['lat'] = df['variable'].str.split('_').str[-2]
df['lon'] = df['variable'].str.split('_').str[-1]
df = df.drop('variable', axis=1)
df.columns = ['start_date', featureName,'lat','lon']
df = df.groupby(['start_date','lat','lon']).first()
df = df.reset_index()
df['start_date'] = pd.to_datetime(df['start_date'], format='%Y-%m-%d', errors='coerce')
return df
This should spead up your code:
We can use melt to unpivot your data from wide to long. Then we use str.split on the column name (values) and use expand=True to get a new column for each split. Finally we join these newly created columns back to our original dataframe:
melt = df.melt(id_vars='start_date').sort_values('start_date').reset_index(drop=True)
newcols = melt['variable'].str.split('_', expand=True).iloc[:, 1:].rename(columns={1:'lat', 2:'lon'})
final = melt.drop(columns='variable').join(newcols)
Output
start_date value lat lon
0 1948-01-01 15030.64 90.0 0.0
1 1948-01-01 15040.64 90.0 5.0
2 1948-01-01 16526.35 87.5 2.5
3 1948-01-02 15050.14 90.0 0.0
4 1948-01-02 15049.28 90.0 5.0
5 1948-01-02 16526.28 87.5 2.5
6 1948-01-03 15076.71 90.0 0.0
7 1948-01-03 15075.00 90.0 5.0
8 1948-01-03 16525.28 87.5 2.5
Timeit test on 800k rows:
3.55 s ± 347 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)