I have an example of input data that can be found here
input
I need to add 2 columns: "Begin_date" & "End_date" based on the data in each row:
Begin date - when all previous cells are empty and date starts from yyyymm01
End_date - when all subsequent are empty:
if all subsequent are not null, then something like "lifelong" date added: "99991231"
otherwise - yyyymm30 or 31 or 28 (depends on month)
Example of output:
I will appreciate any ideas :) Thank you
use pd.melt()
sort data by id and date
import pandas as pd
import numpy as np
from pandas.tseries.offsets import MonthEnd
df = pd.read_excel("input.xlsx")
max_date = df.columns[-1]
res = pd.melt(df, id_vars=['id', 'region'], value_vars=df.columns[2:])
res.dropna(subset=['value'], inplace=True)
res.sort_values(by=['id', 'variable'], ascending=[True, True], inplace=True)
minimum_date = res.drop_duplicates(subset=['id'], keep='first')
maximum_date = res.drop_duplicates(subset=['id'], keep='last')
minimum_date.rename(columns={'variable': 'start_date'}, inplace=True)
maximum_date.rename(columns={'variable': 'end_date'}, inplace=True)
df = pd.merge(df, minimum_date[['id', 'start_date']], on=['id'], how='left')
df = pd.merge(df, maximum_date[['id', 'end_date']], on=['id'], how='left')
df['end_date'] = np.where(df['end_date']==max_date,
"99991231",df['end_date'])
df['start_date'] = (pd.to_datetime(df['start_date'],format="%Y%m",errors='coerce') +MonthEnd(1)).astype(str)
df['end_date'] = (pd.to_datetime(df['end_date'],format="%Y%m",errors='coerce') +MonthEnd(1)).astype(str)
df['end_date'] = np.where(df['end_date']=='NaT',
"99991231",df['end_date'])
print(df)
id region 201801 201802 ... 201905 201906 start_date end_date
0 100001 628 NaN NaN ... 26.0 23.0 2018-09-30 99991231
1 100002 1149 27.0 24.0 ... 26.0 24.0 2018-01-31 99991231
2 100003 1290 26.0 26.0 ... 27.0 25.0 2018-01-31 99991231
3 100004 955 25.0 26.0 ... NaN NaN 2018-01-31 2018-12-31
4 100005 1397 15.0 25.0 ... NaN NaN 2018-01-31 2018-11-30
5 100006 1397 15.0 25.0 ... NaN NaN 2018-01-31 2019-02-28
Idea is convert non datetimelike columns to MultiIndex by DataFrame.set_index and then convert columns to datetimes:
df = pd.read_excel('input.xlsx')
df = df.set_index(['id','region'])
df.columns = pd.to_datetime(df.columns, format='%Y%m')
Then create new columns by DataFrame.assign, filter January columns, compare non missing values and get first value by DataFrame.idxmax, then convert to output format by Series.dt.strftime for begin, for end first swap order with indexing ::-1 and get last non missing values, convert to last days of month and last get default value if last column is not missing value by Series.where:
begin = df.loc[:, df.columns.month == 1].notna().idxmax(axis=1).dt.strftime('%Y%m%d')
end1 = df.iloc[:, ::-1].notna().idxmax(axis=1) + pd.offsets.MonthEnd()
end = end1.dt.strftime('%Y%m%d').where(df.iloc[:, -1].isna(), '99991231')
df.columns = df.columns.strftime('%Y%m')
df = df.assign(date_begin = begin, date_end = end).reset_index()
print (df)
id region 201801 201802 201803 201804 201805 201806 201807 \
0 100001 628 NaN NaN NaN NaN NaN NaN NaN
1 100002 1149 27.0 24.0 27.0 25.0 24.0 26.0 27.0
2 100003 1290 26.0 26.0 26.0 26.0 23.0 27.0 27.0
3 100004 955 25.0 26.0 26.0 24.0 24.0 26.0 28.0
4 100005 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
5 100006 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
201808 ... 201811 201812 201901 201902 201903 201904 201905 \
0 NaN ... 24 20.0 26.0 24.0 26.0 26.0 26.0
1 28.0 ... 24 21.0 26.0 25.0 27.0 24.0 26.0
2 NaN ... 28 NaN 28.0 26.0 27.0 27.0 27.0
3 27.0 ... 24 12.0 NaN NaN NaN NaN NaN
4 26.0 ... 25 NaN NaN NaN NaN NaN NaN
5 26.0 ... 25 23.0 25.0 17.0 NaN NaN NaN
201906 date_begin date_end
0 23.0 20190101 99991231
1 24.0 20180101 99991231
2 25.0 20180101 99991231
3 NaN 20180101 20181231
4 NaN 20180101 20181130
5 NaN 20180101 20190228
[6 rows x 22 columns]
Also is possible create valid datatimes in both new column by Timestamp.max with Timestamp.floor:
df = pd.read_excel('input.xlsx')
df = df.set_index(['id','region'])
df.columns = pd.to_datetime(df.columns, format='%Y%m')
begin = df.loc[:, df.columns.month == 1].notna().idxmax(axis=1)
end1 = df.iloc[:, ::-1].notna().idxmax(axis=1) + pd.offsets.MonthEnd()
end = end1.where(df.iloc[:, -1].isna(), pd.Timestamp.max.floor('d'))
df.columns = df.columns.strftime('%Y%m')
df = df.assign(date_begin = begin, date_end = end).reset_index()
print (df)
id region 201801 201802 201803 201804 201805 201806 201807 \
0 100001 628 NaN NaN NaN NaN NaN NaN NaN
1 100002 1149 27.0 24.0 27.0 25.0 24.0 26.0 27.0
2 100003 1290 26.0 26.0 26.0 26.0 23.0 27.0 27.0
3 100004 955 25.0 26.0 26.0 24.0 24.0 26.0 28.0
4 100005 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
5 100006 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
201808 ... 201811 201812 201901 201902 201903 201904 201905 \
0 NaN ... 24 20.0 26.0 24.0 26.0 26.0 26.0
1 28.0 ... 24 21.0 26.0 25.0 27.0 24.0 26.0
2 NaN ... 28 NaN 28.0 26.0 27.0 27.0 27.0
3 27.0 ... 24 12.0 NaN NaN NaN NaN NaN
4 26.0 ... 25 NaN NaN NaN NaN NaN NaN
5 26.0 ... 25 23.0 25.0 17.0 NaN NaN NaN
201906 date_begin date_end
0 23.0 2019-01-01 2262-04-11
1 24.0 2018-01-01 2262-04-11
2 25.0 2018-01-01 2262-04-11
3 NaN 2018-01-01 2018-12-31
4 NaN 2018-01-01 2018-11-30
5 NaN 2018-01-01 2019-02-28
[6 rows x 22 columns]
Related
I have a dataframe in the following format:
A B C D
2020-11-18 64.0 74.0 34.0 57.0
2020-11-20 NaN 71.0 NaN 58.0
2020-11-23 NaN 11.0 NaN NaN
2020-11-25 69.0 NaN NaN 0.0
2020-11-27 NaN 37.0 19.0 NaN
2020-11-29 63.0 NaN NaN 85.0
2020-12-03 NaN 73.0 NaN 49.0
2020-12-10 NaN NaN 32.0 NaN
2020-12-22 52.0 90.0 33.0 24.0
2020-12-23 NaN 96.0 NaN NaN
2020-12-28 78.0 NaN NaN 68.0
2020-12-29 17.0 70.0 NaN 16.0
2021-01-03 51.0 43.0 NaN 66.0
I want to obtain a new dataframe that contains the last non-NaN values for each month in each column:
A B C D
2020-11 63.0 37.0 19.0 85.0
2020-12 17.0 70.0 33.0 16.0
I tried grouping by month and applying a lambda that returns the in-group maximum index like so:
df.loc[df.groupby(df.index.to_period('M')).apply(lambda x: x.index.max())]
which yields:
A B C D
2020-11-29 63.0 NaN NaN 85.0
2020-12-29 17.0 70.0 NaN 16.0
This returns the values that appear on the last day in each month but not the last non-NaN value. In case the value for the last day in a particular month is a NaN, I will have a NaN appearing here. Instead, I'd only like to have NaN values present if there are absolutely no values for that particular month in that column.
Use GroupBy.last:
df = df.groupby(df.index.to_period('M')).last()
print (df)
A B C D
2020-11 63.0 37.0 19.0 85.0
2020-12 17.0 70.0 33.0 16.0
2021-01 51.0 43.0 NaN 66.0
I have this dataframe
hour = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
visitor = [4,6,2,4,3,7,5,7,8,3,2,8,3,6,4,5,1,8,9,4,2,3,4,1]
df = {"Hour":hour, "Total_Visitor":visitor}
df = pd.DataFrame(df)
print(df)
I applied 6 window rolling sum
df_roll = df.rolling(6, min_periods=6).sum()
print(df_roll)
The first 5 rows will give you NaN value,
The problem is I want to know the sum of total visitor from 9pm to 3am, so I have to sum total visitor from hour 21 and then back to hour 0 until 3
How do you do that automatically with rolling?
I think you need add last N values, then using rolling and filter by length of Series:
N = 6
df_roll = df.iloc[-N:].append(df).rolling(N).sum().iloc[-len(df):]
print (df_roll)
Hour Total_Visitor
0 105.0 18.0
1 87.0 20.0
2 69.0 20.0
3 51.0 21.0
4 33.0 20.0
5 15.0 26.0
6 21.0 27.0
7 27.0 28.0
8 33.0 34.0
9 39.0 33.0
10 45.0 32.0
11 51.0 33.0
12 57.0 31.0
13 63.0 30.0
14 69.0 26.0
15 75.0 28.0
16 81.0 27.0
17 87.0 27.0
18 93.0 33.0
19 99.0 31.0
20 105.0 29.0
21 111.0 27.0
22 117.0 30.0
23 123.0 23.0
Check original solution:
df_roll = df.rolling(6, min_periods=6).sum()
print(df_roll)
Hour Total_Visitor
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 15.0 26.0
6 21.0 27.0
7 27.0 28.0
8 33.0 34.0
9 39.0 33.0
10 45.0 32.0
11 51.0 33.0
12 57.0 31.0
13 63.0 30.0
14 69.0 26.0
15 75.0 28.0
16 81.0 27.0
17 87.0 27.0
18 93.0 33.0
19 99.0 31.0
20 105.0 29.0
21 111.0 27.0
22 117.0 30.0
23 123.0 23.0
Numpy alternative with strides is complicated, but faster if large one Series:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
N = 3
x = np.concatenate([fv[-N+1:], fv.to_numpy()])
cv = pd.Series(rolling_window(x, N).sum(axis=1), index=fv.index)
print (cv)
0 5
1 4
2 4
3 6
4 5
dtype: int64
Though you have mentioned a series, see if this is helpful-
import pandas as pd
def cyclic_roll(s, n):
s = s.append(s[:n-1])
result = s.rolling(n).sum()
return result[-n+1:].append(result[n-1:-n+1])
fv = pd.DataFrame([1, 2, 3, 4, 5])
cv = fv.apply(cyclic_roll, n=3)
cv.reset_index(inplace=True, drop=True)
print cv
Output
0
0 10.0
1 8.0
2 6.0
3 9.0
4 12.0
Given a dataframe as follows:
id value1 value2
0 3918703 62.0 64.705882
1 3919144 60.0 60.000000
2 3919534 62.5 30.000000
3 3919559 55.0 55.000000
4 3920438 82.0 82.031250
5 3920463 71.0 71.428571
6 3920502 70.0 69.230769
7 3920535 80.0 40.000000
8 3920674 62.0 62.222222
9 3920856 80.0 79.987176
I want to check if value2 is in the range of plus and minus 10% of value1, and return a new column result_review.
If it's not in the range as required, then indicate No as result_review's values.
id value1 value2 results_review
0 3918703 62.0 64.705882 NaN
1 3919144 60.0 60.000000 NaN
2 3919534 62.5 30.000000 no
3 3919559 55.0 55.000000 NaN
4 3920438 82.0 82.031250 NaN
5 3920463 71.0 71.428571 NaN
6 3920502 70.0 69.230769 NaN
7 3920535 80.0 40.000000 no
8 3920674 62.0 62.222222 NaN
9 3920856 80.0 79.987176 NaN
How can I do that in Pandas? Thanks for your help at advance.
Use Series.between with DataFrame.loc:
m = df['value2'].between(df['value1'].mul(0.9), df['value1'].mul(1.1))
df.loc[~m, 'results_review'] = 'no'
print(df)
id value1 value2 results_review
0 3918703 62.0 64.705882 NaN
1 3919144 60.0 60.000000 NaN
2 3919534 62.5 30.000000 no
3 3919559 55.0 55.000000 NaN
4 3920438 82.0 82.031250 NaN
5 3920463 71.0 71.428571 NaN
6 3920502 70.0 69.230769 NaN
7 3920535 80.0 40.000000 no
8 3920674 62.0 62.222222 NaN
9 3920856 80.0 79.987176 NaN
I have this data frame:
ID date X1 X2 Y
A 16-07-19 58 50 0
A 21-07-19 28 74 0
B 25-07-19 54 65 1
B 27-07-19 50 30 0
B 29-07-19 81 61 0
C 30-07-19 55 29 0
C 31-07-19 97 69 1
C 03-08-19 13 48 1
D 19-07-18 77 27 1
D 20-07-18 68 50 1
D 22-07-18 89 57 1
D 23-07-18 46 70 0
D 26-07-18 56 13 0
E 06-08-19 47 35 1
I want to "stretch" the data by date, from the first row, to the last row of each ID (groupby),
and to fill the missing values with NaN.
For example: ID A has two rows on 16-07-19, and 21-07-19.
After the implementation, (s)he should have 6 rows on 16-21 of July, 2019.
Expected result:
ID date X1 X2 Y
A 16-07-19 58.0 50.0 0.0
A 17-07-19 NaN NaN NaN
A 18-07-19 NaN NaN NaN
A 19-07-19 NaN NaN NaN
A 20-07-19 NaN NaN NaN
A 21-07-19 28.0 74.0 0.0
B 25-07-19 54.0 65.0 1.0
B 26-07-19 NaN NaN NaN
B 27-07-19 50.0 30.0 0.0
B 28-07-19 NaN NaN NaN
B 29-07-19 81.0 61.0 0.0
C 30-07-19 55.0 29.0 0.0
C 31-07-19 97.0 69.0 1.0
C 01-08-19 NaN NaN NaN
C 02-08-19 NaN NaN NaN
C 03-08-19 13.0 48.0 1.0
D 19-07-18 77.0 27.0 1.0
D 20-07-18 68.0 50.0 1.0
D 21-07-18 NaN NaN NaN
D 22-07-18 89.0 57.0 1.0
D 23-07-18 46.0 70.0 0.0
D 24-07-18 NaN NaN NaN
D 25-07-18 NaN NaN NaN
D 26-07-18 56.0 13.0 0.0
E 06-08-19 47.0 35.0 1.0
Use DataFrame.asfreq per groups working with DatetimeIndex:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
cols = df.columns.difference(['date','ID'], sort=False)
df = df.set_index('date').groupby('ID')[cols].apply(lambda x: x.asfreq('d')).reset_index()
print (df)
ID date X1 X2 Y
0 A 2019-07-16 58.0 50.0 0.0
1 A 2019-07-17 NaN NaN NaN
2 A 2019-07-18 NaN NaN NaN
3 A 2019-07-19 NaN NaN NaN
4 A 2019-07-20 NaN NaN NaN
5 A 2019-07-21 28.0 74.0 0.0
6 B 2019-07-25 54.0 65.0 1.0
7 B 2019-07-26 NaN NaN NaN
8 B 2019-07-27 50.0 30.0 0.0
9 B 2019-07-28 NaN NaN NaN
10 B 2019-07-29 81.0 61.0 0.0
11 C 2019-07-30 55.0 29.0 0.0
12 C 2019-07-31 97.0 69.0 1.0
13 C 2019-08-01 NaN NaN NaN
14 C 2019-08-02 NaN NaN NaN
15 C 2019-08-03 13.0 48.0 1.0
16 D 2018-07-19 77.0 27.0 1.0
17 D 2018-07-20 68.0 50.0 1.0
18 D 2018-07-21 NaN NaN NaN
19 D 2018-07-22 89.0 57.0 1.0
20 D 2018-07-23 46.0 70.0 0.0
21 D 2018-07-24 NaN NaN NaN
22 D 2018-07-25 NaN NaN NaN
23 D 2018-07-26 56.0 13.0 0.0
24 E 2019-08-06 47.0 35.0 1.0
Another idea with DataFrame.reindex per groups:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
cols = df.columns.difference(['date','ID'], sort=False)
f = lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max()))
df = df.set_index('date').groupby('ID')[cols].apply(f).reset_index()
Here is my sort jitsu:
def Sort_by_date(dataf):
# rule1
dataf['Current'] = pd.to_datetime(dataf.Current)
dataf = dataf.sort_values(by=['Current'],ascending=True)
# rule2
dataf['Current'] = pd.to_datetime(dataf.Current)
Mask = (dataf['Current'] > '1/1/2020') & (dataf['Current'] <= '12/31/2022')
dataf = dataf.loc[Mask]
return dataf
you can modify this code to learn to sort by date for your solution.
Next lets sort by group:
Week1 = WeeklyDF.groupby('ID')
Week1_Report = Week1['ID','date','X1','X2','Y']
Week1_Report
lastly, Lets replace the NaN
Week1_Report['X1'.fillna("X1 is 0", inplace = True)
Week1_Report['X2'.fillna("X2 is 0", inplace = True)
Week1_Report['Y'.fillna("Y is 0", inplace = True)
I'm interested in combining two dataframes in pandas that have the same row indices and column names, but different cell values. See the example below:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A':[22,2,np.NaN,np.NaN],
'B':[23,4,np.NaN,np.NaN],
'C':[24,6,np.NaN,np.NaN],
'D':[25,8,np.NaN,np.NaN]})
df2 = pd.DataFrame({'A':[np.NaN,np.NaN,56,100],
'B':[np.NaN,np.NaN,58,101],
'C':[np.NaN,np.NaN,59,102],
'D':[np.NaN,np.NaN,60,103]})
In[6]: print(df1)
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
In[7]: print(df2)
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
I would like the resulting frame to look like this:
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
I have tried different ways of pd.concat and pd.merge but some of the data always gets replaced with NaNs. Any pointers in the right direction would be greatly appreciated.
Use combine_first:
print (df1.combine_first(df2))
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Or fillna:
print (df1.fillna(df2))
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Or update:
df1.update(df2)
print (df1)
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Use combine_first
df1.combine_first(df2)