Pandas: group some data - python

I have dataframe
date id
0 12-12-2015 123
1 13-12-2015 123
2 15-12-2015 123
3 16-12-2015 123
4 18-12-2015 123
5 10-12-2015 456
6 13-12-2015 456
7 15-12-2015 456
And I want to get
id date count
0 123 10-12-2015 0
1 123 11-12-2015 0
2 123 12-12-2015 1
3 123 13-12-2015 1
4 123 14-12-2015 0
5 123 15-12-2015 1
6 123 16-12-2015 1
7 123 17-12-2015 0
8 123 18-12-2015 1
9 456 10-12-2015 1
10 456 11-12-2015 0
11 456 12-12-2015 0
12 456 13-12-2015 1
13 456 14-12-2015 0
14 456 15-12-2015 1
I try before
df = df.groupby('id').resample('D').size().reset_index(name='val')
But it search date between existing to every id. How can I do it to some period?

You can achieve what you want by reindexing in the aggregation of each group and filling NaNs with 0.
import io
import pandas as pd
data = io.StringIO("""\
date id
0 12-12-2015 123
1 13-12-2015 123
2 15-12-2015 123
3 16-12-2015 123
4 18-12-2015 123
5 10-12-2015 456
6 13-12-2015 456
7 15-12-2015 456""")
df = pd.read_csv(data, delim_whitespace=True)
df['date'] = pd.to_datetime(df['date'], format="%d-%m-%Y")
startdate = df['date'].min()
enddate = df['date'].max()
alldates = pd.date_range(startdate, enddate, freq='D', name='date')
def process_id(g):
return g.resample('D').size().reindex(alldates).fillna(0)
output = (df.set_index('date')
.groupby('id')
.apply(process_id)
.stack()
.rename('val')
.reset_index('id'))
print(output)
# id val
# date
# 2015-12-10 123 0.0
# 2015-12-11 123 0.0
# 2015-12-12 123 1.0
# 2015-12-13 123 1.0
# 2015-12-14 123 0.0
# 2015-12-15 123 1.0
# 2015-12-16 123 1.0
# 2015-12-17 123 0.0
# 2015-12-18 123 1.0
# 2015-12-10 456 1.0
# 2015-12-11 456 0.0
# 2015-12-12 456 0.0
# 2015-12-13 456 1.0
# 2015-12-14 456 0.0
# 2015-12-15 456 1.0
# 2015-12-16 456 0.0
# 2015-12-17 456 0.0
# 2015-12-18 456 0.0

Related

index-1 on itterrow makes new row at the end

Based on this table
pernr
plans
mnth
jum_mnth
123
000
1
NaN
123
001
3
NaN
123
001
6
NaN
789
002
10
NaN
789
003
2
NaN
789
003
2
NaN
789
002
2
NaN
I want to set 'jum_mnth' from 'mnth'. 'jum_mnth' have value if:
its last row from same plans
last row from same pernr
so i tried:
for index, row in que.iterrows():
if row['pernr'] != nipp:
que_cop.at[index-1, 'jum_mnth'] = mon
nipp = row['pernr']
plan = row['plans']
mon = row['mnth']
else:
if row['plans'] == plan:
mon = mon + row['mnth']
else:
que_cop.at[index-1, 'jum_mnth'] = mon
print(str(nipp),plan,str(mon))
plan = row['plans']
mon = row['mnth']
if index == que_cop.index[-2]:
que_cop.at[index, 'jum_mnth'] = mon
but it resulting new row ( index -1) at the last like this:
pernr
plans
mnth
jum_mnth
123
000
1
1.0
123
001
3
NaN
123
001
6
9.0
789
002
10
10.0
789
003
2
NaN
789
003
2
4.0
789
002
2
NaN
NaN
NaN
NaN
0.0
and the last row didnt have jum_mnth (it should have jum_mnth)
expected:
pernr
plans
mnth
jum_mnth
123
000
1
1
123
001
3
NaN
123
001
6
9
789
002
10
10
789
003
2
NaN
789
003
2
4
789
002
2
2
so what happened?
any help i would appreciate it.
You can use:
grp = (df[['pernr', 'plans']].ne(df[['pernr', 'plans']].shift())
.any(axis=1).cumsum()
)
g = df.groupby(grp)['mnth']
df['jum_mnth'] = g.transform('sum').where(g.cumcount(ascending=False).eq(0))
Output:
pernr plans mnth jum_mnth
0 123 000 1 1.0
1 123 001 3 NaN
2 123 001 6 9.0
3 789 002 10 10.0
4 789 003 2 NaN
5 789 003 2 4.0
6 789 002 2 2.0

Python - Date Diff On An Anchor Date

I am trying to find the date diff between my anchor date and the other dates grouping by ID.
Input
ID Date Anchor Date
123 1/5/2018 N
123 4/10/2018 N
123 5/8/2018 Y
123 10/12/2018 N
234 1/4/2018 N
234 1/4/2018 N
234 1/4/2018 Y
456 5/6/2018 N
456 5/6/2018 N
456 5/10/2018 N
456 6/1/2018 Y
567 3/2/2018 N
567 3/2/2018 N
567 3/2/2018 Y
Expected Output:
ID Date Anchor Date Diff
123 1/5/2018 N -123
123 4/10/2018 N -28
123 5/8/2018 Y 0
123 10/12/2018 N 157
234 1/4/2018 N 0
234 1/4/2018 N 0
234 1/4/2018 Y 0
456 5/6/2018 N -26
456 5/6/2018 N -26
456 5/10/2018 N -22
456 6/1/2018 Y 0
567 3/2/2018 N 0
567 3/2/2018 N 0
567 3/2/2018 Y 0
Code Attempt
import pandas as pd
df = pd.read_csv()
df['Date'] = df.groupby('ID')['Date'].apply(lambda x: x.sort_values())
df['diff'] = df.groupby('ID')['Date'].diff() / np.timedelta64(1, 'D')
df['diff'] = df['diff'].fillna(0)
The error I am receiving is "incompatible index of inserted column with frame index."
And secondly, I am not sure how to incorporate the Anchor Date column to ensure that is used for time zero.
First you need to convert Date into datetime type:
df['Date'] = pd.to_datetime(df['Date'])
After that, can extract the index of the Anchor Date with idxmax, then use loc to extract the actual dates:
idx = df['Anchor Date'].eq('Y').groupby(df['ID']).transform('idxmax')
df['Diff'] = (df['Date'] - df.loc[idx, 'Date'].values) / np.timedelta64(1, 'D')
Another way is to extract those Date with boolean indexing, and map:
anchor_dates = df.loc[df['Anchor Date']=='Y', ['ID','Date']].set_index('ID')['Date']
df['Diff'] = (df['Date'] - anchor_dates)/np.timedelta64(1, 'D')
Output:
ID Date Anchor Date Diff
0 123 2018-01-05 N -123.0
1 123 2018-04-10 N -28.0
2 123 2018-05-08 Y 0.0
3 123 2018-10-12 N 157.0
4 234 2018-01-04 N 0.0
5 234 2018-01-04 N 0.0
6 234 2018-01-04 Y 0.0
7 456 2018-05-06 N -26.0
8 456 2018-05-06 N -26.0
9 456 2018-05-10 N -22.0
10 456 2018-06-01 Y 0.0
11 567 2018-03-02 N 0.0
12 567 2018-03-02 N 0.0
13 567 2018-03-02 Y 0.0

How to merge DataFrames based on on column while adding another

I have the following mock DataFrames:
df1:
ID FILLER1 FILLER2 QUANTITY
01 123 132 12
02 123 132 5
03 123 132 10
df2:
ID FILLER1 FILLER2 QUANTITY
01 123 132 +1
02 123 132 -1
which would result in the 'Quantity' of DF1 will result in 13, 4 and 10.
Thx in advance for any help provided!
Question is not super clear but if I get what you're trying to do here is a way:
# A left join and filling 0 instead of NaN for that third row
In [19]: merged = df1.merge(df2, on=['ID', 'FILLER1', 'FILLER2'], how='left').fillna(0)
In [20]: merged
Out[20]:
ID FILLER1 FILLER2 QUANTITY_x QUANTITY_y
0 1 123 132 12 1.0
1 2 123 132 5 -1.0
2 3 123 132 10 0.0
# Adding new quantity column
In [21]: merged['QUANTITY'] = merged['QUANTITY_x'] + merged['QUANTITY_y']
In [22]: merged
Out[22]:
ID FILLER1 FILLER2 QUANTITY_x QUANTITY_y QUANTITY
0 1 123 132 12 1.0 13.0
1 2 123 132 5 -1.0 4.0
2 3 123 132 10 0.0 10.0
# Removing _x and _y columns
In [23]: merged = merged[['ID', 'FILLER1', 'FILLER2', 'QUANTITY']]
In [24]: merged
Out[24]:
ID FILLER1 FILLER2 QUANTITY
0 1 123 132 13.0
1 2 123 132 4.0
2 3 123 132 10.0

Python dataframe: pivot on same column

I have two columns "ID" and "division" as shown below.
df = pd.DataFrame(np.array([['111', 'AAA'],['222','AAA'],['333','BBB'],['444','CCC'],['444','AAA'],['222','BBB'],['111','BBB']]),columns=['ID','division'])
ID division
0 111 AAA
1 222 AAA
2 333 BBB
3 444 CCC
4 444 AAA
5 222 BBB
6 111 BBB
The expected output is as shown below where I need to pivot on the same column but the count is dependent on "division". This should be presented in a heatmap.
df = pd.DataFrame(np.array([['0','2','1','1'],['2','0','1','1'],['1','1','0','0'],['1','1','0','0']]),columns=['111','222','333','444'],index=['111','222','333','444'])
111 222 333 444
111 0 2 1 1
222 2 0 1 1
333 1 1 0 0
444 1 1 0 0
So, technically I am doing an overlap between ID's with respect to division.
Example:
The highlighted box in red where the overlap between 111 and 222 ID's is 2(AAA and BBB). where as the overlap between 111 and 444 is 1 (AAA highlighted in the black box).
I could do this in excel in 2 steps.Not sure if below one helps.
Step1:=SUM(COUNTIFS($B$2:$B$8,$B2,$A$2:$A$8,$G2),COUNTIFS($B$2:$B$8,$B2,$A$2:$A$8,H$1))-1
Step2:=IF($G12=H$1,0,SUMIFS(H$2:H$8,$G$2:$G$8,$G12))
But is there any way that we can do it in Python using dataframes.
Appreciate your help
Case-2
if df = pd.DataFrame(np.array([['111', 'AAA','4'],['222','AAA','5'],['333','BBB','6'],
['444','CCC','3'],['444','AAA','2'], ['222','BBB','2'],
['111','BBB','7']]),columns=['ID','division','count'])
ID division count
0 111 AAA 4
1 222 AAA 5
2 333 BBB 6
3 444 CCC 3
4 444 AAA 2
5 222 BBB 2
6 111 BBB 7
Expected output would be
df_result = pd.DataFrame(np.array([['0','18','13','6'],['18','0','8','7'],['13','8','0','0'],['6','7','0','0']]),columns=['111','222','333','444'],index=['111','222','333','444'])
111 222 333 444
111 0 18 13 6
222 18 0 8 7
333 13 8 0 0
444 6 7 0 0
Calculation: Here there is an overlap between 111 and 222 with respect to divisions AAA and BBB hence the sum would be 4+5+2+7=18
Another way to do this is to use a self join with merge and pd.crosstab:
df_out = df.merge(df, on='division')
results = pd.crosstab(df_out.ID_x, df_out.ID_y)
np.fill_diagonal(results.values, 0)
Output:
ID_y 111 222 333 444
ID_x
111 0.0 2.0 1.0 1.0
222 2.0 0.0 1.0 1.0
333 1.0 1.0 0.0 0.0
444 1.0 1.0 0.0 0.0
Case 2
df = pd.DataFrame(np.array([['111', 'AAA','4'],['222','AAA','5'],['333','BBB','6'],
['444','CCC','3'],['444','AAA','2'], ['222','BBB','2'],
['111','BBB','7']]),columns=['ID','division','count'])
df['count'] = df['count'].astype(int)
df_out = df.merge(df, on='division')
df_out = df_out.assign(count = df_out.count_x + df_out.count_y)
results = pd.crosstab(df_out.ID_x, df_out.ID_y, df_out['count'], aggfunc='sum').fillna(0)
np.fill_diagonal(results.values, 0)
Output:
ID_y 111 222 333 444
ID_x
111 0.0 18.0 13.0 6.0
222 18.0 0.0 8.0 7.0
333 13.0 8.0 0.0 0.0
444 6.0 7.0 0.0 0.0

Indexing/Binning Time Series

I have a dataframe like bellow
ID Date
111 1.1.2018
222 5.1.2018
333 7.1.2018
444 8.1.2018
555 9.1.2018
666 13.1.2018
and I would like to bin them into 5 days intervals.
The output should be
ID Date Bin
111 1.1.2018 1
222 5.1.2018 1
333 7.1.2018 2
444 8.1.2018 2
555 9.1.2018 2
666 13.1.2018 3
How can I do this in python, please?
Looks like groupby + ngroup does it:
df['Date'] = pd.to_datetime(df.Date, errors='coerce', dayfirst=True)
df['Bin'] = df.groupby(pd.Grouper(freq='5D', key='Date')).ngroup() + 1
df
ID Date Bin
0 111 2018-01-01 1
1 222 2018-01-05 1
2 333 2018-01-07 2
3 444 2018-01-08 2
4 555 2018-01-09 2
5 666 2018-01-13 3
If you don't want to mutate the Date column, then you may first call assign for a copy based assignment, and then do the groupby:
df['Bin'] = df.assign(
Date=pd.to_datetime(df.Date, errors='coerce', dayfirst=True)
).groupby(pd.Grouper(freq='5D', key='Date')).ngroup() + 1
df
ID Date Bin
0 111 1.1.2018 1
1 222 5.1.2018 1
2 333 7.1.2018 2
3 444 8.1.2018 2
4 555 9.1.2018 2
5 666 13.1.2018 3
One way is to create an array of your date range and use numpy.digitize.
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
date_ranges = pd.date_range(df['Date'].min(), df['Date'].max(), freq='5D')\
.astype(np.int64).values
df['Bin'] = np.digitize(df['Date'].astype(np.int64).values, date_ranges)
Result:
ID Date Bin
0 111 2018-01-01 1
1 222 2018-01-05 1
2 333 2018-01-07 2
3 444 2018-01-08 2
4 555 2018-01-09 2
5 666 2018-01-13 3

Categories