if I have a dataframe like
hour x
1 122
1 133
2 234
...
24 232
How can I get a dataframe like
hour x1 x2 *** xn
1 122 133
2 234
.....
24
You could create a cumulated count per group and pivot:
(df.assign(group=df.groupby('hour').cumcount().add(1))
.pivot(index='hour', columns='group', values='x')
.add_prefix('x')
.rename_axis(None)
)
Output:
group x1 x2
1 122.0 133.0
2 234.0 NaN
24 232.0 NaN
Related
I have a df which has object ID as the index, and then x values and y values in the columns, giving coordinates for where the object moved over time. For example:
id x y
1 100 400
1 110 390
1 115 385
2 110 380
2 115 380
3 200 570
3 210 580
I would like to calculate the change in x and the change in y, for each object, so I can see direction (eg north-east) and how linear or how non linear each route is. I can then filter out object moving in a way I am not interested in.
How do I create a loop which loops over each object (aka ID) separately? For example trying something like: for len(df) would loop over the entire number of rows, it would not discriminate based on ID.
Thank you
# if id is your index, fix that:
df = df.reset_index()
# groupby id, getting the difference row by row within each group:
df[['chngX', 'chngY']] = df.groupby('id')[['x', 'y']].diff()
print(df)
Output:
id x y chngX chngY
0 1 100 400 NaN NaN
1 1 110 390 10.0 -10.0
2 1 115 385 5.0 -5.0
3 2 110 380 NaN NaN
4 2 115 380 5.0 0.0
5 3 200 570 NaN NaN
6 3 210 580 10.0 10.0
I am trying to find the date diff between my anchor date and the other dates grouping by ID.
Input
ID Date Anchor Date
123 1/5/2018 N
123 4/10/2018 N
123 5/8/2018 Y
123 10/12/2018 N
234 1/4/2018 N
234 1/4/2018 N
234 1/4/2018 Y
456 5/6/2018 N
456 5/6/2018 N
456 5/10/2018 N
456 6/1/2018 Y
567 3/2/2018 N
567 3/2/2018 N
567 3/2/2018 Y
Expected Output:
ID Date Anchor Date Diff
123 1/5/2018 N -123
123 4/10/2018 N -28
123 5/8/2018 Y 0
123 10/12/2018 N 157
234 1/4/2018 N 0
234 1/4/2018 N 0
234 1/4/2018 Y 0
456 5/6/2018 N -26
456 5/6/2018 N -26
456 5/10/2018 N -22
456 6/1/2018 Y 0
567 3/2/2018 N 0
567 3/2/2018 N 0
567 3/2/2018 Y 0
Code Attempt
import pandas as pd
df = pd.read_csv()
df['Date'] = df.groupby('ID')['Date'].apply(lambda x: x.sort_values())
df['diff'] = df.groupby('ID')['Date'].diff() / np.timedelta64(1, 'D')
df['diff'] = df['diff'].fillna(0)
The error I am receiving is "incompatible index of inserted column with frame index."
And secondly, I am not sure how to incorporate the Anchor Date column to ensure that is used for time zero.
First you need to convert Date into datetime type:
df['Date'] = pd.to_datetime(df['Date'])
After that, can extract the index of the Anchor Date with idxmax, then use loc to extract the actual dates:
idx = df['Anchor Date'].eq('Y').groupby(df['ID']).transform('idxmax')
df['Diff'] = (df['Date'] - df.loc[idx, 'Date'].values) / np.timedelta64(1, 'D')
Another way is to extract those Date with boolean indexing, and map:
anchor_dates = df.loc[df['Anchor Date']=='Y', ['ID','Date']].set_index('ID')['Date']
df['Diff'] = (df['Date'] - anchor_dates)/np.timedelta64(1, 'D')
Output:
ID Date Anchor Date Diff
0 123 2018-01-05 N -123.0
1 123 2018-04-10 N -28.0
2 123 2018-05-08 Y 0.0
3 123 2018-10-12 N 157.0
4 234 2018-01-04 N 0.0
5 234 2018-01-04 N 0.0
6 234 2018-01-04 Y 0.0
7 456 2018-05-06 N -26.0
8 456 2018-05-06 N -26.0
9 456 2018-05-10 N -22.0
10 456 2018-06-01 Y 0.0
11 567 2018-03-02 N 0.0
12 567 2018-03-02 N 0.0
13 567 2018-03-02 Y 0.0
I have the following mock DataFrames:
df1:
ID FILLER1 FILLER2 QUANTITY
01 123 132 12
02 123 132 5
03 123 132 10
df2:
ID FILLER1 FILLER2 QUANTITY
01 123 132 +1
02 123 132 -1
which would result in the 'Quantity' of DF1 will result in 13, 4 and 10.
Thx in advance for any help provided!
Question is not super clear but if I get what you're trying to do here is a way:
# A left join and filling 0 instead of NaN for that third row
In [19]: merged = df1.merge(df2, on=['ID', 'FILLER1', 'FILLER2'], how='left').fillna(0)
In [20]: merged
Out[20]:
ID FILLER1 FILLER2 QUANTITY_x QUANTITY_y
0 1 123 132 12 1.0
1 2 123 132 5 -1.0
2 3 123 132 10 0.0
# Adding new quantity column
In [21]: merged['QUANTITY'] = merged['QUANTITY_x'] + merged['QUANTITY_y']
In [22]: merged
Out[22]:
ID FILLER1 FILLER2 QUANTITY_x QUANTITY_y QUANTITY
0 1 123 132 12 1.0 13.0
1 2 123 132 5 -1.0 4.0
2 3 123 132 10 0.0 10.0
# Removing _x and _y columns
In [23]: merged = merged[['ID', 'FILLER1', 'FILLER2', 'QUANTITY']]
In [24]: merged
Out[24]:
ID FILLER1 FILLER2 QUANTITY
0 1 123 132 13.0
1 2 123 132 4.0
2 3 123 132 10.0
My goal today is to follow each ID that belongs to Category==1 in a given date, one year later. So I have a dataframe like this:
Period ID Amount Category
20130101 1 100 1
20130101 2 150 1
20130101 3 100 1
20130201 1 90 1
20130201 2 140 1
20130201 3 95 1
20130201 5 250 0
. . .
20140101 1 40 1
20140101 2 70 1
20140101 5 160 0
20140201 1 35 1
20140201 2 65 1
20140201 5 150 0
For example, in 20130201 I have 2 ID's that belong to Category 1: 1,2,3, but just 2 of them are present in 20140201: 1,2. So I need to get the value of Amount, only for those ID's, one year later, like this:
Period ID Amount Category Amount_t1
20130101 1 100 1 40
20130101 2 150 1 70
20130101 3 100 1 nan
20130201 1 90 1 35
20130201 2 140 1 65
20130201 3 95 1 nan
20130201 5 250 0 nan
. . .
20140101 1 40 1 nan
20140101 2 70 1 nan
20140101 5 160 0 nan
20140201 1 35 1 nan
20140201 2 65 1 nan
20140201 5 150 0 nan
So, if the ID doesn't appear next year or belong to Category 0, I'll get a nan. My first approach was to get the list of unique ID's on each Period and then trying to map that to the next year, using some sort of combination of groupby() and isin() like this:
aux = df[df.Category==1].groupby('Period').ID.unique()
aux.index = aux.index + pd.DateOffset(years=1)
But I didn't know how to keep going. I'm thinking some kind of groupby('ID') might be more efficient too. If it were a simple shift() that would be easy, but I'm not sure about how to get the value offset by a year by group.
You can create lagged features with an exact merge after you manually lag one of the join keys.
import pandas as pd
# Datetime so we can do calendar year subtraction
df['Period'] = pd.to_datetime(df.Period, format='%Y%m%d')
# Create one with the lagged features. Here I'll split the steps out.
df2 = df.copy()
df2['Period'] = df2.Period-pd.offsets.DateOffset(years=1) # 1 year lag
df2 = df2.rename(columns={'Amount': 'Amount_t1'})
# Keep only values you want to merge
df2 = df2[df2.Category.eq(1)]
# Bring lagged features
df.merge(df2, on=['Period', 'ID', 'Category'], how='left')
Period ID Amount Category Amount_t1
0 2013-01-01 1 100 1 40.0
1 2013-01-01 2 150 1 70.0
2 2013-01-01 3 100 1 NaN
3 2013-02-01 1 90 1 35.0
4 2013-02-01 2 140 1 65.0
5 2013-02-01 3 95 1 NaN
6 2013-02-01 5 250 0 NaN
7 2014-01-01 1 40 1 NaN
8 2014-01-01 2 70 1 NaN
9 2014-01-01 5 160 0 NaN
10 2014-02-01 1 35 1 NaN
11 2014-02-01 2 65 1 NaN
12 2014-02-01 5 150 0 NaN
I have two pandas dataframe df1 and df2. Where i need to find df1['seq'] by doing a groupby on df2 and taking the sum of the column df2['sum_column']. Below are sample data and my current solution.
df1
id code amount seq
234 3 9.8 ?
213 3 18
241 3 6.4
543 3 2
524 2 1.8
142 2 14
987 2 11
658 3 17
df2
c_id name role sum_column
1 Aus leader 6
1 Aus client 1
1 Aus chair 7
2 Ned chair 8
2 Ned leader 3
3 Mar client 5
3 Mar chair 2
3 Mar leader 4
grouped = df2.groupby('c_id')['sum_column'].sum()
df3 = grouped.reset_index()
df3
c_id sum_column
1 14
2 11
3 11
The next step where am having issues is to map the df3 to df1 and conduct a conditional check to see if df1['amount'] is greater then df3['sum_column'].
df1['seq'] = np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')[sum_column]), 1, 0)
printing out df1['code'].map(df3.set_index('c_id')['sum_column']), I get only NaN values.
Does anyone know what am doing wrong here?
Expected results:
df1
id code amount seq
234 3 9.8 0
213 3 18 1
241 3 6.4 0
543 3 2 0
524 2 1.8 0
142 2 14 1
987 2 11 0
658 3 17 1
Solution should be simplify with remove .reset_index() for df3 and pass Series to map:
s = df2.groupby('c_id')['sum_column'].sum()
df1['seq'] = np.where(df1['amount'] > df1['code'].map(s), 1, 0)
Alternative with casting boolean mask to integer for True, False to 1,0:
df1['seq'] = (df1['amount'] > df1['code'].map(s)).astype(int)
print (df1)
id code amount seq
0 234 3 9.8 0
1 213 3 18.0 1
2 241 3 6.4 0
3 543 3 2.0 0
4 524 2 1.8 0
5 142 2 14.0 1
6 987 2 11.0 0
7 658 3 17.0 1
You forget add quote for sum_column
df1['seq']=np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')['sum_column']), 1, 0)