"Rank" DataFrame columns per row - python

Given a Time Series DataFrame is it possible to create a new DataFrame with the same dimensions but the values are the ranking for each row compared to other columns (ordered smallest value first)?
Example:
ABC DEFG HIJK XYZ
date
2018-01-14 0.110541 0.007615 0.063217 0.002543
2018-01-21 0.007012 0.042854 0.061271 0.007988
2018-01-28 0.085946 0.177466 0.046432 0.069297
2018-02-04 0.018278 0.065254 0.038972 0.027278
2018-02-11 0.071785 0.033603 0.075826 0.073270
The first row would become:
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
as XYZ has the smallest value in that row and ABC the largest.
numpy.argsort looks like it might help however as it outputs the location itself I have not managed to get it to work.
Many thanks

Use double argsort for rank per rows and pass to DataFrame constructor:
df1 = pd.DataFrame(df.values.argsort().argsort() + 1, index=df.index, columns=df.columns)
print (df1)
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
2018-01-21 1 3 4 2
2018-01-28 3 4 1 2
2018-02-04 1 4 3 2
2018-02-11 2 1 4 3
Or use DataFrame.rank with method='dense':
df1 = df.rank(axis=1, method='dense').astype(int)
print (df1)
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
2018-01-21 1 3 4 2
2018-01-28 3 4 1 2
2018-02-04 1 4 3 2
2018-02-11 2 1 4 3

Related

Replace NaN values with values from other table

Please help.
My first table looks like:
id val1 val2
0 4 30
1 5 NaN
2 3 10
3 2 8
4 3 NaN
My second table looks like
id val1 val2_estimate
0 1 8
1 2 12
2 3 13
3 4 16
4 5 22
I want to replace Nan in 1st table with estimated values from column val2_estimate from 2nd table where val1 are the same. val1 in 2nd table are unique. End result need to look like that:
id val1 val2
0 4 30
1 5 22
2 3 10
3 2 8
4 3 13
I want to replace NaN values only.
Use merge to get the corresponding df2's estimate for df1, then use fillna:
df['val2'] = df['val2'].fillna(
df.merge(df2, on=['val1'], how='left')['val2_estimate'])
df
id val1 val2
0 0 4 30.0
1 1 5 22.0
2 2 3 10.0
3 3 2 8.0
4 4 3 13.0
Many ways to skin a cat, this is one of them.
Use fillna with map from a pd.Series created using set_index:
df['val2'] = df['val2'].fillna(df['val1'].map(df2.set_index('val1')['val2_estimate']))
df
Output:
val1 val2
id
0 4 30.0
1 5 22.0
2 3 10.0
3 2 8.0
4 3 13.0

pandas number of items in one column per value in another column

I have two dataframes. say for example, frame 1 is the student info:
student_id course
1 a
2 b
3 c
4 a
5 f
6 f
frame 2 is each interaction the student has with a program
student_id day number_of_clicks
1 4 60
1 5 34
1 7 87
2 3 33
2 4 29
2 8 213
2 9 46
3 2 103
I am trying to add the information from frame 2 to frame 1, ie. for each student I would like to know the number of different days they accessed the database on, and the sum of all the clicks on those days. eg:
student_id course no_days total_clicks
1 a 3 181
2 b 4 321
3 c 1 103
4 a 0 0
5 f 0 0
6 f 0 0
I've tried to do this with groupby, but I couldn't add the information back into frame 1, or figure out how to sum the number of clicks. any ideas?
First we aggregate your df2 to the desired information using GroupBy.agg. Then we merge that information into df1:
agg = df2.groupby('student_id').agg(
no_days=('day', 'size'),
total_clicks=('number_of_clicks', 'sum')
)
df1 = df1.merge(agg, on='student_id', how='left').fillna(0)
student_id course no_days total_clicks
0 1 a 3.0 181.0
1 2 b 4.0 321.0
2 3 c 1.0 103.0
3 4 a 0.0 0.0
4 5 f 0.0 0.0
5 6 f 0.0 0.0
Or if you like one-liners, here's the same method as above, but in one line of code and more in SQL kind of style:
df1.merge(
df2.groupby('student_id').agg(
no_days=('day', 'size'),
total_clicks=('number_of_clicks', 'sum')
),
on='student_id',
how='left'
).fillna(0)
Use merge and fillna the null values then aggregate using groupby.agg as:
df = df1.merge(df2, how='left').fillna(0, downcast='infer')\
.groupby(['student_id', 'course'], as_index=False)\
.agg({'day':np.count_nonzero, 'number_of_clicks':np.sum}).reset_index()
print(df)
student_id course day number_of_clicks
0 1 a 3 181
1 2 b 4 321
2 3 c 1 103
3 4 a 0 0
4 5 f 0 0
5 6 f 0 0
​

creating daily price change for a product on a pandas dataframe

I am working on a data set with the following columns:
order_id
order_item_id
product mrp
units
sale_date
I want to create a new column which shows how much the mrp changed from the last time this product was. This there a way I can do this with pandas data frame?
Sorry if this question is very basic but I am pretty new to pandas.
Sample data:
expected data:
For each row of the data I want to check the amount of price change for the last time the product was sold.
You can do this as follows:
# define a function that applies rolling window calculationg
# taking the difference between the last value and the current
# value
def calc_mrp(ser):
# in case you want the relative change, just
# divide by x[1] or x[0] in the lambda function
return ser.rolling(window=2).apply(lambda x: x[1]-x[0])
# apply this to the grouped 'product_mrp' column
# and store the result in a new column
df['mrp_change']=df.groupby('product_id')['product_mrp'].apply(calc_mrp)
If this is executed on a dataframe like:
Out[398]:
order_id product_id product_mrp units_sold sale_date
0 0 2 647.169280 8 2019-08-23
1 1 0 500.641188 0 2019-08-24
2 2 1 647.789399 15 2019-08-25
3 3 0 381.278167 12 2019-08-26
4 4 2 373.685000 7 2019-08-27
5 5 4 553.472850 2 2019-08-28
6 6 4 634.482718 7 2019-08-29
7 7 3 536.760482 11 2019-08-30
8 8 0 690.242274 6 2019-08-31
9 9 4 500.515521 0 2019-09-01
It yields:
Out[400]:
order_id product_id product_mrp units_sold sale_date mrp_change
0 0 2 647.169280 8 2019-08-23 NaN
1 1 0 500.641188 0 2019-08-24 NaN
2 2 1 647.789399 15 2019-08-25 NaN
3 3 0 381.278167 12 2019-08-26 -119.363022
4 4 2 373.685000 7 2019-08-27 -273.484280
5 5 4 553.472850 2 2019-08-28 NaN
6 6 4 634.482718 7 2019-08-29 81.009868
7 7 3 536.760482 11 2019-08-30 NaN
8 8 0 690.242274 6 2019-08-31 308.964107
9 9 4 500.515521 0 2019-09-01 -133.967197
The NaNs are in the rows, for which there is not previous order with the same product_id.

to_datetime assemblage error due to extra keys

My pandas version is 0.23.4.
I tried to run this code:
df['date_time'] = pd.to_datetime(df[['year','month','day','hour_scheduled_departure','minute_scheduled_departure']])
and the following error appeared:
extra keys have been passed to the datetime assemblage: [hour_scheduled_departure, minute_scheduled_departure]
Any ideas of how to get the job done by pd.to_datetime?
#anky_91
In this image an extract of first 10 rows is presented. First column [int32]: year; Second column[int32]: month; Third column[int32]: day; Fourth column[object]: hour; Fifth column[object]: minute. The length of objects is 2.
Another solution:
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: '0'.join(map(str,x))))],axis=1)
A Date
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
For the example you have added as image (i have skipped the last 3 columns due to save time)
>>df.month=df.month.map("{:02}".format)
>>df.day = df.day.map("{:02}".format)
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: ''.join(map(str,x))))],axis=1)
A Date
0 a 2015-01-01 00:05:00
1 b 2015-01-01 00:01:00
2 c 2015-01-01 00:02:00
3 d 2015-01-01 00:02:00
4 e 2015-01-01 00:25:00
5 f 2015-01-01 00:25:00
You can use rename to columns, so possible use pandas.to_datetime with columns year, month, day, hour, minute:
df = pd.DataFrame({
'A':list('abcdef'),
'year':[2002,2002,2002,2002,2002,2002],
'month':[7,8,9,4,2,3],
'day':[1,3,5,7,1,5],
'hour_scheduled_departure':[5,3,6,9,2,4],
'minute_scheduled_departure':[7,8,9,4,2,3]
})
print (df)
A year month day hour_scheduled_departure minute_scheduled_departure
0 a 2002 7 1 5 7
1 b 2002 8 3 3 8
2 c 2002 9 5 6 9
3 d 2002 4 7 9 4
4 e 2002 2 1 2 2
5 f 2002 3 5 4 3
cols = ['year','month','day','hour_scheduled_departure','minute_scheduled_departure']
d = {'hour_scheduled_departure':'hour','minute_scheduled_departure':'minute'}
df['date_time'] = pd.to_datetime(df[cols].rename(columns=d))
#if necessary remove columns
df = df.drop(cols, axis=1)
print (df)
A date_time
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
Detail:
print (df[cols].rename(columns=d))
year month day hour minute
0 2002 7 1 5 7
1 2002 8 3 3 8
2 2002 9 5 6 9
3 2002 4 7 9 4
4 2002 2 1 2 2
5 2002 3 5 4 3

Pandas deleting rows in order

Given a particular df:
ID Text
1 abc
1 xyz
2 xyz
2 abc
3 xyz
3 abc
3 ijk
4 xyz
I want to apply condition where: Grouping by ID, if abc exists then delete row with xyz. The outcome would be:
ID Text
1 abc
2 abc
3 abc
3 ijk
4 xyz
Usually I would group them by Id and apply np.where(...). However, I don't think this approach would work for this case since it's based on rows.
Many thanks!
To the best of my knowledge, you can vectorize this with a groupby + transform:
df[~(df.Text.eq('abc').groupby(df.ID).transform('any') & df.Text.eq('xyz'))]
ID Text
0 1 abc
3 2 abc
5 3 abc
6 3 ijk
7 4 xyz
I am using crosstab
s=pd.crosstab(df.ID,df.Text)
s.xyz=s.xyz.mask(s.abc.eq(1)&s.xyz.eq(1))
s
Out[162]:
Text abc ijk xyz
ID
1 1 0 NaN
2 1 0 NaN
3 1 1 NaN
4 0 0 1.0
s.replace(0,np.nan).stack().reset_index().drop(0,1)
Out[167]:
ID Text
0 1 abc
1 2 abc
2 3 abc
3 3 ijk
4 4 xyz

Categories