Top Bottom pairings based on column values in a pandas dataframe - python

I would like to generate Sector/Group wise pairs from a DataFrame based on the values in it's Score column.
+---------+-------------------+---------+
| Ticker | Sector | Score |
+---------+-------------------+---------+
| ABC | Energy | 3.5 |
| XYZ | Energy | 4.5 |
| PQR | Tech | 5.5 |
| MNP | Tech | 1.5 |
| JKL | Energy | 10.5 |
| BCA | Energy | 8.5 |
| RDB | Tech | 6.5 |
| JMP | Tech | 2.5 |
+---------+-------------------+---------+
From above example in sector Energy JKL/ABC would be one such pairing as JKL is highest and ABC is lowest scorer in that sector.Similarly next pairing within Energy would be BCA/XYZ as BCA is second highest and XYZ is the second lowest within that sector.
As a next step I would like to retain those pairs within each sector where the pair-difference is greater than a certain threshold.
Thank you for your help.
Output can be
+---------+-------------------+---------+
| Ticker | Sector | Result |
+---------+-------------------+---------+
| ABC | Energy | 0 |
| XYZ | Energy | 0 |
| PQR | Tech | 1 |
| MNP | Tech | 0 |
| JKL | Energy | 1 |
| BCA | Energy | 1 |
| RDB | Tech | 1 |
| JMP | Tech | 0 |
+---------+-------------------+---------+

Is this what you are after?
(
df.groupby('Sector')
.apply(lambda x: [df.Ticker.iloc[x.Score.idxmin()],
df.Ticker.iloc[x.Score.idxmax()],
x.Score.idxmin(), x.Score.idxmax()])
.apply(pd.Series)
.set_axis(['Low Ticker', 'High Ticker', 'Low', 'High'],
axis=1, inplace=False)
.assign(Diff = lambda x: x.High-x.Low)
)
Out[653]:
Low Ticker High Ticker Low High Diff
Sector
Energy ABC JKL 0 4 4
Utilities MNP RDB 3 6 3
Then you can retain those pairs within each sector where the pair-difference is greater than a certain threshold by filtering the Diff column.

This is what I will do
df=df.sort_values('Score')
df=df.assign(New=df.groupby('Sector').cumcount()%2)
df=df.assign(New2=(df.groupby('Sector').New.apply(lambda x :x.cumsum().replace(0,len(x)/2))))
df.groupby(['Sector','New2']).Ticker.apply(list)
Out[1464]:
Sector New2
Energy 1 [XYZ, BCA]
2 [ABC, JKL]
Utilities 1 [JMP, PQR]
2 [MNP, RDB]
Name: Ticker, dtype: object
Then
df['Result']=(df.Score==df.groupby(['Sector','New2']).Score.transform('max')).astype(int)
df.sort_index()
Out[1471]:
Ticker Sector Score New New2 Result
0 ABC Energy 3.5 0 2 0
1 XYZ Energy 4.5 1 1 0
2 PQR Utilities 5.5 0 1 1
3 MNP Utilities 1.5 0 2 0
4 JKL Energy 10.5 1 2 1
5 BCA Energy 8.5 0 1 1
6 RDB Utilities 6.5 1 2 1
7 JMP Utilities 2.5 1 1 0
Edit : As per op adding the diff
df['DIFF']=df.groupby(['Sector','New2']).Score.apply(lambda x : x.diff().bfill())
df.sort_index()
Out[1479]:
Ticker Sector Score New New2 Result DIFF
0 ABC Energy 3.5 0 2 0 7.0
1 XYZ Energy 4.5 1 1 0 4.0
2 PQR Utilities 5.5 0 1 1 3.0
3 MNP Utilities 1.5 0 2 0 5.0
4 JKL Energy 10.5 1 2 1 7.0
5 BCA Energy 8.5 0 1 1 4.0
6 RDB Utilities 6.5 1 2 1 5.0
7 JMP Utilities 2.5 1 1 0 3.0

Related

Dataframe: calculate difference in dates column by another column

I'm trying to calculate running difference on the date column depending on "event column".
So, to add another column with date difference between 1 in event column (there only 0 and 1).
Spo far I came to this half-working crappy solution
Dataframe:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0],'duration':None})
Code:
x = df.loc[df['event']==1, 'date']
k = 0
for i in range(len(x)):
df.loc[k:x.index[i], 'duration'] = x.iloc[i] - k
k = x.index[i]
But I'm sure there is a more elegant solution.
Thanks for any advice.
Output format:
+------+-------+----------+
| date | event | duration |
+------+-------+----------+
| 1 | 0 | 3 |
| 2 | 0 | 3 |
| 3 | 1 | 3 |
| 4 | 0 | 6 |
| 5 | 0 | 6 |
| 6 | 0 | 6 |
| 7 | 0 | 6 |
| 8 | 0 | 6 |
| 9 | 1 | 6 |
| 10 | 0 | 4 |
| 11 | 0 | 4 |
| 12 | 0 | 4 |
| 13 | 1 | 4 |
| 14 | 0 | 2 |
| 15 | 1 | 2 |
+------+-------+----------+
Using your initial dataframe:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0],'duration':None})
Add an index-like column to mark where the transitions occur (you could also base this on the date column if it is unique):
df = df.reset_index().rename(columns={'index':'idx'})
df.loc[df['event']==0, 'idx'] = np.nan
df['idx'] = df['idx'].fillna(method='bfill')
Then, use a groupby() to count the records, and backfill them to match your structure:
df['duration'] = df.groupby('idx')['event'].count()
df['duration'] = df['duration'].fillna(method='bfill')
# Alternatively, the previous two lines can be combined as pointed out by OP
# df['duration'] = df.groupby('idx')['event'].transform('count')
df = df.drop(columns='idx')
print(df)
date event duration
0 1 0 2.0
1 2 1 2.0
2 3 0 3.0
3 4 0 3.0
4 5 1 3.0
5 6 0 5.0
6 7 0 5.0
7 8 0 5.0
8 9 0 5.0
9 10 1 5.0
10 11 0 6.0
11 12 0 6.0
12 13 0 6.0
13 14 0 6.0
14 15 0 6.0
15 16 1 6.0
16 17 0 NaN
It ends up as a float value because of the NaN in the last row. This approach works well in general if there are obvious "groups" of things to count.
As an alternative, because the dates are already there as integers you can look at the differences in dates directly:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0]})
tmp = df[df['event']==1].copy()
tmp['duration'] = (tmp['date'] - tmp['date'].shift(1)).fillna(tmp['date'])
df = pd.merge(df, tmp[['date','duration']], on='date', how='left').fillna(method='bfill')

Sum of only certain columns in a pandas Dataframe

I have a dataframe similar to the one below. I need to add up the sum of only certain columns: Jan-16, Feb-16, Mar-16, Apr-16 and May-16. I have these columns in a list called months_list
--------------------------------------------------------------------------------------
| Id | Name | Jan-16 | Feb-16 | Mar-16 | Apr-16 | May-16 |
| 4674393 | John Miller | 0 | 1 | 1 | 1 | 1 |
| 4674395 | Joe Smith | 0 | 0 | 1 | 1 | 1 |
---------------------------------------------------------------------------------------
My output should look like the below:
--------------------------------------------------------------------------------------
| Id | Name | Jan-16 | Feb-16 | Mar-16 | Apr-16 | May-16 |
| 4674393 | John Miller | 0 | 1 | 1 | 1 | 1 |
| 4674395 | Joe Smith | 0 | 0 | 1 | 1 | 1 |
|Total | | 0 | 1 | 2 | 2 | 2 |
---------------------------------------------------------------------------------------
A new row called 'Total' should be introduced with a column wise sum for all the columns in my months_list: Jan-16, Feb-16, Mar-16, Apr-16 and May-16
I tried the below and it did not work. I got all NaN values
df.loc['Total',:]= df[months_list].sum(axis=1)
You are using the wrong value of axis parameter.
`axis=0`: Sums the column values
`axis=1`: Sums the row values
Assuming your df to be:
In [4]: df
Out[4]:
Id Name Jan-16 Feb-16 Mar-16 Apr-16 May-16
0 4674393 John Miller 0 1 1 1 1
1 4674395 Joe Smith 0 0 1 1 1
In [10]: months_list =['Jan-16', 'Feb-16', 'Mar-16', 'Apr-16', 'May-16']
You code should be:
In [12]: df.loc['Total'] = df[months_list].sum()
In [13]: df
Out[13]:
Id Name Jan-16 Feb-16 Mar-16 Apr-16 May-16
0 4674393.0 John Miller 0.0 1.0 1.0 1.0 1.0
1 4674395.0 Joe Smith 0.0 0.0 1.0 1.0 1.0
Total NaN NaN 0.0 1.0 2.0 2.0 2.0

Pandas, create new column based on values from previuos rows with certain values

Hi I'm trying to use ML to predict some future sales. So i would like to add mean sales from the previous month/year for each product
My df is something like: [ id | year | month | product_id | sales ] I would like to add prev_month_mean_sale and prev_month_id_sale columns
id | year | month | product_id | sales | prev_month_mean_sale | prev_month_id_sale
----------------------------------------------------------------------
1 | 2018 | 1 | 123 | 5 | NaN | NaN
2 | 2018 | 1 | 234 | 4 | NaN | NaN
3 | 2018 | 1 | 345 | 2 | NaN | NaN
4 | 2018 | 2 | 123 | 3 | 3.6 | 5
5 | 2018 | 2 | 345 | 2 | 3.6 | 2
6 | 2018 | 3 | 123 | 4 | 2.5 | 3
7 | 2018 | 3 | 234 | 6 | 2.5 | 0
8 | 2018 | 3 | 567 | 7 | 2.5 | 0
9 | 2019 | 1 | 234 | 4 | 5.6 | 6
10 | 2019 | 1 | 567 | 3 | 5.6 | 7
also I would like to add prev_year_mean_sale and prev_year_id_sale
prev_month_mean_sale is the mean of the total sales of the previuos month, eg: for month 2 is (5+4+2)/3
My actual code is something like:
for index,row in df.iterrows():
loc = df.index[(df['month'] == row['month']-1) &
(df['year'] == row['year']) &
(df['product_id'] == row['product_id']).tolist()[0]]
df.loc[index, 'prev_month_id_sale'] = df.loc[ loc ,'sales']
but it is really slow and my df is really big. Maybe there is another option using groupby() or something like that.
A simple way to avoid loop is to use merge() from dataframe:
df["prev_month"] = df["month"] - 1
result = df.merge(df.rename(columns={"sales", "prev_month_id"sale"}),
how="left",
left_on=["year", "prev_month", "product_id"],
right_on=["year", "month", "product_id"])
The result in this way will have more columns than you needed. You should drop() some of them and/or rename() some other.

How to find average after sorting month column in python

I have a challenge in front of me in python.
| Growth_rate | Month |
| ------------ |-------|
| 0 | 1 |
| -2 | 1 |
| 1.2 | 1 |
| 0.3 | 2 |
| -0.1 | 2 |
| 7 | 2 |
| 9 | 3 |
| 4.1 | 3 |
Now I want to average the growth rate according to the months in a new columns. Like 1st month the avg would be -0.26 and it should look like below table.
| Growth_rate | Month | Mean |
| ----------- | ----- | ----- |
| 0 | 1 | -0.26 |
| -2 | 1 | -0.26 |
| 1.2 | 1 | -0.26 |
| 0.3 | 2 | 2.2 |
| -0.1 | 2 | 2.2 |
| 7 | 2 | 2.2 |
| 9 | 3 | 6.5 |
| 4.1 | 3 | 6.5 |
This will calculate the mean growth rate and put it into mean column.
Any help would be great.
df.groupby(df.months).mean().reset_index().rename(columns={'Growth_Rate':'mean'}).merge(df,on='months')
Out[59]:
months mean Growth_Rate
0 1 -0.266667 0.0
1 1 -0.266667 -2.0
2 1 -0.266667 1.2
3 2 2.200000 -0.3
4 2 2.200000 -0.1
5 2 2.200000 7.0
6 3 6.550000 9.0
7 3 6.550000 4.1
Assuming that you are using the pandas package. If your table is in a DataFrame df
In [91]: means = df.groupby('Month').mean().reset_index()
In [92]: means.columns = ['Month', 'Mean']
Then join via merge
In [93]: pd.merge(df, means, how='outer', on='Month')
Out[93]:
Growth_rate Month Mean
0 0.0 1 -0.266667
1 -2.0 1 -0.266667
2 1.2 1 -0.266667
3 0.3 2 2.400000
4 -0.1 2 2.400000
5 7.0 2 2.400000
6 9.0 3 6.550000
7 4.1 3 6.550000

Partition dataset by timestamp

I have a dataframe of millions of rows like so, with no duplicate time-ID stamps:
ID | Time | Activity
a | 1 | Bar
a | 3 | Bathroom
a | 2 | Bar
a | 4 | Bathroom
a | 5 | Outside
a | 6 | Bar
a | 7 | Bar
What's the most efficient way to convert it to this format?
ID | StartTime | EndTime | Location
a | 1 | 2 | Bar
a | 3 | 4 | Bathroom
a | 5 | N/A | Outside
a | 6 | 7 | Bar
I have to do this with a lot of data, so wondering how to speed up this process as much as possible.
I am using groupby
df.groupby(['ID','Activity']).Time.apply(list).apply(pd.Series).rename(columns={0:'starttime',1:'endtime'}).reset_index()
Out[251]:
ID Activity starttime endtime
0 a Bar 1.0 2.0
1 a Bathroom 3.0 4.0
2 a Outside 5.0 NaN
Or using pivot_table
df.assign(I=df.groupby(['ID','Activity']).cumcount()).pivot_table(index=['ID','Activity'],columns='I',values='Time')
Out[258]:
I 0 1
ID Activity
a Bar 1.0 2.0
Bathroom 3.0 4.0
Outside 5.0 NaN
Update
df.assign(I=df.groupby(['ID','Activity']).cumcount()//2).groupby(['ID','Activity','I']).Time.apply(list).apply(pd.Series).rename(columns={0:'starttime',1:'endtime'}).reset_index()
Out[282]:
ID Activity I starttime endtime
0 a Bar 0 1.0 2.0
1 a Bar 1 6.0 7.0
2 a Bathroom 0 3.0 4.0
3 a Outside 0 5.0 NaN

Categories