Pandas: Find closest group from another dataframe - python

Below, I have two dataframe. I need to update df_mapped using df_original.
In df_mapped, For each x_time need to find 3 closest rows (closest defined from difference from x_price) and add those to df_mapped dataframe.
import io
import pandas as pd
d = """
x_time expiration x_price p_price
60 4 10 20
60 5 11 30
60 6 12 40
60 7 13 50
60 8 14 60
70 5 10 20
70 6 11 30
70 7 12 40
70 8 13 50
70 9 14 60
80 1 10 20
80 2 11 30
80 3 12 40
80 4 13 50
80 5 14 60
"""
df_original = pd.read_csv(io.StringIO(d), delim_whitespace=True)`
to_mapped = """
x_time expiration x_price
50 4 15
60 5 15
70 6 13
80 7 20
90 8 20
"""
df_mapped = pd.read_csv(io.StringIO(to_mapped), delim_whitespace=True)
df_mapped = df_mapped.merge(df_original, on='x_time', how='left')
df_mapped['x_price_delta'] = abs(df_mapped['x_price_x'] - df_mapped['x_price_y'])`
**Intermediate output: In this, need to select 3 min x_price_delta row for each x_time
**
int_out = """
x_time expiration_x x_price_x expiration_y x_price_y p_price x_price_delta
50 4 15
60 5 15 6 12 40 3
60 5 15 7 13 50 2
60 5 15 8 14 60 1
70 6 13 7 12 40 1
70 6 13 8 13 50 0
70 6 13 9 14 60 1
80 7 20 3 12 40 8
80 7 20 4 13 50 7
80 7 20 5 14 60 6
90 8 20
"""
df_int_out = pd.read_csv(io.StringIO(int_out), delim_whitespace=True)
**Final step: keeping x_time fixed need to flatten the dataframe so we get the 3 closest row in one row
**
final_out = """
x_time expiration_original x_price_original expiration_1 x_price_1 p_price_1 expiration_2 x_price_2 p_price_2 expiration_3 x_price_3 p_price_3
50 4 15
60 5 15 6 12 40 7 13 50 8 14 60
70 6 13 7 12 40 8 13 50 9 14 60
80 7 20 3 12 40 4 13 50 5 14 60
90 8 20
"""
df_out = pd.read_csv(io.StringIO(final_out), delim_whitespace=True)
I am stuck in between intermediate and last step. Can't think of way out, what could be done to massage the dataframe?

This is not complete solution but it might help you to get unstuck.
At the end we get the correct data.
In [1]: df = df_int_out.groupby("x_time").apply(lambda x: x.sort_values(ascen
...: ding=False, by="x_price_delta")).set_index(["x_time", "expiration_x"]
...: ).drop(["x_price_delta", "x_price_x"],axis=1)
In [2]: df1 = df.iloc[1:-1]
In [3]: df1.groupby(df1.index).apply(lambda x: pd.concat([pd.DataFrame(d) for
...: d in x.values],axis=1).unstack())
Out[3]:
0
0 1 2 0 1 2 0 1 2
(60, 5) 6.0 12.0 40.0 7.0 13.0 50.0 8.0 14.0 60.0
(70, 6) 7.0 12.0 40.0 9.0 14.0 60.0 8.0 13.0 50.0
(80, 7) 3.0 12.0 40.0 4.0 13.0 50.0 5.0 14.0 60.0
I am sure there are much better ways of handling this case.

Related

Add last n rows of DataFrame into a new column

I would to add 4 last numbers of a dataframe and output will be appended in a new column, I have used for loop to do it but it is slower if there are many rows is there a way we can use df.apply(lamda x:) to achieve this.
**Sample Input:
values
0 10
1 20
2 30
3 40
4 50
5 60
Output:
values result
0 10 10
1 20 30
2 30 60
3 40 100
4 50 140
5 60 180**
use pandas.DataFrame.rolling
>>> df.rolling(4, min_periods=1).sum()
values
0 10.0
1 30.0
2 60.0
3 100.0
4 140.0
5 180.0
6 220.0
7 260.0
8 300.0
Add it together:
>>> df.assign(results=df.rolling(4, min_periods=1).sum().astype(int))
values results
0 10 10
1 20 30
2 30 60
3 40 100
4 50 140
5 60 180
6 70 220
7 80 260
8 90 300

Problem while merging a specific multi level pivot table back to the orignal (single level) dataframe

My question is related to pivot table and merging.
I have a main dataframe that I use to create a pivot table. Later, I perform some calculations to that pivot and add a new column. Finally I want to merge this new column back to the main dataframe but not getting result as desired.
I try to explain the steps that i performed as follows:
Step 1.
df:
items cat section weight factor1
0 1 7 abc 3 80
1 1 7 abc 3 80
2 2 7 xyz 5 60
3 2 7 xyz 5 60
4 2 7 xyz 5 60
5 2 7 xyz 5 60
6 3 7 abc 3 80
7 3 7 abc 3 80
8 3 7 abc 3 80
9 1 8 abc 2 80
10 1 8 abc 2 60
11 2 8 xyz 6 60
12 2 8 xyz 6 60
12 2 8 xyz 6 60
13 2 8 xyz 6 60
14 3 8 abc 2 80
15 1 9 abc 4 80
16 2 9 xyz 9 60
17 2 9 xyz 9 60
18 3 9 abc 4 80
Main dataframe (df) having number of items. Each item has given a number.
whereas each item belongs to a dedicated section. Each item has given a weight that varies based on a category (cat) and section. In addition, there is another column named 'factor' whose value is constant for a given section.
Step 2.
I need to create a pivot as follows from the above df.
pivot = df.pivot_table(db, index=['section'],values=['weight','factor', 'items'],columns=['cat'],aggfunc={'weight':np.max,'factor':np.max, 'items':np.sum})
pivot:
weight factor items
cat 7 8 9 7 8 9 7 8 9
section
abc 3 2 4 80 80 80 5 3 2
xyz 5 6 9 60 60 60 4 4 2
Step 3:
Now I want to perform some calculations on that pivot then add the
result in a new column as follows:
pivot['w_n',7] = pivot['factor', 7]/pivot['items', 7]
pivot['w_n',8] = pivot['factor', 8]/pivot['items', 8]
pivot['w_n',9] = pivot['factor', 9]/pivot['items', 9]
pivot:
weight factor items w_n
cat 7 8 9 7 8 9 7 8 9 7 8 9
section
abc 3 2 4 80 80 80 5 3 2 16 27 40
xyz 5 6 9 60 60 60 4 4 2 15 15 30
Step 4:
Finally I want to merge that new column back to the main df .
with a desired result of single column 'w_n' but instead I am getting 3 columns one for each cat.
Current result:
df:
items cat section weight factor1 w_n_7 w_n,8 w_n,9
0 1 7 abc 3 80 16 27 40
1 1 7 abc 3 80 16 27 40
2 2 7 xyz 5 60 15 15 30
3 2 7 xyz 5 60 15 15 30
4 2 7 xyz 5 60 15 15 30
5 2 7 xyz 5 60 15 15 30
6 3 7 abc 3 80 16 27 40
7 3 7 abc 3 80 16 27 40
8 3 7 abc 3 80 16 27 40
9 1 8 abc 2 80 16 27 40
10 1 8 abc 2 60 16 27 40
11 2 8 xyz 6 60 15 15 30
12 2 8 xyz 6 60 15 15 30
12 2 8 xyz 6 60 15 15 30
13 2 8 xyz 6 60 15 15 30
14 3 8 abc 2 80 16 27 40
15 1 9 abc 4 80 16 27 40
16 2 9 xyz 9 60 15 15 30
17 2 9 xyz 9 60 15 15 30
18 3 9 abc 4 80 16 27 40
Desired result:
------------------
df:
items cat section weight factor1 w_n
0 1 7 abc 3 80 16
1 1 7 abc 3 80 16
2 2 7 xyz 5 60 15
3 2 7 xyz 5 60 15
4 2 7 xyz 5 60 15
5 2 7 xyz 5 60 15
6 3 7 abc 3 80 16
7 3 7 abc 3 80 16
8 3 7 abc 3 80 16
9 1 8 abc 2 80 27
10 1 8 abc 2 60 27
11 2 8 xyz 6 60 15
12 2 8 xyz 6 60 15
12 2 8 xyz 6 60 15
13 2 8 xyz 6 60 15
14 3 8 abc 2 80 27
15 1 9 abc 4 80 40
16 2 9 xyz 9 60 30
17 2 9 xyz 9 60 30
18 3 9 abc 4 80 40
Use DataFrame.join with MultiIndex Series with Series.unstack:
df = df.join(pivot['w_n'].unstack().rename('W_n'), on=['cat','section'])
print (df)
items cat section weight factor W_n
0 1 7 abc 3 80 7.272727
1 1 7 abc 3 80 7.272727
2 2 7 xyz 5 60 7.500000
3 2 7 xyz 5 60 7.500000
4 2 7 xyz 5 60 7.500000
5 2 7 xyz 5 60 7.500000
6 3 7 abc 3 80 7.272727
7 3 7 abc 3 80 7.272727
8 3 7 abc 3 80 7.272727
9 1 8 abc 2 80 16.000000
10 1 8 abc 2 60 16.000000
11 2 8 xyz 6 60 7.500000
12 2 8 xyz 6 60 7.500000
12 2 8 xyz 6 60 7.500000
13 2 8 xyz 6 60 7.500000
14 3 8 abc 2 80 16.000000
15 1 9 abc 4 80 20.000000
16 2 9 xyz 9 60 15.000000
17 2 9 xyz 9 60 15.000000
18 3 9 abc 4 80 20.000000

Create groups based on column values

I am attempting to create user groups based on a particluar DataFrame column value. I would like to create 10 user groups of the entire DataFrame's population, based on the total_usage metric. An example DataFrame df is shown below.
user_id total_usage
1 10
2 10
3 20
4 20
5 30
6 30
7 40
8 40
9 50
10 50
11 60
12 60
13 70
14 70
15 80
16 80
17 90
18 90
19 100
20 100
The df is just a snippet of the entire DataFrame which is over 6000 records long, however I would like like to only have 10 user groups.
An example of my desired output is shown below.
user_id total_usage user_group
1 10 10th_group
2 10 10th_group
3 20 9th_group
4 20 9th_group
5 30 8th_group
6 30 8th_group
7 40 7th_group
8 40 7th_group
9 50 6th_group
10 50 6th_group
11 60 5th_group
12 60 5th_group
13 70 4th_group
14 70 4th_group
15 80 3th_group
16 80 3th_group
17 90 2nd_group
18 90 2nd_group
19 100 1st_group
20 100 1st_group
Any assistance that anyone could provide would be greatly appreciated.
Looks like you are looking for qcut, but in reverse order
df['user_group'] = 10 - pd.qcut(df['total_usage'], np.arange(0,1.1, 0.1)).cat.codes
Output, it's not ordinal, but I hope it will do:
0 10
1 10
2 9
3 9
4 8
5 8
6 7
7 7
8 6
9 6
10 5
11 5
12 4
13 4
14 3
15 3
16 2
17 2
18 1
19 1
dtype: int8
Use qcut with changed order by negatives and Series.map for 1.st and 2.nd values:
s = pd.qcut(-df['total_usage'], np.arange(0,1.1, 0.1), labels=False) + 1
d = {1:'st', 2:'nd'}
df['user_group'] = s.astype(str) + s.map(d).fillna('th') + '_group'
print (df)
user_id total_usage user_group
0 1 10 10th_group
1 2 10 10th_group
2 3 20 9th_group
3 4 20 9th_group
4 5 30 8th_group
5 6 30 8th_group
6 7 40 7th_group
7 8 40 7th_group
8 9 50 6th_group
9 10 50 6th_group
10 11 60 5th_group
11 12 60 5th_group
12 13 70 4th_group
13 14 70 4th_group
14 15 80 3th_group
15 16 80 3th_group
16 17 90 2nd_group
17 18 90 2nd_group
18 19 100 1st_group
19 20 100 1st_group
Try using pd.Series with np.repeat, np.arange, pd.DataFrame.groupby, pd.Series.astype, pd.Series.map and pd.Series.fillna:
x = df.groupby('total_usage')
s = pd.Series(np.repeat(np.arange(len(x.ngroups), [len(i) for i in x.groups.values()]) + 1)
df['user_group'] = (s.astype(str) + s.map({1: 'st', 2: 'nd'}).fillna('th') + '_Group').values[::-1]
And now:
print(df)
Is:
user_id total_usage user_group
0 1 10 10th_Group
1 2 10 10th_Group
2 3 20 9th_Group
3 4 20 9th_Group
4 5 30 8th_Group
5 6 30 8th_Group
6 7 40 7th_Group
7 8 40 7th_Group
8 9 50 6th_Group
9 10 50 6th_Group
10 11 60 5th_Group
11 12 60 5th_Group
12 13 70 4th_Group
13 14 70 4th_Group
14 15 80 3th_Group
15 16 80 3th_Group
16 17 90 2nd_Group
17 18 90 2nd_Group
18 19 100 1st_Group
19 20 100 1st_Group

Pandas DataFrame RangeIndex

I have created a Pandas DataFrame. I need to create a RangeIndex for the DataFrame that corresponds to the frame -
RangeIndex(start=0, stop=x, step=y) - where x and y relate to my DataFrame.
I've not seen an example of how to do this - is there a method or syntax specific to this?
thanks
It seems you need RangeIndex constructor:
df = pd.DataFrame({'A' : range(1, 21)})
print (df)
A
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
13 14
14 15
15 16
16 17
17 18
18 19
19 20
print (df.index)
RangeIndex(start=0, stop=20, step=1)
df.index = pd.RangeIndex(start=0, stop=99, step=5)
print (df)
A
0 1
5 2
10 3
15 4
20 5
25 6
30 7
35 8
40 9
45 10
50 11
55 12
60 13
65 14
70 15
75 16
80 17
85 18
90 19
95 20
print (df.index)
RangeIndex(start=0, stop=99, step=5)
More dynamic solution:
step = 10
df.index = pd.RangeIndex(start=0, stop=len(df.index) * step - 1, step=step)
print (df)
A
0 1
10 2
20 3
30 4
40 5
50 6
60 7
70 8
80 9
90 10
100 11
110 12
120 13
130 14
140 15
150 16
160 17
170 18
180 19
190 20
print (df.index)
RangeIndex(start=0, stop=199, step=10)
EDIT:
As #ZakS pointed in comments better is use only DataFrame constructor:
df = pd.DataFrame({'A' : range(1, 21)}, index=pd.RangeIndex(start=0, stop=99, step=5))
print (df)
0 1
5 2
10 3
15 4
20 5
25 6
30 7
35 8
40 9
45 10
50 11
55 12
60 13
65 14
70 15
75 16
80 17
85 18
90 19
95 20

Calculate the delta between entries in Pandas using partitions

I'm using Dataframe in Pandas, and I would like to calculate the delta between each adjacent rows, using a partition.
For example, this is my initial set after sorting it by A and B:
A B
1 12 40
2 12 50
3 12 65
4 23 30
5 23 45
6 23 60
I want to calculate the delta between adjacent B values, partitioned by A. If we define C as result, the final table should look like this:
A B C
1 12 40 NaN
2 12 50 10
3 12 65 15
4 23 30 NaN
5 23 45 15
6 23 75 30
The reason for the NaN is that we cannot calculate delta for the minimum number in each partition.
You can group by column A and take the difference:
df['C'] = df.groupby('A')['B'].diff()
df
Out:
A B C
1 12 40 NaN
2 12 50 10.0
3 12 65 15.0
4 23 30 NaN
5 23 45 15.0
6 23 60 15.0

Categories