Match unique group from another dataframe - one to many match - python

df1 is the information in the box.
Each box has a different volume size
The volume of box A is 30, B is 25, etc.
df1 = pd.DataFrame({'boxID':['A', 'B', 'C', 'D'],'volume':[30,25,30,10]})
df1.set_index("boxID")
volume
boxID
A 30
B 25
C 30
D 10
df2 is the information of the product
Each product has a different amount
df2 = pd.DataFrame({'Product No':['1', '2', '3', '4', '5', '6', '7'],'amount':[10, 5, 13, 15, 20, 10, 17]})
df2.set_index("Product No")
amount
Product No
1 10
2 5
3 13
4 15
5 20
6 10
7 17
insert "box id" column to df2 to find and match the appropriate box id of df1. Like the data frame at the bottom.
output_df2 = pd.DataFrame({'Product No':['1', '2', '3', '4', '5', '6', '7'],'amount':[10, 5, 13, 15, 20, 10, 17], 'box ID':['A', 'A', 'A', 'B', 'C', 'C', 'D']})
output_df2.set_index("Product No")
amount box ID
Product No
1 10 A
2 5 A
3 13 A
4 15 B
5 20 C
6 10 C
7 17 D
Add the amount(df2) in order from the top to get close to the each box volume(df1) but not exceed the each box
For example, since the first box volume of df1 is 30,
so it can contain first row product(amount 10) of df2 with the second row(amout 5) and the third(amount 13)
is equal to 30 because 10+5+13 = 28.
(However, if you add up to the 4th row, 10+5+13+15 = 43, which exceeds 30
Python is still a beginner, so please give me advice from many experts. It's a very important task for me.
match the appropriate box id of df1 in the box id column in df2.

One way using pandas.cut
s1 = df1["volume"].cumsum()
s2 = df2["amount"].cumsum()
df2["box ID"] = pd.cut(s2, [0, *s1], labels=s1.index)
print(df2)
Output:
amount box ID
Product No
1 10 A
2 5 A
3 13 A
4 15 B
5 20 C
6 10 C
7 17 D

Related

Need to get the values from a column which is a list of values to other dataframe form it as new column

Product Price Quantity Group
Tv 20 1 1
Car 300 1 4
Bike 40 2 2
Laptop 80 1 3
PS4 90 1 2
Example
we need minimal and reproducible example by code not image. lets make it.
df1 = pd.DataFrame([[1, [10, 20, 30]], [2, [40, 50]]], columns=['group', 'price list'])
df2 = pd.DataFrame([['A', 10], ['B', 40], ['C', 20]], columns=['product', 'price'])
df1
group price list
0 1 [10, 20, 30]
1 2 [40, 50]
df2
product price
0 A 10
1 B 40
2 C 20
Code
out = df2.merge(df1.rename(columns={'price list':'price'}).explode('price'), how='left')
out
product price group
0 A 10 1
1 B 40 2
2 C 20 1

How to efficiently reorder rows based on condition?

My dataframe:
df = pd.DataFrame({'col_1': [10, 20, 10, 20, 10, 10, 20, 20],
'col_2': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
5 10 f
6 20 g
7 20 h
I don't want consecutive rows with col_1 = 10, instead a row below a repeating 10 should jump up by one (in this case, index 6 should become index 5 and vice versa), so the order is always 10, 20, 10, 20...
My current solution:
for idx, row in df.iterrows():
if row['col_1'] == 10 and df.iloc[idx + 1]['col_1'] != 20:
df = df.rename({idx + 1:idx + 2, idx + 2: idx + 1})
df = df.sort_index()
df
gives me:
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
5 20 g
6 10 f
7 20 h
which is what I want but it is very slow (2.34s for a dataframe with just over 8000 rows).
Is there a way to avoid loop here?
Thanks
You can use a custom key in sort_values with groupby.cumcount:
df.sort_values(by='col_1', kind='stable', key=lambda s: df.groupby(s).cumcount())
Output:
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
6 20 g
5 10 f
7 20 h

How to improve performance of dataframe slices matching?

I need to improve the performance of the following dataframe slices matching.
What I need to do is find the matching trips between 2 dataframes, according to the sequence column values with order conserved.
My 2 dataframes:
>>>df1
trips sequence
0 11 a
1 11 d
2 21 d
3 21 a
4 31 a
5 31 b
6 31 c
>>>df2
trips sequence
0 12 a
1 12 d
2 22 c
3 22 b
4 22 a
5 32 a
6 32 d
Expected output:
['11 match 12']
This is the following code I' m using:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'trips': [11, 11, 21, 21, 31, 31, 31], 'sequence': ['a', 'd', 'd', 'a', 'a', 'b', 'c']})
df2 = pd.DataFrame({'trips': [12, 12, 22, 22, 22, 32, 32], 'sequence': ['a', 'd', 'c', 'b', 'a', 'a', 'd']})
route_match = []
for trip1 in df1['trips'].drop_duplicates():
for trip2 in df2['trips'].drop_duplicates():
route1 = df1[df1['trips'] == trip1]['sequence']
route2 = df2[df2['trips'] == trip2]['sequence']
if np.array_equal(route1.values,route2.values):
route_match.append(str(trip1) + ' match ' + str(trip2))
break
else:
continue
Despite working, this is very time costly and unefficient as my real dataframes are longer.
Any suggestions?
You can aggregate each trip as tuple with groupby.agg, then merge the two outputs to identify the identical routes:
out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple),
df2.groupby('trips', as_index=False)['sequence'].agg(tuple),
on='sequence'
)
output:
trips_x sequence trips_y
0 11 (a, d) 12
1 11 (a, d) 32
If you only want the first match, drop_duplicates the output of df2 aggregation to prevent unnecessary merging:
out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple),
df2.groupby('trips', as_index=False)['sequence'].agg(tuple)
.drop_duplicates(subset='sequence'),
on='sequence'
)
output:
trips_x sequence trips_y
0 11 (a, d) 12

Python - How to sort except one index [duplicate]

This question already has an answer here:
Pandas Python: sort dataframe but don't include given row
(1 answer)
Closed 4 years ago.
columns=['NAME', 'AB', 'H']
import pandas as pd
df = pd.DataFrame([['Harper', '10', '5'], ['Trout', '10', '5'], ['Ohtani', '10', '5'], ['TOTAL', '30', '15']], columns=columns)
df1 = df.sort_values(by='NAME')
print(df1)
the result is
NAME AB H
0 Harper 10 5
2 Ohtani 10 5
3 TOTAL 30 15
1 Trout 10 5
I want to sort the dataframe except index of 'TOTAL'.
Try following code to sort the df by 'NAME' by excluding 'Total':
df1 = df[df.NAME!='TOTAL'].sort_values(by='NAME')
Output:
NAME AB H
0 Harper 10 5
2 Ohtani 10 5
1 Trout 10 5
You can append back the 'Total' after sorting by:
df1 = df1.append(df[df.NAME=='TOTAL'])
Output:
NAME AB H
0 Harper 10 5
2 Ohtani 10 5
1 Trout 10 5
3 TOTAL 30 15

How to read txt in certain condition

I have text file like below.
A1 1234 56
B2 1234 56
C3 2345167
I have the startposition and length table.
which represents each where each elements start in previous df,and length for each rows.
start length
1 1
2 1
3 1
4 2
6 2
8 2
10 1
I would like to read like below according to startposition and length.
A 1 nan 12 34 5 6
B 2 nan 12 34 5 6
C 3 nan 23 45 16 7
first,I tried
pd.read_csv(file.txt,sep=" ")
But I couldnt figure out how to split.
How can I read and split dataframe?
As mentioned in the comments, this isn't a CSV format, and so I had to produce a work-around.
def get_row_format(length_file):
with open(length_file, 'r') as fd_len:
#Read in the file, not a CSV!
#this double list-comprehension produces a list of lists
rows = [[x.strip() for x in y.split()] for y in fd_len.readlines()]
#determine the row-format from the rows lists
row_form = {int(x[0]): int(x[1]) for x in rows[1:]} #idx 1: to skip header
return row_form
def read_with_row_format(data_file, rform):
with open(data_file, 'r') as fd_data:
for row in fd_data.readlines():
#Get the formatted output
#use .items() for Python 3.x
formatted_output = [row[k-1:k+v-1] for k, v in rform.iteritems()]
print formatted_output
The first function gets the 'row-format' and the second function applies that row format to each line in the file
Usage:
rform = get_row_format('lengths.csv')
read_with_row_format('data.csv', rform)
Output:
['A', '1', '12', '34', '5', '6']
['B', '2', '12', '34', '5', '6']
['C', '3', '23', '45', '6', '7']
This is a fixed width file, you can use pandas.read_fwf:
import pandas as pd
from io import StringIO
s = StringIO("""A1 1234 56
B2 1234 56
C3 2345167""")
pd.read_fwf(s, widths = widths.length, header=None)
# 0 1 2 3 4 5 6
#0 A 1 NaN 12 34 5 6
#1 B 2 NaN 12 34 5 6
#2 C 3 NaN 23 45 16 7
The widths data frame:
widths = pd.read_csv(StringIO("""start length
1 1
2 1
3 1
4 2
6 2
8 2
10 1"""), sep = "\s+")
Since you have the starting position and length of each field, use them.
Here is code to carry that out. Each line is taken in turn. Each field is the slice from the start column to the same position plus the length of the field.
I leave the conversions to you.
data = [
"A1 1234 56",
"B2 1234 56",
"C3 2345167"
]
table = [
[1, 1],
[2, 1],
[3, 1],
[4, 2],
[6, 2],
[8, 2],
[10, 1]
]
for line in data:
fields = [line[(table[col][0]-1) : (table[col][0]+table[col][1]-1)] for col in range(len(table))]
print fields

Categories