Pandas data frames margins by arguments - python

I have to data frames and I need to combine both to get a new one where certain elements from the first (df1) will be inserted into the second one (df2).
For example:
df1=
event_id entity_type start_i end_i token_name doc_id
0 T1 Drug 10756 10766 amiodarone 114220
1 T2 Drug 14597 14614 Calcium Carbonate 114220
2 T3 Strength 14615 14621 500 mg 114220
3 T4 Form 14622 14638 Tablet 114220
and the second data frame:
df2 =
event_id relation_type arg_1 arg_2 doc_id
235 R1 Strength-Drug T3 T2 114220
236 R2 Form-Drug T4 T2 114220
and I need to get the combined data frame:
df3 =
event_id relation_type arg_1 arg_2 doc_id
235 R1 Strength-Drug 500 mg Calcium Carbonate 114220
236 R2 Form-Drug Tablet Calcium Carbonate 114220
Basically, what happens here is the substitution of arg_1 and arg_2 in df2 specified by Ti and Tj by token_name based on its event_id which are Ti and Tj in df1.
df3 = df2.copy()
df3.loc[235,'arg_1'] = df1.loc[df1.event_id == df2.loc[235,'arg_1'], 'token_name'].iloc[0]
df3.loc[235,'arg_2'] = df1.loc[df1.event_id == df2.loc[235,'arg_2'], 'token_name'].iloc[0]
df3.loc[236,'arg_1'] = df1.loc[df1.event_id == df2.loc[236,'arg_1'], 'token_name'].iloc[0]
df3.loc[236,'arg_2'] = df1.loc[df1.event_id == df2.loc[236,'arg_2'], 'token_name'].iloc[0]
I have 'quick-and-dirty' implementation, which works fine, but very slow and given the large number of documents, it is infeasible.
Any ideas for proper implementation with Pandas? It should be a tricky combination of pd.join / pd.merge but I'm still working to figure out which one. Thanks.

Use map with dictionary created by zip:
d = dict(zip(df1['event_id'], df1['token_name']))
#alternative
#d = df1.set_index('event_id')['token_name']
cols = ['arg_1','arg_2']
#not exist values are set to NaN
df2[cols] = df2[cols].apply(lambda x: x.map(d))
#alternative - not exist values are not changed
#df2[cols] = df2[cols].replace(d)
print (df2)
event_id relation_type arg_1 arg_2 doc_id
235 R1 Strength-Drug 500 mg Calcium Carbonate 114220
236 R2 Form-Drug Tablet Calcium Carbonate 114220

Related

Process and return data from a group of a group

I have a pandas dataframe of 3 variables, 2 categorical and 2 numeric.
ID
Trimester
State
Tax
rate
45
T1
NY
20
0.25
23
T3
FL
34
0.3
35
T2
TX
45
0.6
I would like to get a new table of the form:
ID
Trimester
State
Tax
rate
Tax_per_state_per_trimester
45
T1
NY
20
0.25
H
23
T3
FL
34
0.3
L
35
T2
TX
45
0.6
M
where the new variable 'Tax_per_state_per_trimester' is a categorical variable representing the tertiles of the corresponding subgroup, where L = first tertile, M = second tertile, L = last tertile
I understand I can do a double grouping with:
df.groupby(['State', 'Trimester'])
but i don't know how to go from there.
I guess apply or transform with the quantile function should prove useful, but how?
Can you take a look and see if this gives you the results you want ?
df = pd.read_excel('Tax.xlsx')
def mx(tri,state):
return df[(df['Trimester'].eq(tri)) & (df['State'].eq(state))] \
.groupby(['Trimester','State'])['Tax'].apply(max)[0]
for i,v in df.iterrows():
t = (v['Tax'] / mx(v['Trimester'],v['State']))
df.loc[i,'Tax_per_state_per_trimester'] = 'L' if t < 1/3 else 'M' if t < 2/3 else 'H'

How to merge pandas dataframes with different row and column sizes?

I want to merge dataframe 1 and dataframe 2 based on the 'Race's in dataframe2. I only want to include the 'Race's for dataframe 2 and do not want to include any the excess 'Race's from dataframe 1.
My code:
cols1 = ['Race', 'Market ID']
df1 = pd.DataFrame(data=betfairevents, columns=cols1)
cols2 = ['Race']
df2 = pd.DataFrame(data=tabntgevents, columns=cols2)
print(df2)
dfmerge1 = pd.merge(df1,df2,on='Race',how='inner')
Output of dataframe1:
Race Market ID
0 Newcastle R1 1.171771969
1 Newcastle R2 1.171771971
2 Newcastle R3 1.171771973
3 Newcastle R4 1.171771975
4 Newcastle R5 1.171771977
.. ... ...
139 Launceston R6 1.171772509
140 Launceston R7 1.171772511
141 Launceston R8 1.171772513
142 Launceston R9 1.171772515
143 Launceston R10 1.171772517
Output of dataframe2:
Race
0 NEWCASTLE R1
1 BALLARAT R1
2 LISMORE R4
3 WARRAGUL R3
Desired output of merged dataframe:
Race Market ID
0 Newcastle R1 1.171771969
1 Ballarat R1 1.171771971
2 Lismore R4 1.171771973
3 Warragul R3 1.171771975
You can use ".isin" function from pandas,
merged_df = df1[df1['Race'].isin(df2['Race'])
The sample input data you're showing doesn't match the desired output. But here is one way to perform the analysis:
# create sample data
from io import StringIO
import pandas as pd
data1 = '''index Race Market ID
0 Newcastle R1 1.171771969
1 Newcastle R2 1.171771971
2 Newcastle R3 1.171771973
3 Newcastle R4 1.171771975
4 Newcastle R5 1.171771977
139 Launceston R6 1.171772509
140 Launceston R7 1.171772511
141 Launceston R8 1.171772513
142 Launceston R9 1.171772515
143 Launceston R10 1.171772517
'''
df1 = pd.read_csv(StringIO(data1), sep='\s\s+', engine='python').set_index('index')
data2 = '''index Race
0 NEWCASTLE R1
1 BALLARAT R1
2 LISMORE R4
3 WARRAGUL R3
'''
df2 = pd.read_csv(StringIO(data2), sep='\s\s+', engine='python').set_index('index')
Now find 'Race' values that are in both df1 and df2 (with a boolean mask). The .str().lower() performs case-insensitive comparison.
mask = df1['Race'].str.lower().isin(df2['Race'].str.lower().values)
df1[ mask ]
The merge() function would also work for this.

Using pandas to identify nearest objects

I have an assignment that can be done using any programming language. I chose Python and pandas since I have little experience using these and thought it would be a good learning experience. I was able to complete the assignment using traditional loops that I know from traditional computer programming, and it ran okay over thousands of rows, but it brought my laptop down to a screeching halt once I let it process millions of rows. The assignment is outlined below.
You have a two-lane road on a two-dimensional plane. One lane is for cars and the other lane is reserved for trucks. The data looks like this (spanning millions of rows for each table):
cars
id start end
0 C1 200 215
1 C2 110 125
2 C3 240 255
...
trucks
id start end
0 T1 115 175
1 T2 200 260
2 T3 280 340
3 T4 25 85
...
The two dataframes above correspond to this:
start and end columns represent arbitrary positions on the road, where start = the back edge of the vehicle and end = the front edge of the vehicle.
The task is to identify the trucks closest to every car. A truck can have up to three different relationships to a car:
Back - it is in back of the car (cars.end > trucks.end)
Across - it is across from the car (cars.start >= trucks.start and cars.end <= trucks.end)
Front - it is in front of the car (cars.start < trucks.start)
I emphasized "up to" because if there is another car in back or front that is closer to the nearest truck, then this relationship is ignored. In the case of the illustration above, we can observe the following:
C1: Back = T1, Across = T2, Front = none (C3 is blocking)
C2: Back = T4, Across = none, Front = T1
C3: Back = none (C1 is blocking), Across = T2, Front = T3
The final output needs to be appended to the cars dataframe along with the following new columns:
data cross-referenced from the trucks dataframe
for back positions, the gap distance (cars.start - trucks.end)
for front positions, the gap distance (trucks.start - cars.end)
The final cars dataframe should look like this:
id start end back_id back_start back_end back_distance across_id across_start across_end front_id front_start front_end front_distance
0 C1 200 215 T1 115 175 25 T2 200 260
1 C2 110 125 T4 25 85 25 T1 115 175 -10
2 C3 240 255 T2 200 260 T3 280 340 25
Is pandas even the best tool for this task? If there is a better suited tool that is efficient at cross-referencing and appending columns based on some calculation across millions of rows, then I am all ears.
so with pandas, you can use merge_asof, here is one way, maybe not efficient with millions of rows:
#first sort values
trucks = trucks.sort_values(['start'])
cars = cars.sort_values(['start'])
#create back condition
df_back = pd.merge_asof(trucks.rename(columns={col:f'back_{col}'
for col in trucks.columns}),
cars.assign(back_end=lambda x: x['end']),
on='back_end', direction='forward')\
.query('end>back_end')\
.assign(back_distance=lambda x: x['start']-x['back_end'])
#create across condition: here note that cars is the first of the 2 dataframes
df_across = pd.merge_asof(cars.assign(across_start=lambda x: x['start']),
trucks.rename(columns={col:f'across_{col}'
for col in trucks.columns}),
on=['across_start'], direction='backward')\
.query('end<=across_end')
#create front condition
df_front = pd.merge_asof(trucks.rename(columns={col:f'front_{col}'
for col in trucks.columns}),
cars.assign(front_start=lambda x: x['start']),
on='front_start', direction='backward')\
.query('start<front_start')\
.assign(front_distance=lambda x: x['front_start']-x['end'])
# merge all back to cars
df_f = cars.merge(df_back, how='left')\
.merge(df_across, how='left')\
.merge(df_front, how='left')
and you get
print (df_f)
id start end back_id back_start back_end back_distance across_start \
0 C2 110 125 T4 25.0 85.0 25.0 NaN
1 C1 200 215 T1 115.0 175.0 25.0 200.0
2 C3 240 255 NaN NaN NaN NaN 240.0
across_id across_end front_id front_start front_end front_distance
0 NaN NaN T1 115.0 175.0 -10.0
1 T2 260.0 NaN NaN NaN NaN
2 T2 260.0 T3 280.0 340.0 25.0

Pandas iterating in range. Same number twice?

I've written this code and my output is not quite as expected. It seems that the for loop runs through the first iteration twice and then misses out the second and jumps straight to the third. I cannot see where I have gone wrong however so could someone point out the error? Thank you!
Code below:
i = 0
df_int = df1[(df1.sLap > df_z.Entry[i]) & (df1.sLap < df_z.Exit[i]) & (df1.NLap == Lap1)]
df_Entry = df_int.groupby(df_int.BCornerEntry).aggregate([np.mean, np.std])
df_Entry.rename(index={1: 'T'+str(df_z['Turn Number'][i])}, inplace=True)
for i in range(len(df_z)):
df_int = df1[(df1.sLap > df_z.Entry[i]) & (df1.sLap < df_z.Exit[i]) & (df1.NLap == Lap1)]
df_Entry2 = df_int.groupby(df_int.BCornerEntry).aggregate([np.mean, np.std])
df_Entry2.rename(index={1: 'T'+str(df_z['Turn Number'][i])}, inplace=True)
df_Entry = pd.concat([df_Entry, df_Entry2])
df_z is an excel document with data like this:
Turn Number Entry Exit
0 1 321 441
1 2 893 1033
2 3 1071 1184
3 4 1234 1352
4 5 2354 2454
5 6 2464 2554
6 7 2574 2689
7 8 2955 3120..... and so on
Then df1 is a massive DataFrame with 30 columns and 10's of thousands of rows (hence the mean and std).
My Output should be:
tLap
mean std
BCornerEntry
T1 6.845490 0.591227
T2 14.515195 0.541967
T3 19.598690 0.319181
T4 21.555500 0.246757
T5 34.980000 0.518170
T6 37.245000 0.209284
T7 40.220541 0.322800.... and so on
However I get this:
tLap
mean std
BCornerEntry
T1 6.845490 0.591227
T1 6.845490 0.591227
T3 19.598690 0.319181
T4 21.555500 0.246757
T5 34.980000 0.518170
T6 37.245000 0.209284
T7 40.220541 0.322800..... and so on
T2 is still T1 and the numbers are the same? What have I done wrong? Any help would be greatly appreciated!
Instead of range(len(df_z), try using:
for i in range(1, len(df_z)):
...
as range starts with 0 and the i=0 case is already done before the for loop (so for this reason it is included twice).

Intersections between values in two dictionaries in python

I have a csv file that contains trade data for some countries. The data has a format as follows:
rep par commodity value
USA GER 1 700
USA GER 2 100
USA GER 3 400
USA GER 5 100
USA GER 80 900
GER USA 2 300
GER USA 4 500
GER USA 5 700
GER USA 97 450
GER UK 50 300
UK USA 4 1100
UK USA 80 200
UK GER 50 200
UK GER 39 650
I intend to make a new dictionary and by using the created dictionary, calculate the total value of common traded commodities between countries.
For example, consider trade between USA-GER, I intend to check whether GER-USA is in the data and if it exists, values for the common commodities be summed and do the same for all countries. The dictionary should be like:
Dic_c1c2_producs=
{('USA','GER'): ('1','700'),('2','100'),('3','400'),('5','100'),('80','900');
('GER','USA'):('2','300'),('4','500'),('5','700'),('97','450') ;
('GER','UK'):('50','300');
('UK','USA'): ('4','80'),('80','200');
('UK','GER'): ('50','200'),('39','650')}
As you can see, USA-GER and GER-USA have commodities 2 and 5 in common and the value of these goods are (100+300)+(100+700).
For the pairs USA-UK and UK-USA, we have common commodities: 0 so total trade will be 0 as well. For GER-UK and UK-GER, commodity 50 is in common and total trade is 300+200.
At the end, I want to have something like:
Dic_c1c2_summation={('USA','GER'):1200;('GER','UK'):500; ('UK','USA'):0}
Any help would be appreciated.
In addition to my post, I have written following lines:
from collections import defaultdict
rfile = csv.reader(open("filepath",'r'))
rfile.next()
dic_c1c2_products = defaultdict(set)
dic_c_products = {}
country = set()
for row in rfile :
c1 = row[0]
c2 = row[1]
p = row[2]
country.add(c1)
for i in country :
dic_c_products[i] = set()
rfile = csv.reader(open("filepath"))
rfile.next()
for i in rfile:
c1 = i[0]
c2 = i[1]
p = i[2]
v=i[3]
dic_c_products[c1].add((p,v))
if not dic_c1c2_products.has_key((c1,c2)) :
dic_c1c2_products[(c1,c2)] = set()
dic_c1c2_products[(c1,c2)].add((p,v))
else:
dic_c1c2_products[(c1,c2)].add((p,v))
c_list = dic_c_products.keys()
dic_c1c2_productsummation = set()
for i in dic_c1c2_products.keys():
if dic_c1c2_products.has_key((i[1],i[0])):
for p1, v1 in dic_c1c2_products[(i[0],i[1])]:
for p2, v2 in dic_c1c2_products[(i[1],i[0])]:
if p1==p2:
summation=v1+v2
if i not in dic_c1c2_productsum.keys():
dic_c1c2_productsum[(i[0],i[1])]=(p1, summation)
else:
dic_c1c2_productsum[(i[0],i[1])].add((p1, summation))
else:
dic_c1c2_productsn[i] = " "
# save your data in a file called data
import pandas as pd
data = pd.read_csv('data', delim_whitespace=True)
data['par_rep'] = data.apply(lambda x: '_'.join(sorted([x['par'], x['rep']])), axis=1)
result = data.groupby(('par_rep', 'commodity')).filter(lambda x: len(x) >= 2).groupby(('par_rep'))['value'].sum().to_dict()
at the end result is {'GER_UK': 500, 'GER_USA': 1200}

Categories