Pandas iterating in range. Same number twice? - python

I've written this code and my output is not quite as expected. It seems that the for loop runs through the first iteration twice and then misses out the second and jumps straight to the third. I cannot see where I have gone wrong however so could someone point out the error? Thank you!
Code below:
i = 0
df_int = df1[(df1.sLap > df_z.Entry[i]) & (df1.sLap < df_z.Exit[i]) & (df1.NLap == Lap1)]
df_Entry = df_int.groupby(df_int.BCornerEntry).aggregate([np.mean, np.std])
df_Entry.rename(index={1: 'T'+str(df_z['Turn Number'][i])}, inplace=True)
for i in range(len(df_z)):
df_int = df1[(df1.sLap > df_z.Entry[i]) & (df1.sLap < df_z.Exit[i]) & (df1.NLap == Lap1)]
df_Entry2 = df_int.groupby(df_int.BCornerEntry).aggregate([np.mean, np.std])
df_Entry2.rename(index={1: 'T'+str(df_z['Turn Number'][i])}, inplace=True)
df_Entry = pd.concat([df_Entry, df_Entry2])
df_z is an excel document with data like this:
Turn Number Entry Exit
0 1 321 441
1 2 893 1033
2 3 1071 1184
3 4 1234 1352
4 5 2354 2454
5 6 2464 2554
6 7 2574 2689
7 8 2955 3120..... and so on
Then df1 is a massive DataFrame with 30 columns and 10's of thousands of rows (hence the mean and std).
My Output should be:
tLap
mean std
BCornerEntry
T1 6.845490 0.591227
T2 14.515195 0.541967
T3 19.598690 0.319181
T4 21.555500 0.246757
T5 34.980000 0.518170
T6 37.245000 0.209284
T7 40.220541 0.322800.... and so on
However I get this:
tLap
mean std
BCornerEntry
T1 6.845490 0.591227
T1 6.845490 0.591227
T3 19.598690 0.319181
T4 21.555500 0.246757
T5 34.980000 0.518170
T6 37.245000 0.209284
T7 40.220541 0.322800..... and so on
T2 is still T1 and the numbers are the same? What have I done wrong? Any help would be greatly appreciated!

Instead of range(len(df_z), try using:
for i in range(1, len(df_z)):
...
as range starts with 0 and the i=0 case is already done before the for loop (so for this reason it is included twice).

Related

Process and return data from a group of a group

I have a pandas dataframe of 3 variables, 2 categorical and 2 numeric.
ID
Trimester
State
Tax
rate
45
T1
NY
20
0.25
23
T3
FL
34
0.3
35
T2
TX
45
0.6
I would like to get a new table of the form:
ID
Trimester
State
Tax
rate
Tax_per_state_per_trimester
45
T1
NY
20
0.25
H
23
T3
FL
34
0.3
L
35
T2
TX
45
0.6
M
where the new variable 'Tax_per_state_per_trimester' is a categorical variable representing the tertiles of the corresponding subgroup, where L = first tertile, M = second tertile, L = last tertile
I understand I can do a double grouping with:
df.groupby(['State', 'Trimester'])
but i don't know how to go from there.
I guess apply or transform with the quantile function should prove useful, but how?
Can you take a look and see if this gives you the results you want ?
df = pd.read_excel('Tax.xlsx')
def mx(tri,state):
return df[(df['Trimester'].eq(tri)) & (df['State'].eq(state))] \
.groupby(['Trimester','State'])['Tax'].apply(max)[0]
for i,v in df.iterrows():
t = (v['Tax'] / mx(v['Trimester'],v['State']))
df.loc[i,'Tax_per_state_per_trimester'] = 'L' if t < 1/3 else 'M' if t < 2/3 else 'H'

Using pandas to identify nearest objects

I have an assignment that can be done using any programming language. I chose Python and pandas since I have little experience using these and thought it would be a good learning experience. I was able to complete the assignment using traditional loops that I know from traditional computer programming, and it ran okay over thousands of rows, but it brought my laptop down to a screeching halt once I let it process millions of rows. The assignment is outlined below.
You have a two-lane road on a two-dimensional plane. One lane is for cars and the other lane is reserved for trucks. The data looks like this (spanning millions of rows for each table):
cars
id start end
0 C1 200 215
1 C2 110 125
2 C3 240 255
...
trucks
id start end
0 T1 115 175
1 T2 200 260
2 T3 280 340
3 T4 25 85
...
The two dataframes above correspond to this:
start and end columns represent arbitrary positions on the road, where start = the back edge of the vehicle and end = the front edge of the vehicle.
The task is to identify the trucks closest to every car. A truck can have up to three different relationships to a car:
Back - it is in back of the car (cars.end > trucks.end)
Across - it is across from the car (cars.start >= trucks.start and cars.end <= trucks.end)
Front - it is in front of the car (cars.start < trucks.start)
I emphasized "up to" because if there is another car in back or front that is closer to the nearest truck, then this relationship is ignored. In the case of the illustration above, we can observe the following:
C1: Back = T1, Across = T2, Front = none (C3 is blocking)
C2: Back = T4, Across = none, Front = T1
C3: Back = none (C1 is blocking), Across = T2, Front = T3
The final output needs to be appended to the cars dataframe along with the following new columns:
data cross-referenced from the trucks dataframe
for back positions, the gap distance (cars.start - trucks.end)
for front positions, the gap distance (trucks.start - cars.end)
The final cars dataframe should look like this:
id start end back_id back_start back_end back_distance across_id across_start across_end front_id front_start front_end front_distance
0 C1 200 215 T1 115 175 25 T2 200 260
1 C2 110 125 T4 25 85 25 T1 115 175 -10
2 C3 240 255 T2 200 260 T3 280 340 25
Is pandas even the best tool for this task? If there is a better suited tool that is efficient at cross-referencing and appending columns based on some calculation across millions of rows, then I am all ears.
so with pandas, you can use merge_asof, here is one way, maybe not efficient with millions of rows:
#first sort values
trucks = trucks.sort_values(['start'])
cars = cars.sort_values(['start'])
#create back condition
df_back = pd.merge_asof(trucks.rename(columns={col:f'back_{col}'
for col in trucks.columns}),
cars.assign(back_end=lambda x: x['end']),
on='back_end', direction='forward')\
.query('end>back_end')\
.assign(back_distance=lambda x: x['start']-x['back_end'])
#create across condition: here note that cars is the first of the 2 dataframes
df_across = pd.merge_asof(cars.assign(across_start=lambda x: x['start']),
trucks.rename(columns={col:f'across_{col}'
for col in trucks.columns}),
on=['across_start'], direction='backward')\
.query('end<=across_end')
#create front condition
df_front = pd.merge_asof(trucks.rename(columns={col:f'front_{col}'
for col in trucks.columns}),
cars.assign(front_start=lambda x: x['start']),
on='front_start', direction='backward')\
.query('start<front_start')\
.assign(front_distance=lambda x: x['front_start']-x['end'])
# merge all back to cars
df_f = cars.merge(df_back, how='left')\
.merge(df_across, how='left')\
.merge(df_front, how='left')
and you get
print (df_f)
id start end back_id back_start back_end back_distance across_start \
0 C2 110 125 T4 25.0 85.0 25.0 NaN
1 C1 200 215 T1 115.0 175.0 25.0 200.0
2 C3 240 255 NaN NaN NaN NaN 240.0
across_id across_end front_id front_start front_end front_distance
0 NaN NaN T1 115.0 175.0 -10.0
1 T2 260.0 NaN NaN NaN NaN
2 T2 260.0 T3 280.0 340.0 25.0

Fastest way to filter dataframe based on conditions multiple times

I have a data frame where I need to filter on conditions multiple times (more than 200k times) to account for unique results that may come out. I am curious if there is a faster way to search and filter for particular conditions.
My current implementation is below
Description Ticker Start Stop
0 A B 220 100
1 Ab TEST 180 103
2 Bac RANDOM 205 32
3 Ba BLAH 100 2
4 Ca BLAH 92 40
5 Cd B 85 25
6 A B 221 71
7 A B 400 171
def filter_df(object):
stock_source = 'A'
ticker = 'B'
target = 120
my_df = object.maindf[(object.maindf['Description'].values == stock_source) & (object.maindf['Ticker'].values == ticker]
condition = (my_df['Start'].values <= target) & (my_df['Stop'].values >= target)
my_df = my_df[condition]
return my_df
For the above example I should only get rows at index 0 and 6 on which I do some other things
ncalls tottime percall cumtime percall filename:lineno(function)
31192 1.950 0.000 37.554 0.001 test.py:95(filter_df)
Thank you for the help
You can use something like:
stock_source = 'A'
ticker = 'B'
target = 120
m=df.Description.eq(stock_source) & df.Ticker.eq(ticker) \
& ((df.Start.ge(target))&(df.Stop.le(target)))
df[m]
Description Ticker Start Stop
0 A B 220 100
6 A B 221 71
P.S: You can create seperate boolean masks for each condition. :)

Pandas data frames margins by arguments

I have to data frames and I need to combine both to get a new one where certain elements from the first (df1) will be inserted into the second one (df2).
For example:
df1=
event_id entity_type start_i end_i token_name doc_id
0 T1 Drug 10756 10766 amiodarone 114220
1 T2 Drug 14597 14614 Calcium Carbonate 114220
2 T3 Strength 14615 14621 500 mg 114220
3 T4 Form 14622 14638 Tablet 114220
and the second data frame:
df2 =
event_id relation_type arg_1 arg_2 doc_id
235 R1 Strength-Drug T3 T2 114220
236 R2 Form-Drug T4 T2 114220
and I need to get the combined data frame:
df3 =
event_id relation_type arg_1 arg_2 doc_id
235 R1 Strength-Drug 500 mg Calcium Carbonate 114220
236 R2 Form-Drug Tablet Calcium Carbonate 114220
Basically, what happens here is the substitution of arg_1 and arg_2 in df2 specified by Ti and Tj by token_name based on its event_id which are Ti and Tj in df1.
df3 = df2.copy()
df3.loc[235,'arg_1'] = df1.loc[df1.event_id == df2.loc[235,'arg_1'], 'token_name'].iloc[0]
df3.loc[235,'arg_2'] = df1.loc[df1.event_id == df2.loc[235,'arg_2'], 'token_name'].iloc[0]
df3.loc[236,'arg_1'] = df1.loc[df1.event_id == df2.loc[236,'arg_1'], 'token_name'].iloc[0]
df3.loc[236,'arg_2'] = df1.loc[df1.event_id == df2.loc[236,'arg_2'], 'token_name'].iloc[0]
I have 'quick-and-dirty' implementation, which works fine, but very slow and given the large number of documents, it is infeasible.
Any ideas for proper implementation with Pandas? It should be a tricky combination of pd.join / pd.merge but I'm still working to figure out which one. Thanks.
Use map with dictionary created by zip:
d = dict(zip(df1['event_id'], df1['token_name']))
#alternative
#d = df1.set_index('event_id')['token_name']
cols = ['arg_1','arg_2']
#not exist values are set to NaN
df2[cols] = df2[cols].apply(lambda x: x.map(d))
#alternative - not exist values are not changed
#df2[cols] = df2[cols].replace(d)
print (df2)
event_id relation_type arg_1 arg_2 doc_id
235 R1 Strength-Drug 500 mg Calcium Carbonate 114220
236 R2 Form-Drug Tablet Calcium Carbonate 114220

operations in pandas DataFrame

I have a fairly large (~5000 rows) DataFrame, with a number of variables, say 2 ['max', 'min'], sorted by 4 parameters, ['Hs', 'Tp', 'wd', 'seed']. It looks like this:
>>> data.head()
Hs Tp wd seed max min
0 1 9 165 22 225 18
1 1 9 195 16 190 18
2 2 5 165 43 193 12
3 2 10 180 15 141 22
4 1 6 180 17 219 18
>>> len(data)
4500
I want to keep only the first 2 parameters and get the maximum standard deviation for all 'seed's calculated individually for each 'wd'.
In the end, I'm left with unique (Hs, Tp) pairs with the maximum standard deviations for each variable. Something like:
>>> stdev.head()
Hs Tp max min
0 1 5 43.31321 4.597629
1 1 6 43.20004 4.640795
2 1 7 47.31507 4.569408
3 1 8 41.75081 4.651762
4 1 9 41.35818 4.285991
>>> len(stdev)
30
The following code does what I want, but since I have little understanding about DataFrames, I'm wondering if these nested loops can be done in a different and more DataFramy way =)
import pandas as pd
import numpy as np
#
#data = pd.read_table('data.txt')
#
# don't worry too much about this ugly generator,
# it just emulates the format of my data...
total = 4500
data = pd.DataFrame()
data['Hs'] = np.random.randint(1,4,size=total)
data['Tp'] = np.random.randint(5,15,size=total)
data['wd'] = [[165, 180, 195][np.random.randint(0,3)] for _ in xrange(total)]
data['seed'] = np.random.randint(1,51,size=total)
data['max'] = np.random.randint(100,250,size=total)
data['min'] = np.random.randint(10,25,size=total)
# and here it starts. would the creators of pandas pull their hair out if they see this?
# can this be made better?
stdev = pd.DataFrame(columns = ['Hs', 'Tp', 'max', 'min'])
i=0
for hs in set(data['Hs']):
data_Hs = data[data['Hs'] == hs]
for tp in set(data_Hs['Tp']):
data_tp = data_Hs[data_Hs['Tp'] == tp]
stdev.loc[i] = [
hs,
tp,
max([np.std(data_tp[data_tp['wd']==wd]['max']) for wd in set(data_tp['wd'])]),
max([np.std(data_tp[data_tp['wd']==wd]['min']) for wd in set(data_tp['wd'])])]
i+=1
Thanks!
PS: if curious, this is statistics on variables depending on sea waves. Hs is wave height, Tp wave period, wd wave direction, the seeds represent different realizations of an irregular wave train, and min and max are the peaks or my variable during a certain exposition time. After all this, by means of the standard deviation and average, I can fit some distribution to the data, like Gumbel.
This could be a one-liner, if I understood you correctly:
data.groupby(['Hs', 'Tp', 'wd'])[['max', 'min']].std(ddof=0).max(level=[0, 1])
(include reset_index() on the end if you want)

Categories