Pandas SUMIFS from table 2, for column in table 1 - python

I have a big df with tarifs for aviation lines. you can specify data for concrete route, for example by airport of origination, airport of destination, aicraft, month.
Plain example of df:
data = {'orig':['A','A','A','B','B','B'],
'dest':['C','C','C','D','D','D'],
'currency':['RUB','USD','RUB','USD','RUB','USD'],
'tarif':[100,10,120,20,150,30]}
df=pd.DataFrame(data)
df
orig dest currency tarif
0 A C RUB 100
1 A C USD 10
2 A C RUB 120
3 B D USD 20
4 B D RUB 150
5 B D USD 30
I have df2, that contains aviation plan for concrete company. There you may find the same info, like month, orig,dest,aircraft
Plain example of df2:
data2={'orig':['A','B'],
'dest':['C','D']}
df2=pd.DataFrame(data2)
df2
orig dest
0 A C
1 B D
Task:for each row in df2, summurize tarif using conditions.
What I expect:
orig dest RUB USD
0 A C 220 10
1 B D 150 50
Thanks.

Hmmm
df = df.groupby(["orig", "dest", "currency"]).agg(sum).unstack()
df.columns = ['_'.join(col).strip() for col in df.columns.values]
df
gives me
tarif_RUB tarif_USD
orig dest
A C 220 10
B D 150 50
Which is your desired result but I haven't looked at df2 yet so I am afraid you have to describe better/extend your example so I have to do something with df2.

Related

Merge datasets with certain priority

I have 3 datasets
All the same shape
CustomerNumber, Name, Status
A customer can appear on 1, 2 or all 3.
Each dataset is a list of gold/silver/bronze.
example data:
Dataframe 1:
100,James,Gold
Dataframe 2:
100,James,Silver
101,Paul,Silver
Dataframe 3:
100,James,Bronze
101,Paul,Bronze
102,Fred,Bronze
Expected output/aggregated list:
100,James,Gold
101,Paul,Silver
102,Fred,Bronze
So a customer that is captured in all 3, I want to keep Status as gold.
Have been playing with join and merge and just can’t get it right.
Use concat with convert column to ordered categorical, so get priorites if sorting values by multiple columns and last remove duplicates by DataFrame.drop_duplicates:
print (df1)
print (df2)
print (df3)
a b c
0 100 James Gold
a b c
0 100 James Silver
1 101 Paul Silver
a b c
0 101 Paul Bronze
1 102 Fred Bronze
df = pd.concat([df1, df2, df3], ignore_index=True)
df['c'] = pd.Categorical(df['c'], ordered=True, categories=['Gold','Silver','Bronze'])
df = df.sort_values(['a','b','c']).drop_duplicates(['a','b'])
print (df)
a b c
0 100 James Gold
2 101 Paul Silver
4 102 Fred Bronze

Pandas vectorization for a multiple data frame operation

I am looking to increase the speed of an operation within pandas and I have learned that it is generally best to do so via using vectorization. The problem I am looking for help with is vectorizing the following operation.
Setup:
df1 = a table with a date-time column, and city column
df2 = another (considerably larger) table with a date-time column, and city column
The Operation:
for i, row in df2.iterrows():
for x, row2 in df1.iterrows():
if row['date-time'] - row2['date-time'] > pd.Timedelta('8 hours') and row['city'] == row2['city']:
df2.at[i, 'result'] = True
break
As you might imagine, this operation is insanely slow on any dataset of a decent size. I am also just beginning to learn pandas vector operations and would like some help in figuring out a more optimal way to solve this problem
I think what you need is merge() with numpy.where() to achieve the same result.
Since you don't have a reproducible sample in your question, kindly consider this:
>>> df1 = pd.DataFrame({'time':[24,20,15,10,5], 'city':['A','B','C','D','E']})
>>> df2 = pd.DataFrame({'time':[2,4,6,8,10,12,14], 'city':['A','B','C','F','G','H','D']})
>>> df1
time city
0 24 A
1 20 B
2 15 C
3 10 D
4 5 E
>>> df2
time city
0 2 A
1 4 B
2 6 C
3 8 F
4 10 G
5 12 H
6 14 D
From what I understand, you only need to get all the rows in your df2 that has a value in the city column in df1, where the difference in the dates are at least 9 hours (greater than 8 hours).
To do that, we need to merge on your city column:
>>> new_df = df2.merge(df1, how = 'inner', left_on = 'city', right_on = 'city')
>>> new_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
3 14 D 10
time_x basically is the time in your df2 dataframe, and time_y is from your df1.
Now we need to check the difference of those times and retain the one that will give a greater than 8 value in doing so, by using numpy.where() flagging them to do the filtering later:
>>> new_df['flag'] = np.where(new_df['time_y'] - new_df['time_x'] > 8, ['Retain'], ['Remove'])
>>> new_df
time_x city time_y flag
0 2 A 24 Retain
1 4 B 20 Retain
2 6 C 15 Retain
3 14 D 10 Remove
Now that you have that, you can simply filter your new_df by the flag column, removing the column in the final output as such:
>>> final_df = new_df[new_df['flag'].isin(['Retain'])][['time_x', 'city', 'time_y']]
>>> final_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
And there you go, no looping needed. Hope this helps :D

update one dataframe with data from another, for one specific column - Pandas and Python

I'm trying to update one dataframe with data from another, for one specific column called 'Data'. Both dataframe's have the unique ID caled column 'ID'. Both columns have a 'Data' column. I want data from 'Data' in df2 to overwrite entries in df1 'Data', for only the amount of rows that are in df1. Where there is no corresponding 'ID' in df2 the df1 entry should remain.
import pandas as pd
data1 = '''\
ID Data Data1
1 AA BB
2 AB BF
3 AC BK
4 AD BL'''
data2 = '''\
ID Data
1 AAB
3 AAL
4 MNL
5 AAP
6 MNX
8 DLP
9 POW'''
df1 = pd.read_csv(pd.compat.StringIO(data1), sep='\s+')
df2 = pd.read_csv(pd.compat.StringIO(data2), sep='\s+')
Expected output:
new df3 expected outcome.
ID Data Data1
1 AAB BB
2 AB BF
3 AAL BK
4 MNL BL
df2 is a master list of values which never changes and has thousands of entries, where as df1 sometime only ever has a few hundred entries.
I have looked at pd.merge and combine_first however can't seem to get the right combination.
df3 = pd.merge(df1, df2, on='ID', how='left')
Any help much appreciated.
Create new dataframe
Here is one way making use of update:
df3 = df1[:].set_index('ID')
df3['Data'].update(df2.set_index('ID')['Data'])
df3.reset_index(inplace=True)
Or we could use maps/dicts and reassign (Python >= 3.5)
m = {**df1.set_index('ID')['Data'], **df2.set_index('ID')['Data']}
df3 = df1[:].assign(Data=df1['ID'].map(m))
Python < 3.5:
m = df1.set_index('ID')['Data']
m.update(df2.set_index('ID')['Data'])
df3 = df1[:].assign(Data=df1['ID'].map(m))
Update df1
Are you open to update the df1? In that case:
df1.update(df2)
Or if ID not index:
m = df2.set_index('ID')['Data']
df1.loc[df1['ID'].isin(df2['ID']),'Data'] =df1['ID'].map(m)
Or:
df1.set_index('ID',inplace=True)
df1.update(df2.set_index('ID'))
df1.reset_index(inplace=True)
Note: There might be something that makes more sense :)
Full example:
import pandas as pd
data1 = '''\
ID Data Data1
1 AA BB
2 AB BF
3 AC BK
4 AD BL'''
data2 = '''\
ID Data
1 AAB
3 AAL
4 MNL
5 AAP
6 MNX
8 DLP
9 POW'''
df1 = pd.read_csv(pd.compat.StringIO(data1), sep='\s+')
df2 = pd.read_csv(pd.compat.StringIO(data2), sep='\s+')
m = {**df1.set_index('ID')['Data'], **df2.set_index('ID')['Data']}
df3 = df1[:].assign(Data=df1['ID'].map(m))
print(df3)
Returns:
ID Data Data1
0 1 AAB BB
1 2 AB BF
2 3 AAL BK
3 4 MNL BL

isin pandas dataframe from 2 other dataframe

i have a pandas dataframe.
df = pd.DataFrame({'countries':['US','UK','Germany','China','India','Pakistan','lanka'],
'id':['a','b','c','d','e','f','g']})
also i have two more dataframes. df2 and df3.
df2 = pd.DataFrame({'countries':['Germany','China'],
'capital':['c','d']})
df3 = pd.DataFrame({'countries':['lanka','USA'],
'capital':['g','a']})
i want to find the rows in df where df is in df2 and df3
i had this code:
df[df.id.isin(df2.capital)]
but it will find the rows which is in df2.
is there any way i can do it for both df2 and df3 in a single code.
i'e rows from df where df is in df2 and df3
I think you need simply sum both list together:
print (df[df.id.isin(df2.capital.tolist() + df3.capital.tolist())])
countries id
0 US a
2 Germany c
3 China d
6 lanka g
Another solution is use numpy.setxor1d - set exclusive-or of two arrays:
print (df[df.id.isin(np.setxor1d(df2.capital, df3.capital))])
countries id
0 US a
2 Germany c
3 China d
6 lanka g
Or solution with comment with or - |:
print (df[(df.id.isin(df2.capital)) | (df.id.isin(df3.capital))])
countries id
0 US a
2 Germany c
3 China d
6 lanka g

pandas ordering columns misses values

Not sure on the right title for this. But I have a need to take out a column from a dataframe, and show the top five results. The column is a mix of integers and n/a results. As an example I create a basic dataframe:
regiona col1
a n/a
a 1
a 200
b 208
b 400
b 560
b 600
c 800
c 1120
c 1200
c 1680
d n/a
d n/a
And so run:
import pandas as pd
df = pd.read_csv('test_data.csv')
I then created a basic function so I could use this on different columns, so constructed:
def max_search(indicator):
displaced_count = df[df[indicator] != 'n/a']
table = displaced_count.sort_values([indicator], ascending=[False])
return table.head(5)
But when I run
max_search('col1')
It returns:
regiona col1
7 c 800
6 b 600
5 b 560
4 b 400
3 b 208
So it misses anything greater than 800. The steps I think the function should be doing is:
Filter out n/a valyes
Return the top five values.
However, it is not returning anything over 800? Am I missing something very obvious?
Check your dataframe's dtypes, now it is object. So first make sure col1's datatype is numeric.
Use na_values at pd.read_csv() and your function will work as expected:
df = pd.read_csv('test_data.csv', na_values='n/a')
# df.dtypes
You could also do:
df['col1'] = pd.to_numeric(df['col1'], errors='coerce')
df.dropna().sort_values(['col1'], ascending=False).head(5)
regiona col1
10 c 1680.0
9 c 1200.0
8 c 1120.0
7 c 800.0
6 b 600.0

Categories