Replacing values in a pandas dataframe based on multiple conditions - python

I have a fairly simple question based on this sample code:
x1 = 10*np.random.randn(10,3)
df1 = pd.DataFrame(x1)
I am looking for a single DataFrame derived from df1 where positive values are replaced with "up", negative values are replaced with "down", and 0 values, if any, are replaced with "zero". I have tried using the .where() and .mask() methods but could not obtain the desired result.
I have seen other posts which filter according to multiple conditions at once, but they do not show how to replace values according to different conditions.

df1.apply(np.sign).replace({-1: 'down', 1: 'up', 0: 'zero'})
Output:
0 1 2
0 down up up
1 up down down
2 up down down
3 down down up
4 down down up
5 down up up
6 down up down
7 up down down
8 up up down
9 down up up
P.S. Getting exactly zero with randn is pretty unlikely, of course

For multiple conditions ie. (df['employrate'] <=55) & (df['employrate'] > 50)
use this:
df['employrate'] = np.where(
(df['employrate'] <=55) & (df['employrate'] > 50) , 11, df['employrate']
)
or you can do it this way as well,
gm.loc[(gm['employrate'] <55) & (gm['employrate'] > 50),'employrate']=11
here informal syntax can be:
<dataset>.loc[<filter1> & (<filter2>),'<variable>']='<value>'
out[108]:
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 11.000000 7.29
2 Algeria 11.000000 .69
3 Andorra nan 10.17
4 Angola 75.699997 5.57
therefore syntax we used here is:
df['<column_name>'] = np.where((<filter 1> ) & (<filter 2>) , <new value>, df['column_name'])
for single condition, ie. ( 'employrate'] > 70 )
country employrate alcconsumption
0 Afghanistan 55.7000007629394 .03
1 Albania 51.4000015258789 7.29
2 Algeria 50.5 .69
3 Andorra 10.17
4 Angola 75.6999969482422 5.57
use this:
df.loc[df['employrate'] > 70, 'employrate'] = 7
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 51.400002 7.29
2 Algeria 50.500000 .69
3 Andorra nan 10.17
4 Angola 7.000000 5.57
therefore syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]

In general, you could use np.select on the values and re-build the DataFrame
import pandas as pd
import numpy as np
df1 = pd.DataFrame(10*np.random.randn(10, 3))
df1.iloc[0, 0] = 0 # So we can check the == 0 condition
conds = [df1.values < 0 , df1.values > 0]
choices = ['down', 'up']
pd.DataFrame(np.select(conds, choices, default='zero'),
index=df1.index,
columns=df1.columns)
Output:
0 1 2
0 zero down up
1 up down up
2 up up up
3 down down down
4 up up up
5 up up up
6 up up down
7 up up down
8 down up down
9 up up down

IF condition with OR
from pandas import DataFrame
names = {'First_name': ['Jon','Bill','Maria','Emma']}
df = DataFrame(names,columns=['First_name'])
df.loc[(df['First_name'] == 'Bill') | (df['First_name'] == 'Emma'), 'name_match'] = 'Match'
df.loc[(df['First_name'] != 'Bill') & (df['First_name'] != 'Emma'), 'name_match'] = 'Mismatch'
print (df)
Output
First_name name_match
0 Jon Mismatch
1 Bill Match
2 Maria Mismatch
3 Emma Match

Related

Find the second largest date in df python [duplicate]

I am using pandas to analyse some election results. I have a DF, Results, which has a row for each constituency and columns representing the votes for the various parties (over 100 of them):
In[60]: Results.columns
Out[60]:
Index(['Constituency', 'Region', 'Country', 'ID', 'Type', 'Electorate',
'Total', 'Unnamed: 9', '30-50', 'Above',
...
'WP', 'WRP', 'WVPTFP', 'Yorks', 'Young', 'Zeb', 'Party', 'Votes',
'Share', 'Turnout'],
dtype='object', length=147)
So...
In[63]: Results.head()
Out[63]:
Constituency Region Country ID Type \
PAID
1 Aberavon Wales Wales W07000049 County
2 Aberconwy Wales Wales W07000058 County
3 Aberdeen North Scotland Scotland S14000001 Burgh
4 Aberdeen South Scotland Scotland S14000002 Burgh
5 Aberdeenshire West & Kincardine Scotland Scotland S14000058 County
Electorate Total Unnamed: 9 30-50 Above ... WP WRP WVPTFP \
PAID ...
1 49821 31523 NaN NaN NaN ... NaN NaN NaN
2 45525 30148 NaN NaN NaN ... NaN NaN NaN
3 67745 43936 NaN NaN NaN ... NaN NaN NaN
4 68056 48551 NaN NaN NaN ... NaN NaN NaN
5 73445 55196 NaN NaN NaN ... NaN NaN NaN
Yorks Young Zeb Party Votes Share Turnout
PAID
1 NaN NaN NaN Lab 15416 0.489040 0.632725
2 NaN NaN NaN Con 12513 0.415052 0.662230
3 NaN NaN NaN SNP 24793 0.564298 0.648550
4 NaN NaN NaN SNP 20221 0.416490 0.713398
5 NaN NaN NaN SNP 22949 0.415773 0.751528
[5 rows x 147 columns]
The per-constituency results for each party are given in the columns Results.ix[:, 'Unnamed: 9': 'Zeb']
I can find the winning party (i.e. the party which polled highest number of votes) and the number of votes it polled using:
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb']
Results['Party'] = RawResults.idxmax(axis=1)
Results['Votes'] = RawResults.max(axis=1).astype(int)
But, I also need to know how many votes the second-place party got (and ideally its index/name). So is there any way in pandas to return the second highest value/index in a set of columns for each row?
To get the highest values of a column, you can use nlargest() :
df['High'].nlargest(2)
The above will give you the 2 highest values of column High.
You can also use nsmallest() to get the lowest values.
Here is a NumPy solution:
In [120]: df
Out[120]:
a b c d e f g h
0 1.334444 0.322029 0.302296 -0.841236 -0.360488 -0.860188 -0.157942 1.522082
1 2.056572 0.991643 0.160067 -0.066473 0.235132 0.533202 1.282371 -2.050731
2 0.955586 -0.966734 0.055210 -0.993924 -0.553841 0.173793 -0.534548 -1.796006
3 1.201001 1.067291 -0.562357 -0.794284 -0.554820 -0.011836 0.519928 0.514669
4 -0.243972 -0.048144 0.498007 0.862016 1.284717 -0.886455 -0.757603 0.541992
5 0.739435 -0.767399 1.574173 1.197063 -1.147961 -0.903858 0.011073 -1.404868
6 -1.258282 -0.049719 0.400063 0.611456 0.443289 -1.110945 1.352029 0.215460
7 0.029121 -0.771431 -0.285119 -0.018216 0.408425 -1.458476 -1.363583 0.155134
8 1.427226 -1.005345 0.208665 -0.674917 0.287929 -1.259707 0.220420 -1.087245
9 0.452589 0.214592 -1.875423 0.487496 2.411265 0.062324 -0.327891 0.256577
In [121]: np.sort(df.values)[:,-2:]
Out[121]:
array([[ 1.33444404, 1.52208164],
[ 1.28237078, 2.05657214],
[ 0.17379254, 0.95558613],
[ 1.06729107, 1.20100071],
[ 0.86201603, 1.28471676],
[ 1.19706331, 1.57417327],
[ 0.61145573, 1.35202868],
[ 0.15513379, 0.40842477],
[ 0.28792928, 1.42722604],
[ 0.48749578, 2.41126532]])
or as a pandas Data Frame:
In [122]: pd.DataFrame(np.sort(df.values)[:,-2:], columns=['2nd-largest','largest'])
Out[122]:
2nd-largest largest
0 1.334444 1.522082
1 1.282371 2.056572
2 0.173793 0.955586
3 1.067291 1.201001
4 0.862016 1.284717
5 1.197063 1.574173
6 0.611456 1.352029
7 0.155134 0.408425
8 0.287929 1.427226
9 0.487496 2.411265
or a faster solution from #Divakar:
In [6]: df
Out[6]:
a b c d e f g h
0 0.649517 -0.223116 0.264734 -1.121666 0.151591 -1.335756 -0.155459 -2.500680
1 0.172981 1.233523 0.220378 1.188080 -0.289469 -0.039150 1.476852 0.736908
2 -1.904024 0.109314 0.045741 -0.341214 -0.332267 -1.363889 0.177705 -0.892018
3 -2.606532 -0.483314 0.054624 0.979734 0.205173 0.350247 -1.088776 1.501327
4 1.627655 -1.261631 0.589899 -0.660119 0.742390 -1.088103 0.228557 0.714746
5 0.423972 -0.506975 -0.783718 -2.044002 -0.692734 0.980399 1.007460 0.161516
6 -0.777123 -0.838311 -1.116104 -0.433797 0.599724 -0.884832 -0.086431 -0.738298
7 1.131621 1.218199 0.645709 0.066216 -0.265023 0.606963 -0.194694 0.463576
8 0.421164 0.626731 -0.547738 0.989820 -1.383061 -0.060413 -1.342769 -0.777907
9 -1.152690 0.696714 -0.155727 -0.991975 -0.806530 1.454522 0.788688 0.409516
In [7]: a = df.values
In [8]: a[np.arange(len(df))[:,None],np.argpartition(-a,np.arange(2),axis=1)[:,:2]]
Out[8]:
array([[ 0.64951665, 0.26473378],
[ 1.47685226, 1.23352348],
[ 0.17770473, 0.10931398],
[ 1.50132666, 0.97973383],
[ 1.62765464, 0.74238959],
[ 1.00745981, 0.98039898],
[ 0.5997243 , -0.0864306 ],
[ 1.21819904, 1.13162068],
[ 0.98982033, 0.62673128],
[ 1.45452173, 0.78868785]])
Here is an interesting approach. What if we replace the maximum value with the minimum value and calculate. Although it is a quick hack and, not recommended!
first_highest_value_index = df.idxmax()
second_highest_value_index = df.replace(df.max(),df(min)).idxmax()
first_highest_value = df[first_highest_value_index]
second_highest_value = df[second_highest_value_index]
You could just sort your results, such that the first rows will contain the max. Then you can simply use indexing to get the first n places.
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb'].sort_values(by='votes', ascending=False)
RawResults.iloc[0, :] # First place
RawResults.iloc[1, :] # Second place
RawResults.iloc[n, :] # nth place
Here is a solution using nlargest function:
>>> df
a b c
0 4 20 2
1 5 10 2
2 3 40 5
3 1 50 10
4 2 30 15
>>> def give_largest(col,n):
... largest = col.nlargest(n).reset_index(drop = True)
... data = [x for x in largest]
... index = [f'{i}_largest' for i in range(1,len(largest)+1)]
... return pd.Series(data,index=index)
...
...
>>> def n_largest(df, axis, n):
... '''
... Function to return the n-largest value of each
... column/row of the input DataFrame.
... '''
... return df.apply(give_largest, axis = axis, n = n)
...
>>> n_largest(df,axis = 1, n = 2)
1_largest 2_largest
0 20 4
1 10 5
2 40 5
3 50 10
4 30 15
>>> n_largest(df,axis = 0, n = 2)
a b c
1_largest 5 50 15
2_largest 4 40 10
import numpy as np
import pandas as pd
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
def second_largest(df):
return (df.nlargest(2).min())
print(df.apply(second_largest))
a 4
b 40
c 20
dtype: int64
df
a b c d e f g h
0 1.334444 0.322029 0.302296 -0.841236 -0.360488 -0.860188 -0.157942 1.522082
1 2.056572 0.991643 0.160067 -0.066473 0.235132 0.533202 1.282371 -2.050731
2 0.955586 -0.966734 0.055210 -0.993924 -0.553841 0.173793 -0.534548 -1.796006
3 1.201001 1.067291 -0.562357 -0.794284 -0.554820 -0.011836 0.519928 0.514669
4 -0.243972 -0.048144 0.498007 0.862016 1.284717 -0.886455 -0.757603 0.541992
5 0.739435 -0.767399 1.574173 1.197063 -1.147961 -0.903858 0.011073 -1.404868
6 -1.258282 -0.049719 0.400063 0.611456 0.443289 -1.110945 1.352029 0.215460
7 0.029121 -0.771431 -0.285119 -0.018216 0.408425 -1.458476 -1.363583 0.155134
8 1.427226 -1.005345 0.208665 -0.674917 0.287929 -1.259707 0.220420 -1.087245
9 0.452589 0.214592 -1.875423 0.487496 2.411265 0.062324 -0.327891 0.256577
tranpose and use nlargest in a for loop to get the results order by each line:
df1=df.T
results=list()
for col in df1.columns: results.append(df1[col].nlargest(len(df.columns))
the results var is a list of pandas objects, where the first item on the list will be the df's first row sorted in descending order and so on. Since each item on the list is a pandas object, it carries df's column as index (it was transposed), so you will get the values and the df's columns name of each row sorted
results
[h 1.522082
a 1.334444
b 0.322029
c 0.302296
g -0.157942
e -0.360488
d -0.841236
f -0.860188
Name: 0, dtype: float64,
a 2.056572
g 1.282371
b 0.991643
f 0.533202
e 0.235132
c 0.160067
d -0.066473
h -2.050731
Name: 1, dtype: float64,
....

How can I get a previous value with some condition in a dataframe in pandas

Hello I´m trying to get the previous value in an specific column if this valuecontains "-":
this is my code:
count=-1
for i, row in df1.iterrows():
count=count + 1
if row["SUBCAPITULO"]== " "and count>0 and "-" in df1.loc[count-1:"SUBCAPITULO"]:
row["SUBCAPITULO"]= df1.loc[count-1:"SUBCAPITULO"]
Use shift:
Sample:
>>> df
SUBCAPITULO
0 dash-
1
2 comma,
3
4 dot.
df.loc[(df['SUBCAPITULO'] == ' ') &
(df['SUBCAPITULO'].shift().str.contains('-'))] = df['SUBCAPITULO'].shift()
>>> df
SUBCAPITULO
0 dash-
1 dash-
2 comma,
3
4 dot.
Desired output has not been posted and the request is unclear.
Taking these comments into account:
trying to get the previous row value in an specific column if this value contains a dash
I think OP wants to fill the dash- value over multiple lines
Maybe this helps...
import pandas as pd
df = pd.read_csv('test.csv')
print(df, '\n\n')
'''
Shows:
Other_data SUBCAPITULO
0 qwer NaN
1 vfds NaN
2 sdfg 1.01 – TORRE – 1
3 hfgt NaN
4 jkiu capitulo
5 bvcd 2.01 – TORRE – 1
6 grnc NaN
7 sdfg capitulo
8 poij NaN
9 fghg 2.01 – TORRE – 1
'''
for i in reversed(df.index):
if i >= 1:
if '–' in str(df.loc[i, 'SUBCAPITULO']):
if str(df.loc[i-1, 'SUBCAPITULO']) == 'nan':
df.loc[i-1, 'SUBCAPITULO'] = df.loc[i, 'SUBCAPITULO']
print(df)
'''
Shows:
Other_data SUBCAPITULO
0 qwer 1.01 – TORRE – 1
1 vfds 1.01 – TORRE – 1
2 sdfg 1.01 – TORRE – 1
3 hfgt NaN
4 jkiu capitulo
5 bvcd 2.01 – TORRE – 1
6 grnc NaN
7 sdfg capitulo
8 poij 2.01 – TORRE – 1
9 fghg 2.01 – TORRE – 1
'''
print('\n')
df1.reset_index(drop=True, inplace=True)
for i in range(1, len(df1)):
if df1.loc[i, 'SUBCAPITULO'] == " " and "-" in df1.loc[i-1, 'SUBCAPITULO']:
df1.loc[i, 'SUBCAPITULO']=df1.loc[i-1, 'SUBCAPITULO']
df1.dropna(inplace=True)
The problem was that I had TO RESET INDEX before the loop.
enter image description here

Pandas: index-derived column with specific increments based on other columns

I have the following data frame:
import pandas as pd
pandas_df = pd.DataFrame([
["SEX", "Male"],
["SEX", "Female"],
["EXACT_AGE", None],
["Country", "Afghanistan"],
["Country", "Albania"]],
columns=['FullName', 'ResponseLabel'
])
Now what I need to do is to add sort order to this dataframe. Each new "FullName" would increment it by 100 and each consecutive "ResponseLabel" for a given "FullName" would increment it by 1 (for this specific "FullName"). So I basically create two different sort orders that I sum later on.
pandas_full_name_increment = pandas_df[['FullName']].drop_duplicates()
pandas_full_name_increment = pandas_full_name_increment.reset_index()
pandas_full_name_increment.index += 1
pandas_full_name_increment['SortOrderFullName'] = pandas_full_name_increment.index * 100
pandas_df['SortOrderResponseLabel'] = pandas_df.groupby(['FullName']).cumcount() + 1
pandas_df = pd.merge(pandas_df, pandas_full_name_increment, on = ['FullName'], how = 'left')
Result:
FullName ResponseLabel SortOrderResponseLabel index SortOrderFullName SortOrder
0 SEX Male 1 0 100 101
1 SEX Female 2 0 100 102
2 EXACT_AGE NULL 1 2 200 201
3 Country Afghanistan 1 3 300 301
4 Country Albania 2 3 300 302
The result that I get on my "SortOrder" column is correct but I wonder if there is some better approach pandas-wise?
Thank you!
The best way to do this would be to use ngroup and cumcount
name_group = pandas_df.groupby('FullName')
pandas_df['sort_order'] = (
name_group.ngroup(ascending=False).add(1).mul(100) +
name_group.cumcount().add(1)
)
Output
FullName ResponseLabel sort_order
0 SEX Male 101
1 SEX Female 102
2 EXACT_AGE None 201
3 Country Afghanistan 301
4 Country Albania 302

How can I merge these two datasets on 'Name' and 'Year'?

I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')
You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.
Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!

df.apply(sorted, axis=1) removes column names?

Working through Pandas Cookbook. Counting the Total Number of Flights Between Cities.
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('-----')
desired_width = 320
pd.set_option('display.width', desired_width)
pd.options.display.max_rows = 50
pd.options.display.max_columns = 14
# pd.options.display.float_format = '{:,.2f}'.format
file = "e:\\packt\\data_analysis_and_exploration_with_pandas\\section07\\data\\flights.csv"
flights = pd.read_csv(file)
print(flights.head(10))
print()
# This returns the total number of rows for each group.
flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
print(flights_ct.head(10))
print()
# Get the number of flights between Atlanta and Houston in both directions.
print(flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]])
print()
# Sort the origin and destination cities:
# flights_sort = flights.sort_values(by=['ORG_AIR', 'DEST_AIR'], axis=1)
flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)
print(flights_sort.head(10))
print()
# Passing just the first row.
print(sorted(flights.loc[0, ['ORG_AIR', 'DEST_AIR']]))
print()
# Once each row is independently sorted, the column name are no longer correct.
# We will rename them to something generic, then again find the total number of flights between all cities.
rename_dict = {'ORG_AIR': 'AIR1', 'DEST_AIR': 'AIR2'}
flights_sort = flights_sort.rename(columns=rename_dict)
flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()
print(flights_ct2.head(10))
print()
When I get to this line of code my output differs from the authors:
```flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)```
My output does not contain any column names. As a result, when I get to:
```flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()```
it throws a KeyError. This makes sense, as I am trying to rename columns when no column names exist.
My question is, why are the column names gone? All other output matches the authors output exactly:
Connected to pydev debugger (build 191.7141.48)
NumPy: 1.16.3
Pandas: 0.24.2
-----
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN LAX SLC 1625 58.0 94.0 590 1905 65.0 0 0
1 1 1 4 UA DEN IAD 823 7.0 154.0 1452 1333 -13.0 0 0
2 1 1 4 MQ DFW VPS 1305 36.0 85.0 641 1453 35.0 0 0
3 1 1 4 AA DFW DCA 1555 7.0 126.0 1192 1935 -7.0 0 0
4 1 1 4 WN LAX MCI 1720 48.0 166.0 1363 2225 39.0 0 0
5 1 1 4 UA IAH SAN 1450 1.0 178.0 1303 1620 -14.0 0 0
6 1 1 4 AA DFW MSY 1250 84.0 64.0 447 1410 83.0 0 0
7 1 1 4 F9 SFO PHX 1020 -7.0 91.0 651 1315 -6.0 0 0
8 1 1 4 AA ORD STL 1845 -5.0 44.0 258 1950 -5.0 0 0
9 1 1 4 UA IAH SJC 925 3.0 215.0 1608 1136 -14.0 0 0
ORG_AIR DEST_AIR
ATL ABE 31
ABQ 16
ABY 19
ACY 6
AEX 40
AGS 83
ALB 33
ANC 2
ASE 1
ATW 10
dtype: int64
ORG_AIR DEST_AIR
ATL IAH 121
IAH ATL 148
dtype: int64
*** No columns names *** Why?
0 [LAX, SLC]
1 [DEN, IAD]
2 [DFW, VPS]
3 [DCA, DFW]
4 [LAX, MCI]
5 [IAH, SAN]
6 [DFW, MSY]
7 [PHX, SFO]
8 [ORD, STL]
9 [IAH, SJC]
dtype: object
The author's output. Note the columns names are present.
sorted returns a list object and obliterates the columns:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df.apply(sorted, axis=1)
Out[12]:
0 [1, 2]
1 [3, 4]
dtype: object
In [13]: type(df.apply(sorted, axis=1).iloc[0])
Out[13]: list
It's possible that this wouldn't have been the case in earlier pandas... but it would still be bad code.
You can do this by passing the columns explicitly:
In [14]: df.apply(lambda x: pd.Series(sorted(x), df.columns), axis=1)
Out[14]:
A B
0 1 2
1 3 4
A more efficient way to do this is to sort the sort the underlying numpy array:
In [21]: df = pd.DataFrame([[1, 2], [3, 1]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 1
In [23]: arr = df[["A", "B"]].values
In [24]: arr.sort(axis=1)
In [25]: df[["A", "B"]] = arr
In [26]: df
Out[26]:
A B
0 1 2
1 1 3
As you can see this sorts each row.
A final note. I just applied #AndyHayden numpy based solution from above.
flights_sort = flights[["ORG_AIR", "DEST_AIR"]].values
flights_sort.sort(axis=1)
flights[["ORG_AIR", "DEST_AIR"]] = flights_sort
All I can say is … Wow. What an enormous performance difference. I get the exact same
correct answer and I get it as soon as I click the mouse as compared to the pandas lambda solution also provided by #AndyHayden which takes about 20 seconds to perform the sort. That dataset is 58,000+ rows. The numpy solution returns the sort instantly.

Categories