Sorting Pandas dataframe with variable columns - python

I have an arbitrary number of data frames (3 in this case). I am trying to pick out the trip with the highest speed between the starting destination (column A) and the final destination (column variable). These trips need to be stored in a new dataframe.
d= {'A':['London', 'London', 'London', 'London', 'Budapest'], 'B':
['Beijing', 'Sydney', 'Warsaw', 'Budapest', 'Warsaw'],'Speed':
[1000,2000,500,499,500]}
df = pd.DataFrame(data=d)
d1= {'A':['London', 'London', 'London', 'Budapest'], 'B':['Rio', 'Rio',
'Rio', 'Rio'],'C':['Beijing', 'Sydney', 'Budapest', 'Warsaw'],'Speed':
[2000,1000,500,500]}
df1= pd.DataFrame(data=d1)
d2= {'A':['London', 'London', 'London', 'London'],'B':['Florence',
'Florence', 'Florence', 'Florence'],'C':['Rio', 'Rio', 'Rio', 'Rio'], 'D':
['Beijing', 'Sydney', 'Oslo', 'Warsaw'],'Speed':[500,500,500,1000]}
df2= pd.DataFrame(data=d2)
The desired output for this particular case would look like this:
A B C D Speed
London Rio Beijing NaN 2000
London Sydney NaN NaN 2000
London Florence Rio Warsaw 1000
London Florence Rio Oslo 500
London Rio Budapest NaN 500
Budapest Warsaw NaN NaN 500
I started by appending the dataframes with:
df.append(df1).append(df2)

First join all DataFrames toghether and sort by column Speed. Then filter by boolean mask with ffill for forward filling missing values with duplicated:
df = pd.concat([df, df1, df2]).sort_values('Speed', ascending=False)
df = df[~df.ffill(axis=1).duplicated(['A','D'])].reset_index(drop=True)
print (df)
A B C D Speed
0 London Sydney NaN NaN 2000
1 London Rio Beijing NaN 2000
2 London Florence Rio Warsaw 1000
3 Budapest Warsaw NaN NaN 500
4 London Rio Budapest NaN 500
5 London Florence Rio Oslo 500

You can sort the data frames by using values or index. For example, if you want to sort by column B - you can write code as below:
For single column
`df.sort_values(by=['B'])`
Sort by multiple column
df.sort_values(by=['col1', 'col2'])
You can also sort by the index values.

Related

Is there a way to take all of my row values and make them col headers?

I currently have a long list of countries (234 values). For simplicity sake, picture 1 displays only 10. This is what my df currently looks like:
I want to create a matrix of some sort, where the countries listed in each row are also the col headers. This is what I want my output dataframe to look like:
Country
China
India
U.S
Indonesia
Pakistan
...
Montserrat
Falkland Islands
Niue
Tokelau
Vatican City
China
India
U.S.
Indonesia
Pakistan
...
Montserrat
Falkland Islands
Niue
Tokelau
Vatican City
So to reiterate the question, how do I take the value in each row of col 1 and copy it to be my dataframe column headers to create a matrix. This dataframe is also being scraped from a website using requests and beautiful soup, so it isn't like i can create a csv file of a pre-made dataframe. Is what I want to do possible?
Initialize a Pandas DataFrame as follows
countryList = ['China', 'India', 'U.S.']
pd.DataFrame(columns=countryList, index=countryList)
and just append all elements of ```countryList`` according to your use case. This yields an empty dataframe to insert data into.
China
India
U.S.
China
NaN
NaN
NaN
India
NaN
NaN
NaN
U.S.
NaN
NaN
NaN
Will something like this work?
data = ["US","China","England","Spain",'Brazil']
df = pd.DataFrame({"Country":data})
df[df.Country.values] = ''
df
Output:
Country US China England Spain Brazil
0 US
1 China
2 England
3 Spain
4 Brazil
You can even set the country as the index like:
data = [1,2,3,4,5]
df = pd.DataFrame({"Country":data})
df[df.Country.values] = ''
df = df.set_index(df.Country)[df.Country.values].rename_axis(index=None)
Output:
US China England Spain Brazil
US
China
England
Spain
Brazil
7Shoe's answer is good, but in case you already have a dataframe:
import pandas as pd
df = pd.DataFrame({'Country':['U.S.','Canada', 'India']})
pd.DataFrame(columns=df.Country, index=df.Country).rename_axis(None)
Output
Country U.S. Canada India
U.S. NaN NaN NaN
Canada NaN NaN NaN
India NaN NaN NaN

Pandas iterate over rows and find the column names

i have a two dataframes as:
df = pd.DataFrame({'America':["Ohio","Utah","New York"],
'Italy':["Rome","Milan","Venice"],
'Germany':["Berlin","Munich","Jena"]});
df2 = pd.DataFrame({'Cities':["Rome", "New York", "Munich"],
'Country':["na","na","na"]})
i want to itirate on df2 "Cities" column to find the cities on my (df) and append the country of the city (df column names) to the df2 country column
Use melt with map by dictionary:
df1 = df.melt()
print (df1)
variable value
0 America Ohio
1 America Utah
2 America New York
3 Italy Rome
4 Italy Milan
5 Italy Venice
6 Germany Berlin
7 Germany Munich
8 Germany Jena
df2['Country'] = df2['Cities'].map(dict(zip(df1['value'], df1['variable'])))
#alternative, thanks #Sandeep Kadapa
#df2['Country'] = df2['Cities'].map(df1.set_index('value')['variable'])
print (df2)
Cities Country
0 Rome Italy
1 New York America
2 Munich Germany
After melting and renaming the first dataframe:
df1 = df.melt().rename(columns={'variable': 'Country', 'value': 'Cities'})
the solution is a simple merge:
df2 = df2[['Cities']].merge(df1, on='Cities')

pandas fill missing country values based on city if it exists

I'm trying to fill country names in my dataframe if it is null based on city and country names, which exists. For eg see the dataframe below, here i want to replace NaN for City Bangalore with Country India if such City exists in the dataframe
df1=
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 Abu Dhabi UAE
6 Bangalore NaN
I am new to this so any help would be appreciated :).
You can create a series mapping after dropping nulls and duplicates.
Then use fillna with pd.Series.map:
g = df.dropna(subset=['Country']).drop_duplicates('City').set_index('City')['Country']
df['Country'] = df['Country'].fillna(df['City'].map(g))
print(df)
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 AbuDhabi UAE
6 Bangalore India
This solution will also work if NaN occurs first within a group.
I believe
df1.groupby('City')['Country'].fillna(method='ffill')
should resolve your issue by forward filling missing values within the group by.
One of the ways could be -
non_null_cities = df1.dropna().drop_duplicates(['City']).rename(columns={'Country':'C'})
df1 = df1.merge(non_null_cities, on='City', how='left')
df1.loc[df1['Country'].isnull(), 'Country'] = df1['C']
del df1['C']
Hope this will be helpful!
Here is one nasty way to do it.
first use forward fill and then use backwardfill ( for the possible NaN occurs first)
df = df.groupby('City')[['City','Country']].fillna(method = 'ffill').groupby('City')[['City','Country']].fillna(method = 'bfill')

Merging Two Dataframes without a Key Column

I have a requirement where I want to merge two data frames without any key column.
From the input table, I am treating first three columns as one data frame and the last column as another one. My plan is to sort the second data frame and then merge it to the first one without any key column so that it looks like the above output.
Is it possible to merge in this way or if there are any alternatives?
One way is to use pd.DataFrame.join after filtering out null values.
Data from #ALollz.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
res = df1.join(pd.DataFrame(list(filter(None, df2.values)), columns=['comments']))
Result:
Country comments
0 USA X
1 UK Y
2 Finland Z
3 Spain NaN
4 Australia NaN
If by "sort the second dataframe" you mean move the NULL values to the end of the list and keep the rest of the order in tact, then this will get the job done.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia'],
'Name': ['Sam', 'Chris', 'Jeff', 'Kartik', 'Mavenn']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
df1['Comments'] = df2[df2.Comments.notnull()].reset_index().drop(columns='index')
Country Name Comments
0 USA Sam X
1 UK Chris Y
2 Finland Jeff Z
3 Spain Kartik NaN
4 Australia Mavenn NaN
IIUC:
input['Comments'] = input.Comments.sort_values().values
Output:
Comments Country Name
1 X USA Sam
2 Y UK Chris
3 Z Finland Jeff
4 NaN Spain Kartik
5 NaN Australia Maven

Cells all becomes NaN after reordering alphabetically

After I tried to sort my Pandas dataframe by the country column with:
times_data2.reindex_axis(sorted(times_data2['country']), axis=1)
My dataframe became something like:
Argetina Argentina .... United States of America ...
NaN Nan .... NaN ....
If you want to set the index of the dataframe to sorted countries:
df = pd.DataFrame({'country': ['Brazil', 'USA', 'Argentina'], 'val': [1, 2, 3]})
>>> df
country val
0 Brazil 1
1 USA 2
2 Argentina 3
>>> df.set_index('country').sort_index()
val
country
Argentina 3
Brazil 1
USA 2
You may want to transpose these results:
>>> df.set_index('country').sort_index().T
country Argentina Brazil USA
val 3 1 2
If you want to sort by a column, use .sort_values():
times_data2.sort_values(by='country')
Then use .set_index('country') if necessary.

Categories