Cells all becomes NaN after reordering alphabetically - python

After I tried to sort my Pandas dataframe by the country column with:
times_data2.reindex_axis(sorted(times_data2['country']), axis=1)
My dataframe became something like:
Argetina Argentina .... United States of America ...
NaN Nan .... NaN ....

If you want to set the index of the dataframe to sorted countries:
df = pd.DataFrame({'country': ['Brazil', 'USA', 'Argentina'], 'val': [1, 2, 3]})
>>> df
country val
0 Brazil 1
1 USA 2
2 Argentina 3
>>> df.set_index('country').sort_index()
val
country
Argentina 3
Brazil 1
USA 2
You may want to transpose these results:
>>> df.set_index('country').sort_index().T
country Argentina Brazil USA
val 3 1 2

If you want to sort by a column, use .sort_values():
times_data2.sort_values(by='country')
Then use .set_index('country') if necessary.

Related

Is there a way to take all of my row values and make them col headers?

I currently have a long list of countries (234 values). For simplicity sake, picture 1 displays only 10. This is what my df currently looks like:
I want to create a matrix of some sort, where the countries listed in each row are also the col headers. This is what I want my output dataframe to look like:
Country
China
India
U.S
Indonesia
Pakistan
...
Montserrat
Falkland Islands
Niue
Tokelau
Vatican City
China
India
U.S.
Indonesia
Pakistan
...
Montserrat
Falkland Islands
Niue
Tokelau
Vatican City
So to reiterate the question, how do I take the value in each row of col 1 and copy it to be my dataframe column headers to create a matrix. This dataframe is also being scraped from a website using requests and beautiful soup, so it isn't like i can create a csv file of a pre-made dataframe. Is what I want to do possible?
Initialize a Pandas DataFrame as follows
countryList = ['China', 'India', 'U.S.']
pd.DataFrame(columns=countryList, index=countryList)
and just append all elements of ```countryList`` according to your use case. This yields an empty dataframe to insert data into.
China
India
U.S.
China
NaN
NaN
NaN
India
NaN
NaN
NaN
U.S.
NaN
NaN
NaN
Will something like this work?
data = ["US","China","England","Spain",'Brazil']
df = pd.DataFrame({"Country":data})
df[df.Country.values] = ''
df
Output:
Country US China England Spain Brazil
0 US
1 China
2 England
3 Spain
4 Brazil
You can even set the country as the index like:
data = [1,2,3,4,5]
df = pd.DataFrame({"Country":data})
df[df.Country.values] = ''
df = df.set_index(df.Country)[df.Country.values].rename_axis(index=None)
Output:
US China England Spain Brazil
US
China
England
Spain
Brazil
7Shoe's answer is good, but in case you already have a dataframe:
import pandas as pd
df = pd.DataFrame({'Country':['U.S.','Canada', 'India']})
pd.DataFrame(columns=df.Country, index=df.Country).rename_axis(None)
Output
Country U.S. Canada India
U.S. NaN NaN NaN
Canada NaN NaN NaN
India NaN NaN NaN

argument of type "float" is not iterable when trying to use for loop

I have a countrydf as below, in which each cell in the country column contains a list of the countries where the movie was released.
countrydf
id Country release_year
s1 [US] 2020
s2 [South Africa] 2021
s3 NaN 2021
s4 NaN 2021
s5 [India] 2021
I want to make a new df which look like this:
country_yeardf
Year US UK Japan India
1925 NaN NaN NaN NaN
1926 NaN NaN NaN NaN
1927 NaN NaN NaN NaN
1928 NaN NaN NaN NaN
It has the release year and the number of movies released in each country.
My solution is that: with a blank df like the second one, run a for loop to count the number of movies released and then modify the value in the cell relatively.
countrylist=['Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', ….]
for x in countrylist:
for j in list(range(0,8807)):
if x in countrydf.country[j]:
t=int (countrydf.release_year[j] )
country_yeardf.at[t, x] = country_yeardf.at[t, x]+1
an error occurred which read:
TypeError Traceback (most recent call last)
<ipython-input-25-225281f8759a> in <module>()
1 for x in countrylist:
2 for j in li:
----> 3 if x in countrydf.country[j]:
4 t=int(countrydf.release_year[j])
5 country_yeardf.at[t, x] = country_yeardf.at[t, x]+1
TypeError: argument of type 'float' is not iterable
I don’t know which one is of float type here, I have check the type of countrydf.country[j] and it returned int.
I was using pandas and I am just getting started with it. Can anyone please explain the error and suggest a solution for a df that I want to create?
P/s: my English is not so good so hop you guys understand.
Here is a solution using groupby
df = pd.DataFrame([['US', 2015], ['India', 2015], ['US', 2015], ['Russia', 2016]], columns=['country', 'year'])
country year
0 US 2015
1 India 2015
2 US 2015
3 Russia 2016
Now just groupby country and year and unstack the output:
df.groupby(['year', 'country']).size().unstack()
country India Russia US
year
2015 1.0 NaN 2.0
2016 NaN 1.0 NaN
Some alternative ways to achieve this in pandas without loops.
If the Country Column have more than 1 value in the list in each row, you can try the below:
>>df['Country'].str.join("|").str.get_dummies().groupby(df['release_year']).sum()
India South Africa US
release_year
2020 0 0 1
2021 1 1 0
Else if Country has just 1 value per row in the list as you have shown in the example, you can use crosstab
>>pd.crosstab(df['release_year'],df['Country'].str[0])
Country India South Africa US
release_year
2020 0 0 1
2021 1 1 0

Pandas iterate over rows and find the column names

i have a two dataframes as:
df = pd.DataFrame({'America':["Ohio","Utah","New York"],
'Italy':["Rome","Milan","Venice"],
'Germany':["Berlin","Munich","Jena"]});
df2 = pd.DataFrame({'Cities':["Rome", "New York", "Munich"],
'Country':["na","na","na"]})
i want to itirate on df2 "Cities" column to find the cities on my (df) and append the country of the city (df column names) to the df2 country column
Use melt with map by dictionary:
df1 = df.melt()
print (df1)
variable value
0 America Ohio
1 America Utah
2 America New York
3 Italy Rome
4 Italy Milan
5 Italy Venice
6 Germany Berlin
7 Germany Munich
8 Germany Jena
df2['Country'] = df2['Cities'].map(dict(zip(df1['value'], df1['variable'])))
#alternative, thanks #Sandeep Kadapa
#df2['Country'] = df2['Cities'].map(df1.set_index('value')['variable'])
print (df2)
Cities Country
0 Rome Italy
1 New York America
2 Munich Germany
After melting and renaming the first dataframe:
df1 = df.melt().rename(columns={'variable': 'Country', 'value': 'Cities'})
the solution is a simple merge:
df2 = df2[['Cities']].merge(df1, on='Cities')

Sorting Pandas dataframe with variable columns

I have an arbitrary number of data frames (3 in this case). I am trying to pick out the trip with the highest speed between the starting destination (column A) and the final destination (column variable). These trips need to be stored in a new dataframe.
d= {'A':['London', 'London', 'London', 'London', 'Budapest'], 'B':
['Beijing', 'Sydney', 'Warsaw', 'Budapest', 'Warsaw'],'Speed':
[1000,2000,500,499,500]}
df = pd.DataFrame(data=d)
d1= {'A':['London', 'London', 'London', 'Budapest'], 'B':['Rio', 'Rio',
'Rio', 'Rio'],'C':['Beijing', 'Sydney', 'Budapest', 'Warsaw'],'Speed':
[2000,1000,500,500]}
df1= pd.DataFrame(data=d1)
d2= {'A':['London', 'London', 'London', 'London'],'B':['Florence',
'Florence', 'Florence', 'Florence'],'C':['Rio', 'Rio', 'Rio', 'Rio'], 'D':
['Beijing', 'Sydney', 'Oslo', 'Warsaw'],'Speed':[500,500,500,1000]}
df2= pd.DataFrame(data=d2)
The desired output for this particular case would look like this:
A B C D Speed
London Rio Beijing NaN 2000
London Sydney NaN NaN 2000
London Florence Rio Warsaw 1000
London Florence Rio Oslo 500
London Rio Budapest NaN 500
Budapest Warsaw NaN NaN 500
I started by appending the dataframes with:
df.append(df1).append(df2)
First join all DataFrames toghether and sort by column Speed. Then filter by boolean mask with ffill for forward filling missing values with duplicated:
df = pd.concat([df, df1, df2]).sort_values('Speed', ascending=False)
df = df[~df.ffill(axis=1).duplicated(['A','D'])].reset_index(drop=True)
print (df)
A B C D Speed
0 London Sydney NaN NaN 2000
1 London Rio Beijing NaN 2000
2 London Florence Rio Warsaw 1000
3 Budapest Warsaw NaN NaN 500
4 London Rio Budapest NaN 500
5 London Florence Rio Oslo 500
You can sort the data frames by using values or index. For example, if you want to sort by column B - you can write code as below:
For single column
`df.sort_values(by=['B'])`
Sort by multiple column
df.sort_values(by=['col1', 'col2'])
You can also sort by the index values.

Merging Two Dataframes without a Key Column

I have a requirement where I want to merge two data frames without any key column.
From the input table, I am treating first three columns as one data frame and the last column as another one. My plan is to sort the second data frame and then merge it to the first one without any key column so that it looks like the above output.
Is it possible to merge in this way or if there are any alternatives?
One way is to use pd.DataFrame.join after filtering out null values.
Data from #ALollz.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
res = df1.join(pd.DataFrame(list(filter(None, df2.values)), columns=['comments']))
Result:
Country comments
0 USA X
1 UK Y
2 Finland Z
3 Spain NaN
4 Australia NaN
If by "sort the second dataframe" you mean move the NULL values to the end of the list and keep the rest of the order in tact, then this will get the job done.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia'],
'Name': ['Sam', 'Chris', 'Jeff', 'Kartik', 'Mavenn']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
df1['Comments'] = df2[df2.Comments.notnull()].reset_index().drop(columns='index')
Country Name Comments
0 USA Sam X
1 UK Chris Y
2 Finland Jeff Z
3 Spain Kartik NaN
4 Australia Mavenn NaN
IIUC:
input['Comments'] = input.Comments.sort_values().values
Output:
Comments Country Name
1 X USA Sam
2 Y UK Chris
3 Z Finland Jeff
4 NaN Spain Kartik
5 NaN Australia Maven

Categories