I would like to rename 'multi level columns' of a pandas dataframe to 'single level columns'. My code so far does not give any errors but does not rename either. Any suggestions for code improvements?
import pandas as pd
url = 'https://en.wikipedia.org/wiki/Gross_national_income'
df = pd.read_html(url)[3][[('Country', 'Country'), ('GDP[10]', 'GDP[10]')]]\
.rename(columns={('Country', 'Country'):'Country', ('GDP[10]', 'GDP[10]'): 'GDP'})
df
I prefer to use the rename method. df.columns = ['Country', 'GDP'] works but is not what I am looking for.
For rename solution create dictionary by flatten values of MultiIndex with join with new columns names in zip:
url = 'https://en.wikipedia.org/wiki/Gross_national_income'
df = pd.read_html(url)[3]
df.columns = df.columns.map('_'.join)
old = ['No._No.', 'Country_Country', 'GNI (Atlas method)[8]_value (a)',
'GNI (Atlas method)[8]_a - GDP', 'GNI[9]_value (b)', 'GNI[9]_b - GDP',
'GDP[10]_GDP[10]']
new = ['No.','Country','GNI a','GDP a','GNI b', 'GNI b', 'GDP']
df = df.rename(columns=dict(zip(old, new)))
If want create dictionary for rename:
d = {'No._No.': 'No.', 'Country_Country': 'Country', 'GNI (Atlas method)[8]_value (a)': 'GNI a', 'GNI (Atlas method)[8]_a - GDP': 'GDP a', 'GNI[9]_value (b)': 'GNI b', 'GNI[9]_b - GDP': 'GNI b', 'GDP[10]_GDP[10]': 'GDP'}
df = df.rename(columns=d)
print (df)
No. Country GNI a GDP a GNI b GNI b GDP
0 1 United States 20636317 91974 20837347 293004 20544343
1 2 China 13181372 -426779 13556853 -51298 13608151
2 3 Japan 5226599 255276 5155423 184100 4971323
3 4 Germany 3905321 -42299 4058030 110410 3947620
4 5 United Kingdom 2777405 -77891 2816805 -38491 2855296
5 6 France 2752034 -25501 2840071 62536 2777535
6 7 India 2727893 9161 2691040 -27692 2718732
7 8 Italy 2038376 -45488 2106525 22661 2083864
8 9 Brazil 1902286 16804 1832170 -53312 1885482
9 10 Canada 1665565 -47776 1694054 -19287 1713341
For alternatives of "rename", you can use get_level_values(). See below:
df.columns = df.columns.get_level_values(0)
>>> print(df)
Country GDP[10]
0 United States 20544343
1 China 13608151
2 Japan 4971323
3 Germany 3947620
4 United Kingdom 2855296
5 France 2777535
6 India 2718732
7 Italy 2083864
8 Brazil 1885482
9 Canada 1713341
Related
enter image description here what i need to to ist to increment the ID based on the Value in column Country i used this
code:
i=1 for row in new_cols5): new_cols5.loc[new_cols5.Country=='Germany','ID']='GR'+str(i) new_cols5.loc[new_cols5.Country=='Italy', 'ID']='IT'+str(i) new_cols5.loc[new_cols5.Country=='France','ID']='FR'+str(i) i+=1
What i get is always the same number concatinated to the ID
ID
Country
GR1
Germany
FR2
France
IT3
Italy
GR1
Germany
FR2
France
IT3
Italy
desired output:
ID
Country
GR1
Germany
FR1
France
IT1
Italy
GR2
Germany
FR2
France
IT2
Italy
GR3
Germany
FR3
France
IT3
Italy
GR4
Germany
FR4
France
IT4
Italy
i would appreciate your help.
the Dataset look like this :
First you could use print() to see what you get with new_cols5.loc[].
new_cols5.loc[] gives you all matching rows and you assign the same value to all rows at once.
You would have to iterate these rows to assign different values.
Or:
You should get number of matching rows to create list ["GR1", "GR2", ..."] and assign this list. And this doesn't need for-loop
matching = new_cols5.loc[new_cols5.Country=='Germany']
count = len(matching)
ids = [f"GR{i}" for i in range(1, count+1)]
new_cols5.loc[new_cols5.Country=='Germany', 'ID'] = ids
or using only mask with True/False and sum() which treads True as 1 and False as 0
mask = (new_cols5.Country == 'Germany')
count = sum(mask)
ids = [f'GR{i}' for i in range(1, count+1)]
new_cols5.loc[new_cols5.Country == 'Germany', 'ID'] = ids
Minimal working code:
import pandas as pd
# --- columns ---
data = {
'ID': ['A','B','C','D','E','F','G','H','I'],
'Country': ['Germany','France','Italy','Germany','France','Italy','Germany','France','Italy'],
}
df = pd.DataFrame(data)
print(df)
# --- version 1 ---
matching = df.loc[df.Country=='Germany']
count = len(matching)
ids = [f"GR{i}" for i in range(1, count+1)]
df.loc[df.Country=='Germany', 'ID'] = ids
print(df)
# --- version 2 ---
count = sum(df.Country == 'France')
ids = [f'FR{i}' for i in range(1, count+1)]
df.loc[ df.Country == 'France', 'ID' ] = ids
print(df)
Result:
ID Country
0 A Germany
1 B France
2 C Italy
3 D Germany
4 E France
5 F Italy
6 G Germany
7 H France
8 I Italy
# version 1
ID Country
0 GR1 Germany
1 B France
2 C Italy
3 GR2 Germany
4 E France
5 F Italy
6 GR3 Germany
7 H France
8 I Italy
# version 2
ID Country
0 GR1 Germany
1 FR1 France
2 C Italy
3 GR2 Germany
4 FR2 France
5 F Italy
6 GR3 Germany
7 FR3 France
8 I Italy
EDIT:
Version which count all values in column and later replace IDs for all countries.
But it uses first letters from country name - GE instead of GR
for country, count in df.Country.value_counts().items():
short = country[:2].upper()
ids = [f'{short}{i}' for i in range(1, count+1)]
df.loc[ df.Country == country, 'ID' ] = ids
I have a list of dictionaries that also consist of lists and would like to create a dataframe using this list. For example, the data looks like this:
lst = [{'France': [[12548, ABC], [45681, DFG], [45684, HJK]]},
{'USA': [[84921, HJK], [28917, KLESA]]},
{'Japan':[[38292, ASF], [48902, DSJ]]}]
And this is the dataframe I'm trying to create
Country Amount Code
France 12548 ABC
France 45681 DFG
France 45684 HJK
USA 84921 HJK
USA 28917 KLESA
Japan 38292 ASF
Japan 48902 DSJ
As you can see, the keys became column values of the country column and the numbers and the strings became the amount and code columns. I thought I could use something like the following, but it's not working.
df = pd.DataFrame(lst)
You probably need to transform the data into a format that Pandas can read.
Original data
data = [
{"France": [[12548, "ABC"], [45681, "DFG"], [45684, "HJK"]]},
{"USA": [[84921, "HJK"], [28917, "KLESA"]]},
{"Japan": [[38292, "ASF"], [48902, "DSJ"]]},
]
Transforming the data
new_data = []
for country_data in data:
for country, values in country_data.items():
new_data += [{"Country": country, "Amount": amt, "Code": code} for amt, code in values]
Create the dataframe
df = pd.DataFrame(new_data)
Ouput
Country Amount Code
0 France 12548 ABC
1 France 45681 DFG
2 France 45684 HJK
3 USA 84921 HJK
4 USA 28917 KLESA
5 Japan 38292 ASF
6 Japan 48902 DSJ
df = pd.concat([pd.DataFrame(elem) for elem in list])
df = df.apply(lambda x: pd.Series(x.dropna().values)).stack()
df = df.reset_index(level=[0], drop=True).to_frame(name = 'vals')
df = pd.DataFrame(df["vals"].to_list(),index= df.index, columns=['Amount', 'Code']).sort_index()
print(df)
output:
Amount Code
France 12548 ABC
USA 84921 HJK
Japan 38292 ASF
France 45681 DFG
USA 28917 KLESA
Japan 48902 DSJ
France 45684 HJK
Use nested list comprehension for flatten data and pass to DataFrame constructor:
lst = [
{"France": [[12548, "ABC"], [45681, "DFG"], [45684, "HJK"]]},
{"USA": [[84921, "HJK"], [28917, "KLESA"]]},
{"Japan": [[38292, "ASF"], [48902, "DSJ"]]},
]
L = [(country, *x) for country_data in lst
for country, values in country_data.items()
for x in values]
df = pd.DataFrame(L, columns=['Country','Amount','Code'])
print (df)
Country Amount Code
0 France 12548 ABC
1 France 45681 DFG
2 France 45684 HJK
3 USA 84921 HJK
4 USA 28917 KLESA
5 Japan 38292 ASF
6 Japan 48902 DSJ
Build a new dictionary that combines the individual dicts into one, before concatenating the dataframes:
new_dict = {}
for ent in lst:
for key, value in ent.items():
new_dict[key] = pd.DataFrame(value, columns = ['Amount', 'Code'])
pd.concat(new_dict, names=['Country']).droplevel(1).reset_index()
Country Amount Code
0 France 12548 ABC
1 France 45681 DFG
2 France 45684 HJK
3 USA 84921 HJK
4 USA 28917 KLESA
5 Japan 38292 ASF
6 Japan 48902 DSJ
I want to split one column from my dataframe into multiple columns, then attach those columns back to my original dataframe and divide my original dataframe based on whether the split columns include a specific string.
I have a dataframe that has a column with values separated by semicolons like below.
import pandas as pd
data = {'ID':['1','2','3','4','5','6','7'],
'Residence':['USA;CA;Los Angeles;Los Angeles', 'USA;MA;Suffolk;Boston', 'Canada;ON','USA;FL;Charlotte', 'NA', 'Canada;QC', 'USA;AZ'],
'Name':['Ann','Betty','Carl','David','Emily','Frank', 'George'],
'Gender':['F','F','M','M','F','M','M']}
df = pd.DataFrame(data)
Then I split the column as below, and separated the split column into two based on whether it contains the string USA or not.
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
Now if you run USA and nonUSA, you'll note that there are extra columns in nonUSA, and also a row with no country information. So I got rid of those NA values.
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA.columns = ['Country', 'State']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
Now I want to attach USA and nonUSA to my original dataframe, so that I will get two dataframes that look like below:
USAdata = pd.DataFrame({'ID':['1','2','4','7'],
'Name':['Ann','Betty','David','George'],
'Gender':['F','F','M','M'],
'Country':['USA','USA','USA','USA'],
'State':['CA','MA','FL','AZ'],
'County':['Los Angeles','Suffolk','Charlotte','None'],
'City':['Los Angeles','Boston','None','None']})
nonUSAdata = pd.DataFrame({'ID':['3','6'],
'Name':['David','Frank'],
'Gender':['M','M'],
'Country':['Canada', 'Canada'],
'State':['ON','QC']})
I'm stuck here though. How can I split my original dataframe into people whose Residence include USA or not, and attach the split columns from Residence ( USA and nonUSA ) back to my original dataframe?
(Also, I just uploaded everything I had so far, but I'm curious if there's a cleaner/smarter way to do this.)
There is unique index in original data and is not changed in next code for both DataFrames, so you can use concat for join together and then add to original by DataFrame.join or concat with axis=1:
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
#changed order for avoid error
nonUSA.columns = ['Country', 'State']
df = pd.concat([df, pd.concat([USA, nonUSA])], axis=1)
Or:
df = df.join(pd.concat([USA, nonUSA]))
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NaN NaN
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 NaN NaN
3 Charlotte None
4 NaN NaN
5 NaN NaN
6 None None
But it seems it is possible simplify:
c = ['Country', 'State', 'County', 'City']
df[c] = df['Residence'].str.split(';',expand=True)
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NA None
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 None None
3 Charlotte None
4 None None
5 None None
6 None None
I would like to replicate in Pandas the following SQL structure "Update dataframe1 LEFT JON dataframe2 SET dataframe1.column1 = dataframe2.column2 WHERE dataframe1.column3 > X"
I know it is possible to merge the dataframe and then work on the merged columns with with ".where"
However, it doesn't seems to be straighforward as a solution.
df = pd.merge(df1,df2, suffix(a,b))
df['clmn1'] = df['clmn1_b'].where( df[clmn1]>0, df['clmn1_b'])
Is there a better way to reach the goal?
Thanks
To use your example from the comments:
In [21]: df
Out[21]:
Name Gender country
0 Jack M USA
1 Nick M UK
2 Alphio F RU
3 Jenny F USA
In [22]: country_map = {'USA': 'United States', 'UK': 'United Kingdom', 'RU': 'Russia'}
In [23]: df.country.map(country_map)
Out[23]:
0 United States
1 United Kingdom
2 Russia
3 United States
Name: country, dtype: object
To update just the M rows you could use loc and update:
In [24]: df.country.update(df[df.Gender == 'M'].country.map(country_map))
In [25]: df
Out[25]:
Name Gender country
0 Jack M United States
1 Nick M United Kingdom
2 Alphio F RU
3 Jenny F USA
I have a large dataframe this is the sample part of the Dataframe.
Want to swap the Muscat and Shanghai values.
df =
City Score
Istanbul 6.0749
2.23607 Muscat
Prague 4.38576
1.85958 Shanghai
Istanbul 6.0749
Singapore 5.17054
Output:
df =
City Score
Istanbul 6.0749
Muscat 2.23607
Prague 4.38576
Shanghai 1.85958
Istanbul 6.0749
Singapore 5.17054
I am confused that how can I apply the condition after iterating through the dataframe, also is there any other alternative?
Use to_numeric with notna for boolean mask and then swap by loc:
m = pd.to_numeric(df['City'], errors='coerce').notna()
#oldier versions of pandas
#m = pd.to_numeric(df['City'], errors='coerce').notnull()
df.loc[m,['City','Score']] = df.loc[m,['Score','City']].values
print (df)
City Score
0 Istanbul 6.0749
1 Muscat 2.23607
2 Prague 4.38576
3 Shanghai 1.85958
4 Istanbul 6.0749
5 Singapore 5.17054
You can use:
In [39]: mask = pd.to_numeric(df.Score, errors='coerce').isna()
In [40]: s = df.Score.copy()
In [41]: df.Score[mask] = df.City
In [42]: df.City[mask] = s
In [43]: df
Out[43]:
City Score
0 Istanbul 6.0749
1 Muscat 2.23607
2 Prague 4.38576
3 Shanghai 1.85958
4 Istanbul 6.0749
5 Singapore 5.17054