I have dataframe with one of column:
data['countries']
"[{'iso_3166_1': 'KR', 'name': 'South Korea'}]"
"[{'iso_3166_1': 'US', 'name': 'United States of America'}]"
How can extract ONLY country names: 'South Korea','United States of America' etc.
import json
import numpy as np
countries = [ json.loads(c.replace("'", '"')) for c in data['countries'] if not np.isnan(c)]
country_names = [cn for cn[0]['name'] in countries]
And the output will be:
>>> ['South Korea', 'United States of America']
If you don't want to change your DataFrame but just parse the content of the string it contains, you could use split.
>>> a = "[{'iso_3166_1': 'KR', 'name': 'South Korea'}]"
>>> a.split("'name': ")[1].split("'")[1]
'South Korea'
or:
def f(a):
return a.split("'name': ")[1].split("'")[1]
countries = [f(a) for a in data['countries']]
this should work
data['countries'] = data['countries'].apply(lambda x: eval(x))
data['countries'].apply(lambda x: x[0]['name'])
Output
0 South Korea
1 United States of America
Name: 1, dtype: object
list(data[1].apply(lambda x: x[0]['name']))
Output
['South Korea', 'United States of America']
Related
I have a list of dictionaries, the whole list represents different countries, and each dictionary includes basic data about each country as following:
example a
df.countries[3]
"[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'iso_3166_1': 'US', 'name': 'United States of America'}, {'iso_3166_1': 'IN', 'name': 'India'}]"
Of course, there are other cells where countries list has only one dictionary like this:
example b
df.countries[0]
"[{'iso_3166_1': 'US', 'name': 'United States of America'}]"
Or empty list like this:
example c
df.countries[505]
'[]'
What I want to do is:
Delete rows where country name is United States of America BUT
ONLY WHEN it's the only country in the list, not when there are
other countries like example a.
I tried to brainstorm and came up with something like this:
countryToRemove = "United States of America"
for index, row in df.iterrows():
if countryToRemove in row['countries']:
# row to be removed
But it deletes any row with the US in it even if other countries were there.
Edit: My dataframe is as following:
countries
0 [{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...
1 [{'iso_3166_1': 'US', 'name': 'United States o...
2 []
If you have dataframe like this:
countries
0 [{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...
1 [{'iso_3166_1': 'US', 'name': 'United States o...
2 []
Then you can use boolean indexing to filter out your dataframe:
mask = df.countries.apply(
lambda x: len(s := set(d["name"] for d in x)) == 1
and s.pop() == "United States of America"
)
print(df[~mask])
Prints:
countries
0 [{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...
2 []
EDIT: Version without := operator:
def fn(x):
s = set(d["name"] for d in x)
return len(s) == 1 and s.pop() == "United States of America"
mask = df.countries.apply(fn)
print(df[~mask])
I got situation where I need to transpose a dataframe like below.
input dataframe is as below:
input_data = [
['Asia', 'China', 'Beijing'],
['Asia', 'China', 'Shenzhen'],
['America', 'United States', 'New York'],
['America', 'Canada', 'Toronto']
]
input_df = pd.DataFrame(input_data)
input_df.columns = ['continents', 'countries', 'cities']
input_df
continents
countries
cities
0
Asia
China
Beijing
1
Asia
China
Shenzhen
2
America
United States
New York
3
America
Canada
Toronto
The output data I want to get is
# only the unique values are allowed in the output list.
continents = ['Asia', 'America']
countries = [['China'], ['United States', 'Canada']]
cities = [[['Beijing', 'Shenzhen']], [['New York'], ['Toronto']]]
For this case, the input data has three levels Continents -> Countries -> Cities, but what I ultimately want is to take a multiple-level hierarchical dataframe (no matters how deep it is horizontally), then I get the output like the example, and then I will put them on a pyqt5 column view.
pandas.Series.tolist() can convert series value to list.
print(input_df['continents'].unique().tolist())
print(input_df.groupby('continents', sort=False)['countries'].apply(lambda x: x.unique().tolist()).tolist())
print(input_df.groupby(['continents', 'countries'], sort=False)['cities'].apply(lambda x: [x.unique().tolist()]).tolist())
['Asia', 'America']
[['China'], ['United States', 'Canada']]
[[['Beijing', 'Shenzhen']], [['New York']], [['Toronto']]]
As for a general approach, the first approach occurred to me is to loop through the columns of df.
def list_wrapper(alist, times):
for _ in range(times):
alist = [alist]
return alist
columns_name = input_df.columns.values.tolist()
for i in range(len(columns_name)):
if i == 0:
print(input_df[columns_name[i]].unique().tolist())
else:
print(input_df.groupby(columns_name[0:i], sort=False)[columns_name[i]].apply(lambda x: list_wrapper(x.unique().tolist(), i-1)).tolist())
I have a dataframe of environmental data for each country in the world. I want to remove any entries for country that do not represent individual countries, ie 'Africa', or 'World'. I have made a list of those values. I am trying to loop through the df and drop each row where the country = a value in my list. There aren't that many problem entries, I have removed them before with .loc, but I'm unsure why this function is not working. I get an error: KeyError: '[(bunch of numbers)] not found in axis'
not_country = ['Africa', 'Asia', 'Asia (excl. China & India)','EU-27','EU-28', 'Europe','Europe (excl. EU-27)',
'Europe (excl. EU-28)', 'International transport', 'Kuwaiti Oil Fires', 'North America',
'North America (excl. USA)', 'World', 'South America']
def clean_countries(df, lst):
index_names = []
for country_name in lst:
index_names.append(df[df['country'] == country_name].index)
for i in df:
df.drop(index_names, inplace = True)
clean_co2_df = clean_countries(co2_df, not_country) ```
One of the advantages of a dataframe is that you seldom have to iterate through it to get the job done. There are usually more efficient ways. Here is a solution to your question using a sample dataframe with world population data.
not_country = ['Africa', 'Asia', 'Asia (excl. China & India)','EU-27','EU-28', 'Europe','Europe (excl. EU-27)',
'Europe (excl. EU-28)', 'International transport', 'Kuwaiti Oil Fires', 'North America',
'North America (excl. USA)', 'World', 'South America']
pop_data = {'Country': {0: 'China', 1: 'India', 2: 'USA', 3: 'Asia'}, 'Population': {0: 1439000000, 1: 1380004385, 2: 331002651, 3: 4641054775}, 'Yearly Change %': {0: 0.39, 1: 0.99, 2: 0.59, 3: 0.86}}
df = pd.DataFrame(pop_data)
print(f'BEFORE: \n {df}')
df = df.loc[df['Country'].apply(lambda x: x not in not_country)]
print(f'AFTER: \n {df}')
#output:
BEFORE:
Country Population Yearly Change %
0 China 1439000000 0.39
1 India 1380004385 0.99
2 USA 331002651 0.59
3 Asia 4641054775 0.86
AFTER:
Country Population Yearly Change %
0 China 1439000000 0.39
1 India 1380004385 0.99
2 USA 331002651 0.59
I have the following dictionary:
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
I have binned the countries in this dictionary (keys) into continents (values).
from collections import defaultdict
dictionary = defaultdict(list)
for key, value in ContinentDict.items():
dictionary[value].append(key)
This has given me:
dictionary
defaultdict(<class 'list'>, {'Asia': ['China', 'Japan', 'India', 'South Korea', 'Iran'], 'North America': ['United States', 'Canada'], 'Europe': ['United Kingdom', 'Russian Federation', 'Germany', 'France', 'Italy', 'Spain'], 'Australia': ['Australia'], 'South America': ['Brazil']})
I also have the Pandas series Reducedset['estimate']:
Country
China 1.36765e+09
United States 3.17615e+08
Japan 1.27409e+08
United Kingdom 6.3871e+07
Russian Federation 1.435e+08
Canada 3.52399e+07
Germany 8.03697e+07
India 1.27673e+09
France 6.38373e+07
South Korea 4.98054e+07
Italy 5.99083e+07
Spain 4.64434e+07
Iran 7.70756e+07
Australia 2.3316e+07
Brazil 2.05915e+08
Name: estimate, dtype: object
I would like to create a hierarchical index from this dictionary, with the continent as the top of the hierarchy followed by the country.
I have tried the following:
totuple = dictionary.items()
index = pd.MultiIndex.from_tuples(index)
hierarchy = pop.reindex(index)
However, this has not worked.
Would anybody be able to give me a helping hand?
Create list of tuples and pass to MultiIndex.from_tuples:
t = [(k, x) for k, v in dictionary.items() for x in v]
index = pd.MultiIndex.from_tuples(t)
print (index)
MultiIndex([( 'Asia', 'China'),
( 'Asia', 'Japan'),
( 'Asia', 'India'),
( 'Asia', 'South Korea'),
( 'Asia', 'Iran'),
('North America', 'United States'),
('North America', 'Canada'),
( 'Europe', 'United Kingdom'),
( 'Europe', 'Russian Federation'),
( 'Europe', 'Germany'),
( 'Europe', 'France'),
( 'Europe', 'Italy'),
( 'Europe', 'Spain'),
( 'Australia', 'Australia'),
('South America', 'Brazil')],
)
And then:
Reducedset = Reducedset.reindex(index, level=1)
print (Reducedset)
estimate
Asia China 1.367650e+09
Japan 1.274090e+08
India 1.276730e+09
South Korea 4.980540e+07
Iran 7.707560e+07
North America United States 3.176150e+08
Canada 3.523990e+07
Europe United Kingdom 6.387100e+07
Russian Federation 1.435000e+08
Germany 8.036970e+07
France 6.383730e+07
Italy 5.990830e+07
Spain 4.644340e+07
Australia Australia 2.331600e+07
South America Brazil 2.059150e+08
Another idea is use map by original dictionary:
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
d = {'estimate': {'China': 1367650000.0, 'United States': 317615000.0, 'Japan': 127409000.0, 'United Kingdom': 63871000.0, 'Russian Federation': 143500000.0, 'Canada': 35239900.0, 'Germany': 80369700.0, 'India': 1276730000.0, 'France': 63837300.0, 'South Korea': 49805400.0, 'Italy': 59908300.0, 'Spain': 46443400.0, 'Iran': 77075600.0, 'Australia': 23316000.0, 'Brazil': 205915000.0}}
Reducedset = pd.DataFrame(d)
idx = Reducedset.index.map(ContinentDict)
Reducedset.index = [idx, Reducedset.index]
Reducedset = Reducedset.sort_index()
print (Reducedset)
estimate
Asia China 1.367650e+09
India 1.276730e+09
Iran 7.707560e+07
Japan 1.274090e+08
South Korea 4.980540e+07
Australia Australia 2.331600e+07
Europe France 6.383730e+07
Germany 8.036970e+07
Italy 5.990830e+07
Russian Federation 1.435000e+08
Spain 4.644340e+07
United Kingdom 6.387100e+07
North America Canada 3.523990e+07
United States 3.176150e+08
South America Brazil 2.059150e+08
def replace_name(row):
if row['Country Name'] == 'Korea, Rep.':
row['Country Name'] = 'South Korea'
if row['Country Name'] == 'Iran, Islamic Rep.':
row['Country Name'] = 'Iran'
if row['Country Name'] == 'Hong Kong SAR, China':
row['Country Name'] = 'Hong Kong'
return row
GDP.apply(replace_name, axis = 1)
GDP is a 'pd.DataFrame'
In this time when I want to find 'South Korea', it doesn't work, the name is still 'Korea, Rep.'
but if I change the last row in the code to this
GDP = GDP.apply(replace_name, axis = 1)
it works.
At first, I thought the reason is that 'apply' function can't change the GDP itself, but when I dealt with another dataframe, it actually works. The code is below:
def change_name(row):
if row['Country'] == "Republic of Korea":
row['Country'] = 'South Korea'
if row['Country'] == 'United States of America':
row['Country'] = 'United States'
if row['Country'] == 'United Kingdom of Great Britain and Northern Ireland':
row['Country'] ='United Kingdom'
if row['Country'] == 'China, Hong Kong Special Administrative Region':
row['Country'] = 'Hong Kong'
return row
energy.apply(change_name, axis = 1)
energy is also a 'pd.dataframe'.
This time when I search for 'United States', it works. And the original name is 'United States of America', so it changes the name successfully.
The only difference between energy and GDP is that energy is read from an excel file, and GDP is read from a CSV file. So what cause the different result?
I think better is use replace:
d = {'Korea, Rep.':'South Korea', 'Iran, Islamic Rep.':'Iran',
'Hong Kong SAR, China':'Hong Kong'}
GDP['Country Name'] = GDP['Country Name'].replace(d, regex=True)
For difference is possible some whitespace in data, maybe help:
GDP['Country'] = GDP['Country'].str.strip()
Sample:
GDP = pd.DataFrame({'Country Name':[' Korea, Rep. ','a','Iran, Islamic Rep.','United States of America','s','United Kingdom of Great Britain and Northern Ireland'],
'Country': ['s','Hong Kong SAR, China','United States of America','Hong Kong SAR, China','s','f']})
#print (GDP)
d = {'Korea, Rep.':'South Korea', 'Iran, Islamic Rep.':'Iran',
'United Kingdom of Great Britain and Northern Ireland':'United Kingdom',
'Hong Kong SAR, China':'Hong Kong', 'United States of America':'United States'}
#replace by columns
#GDP['Country Name'] = GDP['Country Name'].replace(d, regex=True)
#GDP['Country'] = GDP['Country'].replace(d, regex=True)
#replace multiple columns
GDP[['Country Name','Country']] = GDP[['Country Name','Country']].replace(d, regex=True)
print (GDP)
Country Country Name
0 s South Korea
1 Hong Kong a
2 United States Iran
3 Hong Kong United States
4 s s
5 f United Kingdom