I'm trying to convert the data I pull in from an external API. SO far, my dataframe looks like this:
Country Date Team Rating
United Kingdom 11/8/2019 Team A 95
United Kingdom 2/20/2019 Team B 90
United Kingdom 9/22/2017 Team A 90
United Kingdom 6/28/2016 Team B 90
United Kingdom 6/27/2016 Team C 90
United Kingdom 6/24/2016 Team A 95
United Kingdom 6/12/2015 Team C 100
United Kingdom 6/13/2014 Team C 100
United Kingdom 4/19/2013 Team B 95
United Kingdom 2/22/2013 Team A 95
United Kingdom 12/13/2012 Team C 100
United Kingdom 3/14/2012 Team B 100
United Kingdom 2/13/2012 Team A 100
United Kingdom 10/26/2010 Team C 100
United Kingdom 5/21/2009 Team C 100
United Kingdom 9/21/2000 Team B 100
United Kingdom 9/21/2000 Team B 100
United Kingdom 8/10/1994 Team B 100
United Kingdom 6/26/1989 Team C 100
United Kingdom 4/28/1978 Team C 100
United Kingdom 3/31/1978 Team A 100
I would like it to look like this but I'm struggling to figure out how (I'm still new to dataframes):
Country Date Team A Team B Team C
United Kingdom 11/8/2019 95 90 90
United Kingdom 2/20/2019 90 90 90
United Kingdom 9/22/2017 90 90 90
United Kingdom 6/28/2016 95 90 90
United Kingdom 6/27/2016 95 95 90
United Kingdom 6/24/2016 95 95 100
United Kingdom 6/12/2015 95 95 100
United Kingdom 6/13/2014 95 95 100
United Kingdom 4/19/2013 95 95 100
United Kingdom 2/22/2013 95 100 100
United Kingdom 12/13/2012 100 100 100
United Kingdom 3/14/2012 100 100 100
United Kingdom 2/13/2012 100 100 100
United Kingdom 10/26/2010 100 100 100
United Kingdom 5/21/2009 100 100 100
United Kingdom 9/21/2000 100 100 100
United Kingdom 9/21/2000 100 100 100
United Kingdom 8/10/1994 100 100 100
United Kingdom 6/26/1989 100 100 100
United Kingdom 4/28/1978 100 100 100
United Kingdom 3/31/1978 100 100 100
So essentially I want the country and date columns to remain the same, however as opposed to having just one team per row, I'd like all teams to appear as columns. Instead of having blank values, I would like their previous value used when not updated.
For example, for 11/8/2019, you can see in my original df that only Team A's rating changes. For the Team B and team C column, I'd like them to use their previous value if it isn't updated.
Does anyone have any suggestions?
First of all, if you need to sort over datetimes, I would suggest to either use the YYYYMMDD string representation of dates (e.g. 20191108 for the first record) or to use actual datetime data types. Using the American notation is confusing and not easy to sort on.
In any case, to solve your issue I would advise to use pandas pivot function first, followed by a fill NaN (i.e. fillna) function with a backfill (i.e. bfill) method.
EDIT: If you want to keep the Country column, it seems that using it as a multi-index with the Date column won't work with pivot. What you can do is to keep the original df and join it with the new one on the Date column.
import pandas as pd
import datetime as dt
# Create DataFrame similar to example
df = pd.DataFrame(data={'Date': ['11/8/2019','2/20/2019','9/22/2017','6/28/2016','6/27/2016','6/24/2016','6/12/2015','6/13/2014'],
'Team': ['Team A','Team B','Team A','Team B','Team C','Team A','Team C','Team C'],
'Rating': [95,90,90,90,90,95,100,100]})
# Convert strings to datetimes
df['Date'] = df['Date'].map(lambda x: dt.datetime.strptime(x, '%m/%d/%Y'))
df['Country'] = 'United Kingdom'
# Pivot DataFrame
dfp = df.pivot(columns='Team', values='Rating')
# Join with Country from original df
dfp = df[['Date', 'Country']].join(dfp)
# sort descending on Date
dfp.sort_values(by='Date', ascending=False, inplace=True)
# dfp is:
# Date Country Team A Team B Team C
# 2019-11-08 United Kingdom 95.0 NaN NaN
# 2019-02-20 United Kingdom NaN 90.0 NaN
# 2017-09-22 United Kingdom 90.0 NaN NaN
# ...
# Fill NaN values using the "next" row value
dfp.fillna(method='bfill', inplace=True)
# dfp is:
# Date Country Team A Team B Team C
# 2019-11-08 United Kingdom 95.0 90.0 90.0
# 2019-02-20 United Kingdom 90.0 90.0 90.0
# 2017-09-22 United Kingdom 90.0 90.0 90.0
# ...
Basically, what you need is:
data.pivot_table(index=['Country', 'Date'], columns='Team', values='Rating').reset_index()\
.sort_values(['Country', 'Date'], ascending=False).fillna(method='bfill', axis=0)
It will create a pivot_table, sort the values in the irregular order you have, and pull the last existing values where missing.
Related
We have a data which has column name "birth_country"
i executed following code;
import pandas as pd
df=pd.read_csv("data.csv")
df['birth_country'].value_counts()[:5]
output:
United States of America 259
United Kingdom 85
Germany 61
France 51
Sweden 29
I want my output to be look like;
United States of America
United Kingdom
Germany
France
Sweden
How to do it?
Like;
df['birth_country'].value_counts().idxmax()
gives output:
United States of America
For series by index values use:
pd.Series(df['birth_country'].value_counts()[:5].index)
This my data frame
City
sales
San Diego
500
Texas
400
Nebraska
300
Macau
200
Rome
100
London
50
Manchester
70
I want to add the country at the end which will look like this
City
sales
Country
San Diego
500
US
Texas
400
US
Nebraska
300
US
Macau
200
Hong Kong
Rome
100
Italy
London
50
England
Manchester
200
England
The countries are stored in below dictionary
country={'US':['San Diego','Texas','Nebraska'], 'Hong Kong':'Macau', 'England':['London','Manchester'],'Italy':'Rome'}
It's a little complicated because you have lists and strings as the values and strings are technically iterable, so distinguishing is more annoying. But here's a function that can flatten your dict:
def flatten_dict(d):
nd = {}
for k,v in d.items():
# Check if it's a list, if so then iterate through
if ((hasattr(v, '__iter__') and not isinstance(v, str))):
for item in v:
nd[item] = k
else:
nd[v] = k
return nd
d = flatten_dict(country)
#{'San Diego': 'US',
# 'Texas': 'US',
# 'Nebraska': 'US',
# 'Macau': 'Hong Kong',
# 'London': 'England',
# 'Manchester': 'England',
# 'Rome': 'Italy'}
df['Country'] = df['City'].map(d)
You can implement this using geopy
You can install geopy by pip install geopy
Here is the documentation : https://pypi.org/project/geopy/
# import libraries
from geopy.geocoders import Nominatim
# you need to mention a name for the app
geolocator = Nominatim(user_agent="some_random_app_name")
# get country name
df['Country'] = df['City'].apply(lambda x : geolocator.geocode(x).address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 中国
4 Rome 100 Italia
5 London 50 United Kingdom
6 Manchester 70 United Kingdom
# to get country name in english
df['Country'] = df['City'].apply(lambda x : geolocator.reverse(geolocator.geocode(x).point, language='en').address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 China
4 Rome 100 Italy
5 London 50 United Kingdom
6 Manchester 70 United Kingdom
I have a data set of trade statistics. The data is in the following form:
reporter partner Time Period export import
0 Argentina United States 1990M2 1.304801e+08 5.984441e+07
1 Argentina United States 1990M3 1.237417e+08 5.092350e+07
2 Argentina United States 1990M4 1.020971e+08 4.884196e+07
3 Argentina United States 1990M5 1.569232e+08 5.583000e+07
4 Argentina United States 1990M6 1.539624e+08 6.869098e+07
5 Argentina United States 1990M7 1.491639e+08 6.207464e+07
6 Argentina United States 1990M8 1.675413e+08 8.482295e+07
7 Argentina United States 1990M9 1.459988e+08 7.731452e+07
8 Argentina United States 1990M10 1.613134e+08 1.061588e+08
9 Argentina United States 1990M11 1.392604e+08 9.931942e+07
10 Argentina United States 1990M12 1.266004e+08 1.003602e+08
11 Argentina United States 1991M1 1.183864e+08 8.458743e+07
12 Argentina United States 1991M2 1.107058e+08 7.544877e+07
13 Argentina United States 1991M3 1.034667e+08 7.632608e+07
14 Argentina United States 1991M4 1.078808e+08 9.906306e+07
and so on.
The "Time Period" variable is Dtype object. I want to change the format of the "Time Period" variable so that I get February 1990 instead of 1990M2, March 1990 instead of 1990M2, etc.
Convert to datetime64 dtype using pd.to_datetime with a specified format and extract the year using the dt accessor. Ex:
df['year'] = pd.to_datetime(df['Time Period'], format='%YM%m').dt.year
If you wish, you could also extract the month via dt.month. docs.
Alternatively, you could split the strings on 'M' and cast to two separate columns of dtype int, e.g.
df[['Y','M']] = df["Time Period"].str.split("M", expand=True).astype(int)
I am trying to sort a Pandas Series in ascending order.
Top15['HighRenew'].sort_values(ascending=True)
Gives me:
Country
China 1
Russian Federation 1
Canada 1
Germany 1
Italy 1
Spain 1
Brazil 1
South Korea 2.27935
Iran 5.70772
Japan 10.2328
United Kingdom 10.6005
United States 11.571
Australia 11.8108
India 14.9691
France 17.0203
Name: HighRenew, dtype: object
The values are in ascending order.
However, when I then modify the series in the context of the dataframe:
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True)
Top15['HighRenew']
Gives me:
Country
China 1
United States 11.571
Japan 10.2328
United Kingdom 10.6005
Russian Federation 1
Canada 1
Germany 1
India 14.9691
France 17.0203
South Korea 2.27935
Italy 1
Spain 1
Iran 5.70772
Australia 11.8108
Brazil 1
Name: HighRenew, dtype: object
Why this is giving me a different output to that above?
Would be grateful for any advice?
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True).to_numpy()
or
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True).reset_index(drop=True)
When you sort_values , the indexes don't change so it is aligning per the index!
Thank you to anky for providing me with this fantastic solution!
country_name country_code val_code \
United States of America 231 1
United States of America 231 2
United States of America 231 3
United States of America 231 4
United States of America 231 5
y191 y192 y193 y194 y195 \
47052179 43361966 42736682 43196916 41751928
1187385 1201557 1172941 1176366 1192173
28211467 27668273 29742374 27543836 28104317
179000 193000 233338 276639 249688
12613922 12864425 13240395 14106139 15642337
In the data frame above, I would like to compute for each row, the percentage of the total occupied by that val_code, resulting in foll. data frame.
I.e. Sum up each row and divide by total of all rows
country_name country_code val_code \
United States of America 231 1
United States of America 231 2
United States of America 231 3
United States of America 231 4
United States of America 231 5
perc
50.14947129
1.363631254
32.48344744
0.260213146
15.74323688
Right now, I am doing this, but it is not working
grp_df = df.groupby(['country_name', 'val_code']).agg()
pct_df = grp_df.groupby(level=0).apply(lambda x: 100*x/float(x.sum()))
You can get the percentages of each column using a lambda function as follows:
>>> df.iloc[:, 3:].apply(lambda x: x / x.sum())
y191 y192 y193 y194 y195
0 0.527231 0.508411 0.490517 0.500544 0.480236
1 0.013305 0.014088 0.013463 0.013631 0.013713
2 0.316116 0.324405 0.341373 0.319164 0.323259
3 0.002006 0.002263 0.002678 0.003206 0.002872
4 0.141342 0.150833 0.151969 0.163455 0.179920
Your example does not have any duplicate values for val_code, so I'm unsure how you want your data to appear (i.e. show percent of total in column vs. total for each vval_code group.)
Ge the total for all the columns of interest and then add the percentage column:
In [35]:
total = np.sum(df.ix[:,'y191':].values)
df['percent'] = df.ix[:,'y191':].sum(axis=1)/total * 100
df
Out[35]:
country_name country_code val_code y191 y192 \
0 United States of America 231 1 47052179 43361966
1 United States of America 231 1 1187385 1201557
2 United States of America 231 1 28211467 27668273
3 United States of America 231 1 179000 193000
4 United States of America 231 1 12613922 12864425
y193 y194 y195 percent
0 42736682 43196916 41751928 50.149471
1 1172941 1176366 1192173 1.363631
2 29742374 27543836 28104317 32.483447
3 233338 276639 249688 0.260213
4 13240395 14106139 15642337 15.743237
So np.sum will sum all the values:
In [32]:
total = np.sum(df.ix[:,'y191':].values)
total
Out[32]:
434899243
We then call .sum(axis=1)/total * 100 on the cols of interest to sum row-wise, divide by the total and multiply by 100 to get a percentage.