How to convert Object YYYYM1 YYYYM2 to Month and Year - python

I have a data set of trade statistics. The data is in the following form:
reporter partner Time Period export import
0 Argentina United States 1990M2 1.304801e+08 5.984441e+07
1 Argentina United States 1990M3 1.237417e+08 5.092350e+07
2 Argentina United States 1990M4 1.020971e+08 4.884196e+07
3 Argentina United States 1990M5 1.569232e+08 5.583000e+07
4 Argentina United States 1990M6 1.539624e+08 6.869098e+07
5 Argentina United States 1990M7 1.491639e+08 6.207464e+07
6 Argentina United States 1990M8 1.675413e+08 8.482295e+07
7 Argentina United States 1990M9 1.459988e+08 7.731452e+07
8 Argentina United States 1990M10 1.613134e+08 1.061588e+08
9 Argentina United States 1990M11 1.392604e+08 9.931942e+07
10 Argentina United States 1990M12 1.266004e+08 1.003602e+08
11 Argentina United States 1991M1 1.183864e+08 8.458743e+07
12 Argentina United States 1991M2 1.107058e+08 7.544877e+07
13 Argentina United States 1991M3 1.034667e+08 7.632608e+07
14 Argentina United States 1991M4 1.078808e+08 9.906306e+07
and so on.
The "Time Period" variable is Dtype object. I want to change the format of the "Time Period" variable so that I get February 1990 instead of 1990M2, March 1990 instead of 1990M2, etc.

Convert to datetime64 dtype using pd.to_datetime with a specified format and extract the year using the dt accessor. Ex:
df['year'] = pd.to_datetime(df['Time Period'], format='%YM%m').dt.year
If you wish, you could also extract the month via dt.month. docs.
Alternatively, you could split the strings on 'M' and cast to two separate columns of dtype int, e.g.
df[['Y','M']] = df["Time Period"].str.split("M", expand=True).astype(int)

Related

finding id name of 5 most frequent value in a column in pandas

We have a data which has column name "birth_country"
i executed following code;
import pandas as pd
df=pd.read_csv("data.csv")
df['birth_country'].value_counts()[:5]
output:
United States of America 259
United Kingdom 85
Germany 61
France 51
Sweden 29
I want my output to be look like;
United States of America
United Kingdom
Germany
France
Sweden
How to do it?
Like;
df['birth_country'].value_counts().idxmax()
gives output:
United States of America
For series by index values use:
pd.Series(df['birth_country'].value_counts()[:5].index)

Sorting values in a pandas series in ascending order not working when re-assigned

I am trying to sort a Pandas Series in ascending order.
Top15['HighRenew'].sort_values(ascending=True)
Gives me:
Country
China 1
Russian Federation 1
Canada 1
Germany 1
Italy 1
Spain 1
Brazil 1
South Korea 2.27935
Iran 5.70772
Japan 10.2328
United Kingdom 10.6005
United States 11.571
Australia 11.8108
India 14.9691
France 17.0203
Name: HighRenew, dtype: object
The values are in ascending order.
However, when I then modify the series in the context of the dataframe:
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True)
Top15['HighRenew']
Gives me:
Country
China 1
United States 11.571
Japan 10.2328
United Kingdom 10.6005
Russian Federation 1
Canada 1
Germany 1
India 14.9691
France 17.0203
South Korea 2.27935
Italy 1
Spain 1
Iran 5.70772
Australia 11.8108
Brazil 1
Name: HighRenew, dtype: object
Why this is giving me a different output to that above?
Would be grateful for any advice?
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True).to_numpy()
or
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True).reset_index(drop=True)
When you sort_values , the indexes don't change so it is aligning per the index!
Thank you to anky for providing me with this fantastic solution!

How to transform dataframe columns

I'm trying to convert the data I pull in from an external API. SO far, my dataframe looks like this:
Country Date Team Rating
United Kingdom 11/8/2019 Team A 95
United Kingdom 2/20/2019 Team B 90
United Kingdom 9/22/2017 Team A 90
United Kingdom 6/28/2016 Team B 90
United Kingdom 6/27/2016 Team C 90
United Kingdom 6/24/2016 Team A 95
United Kingdom 6/12/2015 Team C 100
United Kingdom 6/13/2014 Team C 100
United Kingdom 4/19/2013 Team B 95
United Kingdom 2/22/2013 Team A 95
United Kingdom 12/13/2012 Team C 100
United Kingdom 3/14/2012 Team B 100
United Kingdom 2/13/2012 Team A 100
United Kingdom 10/26/2010 Team C 100
United Kingdom 5/21/2009 Team C 100
United Kingdom 9/21/2000 Team B 100
United Kingdom 9/21/2000 Team B 100
United Kingdom 8/10/1994 Team B 100
United Kingdom 6/26/1989 Team C 100
United Kingdom 4/28/1978 Team C 100
United Kingdom 3/31/1978 Team A 100
I would like it to look like this but I'm struggling to figure out how (I'm still new to dataframes):
Country Date Team A Team B Team C
United Kingdom 11/8/2019 95 90 90
United Kingdom 2/20/2019 90 90 90
United Kingdom 9/22/2017 90 90 90
United Kingdom 6/28/2016 95 90 90
United Kingdom 6/27/2016 95 95 90
United Kingdom 6/24/2016 95 95 100
United Kingdom 6/12/2015 95 95 100
United Kingdom 6/13/2014 95 95 100
United Kingdom 4/19/2013 95 95 100
United Kingdom 2/22/2013 95 100 100
United Kingdom 12/13/2012 100 100 100
United Kingdom 3/14/2012 100 100 100
United Kingdom 2/13/2012 100 100 100
United Kingdom 10/26/2010 100 100 100
United Kingdom 5/21/2009 100 100 100
United Kingdom 9/21/2000 100 100 100
United Kingdom 9/21/2000 100 100 100
United Kingdom 8/10/1994 100 100 100
United Kingdom 6/26/1989 100 100 100
United Kingdom 4/28/1978 100 100 100
United Kingdom 3/31/1978 100 100 100
So essentially I want the country and date columns to remain the same, however as opposed to having just one team per row, I'd like all teams to appear as columns. Instead of having blank values, I would like their previous value used when not updated.
For example, for 11/8/2019, you can see in my original df that only Team A's rating changes. For the Team B and team C column, I'd like them to use their previous value if it isn't updated.
Does anyone have any suggestions?
First of all, if you need to sort over datetimes, I would suggest to either use the YYYYMMDD string representation of dates (e.g. 20191108 for the first record) or to use actual datetime data types. Using the American notation is confusing and not easy to sort on.
In any case, to solve your issue I would advise to use pandas pivot function first, followed by a fill NaN (i.e. fillna) function with a backfill (i.e. bfill) method.
EDIT: If you want to keep the Country column, it seems that using it as a multi-index with the Date column won't work with pivot. What you can do is to keep the original df and join it with the new one on the Date column.
import pandas as pd
import datetime as dt
# Create DataFrame similar to example
df = pd.DataFrame(data={'Date': ['11/8/2019','2/20/2019','9/22/2017','6/28/2016','6/27/2016','6/24/2016','6/12/2015','6/13/2014'],
'Team': ['Team A','Team B','Team A','Team B','Team C','Team A','Team C','Team C'],
'Rating': [95,90,90,90,90,95,100,100]})
# Convert strings to datetimes
df['Date'] = df['Date'].map(lambda x: dt.datetime.strptime(x, '%m/%d/%Y'))
df['Country'] = 'United Kingdom'
# Pivot DataFrame
dfp = df.pivot(columns='Team', values='Rating')
# Join with Country from original df
dfp = df[['Date', 'Country']].join(dfp)
# sort descending on Date
dfp.sort_values(by='Date', ascending=False, inplace=True)
# dfp is:
# Date Country Team A Team B Team C
# 2019-11-08 United Kingdom 95.0 NaN NaN
# 2019-02-20 United Kingdom NaN 90.0 NaN
# 2017-09-22 United Kingdom 90.0 NaN NaN
# ...
# Fill NaN values using the "next" row value
dfp.fillna(method='bfill', inplace=True)
# dfp is:
# Date Country Team A Team B Team C
# 2019-11-08 United Kingdom 95.0 90.0 90.0
# 2019-02-20 United Kingdom 90.0 90.0 90.0
# 2017-09-22 United Kingdom 90.0 90.0 90.0
# ...
Basically, what you need is:
data.pivot_table(index=['Country', 'Date'], columns='Team', values='Rating').reset_index()\
.sort_values(['Country', 'Date'], ascending=False).fillna(method='bfill', axis=0)
It will create a pivot_table, sort the values in the irregular order you have, and pull the last existing values where missing.

Rounding and sorting dataframe with pandas

https://github.com/haosmark/jupyter_notebooks/blob/master/Coursera%20week%203%20assignment.ipynb
All the way at the bottom of the code, with question 3, I'm trying to average, round, and sort the data, however for some reason rounding and sorting isn't working at all
i = df.columns.get_loc('2006')
avgGDP = df[df.columns[i:]].copy()
avgGDP = avgGDP.mean(axis=1).round(2).sort_values(ascending=False)
avgGDP
what am I doing wrong here?
This is what df looks like before I apply average, round, and sort.
Your series is actually sorted, the first line being 1.5e+13 and the last one 4.4e+11:
Country
United States 1.536434e+13
China 6.348609e+12
Japan 5.542208e+12
Germany 3.493025e+12
France 2.681725e+12
United Kingdom 2.487907e+12
Brazil 2.189794e+12
Italy 2.120175e+12
India 1.769297e+12
Canada 1.660648e+12
Russian Federation 1.565460e+12
Spain 1.418078e+12
Australia 1.164043e+12
South Korea 1.106714e+12
Iran 4.441558e+11
Rounding doesn't do anything visible here because the smallest value is 4e+11, and rounding it to 2 decimal places doesn't show on this scale. If you want to keep only 2 decimal places in the scientific notation, you can use .map('{:0.2e}'.format), see my note below.
Note: just for fun, you could also calculate the same with a one-liner:
df.filter(regex='^2').mean(1).sort_values()[::-1].map('{:0.2e}'.format)
Output:
Country
United States 1.54e+13
China 6.35e+12
Japan 5.54e+12
Germany 3.49e+12
France 2.68e+12
United Kingdom 2.49e+12
Brazil 2.19e+12
Italy 2.12e+12
India 1.77e+12
Canada 1.66e+12
Russian Federation 1.57e+12
Spain 1.42e+12
Australia 1.16e+12
South Korea 1.11e+12
Iran 4.44e+11

Filter and drop rows by proportion python

I have a dataframe called wine that contains a bunch of rows I need to drop.
How do i drop all rows in column 'country' that are less than 1% of the whole?
Here are the proportions:
#proportion of wine countries in the data set
wine.country.value_counts() / len(wine.country)
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
New Zealand 0.009069
Israel 0.006133
Greece 0.004493
Canada 0.002526
Hungary 0.001755
Romania 0.001558
...
I got lazy and didn't include all of the results, but i think you catch my drift. I need to drop all rows with proportions less than .01
Here is the head of my dataframe:
country designation points price province taster_name variety year price_category
Portugal Avidagos 87 15.0 Douro Roger Voss Portuguese Red 2011.0 low
You can use something like this:
df = df[df.proportion >= .01]
From that dataset it should give you something like this:
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
figured it out
country_filter = wine.country.value_counts(normalize=True) > 0.01
country_index = country_filter[country_filter.values == True].index
wine = wine[wine.country.isin(list(country_index))]

Categories