Python-Pandas convert columns into rows - python

My database from excel has some information by Country for Years. The problem is each year is a different column header. For example:
Country Indicator 1950 1951 1952
Australia x 10 27 20
Australia y 7 11 8
Australia z 40 32 37
I want to convert each Indicator as a column header and make a column by year. Like this:
Country year x y z
Australia 1950 10 7 40
Australia 1951 27 11 32
Australia 1952 20 8 37
And I don't know how many countries are in the column. Years = 1950 to 2019

We can do format with stack and unstack
df.set_index(['Country','Indicator']).stack().unstack(level=1).reset_index()
Indicator Country level_1 x y z
0 Australia 1950 10 7 40
1 Australia 1951 27 11 32
2 Australia 1952 20 8 37

This is just an exploration ... #Yoben's solution is the proper way to do it via Pandas ... I just seeing what other possibilities there are :
#create a dictionary of the years
years = {'Year' : df.filter(regex='\d').columns}
#get the data for the years column
year_data = df.filter(regex='\d').to_numpy()
#create a dictionary from the indicator and years columns pairing
reshaped = dict(zip(df.Indicator,year_data))
reshaped.update(years)
#create a new dataframe
pd.DataFrame(reshaped,index=df.Country)
x y z Year
Country
Australia 10 7 40 1950
Australia 27 11 32 1951
Australia 20 8 37 1952
You should never have to do this, as u could easily work within the dataframe, without the need to create a new one. The only time u may consider this is for the speed. Besides that, just something to explore

It's not exactly what you are looking for, but if your dataframe is the variable df, you can use the transpose method to invert the dataframe.
In [7]: df
Out[7]:
col1 col2 col3
0 1 True 10
1 2 False 10
2 3 False 100
3 4 True 100
Transpose
In [8]: df.T
Out[8]:
0 1 2 3
col1 1 2 3 4
col2 True False False True
col3 10 10 100 100
I think you have a multi-index dataframe so you may want to check the documentation on that.

Related

Add new row of values to Pandas DataFrame in specific row

I'm trying to add after the Gross profit line in an income statement new line with some values from array.
I tried just to append it in the location but nothing changed.
income_statement.loc[["Gross Profit"]].append(gross)
The only way i succeed doing something similar is by making it another dataframe and concat it to end of the income_statement.
I'm trying to make it look like that:(The 'gross' line in yellow)
How can i do it?
I created a sample df that tried to look similar to yours (see below).
df
Unnamed: 0 2010 2011 2012 2013 ... 2016 2017 2018 2019 TTM
0 gross profit 10 11 12 13 ... 16 17 18 19 300
1 total revenue 1 2 3 4 ... 7 8 9 10 400
The aim now would be to add a row between them ('gross'), with the values you have listed in the picture.
One way to add the row could be with numpy.insert, which returns an array back so you have to convert back to a pd.DataFrame:
# Store the columns of your df
cols = df.columns
# Add the row (the number indicates the index position for the row to be added,1 is the 2nd row as Python indexes start from 0)
new = pd.DataFrame(np.insert
(df.values, 1, values = ['gross',22, 45, 65,87,108,130,151,152,156,135,133], axis=0),
columns=cols)
Which gets back:
new
Unnamed: 0 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 TTM
0 gross profit 10 11 12 13 14 15 16 17 18 19 300
1 gross 22 45 65 87 108 130 151 152 156 135 133
2 total revenue 1 2 3 4 5 6 7 8 9 10 400
Hopefully this will work for you. Let me know for issues.

Python: remove rows with max value in each group

I have a pandas data frame df like this.
In [1]: df
Out[1]:
country count
0 Japan 78
1 Japan 80
2 USA 45
3 France 34
4 France 90
5 UK 45
6 UK 34
7 China 32
8 China 87
9 Russia 20
10 Russia 67
I want to remove rows with the maximum value in each group. So the result should look like:
country count
0 Japan 78
3 France 34
6 UK 34
7 China 32
9 Russia 20
My first attempt:
idx = df.groupby(['country'], sort=False).max()['count'].index
df_new = df.drop(list(idx))
My second attempt:
idx = df.groupby(['country'])['count'].transform(max).index
df_new = df.drop(list(idx))
But it didn't work. Any ideas?
groupby / transform('max')
You can first calculate a series of maximums by group. Then filter out instances where count is equal to that series. Note this will also remove duplicates maximums.
g = df.groupby(['country'])['count'].transform('max')
df = df[~(df['count'] == g)]
The series g represents maximums for each row by group. Where this equals df['count'] (by index), you have a row where you have the maximum for your group. You then use ~ for the negative condition.
print(df.groupby(['country'])['count'].transform('max'))
0 80
1 80
2 45
3 90
4 90
5 45
6 45
7 87
8 87
9 20
Name: count, dtype: int64
sort + drop
Alternatively, you can sort and drop the final occurrence:
res = df.sort_values('count')
res = res.drop(res.groupby('country').tail(1).index)
print(res)
country count
9 Russia 20
7 China 32
3 France 34
6 UK 34
0 Japan 78

Why doesn't replace columns in the new dataframe?

I have a df with nations as index and years(1990-2015) as header. I want to make a new df2 where every column is the sum of 5 year, eg: 1995-1999, 2000-2004 etc
I have done this:
df2 = pd.DataFrame(index=df.index[:], columns=['1995', '2000', '2005', '2010', '2015'])
df2['1995'] = df.iloc[0:4].sum(axis=1)
But it doesnt replace the NaN values.
What am I doing wrong? Thanks in advance
Step 1
Transpose and reset index with df.T.reset_index
df2 = df.T.reset_index(drop=True)
Step 2
Using df.groupby, group by index in sets of 5, and then sum with dfGroupBy.agg, passing np.nansum
df2 = df2.groupby(df2.index // 5).agg(np.nansum).T
Step 3
Assign the columns inplace
df2.columns = pd.to_datetime(df.columns[::5]).year + 5
df = ... # Borrowed from Bharath
df2 = df.T.reset_index(drop=True)
df2 = df2.groupby(df2.index // 5).sum().T
df2.columns = pd.to_datetime(df.columns[::5]).year + 5
print(df2)
Output:
1995 2000 2005 2010
Country
IN 72 29 100 2
EG 31 40 40 24
I think you are looking for sum of every 5 columns after a specific column. One way of doing it is using a for loop for concatinating data after slicing i.e if you have a dataframe
df = pd.DataFrame({'Country':['IN','EG'],'1990':[2,4],'1991':[4,5],'1992':[2,4],'1993':[2,4],'1994':[62,14],'1995':[21,4],'1996':[2,14],'1997':[2,4],'1998':[2,14],'1999':[2,4],'2000':[2,4],'2001':[2,14],'2002':[92,4],'2003':[2,4],'2004':[2,14],'2005':[2,24]})
df.set_index('Country',drop=True,inplace=True)
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 \
Country
IN 2 4 2 2 62 21 2 2 2 2 2
EG 4 5 4 4 14 4 14 4 14 4 4
2001 2002 2003 2004 2005
Country
IN 2 92 2 2 2
EG 14 4 4 14 24
Then
df2 = pd.DataFrame(index=df.index[:])
columns=['1990','1995', '2000', '2005']
for x in columns:
df2 = pd.concat([df2,df[df.columns[df.columns.tolist().index(x):][0:5]].sum(axis=1)],axis=1)
df2.columns= columns
Output :
1990 1995 2000 2005
Country
IN 72 29 100 2
EG 31 40 40 24
If you want to set different columns then ,
df2.columns = ['1990-1994','1995-1999','1999-2004','2005-']
Hope it helps
You can use:
convert columns to_datetime
resample by columns (axis=1) by 5A (years) and aggregate sum
last get years from columns by DatetimeIndex.year and remove 4
df.columns = pd.to_datetime(df.columns, format='%Y')
df2 = df.resample('5A',axis=1, closed='left').sum()
df2.columns = df2.columns.year - 4
print (df2)
1990 1995 2000 2005
Country
IN 72 29 100 2
EG 31 40 40 24
If need change years, also is possible add 1:
df.columns = pd.to_datetime(df.columns, format='%Y')
df2 = df.resample('5A',axis=1, closed='left').sum()
df2.columns = df2.columns.year + 1
print (df2)
1995 2000 2005 2010
Country
IN 72 29 100 2
EG 31 40 40 24

Pandas dataframe: turn date in column into value in row

I'm trying to turn the following dataframe (with values for county and year)
county region 2012 2013 ... 2035
A 101 10 15 ... 7
B 101 13 8 ... 11
...
into a dataframe that looks like this:
county region year sum
A 101 2012 10
A 101 2013 15
... ... ... ...
A 101 2035 7
B 101 2012 13
B 101 2013 8
B 101 2035 11
My current dataframe has 400 rows (different counties) with values for the years 2012-2035.
My manual approach would be to slice the year columns off and put each of them below the last row of the preceding year. But of course there has to be a pythonic way.
I guess I'm missing a basic pandas concept here, probably I just couldn't find the right answer to this problem because I simply didn't know how to ask the right question. Please be gentle with the newcomer.
You can use melt from pandas:
In [26]: df
Out[26]:
county region 2012 2013
0 A 101 10 15
1 B 101 13 8
In [27]: pd.melt(df, id_vars=['county','region'], var_name='year', value_name='sum')
Out[27]:
county region year sum
0 A 101 2012 10
1 B 101 2012 13
2 A 101 2013 15
3 B 101 2013 8

Pandas dataframe: how to find missing years in a timeseries?

I have a DataFrame with a timestamp index and some 100,000 rows. Via
df['year'] = df.index.year
it is easy to create a new column which contains the year of each row. Now I want to find out which years are missing from my timeseries. So far, I understand that I can use groupby to obtain "something" which allows me to find the unique values. Thus,
grouped = df.groupby('year')
grouped.groups.keys()
will give me the years which are present in my dataset. I could now build a complete year vector with
pd.date_range(df.index.min(), df.index.max(), freq='AS')
and through reindex I should then be able to find the missing years as those years which have NaN values.
However, this sounds awfully complicated for such seemingly simple task, and the grouped.groups operation actually takes quite a while; presumably, because it doesn't only look for unique keys, but also builds the index lists of rows that belong to each key, which is a feature that I don't need here.
Is there any way to obtain the unique elements of a dataframe column more directly/efficiently?
One method would be to construct a series of the years of interest and then use isin to see the missing values:
In [89]:
year_s = pd.Series(np.arange(1993, 2015))
year_s
Out[89]:
0 1993
1 1994
2 1995
3 1996
4 1997
5 1998
6 1999
7 2000
8 2001
9 2002
10 2003
11 2004
12 2005
13 2006
14 2007
15 2008
16 2009
17 2010
18 2011
19 2012
20 2013
21 2014
dtype: int32
In [88]:
df = pd.DataFrame({'year':[1999, 2000, 2013]})
df
Out[88]:
year
0 1999
1 2000
2 2013
In [91]:
year_s[~year_s.isin(df['year'])]
Out[91]:
0 1993
1 1994
2 1995
3 1996
4 1997
5 1998
8 2001
9 2002
10 2003
11 2004
12 2005
13 2006
14 2007
15 2008
16 2009
17 2010
18 2011
19 2012
21 2014
dtype: int32
So in your case you can generate the year series as above, then for your df you can get the years using:
df.index.year.unique()
which will be much quicker than performing a groupby.
Take care that the last value passed to arange is not included in the range
If all you want is a list of missing years, you can first convert your Data Series to a list and simply build a list of missing years using a list comprehension:
years = df['year'].unique()
missing_years = [y for y in range(min(years), max(years)+1) if y not in years]

Categories