set_index equivalent for columns headings - python

In Pandas, if I have a DataFrame that looks like:
0 1 2 3 4 5 6
0 2013 2012 2011 2010 2009 2008
1 January 3,925 3,463 3,289 3,184 3,488 4,568
2 February 3,632 2,983 2,902 3,053 3,347 4,527
3 March 3,909 3,166 3,217 3,175 3,636 4,594
4 April 3,903 3,258 3,146 3,023 3,709 4,574
5 May 4,075 3,234 3,266 3,033 3,603 4,511
6 June 4,038 3,272 3,316 2,909 3,057 4,081
7 July 3,661 3,359 3,062 3,354 4,215
8 August 3,942 3,417 3,077 3,395 4,139
9 September 3,703 3,169 3,095 3,100 3,752
10 October 3,727 3,469 3,179 3,375 3,874
11 November 3,722 3,145 3,159 3,213 3,567
12 December 3,866 3,251 3,199 3,324 3,362
13 Total 23,482 41,997 38,946 37,148 40,601 49,764
I can convert the first column to be the index using:
In [55]: df.set_index([0])
Out[55]:
1 2 3 4 5 6
0
2013 2012 2011 2010 2009 2008
January 3,925 3,463 3,289 3,184 3,488 4,568
February 3,632 2,983 2,902 3,053 3,347 4,527
March 3,909 3,166 3,217 3,175 3,636 4,594
April 3,903 3,258 3,146 3,023 3,709 4,574
May 4,075 3,234 3,266 3,033 3,603 4,511
June 4,038 3,272 3,316 2,909 3,057 4,081
July 3,661 3,359 3,062 3,354 4,215
August 3,942 3,417 3,077 3,395 4,139
September 3,703 3,169 3,095 3,100 3,752
October 3,727 3,469 3,179 3,375 3,874
November 3,722 3,145 3,159 3,213 3,567
December 3,866 3,251 3,199 3,324 3,362
Total 23,482 41,997 38,946 37,148 40,601 49,764
My question is how to convert the first row to be the column headings?
The closest I can get is:
In [53]: df.set_index([0]).rename(columns=df.loc[0])
Out[53]:
2013 2012 2011 2010 2009 2008
0
2013 2012 2011 2010 2009 2008
January 3,925 3,463 3,289 3,184 3,488 4,568
February 3,632 2,983 2,902 3,053 3,347 4,527
March 3,909 3,166 3,217 3,175 3,636 4,594
April 3,903 3,258 3,146 3,023 3,709 4,574
May 4,075 3,234 3,266 3,033 3,603 4,511
June 4,038 3,272 3,316 2,909 3,057 4,081
July 3,661 3,359 3,062 3,354 4,215
August 3,942 3,417 3,077 3,395 4,139
September 3,703 3,169 3,095 3,100 3,752
October 3,727 3,469 3,179 3,375 3,874
November 3,722 3,145 3,159 3,213 3,567
December 3,866 3,251 3,199 3,324 3,362
Total 23,482 41,997 38,946 37,148 40,601 49,764
but then I have to go in and remove the first row.

The best way to handle this is to avoid getting into this situation.
How was df created? For example, if you used read_csv or a variant, then header=0 will tell read_csv to parse the first line as the column names.
Given df as you have it, I don't think there is an easier way to fix it than what you've described. To remove the first row, you could use df.iloc:
df = df.iloc[1:]

I'm not sure if this is more efficient, but you could try creating a data frame with the corect index and default column names out of your problem data frame, and then rename the columns also using the promlematic data frame. For example:
import pandas as pd
import numpy as np
from pandas import DataFrame
data = {'0':[' ', 'Jan', 'Feb', 'Mar', 'April'], \
'1' : ['2013', 3926, 3456, 3245, 1254], \
'2' : ['2012', 3346, 4342, 1214, 4522], \
'3' : ['2011', 3946, 4323, 1214, 8922]}
DF = DataFrame(data)
DF2 = (DataFrame(DF.ix[1:, 1:]).set_index(DF.ix[1:,0]))
DF2.columns = DF.ix[0, 1:]
DF2

If there is a valid index you can double transform like this:
If you know the name of the row (in this case: 0)
df.T.set_index(0).T
If you know the position of the row (in this case: 0)
df.T.set_index(df.index[0]).T
Or for multiple rows to MultiIndex:
df.T.set_index(list(df.index[0:2])).T

Related

Pandas subtract value from one dataframe from columns in another dataframe depending on index

I have two dataframes like the ones below, let's call them df1 and df2 respectively.
state year val1
ALABAMA 2012 22.186789
2016 27.725147
2020 25.461653
ALASKA 2012 13.988918
2016 14.730641
2020 10.061191
ARIZONA 2012 9.064766
2016 3.543962
2020 -0.308710
year val2
2000 -0.491702
2004 2.434132
2008 -7.399984
2012 -3.935184
2016 -2.181941
2020 -4.448889
For each row in df1, I want to subtract val2 in df2 from every corresponding year in df1. I.e., I want to find the difference between val1 and val2 for each year in every state.
The dataframe I am trying to obtain is
party_simplified val1 difference
state year
ALABAMA 2012 22.186789 26.121973
2016 27.725147 29.907088
2020 25.461653 29.910542
ALASKA 2012 13.988918 17.924102
2016 14.730641 16.912582
2020 10.061191 14.510080
ARIZONA 2012 9.064766 12.999950
2016 3.543962 5.725903
2020 -0.308710 4.140180
I have been able to accomplish this using a for loop like the one below, but am wondering if there is a more efficient way.
for i in range(2012, 2024, 4):
df1.loc[(slice(None), i), 'difference'] = df1.loc[(slice(None), i), 'val1'] - df2.loc[i]['val2']
This works for me, though you might need to tweak it based on how your indices are set up:
merged = df1.reset_index().merge(df2, left_on='year', right_index=True)
df_new = (merged
.assign(**{'difference': merged.val1 - merged.val2})
.set_index(['state', 'year'])
.filter(['val1', 'difference'])
.sort_index())
df_new
val1 difference
state year
ALABAMA 2012 22.186789 26.121973
2016 27.725147 29.907088
2020 25.461653 29.910542
ALASKA 2012 13.988918 17.924102
2016 14.730641 16.912582
2020 10.061191 14.510080
ARIZONA 2012 9.064766 12.999950
2016 3.543962 5.725903
2020 -0.308710 4.140179

Move data from row 1 to row 0

I have this function written in python. I want this thing show difference between row from production column.
Here's the code
def print_df():
mycursor.execute("SELECT * FROM productions")
myresult = mycurson.fetchall()
myresult.sort(key=lambda x: x[0])
df = pd.DataFrame(myresult, columns=['Year', 'Production (Ton)'])
df['Dif'] = abs(df['Production (Ton)']. diff())
print(abs(df))
And of course the output is this
Year Production (Ton) Dif
0 2010 339491 NaN
1 2011 366999 27508.0
2 2012 361986 5013.0
3 2013 329461 32525.0
4 2014 355464 26003.0
5 2015 344998 10466.0
6 2016 274317 70681.0
7 2017 200916 73401.0
8 2018 217246 16330.0
9 2019 119830 97416.0
10 2020 66640 53190.0
But I want the output like this
Year Production (Ton) Dif
0 2010 339491 27508.0
1 2011 366999 5013.0
2 2012 361986 32525.0
3 2013 329461 26003.0
4 2014 355464 10466.0
5 2015 344998 70681.0
6 2016 274317 73401.0
7 2017 200916 16330.0
8 2018 217246 97416.0
9 2019 119830 53190.0
10 2020 66640 66640.0
What should I change or add to my code?
You can use a negative period input to diff to get the differences the way you want, and then fillna to fill the last value with the value from the Production column:
df['Dif'] = df['Production (Ton)'].diff(-1).fillna(df['Production (Ton)']).abs()
Output:
Year Production (Ton) Dif
0 2010 339491 27508.0
1 2011 366999 5013.0
2 2012 361986 32525.0
3 2013 329461 26003.0
4 2014 355464 10466.0
5 2015 344998 70681.0
6 2016 274317 73401.0
7 2017 200916 16330.0
8 2018 217246 97416.0
9 2019 119830 53190.0
10 2020 66640 66640.0
Use shift(-1) to shift all rows one position up.
df['Dif'] = (df['Production (Ton)'] - df['Production (Ton)'].shift(-1).fillna(0)).abs()
Notice that by setting fillna(0), you avoid the NaNs.
You can also use diff:
df['Dif'] = df['Production (Ton)'].diff().shift(-1).fillna(0).abs()

How to get years without starting with df=df.set_index

I have this set of dataframe:Dataframe
I can obtain the values that is 15% greater than the mean by:
df[df['Interest']>(df["Interest"].mean()*1.15)].Interest.to_string()
I obtained all values that are 15% greater than interest in their respective categories
The question is how do I get the year where these values occurred without starting with:
df=df.set_index('Year")
at the start as the function above requires my year values with df.iloc
How do I get the year where these values occurred without starting with df.set_index('Year")
Use .loc:
>>> df
Year Dividends Interest Other Types Rent Royalties Trade Income
0 2007 7632014 4643033 206207 626668 89715 18654926
1 2008 6718487 4220161 379049 735494 58535 29677697
2 2009 1226858 5682198 482776 1015181 138083 22712088
3 2010 978925 2229315 565625 1260765 146791 15219378
4 2011 1500621 2452712 675770 1325025 244073 19697549
5 2012 308064 2346778 591180 1483543 378998 33030888
6 2013 275019 4274425 707344 1664747 296136 17503798
7 2014 226634 3124281 891466 1807172 443671 16023363
8 2015 2171559 3474825 1144862 1858838 585733 16778858
9 2016 767713 4646350 2616322 1942102 458543 13970498
10 2017 759016 4918320 1659303 2001220 796343 9730659
11 2018 687308 6057191 1524474 2127583 1224471 19570540
>>> df.loc[df['Interest']>(df["Interest"].mean()*1.15), ['Year', 'Interest']]
Year Interest
0 2007 4643033
2 2009 5682198
9 2016 4646350
10 2017 4918320
11 2018 6057191
This will return a DataFrame with Year and the Interest values that match your condition
df[df['Interest']>(df["Interest"].mean()*1.15)][['Year', 'Interest']]
This will return the Year :-
df.loc[df["Interest"]>df["Interest"].mean()*1.15]["Year"]

I have multiIndexes for my dataframe, how do I calculate the sum for one level?

Hi everyone, I want to calculate the sum of Violent_type count according to year. For example, calculating total count of violent_type for year 2013, which is 18728+121662+1035. But I don't know how to select the data when there are multiIndexes. Any advice will be appreciated. Thanks.
The level argument in pandas.DataFrame.groupby() is what you are looking for.
level int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
To answer your question, you only need:
df.groupby(level=[0, 1]).sum()
# or
df.groupby(level=['district', 'year']).sum()
To see the effect
import pandas as pd
iterables = [['001', 'SST'], [2013, 2014], ['Dangerous', 'Non-Violent', 'Violent']]
index = pd.MultiIndex.from_product(iterables, names=['district', 'year', 'Violent_type'])
df = pd.DataFrame(list(range(0, len(index))), index=index, columns=['count'])
'''
print(df)
count
district year Violent_type
001 2013 Dangerous 0
Non-Violent 1
Violent 2
2014 Dangerous 3
Non-Violent 4
Violent 5
SST 2013 Dangerous 6
Non-Violent 7
Violent 8
2014 Dangerous 9
Non-Violent 10
Violent 11
'''
print(df.groupby(level=[0, 1]).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''
print(df.groupby(level=['district', 'year']).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''

Calculate average for each quarter given month columns

Suppose I have a Python Pandas dataframe with 10 rows and 16 columns. Each row stands for one product. The first column is product ID. Other 15 columns are selling price for
2010/01,2010/02,2010/03,2010/05,2010/06,2010/07,2010/08,2010/10,2010/11,2010/12,2011/01,2011/02,2011/03,2011/04,2011/05.
(The column name is in strings, not in date format) Now I want to calculate the mean selling price each quarter (1Q2010,2Q2010,...,2Q2011), I don't know how to deal with it. (Note that there is missing month for 2010/04, 2010/09 and 2011/06.)
The description above is just an example. Because this data set is quite small. It is possible to loop manually. However, the real data set I work on is 10730*202. Therefore I can not manually check which month is actually missing or map quarters manually. I wonder what efficient way I can apply here.
Thanks for the help!
This should help.
import pandas as pd
import numpy as np
rng = pd.DataFrame({'date': pd.date_range('1/1/2011', periods=72, freq='M'), 'value': np.arange(72)})
df = rng.groupby([rng.date.dt.quarter, rng.date.dt.year]) .mean()
df.index.names = ['quarter', 'year']
df.columns = ['mean']
print df
mean
quarter year
1 2011 1
2012 13
2013 25
2014 37
2015 49
2016 61
2 2011 4
2012 16
2013 28
2014 40
2015 52
2016 64
3 2011 7
2012 19
2013 31
2014 43
2015 55
2016 67
4 2011 10
2012 22
2013 34
2014 46
2015 58
2016 70

Categories