set_index equivalent for columns headings - python
In Pandas, if I have a DataFrame that looks like:
0 1 2 3 4 5 6
0 2013 2012 2011 2010 2009 2008
1 January 3,925 3,463 3,289 3,184 3,488 4,568
2 February 3,632 2,983 2,902 3,053 3,347 4,527
3 March 3,909 3,166 3,217 3,175 3,636 4,594
4 April 3,903 3,258 3,146 3,023 3,709 4,574
5 May 4,075 3,234 3,266 3,033 3,603 4,511
6 June 4,038 3,272 3,316 2,909 3,057 4,081
7 July 3,661 3,359 3,062 3,354 4,215
8 August 3,942 3,417 3,077 3,395 4,139
9 September 3,703 3,169 3,095 3,100 3,752
10 October 3,727 3,469 3,179 3,375 3,874
11 November 3,722 3,145 3,159 3,213 3,567
12 December 3,866 3,251 3,199 3,324 3,362
13 Total 23,482 41,997 38,946 37,148 40,601 49,764
I can convert the first column to be the index using:
In [55]: df.set_index([0])
Out[55]:
1 2 3 4 5 6
0
2013 2012 2011 2010 2009 2008
January 3,925 3,463 3,289 3,184 3,488 4,568
February 3,632 2,983 2,902 3,053 3,347 4,527
March 3,909 3,166 3,217 3,175 3,636 4,594
April 3,903 3,258 3,146 3,023 3,709 4,574
May 4,075 3,234 3,266 3,033 3,603 4,511
June 4,038 3,272 3,316 2,909 3,057 4,081
July 3,661 3,359 3,062 3,354 4,215
August 3,942 3,417 3,077 3,395 4,139
September 3,703 3,169 3,095 3,100 3,752
October 3,727 3,469 3,179 3,375 3,874
November 3,722 3,145 3,159 3,213 3,567
December 3,866 3,251 3,199 3,324 3,362
Total 23,482 41,997 38,946 37,148 40,601 49,764
My question is how to convert the first row to be the column headings?
The closest I can get is:
In [53]: df.set_index([0]).rename(columns=df.loc[0])
Out[53]:
2013 2012 2011 2010 2009 2008
0
2013 2012 2011 2010 2009 2008
January 3,925 3,463 3,289 3,184 3,488 4,568
February 3,632 2,983 2,902 3,053 3,347 4,527
March 3,909 3,166 3,217 3,175 3,636 4,594
April 3,903 3,258 3,146 3,023 3,709 4,574
May 4,075 3,234 3,266 3,033 3,603 4,511
June 4,038 3,272 3,316 2,909 3,057 4,081
July 3,661 3,359 3,062 3,354 4,215
August 3,942 3,417 3,077 3,395 4,139
September 3,703 3,169 3,095 3,100 3,752
October 3,727 3,469 3,179 3,375 3,874
November 3,722 3,145 3,159 3,213 3,567
December 3,866 3,251 3,199 3,324 3,362
Total 23,482 41,997 38,946 37,148 40,601 49,764
but then I have to go in and remove the first row.
The best way to handle this is to avoid getting into this situation.
How was df created? For example, if you used read_csv or a variant, then header=0 will tell read_csv to parse the first line as the column names.
Given df as you have it, I don't think there is an easier way to fix it than what you've described. To remove the first row, you could use df.iloc:
df = df.iloc[1:]
I'm not sure if this is more efficient, but you could try creating a data frame with the corect index and default column names out of your problem data frame, and then rename the columns also using the promlematic data frame. For example:
import pandas as pd
import numpy as np
from pandas import DataFrame
data = {'0':[' ', 'Jan', 'Feb', 'Mar', 'April'], \
'1' : ['2013', 3926, 3456, 3245, 1254], \
'2' : ['2012', 3346, 4342, 1214, 4522], \
'3' : ['2011', 3946, 4323, 1214, 8922]}
DF = DataFrame(data)
DF2 = (DataFrame(DF.ix[1:, 1:]).set_index(DF.ix[1:,0]))
DF2.columns = DF.ix[0, 1:]
DF2
If there is a valid index you can double transform like this:
If you know the name of the row (in this case: 0)
df.T.set_index(0).T
If you know the position of the row (in this case: 0)
df.T.set_index(df.index[0]).T
Or for multiple rows to MultiIndex:
df.T.set_index(list(df.index[0:2])).T
Related
Pandas subtract value from one dataframe from columns in another dataframe depending on index
I have two dataframes like the ones below, let's call them df1 and df2 respectively. state year val1 ALABAMA 2012 22.186789 2016 27.725147 2020 25.461653 ALASKA 2012 13.988918 2016 14.730641 2020 10.061191 ARIZONA 2012 9.064766 2016 3.543962 2020 -0.308710 year val2 2000 -0.491702 2004 2.434132 2008 -7.399984 2012 -3.935184 2016 -2.181941 2020 -4.448889 For each row in df1, I want to subtract val2 in df2 from every corresponding year in df1. I.e., I want to find the difference between val1 and val2 for each year in every state. The dataframe I am trying to obtain is party_simplified val1 difference state year ALABAMA 2012 22.186789 26.121973 2016 27.725147 29.907088 2020 25.461653 29.910542 ALASKA 2012 13.988918 17.924102 2016 14.730641 16.912582 2020 10.061191 14.510080 ARIZONA 2012 9.064766 12.999950 2016 3.543962 5.725903 2020 -0.308710 4.140180 I have been able to accomplish this using a for loop like the one below, but am wondering if there is a more efficient way. for i in range(2012, 2024, 4): df1.loc[(slice(None), i), 'difference'] = df1.loc[(slice(None), i), 'val1'] - df2.loc[i]['val2']
This works for me, though you might need to tweak it based on how your indices are set up: merged = df1.reset_index().merge(df2, left_on='year', right_index=True) df_new = (merged .assign(**{'difference': merged.val1 - merged.val2}) .set_index(['state', 'year']) .filter(['val1', 'difference']) .sort_index()) df_new val1 difference state year ALABAMA 2012 22.186789 26.121973 2016 27.725147 29.907088 2020 25.461653 29.910542 ALASKA 2012 13.988918 17.924102 2016 14.730641 16.912582 2020 10.061191 14.510080 ARIZONA 2012 9.064766 12.999950 2016 3.543962 5.725903 2020 -0.308710 4.140179
Move data from row 1 to row 0
I have this function written in python. I want this thing show difference between row from production column. Here's the code def print_df(): mycursor.execute("SELECT * FROM productions") myresult = mycurson.fetchall() myresult.sort(key=lambda x: x[0]) df = pd.DataFrame(myresult, columns=['Year', 'Production (Ton)']) df['Dif'] = abs(df['Production (Ton)']. diff()) print(abs(df)) And of course the output is this Year Production (Ton) Dif 0 2010 339491 NaN 1 2011 366999 27508.0 2 2012 361986 5013.0 3 2013 329461 32525.0 4 2014 355464 26003.0 5 2015 344998 10466.0 6 2016 274317 70681.0 7 2017 200916 73401.0 8 2018 217246 16330.0 9 2019 119830 97416.0 10 2020 66640 53190.0 But I want the output like this Year Production (Ton) Dif 0 2010 339491 27508.0 1 2011 366999 5013.0 2 2012 361986 32525.0 3 2013 329461 26003.0 4 2014 355464 10466.0 5 2015 344998 70681.0 6 2016 274317 73401.0 7 2017 200916 16330.0 8 2018 217246 97416.0 9 2019 119830 53190.0 10 2020 66640 66640.0 What should I change or add to my code?
You can use a negative period input to diff to get the differences the way you want, and then fillna to fill the last value with the value from the Production column: df['Dif'] = df['Production (Ton)'].diff(-1).fillna(df['Production (Ton)']).abs() Output: Year Production (Ton) Dif 0 2010 339491 27508.0 1 2011 366999 5013.0 2 2012 361986 32525.0 3 2013 329461 26003.0 4 2014 355464 10466.0 5 2015 344998 70681.0 6 2016 274317 73401.0 7 2017 200916 16330.0 8 2018 217246 97416.0 9 2019 119830 53190.0 10 2020 66640 66640.0
Use shift(-1) to shift all rows one position up. df['Dif'] = (df['Production (Ton)'] - df['Production (Ton)'].shift(-1).fillna(0)).abs() Notice that by setting fillna(0), you avoid the NaNs. You can also use diff: df['Dif'] = df['Production (Ton)'].diff().shift(-1).fillna(0).abs()
How to get years without starting with df=df.set_index
I have this set of dataframe:Dataframe I can obtain the values that is 15% greater than the mean by: df[df['Interest']>(df["Interest"].mean()*1.15)].Interest.to_string() I obtained all values that are 15% greater than interest in their respective categories The question is how do I get the year where these values occurred without starting with: df=df.set_index('Year") at the start as the function above requires my year values with df.iloc
How do I get the year where these values occurred without starting with df.set_index('Year") Use .loc: >>> df Year Dividends Interest Other Types Rent Royalties Trade Income 0 2007 7632014 4643033 206207 626668 89715 18654926 1 2008 6718487 4220161 379049 735494 58535 29677697 2 2009 1226858 5682198 482776 1015181 138083 22712088 3 2010 978925 2229315 565625 1260765 146791 15219378 4 2011 1500621 2452712 675770 1325025 244073 19697549 5 2012 308064 2346778 591180 1483543 378998 33030888 6 2013 275019 4274425 707344 1664747 296136 17503798 7 2014 226634 3124281 891466 1807172 443671 16023363 8 2015 2171559 3474825 1144862 1858838 585733 16778858 9 2016 767713 4646350 2616322 1942102 458543 13970498 10 2017 759016 4918320 1659303 2001220 796343 9730659 11 2018 687308 6057191 1524474 2127583 1224471 19570540 >>> df.loc[df['Interest']>(df["Interest"].mean()*1.15), ['Year', 'Interest']] Year Interest 0 2007 4643033 2 2009 5682198 9 2016 4646350 10 2017 4918320 11 2018 6057191
This will return a DataFrame with Year and the Interest values that match your condition df[df['Interest']>(df["Interest"].mean()*1.15)][['Year', 'Interest']]
This will return the Year :- df.loc[df["Interest"]>df["Interest"].mean()*1.15]["Year"]
I have multiIndexes for my dataframe, how do I calculate the sum for one level?
Hi everyone, I want to calculate the sum of Violent_type count according to year. For example, calculating total count of violent_type for year 2013, which is 18728+121662+1035. But I don't know how to select the data when there are multiIndexes. Any advice will be appreciated. Thanks.
The level argument in pandas.DataFrame.groupby() is what you are looking for. level int, level name, or sequence of such, default None If the axis is a MultiIndex (hierarchical), group by a particular level or levels. To answer your question, you only need: df.groupby(level=[0, 1]).sum() # or df.groupby(level=['district', 'year']).sum() To see the effect import pandas as pd iterables = [['001', 'SST'], [2013, 2014], ['Dangerous', 'Non-Violent', 'Violent']] index = pd.MultiIndex.from_product(iterables, names=['district', 'year', 'Violent_type']) df = pd.DataFrame(list(range(0, len(index))), index=index, columns=['count']) ''' print(df) count district year Violent_type 001 2013 Dangerous 0 Non-Violent 1 Violent 2 2014 Dangerous 3 Non-Violent 4 Violent 5 SST 2013 Dangerous 6 Non-Violent 7 Violent 8 2014 Dangerous 9 Non-Violent 10 Violent 11 ''' print(df.groupby(level=[0, 1]).sum()) ''' count district year 001 2013 3 2014 12 SST 2013 21 2014 30 ''' print(df.groupby(level=['district', 'year']).sum()) ''' count district year 001 2013 3 2014 12 SST 2013 21 2014 30 '''
Calculate average for each quarter given month columns
Suppose I have a Python Pandas dataframe with 10 rows and 16 columns. Each row stands for one product. The first column is product ID. Other 15 columns are selling price for 2010/01,2010/02,2010/03,2010/05,2010/06,2010/07,2010/08,2010/10,2010/11,2010/12,2011/01,2011/02,2011/03,2011/04,2011/05. (The column name is in strings, not in date format) Now I want to calculate the mean selling price each quarter (1Q2010,2Q2010,...,2Q2011), I don't know how to deal with it. (Note that there is missing month for 2010/04, 2010/09 and 2011/06.) The description above is just an example. Because this data set is quite small. It is possible to loop manually. However, the real data set I work on is 10730*202. Therefore I can not manually check which month is actually missing or map quarters manually. I wonder what efficient way I can apply here. Thanks for the help!
This should help. import pandas as pd import numpy as np rng = pd.DataFrame({'date': pd.date_range('1/1/2011', periods=72, freq='M'), 'value': np.arange(72)}) df = rng.groupby([rng.date.dt.quarter, rng.date.dt.year]) .mean() df.index.names = ['quarter', 'year'] df.columns = ['mean'] print df mean quarter year 1 2011 1 2012 13 2013 25 2014 37 2015 49 2016 61 2 2011 4 2012 16 2013 28 2014 40 2015 52 2016 64 3 2011 7 2012 19 2013 31 2014 43 2015 55 2016 67 4 2011 10 2012 22 2013 34 2014 46 2015 58 2016 70