Pandas dataframe.set_index() deletes previous index and column - python

I just came across a strange phenomenon with Pandas DataFrames, when setting index using DataFrame.set_index('some_index') the old column that was also an index is deleted! Here is an example:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
>>> df_mn
sale year
month
1 55 2012
4 40 2014
7 84 2013
10 31 2014
Now I change the index to year:
df_mn.set_index('year')
sale
year
2012 55
2014 40
2013 84
2014 31
.. and the month column was removed with the index. This is vary irritating because I just wanted to swap the DataFrame index.
Is there a way to not have the previous column that was an index from being deleted? Maybe through something like: DataFrame.set_index('new_index',delete_previous_index=False)
Thanks for any advice

You can do the following
>>> df_mn.reset_index().set_index('year')
month sale
year
2012 1 55
2014 4 40
2013 7 84
2014 10 31

the solution I found to reatain a previous columns is to set drop=False
dataframe.set_index('some_column',drop=False). This is not the perfect answer but it works!

No, in such cases you have to save your previous column, like shown
below:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
df_mn['month'] = df_mn.index #Save it as another column, and then run set_index with year column as value.
df_mn.set_index('year')
Besides you are using a duplicate dataframe df_mn , so the dataframe df remains unchanged you can use it again.
And also if you aren't setting the
inplace argument for set_index to True
df_mn won't have changed even after you call set_index() on it.
Also, like the other answer you can always use reset_index().

Related

Pandas groupby use aggregate based on two columns

Imagine I have the following dataframe:
np.random.seed(42)
t = pd.DataFrame({'year': 4*['2018']+3*['2019']+4*['2016'],
'pop': np.random.randint(10, 100, size=(11)),
'production': np.random.randint(2000, 40000, size=(11))})
print(t)
year pop production
2018 61 3685
2018 24 2769
2018 81 4433
2018 70 7311
2019 30 39819
2019 92 19568
2019 96 21769
2016 84 30693
2016 84 8396
2016 97 29480
2016 33 27658
I want to find the sum of production divided by the sum of the pop by each year, my final dataframe would be something like:
tmp = t.groupby('year').sum()
tmp['production']/tmp['pop']
year
2016 322.909396
2018 77.110169
2019 372.275229
I was thinking if it could be done using groupby year and then using agg based on two columns, something like:
#doesn't work
t.groupby('year').agg(prod_per_pop = (['pop', 'production'],
lambda x: x['production'].sum()/x['pop'].sum()))
My question is basically if it is possible to use any pandas groupby method to achieve that in an easy way rather than having to create another dataframe and then having to divide.
You could use lambda functions with axis=1 to solve it in single line.
t.groupby('year')['pop','production'].agg('sum').apply(lambda x: x['production']/x['pop'], axis=1)

Plot a DataFrame based on grouped by column in Python

Based on the code below, I'm trying to assign some columns to my DataFrame which has been grouped by month of the date and works well :
all_together = (df_clean.groupby(df_clean['ContractDate'].dt.strftime('%B'))
.agg({'Amount': [np.sum, np.mean, np.min, np.max]})
.rename(columns={'sum': 'sum_amount', 'mean': 'avg_amount', 'amin': 'min_amount', 'amax': 'max_amount'}))
But for some reason when I try to plot the result(in any kind as plot), it's not able to recognize my "ContractDate" as a column and also any of those renamed names such as: 'sum_amount'.
Do you have any idea that what's the issue and what am I missing as a rule for plotting the data?
I have tried the code below for plotting and it asks me what is "ContractDate" and what is "sum_amount"!
all_together.groupby(df_clean['ContractDate'].dt.strftime('%B'))['sum_amount'].nunique().plot(kind='bar')
#or
all_together.plot(kind='bar',x='ContractDate',y='sum_amount')
I really appreciate your time
Cheers,
z.A
When you apply groupby function on a DataFrame, it makes the groupby column as index(ContractDate in your case). So you need to reset the index first to make it as a column.
df = pd.DataFrame({'month':['jan','feb','jan','feb'],'v2':[23,56,12,59]})
t = df.groupby('month').agg('sum')
Output:
v2
month
feb 115
jan 35
So as you see, you're getting months as index. Then when you reset the index:
t.reset_index()
Output:
month v2
0 feb 115
1 jan 35
Next when you apply multiple agg functions on a single column in the groupby, it will create a multiindexed dataframe. So you need to make it as single level index:
t = df.groupby('month').agg({'v2': [np.sum, np.mean, np.min, np.max]}).rename(columns={'sum': 'sum_amount', 'mean': 'avg_amount', 'amin': 'min_amount', 'amax': 'max_amount'})
v2
sum_amount avg_amount min_amount max_amount
month
feb 115 57.5 56 59
jan 35 17.5 12 23
It created a multiindex.if you check t.columns, you get
MultiIndex(levels=[['v2'], ['avg_amount', 'max_amount', 'min_amount', 'sum_amount']],
labels=[[0, 0, 0, 0], [3, 0, 2, 1]])
Now use this:
t.columns = t.columns.get_level_values(1)
t.reset_index(inplace=True)
You will get a clean dataframe:
month sum_amount avg_amount min_amount max_amount
0 feb 115 57.5 56 59
1 jan 35 17.5 12 23
Hope this helps for your plotting.

If I have two dataframes of which one is a subset of other, how do I remove the common rows completely?

I have already looked for this type of question but none of them really answers my question.
Suppose I have two dataframes and the indices of these are NOT consistent. df2 is a subset of df1 and I want to remove all the rows in df1 that are present in df2.
I already tried the following but it's not giving me the result I'm looking.
df1[~df1.index.isin(df2.index)]
Unfortunately, I can't share the original data with you however, the number of columns in the two dataframes are 14.
Here's an example of what I'm looking for:
df1 =
month year sale
0 1 2012 55
1 4 2014 40
2 7 2013 84
3 10 2014 31
df2 =
month year sale
0 1 2012 55
1 10 2014 31
and I'm looking for:
df =
month year sale
0 4 2014 40
1 7 2013 84
You could create a multi-index with all the columns in each dataframe. From that point you have just to drop the indices of the second from the first one:
df1.set_index(list(df1.columns)).drop(df2.set_index(list(df2.columns)).index).reset_index()
Result with your example data:
month year sale
0 4 2014 40
1 7 2013 84
Use left join by DataFrame.merge and indicator parameter, then compare new column for Series.eq (==) and filter by boolean indexing:
df = df1[df1.merge(df2, indicator=True, how='left')['_merge'].eq('left_only')]
print (df)
month year sale
1 4 2014 40
2 7 2013 84
So what you want is to remove by values, not by index.
Use concatenate and drop:
comp = pd.concat([df1, df2]).drop_duplicates(keep=False)
Example:
df1 = pd.DataFrame({'month': [1, 4, 7, 10], 'year': [2012, 2014, 2013, 2014], 'sale': [55, 40, 84, 31]})
df2 = pd.DataFrame({'month': [1, 10], 'year': [2012, 2014], 'sale': [55, 31]})
pd.concat([df1, df2]).drop_duplicates(keep=False)
Result:
month sale year
1 4 40 2014
2 7 84 2013
can you try below:
df1[~df1.isin(df2)]

Python Pandas: Index is overlapped? How to fix

I have created a dataframe after importing weather data, now called "weather".
The end goal is to be able to view data for specific month and year.
It started like this:
Then I ran weather = weather.T to transform the graph making it look like:
Then I ran weather.columns=weather.iloc[0] to make the graph look like:
But the "year" column and "month" column are located in the index (i think?). How would i get it so it looks like:
Thanks for looking! Will appreciate any help :)
Please note I will remove the first row with the years in it. So don't worry about this part
This just means the pd.Index object underlying your pd.DataFrame object, unbeknownst to you, has a name:
df = pd.DataFrame({'YEAR': [2016, 2017, 2018],
'JAN': [1, 2, 3],
'FEB': [4, 5, 6],
'MAR': [7, 8, 9]})
df.columns.name = 'month'
df = df.T
df.columns = df.iloc[0]
print(df)
YEAR 2016 2017 2018
month
YEAR 2016 2017 2018
JAN 1 2 3
FEB 4 5 6
MAR 7 8 9
If this really bothers you, you can use reset_index to elevate your index to a series and then drop the extra heading row. You can, at the same time, remove the column name:
df = df.reset_index().drop(0)
df.columns.name = ''
print(df)
month 2016 2017 2018
1 JAN 1 2 3
2 FEB 4 5 6
3 MAR 7 8 9

Reindexing only valid with uniquely valued Index objects: Pandas DataFrame Panel

I am trying to average each cell of a bunch of .csv files to export as a single averaged .csv file using Pandas.
I have no problems, creating the dataframe itself, but when I try to turn it into a Panel (i.e. panel=pd.Panel(dataFrame)), I get the error: InvalidIndexError: Reindexing only valid with uniquely valued Index objects pandas pd.panel
An example of what each csv file looks like:
Year, Month, Day, Latitude, Longitude, Value1, Value 2
2010, 06, 01, 23, 97, 1, 3.5
2010, 06, 01, 24, 97, 5, 8.2
2010, 06, 01, 25, 97, 6, 4.6
2010, 06, 01, 26, 97, 4, 2.0
Each .csv file is from gridded data so they have the same number of rows and columns, as well as some no data values (given a value of -999.9), which my code snippet below addresses.
The code that I have so far to do this is:
june=[]
for csv1 in glob.glob(path+'\\'+'*.csv'):
if csv1[-10:-8] == '06':
june.append(csv1)
dfs={i: pd.DataFrame.from_csv(i) for i in june}
panel=pd.Panel(dfs)
panels=panel.replace(-999.9,np.NaN)
dfs_mean=panels.mean(axis=0)
I have seen questions where the user is getting the same error, but the solutions for those questions doesn't seem to work with my issue. Any help fixing this, or ideas for a better approach would be greatly appreciated.
pd.Panel has been deprecated
Use pd.concat with a dictionary comprehension and take the mean over level 1.
df1 = pd.concat({f: pd.read_csv(f) for f in glob('meansample[0-9].csv')})
df1.mean(level=1)
Year Month Day Latitude Longitude Value1 Value 2
0 2010 6 1 23 97 1 3.5
1 2010 6 1 24 97 5 8.2
2 2010 6 1 25 97 6 4.6
3 2010 6 1 26 97 4 2.0
I have a suggestion to change the approach a bit. Instead of converting each DF into panel, just concat them into one big DF but for each one give a unique ID. After you can just do groupby by the ID and use mean() to get the result.
It would look similar to this:
import Pandas as pd
df = pd.DataFrame()
for csv1 in glob.glob(path+'\\'+'*.csv'):
if csv1[-10:-8] == '06':
temp_df = pd.read_csv(csv1)
temp_df['df_id'] = csv1
df = pd.concat([df, temp_df])
df.replace(-999.9, np.nan)
df = df.groupby("df_id").mean()
I hope it helps somehow, if you still have any issues with that let me know.

Categories