Python Pandas: Index is overlapped? How to fix - python

I have created a dataframe after importing weather data, now called "weather".
The end goal is to be able to view data for specific month and year.
It started like this:
Then I ran weather = weather.T to transform the graph making it look like:
Then I ran weather.columns=weather.iloc[0] to make the graph look like:
But the "year" column and "month" column are located in the index (i think?). How would i get it so it looks like:
Thanks for looking! Will appreciate any help :)
Please note I will remove the first row with the years in it. So don't worry about this part

This just means the pd.Index object underlying your pd.DataFrame object, unbeknownst to you, has a name:
df = pd.DataFrame({'YEAR': [2016, 2017, 2018],
'JAN': [1, 2, 3],
'FEB': [4, 5, 6],
'MAR': [7, 8, 9]})
df.columns.name = 'month'
df = df.T
df.columns = df.iloc[0]
print(df)
YEAR 2016 2017 2018
month
YEAR 2016 2017 2018
JAN 1 2 3
FEB 4 5 6
MAR 7 8 9
If this really bothers you, you can use reset_index to elevate your index to a series and then drop the extra heading row. You can, at the same time, remove the column name:
df = df.reset_index().drop(0)
df.columns.name = ''
print(df)
month 2016 2017 2018
1 JAN 1 2 3
2 FEB 4 5 6
3 MAR 7 8 9

Related

Pandas Dataframe increased or decreased in a certain amount of time

I have a DataFrame with the following structure:
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['R.04T', 1, 2013, 23456, 22, 1 ], ['R.04T', 15, 2014,
23456, 22, 1], ['F.04T', 9, 2010, 75920, 00, 3], ['F.04T', 4,
2012, 75920, 00, 3], ['R.04T', 7, 2013, 20054, 13, 1],
['R.04T',12, 2014, 20058,13, 1]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['product_code', 'sold', 'year', 'city_number',
'district_number', 'number_of_the_department'])
print(df)
I want to know if the locations ('city_number' + 'district_number' + 'number_of_the_department') have increased or decreased the amount of sales per year, per article. Id thought about joining the columns to one location column like the following:
# join the locations
df['location'] = df['city_number'].astype(str) + ','+
df['district_number'].astype(str) + ','+ df['number_of_the_department'].astype(str)
But I'm not sure how to groupby? the df to get my answer of the question.
I want to know if the sales have increased or decreased (per year and item) by a certain percentage per year (p.ex. 2013 to 2014 x% decreased).
Maybe someone can help? :)
Try this:
df = df.assign(
pct_change_sold=df.sort_values(by="year")
.groupby(by=["city_number", "district_number", "number_of_the_department"])["sold"]
.pct_change()
.fillna(0)
)
product_code sold year city_number district_number number_of_the_department pct_change_sold
0 R.04T 1 2013 23456 22 1 0.000000
1 R.04T 15 2014 23456 22 1 14.000000
2 F.04T 9 2010 75920 0 3 0.000000
3 F.04T 4 2012 75920 0 3 -0.555556
4 R.04T 7 2006 75920 22 1 0.000000
5 U.90G 12 2005 75021 34 3 0.000000

Mean of index values in Pandas dataframe following groupby()

I have the following dataframe:
myDF = pd.DataFrame({'quarter':['Q1','Q2','Q3','Q4','Q1','Q2','Q3','Q4','Q1','Q2','Q3','Q4'],
'year':[2018,2018,2018,2018,2019,2019,2019,2019,2020,2020,2020,2020]})
which looks like:
quarter year
0 Q1 2018
1 Q2 2018
2 Q3 2018
3 Q4 2018
4 Q1 2019
5 Q2 2019
6 Q3 2019
7 Q4 2019
8 Q1 2020
9 Q2 2020
10 Q3 2020
11 Q4 2020
I can calculate the mean of the index values:
print(np.mean(myDF.index))
5.5
...but I would like to produce a list of the mean index values for each year.
I can create a new variable based on index values and find the mean of those values as follows:
myDF['idx'] = myDF.index
print(myDF.groupby('year')['idx'].apply(list))
print(myDF.groupby('year')['idx'].apply(np.mean).tolist())
to produce:
year
2018 [0, 1, 2, 3]
2019 [4, 5, 6, 7]
2020 [8, 9, 10, 11]
Name: idx, dtype: object
[1.5, 5.5, 9.5]
However, I don't seem to be able to manipulate the index values directly. I've tried applying various versions of the above to DataFrameGroupBy objects but I get the following error:
AttributeError: 'DataFrameGroupBy' object has no attribute 'index'
So, whilst I have a solution, creating a new variable based on the index seems a bit redundant. Can the required list of means be created without the need to alter the original dataframe?

If I have two dataframes of which one is a subset of other, how do I remove the common rows completely?

I have already looked for this type of question but none of them really answers my question.
Suppose I have two dataframes and the indices of these are NOT consistent. df2 is a subset of df1 and I want to remove all the rows in df1 that are present in df2.
I already tried the following but it's not giving me the result I'm looking.
df1[~df1.index.isin(df2.index)]
Unfortunately, I can't share the original data with you however, the number of columns in the two dataframes are 14.
Here's an example of what I'm looking for:
df1 =
month year sale
0 1 2012 55
1 4 2014 40
2 7 2013 84
3 10 2014 31
df2 =
month year sale
0 1 2012 55
1 10 2014 31
and I'm looking for:
df =
month year sale
0 4 2014 40
1 7 2013 84
You could create a multi-index with all the columns in each dataframe. From that point you have just to drop the indices of the second from the first one:
df1.set_index(list(df1.columns)).drop(df2.set_index(list(df2.columns)).index).reset_index()
Result with your example data:
month year sale
0 4 2014 40
1 7 2013 84
Use left join by DataFrame.merge and indicator parameter, then compare new column for Series.eq (==) and filter by boolean indexing:
df = df1[df1.merge(df2, indicator=True, how='left')['_merge'].eq('left_only')]
print (df)
month year sale
1 4 2014 40
2 7 2013 84
So what you want is to remove by values, not by index.
Use concatenate and drop:
comp = pd.concat([df1, df2]).drop_duplicates(keep=False)
Example:
df1 = pd.DataFrame({'month': [1, 4, 7, 10], 'year': [2012, 2014, 2013, 2014], 'sale': [55, 40, 84, 31]})
df2 = pd.DataFrame({'month': [1, 10], 'year': [2012, 2014], 'sale': [55, 31]})
pd.concat([df1, df2]).drop_duplicates(keep=False)
Result:
month sale year
1 4 40 2014
2 7 84 2013
can you try below:
df1[~df1.isin(df2)]

Pandas dataframe.set_index() deletes previous index and column

I just came across a strange phenomenon with Pandas DataFrames, when setting index using DataFrame.set_index('some_index') the old column that was also an index is deleted! Here is an example:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
>>> df_mn
sale year
month
1 55 2012
4 40 2014
7 84 2013
10 31 2014
Now I change the index to year:
df_mn.set_index('year')
sale
year
2012 55
2014 40
2013 84
2014 31
.. and the month column was removed with the index. This is vary irritating because I just wanted to swap the DataFrame index.
Is there a way to not have the previous column that was an index from being deleted? Maybe through something like: DataFrame.set_index('new_index',delete_previous_index=False)
Thanks for any advice
You can do the following
>>> df_mn.reset_index().set_index('year')
month sale
year
2012 1 55
2014 4 40
2013 7 84
2014 10 31
the solution I found to reatain a previous columns is to set drop=False
dataframe.set_index('some_column',drop=False). This is not the perfect answer but it works!
No, in such cases you have to save your previous column, like shown
below:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
df_mn['month'] = df_mn.index #Save it as another column, and then run set_index with year column as value.
df_mn.set_index('year')
Besides you are using a duplicate dataframe df_mn , so the dataframe df remains unchanged you can use it again.
And also if you aren't setting the
inplace argument for set_index to True
df_mn won't have changed even after you call set_index() on it.
Also, like the other answer you can always use reset_index().

How to efficiently add multiple columns to pandas data frame with values that depend on other dynamic columns

How can I use better solution instead of following codes? in big data set with lots of columns this code takes too much time
import pandas as pd
df = pd.DataFrame({'Jan':[10,20], 'Feb':[3,5],'Mar':[30,4],'Month':
[3,2],'Year':[2016,2016]})
# Jan Feb Mar Month Year
# 0 10 3 30 3 2016
# 1 20 5 4 2 2016
df1['Antal_1']= np.nan
df1['Antal_2']= np.nan
for i in range(len(df)):
if df['Yaer'][i]==2016:
df['Antal_1'][i]=df.iloc[i,df['Month'][i]-1]
df['Antal_2'][i]=df.iloc[i,df['Month'][i]-2]
else:
df['Antal_1'][i]=df.iloc[i,-1]
df['Antal_2'][i]=df.iloc[i,-2]
df
# Jan Feb Mar Month Year Antal_1 Antal_2
# 0 10 3 30 3 2016 30 3
# 1 20 5 4 2 2016 5 20
You should see a marginal speed-up by using df.apply instead of iterating rows:
import pandas as pd
df = pd.DataFrame({'Jan': [10, 20], 'Feb': [3, 5], 'Mar': [30, 4],
'Month': [3, 2],'Year': [2016, 2016]})
df = df[['Jan', 'Feb', 'Mar', 'Month', 'Year']]
def calculator(row):
m1 = row['Month']
m2 = row.index.get_loc('Month')
return (row[int(m1-1)], row[int(m1-2)]) if row['Year'] == 2016 \
else (row[m2-1], row[m2-2])
df['Antal_1'], df['Antal_2'] = list(zip(*df.apply(calculator, axis=1)))
# Jan Feb Mar Month Year Antal_1 Antal_2
# 0 10 3 30 3 2016 30 3
# 1 20 5 4 2 2016 5 20
It's not clear to me what you want to do in the case of the year not being 2016, so I've made the value 100. Show an example and I can finish it. If it's just NaNs, then you can remove the first two lines from below.
df['Antal_1'] = 100
df['Antal_2'] = 100
df.loc[df['Year']==2016, 'Antal_1'] = df[df.columns[df.columns.get_loc("Month")-1]]
df.loc[df['Year']==2016, 'Antal_2'] = df[df.columns[df.columns.get_loc("Month")-2]]

Categories