Reshaping a dataframe every nth column - python

I have two datasets. After merging them horzontally, and sorting the columns with the following code, I get the dataset below:
df=
X
Y
5.2
6.5
3.3
7.6
df_year=
X
Y
2014
2014
2015
2015
df_all_cols = pd.concat([df, df_year], axis = 1)
sorted_columns = sorted(df_all_cols.columns)
df_all_cols_sort = df_all_cols[sorted_columns]
X
X
Y
Y
5.2
2014
6.5
2014
3.3
2015
7.6
2015
I am trying to make my data look like this, by stacking the dataset every 2 columns.
name
year
Variable
5.2
2014
X
3.3
2015
X
6.5
2014
Y
7.6
2015
Y

One approach could be as follows:
Apply df.stack to both dfs before feeding them to pd.concat. The result at this stage being:
0 1
0 X 5.2 2014
Y 6.5 2014
1 X 3.3 2015
Y 7.6 2015
Next, use df.sort_index to sort on the original column names (i.e. "X, Y", now appearing as index level 1), and get rid of index level 0 (df.droplevel).
Finally, use df.reset_index with drop=False to insert index as a column and rename all the columns with df.rename.
res = (pd.concat([df.stack(),df_year.stack()], axis=1)
.sort_index(level=1)
.droplevel(0)
.reset_index(drop=False)
.rename(columns={'index':'Variable',0:'name',1:'year'})
)
# change the order of cols
res = res.iloc[:, [1,2,0]]
print(res)
name year Variable
0 5.2 2014 X
1 3.3 2015 X
2 6.5 2014 Y
3 7.6 2015 Y

Related

Copy the contents of a set of coordinates into a new set, based on a condition

I am trying based on the code seen below to enhance it in order to do what it does, meaning moving each value from the same row onto the next cell if theres a NaN. The adjustment that I am trying to make is if, Jan (1st month of each row) is NaN then fill it in with the last value from the previous year which would be Jun - 2.04
This is what I am using so far:
df.loc[df['Jan'].isna(), 'Jan'] = df[df['Jan'].isna()].apply(lambda x: x[x.notna()][-1], axis=1)
df.loc[:, 'Jan':] = df.loc[:, 'Jan':].ffill(axis=1)
print(df)
Input sample data
Region-INF Series Name Series ID Jan Feb Mar Apr May Jun
Pacific All Items CUUR0490SDD 2.9 2.8 NaN NaN 2.52 **2.04**
Pacific All Items CUE07890SDF NaN 2.64 NaN 2.44 2.59 3
Pacific All Items CUE073310SAF 2.1 2.4 NaN 2.21 3.45 NaN
Expected output:
Region-INF Series Name Series ID Jan Feb Mar Apr May Jun
Pacific All Items CUUR0490SDD 2.9 2.8 2.8 2.8 2.52 **2.04**
Pacific All Items CUE07890SDF **2.04** 2.64 2.64 2.44 2.59 3
Pacific All Items CUE073310SAF 2.1 2.4 2.4 2.21 3.45 3.45
Any suggestions how I can modify the existing code?
You can modify the existing code by first adding a new column that contains the last value from the previous year. Then use that column to fill in the missing values in the 'Jan' column.
df.loc[:, "Jan":] = df.loc[:, "Jan":].ffill(axis=1)
# Shift the values in the last column by one row
df["last_col_shifted"] = df.iloc[:, -1].shift(1)
# Use the shifted column to fill in missing values in the 'Jan' column
df.loc[df["Jan"].isna(), "Jan"] = df["last_col_shifted"]
# remove the shifted column
df.drop(["last_col_shifted"], axis=1, inplace=True)
print(df)

Subtract columns from two DFs based on matching condition

Suppose I have the following two DFs:
DF A: First column is a date, and then there are columns that start with a year (2021, 2022...)
Date 2021.Water 2021.Gas 2022.Electricity
may-04 500 470 473
may-05 520 490 493
may-06 540 510 513
DF B: First column is a date, and then there are columns that start with a year (2021, 2022...)
Date 2021.Amount 2022.Amount
may-04 100 95
may-05 110 105
may-06 120 115
The expected result is a DF with the columns from DF A, but that have the rows divided by the values for the matching year in DF B. Such as:
Date 2021.Water 2021.Gas 2022.Electricity
may-04 5.0 4.7 5.0
may-05 4.7 4.5 4.7
may-06 4.5 4.3 4.5
I am really struggling with this problem. Let me know if any clarifications are needed and will be glad to help.
Try this:
dfai = dfa.set_index('Date')
dfai.columns = dfai.columns.str.split('.', expand=True)
dfbi = dfb.set_index('Date').rename(columns = lambda x: x.split('.')[0])
df_out = dfai.div(dfbi, level=0).round(1)
df_out.columns = df_out.columns.map('.'.join)
df_out.reset_index()
Output:
Date 2021.Water 2021.Gas 2022.Electricity
0 may-04 5.0 4.7 5.0
1 may-05 4.7 4.5 4.7
2 may-06 4.5 4.2 4.5
Details
First, move 'Date' into the index of both dataframes, then use string split to get years into a level in each dataframe.
Use, pd.DataFrame.div with level=0 to align operations on the top level index of each dataframe.
Flatten multiindex column header back to a single level and reset_index.

Python multiple lines by year

I have a dataset consisting of two columns: Date ds and volume y. I would like to see how the daily avg volume is trending across different months and years. I would like to have month names on x-axis and avg vol on y-axis. The lines should represent different years. Here is sample dataset and where I am stuck.
df = pd.DataFrame([
{"ds":"2017-01-01","y":3},
{"ds":"2017-01-18","y":4},
{"ds":"2017-02-04","y":6},
{"ds":"2018-01-06","y":2},
{"ds":"2018-01-12","y":8},
{"ds":"2018-02-08","y":2},
{"ds":"2018-03-02","y":8},
{"ds":"2018-03-15","y":2},
{"ds":"2018-03-22","y":8},
])
df["ds"] = pd.to_datetime(df["ds"])
df.set_index("ds",inplace=True)
df.resample("M").mean().plot()
Solution with aggregate mean for month names with years, reshape by Series.unstack and last ploting:
df["ds"] = pd.to_datetime(df["ds"])
#if necessary sorting
#df = df.sort_values('ds')
df1 = (df.groupby([df["ds"].dt.strftime('%b'), df["ds"].dt.year], sort=False)['y']
.mean()
.unstack(fill_value=0))
print (df1)
ds 2017 2018
ds
Jan 3.5 5.0
Feb 6.0 2.0
Mar 0.0 6.0
df1.plot()
You must group by years and by months:
import calendar # to use months' proper names
means = df.groupby([df.index.month, df.index.year]).mean()\
.unstack().reset_index(0, drop=True)\
.rename(dict(enumerate(calendar.month_abbr[1:])))
#ds 2017 2018
#ds
#Jan 3.5 5.0
#Feb 6.0 2.0
#Mar NaN 6.0

How to create new columns by looping through columns in different dataframes?

I have two pd.dataframes:
df1:
Year Replaced Not_replaced
2015 1.5 0.1
2016 1.6 0.3
2017 2.1 0.1
2018 2.6 0.5
df2:
Year HI LO RF
2015 3.2 2.9 3.0
2016 3.0 2.8 2.9
2017 2.7 2.5 2.6
2018 2.6 2.2 2.3
I need to create a third df3 by using the following equation:
df3[column1]=df1['Replaced']-df1['Not_replaced]+df2['HI']
df3[column2]=df1['Replaced']-df1['Not_replaced]+df2['LO']
df3[column3]=df1['Replaced']-df1['Not_replaced]+df2['RF']
I can merge the two dataframes and manually create 3 new columns one by one, but I can't figure out how to use the loop function to create the results.
You can create an empty dataframe & fill it with values while looping
(Note: col_names & df3.columns must be of the same length)
df3 = pd.DataFrame(columns = ['column1','column2','column3'])
col_names = ["HI", "LO","RF"]
for incol,df3column in zip(col_names,df3.columns):
df3[df3column] = df1['Replaced']-df1['Not_replaced']+df2[incol]
print(df3)
output
column1 column2 column3
0 4.6 4.3 4.4
1 4.3 4.1 4.2
2 4.7 4.5 4.6
3 4.7 4.3 4.4
for the for loop, I would first merge df1 and df2 into to create a new df, called df3. Then, I would create a list of te names of the columns you want to iterate through:
col_names = ["HI", "LO","RF"]
for col in col_names:
df3[f"column_{col}]= df3['Replaced']-df3['Not_replaced]+df3[col]

Reshaping Pandas dataframe by months

The task is to transform the below table
import pandas as pd
import numpy as np
index = pd.date_range('2000-1-1', periods=700, freq='D')
df = pd.DataFrame(np.random.randn(700), index=index, columns=["values"])
df.groupby(by=[df.index.year, df.index.month]).sum()
In[1]: df
Out[1]:
values
2000 1 1.181000
2 -8.005783
3 6.590623
4 -6.266232
5 1.266315
6 0.384050
7 -1.418357
8 -3.132253
9 0.005496
10 -6.646101
11 9.616482
12 3.960872
2001 1 -0.989869
2 -2.845278
3 -1.518746
4 2.984735
5 -2.616795
6 8.360319
7 5.659576
8 0.279863
9 -5.220678
10 5.077400
11 1.332519
such that it looks like this
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2000 1.2 -8.0 6.6 -6.3 1.2 0.4 -1.4 -3.1 0.0 -6.6 9.6 3.9
2001 -0.9 -2.8 -1.5 3.0 -2.6 8.3 5.7 0.3 -5.2 5.1 1.3
Additionally I need to add an extra column which sums the yearly values like this
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Year
2000 1.2 -8.0 6.6 -6.3 1.2 0.4 -1.4 -3.1 0.0 -6.6 9.6 3.9 4.7
2001 -0.9 -2.8 -1.5 3.0 -2.6 8.3 5.7 0.3 -5.2 5.1 1.3 10.7
Is there a quick pandas pivotal way to solve this?
use strftime('%b') in your groupby
df['values'].groupby([df.index.year, df.index.strftime('%b')]).sum().unstack()
To preserve order of months
df['values'].groupby([df.index.year, df.index.strftime('%b')], sort=False).sum().unstack()
With 'Year' at end
df['values'].groupby([df.index.year, df.index.strftime('%b')], sort=False).sum() \
.unstack().assign(Year=df.groupby(df.index.year).sum())
You can do something like this:
import pandas as pd
import numpy as np
index = pd.date_range('2000-1-1', periods=700, freq='D')
df = pd.DataFrame(np.random.randn(700), index=index, columns=["values"])
l = [df.index.strftime("%Y"), df.index.strftime("%b"), df.index.strftime("%d")]
df.index = l
df=df.groupby(level=[-3,-2]).sum().unstack(-1)
df['Year'] = df.sum(axis=1)
df
Output:
Only change is you need to unstack the DF to convert it into a wide format. Once you get the integer month numbers, you could convert these into a datetime by specifying %m directive as the format to be considered. After obtaining this, use it to retrieve it's string representation through the help of strftime.
Calculate the year by taking it's sum across columns by specifying axis=1.
np.random.seed(314)
fr = df.groupby([df.index.year, df.index.month]).sum().unstack(fill_value=0)
fr.columns = pd.to_datetime(fr.columns.droplevel(0), format='%m').strftime('%b')
fr['Year'] = fr.sum(1)
The extra Year column you can do by doing
df['Year'] = df.sum(axis=1)
It will sum the dataframe row-wise (due to the axis=1), and storing it in a new column.

Categories