Rearrange Pandas Dataframe, split rows and transpose - python

I'd like to convert a Datafram which has this format:
df = pd.DataFrame({"Date": ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'],
"A1": [1, 2, 2, 2],
"A2": [9, 2, 2, 3],
"A3": [1, 3, 2, 9],
"B1": [1, 8, 2, 3],
"B2": [3, 8, 9, 3],
"B3": [2, 4, 5, 5]})
Date
A1
A2
A3
B1
B2
B3
2021-01-01
1
9
1
1
3
2
2021-01-02
2
2
3
8
8
4
2021-01-03
2
2
2
2
9
5
2021-01-04
2
3
9
3
3
5
What I want to create table, that just starts with letters in the rows.
My idea is the following:
Add 2 dummy rows after every row with date
Copy the values from (X2) und (X3) into that dummy row for the same date
Delete the Columns (X2) and (X3)
transpose the whole table
The target format looks like this:
Date
2021-01-01 (1)
2021-01-01 (2)
2021-01-02 (3)
2021-01-02 (4)
2021-01-02 (5)
2021-01-02 (6)
2021-01-03 1 (7)
2021-01-03 (8)
2021-01-03 (9)
A
1
9
1
2
3
8
2
2
2
B
1
3
2
8
8
4
2
9
5
I couldnt get it to work, I'll try to post the code later on.
Is there any cleaner, faster way to do so?
Thank you for any help!

Use melt to get the long format to construct the corresponding date format for each category:
df = pd.melt(df, id_vars='Date') # in each row: 2021-01-01 | A1 | 1
df['idx'] = df['variable'].str[:-1] # A, B, ...
df['Date'] = df['Date'].astype(str) + ' (' + df['variable'].str[-1] + ')'
df = df[['Date', 'idx', 'value']].pivot(values='value', index='idx', columns='Date')
set df.index.name = None if you don't want the col shown:
Date 2021-01-01 (1) 2021-01-01 (2) 2021-01-01 (3) 2021-01-02 (1) 2021-01-02 (2) 2021-01-02 (3) 2021-01-03 (1) 2021-01-03 (2) 2021-01-03 (3) 2021-01-04 (1) 2021-01-04 (2) 2021-01-04 (3)
A 1 9 1 2 2 3 2 2 2 2 3 9
B 1 3 2 8 8 4 2 9 5 3 3 5

Related

How to count the number of unique values per group over the last n days

I have the pandas dataframe below:
groupId
date
value
1
2023-01-01
A
1
2023-01-05
B
1
2023-01-17
C
2
2023-01-01
A
2
2023-01-20
B
3
2023-01-01
A
3
2023-01-10
B
3
2023-01-12
C
I would like to do a groupby and count the number of unique values for each groupId but only looking at the last n=14 days, relatively to the date of the row.
What I would like as a result is something like this:
groupId
date
value
newColumn
1
2023-01-01
A
1
1
2023-01-05
B
2
1
2023-01-17
C
2
2
2023-01-01
A
1
2
2023-01-20
B
1
3
2023-01-01
A
1
3
2023-01-10
B
2
3
2023-01-12
C
3
I did try using a groupby(...).rolling('14d').nunique() and while the rolling function works on numeric fields to count and compute the mean, etc ... it doesn't work when used with nunique on string fields to count the number of unique string/object values.
You can use the code below to generate the dataframe.
pd.DataFrame(
{
'groupId': [1, 1, 1, 2, 2, 3, 3, 3],
'date': ['2023-01-01', '2023-01-05', '2023-01-17', '2023-01-01', '2023-01-20', '2023-01-01', '2023-01-10', '2023-01-12'], #YYYY-MM-DD
'value': ['A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'],
'newColumn': [1, 2, 2, 1, 1, 1, 2, 3]
}
)
Do you have an idea on how to solve this, even if not using the rolling function? That'd be much appreciated!
Instead of nunique, you can also use count:
>>> (df.groupby('groupId').rolling('14D', on='date')['value'].count()
.astype(int).rename('newColumn').reset_index())
groupId date newColumn
0 1 2023-01-01 1
1 1 2023-01-05 2
2 1 2023-01-17 2
3 2 2023-01-01 1
4 2 2023-01-20 1
5 3 2023-01-01 1
6 3 2023-01-10 2
7 3 2023-01-12 3
Caveats: it can be complicated to merge this output with your original dataframe except if (groupId, date) is a unique combination.
Update
If your index is numeric (or create a dummy column monotonic increasing), you can use this trick:
sr = (df.reset_index().groupby('groupId').rolling('14D', on='date')
.agg({'value': 'count', 'index': 'max'}).astype(int)
.set_index('index')['value'])
df['newColumn'] = sr
print(df)
# Output
groupId date value newColumn
0 1 2023-01-01 A 1
1 1 2023-01-05 B 2
2 1 2023-01-17 C 2
3 2 2023-01-01 A 1
4 2 2023-01-20 B 1
5 3 2023-01-01 A 1
6 3 2023-01-10 B 2
7 3 2023-01-12 C 3
Update 2
You can use pd.factorize to convert value column as numeric column:
>>> (df.assign(value=pd.factorize(df['value'])[0])
.groupby('groupId').rolling('14D', on='date')['value']
.apply(lambda x: x.nunique())
.astype(int).rename('newColumn').reset_index())
groupId date newColumn
0 1 2023-01-01 1
1 1 2023-01-05 2
2 1 2023-01-17 2
3 2 2023-01-01 1
4 2 2023-01-20 1
5 3 2023-01-01 1
6 3 2023-01-10 2
7 3 2023-01-12 3
Another possible solution, which does not use rolling:
df['date'] = pd.to_datetime(df['date'])
df['new2'] = df.groupby('groupId')['date'].transform(
lambda x: x.diff().dt.days.cumsum().le(14).mul(~x.duplicated()).cumsum()+1)
Output:
groupId date value new2
0 1 2023-01-01 A 1
1 1 2023-01-05 B 2
2 1 2023-01-17 C 2
3 2 2023-01-01 A 1
4 2 2023-01-20 B 1
5 3 2023-01-01 A 1
6 3 2023-01-10 B 2
7 3 2023-01-12 C 3

Define start and end date of several DataFrames with pandas

I have many DataFrames which have a different period lengths. I am trying to create a for loop to define for all those DataFrames a specific start and end day.
Here is a simple example:
df1:
Dates ID1 ID2
0 2021-01-01 0 1
1 2021-01-02 0 0
2 2021-01-03 1 0
3 2021-01-04 2 2
4 2021-01-05 1 4
5 2021-01-06 -1 -2
df2:
Dates ID1 ID2
0 2021-01-01 0 1
1 2021-01-02 1 2
2 2021-01-03 -1 3
3 2021-01-04 1 -1
4 2021-01-05 4 2
I want to define a specific start and end day as:
start = pd.to_datetime('2021-01-02')
end = pd.to_datetime('2021-01-04')
So far, I only figured out how to define the period for one DataFrame:
df1.loc[(df1['Dates'] >= start) & (df1['Dates'] <= end)]
Is there an easy method to loop over all DataFrames at the same time to define the start and end dates?
For reproducibility:
import pandas as pd
df1 = pd.DataFrame({
'Dates':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06'],
'ID1':[0,0,1,2,1,-1],
'ID2':[1,0,0,2,4,-2]})
df1['Dates'] = pd.to_datetime(df1['Dates'])
df2 = pd.DataFrame({
'Dates':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],
'ID1':[0,1,-1,1,4],
'ID2':[1,2,3,-1,2]})
df2['Dates'] = pd.to_datetime(df2['Dates'])
You can store your dataframes in a list, and then apply your loc formula on all the dataframes in the list using list comprehension, and return back a new list of the resulting filtered dataframes:
# Create a list with your dataframes
dfs = [df1 , df2]
# Thresholds
start = pd.to_datetime('2021-01-02')
end = pd.to_datetime('2021-01-04')
# Filter all of them and store back
filtered_dfs = [df.loc[(df['Dates'] >= start) & (df['Dates'] <= end)] for df in dfs]
Result:
>>> print(filtered_dfs)
[ Dates ID1 ID2
1 2021-01-02 0 0
2 2021-01-03 1 0
3 2021-01-04 2 2,
Dates ID1 ID2
1 2021-01-02 1 2
2 2021-01-03 -1 3
3 2021-01-04 1 -1]
>>> print(dfs)
[ Dates ID1 ID2
0 2021-01-01 0 1
1 2021-01-02 0 0
2 2021-01-03 1 0
3 2021-01-04 2 2
4 2021-01-05 1 4
5 2021-01-06 -1 -2,
Dates ID1 ID2
0 2021-01-01 0 1
1 2021-01-02 1 2
2 2021-01-03 -1 3
3 2021-01-04 1 -1
4 2021-01-05 4 2]

Pandas cumulative sum depending on other columns value

I have a Dataset like this
Date Runner Group distance [km]
2021-01-01 Joe 1 7
2021-01-02 Jack 1 6
2021-01-03 Jess 1 9
2021-01-01 Paul 2 11
2021-01-02 Peter 2 12
2021-01-02 Sara 3 15
2021-01-03 Sarah 3 10
and I want to calculate the cumulative sum for each group of runners.
Date Runner Group distance [km] cum sum [km]
2021-01-01 Joe 1 7 7
2021-01-02 Jack 1 6 13
2021-01-03 Jess 1 9 22
2021-01-01 Paul 2 11 11
2021-01-02 Peter 2 12 23
2021-01-02 Sara 3 15 15
2021-01-03 Sarah 3 10 25
Unfortunately, I have no idea how to do this and I didn't find the answer somewhere else. Could someone give me a hint?
import pandas as pd
import numpy as np
df = pd.DataFrame([['2021-01-01','Joe', 1, 7],
['2021-01-02',"Jack", 1, 6],
['2021-01-03',"Jess", 1, 9],
['2021-01-01',"Paul", 2, 11],
['2021-01-02',"Peter", 2, 12],
['2021-01-02',"Sara", 3, 15],
['2021-01-03',"Sarah", 3, 10]],
columns=['Date','Runner', 'Group', 'distance [km]'])
Try groupby cumsum:
>>> df['cum sum [km]'] = df.groupby('Group')['distance [km]'].cumsum()
>>> df
Date Runner Group distance [km] cum sum [km]
0 2021-01-01 Joe 1 7 7
1 2021-01-02 Jack 1 6 13
2 2021-01-03 Jess 1 9 22
3 2021-01-01 Paul 2 11 11
4 2021-01-02 Peter 2 12 23
5 2021-01-02 Sara 3 15 15
6 2021-01-03 Sarah 3 10 25
>>>

Difference many columns from a baseline column in pandas

I have a baseline column (base) in a pandas data frame and I want to difference all other columns x* from this column while preserving two groups group1, group2:
The easiest way is to simply difference by doing:
df = pd.DataFrame({'group1': [0, 0, 1, 1], 'group2': [2, 2, 3, 4],
'base': [0, 1, 2, 3], 'x1': [3, 4, 5, 6], 'x2': [5, 6, 7, 8]})
df['diff_x1'] = df['x1'] - df['base']
df['diff_x2'] = df['x2'] - df['base']
group1 group2 base x1 x2 diff_x1 diff_x2
0 0 2 0 3 5 3 5
1 0 2 1 4 6 3 5
2 1 3 2 5 7 3 5
3 1 4 3 6 8 3 5
But I have hundreds of columns I need to do this for, so I'm looking for a more efficient way.
You can subtract a Series from a dataframe column wise using the sub method with axis=0, which can save you from doing the subtraction for each column individually:
to_sub = df.filter(regex='x.*') # filter based on your actual logic
pd.concat([
df,
to_sub.sub(df.base, axis=0).add_prefix('diff_')
], axis=1)
# group1 group2 base x1 x2 diff_x1 diff_x2
#0 0 2 0 3 5 3 5
#1 0 2 1 4 6 3 5
#2 1 3 2 5 7 3 5
Another way is using df.drop(..., axis=1). Then pass each remaining column of that dataframe into sub(..., axis=0). Guarantees you catch all columns, and preserve their order, don't even need a regex.
df_diff = df.drop(['group1','group2','base'], axis=1).sub(df['base'], axis=0).add_prefix('diff_')
diff_x1 diff_x2
0 3 5
1 3 5
2 3 5
3 3 5
Hence your full solution is:
pd.concat([df, df_diff], axis=1)
group1 group2 base x1 x2 diff_x1 diff_x2
0 0 2 0 3 5 3 5
1 0 2 1 4 6 3 5
2 1 3 2 5 7 3 5
3 1 4 3 6 8 3 5

How to unpack a list in a dataframe

First time posting, newbie to python.
I have a data frame consisting of 3 columns: ['ID', 'date', 'profit_forecast']
'ID': is product ID
'date': start date
'profit_forecast': a list containing 367 items, each item is a profit forecast for date+n
I am looking to create a new data frame that maps each item in profit_forecast to the ID and corresponding date+n for its position in the list.
Not sure how to start.
Thanks in advance!
If I understand you correctly, the following example data captures the essence of your question:
df = pd.DataFrame({'ID': [1, 2, 3],
'date': pd.date_range('2019-01-01', freq='YS', periods=3),
'profit_forecast': [[1, 2, 3], [4, 5], [6, 7, 8, 9]]})
df
ID date profit_forecast
0 1 2019-01-01 [1, 2, 3]
1 2 2020-01-01 [4, 5]
2 3 2021-01-01 [6, 7, 8, 9]
One solution is to make sure you've upgraded to pandas 0.25, and then to explode the profit_forecast column:
res = df.explode('profit_forecast')
res
ID date profit_forecast
0 1 2019-01-01 1
0 1 2019-01-01 2
0 1 2019-01-01 3
1 2 2020-01-01 4
1 2 2020-01-01 5
2 3 2021-01-01 6
2 3 2021-01-01 7
2 3 2021-01-01 8
2 3 2021-01-01 9
At this point, your question is not clear enough on how you need to increment the dates of each ID. If by "date + n" you mean to add one day to each consecutive date within each ID, then something like this should work:
res['date'] = res['date'] + pd.to_timedelta(res.groupby('ID').cumcount(), 'D')
res
ID date profit_forecast
0 1 2019-01-01 1
0 1 2019-01-02 2
0 1 2019-01-03 3
1 2 2020-01-01 4
1 2 2020-01-02 5
2 3 2021-01-01 6
2 3 2021-01-02 7
2 3 2021-01-03 8
2 3 2021-01-04 9

Categories