First time posting, newbie to python.
I have a data frame consisting of 3 columns: ['ID', 'date', 'profit_forecast']
'ID': is product ID
'date': start date
'profit_forecast': a list containing 367 items, each item is a profit forecast for date+n
I am looking to create a new data frame that maps each item in profit_forecast to the ID and corresponding date+n for its position in the list.
Not sure how to start.
Thanks in advance!
If I understand you correctly, the following example data captures the essence of your question:
df = pd.DataFrame({'ID': [1, 2, 3],
'date': pd.date_range('2019-01-01', freq='YS', periods=3),
'profit_forecast': [[1, 2, 3], [4, 5], [6, 7, 8, 9]]})
df
ID date profit_forecast
0 1 2019-01-01 [1, 2, 3]
1 2 2020-01-01 [4, 5]
2 3 2021-01-01 [6, 7, 8, 9]
One solution is to make sure you've upgraded to pandas 0.25, and then to explode the profit_forecast column:
res = df.explode('profit_forecast')
res
ID date profit_forecast
0 1 2019-01-01 1
0 1 2019-01-01 2
0 1 2019-01-01 3
1 2 2020-01-01 4
1 2 2020-01-01 5
2 3 2021-01-01 6
2 3 2021-01-01 7
2 3 2021-01-01 8
2 3 2021-01-01 9
At this point, your question is not clear enough on how you need to increment the dates of each ID. If by "date + n" you mean to add one day to each consecutive date within each ID, then something like this should work:
res['date'] = res['date'] + pd.to_timedelta(res.groupby('ID').cumcount(), 'D')
res
ID date profit_forecast
0 1 2019-01-01 1
0 1 2019-01-02 2
0 1 2019-01-03 3
1 2 2020-01-01 4
1 2 2020-01-02 5
2 3 2021-01-01 6
2 3 2021-01-02 7
2 3 2021-01-03 8
2 3 2021-01-04 9
Related
I have a dataframe which has 3 columns. id, date and val. The ids are different. I want to put all of the ids besides eachothers. For example, the first rows consist only one id with different dates and then the second distinct ids and etc. Here is a simple example;
import pandas as pd
df['id'] = [10, 2, 3, 10, 10, 2, 2]
df['date'] = ['2020-01-01 12:00:00','2020-01-01 12:00:00','2020-01-01 12:00:00','2020-01-01 13:00:00','2020-01-01 14:00:00', '2020-01-01 13:00:00','2020-01-01 14:00:00']
df ['val'] = [0, 1, 2,-3, 4, 6,7]
The out put which I want is;
id date val
0 10 2020-01-01 12:00:00 0
1 10 2020-01-01 13:00:00 -3
2 10 2020-01-01 14:00:00 4
3 2 2020-01-01 12:00:00 1
4 2 2020-01-01 13:00:00 6
5 2 2020-01-01 14:00:00 7
6 3 2020-01-01 12:00:00 2
Explanation: We have three ids, 1, 2, 3. At first I want the values based on date for id=1 and then the values based on date for id=2 and etc. The ids does not need to sort.
Can you please help me with that? Thank you
I have the pandas dataframe below:
groupId
date
value
1
2023-01-01
A
1
2023-01-05
B
1
2023-01-17
C
2
2023-01-01
A
2
2023-01-20
B
3
2023-01-01
A
3
2023-01-10
B
3
2023-01-12
C
I would like to do a groupby and count the number of unique values for each groupId but only looking at the last n=14 days, relatively to the date of the row.
What I would like as a result is something like this:
groupId
date
value
newColumn
1
2023-01-01
A
1
1
2023-01-05
B
2
1
2023-01-17
C
2
2
2023-01-01
A
1
2
2023-01-20
B
1
3
2023-01-01
A
1
3
2023-01-10
B
2
3
2023-01-12
C
3
I did try using a groupby(...).rolling('14d').nunique() and while the rolling function works on numeric fields to count and compute the mean, etc ... it doesn't work when used with nunique on string fields to count the number of unique string/object values.
You can use the code below to generate the dataframe.
pd.DataFrame(
{
'groupId': [1, 1, 1, 2, 2, 3, 3, 3],
'date': ['2023-01-01', '2023-01-05', '2023-01-17', '2023-01-01', '2023-01-20', '2023-01-01', '2023-01-10', '2023-01-12'], #YYYY-MM-DD
'value': ['A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'],
'newColumn': [1, 2, 2, 1, 1, 1, 2, 3]
}
)
Do you have an idea on how to solve this, even if not using the rolling function? That'd be much appreciated!
Instead of nunique, you can also use count:
>>> (df.groupby('groupId').rolling('14D', on='date')['value'].count()
.astype(int).rename('newColumn').reset_index())
groupId date newColumn
0 1 2023-01-01 1
1 1 2023-01-05 2
2 1 2023-01-17 2
3 2 2023-01-01 1
4 2 2023-01-20 1
5 3 2023-01-01 1
6 3 2023-01-10 2
7 3 2023-01-12 3
Caveats: it can be complicated to merge this output with your original dataframe except if (groupId, date) is a unique combination.
Update
If your index is numeric (or create a dummy column monotonic increasing), you can use this trick:
sr = (df.reset_index().groupby('groupId').rolling('14D', on='date')
.agg({'value': 'count', 'index': 'max'}).astype(int)
.set_index('index')['value'])
df['newColumn'] = sr
print(df)
# Output
groupId date value newColumn
0 1 2023-01-01 A 1
1 1 2023-01-05 B 2
2 1 2023-01-17 C 2
3 2 2023-01-01 A 1
4 2 2023-01-20 B 1
5 3 2023-01-01 A 1
6 3 2023-01-10 B 2
7 3 2023-01-12 C 3
Update 2
You can use pd.factorize to convert value column as numeric column:
>>> (df.assign(value=pd.factorize(df['value'])[0])
.groupby('groupId').rolling('14D', on='date')['value']
.apply(lambda x: x.nunique())
.astype(int).rename('newColumn').reset_index())
groupId date newColumn
0 1 2023-01-01 1
1 1 2023-01-05 2
2 1 2023-01-17 2
3 2 2023-01-01 1
4 2 2023-01-20 1
5 3 2023-01-01 1
6 3 2023-01-10 2
7 3 2023-01-12 3
Another possible solution, which does not use rolling:
df['date'] = pd.to_datetime(df['date'])
df['new2'] = df.groupby('groupId')['date'].transform(
lambda x: x.diff().dt.days.cumsum().le(14).mul(~x.duplicated()).cumsum()+1)
Output:
groupId date value new2
0 1 2023-01-01 A 1
1 1 2023-01-05 B 2
2 1 2023-01-17 C 2
3 2 2023-01-01 A 1
4 2 2023-01-20 B 1
5 3 2023-01-01 A 1
6 3 2023-01-10 B 2
7 3 2023-01-12 C 3
I'd like to convert a Datafram which has this format:
df = pd.DataFrame({"Date": ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'],
"A1": [1, 2, 2, 2],
"A2": [9, 2, 2, 3],
"A3": [1, 3, 2, 9],
"B1": [1, 8, 2, 3],
"B2": [3, 8, 9, 3],
"B3": [2, 4, 5, 5]})
Date
A1
A2
A3
B1
B2
B3
2021-01-01
1
9
1
1
3
2
2021-01-02
2
2
3
8
8
4
2021-01-03
2
2
2
2
9
5
2021-01-04
2
3
9
3
3
5
What I want to create table, that just starts with letters in the rows.
My idea is the following:
Add 2 dummy rows after every row with date
Copy the values from (X2) und (X3) into that dummy row for the same date
Delete the Columns (X2) and (X3)
transpose the whole table
The target format looks like this:
Date
2021-01-01 (1)
2021-01-01 (2)
2021-01-02 (3)
2021-01-02 (4)
2021-01-02 (5)
2021-01-02 (6)
2021-01-03 1 (7)
2021-01-03 (8)
2021-01-03 (9)
A
1
9
1
2
3
8
2
2
2
B
1
3
2
8
8
4
2
9
5
I couldnt get it to work, I'll try to post the code later on.
Is there any cleaner, faster way to do so?
Thank you for any help!
Use melt to get the long format to construct the corresponding date format for each category:
df = pd.melt(df, id_vars='Date') # in each row: 2021-01-01 | A1 | 1
df['idx'] = df['variable'].str[:-1] # A, B, ...
df['Date'] = df['Date'].astype(str) + ' (' + df['variable'].str[-1] + ')'
df = df[['Date', 'idx', 'value']].pivot(values='value', index='idx', columns='Date')
set df.index.name = None if you don't want the col shown:
Date 2021-01-01 (1) 2021-01-01 (2) 2021-01-01 (3) 2021-01-02 (1) 2021-01-02 (2) 2021-01-02 (3) 2021-01-03 (1) 2021-01-03 (2) 2021-01-03 (3) 2021-01-04 (1) 2021-01-04 (2) 2021-01-04 (3)
A 1 9 1 2 2 3 2 2 2 2 3 9
B 1 3 2 8 8 4 2 9 5 3 3 5
I have a df like:
and I have to filter my df by having values within two weeks from each ID
so for each ID, I have to look ahead next two weeks from the first date and only keep those records.
Output:
I tried creating a min date per each ID and using below code to try to filter:
df[df.date.between(df['min_date'],df['min_date']+pd.DateOffset(days=14))]
Is their any efficient way than this? because this is taking a lot of time since my dataframe is big
Setup
df = pd.DataFrame({
'Id': np.repeat([2, 3, 4], [4, 3, 4]),
'Date': ['12/31/2019', '1/1/2020', '1/5/2020', '1/20/2020',
'1/5/2020', '1/10/2020', '1/30/2020', '2/2/2020',
'2/4/2020', '2/10/2020', '2/25/2020'],
'Value': [*'abcbdeefffg']
})
First, convert Date to Timestamp with to_datetime
df['Date'] = pd.to_datetime(df['Date'])
concat with groupby in a comprehension
pd.concat([
d[d.Date <= d.Date.min() + pd.offsets.Day(14)]
for _, d in df.groupby('Id')
])
Id Date Value
0 2 2019-12-31 a
1 2 2020-01-01 b
2 2 2020-01-05 c
4 3 2020-01-05 d
5 3 2020-01-10 e
7 4 2020-02-02 f
8 4 2020-02-04 f
9 4 2020-02-10 f
boolean slice... also with groupby
df[df.Date <= df.Id.map(df.groupby('Id').Date.min() + pd.offsets.Day(14))]
Id Date Value
0 2 2019-12-31 a
1 2 2020-01-01 b
2 2 2020-01-05 c
4 3 2020-01-05 d
5 3 2020-01-10 e
7 4 2020-02-02 f
8 4 2020-02-04 f
9 4 2020-02-10 f
I struggle with pandas.concat, so you can try using merge:
# Convert Date to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
# Get min Date for each Id and add two weeks (14 days)
s = df.groupby('Id')['Date'].min() + pd.offsets.Day(14)
# Merge df and s
df = df.merge(s, left_on='Id', right_index=True)
# Keep records where Date is less than the allowed limit
df = df.loc[df['Date_x'] <= df['Date_y'], ['Id','Date_x','Value']]
# Rename Date_x to Date (optional)
df.rename(columns={'Date_x':'Date'}, inplace=True)
The result is:
Id Date Value
0 2 2019-12-31 a
1 2 2020-01-01 b
2 2 2020-01-05 c
4 3 2020-01-05 d
5 3 2020-01-10 e
7 4 2020-02-02 f
8 4 2020-02-04 f
9 4 2020-02-10 f
I know how to use groupby method with ffill or bfill to impute the missing values. But my problem here is that I need to first find the closest date in the "date" column to the null value in the "score" column, and if the value in the score column was not null, then impute it with that value. If the value was null, I need to search for another nearest closest date. I can iterate through the rows and do it, but it is very slow.
This is an example f the data:
df = pd.DataFrame(
{'cn': [1, 1, 1, 1, 2, 2, 2],
'date': ['01/10/2017', '02/09/2016', '02/10/2016','01/20/2017', '05/15/2019', '02/10/2016', '02/10/2017'],
'score': [np.nan, np.nan, 6, 5, 4, np.nan, 8]})
cn date score
0 1 01/10/2017 NaN
1 1 02/09/2016 NaN
2 1 02/10/2016 6
3 1 01/20/2017 5
4 2 05/15/2019 4
5 2 02/10/2016 NaN
6 2 02/10/2017 8.0
output should be
cn date score
0 1 01/10/2017 5
1 1 02/09/2016 6
2 1 02/10/2016 6
3 1 01/20/2017 5
4 2 05/15/2017 4
5 2 02/10/2016 8
6 2 02/10/2018 8
How can I do it using groupby method and an apply function?
Use pd.merge_asof to get the Series of the closest match and then just .fillna. There's some manipulation to make sure things align on index in the end.
import pandas as pd
df['date'] = pd.to_datetime(df.date)
s = (pd.merge_asof(
df.sort_values('date').reset_index(), # Full Data Frame
df.sort_values('date').dropna(subset=['score']), # Subset with valid scores
by='cn', # Only within `'cn'` group
on='date', direction='nearest' # Match closest date
)
.set_index('index')
.score_y)
df['score'] = df.score.fillna(s, downcast='infer')
Output: df
cn date score
0 1 2017-01-10 5
1 1 2016-02-09 6
2 1 2016-02-10 6
3 1 2017-01-20 5
4 2 2019-05-15 4
5 2 2016-02-10 8
6 2 2017-02-10 8