I have a dataFrame with date column, and sometimes the date might appear twice.
When I write to a certain date, I would like to write to the last row that have this date, not the first.
Right now I use:
df.loc[df['date'] == date, columnA] = value
Which in the case of a df like this will write at index 1, not 2:
date columnA
0 17.4.2022
1 17.5.2022
2 17.5.2022 value #in the case of 17.5 write the data to this row.
3 17.6.2022
How to make sure I am writing to the last date all the time, and if there is one, so write into that one?
You can chain mask for last duplicated date value by Series.duplicated:
print (df)
date columnA
0 17.4.2022 8
1 17.5.2022 1
2 17.5.2022 1
2 17.5.2022 1
3 17.6.2022 3
date = '17.5.2022'
df.loc[(df['date'] == date) & ~df['date'].duplicated(keep='last'), 'columnA'] = 100
print (df)
date columnA
0 17.4.2022 8
1 17.5.2022 1
2 17.5.2022 1
2 17.5.2022 100
3 17.6.2022 3
Related
I am trying to create a new column in pandas based off of adding previous column values and a second column.
To clarify this, the project is building a schedule that adds the duration and day of the predecessor(s) to create the start day of the current row. Essentially ID 1 has no predecessor so the day becomes 1 (just a default). ID 2 has the predecessor ID 1 so we take ID 1's Day (1) and adds ID 1's Duration (1) making ID 2's Day 2.
I have a dataframe that looks like:
Templ
ID
Predecessors
Duration
aaa
1
0
1
aaa
2
1
2
aaa
3
1
3
aaa
3
2
3
aaa
4
3
8
Note that there are more "Templ" values than just one and the "ID" can repeate within each "Templ" value.
Currently I'm trying to use the function that is failing:
def day_col(row):
if row['Predecessors'] <1 :
return 1
elif row['Predecessors'] >= 1:
pred = row['Predecessors']
sche = row["Templ"]
newRow = df.loc[(df['ID'] == pred) & (df['Templ'] == sche)]
return newRow['Day'] + newRow['Duration']
df["Day"] = df.apply(lambda row: day_col(row), axis=1)
this results in a KeyError(key) from err KeyError 'Day'
The end result is supposed to look like this:
Templ
ID
Predecessors
Duration
Day
aaa
1
0
1
1
aaa
2
1
2
2
aaa
3
1
3
2
aaa
3
2
3
4
aaa
4
3
8
7
My function could be way off so don't feel the need to fix that code if I am going about it the wrong way.
I have pandas dataframe that contains dates in column Date. I need to add another column Days which contains the date difference from previous cell. So date in ith cell should difference from i-1th. And for the first difference consider it to be 0.
Date Days
08-01-1997 0
09-01-1997 1
10-01-1997 1
13-01-1997 3
14-01-1997 1
15-01-1997 1
01-03-1997 45
03-03-1997 2
04-03-1997 1
05-03-1997 1
13-06-1997 100
I tried this but not useful.
First convert the Date column to pandas DateTime object, then calculate the difference which is timedelta object, from there, take the days from Series.dt and assign 0 to first value
>>> df['Date']=pd.to_datetime(df['Date'], dayfirst=True)
>>> df['Days']=(df['Date']-df['Date'].shift()).dt.days.fillna(0).astype(int)
OUTPUT
df
Date Days
0 1997-01-08 0
1 1997-01-09 1
2 1997-01-10 1
3 1997-01-13 3
4 1997-01-14 1
5 1997-01-15 1
6 1997-03-01 45
7 1997-03-03 2
8 1997-03-04 1
9 1997-03-05 1
10 1997-06-13 100
you can use diff as well
df['date_up'] = pd.to_datetime(df['Date'],dayfirst=True)
df['date_diff'] = df['date_up'].diff()
df['date_diff_num_days'] = df['date_diff'].dt.days.fillna(0).astype(int)
df.head()
Date Days date_up date_diff date_diff_num_days
0 08-01-1997 0 1997-01-08 NaT 0
1 09-01-1997 1 1997-01-09 1 days 1
2 10-01-1997 1 1997-01-10 1 days 1
3 13-01-1997 3 1997-01-13 3 days 3
4 14-01-1997 1 1997-01-14 1 days 1
I have dataframe like in the below pic.
First; I want the top 2 products, second I need the top 2 products frequents per day, so I need to group it by days and select the top 2 products from products column, I tried this code but it gives an error.
df.groupby("days", as_index=False)(["products"] == "Follow Up").count()
enter image description here
You need to groupby over both days and products and then use size. Once you have done this you will have all the counts in the df you require.
You will then need to sort both the day and the default 0 column which now contains your counts, this has been created by resetting your index on the initial groupby.
We follow the instructions in Pandas get topmost n records within each group to give your desired result.
A full example:
Setup:
df = pd.DataFrame({'day':[1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3],
'value':['a','a','b','b','b','c','a','a','b','b','b','c','a','a','b','b','b','c']})
df.head(6)
day value
0 1 a
1 1 a
2 1 b
3 1 b
4 1 b
5 1 c
df_counts = df.groupby(['day','values']).size().reset_index().sort_values(['day', 0], ascending = [True, False])
df_top_2 = df_counts.groupby('day').head(2)
df_top_2
day value 0
1 1 b 3
0 1 a 2
4 2 b 3
3 2 a 2
7 3 b 3
6 3 a 2
Of course, you should rename the 0 column to something more reasonable but this is a minimal example.
I have a pandas dataframe something like this
Date ID
01/01/2016 a
05/01/2016 a
10/05/2017 a
05/05/2014 b
07/09/2014 b
12/08/2017 b
What I need to do is to add a column which shows the number of entries for each ID that occurred within the last year and another column showing the number within the next year. I've written some horrible code that iterates through the whole dataframe (millions of lines) and does the computations but there must be a better way!
I think you need between with boolean indexing for filter first and then groupby and aggregate size.
Outputs are concated and add reindex for add missing rows filled by 0:
print (df)
Date ID
0 01/01/2016 a
1 05/01/2016 a
2 10/05/2017 a
3 05/05/2018 b
4 07/09/2014 b
5 07/09/2014 c
6 12/08/2018 b
#convert to datetime (if first number is day, add parameter dayfirst)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
now = pd.datetime.today()
print (now)
oneyarbeforenow = now - pd.offsets.DateOffset(years=1)
oneyarafternow = now + pd.offsets.DateOffset(years=1)
#first filter
a = df[df['Date'].between(oneyarbeforenow, now)].groupby('ID').size()
b = df[df['Date'].between(now, oneyarafternow)].groupby('ID').size()
print (a)
ID
a 1
dtype: int64
print (b)
ID
b 2
dtype: int64
df1 = pd.concat([a,b],axis=1).fillna(0).astype(int).reindex(df['ID'].unique(),fill_value=0)
print (df1)
0 1
a 1 0
b 0 2
c 0 0
EDIT:
If need compare each date by first date add or subtract year offset per group need custom function with condition and sum Trues:
offs = pd.offsets.DateOffset(years=1)
f = lambda x: pd.Series([(x > x.iat[-1] - offs).sum(), \
(x < x.iat[-1] + offs).sum()], index=['last','next'])
df = df.groupby('ID')['Date'].apply(f).unstack(fill_value=0).reset_index()
print (df)
ID last next
0 a 1 3
1 b 3 2
2 c 1 1
In [19]: x['date'] = pd.to_datetime( x['date']) # convert string date to datetime pd object
In [20]: x['date'] = x['date'].dt.year # get year from the date
In [21]: x
Out[21]:
date id
0 2016 a
1 2016 a
2 2017 a
3 2014 b
4 2014 b
5 2017 b
In [27]: x.groupby(['date','id']).size() # group by both columns
Out[27]:
date id
2014 b 2
2016 a 2
2017 a 1
b 1
Using resample takes care of missing inbetween years. See. year-2015
In [550]: df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
Out[550]:
Date 2014-12-31 2015-12-31 2016-12-31 2017-12-31
ID
a 0 0 2 1
b 2 0 0 1
Use rename if you want only year in columns
In [551]: (df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
.rename(columns=lambda x: x.year))
Out[551]:
Date 2014 2015 2016 2017
ID
a 0 0 2 1
b 2 0 0 1
How do I consolidate time periods data in Python pandas?
I want to manipulate data from
person start end
1 2001-1-8 2002-2-14
1 2002-2-14 2003-3-1
2 2001-1-5 2002-2-16
2 2002-2-17 2003-3-9
to
person start end
1 2001-1-8 2002-3-1
2 2001-1-5 2002-3-9
I want to check first if the last end and new start are within 1 day first. If not, then keep the original data structure, if so, then consolidate.
df.sort_values(["person", "start", "end"], inplace=True)
def condense(df):
df['prev_end'] = df["end"].shift(1)
df['dont_condense'] = (abs(df['prev_end'] - df['start']) > timedelta(days=1))
df["group"] = df['dont_condense'].fillna(False).cumsum()
return df.groupby("group").apply(lambda x: pd.Series({"person": x.iloc[0].person,
"start": x.iloc[0].start,
"end": x.iloc[-1].end}))
df.groupby("person").apply(condense).reset_index(drop=True)
You can use if each group contains only 2 rows and need difference 1 and 0 days, also all data are sorted:
print (df)
person start end
0 1 2001-1-8 2002-2-14
1 1 2002-2-14 2003-3-1
2 2 2001-1-5 2002-2-16
3 2 2002-2-17 2003-3-9
4 3 2001-1-2 2002-2-14
5 3 2002-2-17 2003-3-10
df.start = pd.to_datetime(df.start)
df.end = pd.to_datetime(df.end)
def f(x):
#if need difference only 0 days, use
#a = (x['start'] - x['end'].shift()) == pd.Timedelta(days=0)
a = (x['start'] - x['end'].shift()).isin([pd.Timedelta(days=1), pd.Timedelta(days=0)])
if a.any():
x.end = x['end'].shift(-1)
return (x)
df1 = df.groupby('person').apply(f).dropna().reset_index(drop=True)
print (df1)
person start end
0 1 2001-01-08 2003-03-01
1 2 2001-01-05 2003-03-09
2 3 2001-01-02 2002-02-14
3 3 2002-02-17 2003-03-10