Pandas change values based on previous value in same column

Pandas change values based on previous value in same column - python

I have the following dataframe:
import pandas as pd
import datetime
df = pd.DataFrame({'ID': [1, 2, 1, 1],
'Date' : [datetime.date(year=2022,month=5,day=1), datetime.date(year=2022,month=11,day=1),
datetime.date(year=2022,month=10,day=1), datetime.date(year=2022,month=11,day=1)],
"Lifecycle ID": [5,5,5,5]})
And I need to change the lifecycle based on the lifecycle 6 month ago (if it was 5, it should always be 6 (not +1)).
I'm currently trying:
df.loc[(df["Date"] == (df["Date"] - pd.DateOffset(months=6))) & (df["Lifecycle ID"] == 5), "Lifecycle ID"] = 6
However Pandas is not considering the ID and I don't know how.
The output should be this dataframe (only last Lifecycle ID changed to 6):
Could you please help me here?

The logic is not fully clear, but if I guess correctly:
# ensure datetime type
df['Date'] = pd.to_datetime(df['Date'])
# add the time delta to form a helper DataFrame
df2 = df.assign(Date=df['Date'].add(pd.DateOffset(months=6)))
# merge on ID/Date, retrieve "Lifecycle ID"
# and check if the value is 5
m = df[['ID', 'Date']].merge(df2, how='left')['Lifecycle ID'].eq(5)
# if it is, update the value
df.loc[m, 'Lifecycle ID'] = 6
If you want to increment the value automatically from the value 6 months ago:
s = df[['ID', 'Date']].merge(df2, how='left')['Lifecycle ID']
df.loc[s.notna(), 'Lifecycle ID'] = s.add(1)
Output:
ID Date Lifecycle ID
0 1 2022-05-01 5
1 2 2022-11-01 5
2 1 2022-10-01 5
3 1 2022-11-01 6

Related

Groupby count per category per month (Current month vs Remaining past months) in separate columns in pandas

Lets say I have the following dataframe:
I am trying to get something like this.
I was thinking to maybe use the rolling function and have separate dataframes for each count type(current month and past 3 months) and then merge them based on ID.
I am new to python and pandas so please bear with me if its a simple question. I am still learning :)
EDIT:
#furas so I started with calculating cumulative sum for all the counts as separate columns
df['f_count_cum] = df.groupby(["ID"])['f_count'].transform(lambda x:x.expanding().sum())
df['t_count_cum] = df.groupby(["ID"])['t_count'].transform(lambda x:x.expanding().sum())
and then just get the current month df by
df_current = df[df.index == (max(df.index)]
df_past_month = df[df.index == (max(df.index - 1)]
and then just merge the two dataframes based on the ID ?
I am not sure if its correct but this is my first take on this

Few assumptions looking at the input sample:
Month index is of datetime64[ns] type. If not, please use below to typecast the datatype.
df['Month'] = pd.to_datetime(df.Month)
Month column is the index. If not, please set it as index.
df = df.set_index('Month')
Considering last month of the df as current month and first 3 months as 'past 3 months'. If not modify the last and first function accordingly in df1 and df2 respectively.
Code
df1 = df.last('M').groupby('ID').sum().reset_index().rename(
columns={'f_count':'f_count(current month)',
't_count':'t_count(current month)'})
df2 = df.first('3M').groupby('ID').sum().reset_index().rename(
columns={'f_count':'f_count(past 3 months)',
't_count':'t_count(past 3 months)'})
df = pd.merge(df1, df2, on='ID', how='inner').reindex(columns = [ 'ID',
'f_count(current month)', 'f_count(past 3 months)',
't_count(current month)','t_count(past 3 months)'
])
Output
ID f_count(current month) f_count(past 3 months) t_count(current month) t_count(past 3 months)
0 A 3 13 8 14
1 B 3 5 7 5
2 C 1 3 2 4
Another version of same code, if you prefer function and single statement
def get_df(freq):
if freq=='M':
return df.last('M').groupby('ID').sum().reset_index()
return df.first('3M').groupby('ID').sum().reset_index()
df = pd.merge(get_df('M').rename(
columns={'f_count':'f_count(current month)',
't_count':'t_count(current month)'}),
get_df('3M').rename(
columns={'f_count':'f_count(past 3 months)',
't_count':'t_count(past 3 months)'}),
on='ID').reindex(columns = [ 'ID',
'f_count(current month)', 'f_count(past 3 months)',
't_count(current month)','t_count(past 3 months)'])
EDIT:
For previous two months from current month:(we can use different combinations of first and last function as per our need)
df2 = df.last('3M').first('2M').groupby('ID').sum().reset_index().rename(
columns={'f_count':'f_count(past 3 months)',
't_count':'t_count(past 3 months)'})

how to drop rows based on other column only if has multiple different values

What I have?
I have a dataframe like this:
id value
0 0 5
1 0 5
2 0 6
3 1 7
4 1 7
What I want to get?
I want to drop all the rows with id that has more than one different value. in the example above I want to drop all the rows with id = 0
id value
3 1 7
4 1 7
What I have tried?
import pandas as pd
df = pd.DataFrame({'id':[0, 0, 0, 1, 1], 'value':[5,5,6,7,7]})
print(df)
id_list = df['id'].tolist()
id_set = set(id_list)
for id in id_set:
temp_list = df.loc[df['id'] == id,'value'].tolist()
s = set(temp_list)
if len(s) > 1:
df = df.loc[df['id'] != id]
it works, but it ugly and inefficient
There is a better pytonic way using pandas methods?

Use GroupBy.transform with DataFrameGroupBy.nunique for number of unique values to Series, so possible compare and filter in boolean indexing:
df = df[df.groupby('id')['value'].transform('nunique').eq(1)]
print (df)
id value
3 1 7
4 1 7

# Try this code #
import pandas as pd
id1 = pd.Series([0,0,0,1,1])
value = pd.Series([5,5,6,7,7])
data = pd.DataFrame({'id':id1,'value':value})
datag = data.groupby('id')
# to delete rows,that id have different values
datadel = []
for i in set(data.id):
if len(set(datag.get_group(i)['value'])) != 1:
datadel.extend(data.loc[data["id"] == i].index.tolist())
data.drop(datadel, inplace = True)
print(data)

Changing format of date in pandas dataframe

I have a pandas dataframe, in which a column is a string formatted as
yyyymmdd
which should be a date. Is there an easy way to convert it to a recognizable form of date?
And then what python libraries should I use to handle them?
Let's say, for example, that I would like to consider all the events (rows) whose date field is a working day (so mon-fri). What is the smoothest way to handle such a task?

Ok so you want to select Mon-Friday. Do that by converting your column to datetime and check if the dt.dayofweek is lower than 6 (Mon-Friday --> 0-4)
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
Full example:
import pandas as pd
df = pd.DataFrame({
'date': [
'20180101',
'20180102',
'20180103',
'20180104',
'20180105',
'20180106',
'20180107'
],
'value': range(7)
})
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
print(df2)
Returns:
date value
0 20180101 0
1 20180102 1
2 20180103 2
3 20180104 3
4 20180105 4

shifting pandas series for only some entries

I've got a dataframe that has a Time Series (made up of strings) with some missing information:
# Generate a toy dataframe:
import pandas as pd
data = {'Time': ['0'+str(i)+':15:45' for i in range(10)]}
data['Time'][4] = 'unknown'
data['Time'][8] = 'unknown'
df = pd.DataFrame(data)
# df
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 unknown
5 05:15:45
6 06:15:45
7 07:15:45
8 unknown
9 09:15:45
I would like the unknown entries to match the entry above, resulting in this dataframe:
# desired_df
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 03:15:45
5 05:15:45
6 06:15:45
7 07:15:45
8 07:15:45
9 09:15:45
What is the best way to achieve this?

If you're intent on working with a time series data. I would recommend converting it to a time series, and then forward filling the blanks
import pandas as pd
data = {'Time': ['0'+str(i)+':15:45' for i in range(10)]}
data['Time'][4] = 'unknown'
data['Time'][8] = 'unknown'
df.Time = pd.to_datetime(df.Time, errors = 'coerce')
df.fillna(method='ffill')
However, if you are getting this data from a csv file or something where you use pandas.read_* function you should use the na_values argument in those functions to specify unknown as a NA value
df = pd.read_csv('example.csv', na_values = 'unknown')
df = df.fillna(method='ffill')
you can also pass a list instead of the string, and it adds the words passed to already existing list of NA values
However, if you want to keep the column a string, I would recommend just doing a find and replace
df.Time = np.where(df.Time == 'unknown', df.Time.shift(),df.Time)

One way to do this would be using pandas' shift, creating a new column with the data in Time shifted by one, and dropping it. But there may be a cleaner way to achieve this:
# Create new column with the shifted time data
df['Time2'] = df['Time'].shift()
# Replace the data in Time with the data in your new column where necessary
df.loc[df['Time'] == 'unknown', 'Time'] = df.loc[df['Time'] == 'unknown', 'Time2']
# Drop your new column
df = df.drop('Time2', axis=1)
print(df)
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 03:15:45
5 05:15:45
6 06:15:45
7 07:15:45
8 07:15:45
9 09:15:45
EDIT: as pointed out by Zero, the new column step can be skipped altogether:
df.loc[df['Time'] == 'unknown', 'Time'] = df['Time'].shift()

Getting a new series conditional on some rows being present in Python and Pandas

I did not know of an easier thing to call what I am trying to do. Edits welcome. Here is what I want to do.
I have store, date, and product indices and a column called price.
I have two unique products 1 and 2.
But for each store, I don't have an observation for every date, and for every date, I don't have both products necessarily.
I want to create a series for each store that is indexed by dates only when when both products are present. The reason is because I want the value of the series to be product 1 price / product 2 price.
This is highly unbalanced panel, and I did a horrible workaround about 75 lines of code, so I appreciate any tips. This will be very useful in the future.
Data looks like below.
weeknum Location_Id Item_Id averageprice
70 201138 8501 1 0.129642
71 201138 8501 2 0.188274
72 201138 8502 1 0.129642
73 201139 8504 1 0.129642
Expected output in this simple case would be:
weeknum Location_Id averageprice
? 201138 8501 0.129642/0.188274
Since that is the only one with every requirement met.

I think this could be join on the two subFrames (but perhaps there is a cleaner pivoty way):
In [11]: res = pd.merge(df[df['Item_Id'] == 1], df[df['Item_Id'] == 2],
on=['weeknum', 'Location_Id'])
In [12]: res
Out[12]:
weeknum Location_Id Item_Id_x averageprice_x Item_Id_y averageprice_y
0 201138 8501 1 0.129642 2 0.188274
Now you can divide those two columns in the result:
In [13]: res['price'] = res['averageprice_x'] / res['averageprice_y']
In [14]: res
Out[14]:
weeknum Location_Id Item_Id_x averageprice_x Item_Id_y averageprice_y price
0 201138 8501 1 0.129642 2 0.188274 0.688582

Example data similar to yours:
weeknum loc_id item_id avg_price
0 1 8 1 8
1 1 8 2 9
2 1 9 1 10
3 2 10 1 11
First create a date mask that gets you the correct dates:
df_group = df.groupby(['loc_id', 'weeknum'])
df = df.join(df_group.item_id.apply(lambda x: len(x.unique()) == 2), on = ['loc_id', 'weeknum'], r_suffix = '_r')
weeknum loc_id item_id avg_price item_id_r
0 1 8 1 8 True
1 1 8 2 9 True
2 1 9 1 10 False
3 2 10 1 11 False
This give yous a boolean mask for groupby of each store for each date where there are exactly two unique Item_Id present. From this you can now apply the function that concatenates your prices:
df[df.item_id_r].groupby(['loc_id','weeknum']).avg_price.apply(lambda x: '/'.join([str(y) for y in x]))
loc_id weeknum
8 1 8,9
It's a bit verbose and lots of lambdas but it will get you started and you can refactor to make faster and/or more concise if you want.

Let's say your full dataset is called TILPS. Then you might try this:
import pandas as pd
from __future__ import division
# Get list of unique dates present in TILPS
datelist = list(TILPS.ix[:, 'datetime'].unique())
# Get list of unique stores present in TILPS
storelist = list(TILPS.ix[:, 'store'].unique())
# For a given date, extract relative price
def dateLevel(daterow):
price1 = int(daterow.loc[(daterow['Item_id']==1), 'averageprice'].unique())
price2 = int(daterow.loc[(daterow['Item_id']==2), 'averageprice'].unique())
return pd.DataFrame(pd.Series({'relprice' : price1/price2}))
# For each store, extract relative price for each date
def storeLevel(group, datelist):
info = {d: for d in datelist}
exist = group.loc[group['datetime'].isin(datelist), ['weeknum', 'locid']]
exist_gr = exist.groupy('datetime')
relprices = exist_gr.apply(dateLevel)
# Merge relprices with exist on INDEX.
exist.merge(relprices, left_index=True, right_index=True)
return exist
# Group TILPS by store
gr_store = TILPS.groupby('store')
fn = lambda x: storeLevel(x, datelist)
output = gr_store.apply(fn)
# Peek at output
print output.head(30)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas change values based on previous value in same column - python

Related

Groupby count per category per month (Current month vs Remaining past months) in separate columns in pandas

how to drop rows based on other column only if has multiple different values

Changing format of date in pandas dataframe

shifting pandas series for only some entries

Getting a new series conditional on some rows being present in Python and Pandas

Categories

Resources