Pandas change values based on previous value in same column - python

I have the following dataframe:
import pandas as pd
import datetime
df = pd.DataFrame({'ID': [1, 2, 1, 1],
'Date' : [datetime.date(year=2022,month=5,day=1), datetime.date(year=2022,month=11,day=1),
datetime.date(year=2022,month=10,day=1), datetime.date(year=2022,month=11,day=1)],
"Lifecycle ID": [5,5,5,5]})
And I need to change the lifecycle based on the lifecycle 6 month ago (if it was 5, it should always be 6 (not +1)).
I'm currently trying:
df.loc[(df["Date"] == (df["Date"] - pd.DateOffset(months=6))) & (df["Lifecycle ID"] == 5), "Lifecycle ID"] = 6
However Pandas is not considering the ID and I don't know how.
The output should be this dataframe (only last Lifecycle ID changed to 6):
Could you please help me here?

The logic is not fully clear, but if I guess correctly:
# ensure datetime type
df['Date'] = pd.to_datetime(df['Date'])
# add the time delta to form a helper DataFrame
df2 = df.assign(Date=df['Date'].add(pd.DateOffset(months=6)))
# merge on ID/Date, retrieve "Lifecycle ID"
# and check if the value is 5
m = df[['ID', 'Date']].merge(df2, how='left')['Lifecycle ID'].eq(5)
# if it is, update the value
df.loc[m, 'Lifecycle ID'] = 6
If you want to increment the value automatically from the value 6 months ago:
s = df[['ID', 'Date']].merge(df2, how='left')['Lifecycle ID']
df.loc[s.notna(), 'Lifecycle ID'] = s.add(1)
Output:
ID Date Lifecycle ID
0 1 2022-05-01 5
1 2 2022-11-01 5
2 1 2022-10-01 5
3 1 2022-11-01 6

Related

Groupby count per category per month (Current month vs Remaining past months) in separate columns in pandas

Lets say I have the following dataframe:
I am trying to get something like this.
I was thinking to maybe use the rolling function and have separate dataframes for each count type(current month and past 3 months) and then merge them based on ID.
I am new to python and pandas so please bear with me if its a simple question. I am still learning :)
EDIT:
#furas so I started with calculating cumulative sum for all the counts as separate columns
df['f_count_cum] = df.groupby(["ID"])['f_count'].transform(lambda x:x.expanding().sum())
df['t_count_cum] = df.groupby(["ID"])['t_count'].transform(lambda x:x.expanding().sum())
and then just get the current month df by
df_current = df[df.index == (max(df.index)]
df_past_month = df[df.index == (max(df.index - 1)]
and then just merge the two dataframes based on the ID ?
I am not sure if its correct but this is my first take on this
Few assumptions looking at the input sample:
Month index is of datetime64[ns] type. If not, please use below to typecast the datatype.
df['Month'] = pd.to_datetime(df.Month)
Month column is the index. If not, please set it as index.
df = df.set_index('Month')
Considering last month of the df as current month and first 3 months as 'past 3 months'. If not modify the last and first function accordingly in df1 and df2 respectively.
Code
df1 = df.last('M').groupby('ID').sum().reset_index().rename(
columns={'f_count':'f_count(current month)',
't_count':'t_count(current month)'})
df2 = df.first('3M').groupby('ID').sum().reset_index().rename(
columns={'f_count':'f_count(past 3 months)',
't_count':'t_count(past 3 months)'})
df = pd.merge(df1, df2, on='ID', how='inner').reindex(columns = [ 'ID',
'f_count(current month)', 'f_count(past 3 months)',
't_count(current month)','t_count(past 3 months)'
])
Output
ID f_count(current month) f_count(past 3 months) t_count(current month) t_count(past 3 months)
0 A 3 13 8 14
1 B 3 5 7 5
2 C 1 3 2 4
Another version of same code, if you prefer function and single statement
def get_df(freq):
if freq=='M':
return df.last('M').groupby('ID').sum().reset_index()
return df.first('3M').groupby('ID').sum().reset_index()
df = pd.merge(get_df('M').rename(
columns={'f_count':'f_count(current month)',
't_count':'t_count(current month)'}),
get_df('3M').rename(
columns={'f_count':'f_count(past 3 months)',
't_count':'t_count(past 3 months)'}),
on='ID').reindex(columns = [ 'ID',
'f_count(current month)', 'f_count(past 3 months)',
't_count(current month)','t_count(past 3 months)'])
EDIT:
For previous two months from current month:(we can use different combinations of first and last function as per our need)
df2 = df.last('3M').first('2M').groupby('ID').sum().reset_index().rename(
columns={'f_count':'f_count(past 3 months)',
't_count':'t_count(past 3 months)'})

how to drop rows based on other column only if has multiple different values

What I have?
I have a dataframe like this:
id value
0 0 5
1 0 5
2 0 6
3 1 7
4 1 7
What I want to get?
I want to drop all the rows with id that has more than one different value. in the example above I want to drop all the rows with id = 0
id value
3 1 7
4 1 7
What I have tried?
import pandas as pd
df = pd.DataFrame({'id':[0, 0, 0, 1, 1], 'value':[5,5,6,7,7]})
print(df)
id_list = df['id'].tolist()
id_set = set(id_list)
for id in id_set:
temp_list = df.loc[df['id'] == id,'value'].tolist()
s = set(temp_list)
if len(s) > 1:
df = df.loc[df['id'] != id]
it works, but it ugly and inefficient
There is a better pytonic way using pandas methods?
Use GroupBy.transform with DataFrameGroupBy.nunique for number of unique values to Series, so possible compare and filter in boolean indexing:
df = df[df.groupby('id')['value'].transform('nunique').eq(1)]
print (df)
id value
3 1 7
4 1 7
# Try this code #
import pandas as pd
id1 = pd.Series([0,0,0,1,1])
value = pd.Series([5,5,6,7,7])
data = pd.DataFrame({'id':id1,'value':value})
datag = data.groupby('id')
# to delete rows,that id have different values
datadel = []
for i in set(data.id):
if len(set(datag.get_group(i)['value'])) != 1:
datadel.extend(data.loc[data["id"] == i].index.tolist())
data.drop(datadel, inplace = True)
print(data)

Changing format of date in pandas dataframe

I have a pandas dataframe, in which a column is a string formatted as
yyyymmdd
which should be a date. Is there an easy way to convert it to a recognizable form of date?
And then what python libraries should I use to handle them?
Let's say, for example, that I would like to consider all the events (rows) whose date field is a working day (so mon-fri). What is the smoothest way to handle such a task?
Ok so you want to select Mon-Friday. Do that by converting your column to datetime and check if the dt.dayofweek is lower than 6 (Mon-Friday --> 0-4)
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
Full example:
import pandas as pd
df = pd.DataFrame({
'date': [
'20180101',
'20180102',
'20180103',
'20180104',
'20180105',
'20180106',
'20180107'
],
'value': range(7)
})
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
print(df2)
Returns:
date value
0 20180101 0
1 20180102 1
2 20180103 2
3 20180104 3
4 20180105 4

shifting pandas series for only some entries

I've got a dataframe that has a Time Series (made up of strings) with some missing information:
# Generate a toy dataframe:
import pandas as pd
data = {'Time': ['0'+str(i)+':15:45' for i in range(10)]}
data['Time'][4] = 'unknown'
data['Time'][8] = 'unknown'
df = pd.DataFrame(data)
# df
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 unknown
5 05:15:45
6 06:15:45
7 07:15:45
8 unknown
9 09:15:45
I would like the unknown entries to match the entry above, resulting in this dataframe:
# desired_df
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 03:15:45
5 05:15:45
6 06:15:45
7 07:15:45
8 07:15:45
9 09:15:45
What is the best way to achieve this?
If you're intent on working with a time series data. I would recommend converting it to a time series, and then forward filling the blanks
import pandas as pd
data = {'Time': ['0'+str(i)+':15:45' for i in range(10)]}
data['Time'][4] = 'unknown'
data['Time'][8] = 'unknown'
df.Time = pd.to_datetime(df.Time, errors = 'coerce')
df.fillna(method='ffill')
However, if you are getting this data from a csv file or something where you use pandas.read_* function you should use the na_values argument in those functions to specify unknown as a NA value
df = pd.read_csv('example.csv', na_values = 'unknown')
df = df.fillna(method='ffill')
you can also pass a list instead of the string, and it adds the words passed to already existing list of NA values
However, if you want to keep the column a string, I would recommend just doing a find and replace
df.Time = np.where(df.Time == 'unknown', df.Time.shift(),df.Time)
One way to do this would be using pandas' shift, creating a new column with the data in Time shifted by one, and dropping it. But there may be a cleaner way to achieve this:
# Create new column with the shifted time data
df['Time2'] = df['Time'].shift()
# Replace the data in Time with the data in your new column where necessary
df.loc[df['Time'] == 'unknown', 'Time'] = df.loc[df['Time'] == 'unknown', 'Time2']
# Drop your new column
df = df.drop('Time2', axis=1)
print(df)
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 03:15:45
5 05:15:45
6 06:15:45
7 07:15:45
8 07:15:45
9 09:15:45
EDIT: as pointed out by Zero, the new column step can be skipped altogether:
df.loc[df['Time'] == 'unknown', 'Time'] = df['Time'].shift()

Getting a new series conditional on some rows being present in Python and Pandas

I did not know of an easier thing to call what I am trying to do. Edits welcome. Here is what I want to do.
I have store, date, and product indices and a column called price.
I have two unique products 1 and 2.
But for each store, I don't have an observation for every date, and for every date, I don't have both products necessarily.
I want to create a series for each store that is indexed by dates only when when both products are present. The reason is because I want the value of the series to be product 1 price / product 2 price.
This is highly unbalanced panel, and I did a horrible workaround about 75 lines of code, so I appreciate any tips. This will be very useful in the future.
Data looks like below.
weeknum Location_Id Item_Id averageprice
70 201138 8501 1 0.129642
71 201138 8501 2 0.188274
72 201138 8502 1 0.129642
73 201139 8504 1 0.129642
Expected output in this simple case would be:
weeknum Location_Id averageprice
? 201138 8501 0.129642/0.188274
Since that is the only one with every requirement met.
I think this could be join on the two subFrames (but perhaps there is a cleaner pivoty way):
In [11]: res = pd.merge(df[df['Item_Id'] == 1], df[df['Item_Id'] == 2],
on=['weeknum', 'Location_Id'])
In [12]: res
Out[12]:
weeknum Location_Id Item_Id_x averageprice_x Item_Id_y averageprice_y
0 201138 8501 1 0.129642 2 0.188274
Now you can divide those two columns in the result:
In [13]: res['price'] = res['averageprice_x'] / res['averageprice_y']
In [14]: res
Out[14]:
weeknum Location_Id Item_Id_x averageprice_x Item_Id_y averageprice_y price
0 201138 8501 1 0.129642 2 0.188274 0.688582
Example data similar to yours:
weeknum loc_id item_id avg_price
0 1 8 1 8
1 1 8 2 9
2 1 9 1 10
3 2 10 1 11
First create a date mask that gets you the correct dates:
df_group = df.groupby(['loc_id', 'weeknum'])
df = df.join(df_group.item_id.apply(lambda x: len(x.unique()) == 2), on = ['loc_id', 'weeknum'], r_suffix = '_r')
weeknum loc_id item_id avg_price item_id_r
0 1 8 1 8 True
1 1 8 2 9 True
2 1 9 1 10 False
3 2 10 1 11 False
This give yous a boolean mask for groupby of each store for each date where there are exactly two unique Item_Id present. From this you can now apply the function that concatenates your prices:
df[df.item_id_r].groupby(['loc_id','weeknum']).avg_price.apply(lambda x: '/'.join([str(y) for y in x]))
loc_id weeknum
8 1 8,9
It's a bit verbose and lots of lambdas but it will get you started and you can refactor to make faster and/or more concise if you want.
Let's say your full dataset is called TILPS. Then you might try this:
import pandas as pd
from __future__ import division
# Get list of unique dates present in TILPS
datelist = list(TILPS.ix[:, 'datetime'].unique())
# Get list of unique stores present in TILPS
storelist = list(TILPS.ix[:, 'store'].unique())
# For a given date, extract relative price
def dateLevel(daterow):
price1 = int(daterow.loc[(daterow['Item_id']==1), 'averageprice'].unique())
price2 = int(daterow.loc[(daterow['Item_id']==2), 'averageprice'].unique())
return pd.DataFrame(pd.Series({'relprice' : price1/price2}))
# For each store, extract relative price for each date
def storeLevel(group, datelist):
info = {d: for d in datelist}
exist = group.loc[group['datetime'].isin(datelist), ['weeknum', 'locid']]
exist_gr = exist.groupy('datetime')
relprices = exist_gr.apply(dateLevel)
# Merge relprices with exist on INDEX.
exist.merge(relprices, left_index=True, right_index=True)
return exist
# Group TILPS by store
gr_store = TILPS.groupby('store')
fn = lambda x: storeLevel(x, datelist)
output = gr_store.apply(fn)
# Peek at output
print output.head(30)

Categories