Python: creating categorical variable in the existing column - python

I have a DataFrame in python that has a column holding difference of 2 dates. I would like to create a new/overwrite on the existing column that can convert numeric to categorical variable based on below rules:
difference 0 days Level 0
difference 2 days Level 1
difference 2-6 days Level 2
difference 6-15 days Level 3
difference 15-69 days Level 4
difference NAT Level 5
how this could be accomplished.

say the column name is 'difference'
you can define a method like
def get_difference_category(difference):
if difference < 0:
return 0
if difference <=2:
return 1
#.. and so on
df['difference'] = df['difference'].apply(lambda value: get_difference_category(value), axis=1)
reference links:
https://github.com/vi3k6i5/pandas_basics/blob/master/2_b_apply_a_function_row_wise.ipynb
https://github.com/vi3k6i5/pandas_basics/blob/master/2_c_apply_a_function_to_a_column.ipynb

You can use np.searchsorted to find where each time delta falls into an array of break points. I replace any NaT differences with Level 6
td = pd.to_timedelta(['0 days', '2 days', '6 days', '15 days', '69 days'])
difs = df.End.values - df.Start.values
vals = np.searchsorted(td.values, difs)
vals[pd.isnull(difs)] = 6
df = df.assign(
Level=np.core.defchararray.add(
'Level ', vals.astype(str)
)
)
df
Start End Level
0 2017-01-01 2017-01-11 Level 3
1 2017-01-02 2017-03-09 Level 4
2 2017-01-03 2017-03-16 Level 5
3 2017-01-04 2017-01-10 Level 2
4 2017-01-05 2017-01-05 Level 0
5 2017-01-06 2017-01-08 Level 1
6 2017-01-07 2017-01-26 Level 4
7 2017-01-08 2017-01-15 Level 3
8 2017-01-09 2017-02-16 Level 4
9 2017-01-10 2017-01-24 Level 3
Setup
import pandas as pd
from io import StringIO
txt = """ Start End
0 2017-01-01 2017-01-11
1 2017-01-02 2017-03-09
2 2017-01-03 2017-03-16
3 2017-01-04 2017-01-10
4 2017-01-05 2017-01-05
5 2017-01-06 2017-01-08
6 2017-01-07 2017-01-26
7 2017-01-08 2017-01-15
8 2017-01-09 2017-02-16
9 2017-01-10 2017-01-24"""
df = pd.read_csv(StringIO(txt), delim_whitespace=True).apply(pd.to_datetime)

Edit: To handle NaT, You can use pd.cut:
data['Severity'] = pd.cut((data['End'] - data['Start']).dt.days,[-pd.np.inf,-1,0,2,6,15,69],labels=['Level 5', 'Level 0','Level 1','Level 2','Level 3','Level 4']).fillna('Level 5')
Example:
df.head(10)
Start End
0 2017-01-01 2017-01-11
1 2017-01-02 2017-03-09
2 2017-01-03 2017-03-16
3 2017-01-04 2017-01-10
4 2017-01-05 2017-01-25
5 2017-01-06 2017-01-25
6 2017-01-07 2017-01-26
7 2017-01-08 2017-01-15
8 2017-01-09 2017-02-16
9 2017-01-10 2017-01-24
df['Severity'] = pd.cut((df['End'] - df['Start']).dt.days,[-np.inf,0,2,6,15,69,np.inf],labels=['Level 0','Level 1','Level 2','Level 3','Level 4','Level 5'])
Output:
End Start Severity
0 2017-01-11 2017-01-01 Level 3
1 2017-03-09 2017-01-02 Level 4
2 2017-03-16 2017-01-03 Level 5
3 2017-01-10 2017-01-04 Level 2
4 2017-01-25 2017-01-05 Level 4
5 2017-01-25 2017-01-06 Level 4
6 2017-01-26 2017-01-07 Level 4
7 2017-01-15 2017-01-08 Level 3
8 2017-02-16 2017-01-09 Level 4
9 2017-01-24 2017-01-10 Level 3

I added an bar plot to analyze the distribution. I also used a dataframe and a lambda function to get my day differences. Visuals help you understand the data. the histogram gives you insight into classification distributions and the pairplot shows you how the day interval is distributed.
column1=['2017-01-01','2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08','2017-01-09','2017-01-10']
column2=['2017-01-01','2017-01-11','2017-03-09','2017-03-16','2017-01-10','2017-01-25','2017-01-25','2017-01-26','2017-01-15','2017-02-16','2017-01-24' ]
index=range(0,len(column1))
data={'column1':column1,'column2':column2}
df=pd.DataFrame(data, columns=['column1','column2'],index=index)
print(df.head())
differences=df.apply(lambda x: datetime.strptime(x['column2'],'%Y-%m-%d')- datetime.strptime(x['column1'],'%Y-%m-%d'),axis=1)
differences=differences.dt.days.astype('int')
years_bins=[-1,0,2,6,15,69,np.inf]
output_labels=['level 0','level 1','level 2','level 3','level 4','level 5']
out=pd.cut(differences,bins=years_bins,labels=output_labels)
df['differences']=differences
df['classification']=out
print(df.head())
fig, ax = plt.subplots()
ax = out.value_counts(sort=False).plot.barh(rot=0, color="b", figsize=(6,4))
ax.set_yticklabels(labels)
plt.show()
plt.hist(df['classification'], bins=6)
plt.show()
sns.distplot(df['differences'])
plt.show()

Related

Pandas get minimum total in moving window

I have something like the following dataframe:
df=pd.Series(index=pd.date_range(start='1/1/2017', end='1/10/2017', freq='D'),
data=[5,5,2,1,3,4,5,6,7,8])
df
Out[216]:
2017-01-01 5
2017-01-02 5
2017-01-03 2
2017-01-04 1
2017-01-05 3
2017-01-06 4
2017-01-07 5
2017-01-08 6
2017-01-09 7
2017-01-10 8
Freq: D, dtype: int64
I want to identify the start date of the 3 day period that has the minimum total value. So in this example, 2017-01-03 through 2017-01-05, has the minimum value with a sum of 6 across those 3 days.
Is there a way to do this without looping through each 3 day window?
The result would be:
2017-01-03 6
And if there were multiple windows that have the same minimum sum, the result could have a record for each.
IIUC rolling
df=pd.Series(index=pd.date_range(start='1/1/2017', end='1/10/2017', freq='D'),
data=[5,5,2,1,3,4,5,6,7,8])
df=df.to_frame()
df['New']=df.rolling(3).sum().shift(-2).values
df.loc[df.New==df.New.min(),:].drop(0,1)
Out[685]:
New
2017-01-03 6.0

Pandas fill forward and sum as you go

I have a sparse dataframe including dates of when inventory is bought or sold like the following:
Date Inventory
2017-01-01 10
2017-01-05 -5
2017-01-07 15
2017-01-09 -20
First step I would like to solve is to to add in the other dates. I know you can use resample but just highlighting this part in case it has an impact on the next more difficult part. As below:
Date Inventory
2017-01-01 10
2017-01-02 NaN
2017-01-03 NaN
2017-01-04 NaN
2017-01-05 -5
2017-01-06 NaN
2017-01-07 15
2017-01-08 NaN
2017-01-09 -20
The final step is to have it fill forward over the NaNs except that once it encounters a new value that get added to the current value of the row above, so that the final dataframe looks like the following:
Date Inventory
2017-01-01 10
2017-01-02 10
2017-01-03 10
2017-01-04 10
2017-01-05 5
2017-01-06 5
2017-01-07 20
2017-01-08 20
2017-01-09 0
2017-01-10 0
I am trying to get a pythonic approach to this and not a loop based approach as that will be very slow.
The example should also work for a table with multiple columns as such:
Date InventoryA InventoryB
2017-01-01 10 NaN
2017-01-02 NaN NaN
2017-01-03 NaN 5
2017-01-04 NaN 5
2017-01-05 -5 NaN
2017-01-06 NaN -10
2017-01-07 15 NaN
2017-01-08 NaN NaN
2017-01-09 -20 NaN
would become:
Date InventoryA InventoryB
2017-01-01 10 0
2017-01-02 10 0
2017-01-03 10 5
2017-01-04 10 10
2017-01-05 5 10
2017-01-06 5 0
2017-01-07 20 0
2017-01-08 20 0
2017-01-09 0 0
2017-01-10 0 0
hope that helps too. I think the current solution will have a problem with the nans as such.
thanks
You can just fill the missing values with 0 after resampling (no inventory change on that day), and then use cumsum
df.fillna(0).cumsum()
You're simply doing the two steps in the wrong order :)
df['Inventory'].cumsum().resample('D').pad()
Edit: you might need to set the Date as index first.
df = df.set_index('Date')
Part 1 : Assuming df is your
Date Inventory
2017-01-01 10
2017-01-05 -5
2017-01-07 15
2017-01-09 -20
Then
import pandas as pd
import datetime
df_new = pd.DataFrame([df.Date.min() + datetime.timedelta(days=day) for day in range((df.Date.max() - df.Date.min()).days+1)])
df_new = df_new.merge(df, left_on=0, right_on='Date',how="left").drop("Date",axis=1)
df_new.columns = df.columns
Gives you :
Date Inventory
0 2017-01-01 10.0
1 2017-01-02 NaN
2 2017-01-03 NaN
3 2017-01-04 NaN
4 2017-01-05 -5.0
5 2017-01-06 NaN
6 2017-01-07 15.0
7 2017-01-08 NaN
8 2017-01-09 -20.0
part 2
From fillna method descriptions:
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill:
propagate last valid observation forward to next valid backfill /
bfill: use NEXT valid observation to fill gap
df_new.Inventory = df_new.Inventory.fillna(method="ffill")
Gives you
Date Inventory
0 2017-01-01 10.0
1 2017-01-02 10.0
2 2017-01-03 10.0
3 2017-01-04 10.0
4 2017-01-05 -5.0
5 2017-01-06 -5.0
6 2017-01-07 15.0
7 2017-01-08 15.0
8 2017-01-09 -20.0
You should be able to generalise it for more than one column once you understood how it can be done with one.

Replace values in Pandas DataFrame with a DateTime Index based on a list of dates

I have a dataframe of values and a list of dates.
E.g.,
data = pd.DataFrame([1,3,5,7,2,3,9,1,3,8,4,5],index=pd.date_range(start='2017-01-01',periods=12),columns=['values'])
I want to replace the value of a date in the date list with a zero value. E.g.,
date_list = ['2017-01-04', '2017-01-07', '2017-01-10']
I have tried:
data[date_list] == 0
but this yields an error:
KeyError: "None of [['2017-01-04', '2017-01-07', '2017-01-10']] are in the [index]"
Does anyone have an idea of how to solve this? I have a very large dataframe and date list...
Another way,
In [11]: data[data.index.isin(date_list)] = 0
In [12]: data
Out[12]:
values
2017-01-01 1
2017-01-02 3
2017-01-03 5
2017-01-04 0
2017-01-05 2
2017-01-06 3
2017-01-07 0
2017-01-08 1
2017-01-09 3
2017-01-10 0
2017-01-11 4
2017-01-12 5
You need to convert that list to datetime and use the loc indexer:
data.loc[pd.to_datetime(date_list)] = 0
data
Out[19]:
values
2017-01-01 1
2017-01-02 3
2017-01-03 5
2017-01-04 0
2017-01-05 2
2017-01-06 3
2017-01-07 0
2017-01-08 1
2017-01-09 3
2017-01-10 0
2017-01-11 4
2017-01-12 5
This works because the DataFrame has only one column. This sets all the columns to zero. But as jezrael pointed out, if you only want to set the values column to zero, you need to specify that:
data.loc[pd.to_datetime(date_list), 'values'] = 0

Efficiently serialising timestamps

I have a dataframe to analyse that has a column of dates as datetimes, and a column of hours as integers.
I would like to combine the two columns into a single timestamp field for some further analysis, but cannot find a way to do so quickly.
I have this code that works, but takes an inoordinate amount of time due to the length of the dataframe (~1m entries)
for i in range(len(my_df))
my_df['gen_timestamp'][i] = datetime.datetime.combine(my_df['date'][i],
datetime.time(my_df['hour'][i])
What I would like to do is to somehow convert the datetime type in my_df['date'] to an integer (say a timestamp in seconds) and the integer type in my_df['hour'], so that they can be quickly summed without the need for a laborious loop.
Worst case I then convert that integer back to a datetime in one go or just use seconds as my data type going forwards.
Thanks for any help.
IIUC you can construct a TimedeltaIndex and add this to your datetimes:
In [112]:
# sample data
df = pd.DataFrame({'date':pd.date_range(dt.datetime(2017,1,1), periods=10), 'hour':np.arange(10)})
df
Out[112]:
date hour
0 2017-01-01 0
1 2017-01-02 1
2 2017-01-03 2
3 2017-01-04 3
4 2017-01-05 4
5 2017-01-06 5
6 2017-01-07 6
7 2017-01-08 7
8 2017-01-09 8
9 2017-01-10 9
In [113]:
df['timestamp'] = df['date'] + pd.TimedeltaIndex(df['hour'], unit='h')
df
Out[113]:
date hour timestamp
0 2017-01-01 0 2017-01-01 00:00:00
1 2017-01-02 1 2017-01-02 01:00:00
2 2017-01-03 2 2017-01-03 02:00:00
3 2017-01-04 3 2017-01-04 03:00:00
4 2017-01-05 4 2017-01-05 04:00:00
5 2017-01-06 5 2017-01-06 05:00:00
6 2017-01-07 6 2017-01-07 06:00:00
7 2017-01-08 7 2017-01-08 07:00:00
8 2017-01-09 8 2017-01-09 08:00:00
9 2017-01-10 9 2017-01-10 09:00:00
So in your case I expect the following to work:
my_df['gen_timestamp'] = my_df['date'] + pd.TimedeltaIndex(my_df['hour'], unit='h')
this assumes that my_df['date'] is already Datetime if not convert first using my_df['date'] = pd.to_datetime(my_df['date'])

merge and sample two pandas time series

I have two time series. I would like to merge them and asfreq(*, method='pad') the result, restricted to the time range they share in common.
To illustrate, suppose I define A and B like this:
import datetime as dt
import numpy as np
import pandas as pd
A = pd.Series(np.arange(4), index=pd.date_range(dt.datetime(2017,1,4,10,0,0),
periods=4, freq=dt.timedelta(seconds=10)))
B = pd.Series(np.arange(6), index=pd.date_range(dt.datetime(2017,1,4,10,0,7),
periods=6, freq=dt.timedelta(seconds=3)))
So they look like:
# A
2017-01-04 10:00:00 0
2017-01-04 10:00:10 1
2017-01-04 10:00:20 2
2017-01-04 10:00:30 3
# B
2017-01-04 10:00:07 0
2017-01-04 10:00:10 1
2017-01-04 10:00:13 2
2017-01-04 10:00:16 3
2017-01-04 10:00:19 4
2017-01-04 10:00:22 5
I would like to compute something like:
# combine_and_asfreq(A, B, dt.timedelta(seconds=5))
# timestamp A B
2017-01-04 10:00:07 0 0
2017-01-04 10:00:12 1 1
2017-01-04 10:00:17 1 3
2017-01-04 10:00:22 2 5
How can I do this?
I am not exactly sure what you are asking but here is a somewhat convoluted method that first finds the overlapping time and creates a single column dataframe as the 'base' dataframe with the 5s timedelta.
get started by setting up dataframes properly
start = max(A.index.min(), B.index.min())
end = min(A.index.max(), B.index.max())
df_time = pd.DataFrame({'time': pd.date_range(start,end,freq='5s')})
df_A = A.reset_index()
df_B = B.reset_index()
df_A.columns = ['time', 'value']
df_B.columns = ['time', 'value']
Now we have the following three dataframes.
df_A
time value
0 2017-01-04 10:00:00 0
1 2017-01-04 10:00:10 1
2 2017-01-04 10:00:20 2
3 2017-01-04 10:00:30 3
df_B
time value
0 2017-01-04 10:00:07 0
1 2017-01-04 10:00:10 1
2 2017-01-04 10:00:13 2
3 2017-01-04 10:00:16 3
4 2017-01-04 10:00:19 4
5 2017-01-04 10:00:22 5
df_time
time
0 2017-01-04 10:00:07
1 2017-01-04 10:00:12
2 2017-01-04 10:00:17
3 2017-01-04 10:00:22
Use merge_asof to join all three
pd.merge_asof(pd.merge_asof(df_time, df_A, on='time'), df_B, on='time', suffixes=('_A', '_B'))
time value_A value_B
0 2017-01-04 10:00:07 0 0
1 2017-01-04 10:00:12 1 1
2 2017-01-04 10:00:17 1 3
3 2017-01-04 10:00:22 2 5

Categories