Adding to date in pandas dataframe with other dataframe value

Adding to date in pandas dataframe with other dataframe value - python

I want to increase the date in one dataframe column by an integer value in another.
I receive TypeError: unsupported type for timedelta days component: numpy.int64
My dataframes look like this:
import pandas as pd
import numpy as np
import datetime as dt
dfa = pd.DataFrame([
['5/15/17',1],
['5/15/17',1]],
columns = ['Start','Days'])
dfb = pd.DataFrame([
['5/15/17',1],
['5/15/17',1]],
columns = ['Start','Days'])
I format the 'Start' column to datetime with this code:
dfa['Start'] = dfa['Start'].apply(lambda x:
dt.datetime.strptime(x,'%m/%d/%y'))
dfb['Start'] = dfb['Start'].apply(lambda x:
dt.datetime.strptime(x,'%m/%d/%y'))
I try to change the values in the dfa dataframe. The dfb dataframe reference works for 'Days' but not for 'Start':
for i, row in dfb.iterrows():
for j, row in dfa.iterrows():
new = pd.DataFrame({"Start": dfa.loc[j,"Start"] + datetime.timedelta(days=dfb.loc[i,"Days"]), "Days": dfa.loc[j,"Days"] - dfb.loc[i,"Days"]}, index = [j+1])
dfa = pd.concat([dfa.ix[:j], new, dfa.ix[j+1:]]).reset_index(drop=True)
This is the key component that raises the error:
"Start": dfa.loc[j,"Start"] + datetime.timedelta(days=dfb.loc[i,"Days"]
It works fine if I use:
"Start": dfa.loc[j,"Start"] + datetime.timedelta(days=1)
but I need it to be taking that value from dfb, not a static integer.

IIUC (I changed the input values a bit to clarify what is going on):
import pandas as pd
dfa = pd.DataFrame([
['5/15/17',1],
['5/16/17',1]],
columns = ['Start','Days'])
dfb = pd.DataFrame([
['5/15/17',3],
['5/16/17',4]],
columns = ['Start','Days'])
dfa['Start'] = pd.to_datetime(dfa['Start'])
dfb['Start'] = pd.to_datetime(dfb['Start'])
dfa['Start'] = dfa['Start'] + dfb['Days'].apply(pd.Timedelta,unit='D')
print(dfa)
Output:
Start Days
0 2017-05-18 1
1 2017-05-20 1

Related

How can i get the top 10 most frequent values between 2 dates from a csv with pandas?

Essentially I have a csv file which has an OFFENCE_CODE column and a column with some dates called OFFENCE_MONTH. The code I have provided retrieves the 10 most frequently occuring offence codes within the OFFENCE_CODE column, however I need to be able to do this between 2 dates from the OFFENCE_MONTH column.
import numpy as np
import pandas as pd
input_date1 = 2012/11/1
input_date2 = 2013/11/1
df = pd.read_csv("penalty_data_set.csv", dtype='unicode', usecols=['OFFENCE_CODE', 'OFFENCE_MONTH'])
print(df['OFFENCE_CODE'].value_counts().nlargest(10))

You can use pandas.Series.between :
df['OFFENCE_MONTH'] = pd.to_datetime(df['OFFENCE_MONTH'])
input_date1 = pd.to_datetime('2012/11/1')
input_date2 = pd.to_datetime('2013/11/1')
m = df['OFFENCE_MONTH'].between(input_date1, input_date2)
df.loc[m, 'OFFENCE_CODE'].value_counts().nlargest(10)

You can do this if it is per month:
import pandas as pd
input_date1 = 2012/11/1
input_date2 = 2013/11/1
# example dataframe
# df = pd.read_csv("penalty_data_set.csv", dtype='unicode', usecols=['OFFENCE_CODE', 'OFFENCE_MONTH'])
d = {'OFFENCE_MONTH':[1,1,1,2,3,4,4,5,6,12],
'OFFENCE_CODE':['a','a','b','d','r','e','f','g','h','a']}
df = pd.DataFrame(d)
print(df)
# make a filter (example here)
df_filter = df.loc[(df['OFFENCE_MONTH']>=1) & (df['OFFENCE_MONTH']<5)]
print(df_filter)
# arrange the filter
print(df_filter['OFFENCE_CODE'].value_counts().nlargest(10))
example result:
a 2
b 1
d 1
r 1
e 1
f 1

First you need to convert the date in OFFENCE_MONTH column to datetime :
from datetime import datetime
datetime.strptime(input_date1, "%Y-%m-%d")
datetime.strptime(input_date2, "%Y-%m-%d")
datetime.strptime(df['OFFENCE_MONTH'], "%Y-%m-%d")
Then Selecting rows based on your conditions:
rslt_df = df[df['OFFENCE_MONTH'] >= input_date1 and df['OFFENCE_MONTH'] <= input_date2]
print(rslt_df['OFFENCE_CODE'].value_counts().nlargest(10))

Concatenate on specific condition python

EDITED
I want to write an If loop with conditions on cooncatenating strings.
i.e. If cell A1 contains a specific format of text, then only do you concatenate, else leave as is.
example:
If bill number looks like: CM2/0000/, then concatenate this string with the date column (month - year), else leave the bill number as it is.
Sample Data

You can create function which does what you need and use df.apply() to execute it on all rows.
I use example data from #Boomer answer.
EDIT: you didn't show what you really have in dataframe and it seems you have datetime in bill_date but I used strings. I had to convert strings to datetime to show how to work with this. And now it needs .strftime('%m-%y') or sometimes .dt.strftime('%m-%y') instead of .str[3:].str.replace('/','-'). Because pandas uses different formats to display dateitm for different countries so I couldn't use str(x) for this because it gives me 2019-09-15 00:00:00 instead of yours 15/09/19
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
def convert(row):
if row['bill_number'].endswith('/'):
#return row['bill_number'] + row['bill_date'].str[3:].replace('/','-')
return row['bill_number'] + row['bill_date'].strftime('%m-%y')
else:
return row['bill_number']
df['bill_number'] = df.apply(convert, axis=1)
print(df)
Result:
bill_number bill_date
0 CM2/0000/09-19 15/09/19
1 CM2/0000 15/09/19
2 CM3/0000/09-19 15/09/19
3 CM3/0000 15/09/19
Second idea is to create mask
mask = df['bill_number'].str.endswith('/')
and later use it for all values
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
Left side needs .loc[mask,'bill_number'] instead of `[mask]['bill_number'] to correctly assing values - but right side doesn't need it.
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
mask = df['bill_number'].str.endswith('/')
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
# or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
#or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
print(df)
Third idea is to use numpy.where()
import pandas as pd
import numpy as np
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
df['bill_number'] = np.where(
df['bill_number'].str.endswith('/'),
#df['bill_number'] + df['bill_date'].str[3:].str.replace('/','-'),
df['bill_number'] + df['bill_date'].dt.strftime('%m-%y'),
df['bill_number'])
print(df)

Maybe this will work for you. It would be nice to have a data sample like #Mike67 was stating. But based on your information this is what I came up with. Bulky, but it works. I'm sure someone else will have a fancier version.
import pandas as pd
from pandas import DataFrame, Series
dat = {'num': ['CM2/0000/','CM2/0000', 'CM3/0000/', 'CM3/0000',],
'date': ['15/09/19','15/09/19','15/09/19','15/09/19']}
df = pd.DataFrame(dat)
df['date'] = df['date'].map(lambda x: str(x)[3:])
df['date'] = df['date'].str.replace('/','-')
for cols in df.columns:
df.loc[df['num'].str.endswith('/'), cols] = df['num'] + df['date']
print(df)
Results:
num date
0 CM2/0000/09-19 09-19
1 CM2/0000 09-19
2 CM3/0000/09-19 09-19
3 CM3/0000 09-19

Python Series - if it is between the values of two other series (time)

I have a dataframe with datetime (df1). I want to know if the datetime in a 'col1' in df1 is between any of the pair of the datetime of two columns ('lowerbound' and 'upperbound') in another dataframe (df2).
For example:
df1 = pd.to_datetime(['2014-04-09 07:37:00','2015-04-09 07:00:00',
'2014-02-02 08:31:00','2014-03-02 08:22:00'])
df1 = pd.DataFrame(df1,columns = ['col1'])
lowerbound = pd.to_datetime(['2014-04-09 07:25:00','2014-02-02 08:30:00',
'2015-04-09 06:00:00','2014-03-02 08:12:00'])
upperbound = pd.to_datetime(['2014-04-09 07:38:00','2014-04-09 07:48:00',
'2015-04-09 08:00:00','2014-02-02 08:33:00')
df2 = pd.DataFrame(lowerbound,columns = ['lowerbound'])
df2['upperbound'] = upperbound
The result shall be [1,1,0,0] since:
df1['col1'][0] is between df2['lowerbound'][0] & df2['lowerbound'][0]
df1['col1'][1] is between df2['lowerbound'][2] & df2['lowerbound'][2]
Although df1['col1'][2] is between df2['lowerbound'][1] & df2['lowerbound'][3], the index for df2['lowerbound'] and df2['lowerbound'] are not the same.
Thanks!

you can use np.greater_than and np.less_than with outer and any in axis=1, such as:
import numpy as np
print ((np.greater_equal.outer(df1.col1, df2.lowerbound)
& np.less_equal.outer(df1.col1, df2.upperbound))
.any(1).astype(int))
here is gives with your data [1 1 1 1]

I believe you need apply in this case
df1.col1.apply(lambda dat: ((dat>= df2.lowerbound) & (dat <= df2.upperbound)).any())

looping through an index and the previous item

I would like to loop through a list of dates, in order to obtain the difference between a date and the date before (e.g. the difference between 4/13 and 3/13).
I'm looking for a for-loop able to scan thorugh couple of dates
import pandas as pd
import numpy as np
raw_data = {'date' : pd.to_datetime(pd.Series(['2017-03-30','2017-03-31','2017-04-03','2017-04-04'])),
'age': [10,np.nan,50,30]}
df1 = pd.DataFrame(raw_data, columns = ['date','age'])
df1
input data
df2=df1.T
df2['new']=df2.iloc[:,3]-df2.iloc[:,2]
desidered result:
raw_data = {'date' : pd.to_datetime(pd.Series(['2017-03-30','2017-03-31','2017-04-03','2017-04-04'])),'age': [10,np.nan,50,30],
'diff': [10,-10,50,-20]}
output = pd.DataFrame(raw_data, columns = ['date','age','diff'])
output

You don't need a loop if you want to know the difference between to dates.
Maybe you should do this:
df1['difference'] = (df1['date'] - df1['date'].shift(1)).apply(lambda x: pd.NaT if x is pd.NaT else x.days)
or this:
df1['difference'] = (df1['date'] - df1['date'].shift(-1)).apply(lambda x: pd.NaT if x is pd.NaT else x.days)

Extract column info based on index of another Data frame for time based Data frames - Avoid Iterating

I have 2 DataFrames indexed by Time.
import datetime as dt
import pandas as pd
rng1 = pd.date_range("11:00:00","11:00:30",freq="500ms")
df1 = pd.DataFrame({'A':range(1,62), 'B':range(1000,62000,1000)},index = rng)
rng2 = pd.date_range("11:00:03","11:01:03",freq="700ms")
df2 = pd.DataFrame({'Z':range(10,880,10)},index = rng2)
I am trying to assign 'C' in df1 the last element of 'Z' in df2 closest to time index of df1. The following code seems to work now (returns a list).
df1['C'] = None
for tidx,a,b,c in df1.itertuples():
df1['C'].loc[tidx] = df2[:tidx].tail(1).Z.values
#df1['C'].loc[tidx] = df2[:tidx].Z -->Was trying this which didn't work
df1
Is it possible to avoid iterating.

TIL: Pandas Index instances have map method attributes.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.map.html
def fn(df):
def inner(dt):
return df.ix[abs(df.index - dt).argmin(), 'Z']
return inner
df1['C'] = df1.index.map(fn(df2))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding to date in pandas dataframe with other dataframe value - python

Related

How can i get the top 10 most frequent values between 2 dates from a csv with pandas?

Concatenate on specific condition python

Python Series - if it is between the values of two other series (time)

looping through an index and the previous item

Extract column info based on index of another Data frame for time based Data frames - Avoid Iterating

Categories

Resources