I want to increase the date in one dataframe column by an integer value in another.
I receive TypeError: unsupported type for timedelta days component: numpy.int64
My dataframes look like this:
import pandas as pd
import numpy as np
import datetime as dt
dfa = pd.DataFrame([
['5/15/17',1],
['5/15/17',1]],
columns = ['Start','Days'])
dfb = pd.DataFrame([
['5/15/17',1],
['5/15/17',1]],
columns = ['Start','Days'])
I format the 'Start' column to datetime with this code:
dfa['Start'] = dfa['Start'].apply(lambda x:
dt.datetime.strptime(x,'%m/%d/%y'))
dfb['Start'] = dfb['Start'].apply(lambda x:
dt.datetime.strptime(x,'%m/%d/%y'))
I try to change the values in the dfa dataframe. The dfb dataframe reference works for 'Days' but not for 'Start':
for i, row in dfb.iterrows():
for j, row in dfa.iterrows():
new = pd.DataFrame({"Start": dfa.loc[j,"Start"] + datetime.timedelta(days=dfb.loc[i,"Days"]), "Days": dfa.loc[j,"Days"] - dfb.loc[i,"Days"]}, index = [j+1])
dfa = pd.concat([dfa.ix[:j], new, dfa.ix[j+1:]]).reset_index(drop=True)
This is the key component that raises the error:
"Start": dfa.loc[j,"Start"] + datetime.timedelta(days=dfb.loc[i,"Days"]
It works fine if I use:
"Start": dfa.loc[j,"Start"] + datetime.timedelta(days=1)
but I need it to be taking that value from dfb, not a static integer.
IIUC (I changed the input values a bit to clarify what is going on):
import pandas as pd
dfa = pd.DataFrame([
['5/15/17',1],
['5/16/17',1]],
columns = ['Start','Days'])
dfb = pd.DataFrame([
['5/15/17',3],
['5/16/17',4]],
columns = ['Start','Days'])
dfa['Start'] = pd.to_datetime(dfa['Start'])
dfb['Start'] = pd.to_datetime(dfb['Start'])
dfa['Start'] = dfa['Start'] + dfb['Days'].apply(pd.Timedelta,unit='D')
print(dfa)
Output:
Start Days
0 2017-05-18 1
1 2017-05-20 1
Related
Essentially I have a csv file which has an OFFENCE_CODE column and a column with some dates called OFFENCE_MONTH. The code I have provided retrieves the 10 most frequently occuring offence codes within the OFFENCE_CODE column, however I need to be able to do this between 2 dates from the OFFENCE_MONTH column.
import numpy as np
import pandas as pd
input_date1 = 2012/11/1
input_date2 = 2013/11/1
df = pd.read_csv("penalty_data_set.csv", dtype='unicode', usecols=['OFFENCE_CODE', 'OFFENCE_MONTH'])
print(df['OFFENCE_CODE'].value_counts().nlargest(10))
You can use pandas.Series.between :
df['OFFENCE_MONTH'] = pd.to_datetime(df['OFFENCE_MONTH'])
input_date1 = pd.to_datetime('2012/11/1')
input_date2 = pd.to_datetime('2013/11/1')
m = df['OFFENCE_MONTH'].between(input_date1, input_date2)
df.loc[m, 'OFFENCE_CODE'].value_counts().nlargest(10)
You can do this if it is per month:
import pandas as pd
input_date1 = 2012/11/1
input_date2 = 2013/11/1
# example dataframe
# df = pd.read_csv("penalty_data_set.csv", dtype='unicode', usecols=['OFFENCE_CODE', 'OFFENCE_MONTH'])
d = {'OFFENCE_MONTH':[1,1,1,2,3,4,4,5,6,12],
'OFFENCE_CODE':['a','a','b','d','r','e','f','g','h','a']}
df = pd.DataFrame(d)
print(df)
# make a filter (example here)
df_filter = df.loc[(df['OFFENCE_MONTH']>=1) & (df['OFFENCE_MONTH']<5)]
print(df_filter)
# arrange the filter
print(df_filter['OFFENCE_CODE'].value_counts().nlargest(10))
example result:
a 2
b 1
d 1
r 1
e 1
f 1
First you need to convert the date in OFFENCE_MONTH column to datetime :
from datetime import datetime
datetime.strptime(input_date1, "%Y-%m-%d")
datetime.strptime(input_date2, "%Y-%m-%d")
datetime.strptime(df['OFFENCE_MONTH'], "%Y-%m-%d")
Then Selecting rows based on your conditions:
rslt_df = df[df['OFFENCE_MONTH'] >= input_date1 and df['OFFENCE_MONTH'] <= input_date2]
print(rslt_df['OFFENCE_CODE'].value_counts().nlargest(10))
EDITED
I want to write an If loop with conditions on cooncatenating strings.
i.e. If cell A1 contains a specific format of text, then only do you concatenate, else leave as is.
example:
If bill number looks like: CM2/0000/, then concatenate this string with the date column (month - year), else leave the bill number as it is.
Sample Data
You can create function which does what you need and use df.apply() to execute it on all rows.
I use example data from #Boomer answer.
EDIT: you didn't show what you really have in dataframe and it seems you have datetime in bill_date but I used strings. I had to convert strings to datetime to show how to work with this. And now it needs .strftime('%m-%y') or sometimes .dt.strftime('%m-%y') instead of .str[3:].str.replace('/','-'). Because pandas uses different formats to display dateitm for different countries so I couldn't use str(x) for this because it gives me 2019-09-15 00:00:00 instead of yours 15/09/19
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
def convert(row):
if row['bill_number'].endswith('/'):
#return row['bill_number'] + row['bill_date'].str[3:].replace('/','-')
return row['bill_number'] + row['bill_date'].strftime('%m-%y')
else:
return row['bill_number']
df['bill_number'] = df.apply(convert, axis=1)
print(df)
Result:
bill_number bill_date
0 CM2/0000/09-19 15/09/19
1 CM2/0000 15/09/19
2 CM3/0000/09-19 15/09/19
3 CM3/0000 15/09/19
Second idea is to create mask
mask = df['bill_number'].str.endswith('/')
and later use it for all values
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
Left side needs .loc[mask,'bill_number'] instead of `[mask]['bill_number'] to correctly assing values - but right side doesn't need it.
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
mask = df['bill_number'].str.endswith('/')
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
# or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
#or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
print(df)
Third idea is to use numpy.where()
import pandas as pd
import numpy as np
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
df['bill_number'] = np.where(
df['bill_number'].str.endswith('/'),
#df['bill_number'] + df['bill_date'].str[3:].str.replace('/','-'),
df['bill_number'] + df['bill_date'].dt.strftime('%m-%y'),
df['bill_number'])
print(df)
Maybe this will work for you. It would be nice to have a data sample like #Mike67 was stating. But based on your information this is what I came up with. Bulky, but it works. I'm sure someone else will have a fancier version.
import pandas as pd
from pandas import DataFrame, Series
dat = {'num': ['CM2/0000/','CM2/0000', 'CM3/0000/', 'CM3/0000',],
'date': ['15/09/19','15/09/19','15/09/19','15/09/19']}
df = pd.DataFrame(dat)
df['date'] = df['date'].map(lambda x: str(x)[3:])
df['date'] = df['date'].str.replace('/','-')
for cols in df.columns:
df.loc[df['num'].str.endswith('/'), cols] = df['num'] + df['date']
print(df)
Results:
num date
0 CM2/0000/09-19 09-19
1 CM2/0000 09-19
2 CM3/0000/09-19 09-19
3 CM3/0000 09-19
I have a dataframe with datetime (df1). I want to know if the datetime in a 'col1' in df1 is between any of the pair of the datetime of two columns ('lowerbound' and 'upperbound') in another dataframe (df2).
For example:
df1 = pd.to_datetime(['2014-04-09 07:37:00','2015-04-09 07:00:00',
'2014-02-02 08:31:00','2014-03-02 08:22:00'])
df1 = pd.DataFrame(df1,columns = ['col1'])
lowerbound = pd.to_datetime(['2014-04-09 07:25:00','2014-02-02 08:30:00',
'2015-04-09 06:00:00','2014-03-02 08:12:00'])
upperbound = pd.to_datetime(['2014-04-09 07:38:00','2014-04-09 07:48:00',
'2015-04-09 08:00:00','2014-02-02 08:33:00')
df2 = pd.DataFrame(lowerbound,columns = ['lowerbound'])
df2['upperbound'] = upperbound
The result shall be [1,1,0,0] since:
df1['col1'][0] is between df2['lowerbound'][0] & df2['lowerbound'][0]
df1['col1'][1] is between df2['lowerbound'][2] & df2['lowerbound'][2]
Although df1['col1'][2] is between df2['lowerbound'][1] & df2['lowerbound'][3], the index for df2['lowerbound'] and df2['lowerbound'] are not the same.
Thanks!
you can use np.greater_than and np.less_than with outer and any in axis=1, such as:
import numpy as np
print ((np.greater_equal.outer(df1.col1, df2.lowerbound)
& np.less_equal.outer(df1.col1, df2.upperbound))
.any(1).astype(int))
here is gives with your data [1 1 1 1]
I believe you need apply in this case
df1.col1.apply(lambda dat: ((dat>= df2.lowerbound) & (dat <= df2.upperbound)).any())
I would like to loop through a list of dates, in order to obtain the difference between a date and the date before (e.g. the difference between 4/13 and 3/13).
I'm looking for a for-loop able to scan thorugh couple of dates
import pandas as pd
import numpy as np
raw_data = {'date' : pd.to_datetime(pd.Series(['2017-03-30','2017-03-31','2017-04-03','2017-04-04'])),
'age': [10,np.nan,50,30]}
df1 = pd.DataFrame(raw_data, columns = ['date','age'])
df1
input data
df2=df1.T
df2['new']=df2.iloc[:,3]-df2.iloc[:,2]
desidered result:
raw_data = {'date' : pd.to_datetime(pd.Series(['2017-03-30','2017-03-31','2017-04-03','2017-04-04'])),'age': [10,np.nan,50,30],
'diff': [10,-10,50,-20]}
output = pd.DataFrame(raw_data, columns = ['date','age','diff'])
output
You don't need a loop if you want to know the difference between to dates.
Maybe you should do this:
df1['difference'] = (df1['date'] - df1['date'].shift(1)).apply(lambda x: pd.NaT if x is pd.NaT else x.days)
or this:
df1['difference'] = (df1['date'] - df1['date'].shift(-1)).apply(lambda x: pd.NaT if x is pd.NaT else x.days)
I have 2 DataFrames indexed by Time.
import datetime as dt
import pandas as pd
rng1 = pd.date_range("11:00:00","11:00:30",freq="500ms")
df1 = pd.DataFrame({'A':range(1,62), 'B':range(1000,62000,1000)},index = rng)
rng2 = pd.date_range("11:00:03","11:01:03",freq="700ms")
df2 = pd.DataFrame({'Z':range(10,880,10)},index = rng2)
I am trying to assign 'C' in df1 the last element of 'Z' in df2 closest to time index of df1. The following code seems to work now (returns a list).
df1['C'] = None
for tidx,a,b,c in df1.itertuples():
df1['C'].loc[tidx] = df2[:tidx].tail(1).Z.values
#df1['C'].loc[tidx] = df2[:tidx].Z -->Was trying this which didn't work
df1
Is it possible to avoid iterating.
TIL: Pandas Index instances have map method attributes.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.map.html
def fn(df):
def inner(dt):
return df.ix[abs(df.index - dt).argmin(), 'Z']
return inner
df1['C'] = df1.index.map(fn(df2))