I would like to loop through a list of dates, in order to obtain the difference between a date and the date before (e.g. the difference between 4/13 and 3/13).
I'm looking for a for-loop able to scan thorugh couple of dates
import pandas as pd
import numpy as np
raw_data = {'date' : pd.to_datetime(pd.Series(['2017-03-30','2017-03-31','2017-04-03','2017-04-04'])),
'age': [10,np.nan,50,30]}
df1 = pd.DataFrame(raw_data, columns = ['date','age'])
df1
input data
df2=df1.T
df2['new']=df2.iloc[:,3]-df2.iloc[:,2]
desidered result:
raw_data = {'date' : pd.to_datetime(pd.Series(['2017-03-30','2017-03-31','2017-04-03','2017-04-04'])),'age': [10,np.nan,50,30],
'diff': [10,-10,50,-20]}
output = pd.DataFrame(raw_data, columns = ['date','age','diff'])
output
You don't need a loop if you want to know the difference between to dates.
Maybe you should do this:
df1['difference'] = (df1['date'] - df1['date'].shift(1)).apply(lambda x: pd.NaT if x is pd.NaT else x.days)
or this:
df1['difference'] = (df1['date'] - df1['date'].shift(-1)).apply(lambda x: pd.NaT if x is pd.NaT else x.days)
Related
I have this dataframe looking like the below dataframe.
import pandas as pd
data = [['yellow', '800test' ], ['red','900ui'], ['blue','900test'], ['indigo','700test'], ['black','400ui']]
df = pd.DataFrame(data, columns = ['Donor', 'value'])
In the value field, if a string contains say 'test', I'd like to divide these numbers by 1000. What would be the best way to do this?
Check Below code using lambda function
df['value_2'] = df.apply(lambda x: str(int(x.value.replace('test',''))/1000)+'test' if x.value.find('test') > -1 else x.value, axis=1)
df
Output:
df["value\d"] = df.value.str.findall("\d+").str[0].astype(int)
df["value\w"] = df.value.str.findall("[^\d]+").str[0]
df.loc[df["value\w"] == "test", "value\d"] = df["value\d"]/1000
df["value"] = df["value\w"] + df["value\d"].astype(str)
EDITED
I want to write an If loop with conditions on cooncatenating strings.
i.e. If cell A1 contains a specific format of text, then only do you concatenate, else leave as is.
example:
If bill number looks like: CM2/0000/, then concatenate this string with the date column (month - year), else leave the bill number as it is.
Sample Data
You can create function which does what you need and use df.apply() to execute it on all rows.
I use example data from #Boomer answer.
EDIT: you didn't show what you really have in dataframe and it seems you have datetime in bill_date but I used strings. I had to convert strings to datetime to show how to work with this. And now it needs .strftime('%m-%y') or sometimes .dt.strftime('%m-%y') instead of .str[3:].str.replace('/','-'). Because pandas uses different formats to display dateitm for different countries so I couldn't use str(x) for this because it gives me 2019-09-15 00:00:00 instead of yours 15/09/19
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
def convert(row):
if row['bill_number'].endswith('/'):
#return row['bill_number'] + row['bill_date'].str[3:].replace('/','-')
return row['bill_number'] + row['bill_date'].strftime('%m-%y')
else:
return row['bill_number']
df['bill_number'] = df.apply(convert, axis=1)
print(df)
Result:
bill_number bill_date
0 CM2/0000/09-19 15/09/19
1 CM2/0000 15/09/19
2 CM3/0000/09-19 15/09/19
3 CM3/0000 15/09/19
Second idea is to create mask
mask = df['bill_number'].str.endswith('/')
and later use it for all values
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
Left side needs .loc[mask,'bill_number'] instead of `[mask]['bill_number'] to correctly assing values - but right side doesn't need it.
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
mask = df['bill_number'].str.endswith('/')
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
# or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
#or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
print(df)
Third idea is to use numpy.where()
import pandas as pd
import numpy as np
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
df['bill_number'] = np.where(
df['bill_number'].str.endswith('/'),
#df['bill_number'] + df['bill_date'].str[3:].str.replace('/','-'),
df['bill_number'] + df['bill_date'].dt.strftime('%m-%y'),
df['bill_number'])
print(df)
Maybe this will work for you. It would be nice to have a data sample like #Mike67 was stating. But based on your information this is what I came up with. Bulky, but it works. I'm sure someone else will have a fancier version.
import pandas as pd
from pandas import DataFrame, Series
dat = {'num': ['CM2/0000/','CM2/0000', 'CM3/0000/', 'CM3/0000',],
'date': ['15/09/19','15/09/19','15/09/19','15/09/19']}
df = pd.DataFrame(dat)
df['date'] = df['date'].map(lambda x: str(x)[3:])
df['date'] = df['date'].str.replace('/','-')
for cols in df.columns:
df.loc[df['num'].str.endswith('/'), cols] = df['num'] + df['date']
print(df)
Results:
num date
0 CM2/0000/09-19 09-19
1 CM2/0000 09-19
2 CM3/0000/09-19 09-19
3 CM3/0000 09-19
Apologies if something similar has been asked before.
I have a task where I need a function that is fed a list of unix times, and a pandas df.
The pandas df has a column for unix time, a column for latitude, and a column for longitude.
I need to extract the latitude from the df where the df unix time matches the unix time in my list I pass to the function.
So far I have:
`def nl_lat_array(pandas_df, unixtime_list):
lat = dict()
data = pandas_df
for x, row in data.iterrows():
if data[data['DateTime_Unix']] == i in unixtime_list:
lat[i] = data[data['Latitude']]
v=list(lat.values())
nl_lat_array = np.array(v)
return nl_lat_array
This results in the following error:
KeyError: "None of [Float64Index([1585403852.468, 1585403852.518, 1585403852.568, 1585403852.618,\n 1585403852.668, 1585403852.718, 1585403852.768, 1585403852.818,\n 1585403852.868, 1585403852.918,\n ...\n 1585508348.524, 1585508348.574, 1585508348.624, 1585508348.674,\n 1585508348.724, 1585508348.774, 1585508348.824, 1585508348.874,\n 1585508348.924, 1585508348.974],\n dtype='float64', length=2089945)] are in the [columns]"
However these values in the pandas array do exist in the list I am passing.
Any help would be greatly appreciated.
import pandas as pd
data = pd.DataFrame([[1,4,7],[2,5,8],[3,6,9]])
data.columns = ['time', 'lat', 'long']
time_list = [1,2]
d = data[data['time'].isin(time_list)]['lat'].values
# [4, 5]
You can do something like this.
filtered_data = data[data['DateTime_Unix'].isin(unixtime_list)]
filtered_data['Latitude'].values()
I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.
I want to increase the date in one dataframe column by an integer value in another.
I receive TypeError: unsupported type for timedelta days component: numpy.int64
My dataframes look like this:
import pandas as pd
import numpy as np
import datetime as dt
dfa = pd.DataFrame([
['5/15/17',1],
['5/15/17',1]],
columns = ['Start','Days'])
dfb = pd.DataFrame([
['5/15/17',1],
['5/15/17',1]],
columns = ['Start','Days'])
I format the 'Start' column to datetime with this code:
dfa['Start'] = dfa['Start'].apply(lambda x:
dt.datetime.strptime(x,'%m/%d/%y'))
dfb['Start'] = dfb['Start'].apply(lambda x:
dt.datetime.strptime(x,'%m/%d/%y'))
I try to change the values in the dfa dataframe. The dfb dataframe reference works for 'Days' but not for 'Start':
for i, row in dfb.iterrows():
for j, row in dfa.iterrows():
new = pd.DataFrame({"Start": dfa.loc[j,"Start"] + datetime.timedelta(days=dfb.loc[i,"Days"]), "Days": dfa.loc[j,"Days"] - dfb.loc[i,"Days"]}, index = [j+1])
dfa = pd.concat([dfa.ix[:j], new, dfa.ix[j+1:]]).reset_index(drop=True)
This is the key component that raises the error:
"Start": dfa.loc[j,"Start"] + datetime.timedelta(days=dfb.loc[i,"Days"]
It works fine if I use:
"Start": dfa.loc[j,"Start"] + datetime.timedelta(days=1)
but I need it to be taking that value from dfb, not a static integer.
IIUC (I changed the input values a bit to clarify what is going on):
import pandas as pd
dfa = pd.DataFrame([
['5/15/17',1],
['5/16/17',1]],
columns = ['Start','Days'])
dfb = pd.DataFrame([
['5/15/17',3],
['5/16/17',4]],
columns = ['Start','Days'])
dfa['Start'] = pd.to_datetime(dfa['Start'])
dfb['Start'] = pd.to_datetime(dfb['Start'])
dfa['Start'] = dfa['Start'] + dfb['Days'].apply(pd.Timedelta,unit='D')
print(dfa)
Output:
Start Days
0 2017-05-18 1
1 2017-05-20 1