how to declare arguments as datetime64 inside a function? - python

I am trying to apply the next function in which two datetime64 pandas dataframe columns are arguments:
import datetime
import pandas as pd
def set_dif_months_na(start_date, end_date):
if (pd.isnull(start_date) and pd.notnull(end_date)):
return None
elif (pd.notnull(start_date) and pd.isnull(end_date)):
return None
elif (pd.isnull(start_date) and pd.isnull(end_date)):
return None
else:
start_date = datetime.strptime(start_date, "%d/%m/%Y")
end_date = datetime.strptime(end_date, "%d/%m/%Y")
return abs((end_date.year - start_date.year) * 12 + (end_date.month - start_date.month))
This function is intended to get month difference as integer given two dates as arguments, else it has to return None.
When I apply it to a new pandas dataframe column as this:
df['new_col'] = [set_dif_months_na(date1, date2)
for date1,date2 in
zip(df['date1'], df['date2'])]
The next error arises:
TypeError: strptime() argument 1 must be str, not Timestamp
How could I adjust the function in order to properly apply it over a new pandas dataframe column?

You see, pandas uses numpy to parse dates, and numpy.datetime64 is not directly compatible with datetime.datetime, which you are trying to use.
There's a couple of different solutions, but if you want to use datetime, which is more readable in my opinion, you may do something like this. First we define a function to convert between both data types (got it from here):
def numpy2datetime(date):
return (datetime.
datetime.
utcfromtimestamp(
(date - np.datetime64('1970-01-01T00:00:00')) /
np.timedelta64(1, 's'))
)
Then you may be able to do what you want by changing your function from :
start_date = datetime.strptime(start_date, "%d/%m/%Y")
end_date = datetime.strptime(end_date, "%d/%m/%Y")
to
start_date = numpy2datetime(start_date)
end_date = numpy2datetime(end_date)
This should work. However, I may have some additional suggestions for you. First, you can change all your if and elif to a single one by using the or logical operator:
if pd.isnull(start_date) or pd.isnull(end_date):
return None
else:
start_date = numpy2datetime(start_date)
end_date = numpy2datetime(end_date)
return abs((end_date.year - start_date.year) * 12 + (end_date.month - start_date.month))
And a last one is regarding your list comprehension. You don't need zip at all, since both columns are within the same dataframe. You can simply do:
df['new_col'] = [set_dif_months_na(date1, date2)
for date1,date2 in
df[['date1','date2']].values]
Don't know if it's faster, but at least it's clearer.
Hope it's useful. And let us know if you have any further issues.

By changing start_date and end_date setting from strptime to pd.to_datetime the function worked without any error:
def set_dif_months_na(start_date, end_date):
if (pd.isnull(start_date) and pd.notnull(end_date)):
return None
elif (pd.notnull(start_date) and pd.isnull(end_date)):
return None
elif (pd.isnull(start_date) and pd.isnull(end_date)):
return None
else:
start_date = pd.to_datetime(start_date, format="%d/%m/%Y")
end_date = pd.to_datetime(end_date, format="%d/%m/%Y")
return abs((end_date.year - start_date.year) * 12 + (end_date.month - start_date.month))

Related

Create New Column in Pandas Using Values from Two Columns

I hope everyone is well. I am trying to create a new column that requires the values from two different columns ("Task Start Date" and "Hours"). I am trying to use the apply function but i cant figure out the correct syntax.
def get_start_date(end_date, day):
date_format = '%d-%b-%y'
date = datetime.strptime(end_date, date_format)
start_date = date - timedelta(days = day)
return start_date
asana_filtered["Task Start Date"] = asana_filtered.apply(get_start_date(["Task Due Date"], ["Days"]))
Found it!
python pandas- apply function with two arguments to columns
asana_filtered["Task Start Date"] = asana_filtered.apply(lambda x: get_start_date(x["Task Due Date"], x["Days"]), axis=1)
You can use the native pandas time converters:
df["Task Start Date"] = pd.to_datetime(df["Task Due Date"]) - pd.to_timedelta(df["Days"], unit="D")

How to check the date in between the tuple, dates as start and end

I 'm having a function where it creates a dictionary as below.
x = {'filename': {'filetype': ('5/6/2019', '12/31/2019')}, 'filename2': {'filetype': ('3/24/2018', '5/6/2019')}}
I need to create a new function by passing the date and its type to return the filename based on the tuple dates.
def fn(date, filetype):
I'm trying to pass a date as a first argument
and the date should check if it is in between the tuple as start and end dates
in the dictionary values above If it is in between those dates I need to return the file name
return filename
Question:
Is it possible to check the in-between dates for tuples?
you should convert to datetime objects:
from datetime import datetime
x = {'filename': {'filetype': ('5/6/2019', '12/31/2019')}, 'filename2': {'filetype': ('3/24/2018', '5/6/2019')}}
def fn(dateobj, filetype):
dateobj = datetime.strptime(dateobj, '%m/%d/%Y')
startdate = datetime.strptime(filetype[0], '%m/%d/%Y')
enddate = datetime.strptime(filetype[1], '%m/%d/%Y')
return startdate <= dateobj <= enddate
print(fn('6/6/2019', x['filename']['filetype']))
print(fn('4/6/2019', x['filename']['filetype']))
this will print:
True
False
As people mentioned in the comments, transforming the string dates to datetime objects is recommended. One way to do it is:
from datetime import datetime
new_date = datetime.strptime('12/31/2019', '%m/%d/%Y')
Assuming all datestrings are datetime objects, your function becomes:
def fn(date, filetype):
for filename, range in x.items():
if filetype in range:
start, end = range[filetype]
if start <= date <= end:
return filename
This will return the filename if the date lies between the range, and None otherwise
Use split to convert dates to 3 numeric values in this order: year, month, date. Then you can compare the dates as tuples.
def convert(datestr):
m, d, y = datestr.split('/')
return (int(y), int(m), int(d))
date1 = convert('12/31/2018')
date2 = convert('1/1/2019')
print(date1 < date2)
The same approach works with lists, but those two types must not be mixed, either all dates in a comparison are tuples, or all dates are lists.
For date intervals simply test (e.g. in an if statement):
begin <= date <= end
where all 3 values are as described above.

Add one day to date (string format) when the input string format is not known

I have the format
day/month/year
And I have a task to define a function that takes a date and returns the date with 1 day increased
Example:
next_day("13/1/2018") returns 14/1/2018
next_day("31/3/2018") returns 1/4/2018
How can I do that, I don't know how to do this when the function takes date not day, month, year.
This is one way using the 3rd party dateutil library and datetime from the standard library.
import datetime
from dateutil import parser
def add_day(x):
try:
new = parser.parse(x) + datetime.timedelta(days=1)
except ValueError:
new = parser.parse(x, dayfirst=True) + datetime.timedelta(days=1)
return new.strftime('%d/%m/%Y').lstrip('0').replace('/0', '/')
add_day('13/1/2018') # '14/1/2018'
add_day('31/3/2018') # '1/4/2018'
Trying to perform the same logic with datetime will be more restrictive, which is probably not what you want since it's not obvious you can guarantee the format of your input dates.
Explanation
Try parsing sequentially with month first (default), then day first.
Add a day using datetime.timedelta.
Use string formatting to remove leading zeros.
Pure datetime solution
import datetime
def add_day(x):
try:
new = datetime.datetime.strptime(x, '%m/%d/%Y') + datetime.timedelta(days=1)
except ValueError:
new = datetime.datetime.strptime(x, '%d/%m/%Y') + datetime.timedelta(days=1)
return new.strftime('%d/%m/%Y').lstrip('0').replace('/0', '/')
add_day('13/1/2018') # '14/1/2018'
add_day('31/3/2018') # '1/4/2018'
You can try this function to return the current date at least.
extension Date {
var withWeekDayMonthDayAndYear: String {
let formatter = DateFormatter()
formatter.timeZone = TimeZone(abbreviation: "EST")
formatter.dateFormat = "EEEE, MMMM dd, yyyy"
return formatter.string(from: self)
}
Then use the extension..
((Date().withWeekDayMonthDayAndYear))
It's a start..

PySpark: filtering a DataFrame by date field in range where date is string

My dataframes contains one field which is a date and it appears in the string format, as example
'2015-07-02T11:22:21.050Z'
I need to filter the DataFrame on the date to get only the records in the last week.
So, I was trying a map approach where I transformed the string dates to datetime objects with strptime:
def map_to_datetime(row):
format_string = '%Y-%m-%dT%H:%M:%S.%fZ'
row.date = datetime.strptime(row.date, format_string)
df = df.map(map_to_datetime)
and then I would apply a filter as
df.filter(lambda row:
row.date >= (datetime.today() - timedelta(days=7)))
I manage to get the mapping working but the filter fails with
TypeError: condition should be string or Column
Is there a way to use a filtering in a way that works or should I change the approach and how?
I figured out a way to solve my problem by using the SparkSQL API with dates in String format.
Here is an example:
last_week = (datetime.today() - timedelta(days=7)).strftime(format='%Y-%m-%d')
new_df = df.where(df.date >= last_week)
Spark >= 1.5
You can use INTERVAL
from pyspark.sql.functions import expr, current_date
df_casted.where(col("dt") >= current_date() - expr("INTERVAL 7 days"))
Spark < 1.5
You can solve this without using worker side Python code and switching to RDDs. First of all, since you use ISO 8601 string, your data can be directly casted to date or timestamp:
from pyspark.sql.functions import col
df = sc.parallelize([
('2015-07-02T11:22:21.050Z', ),
('2016-03-20T21:00:00.000Z', )
]).toDF(("d_str", ))
df_casted = df.select("*",
col("d_str").cast("date").alias("dt"),
col("d_str").cast("timestamp").alias("ts"))
This will save one roundtrip between JVM and Python. There are also a few way you can approach the second part. Date only:
from pyspark.sql.functions import current_date, datediff, unix_timestamp
df_casted.where(datediff(current_date(), col("dt")) < 7)
Timestamps:
def days(i: int) -> int:
return 60 * 60 * 24 * i
df_casted.where(unix_timestamp() - col("ts").cast("long") < days(7))
You can also take a look at current_timestamp and date_sub
Note: I would avoid using DataFrame.map. It is better to use DataFrame.rdd.map instead. It will save you some work when switching to 2.0+
from datetime import datetime, timedelta
last_7_days = (datetime.today() - timedelta(days=7)).strftime(format='%Y-%m-%d')
new_df = signal1.where(signal1.publication_date >= last_7_days)
You need to import datetime and timedelta for this.

Difference between two dates in Python

I have two different dates and I want to know the difference in days between them. The format of the date is YYYY-MM-DD.
I have a function that can ADD or SUBTRACT a given number to a date:
def addonDays(a, x):
ret = time.strftime("%Y-%m-%d",time.localtime(time.mktime(time.strptime(a,"%Y-%m-%d"))+x*3600*24+3600))
return ret
where A is the date and x the number of days I want to add. And the result is another date.
I need a function where I can give two dates and the result would be an int with date difference in days.
Use - to get the difference between two datetime objects and take the days member.
from datetime import datetime
def days_between(d1, d2):
d1 = datetime.strptime(d1, "%Y-%m-%d")
d2 = datetime.strptime(d2, "%Y-%m-%d")
return abs((d2 - d1).days)
Another short solution:
from datetime import date
def diff_dates(date1, date2):
return abs(date2-date1).days
def main():
d1 = date(2013,1,1)
d2 = date(2013,9,13)
result1 = diff_dates(d2, d1)
print '{} days between {} and {}'.format(result1, d1, d2)
print ("Happy programmer's day!")
main()
You can use the third-party library dateutil, which is an extension for the built-in datetime.
Parsing dates with the parser module is very straightforward:
from dateutil import parser
date1 = parser.parse('2019-08-01')
date2 = parser.parse('2019-08-20')
diff = date2 - date1
print(diff)
print(diff.days)
Answer based on the one from this deleted duplicate
I tried the code posted by larsmans above but, there are a couple of problems:
1) The code as is will throw the error as mentioned by mauguerra
2) If you change the code to the following:
...
d1 = d1.strftime("%Y-%m-%d")
d2 = d2.strftime("%Y-%m-%d")
return abs((d2 - d1).days)
This will convert your datetime objects to strings but, two things
1) Trying to do d2 - d1 will fail as you cannot use the minus operator on strings and
2) If you read the first line of the above answer it stated, you want to use the - operator on two datetime objects but, you just converted them to strings
What I found is that you literally only need the following:
import datetime
end_date = datetime.datetime.utcnow()
start_date = end_date - datetime.timedelta(days=8)
difference_in_days = abs((end_date - start_date).days)
print difference_in_days
Try this:
data=pd.read_csv('C:\Users\Desktop\Data Exploration.csv')
data.head(5)
first=data['1st Gift']
last=data['Last Gift']
maxi=data['Largest Gift']
l_1=np.mean(first)-3*np.std(first)
u_1=np.mean(first)+3*np.std(first)
m=np.abs(data['1st Gift']-np.mean(data['1st Gift']))>3*np.std(data['1st Gift'])
pd.value_counts(m)
l=first[m]
data.loc[:,'1st Gift'][m==True]=np.mean(data['1st Gift'])+3*np.std(data['1st Gift'])
data['1st Gift'].head()
m=np.abs(data['Last Gift']-np.mean(data['Last Gift']))>3*np.std(data['Last Gift'])
pd.value_counts(m)
l=last[m]
data.loc[:,'Last Gift'][m==True]=np.mean(data['Last Gift'])+3*np.std(data['Last Gift'])
data['Last Gift'].head()
I tried a couple of codes, but end up using something as simple as (in Python 3):
from datetime import datetime
df['difference_in_datetime'] = abs(df['end_datetime'] - df['start_datetime'])
If your start_datetime and end_datetime columns are in datetime64[ns] format, datetime understands it and return the difference in days + timestamp, which is in timedelta64[ns] format.
If you want to see only the difference in days, you can separate only the date portion of the start_datetime and end_datetime by using (also works for the time portion):
df['start_date'] = df['start_datetime'].dt.date
df['end_date'] = df['end_datetime'].dt.date
And then run:
df['difference_in_days'] = abs(df['end_date'] - df['start_date'])
pd.date_range('2019-01-01', '2019-02-01').shape[0]

Categories