Python pandas only reads days when comparing dates from csv - python

so let's say this is my code:
df = pd.read_table('file_name', sep=';')
pd.Timestamp("today").strftime(%d.%m.%y)
df = df[(df['column1'] < today)]
df
Here's the table from the csv file:
Column 1
27.02.2018
05.11.2018
22.05.2018
01.11.2018
01.08.2018
01.08.2018
16.10.2018
22.08.2018
21.11.2018
so as you can see, I imported a table from a csv file. I only need to see dates before today (16.10.2018), but when I run the code this is what I get
Column 1
05.11.2018
01.11.2018
01.08.2018
01.08.2018
Which means Python is only looking at the days and ignoring the months, and this is wrong. I need it to understand this is a date not just numbers. What do I do to achieve that?
PS I'm new to Python

You should convert your column to the date type, not strings, since strings are compared lexicographically.
You can thus convert it with:
# convert the strings to date(time) objects
df['column1'] = pd.to_datetime(df['column1'], format='%d.%m.%Y')
Then you can compare it with a date object, like:
>>> from datetime import date
>>> df[df['column1'] < date.today()]
column1
0 2018-02-27
1 2018-05-11
2 2018-05-22
3 2018-01-11
4 2018-01-08
5 2018-01-08
7 2018-08-22

Related

Converting column type 'datetime64[ns]' to datetime in Python3

I would like to perform a comparison between the two dates (One from a pandas dataframe) in python3, another one is calculated. I would like to filter pandas dataframe if the values in the 'Publication_date' is equal to or less than the today's date and is greater than the date 10 years ago.
The pandas df looks like this:
PMID Publication_date
0 31611796 2019-09-27
1 33348808 2020-12-17
2 12089324 2002-06-27
3 31028872 2019-04-25
4 26805781 2016-01-21
I am doing the comparison as shown below.
df[(df['Publication_date']> datetime.date.today() - datetime.timedelta(days=3650)) &
(df['Publication_date']<= datetime.date.today())]
Above date filter when applied on the df should not give Row:3 of the df.
'Publication_date' column has type 'string'. I converted it to date using below line in my script.
df_phenotype['publication_date']= pd.to_datetime(df_phenotype['publication_date'])
But it changes the column type to 'datetime64[ns]' that makes the comparison incompatible between 'datetime64[ns]' and datetime.
How can I perform this comparison?
Any help is highly appreciated.
You can use only pandas for working with datetimes - Timestamp.floor is for remove times from datetimes (set times to 00:00:00):
df['Publication_date']= pd.to_datetime(df['Publication_date'])
today = pd.to_datetime('now').floor('d')
df1 = df[(df['Publication_date']> today - pd.Timedelta(days=3650)) &
(df['Publication_date']<= today)]
Also you can use 10 years offset:
today = pd.to_datetime('now').floor('d')
df1 = df[(df['Publication_date']> today - pd.offsets.DateOffset(years=10)) &
(df['Publication_date']<= today)]
print (df1)
PMID Publication_date
0 31611796 2019-09-27
1 33348808 2020-12-17
3 31028872 2019-04-25
4 26805781 2016-01-21

Converting Integer column to Date Column

In my import file one of the column has a date, if I view the same column in the dataframe, its converted into integer. How do I convert back to the date format.
In the data file, the column looks like 'Oct-17' but when I view in the dataframe it looks like '43009'. How do I change in Python from integer to Date so my data looks like 'Oct-17'
Appreciate for your help
Use xlrd, once you read in pandas:
df = pd.DataFrame({'Date_String':[43009,43000,42345,43134,43917]})
import xlrd
df['Date'] = df['Date_String'].apply(lambda x: xlrd.xldate.xldate_as_datetime(x, 0))
df['MMMYY'] =df['Date'].apply(lambda x: x.strftime('%b-%y'))
print(df)
Date_String Date MMMYY
0 43009 2017-10-01 Oct-17
1 43000 2017-09-22 Sep-17
2 42345 2015-12-07 Dec-15
3 43134 2018-02-03 Feb-18
4 43917 2020-03-27 Mar-20

calculate date difference between today's date and pandas date series

Want to calculate the difference of days between pandas date series -
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
and today's date.
I tried but could not come up with logical solution.
Please help me with the code. Actually I am new to python and there are lot of syntactical errors happening while applying any function.
You could do something like
# generate time data
data = pd.to_datetime(pd.Series(["2018-09-1", "2019-01-25", "2018-10-10"]))
pd.to_datetime("now") > data
returns:
0 False
1 True
2 False
you could then use that to select the data
data[pd.to_datetime("now") > data]
Hope it helps.
Edit: I misread it but you can easily alter this example to calculate the difference:
data - pd.to_datetime("now")
returns:
0 -122 days +13:10:37.489823
1 24 days 13:10:37.489823
2 -83 days +13:10:37.489823
dtype: timedelta64[ns]
You can try as Follows:
>>> from datetime import datetime
>>> df
col1
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
Make Sure to convert the column names to_datetime:
>>> df['col1'] = pd.to_datetime(df['col1'], infer_datetime_format=True)
set the current datetime in order to Further get the diffrence:
>>> curr_time = pd.to_datetime("now")
Now get the Difference as follows:
>>> df['col1'] - curr_time
0 -2145 days +07:48:48.736939
1 -2163 days +07:48:48.736939
2 -2140 days +07:48:48.736939
3 -2139 days +07:48:48.736939
4 -2132 days +07:48:48.736939
5 -2119 days +07:48:48.736939
6 -2115 days +07:48:48.736939
7 -2112 days +07:48:48.736939
Name: col1, dtype: timedelta64[ns]
With numpy you can solve it like difference-two-dates-days-weeks-months-years-pandas-python-2
. bottom line
df['diff_days'] = df['First dates column'] - df['Second Date column']
# for days use 'D' for weeks use 'W', for month use 'M' and for years use 'Y'
df['diff_days']=df['diff_days']/np.timedelta64(1,'D')
print(df)
if you want days as int and not as float use
df['diff_days']=df['diff_days']//np.timedelta64(1,'D')
From the pandas docs under Converting To Timestamps you will find:
"Converting to Timestamps To convert a Series or list-like object of date-like objects e.g. strings, epochs, or a mixture, you can use the to_datetime function"
I haven't used pandas before but this suggests your pandas date series (a list-like object) is iterable and each element of this series is an instance of a class which has a to_datetime function.
Assuming my assumptions are correct, the following function would take such a list and return a list of timedeltas' (a datetime object representing the difference between two date time objects).
from datetime import datetime
def convert(pandas_series):
# get the current date
now = datetime.now()
# Use a list comprehension and the pandas to_datetime method to calculate timedeltas.
return [now - pandas_element.to_datetime() for pandas_series]
# assuming 'some_pandas_series' is a list-like pandas series object
list_of_timedeltas = convert(some_pandas_series)

Datetime comparisons in python

I have a file with two different dates: one has a timestamp and one does not. I need to read the file, disregard the timestamp, and compare the two dates. If the two dates are the same then I need to spit it to the output file and disregard any other rows.
I'm having trouble knowing if I should be using a datetime function on the input and formatting the date there and then simply seeing if the two are equivalent? Or should I be using a timedelta?
I've tried a couple different ways but haven't had success.
df = pd.read_csv("File.csv", dtype={'DATETIMESTAMP': np.datetime64, 'DATE':np.datetime64})
Gives me : TypeError: the dtype < M8 is not supported for parsing, pass this column using parse_dates instead
I've also tried to just remove the timestamp and then compare, but the strings end up with different date formats and that doesn't work either.
df['RemoveTimestamp'] = df['DATETIMESTAMP'].apply(lambda x: x[:10])
df = df[df['RemoveTimestamp'] == df['DATE']]
Any guidance appreciated.
Here is my sample input CSV file:
"DATE", "DATETIMESTAMP"
"8/6/2014","2014-08-06T10:18:38.000Z"
"1/15/2013","2013-01-15T08:57:38.000Z"
"3/7/2013","2013-03-07T16:57:18.000Z"
"12/4/2012","2012-12-04T10:59:37.000Z"
"5/6/2014","2014-05-06T11:07:46.000Z"
"2/13/2013","2013-02-13T15:51:42.000Z"
import pandas as pd
import numpy as np
# your data, both columns are in string
# ================================================
df = pd.read_csv('sample_data.csv')
df
DATE DATETIMESTAMP
0 8/6/2014 2014-08-06T10:18:38.000Z
1 1/15/2013 2013-01-15T08:57:38.000Z
2 3/7/2013 2013-03-07T16:57:18.000Z
3 12/4/2012 2012-12-04T10:59:37.000Z
4 5/6/2014 2014-05-06T11:07:46.000Z
5 2/13/2013 2013-02-13T15:51:42.000Z
# processing
# =================================================
# convert string to datetime
df['DATE'] = pd.to_datetime(df['DATE'])
df['DATETIMESTAMP'] = pd.to_datetime(df['DATETIMESTAMP'])
# cast timestamp to date
df['DATETIMESTAMP'] = df['DATETIMESTAMP'].values.astype('<M8[D]')
DATE DATETIMESTAMP
0 2014-08-06 2014-08-06
1 2013-01-15 2013-01-15
2 2013-03-07 2013-03-07
3 2012-12-04 2012-12-04
4 2014-05-06 2014-05-06
5 2013-02-13 2013-02-13
# compare
df['DATE'] == df['DATETIMESTAMP']
0 True
1 True
2 True
3 True
4 True
5 True
dtype: bool
How about:
import time
filename = dates.csv
with open(filename) as f:
contents = f.readlines()
for i in contents:
date1, date2 = i.split(',')
date1 = date1.strip('"')
date2 = date2.split('T')[0].strip('"')
date1a = time.strftime("%Y-%m-%d",time.strptime(date1, "%m/%d/%Y"))
print i if date1a == date2 else None

get subset dataframe by date

I have the following subset with a starting date (DD/MM/YYYY) and Amount
Start Date Amount
1 01/01/2013 20
2 02/05/2007 10
3 01/05/2004 15
4 01/06/2014 20
5 17/08/2008 21
I'd like to create a subset of this dataframe where only where the Start Date Day is 01:
Start Date Amount
1 01/01/2013 20
3 01/05/2004 15
4 01/06/2014 20
I've tried to loop through the table and use the index but couldn't find a suitable way to iterate through a dataframe rows.
Assuming your dates are datetime already then the following should work, if they are strings you can convert them using to_datetime so df['Start Date'] = pd.to_datetime(df['Start Date']), you may also need to pass param dayfirst = True if required. If you imported the data using read_csv you could've done this at the point of import so df = pd.read_csv('data.csv', parse_dates=[n], dayfirst=True) where n is the index (0-based of course) so if it was the first then pass parse_dates=[0].
One method could be to apply a lambda to the column and use the boolean index returned this to index against:
In [19]:
df[df['Start Date'].apply(lambda x: x.day == 1)]
Out[19]:
Start Date Amount
index
1 2013-01-01 20
3 2004-05-01 15
4 2014-06-01 20
Not sure if there is a built in method that doesn't involve setting this to be the index which will convert it into a timeseries index.

Categories