Date structure changing in python when I extract data from csv - python

I am pulling data from CSV and the Date column is in the format "dd/mm/YYYY" but looks like when I convert dates from object to datetime format some of the dates are changing weirdly.
I have showed the example below, the last there entries are for 7th March, 6th March and 6th March. After converting to datetime those dates changed to 3rd July, 3rd June and 3rd June.
How do I fix the issue?

Maybe you can try a pandas.to_datetime argument "dayfirst"
So the code will look like:
pd.to_datetime(df.index, dayfirst=True)
And it should fix it. In documentation a information about this argument looks:
"Specify a date parse order if arg is str or its list-likes. If True, parses dates with the day first, eg 10/11/12 is parsed as 2012-11-10"

Related

Date/Time formatting issue while webscraping to CSV file

I'm learning python and software development...
I've been scraping data (date/time, interest rate) every minute since July from a website and appending it to a CSV file. Today I went to chart the data using jupyter notebook, pandas..etc
I sliced off the 'AM/PM' string characters and used the pandas.to_datetime method on the date/time column to properly format it and .
data['date/time'] = data['date/time'].str[0:14].map(pandas.to_datetime)
However, it appears that the date/time data was at first interpreted by python/jupyter/pandas following the ddmmyy convention but then changed at the start of a new month to being interpreted to mmddyy. On the 13th of the month the interpretation changed back to ddmmyy.
For example:
The CSV file shows the following string value within the respective cell:
31/07/22 23:59PM
01/08/22 00:00AM
...
12/08/22 23:59PM
13/08/22 00:00AM
However, the pandas dataframe, after using the 'to_datetime' method shows:
2022-07-31 23:59:00
2022-01-08 00:00:00
...
2022-12-08 23:59:00
2022-08-13 00:00:00
I've been trying to figure out:
Why this happened?
How can I avoid this moving forward?
How can I fix this so that I may chart/plot the time series data properly?
UpdateIt looks like the issue occurs while filtering from a larger CSV file into the CSV file I'm working with.
You should use the pandas.to_datetime dayfirst parameter set to True to assume the format as day/month/year. Otherwise, it assumes the format is month/day/year if 1 <= month <= 12.
It's going to be like the following:
data['date/time'] = data['date/time'].str[0:14].map(lambda x: pd.to_datetime(x, dayfirst=True))

Pyspark - Convert to Timestamp

Spark version : 2.1
I'm trying to convert a string datetime column to utc timestamp with the format yyyy-mm-ddThh:mm:ss
I first start by changing the format of the string column to yyyy-mm-ddThh:mm:ss
and then convert it to timestamp type. Later I would convert the timestamp to UTC using to_utc_timestamp function.
df.select(
f.to_timestamp(
f.date_format(f.col("time"), "yyyy-MM-dd'T'HH:mm:ss"), "yyyy-MM-dd'T'HH:mm:ss"
)
).show(5, False)
The date_format works fine by giving me the correct format. But, when I do to_timestamp on top of that result, the format changes to yyyy-MM-dd HH:mm:ss, when it should instead be yyyy-MM-dd'T'HH:mm:ss. Why does this happen?
Could someone tell me how I could retain the format given by date_format? What should I do?
The function to_timestamp returns a string to a timestamp, with the format yyyy-MM-dd HH:mm:ss.
The second argument is used to define the format of the DateTime in the string you are trying to parse.
You can see a couple of examples in the official documentation.
The code should be like this, just look at the single 'd' part here, and this is tricky in many cases.
data= data.withColumn('date', to_timestamp(col('date'), 'yyyy/MM/d'))

Converting UTC Timestamp in CSV to Local time (PST)

I am looking for a code on how to convert timestamps from some GPS data in a csv file to local time (in this case PST). I do have some other files I would have to convert also to CDT and EDT.
This is what the output looks like:
2019-09-18T07:07:48.000Z
I would like to create a separate column in the right of the Excel for the Date and another for the time EX:
TIME_UTC DATE TIME_PST
2019-09-18T07:07:48.000Z 09-18-2019 12:07:48 AM
I only know basic Python and nothing about Excel in python so it would be super helpful!
Thank you!!!
By calling to localize you tell in what TZ your time is. So, in your example you say that your date is in UTC, then you call astimezone for UTC. FOr example:
utc_dt = pytz.utc.localize(datetime.pstnow())
pst_tz = timezone('US/Pacific')
pst_dt = pst_tz.normalize(pst_dt.astimezone(utc_tz))
pst_dt.strftime(fmt)
For more example, visit here
If you want to use Excel Formula:
For the date:
=INT(SUBSTITUTE(LEFT(A2,LEN(A2)-1),"T"," ")-TIME(7,0,0))
For the Time:
=MOD(SUBSTITUTE(LEFT(A2,LEN(A2)-1),"T"," ")-TIME(7,0,0),1)
And format the output with the desire format: mm-dd-yyyy and hh:mm:ss AM/PM respectively.

pandas.read_csv() can apply different date formats within the same column! Is it a known bug? How to fix it?

I have realised that, unless the format of a date column is declared explicitly or semi-explicitly (with dayfirst), pandas can apply different date formats to the same column, when reading a csv file! One row could be dd/mm/yyyy and another row in the same column mm/dd/yyyy!
Insane doesn't even come close to describing it! Is it a known bug?
To demonstrate: the script below creates a very simple table with the dates from January 1st to the 31st, in the dd/mm/yyyy format, saves it to a csv file, then reads back the csv.
I then use pandas.DatetimeIndex to extract the day.
Well, the day is 1 for the first 12 days (when month and day were both < 13), and 13 14 etc afterwards. How on earth is this possible?
The only way I have found to fix this is to declare the date format, either explicitly or just with dayfirst=True. But it's a pain because it means I must declare the date format even when I import csv with the best-formatted dates ever! Is there a simpler way?
This happens to me with pandas 0.23.4 and Python 3.7.1 on Windows 10
import numpy as np
import pandas as pd
df=pd.DataFrame()
df['day'] =np.arange(1,32)
df['day']=df['day'].apply(lambda x: "{:0>2d}".format(x) )
df['month']='01'
df['year']='2018'
df['date']=df['day']+'/'+df['month']+'/'+df['year']
df.to_csv('mydates.csv', index=False)
#same results whether you use parse_dates or not
imp = pd.read_csv('mydates.csv',parse_dates=['date'])
imp['day extracted']=pd.DatetimeIndex(imp['date']).day
print(imp['day extracted'])
By default it assumes the American dateformat, and indeed switches mid-column without throwing an Error, if that fails. And though it breaks the Zen of Python by letting this Error pass silently, "Explicit is better than implicit". So if you know your data has an international format, you can use dayfirst
imp = pd.read_csv('mydates.csv', parse_dates=['date'], dayfirst=True)
With files you produce, be unambiguous by using an ISO 8601 format with a timezone designator.

Detect missing date in string using dateutil in python

I have a string which contains a timestamp. This timestamp may or may not contain the date it was recorded. If it does not I need to retrieve it from another source. For example:
if the string is
str='11:42:27.619' #It does not contain a date, just the time
when I use dateutil.parser.parse(str), it will return me a datetime object with today's date. How can I detect when there is no date? So I can get it from somewhere else?
I can not just test if it is today's date because the timestamp may be from today and I should use the date provided in the timestamp if it exists.
Thank you
What I would do is first check the string's length, if it contains it should be larger, then I would proceed as you mention.

Categories