Date/Time formatting issue while webscraping to CSV file - python

I'm learning python and software development...
I've been scraping data (date/time, interest rate) every minute since July from a website and appending it to a CSV file. Today I went to chart the data using jupyter notebook, pandas..etc
I sliced off the 'AM/PM' string characters and used the pandas.to_datetime method on the date/time column to properly format it and .
data['date/time'] = data['date/time'].str[0:14].map(pandas.to_datetime)
However, it appears that the date/time data was at first interpreted by python/jupyter/pandas following the ddmmyy convention but then changed at the start of a new month to being interpreted to mmddyy. On the 13th of the month the interpretation changed back to ddmmyy.
For example:
The CSV file shows the following string value within the respective cell:
31/07/22 23:59PM
01/08/22 00:00AM
...
12/08/22 23:59PM
13/08/22 00:00AM
However, the pandas dataframe, after using the 'to_datetime' method shows:
2022-07-31 23:59:00
2022-01-08 00:00:00
...
2022-12-08 23:59:00
2022-08-13 00:00:00
I've been trying to figure out:
Why this happened?
How can I avoid this moving forward?
How can I fix this so that I may chart/plot the time series data properly?
UpdateIt looks like the issue occurs while filtering from a larger CSV file into the CSV file I'm working with.

You should use the pandas.to_datetime dayfirst parameter set to True to assume the format as day/month/year. Otherwise, it assumes the format is month/day/year if 1 <= month <= 12.
It's going to be like the following:
data['date/time'] = data['date/time'].str[0:14].map(lambda x: pd.to_datetime(x, dayfirst=True))

Related

Date structure changing in python when I extract data from csv

I am pulling data from CSV and the Date column is in the format "dd/mm/YYYY" but looks like when I convert dates from object to datetime format some of the dates are changing weirdly.
I have showed the example below, the last there entries are for 7th March, 6th March and 6th March. After converting to datetime those dates changed to 3rd July, 3rd June and 3rd June.
How do I fix the issue?
Maybe you can try a pandas.to_datetime argument "dayfirst"
So the code will look like:
pd.to_datetime(df.index, dayfirst=True)
And it should fix it. In documentation a information about this argument looks:
"Specify a date parse order if arg is str or its list-likes. If True, parses dates with the day first, eg 10/11/12 is parsed as 2012-11-10"

How do I compare times using an if statement in python from a csv file

I am trying to build a program which reads dates from a csv file. I then wish to compare them using an if statement
However every time I attempt this in every way I have tried it results in a lot of errors
I get the time by slicing from the csv file where a column consists of both the time and date [Device Timestamp]
Dates_and_times = pd.Series(readings[Device Timestamp]
Reading_times = dates_and_times.str.slice(0, 10)
I then attempt the if statement to sort the reading times into categories
If 11:00:00 < reading_times <=16:00:00
Afternoon_reading+=1
Then I get hit with the following errors
Invalid syntax (:)
And a lot more at runtime including cannot convert between instances of str and int etc
Can someone please tell me how to put these times in a suitable format to be used in an if statement
Sample date in the csv: "05/11/2019 12:20:00"
Possible solution for possible format:
datetime.datetime.strptime('05/11/2019 12:20:00', '%d/%m/%Y %H:%M:%S')
The concept is to describe how your date looks using characters the the module datetime understands.

How to read a date column from excel in a desired language with Pandas?

I'm using pandas.read_excel() to turn excel tables into dataframes to work with in Python. This tables contain date columns in the following format: 01Jun2018.
When I run the instruction, the tables are turned into dataframes just fine. The issue comes from the fact that I'm currently working in Mexico where month abreviations are spelled in spanish. Because of this the date columns show some cells with the correct datetime-type info, but the cells which originally contain months that do not correspond to month names in spanish (for example: april != abril, January != Enero) show the original strings. I need to do some operations with dates, so this columns must be datetime-type entirely.
I've tried switching the locale to en_US but nothing happened.
You need to set the locale using locale. If you already have a dataframe like this:
dates
0 01Ene2018
1 20Feb2018
2 01Jun2018
Then you need to change the type of that column using pd.to_datetime after setting the locale:
import locale
import pandas as pd
locale.setlocale(locale.LC_ALL, locale.locale_alias["es_mx"])
df.dates = pd.to_datetime(df.dates, format="%d%b%Y")
print(df.dates)
Output:
0 2018-01-01
1 2018-02-20
2 2018-06-01
Name: dates, dtype: datetime64[ns]
This is supposing you have the es_MX locale installed in your system, otherwise you will need to install it.

pandas.read_csv() can apply different date formats within the same column! Is it a known bug? How to fix it?

I have realised that, unless the format of a date column is declared explicitly or semi-explicitly (with dayfirst), pandas can apply different date formats to the same column, when reading a csv file! One row could be dd/mm/yyyy and another row in the same column mm/dd/yyyy!
Insane doesn't even come close to describing it! Is it a known bug?
To demonstrate: the script below creates a very simple table with the dates from January 1st to the 31st, in the dd/mm/yyyy format, saves it to a csv file, then reads back the csv.
I then use pandas.DatetimeIndex to extract the day.
Well, the day is 1 for the first 12 days (when month and day were both < 13), and 13 14 etc afterwards. How on earth is this possible?
The only way I have found to fix this is to declare the date format, either explicitly or just with dayfirst=True. But it's a pain because it means I must declare the date format even when I import csv with the best-formatted dates ever! Is there a simpler way?
This happens to me with pandas 0.23.4 and Python 3.7.1 on Windows 10
import numpy as np
import pandas as pd
df=pd.DataFrame()
df['day'] =np.arange(1,32)
df['day']=df['day'].apply(lambda x: "{:0>2d}".format(x) )
df['month']='01'
df['year']='2018'
df['date']=df['day']+'/'+df['month']+'/'+df['year']
df.to_csv('mydates.csv', index=False)
#same results whether you use parse_dates or not
imp = pd.read_csv('mydates.csv',parse_dates=['date'])
imp['day extracted']=pd.DatetimeIndex(imp['date']).day
print(imp['day extracted'])
By default it assumes the American dateformat, and indeed switches mid-column without throwing an Error, if that fails. And though it breaks the Zen of Python by letting this Error pass silently, "Explicit is better than implicit". So if you know your data has an international format, you can use dayfirst
imp = pd.read_csv('mydates.csv', parse_dates=['date'], dayfirst=True)
With files you produce, be unambiguous by using an ISO 8601 format with a timezone designator.

Python: Reading Excel and automatically turning a string into a Date object?

I'm using the openpyxl library in Python and I'm trying to read in the value of a cell. The cells value is a date in the format MM/DD/YYYY. I would like for the value to be read into my script simply as a string (i.e. "8/6/2014"), but instead Python is somehow automatically reading it as a date object (Result is "2014-08-06 00:00:00") I don't know if this is something I need to fix in Excel or Python, but how do I get the string I'm looking for?
I would suggest changing it in your Excel if you want to preserve what is being read in by openpyxl. That said, when a cell has been formatted to a date in Excel, it becomes altered to fit a specified format so you've lost the initial string format in either case.
For example, let's say that the user enters the date 1/1/2018 into a cell that is formatted MM/DD/YYYY, Excel will change the data to 01/01/2018 and you will lose the original string that was entered.
If you only care to see data of the form MM/DD/YYYY, an alternate solution would be to cast the date with date_cell.strftime("%m/%d/%Y")
I found out how to fix it with these lines of code:
dateString = str(ws.cell(row=row, column=column).value.date())
newDate = datetime.strptime(dateString, "%Y-%m-%d").strftime("%m/%d/%Y")
The string "newDate" gives me the format "8/6/2018"

Categories