How do I properly define a date column?
I'm reading a csv file which contains a date column, and writing that out to an excel file.
Using pd.to_datetime I can get the values represented as datetime. However as my input is date only, the time values come out as zeroes.
I've tried using format='%d/%m/%Y however that fails as my input is not zero padded for the day and month values –
ValueError: time data '07/18/2016' does not match format '%d/%m/%Y' (match)
import pandas as pd
from datetime import datetime, date
jobs_file = "test01.csv"
output_file = jobs_file.replace('csv', 'xlsx')
jobs_path = "C:\\Users\\Downloads\\" + jobs_file
output_path = "C:\\Users\\Downloads\\" + output_file
jobs = pd.read_csv(jobs_path,)
jobs['LAST RUN'] = pd.to_datetime(jobs['LAST RUN'])
#jobs['LAST RUN'] = pd.to_datetime(jobs['LAST RUN'], format='%d/%m/%Y')
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter(output_path, engine='xlsxwriter',
date_format='mm/dd/yyyy')
# Convert the dataframe to an XlsxWriter Excel object.
jobs.to_excel(writer, sheet_name='Jobs', index=False)
# Close the Pandas Excel writer and output the Excel file.
writer.save()
My input values look like this -
7/18/2016
Output is looking like this -
7/18/2016 12:00:00 AM
I'd like it to be just the date -
7/18/2016
Related
I have a yfinance download that is working fine, but I want the Date column to be in YYYY/MM/DD format when I write to disk.
The Date column is the Index, so I first remove the index. Then I have tried using Pandas' "to_datetime" and also ".str.replace" to get the column data to be formatted in YYYY/MM/DD.
Here is the code:
import pandas
import yfinance as yf
StartDate_T = '2021-12-20'
EndDate_T = '2022-05-14'
df = yf.download('CSCO', start=StartDate_T, end=EndDate_T, rounding=True)
df.sort_values(by=['Date'], inplace=True, ascending=False)
df.reset_index(inplace=True) # Make it no longer an Index
df['Date'] = pandas.to_datetime(df['Date'], format="%Y/%m/%d") # Tried this, but it fails
#df['Date'] = df['Date'].str.replace('-', '/') # Tried this also - but error re str
file1 = open('test.txt', 'w')
df.to_csv(file1, index=True)
file1.close()
How can I fix this?
Change the format of the date after resetting the index:
df.reset_index(inplace=True)
df['Date'] = df['Date'].dt.strftime('%Y/%m/%d')
As noted in Convert datetime to another format without changing dtype, you can not change the format and keep the datetime format, due to how datetime stores the dates internally. So I would use the line above before writing to the file (which changes the column to string format) and convert it back to datetime afterwards, to have the datetime properties.
df['Date'] = pd.to_datetime(df['Date'])
You can pass a date format to the to_csv function:
df.to_csv(file1, date_format='%Y/%m/%d')
I'm trying to compare 2 columns of timestamps in a CSV file, and I only want to keep the rows where the date/time in column 1 is before the date/time in column 2. I'm not too sure where to start, since we're looking at comparing many different numbers (e.g. month, year, hour, minute, etc.) separately in relation with one another, including the AM/PM comparison.
Example: (date is [mm/dd/yyyy] format)
11/20/2018 3:00:13 PM
11/23/2017 6:45:00 AM
12/22/2019 4:00:12 PM
1/10/2020 4:50:11 AM
10/10/2018 2:02:19 PM
10/07/2018 1:04:15 PM
Here I would want to keep row 2 because the date in column 2 comes after the date in column 1, and I would not want to keep rows 1 & 3. Is there a neat way to do this in command line? (If not, any pointers to write a Python script would be very helpful)
Thanks in advance!
I will try and keep this as clear and detailed as possible so everyone can understand :
1) First I imported python's datetime library
import datetime as dt
2) Now i am importing the csv file which i have to work on , in this case I have used dates.csv which has the same data as in the question asked above :
from csv import reader
dataset = list(reader(open("dates.csv", encoding = "utf-8")))
2.1) Printing dataset to check if its working :
dataset
printing a single date from our dataset in order to check pattern :
Keep in mind that indexing in python starts with zero
dataset[1][0] # dataset[row][column]
2.2) Pattern is month/day/year hour:min:sec AM/PM
pattern = "%m/%d/%Y %I:%M:%S %p"
you can check Legal Format Codes in order to create a different pattern in future.
3) Now converting our dataset dates into date time object using the library we imported
for i in dataset[1:]:
# [1:] because 1st row has heading and we don't need it
i[0] = dt.datetime.strptime(i[0],pattern)
i[1] = dt.datetime.strptime(i[1],pattern)
print(dataset[1][0])
successfully converted ^
4) Now we will manually comparing dates in order to understand the concepts.
by simply using comparison operators we can compare the dates in python
using datetime library
print(dataset[2][0] , "and" , dataset[2][1])
print(dataset[2][0] > dataset[2][1])
5) Now creating a separate list in which only those rows will be added where column 2's date is greater than column 1's date :
col2_greatorthan_col1 = []
adding heading in our new list :
col2_greatorthan_col1.append(["column 1" , "column 2"])
comparing each and every date and appending our desired row in our new list :
for i in dataset[1:]:
if i[1] > i[0]: # means if column 2's date is greater than column 1's date
col2_greatorthan_col1.append(i) # appending the filtered rows in our new list
col2_greatorthan_col1
6) Now simply creating a real world csv file which will have the same data as col2_greatorthan_col1
import csv
with open("new_dates.csv" , "w" , newline = "") as file :
writer = csv.writer(file)
writer.writerows(lst)
Result :
A new csv file by the name of new_dates.csv will be created in the same directory as your python code file.
This file will only contain those rows where column 2's date is greater than column 1's date.
In Python, you just need to convert each of the date values into datetime objects. They can then be easily compared with a simple < operator. For example:
from datetime import datetime
import csv
with open('input.csv') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_input = csv.reader(f_input)
#header = next(csv_input)
csv_output = csv.writer(f_output)
#csv_output.writerow(header)
for row in csv_input:
date_col1 = datetime.strptime(row[0], '%m/%d/%Y %I:%M:%S %p')
date_col2 = datetime.strptime(row[1], '%m/%d/%Y %I:%M:%S %p')
if date_col1 < date_col2:
csv_output.writerow(row)
If your CSV file contains a header, uncomment the two lines. You can find more information on how the format string work with the .strptime() function documentation.
This approach uses built in Python functionality and so does not need further modules to be installed.
Using Pandas
import pandas as pd
# Read tab delimited CSV file without header
# Names columns date1, date2
df = pd.read_csv("dates.csv",
header = None,
sep='\t',
parse_dates = [0, 1], # use default date parser i.e. parser.parser
names=["date1", "date2"])
# Filter (keep) row when date2 > date1
df = df[df.date2 > df.date1]
# Output to filtered CSV file using the original date format
df.to_csv('filtered_dates.csv', index = False, header = False, sep = '\t', date_format = "%Y/%m/%d %I:%M:%S %p")
With comman line tools you can use awk:
to convert 1st date to epoch format:
echo "11/20/2018 3:00:13 PM" |gawk -F'[/:]' '{print mktime($3" "$1" "$2" "$4" "$5" "$6" "$7)}'
same for the second field. And then subtract column 2 from column 1. If the result is positive this mean column 1 is after column 2
Here is used function mktime from awk which do the "magic". Be aware this function is not available in some of UNIX awk version
I have saved the sample you provided in a tab separated file - with no headers. I have imported it as a DataFrame using (note that I specified your date format in date_parser):
import pandas as pd
import datetime as dt
df = pd.read_csv(PATH_TO_YOUR_FILE, sep="\t", names=["col1", "col2"], parse_dates=[0,1], date_parser=lambda x:dt.datetime.strptime(x, "%m/%d/%Y %I:%M:%S %p")
To select the rows you need:
df.loc[df.loc[:,"col2"]>df.loc[:,"col1"],:]
You can use pd.to_datetime to parse the date-time strings and then use their comparison as a condition to filter the required rows.
Demo:
import pandas as pd
df = pd.DataFrame({
'start': ['11/20/2018 3:00:13 PM', '12/22/2019 4:00:12 PM', '10/10/2018 2:02:19 PM'],
'end': ['11/23/2017 6:45:00 AM', '1/10/2020 4:50:11 AM', '10/07/2018 1:04:15 PM']
})
result = pd.DataFrame(df[
pd.to_datetime(df['start'], format='%m/%d/%Y %I:%M:%S %p') <
pd.to_datetime(df['end'], format='%m/%d/%Y %I:%M:%S %p')
])
print(result)
Output:
start end
1 12/22/2019 4:00:12 PM 1/10/2020 4:50:11 AM
ONLINE DEMO
Question about formatting data when writing to an Excel doc as I am a bit newer to using Openpyxl.
I have an excel sheet that I am writing data to where one of the columns is a column that holds the current date in 'mm/dd/yyyy' format. Currently when writing to the Excel doc, my code reformats the date to 'yyyy-mm-dd' format, and the excel doc does not recognize the data as 'Date' type, but as 'General' data type.
Here is my Python code to write the date to the sheet.
from openpyxl import Workbook
from openpyxl import load_workbook
from date time import date
workbookName = "Excel workbook.xlsm"
wb = Workbook()
wb = load_workbook(workbookName, data_only=True, keep_vba=True)
ws = wb["Sheet1"]
rowCount = 2000
insertRow = rowCount + 7
origDate = date.today()
dateString = datetime.datetime.strftime(origDate, '%m/%d/%Y')
insertDate = datetime.datetime.strptime(dateString, '%m/%d/%Y').date()
dateCell = ws.cell(row = insertRow, column = 1)
dateCell.value = insertDate
wb.save("Excel workbook.xlsm")
So for example, if I ran this code using today's date of 03/19/2021, the cell would look like 2021-03-18 with General type.
Not sure what I am missing, but I want the inserted cell to have 'Date' type in 'mm/dd/yyyy' format. Any pointers?
I think this can be done simply with:
dateCell.value = origDate
dateCell.number_format = 'mm/dd/yyyy'
Note that there's no such thing as a date data type in Excel. A date is just a formatted number. "Date" (and "General" for that matter) are just formatting; the underlying value of the cell is separate and unchanged.
Though if you really want the cell to show as having a "Date" format and not "Custom", perhaps:
dateCell.number_format = 'mm/dd/yyyy;#'
This question is different from all the available questions and answers available in stack overflow because I do not want to change my data type to string in order to obtain desired output.
I find it as a most confusing and not able to find proper solution of my problem.
I read an excel file which have one column as following-
Date
9/20/2017 7:27:30 PM
9/20/2017 7:27:30 PM
11/21/2018 8:28:30 AM
7/18/2019 9:30:08 PM
.
.
.
I am taking this data from excel sheet with the help of dataframe
df = pd.read_excel("data.xlsx")
Firstly I want to remove time from this column. I am doing it as -
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = pd.to_datetime(df['Date'], errors='ignore', format='%d/%b/%Y').dt.date
It produces following output and datatype as datetime.date
Date
20/9/2017
20/9/2017
21/11/2018
18/7/2019
.
.
.
But I want it as following type without changing it into string.Because I want to store this data into another excel file and this column must behave as a date column if we apply filtering in my excel sheet.
Date
20/Sep/2017
20/Sep/2017
21/Nov/2018
18/Jul/2019
.
.
.
I can produce above output by
df['Date'] = df['Date'].apply(lambda x: x.strftime('%d/%b/%Y'))
But again this date column will be changed into string.But I do not want it as string. I want it as datetime type excluding time values from each cell.
A possible solution after converting it from string to datetime is as following but it will again add time values in it-
df['Date'] = pd.to_datetime(df['Date'])
After executing above two steps it will also include time as 12:00:00 AM or 00:00:00 AM along with date value.
Hope I am clear.
How to obtained the desired result with final column value as date type
But I want it as following type without changing it into string
No it is not possible, if want datetimes without times there is only pattern YYYY-MM-DD in python/pandas.
#datetimes with no times
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y %I:%M:%S %p').dt.floor('d')
#python dates
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y %I:%M:%S %p').dt.date
For all custom formats are datetimes converted to strings like:
df['Date'] = df['Date'].dt.strftime('%d/%b/%Y')
You can set the date_format in the excelwriter
writer = pd.ExcelWriter("pandas_datetime.xlsx",
engine='xlsxwriter',
date_format='%d/%b/%Y')
df.to_excel(writer)
think i am bit late here, as a workaround
do not format the date column , let it be a regular df date column, save the excel workbook and then open the excel again and using openpyxl module format that column range
import openpyxl
workbook = openpyxl.load_workbook(file_path)
sheet = workbook['Sheet1'] # get the active sheet
#-- assuming that the column is M and data starts from M2
last_line_end = 'M' + str(len(df)+1)
for row in sheet['M2:' + last_line_end]:
for cell in row:
cell.number_format = "DD/MM/YY"
workbook.save(file_name) # save workbook
workbook.close()
I am making a generic tool which can take up any csv file.I have a csv file which looks something like this. The first row is the column name and the second row is the type of variable.
sam.csv
Time,M1,M2,M3,CityName
temp,num,num,num,city
20-06-13,19,20,0,aligarh
20-02-13,25,42,7,agra
20-03-13,23,35,4,aligarh
20-03-13,21,32,3,allahabad
20-03-13,17,27,1,aligarh
20-02-13,16,40,5,aligarh
Other CSV file looks like:
Time,M1,M2,M3,CityName
temp,num,num,num,city
20/8/16,789,300,10,new york
12/6/17,464,67,23,delhi
12/6/17,904,98,78,delhi
So, there could be any date format or it could be a time stamp.I want to convert it to "20-May-13" or "%d-%b-%y" format string everytime and sort the column from oldest date to the newest date. I have been able to search the column name where the type is "temp" and try to convert it to the required format but all the methods require me to specify the original format which is not possible in my case.
Code--
import csv
import time
from datetime import datetime,date
import pandas as pd
import dateutil
from dateutil.parser import parse
filename = 'sam.csv'
data_date = pd.read_csv(filename)
column_name = data_date.ix[:, data_date.loc[0] == "temp"]
column_work = column_name.iloc[1:]
column_some = column_work.iloc[:,0]
default_date = datetime.combine(date.today(), datetime.min.time()).replace(day=1)
for line in column_some:
print(parse(line[0], default=default_date).strftime("%d-%b-%y"))
In "sam.csv", the dates are in 2013. But in my output it gives the correct format but all the 6 dates as 2-Mar-2018
You can use the dateutil library for converting any date format to your required format.
Ex:
import csv
from dateutil.parser import parse
p = "PATH_TO_YOUR_CSV.csv" #I have used your sample data to test.
with open(p, "r") as infile:
reader = csv.reader(infile)
next(reader) #Skip Header
next(reader) #Skip Header
for line in reader:
print(parse(line[0]).strftime("%d-%B-%y")) #Parse Date and convert it to date-month-year
Output:
20-June-13
20-February-13
20-March-13
20-March-13
20-March-13
20-February-13
20-August-16
06-December-17
06-December-17
MoreInfo on Dateutil