How to find earliest and latest dates from a CSV File [Python]

How to find earliest and latest dates from a CSV File [Python] - python

My CSV file is arranged so that there's a row named "Dates," and below that row is a gigantic column of a million dates, in the traditional format like "4/22/2015" and "3/27/2014".
How can I write a program that identifies the earliest and latest dates in the CSV file, while maintaining the original format (month/day/year)?
I've tried
for line in count_dates:
dates = line.strip().split(sep="/")
all_dates.append(dates)
print (all_dates)
I've tried to take away the "/" and replace it with a blank space, but it does not print anything.

import pandas as pd
import datetime
df = pd.read_csv('file_name.csv')
df['Dates'] = df['Dates'].apply(lambda v: datetime.datetime.strptime(v, '%m/%d/%Y'))
print df['Dates'].min(), df['Dates'].max()

Considering you have a large file, reading it in its entirety into memory is a bad idea.
Read the file line by line, manually keeping track of the earliest and latest dates. Use datetime.datetime.strptime to convert the strings to dates (takes the string format as parameter.
import datetime
with open("input.csv") as f:
f.readline() # get the "Dates" header out of the way
first = f.readline().strip()
earliest = datetime.datetime.strptime(first, "%m/%d/%Y")
latest = datetime.datetime.strptime(first, "%m/%d/%Y")
for line in f:
date = datetime.datetime.strptime(line.strip(), "%m/%d/%Y")
if date < earliest: earliest = date
if date > latest: latest = date
print "Earliest date:", earliest
print "Latest date:", latest

Let's open the csv file, read out all the dates. Then use strptime to turn them into comparable datetime objects (now, we can use max). Lastly, let's print out the biggest (latest) date
import csv
from datetime import datetime as dt
with open('path/to/file') as infile:
dt.strftime(max(dt.strptime(row[0], "%m/%d/%Y") \
for row in csv.reader(infile)), \
"%m/%d/%Y")
Naturally, you can use min to get the earliest date. However, this takes two linear runs, and you can do this with just one, if you are willing to do some heavy lifting yourself:
import csv
from datetime import datetime as dt
with open('path/to/file') as infile:
reader = csv.reader(infile)
date, *_rest = next(infile)
date = dt.strptime(date, "%m/%d/%Y")
for date, *_rest in reader:
date = dt.strptime(date, "%m/%d/%Y")
earliest = min(date, earliest)
latest = max(date, latest)
print("earliest:", dt.strftime(earliest, "%m/%d/%Y"))
print("latest:", dt.strftime(latest, "%m/%d/%Y"))

A bit of an RTFM answer: Open the file in csv format (see the csv library), and then iterate line by line converting the field that is a date into a date object (see the docs for converting a string to a date object), and if it is less than minimum so far store it as minimum, similar for max, with a special condition on the first line that the date becomes both min and max dates.
Or for some overkill you could just use Pandas to read it into a data frame specifying the specific column as date format then just use max & min.

I think it is more convenient to use pandas for this purpose.
import pandas as pd
df = pd.read_csv('file_name.csv')
df['name_of_column_with_date'] = pd.to_datetime(df['name_of_column_with_date'], format='%-m/%d/%Y')
print('min_date{}'.format(min(df['name_of_column_with_date'])))
print('max_date{}'.format(max(df['name_of_column_with_date'])))
The built-in functions work well with Pandas Dataframes.
For more understanding of the format feature in pd.to_datatime you can use Python strftime cheat sheet

Related

How to compare 2 dates in 2 different columns of a csv to tell if the date in column 1 comes before column 2

I'm trying to compare 2 columns of timestamps in a CSV file, and I only want to keep the rows where the date/time in column 1 is before the date/time in column 2. I'm not too sure where to start, since we're looking at comparing many different numbers (e.g. month, year, hour, minute, etc.) separately in relation with one another, including the AM/PM comparison.
Example: (date is [mm/dd/yyyy] format)
11/20/2018 3:00:13 PM
11/23/2017 6:45:00 AM
12/22/2019 4:00:12 PM
1/10/2020 4:50:11 AM
10/10/2018 2:02:19 PM
10/07/2018 1:04:15 PM
Here I would want to keep row 2 because the date in column 2 comes after the date in column 1, and I would not want to keep rows 1 & 3. Is there a neat way to do this in command line? (If not, any pointers to write a Python script would be very helpful)
Thanks in advance!

I will try and keep this as clear and detailed as possible so everyone can understand :
1) First I imported python's datetime library
import datetime as dt
2) Now i am importing the csv file which i have to work on , in this case I have used dates.csv which has the same data as in the question asked above :
from csv import reader
dataset = list(reader(open("dates.csv", encoding = "utf-8")))
2.1) Printing dataset to check if its working :
dataset
printing a single date from our dataset in order to check pattern :
Keep in mind that indexing in python starts with zero
dataset[1][0] # dataset[row][column]
2.2) Pattern is month/day/year hour:min:sec AM/PM
pattern = "%m/%d/%Y %I:%M:%S %p"
you can check Legal Format Codes in order to create a different pattern in future.
3) Now converting our dataset dates into date time object using the library we imported
for i in dataset[1:]:
# [1:] because 1st row has heading and we don't need it
i[0] = dt.datetime.strptime(i[0],pattern)
i[1] = dt.datetime.strptime(i[1],pattern)
print(dataset[1][0])
successfully converted ^
4) Now we will manually comparing dates in order to understand the concepts.
by simply using comparison operators we can compare the dates in python
using datetime library
print(dataset[2][0] , "and" , dataset[2][1])
print(dataset[2][0] > dataset[2][1])
5) Now creating a separate list in which only those rows will be added where column 2's date is greater than column 1's date :
col2_greatorthan_col1 = []
adding heading in our new list :
col2_greatorthan_col1.append(["column 1" , "column 2"])
comparing each and every date and appending our desired row in our new list :
for i in dataset[1:]:
if i[1] > i[0]: # means if column 2's date is greater than column 1's date
col2_greatorthan_col1.append(i) # appending the filtered rows in our new list
col2_greatorthan_col1
6) Now simply creating a real world csv file which will have the same data as col2_greatorthan_col1
import csv
with open("new_dates.csv" , "w" , newline = "") as file :
writer = csv.writer(file)
writer.writerows(lst)
Result :
A new csv file by the name of new_dates.csv will be created in the same directory as your python code file.
This file will only contain those rows where column 2's date is greater than column 1's date.

In Python, you just need to convert each of the date values into datetime objects. They can then be easily compared with a simple < operator. For example:
from datetime import datetime
import csv
with open('input.csv') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_input = csv.reader(f_input)
#header = next(csv_input)
csv_output = csv.writer(f_output)
#csv_output.writerow(header)
for row in csv_input:
date_col1 = datetime.strptime(row[0], '%m/%d/%Y %I:%M:%S %p')
date_col2 = datetime.strptime(row[1], '%m/%d/%Y %I:%M:%S %p')
if date_col1 < date_col2:
csv_output.writerow(row)
If your CSV file contains a header, uncomment the two lines. You can find more information on how the format string work with the .strptime() function documentation.
This approach uses built in Python functionality and so does not need further modules to be installed.

Using Pandas
import pandas as pd
# Read tab delimited CSV file without header
# Names columns date1, date2
df = pd.read_csv("dates.csv",
header = None,
sep='\t',
parse_dates = [0, 1], # use default date parser i.e. parser.parser
names=["date1", "date2"])
# Filter (keep) row when date2 > date1
df = df[df.date2 > df.date1]
# Output to filtered CSV file using the original date format
df.to_csv('filtered_dates.csv', index = False, header = False, sep = '\t', date_format = "%Y/%m/%d %I:%M:%S %p")

With comman line tools you can use awk:
to convert 1st date to epoch format:
echo "11/20/2018 3:00:13 PM" |gawk -F'[/:]' '{print mktime($3" "$1" "$2" "$4" "$5" "$6" "$7)}'
same for the second field. And then subtract column 2 from column 1. If the result is positive this mean column 1 is after column 2
Here is used function mktime from awk which do the "magic". Be aware this function is not available in some of UNIX awk version

I have saved the sample you provided in a tab separated file - with no headers. I have imported it as a DataFrame using (note that I specified your date format in date_parser):
import pandas as pd
import datetime as dt
df = pd.read_csv(PATH_TO_YOUR_FILE, sep="\t", names=["col1", "col2"], parse_dates=[0,1], date_parser=lambda x:dt.datetime.strptime(x, "%m/%d/%Y %I:%M:%S %p")
To select the rows you need:
df.loc[df.loc[:,"col2"]>df.loc[:,"col1"],:]

You can use pd.to_datetime to parse the date-time strings and then use their comparison as a condition to filter the required rows.
Demo:
import pandas as pd
df = pd.DataFrame({
'start': ['11/20/2018 3:00:13 PM', '12/22/2019 4:00:12 PM', '10/10/2018 2:02:19 PM'],
'end': ['11/23/2017 6:45:00 AM', '1/10/2020 4:50:11 AM', '10/07/2018 1:04:15 PM']
})
result = pd.DataFrame(df[
pd.to_datetime(df['start'], format='%m/%d/%Y %I:%M:%S %p') <
pd.to_datetime(df['end'], format='%m/%d/%Y %I:%M:%S %p')
])
print(result)
Output:
start end
1 12/22/2019 4:00:12 PM 1/10/2020 4:50:11 AM
ONLINE DEMO

Splitting a CSV column into two

Column 2 within my csv file looks like the following:
20150926T104044Z
20150926T104131Z
and so on.
I have a definition created that will change the listed date into a julian date, but was wondering how I can go about altering this specific column of data?
Is there a way I can make python change the dates within the csv to a their julian date equivalent? Can I split the column into two csv's and translate the julian date from there?

You might be overthinking it. Try this.
from dateutil.parser import parse
import csv
def get_julian(_date):
# _date is holding 20150926T104044Z
the_date = parse(_date)
julian_start = parse('19000101T000000Z')
julian_days = (the_date - julian_start).days
return julian_days
with open('filename.csv') as f:
csv_reader = csv.reader(f)
for row in csv_reader:
# Column 2, right?
row[1] = get_julian(row[1])
# Do things and stuff with your corrected data.

I observed that there are many interpretations to Julian Day, One is Oridinal date(day of the year) and another one day from Monday, January 1, 4713 BC.
import pandas as pd
import datetime
import jdcal
df = pd.read_csv("path/to/your/csv")
def tojulianDate(date):
return datetime.datetime.strptime(date, '%Y%m%dT%H%M%SZ').strftime('%y%j')
def tojulianDate2(date):
curr_date = datetime.datetime.strptime(date, '%Y%m%dT%H%M%SZ')
curr_date_tuple = curr_date.timetuple()
return int(sum(jdcal.gcal2jd(curr_date_tuple.tm_year, curr_date_tuple.tm_mon, curr_date_tuple.tm_mday)))
df['Calendar_Dates'] = df['Calendar_Dates'].apply(tojulianDate2)
df.to_csv('path/to/modified/csv')
Method "toJulianDate" can be used to get the day of the year or Oridinal Date.
for second format, there is a library called jdcal to convert gregorian date to julian day or vice versa which is done in toJulianDate2 . This can also be done directly by opening csv and without loading into a dataframe.
Similar question was answered here Extract day of year and Julian day from a string date in python

Convert any Date String Format to a specific date format string

I am making a generic tool which can take up any csv file.I have a csv file which looks something like this. The first row is the column name and the second row is the type of variable.
sam.csv
Time,M1,M2,M3,CityName
temp,num,num,num,city
20-06-13,19,20,0,aligarh
20-02-13,25,42,7,agra
20-03-13,23,35,4,aligarh
20-03-13,21,32,3,allahabad
20-03-13,17,27,1,aligarh
20-02-13,16,40,5,aligarh
Other CSV file looks like:
Time,M1,M2,M3,CityName
temp,num,num,num,city
20/8/16,789,300,10,new york
12/6/17,464,67,23,delhi
12/6/17,904,98,78,delhi
So, there could be any date format or it could be a time stamp.I want to convert it to "20-May-13" or "%d-%b-%y" format string everytime and sort the column from oldest date to the newest date. I have been able to search the column name where the type is "temp" and try to convert it to the required format but all the methods require me to specify the original format which is not possible in my case.
Code--
import csv
import time
from datetime import datetime,date
import pandas as pd
import dateutil
from dateutil.parser import parse
filename = 'sam.csv'
data_date = pd.read_csv(filename)
column_name = data_date.ix[:, data_date.loc[0] == "temp"]
column_work = column_name.iloc[1:]
column_some = column_work.iloc[:,0]
default_date = datetime.combine(date.today(), datetime.min.time()).replace(day=1)
for line in column_some:
print(parse(line[0], default=default_date).strftime("%d-%b-%y"))
In "sam.csv", the dates are in 2013. But in my output it gives the correct format but all the 6 dates as 2-Mar-2018

You can use the dateutil library for converting any date format to your required format.
Ex:
import csv
from dateutil.parser import parse
p = "PATH_TO_YOUR_CSV.csv" #I have used your sample data to test.
with open(p, "r") as infile:
reader = csv.reader(infile)
next(reader) #Skip Header
next(reader) #Skip Header
for line in reader:
print(parse(line[0]).strftime("%d-%B-%y")) #Parse Date and convert it to date-month-year
Output:
20-June-13
20-February-13
20-March-13
20-March-13
20-March-13
20-February-13
20-August-16
06-December-17
06-December-17
MoreInfo on Dateutil

csv file to date format

I would like to read in a csv file of dates (shown below) and loop through it using solar.GetAltitude on each date to calculate a list of sun altitudes. (I'm using Python 2.7.2 on Windows 7 Enterprise.)
CSV file: TimeStamp 01/01/2014 00:10 01/01/2014 00:20 01/01/2014 00:30
01/01/2014 00:40
My code gives the following error ValueError: unconverted data remains:. This suggests the wrong date format, but it works fine on a single date, rather than a string of dates.
I've researched this topic carefully on Stack Overflow. I've also tried the map function, np.datetime64 and reading to a list rather than a string but get a different error referring to no attribute 'year'.
I'd really appreciate any help because I'm running out of ideas.
import datetime
from datetime import datetime
import julian
import solar
from solar import *
import os
import csv
# Create lists to hold the records.
dates = []
# Navigate to correct directory
os.chdir('D:\\Di_Python')
filename = 'SPA timestamp small.csv'
# Read through the entire file, skip the first line
with open(filename) as f:
# Create a csv reader object.
reader = csv.reader(f)
# Ignore the header row.
next(reader)
# Store the dates in the appropriate list.
for row in reader:
dates.append(row)
print row
# Change list to string so can use a function on it
lines = []
for date in dates:
lines.append('\t'.join(map(str, date)))
result = '\n'.join(lines)
print result
minutes = []
minutes.append(datetime.datetime.strptime(result,'%d/%m/%Y %H:%M'))
# Inputs
latitude_deg = 52.8
longitude_deg = -1.2
elevation = 0
# i should be 52560 - 10 min interval whole year
for i in minutes:
utc_datetime = i
altitude = solar.GetAltitude(latitude_deg, longitude_deg, utc_datetime)
altitude_list.append(altitude)
print altitude_list

First of all, the code is not indented properly making it harder to guess.
I think the input to datetime.datetime.strptime is not correct. You create result by using a '\n'.join(...) but the format string does not contain the '\n'. Creating a string from the list of dates seems unnecessary to me.
I think what you want is this:
for date in dates:
minutes.append(datetime.datetime.strptime(date, '%d/%m/%Y %H:%M'))
Note that the names you use for the lists are misleading as minutes holds datetime.datetime objects rather than minute values!

Many thanks to Vikramis and Lutz Horn for their help and comments. After experimenting with Vikramis' code, I achieved a working version which I have copied below.
My error occurred at line 40:
minutes.append(datetime.datetime.strptime(result,'%d/%m/%Y %H:%M'))
I found that I needed to create a string from the list to avoid the following error "TypeError: must be string, not list". I have now tidied this up by using (str(date) to replace the for loop and hopefully used more sensible names.
My problem was with the formatting. It needs to be
"['%d/%m/%Y %H:%M']" because I'm accessing items in a list, rather than "'%d/%m/%Y %H:%M'" which works in the shell for a single date.
import datetime
from datetime import datetime
import julian
import solar
from solar import *
import os
import csv
# Create lists to hold the records.
dates = []
datetimeObj = []
altitude_list = []
# Navigate to correct directory
os.chdir('D:\\Di_Python')
filename = 'SPA timestamp small.csv'
# Read through the entire file, skip the first line
with open(filename) as f:
# Create a csv reader object.
reader = csv.reader(f)
# Ignore the header row.
next(reader)
# Store the dates in the appropriate list.
for row in reader:
dates.append(row)
print row
# Change format to datetime
# str(date) used to avoid TypeError: must be string, not list
for date in dates:
datetimeObj.append(datetime.datetime.strptime(str(date),"['%d/%m/%Y %H:%M']"))
for j in datetimeObj:
print j
# Inputs
latitude_deg = 52.8
longitude_deg = -1.2
elevation = 0
# i should be 52560 - 10 min interval whole year
for i in datetimeObj:
utc_datetime = i
altitude = solar.GetAltitude(latitude_deg, longitude_deg, utc_datetime)
print altitude
altitude_list.append(altitude)
# print altitude_list

pythonian comparison: date.time from csv file to date.time from timestamp

In python I import a csv file with one datetime value at each row (2013-03-14 07:37:33)
and I want to compare it with the datetime values I obtain with timestamp.
I assume that when I read the csv the result is strings, but when I try to compare them in a loop with the strings from timestamp does not compare them at all without giving me an error at the same time.
Any suggestions?
csv_in = open('FakeOBData.csv', 'rb')
reader = csv.reader(csv_in)
for row in reader:
date = row
OBD.append(date)
.
.
.
for x in OBD:
print x
sightings = db.edge.find ( { "tag" : int(participant_tag)},{"_id":0}).sort("time")
for sighting in sightings:
time2 = datetime.datetime.fromtimestamp(time)
if x == time2:

Use datetime.datetime.strptime to parse the strings into datetime objects. You may also have to work out what time zone the date strings in your CSV are from and adjust for that.
%Y-%m-%d %H:%M:%S should work as your format string:
x_datetime = datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
if x_datetime == time2:
Or parse it when reading:
for row in reader:
date = datetime.datetime.strptime(row[0], '%Y-%m-%d %H:%M:%S')

You could parse it yourself with datetime.datetime.strptime which should be fine if you know the format the date is in. If you do not know the format or want to be more robust I would advise you to use the parser from python-dateutil library, it has an awesome parser that is very robust.
pip install python-dateutil
Then
import dateutil.parser
d = dateutil.parser.parse('1 Jan 2012 12pm UTC') # its that robust!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find earliest and latest dates from a CSV File [Python] - python

import pandas as pd import datetime df = pd.read_csv('file_name.csv') df['Dates'] = df['Dates'].apply(lambda v: datetime.datetime.strptime(v, '%m/%d/%Y')) print df['Dates'].min(), df['Dates'].max()

Related

How to compare 2 dates in 2 different columns of a csv to tell if the date in column 1 comes before column 2

Splitting a CSV column into two

Convert any Date String Format to a specific date format string

csv file to date format

pythonian comparison: date.time from csv file to date.time from timestamp

Categories

Resources