Splitting a CSV column into two - python

Column 2 within my csv file looks like the following:
20150926T104044Z
20150926T104131Z
and so on.
I have a definition created that will change the listed date into a julian date, but was wondering how I can go about altering this specific column of data?
Is there a way I can make python change the dates within the csv to a their julian date equivalent? Can I split the column into two csv's and translate the julian date from there?

You might be overthinking it. Try this.
from dateutil.parser import parse
import csv
def get_julian(_date):
# _date is holding 20150926T104044Z
the_date = parse(_date)
julian_start = parse('19000101T000000Z')
julian_days = (the_date - julian_start).days
return julian_days
with open('filename.csv') as f:
csv_reader = csv.reader(f)
for row in csv_reader:
# Column 2, right?
row[1] = get_julian(row[1])
# Do things and stuff with your corrected data.

I observed that there are many interpretations to Julian Day, One is Oridinal date(day of the year) and another one day from Monday, January 1, 4713 BC.
import pandas as pd
import datetime
import jdcal
df = pd.read_csv("path/to/your/csv")
def tojulianDate(date):
return datetime.datetime.strptime(date, '%Y%m%dT%H%M%SZ').strftime('%y%j')
def tojulianDate2(date):
curr_date = datetime.datetime.strptime(date, '%Y%m%dT%H%M%SZ')
curr_date_tuple = curr_date.timetuple()
return int(sum(jdcal.gcal2jd(curr_date_tuple.tm_year, curr_date_tuple.tm_mon, curr_date_tuple.tm_mday)))
df['Calendar_Dates'] = df['Calendar_Dates'].apply(tojulianDate2)
df.to_csv('path/to/modified/csv')
Method "toJulianDate" can be used to get the day of the year or Oridinal Date.
for second format, there is a library called jdcal to convert gregorian date to julian day or vice versa which is done in toJulianDate2 . This can also be done directly by opening csv and without loading into a dataframe.
Similar question was answered here Extract day of year and Julian day from a string date in python

Related

How can I take list of Dates from csv (as strings) and return only the dates/data between a start date and end date?

I have a csv file with dates in format M/D/YYYY from 1948 to 2017. I'm able to plot other columns/lists associated with each date by list index. I want to be able to ask the user for a start date, and an end date, then return/plot the data from only within that period.
Problem is, reading dates in from the csv, they are strings so I cannot use if date[x] >= startDate && date[x] <= endDate because theres no way for me to turn dates in this format to integers.
Here is my csv file
I am already able to read in the dates from the csv to its own list.
How can I take the dates in my list and only return the ones within the user specified date range?
Here is my function for plotting the entire dataset right now:
#CSV Plotting function
def CSV_Plot (data,header,column1,column2):
#pyplot.plot([item[column1] for item in data] , [item[column2] for item in data])
pyplot.scatter([item[column1] for item in data] , [item[column2] for item in data])
pyplot.xlabel(header[column1])
pyplot.ylabel(header[column2])
pyplot.show()
return True
CSV_Plot(mycsvdata,data_header,dateIndex,rainIndex)
This is how I am asking the user to input the start and end dates:
#Ask user for start date in M/D/YYY format
startDate = input('Please provide the start date (M/D/YYYY) of the period for the data you would like to plot: ')
endDate = input('Please provide the end date (M/D/YYYY) of the period for the data you would like to plot: ')
You need to compare the dates.
I would suggest parsing the dates from your CSV into a datetime object, and also turning the user input value into a datetime object.
How to create a datetime object from a string? You need to specify the format string and the strptime() will parse it for you. Details here:
Converting string into datetime
In your case, it could be something like
from datetime import datetime
# Considering date is in M/D/YYYY format
datetime_object1 = datetime.strptime(date_string, "%m/%d/%Y")
Then you can compare them with a > or < operator. Here you can find details of how to compare the dates.

How to find missing dates in an excel file by python

I'm a beginner in python. I have an excel file. This file shows the rainfall amount between 2016-1-1 and 2020-6-30. It has 2 columns. The first column is date, another column is rainfall. Some dates are missed in the file (The rainfall didn't estimate). For example there isn't a row for 2016-05-05 in my file. This a sample of my excel file.
Date rainfall (mm)
1/1/2016 10
1/2/2016 5
.
.
.
12/30/2020 0
I want to find the missing dates but my code doesn't work correctly!
import pandas as pd
from datetime import datetime, timedelta
from matplotlib import dates as mpl_dates
from matplotlib.dates import date2num
df=pd.read_excel ('rainfall.xlsx')
a= pd.date_range(start = '2016-01-01', end = '2020-06-30' ).difference(df.index)
print(a)
Here' a beginner friendly way of doing it.
First you need to make sure, that the Date in your dataframe is really a date and not a string or object.
Type (or print) df.info().
The date column should show up as datetime64[ns]
If not, df['Date'] = pd.to_datetime(df['Date'], dayfirst=False)fixes that. (Use dayfirst to tell if the month is first or the day is first in your date string because Pandas doesn't know. Month first is the default, if you forget, so it would work without...)
For the tasks of finding missing days, there's many ways to solve it. Here's one.
Turn all dates into a series
all_dates = pd.Series(pd.date_range(start = '2016-01-01', end = '2020-06-30' ))
Then print all dates from that series which are not in your dataframe "Date" column. The ~ sign means "not".
print(all_dates[~all_dates.isin(df['Date'])])
Try:
df = pd.read_excel('rainfall.xlsx', usecols=[0])
a = pd.date_range(start = '2016-01-01', end = '2020-06-30').difference([l[0] for l in df.values])
print(a)
And the date in the file must like 2016/1/1
To find the missing dates from a list, you can apply Conditional Formatting function in Excel. 4. Click OK > OK, then the position of the missing dates are highlighted. Note: The last date in the date list will be highlighted.
this TRICK Is not with python,a NORMAL Trick

How would I normalize dates in a csv file? python

I have a CSV file with a field named start_date that contains data in a variety of formats.
Some of the formats include e.g., June 23, 1912 or 5/11/1930 (month, day, year). But not all values are valid dates.
I want to add a start_date_description field adjacent to the start_date column to filter invalid date values into. Lastly, normalize all valid date values in start_date to ISO 8601 (i.e., YYYY-MM-DD).
So far I was only able to load the start_date into my file, I am stuck and would appreciate ant help. Please, any solution especially without using a library would be great!
import csv
date_column = ("start_date")
f = open("test.csv","r")
csv_reader = csv.reader(f)
headers = None
results = []
for row in csv_reader:
if not headers:
headers = []
for i, col in enumerate(row):
if col in date_column:
headers.append(i)
else:
results.append(([row[i] for i in headers]))
print results
One way is to use dateutil module, you can parse data as follows:
from dateutil import parser
parser.parse('3/16/78')
parser.parse('4-Apr') # this will give current year i.e. 2017
Then parsing to your format can be done by
dt = parser.parse('3/16/78')
dt.strftime('%Y-%m-%d')
Suppose you have table in dataframe format, you can now define parsing function and apply to column as follows:
def parse_date(start_time):
try:
return parser.parse(x).strftime('%Y-%m-%d')
except:
return ''
df['parse_date'] = df.start_date.map(lambda x: parse_date(x))
Question: ... add a start_date_description ... normalize ... to ISO 8601
This reads the File test.csv and validates the Date String in Column start_date with Date Directive Patterns and returns a
dict{description, ISO}. The returned dict is used to update the current Row dict and the updated Row dict is writen to the File test_update.csv.
Put this in a NEW Python File and run it!
A missing valid Date Directive Pattern could be simple added to the Array.
Python ยป 3.6 Documentation: 8.1.8. strftime() and strptime() Behavior
from datetime import datetime as dt
import re
def validate(date):
def _dict(desc, date):
return {'start_date_description':desc, 'ISO':date}
for format in [('%m/%d/%y','Valid'), ('%b-%y','Short, missing Day'), ('%d-%b-%y','Valid'),
('%d-%b','Short, missing Year')]: #, ('%B %d. %Y','Valid')]:
try:
_dt = dt.strptime(date, format[0])
return _dict(format[1], _dt.strftime('%Y-%m-%d'))
except:
continue
if not re.search(r'\d+', date):
return _dict('No Digit', None)
return _dict('Unknown Pattern', None)
with open('test.csv') as fh_in, open('test_update.csv', 'w') as fh_out:
csv_reader = csv.DictReader(fh_in)
csv_writer = csv.DictWriter(fh_out,
fieldnames=csv_reader.fieldnames +
['start_date_description', 'ISO'] )
csv_writer.writeheader()
for row, values in enumerate(csv_reader,2):
values.update(validate(values['start_date']))
# Show only Invalid Dates
if any(w in values['start_date_description']
for w in ['Unknown', 'No Digit', 'missing']):
print('{:>3}: {v[start_date]:13.13} {v[start_date_description]:<22} {v[ISO]}'.
format(row, v=values))
csv_writer.writerow(values)
Output:
start_date start_date_description ISO
June 23. 1912 Valid 1912-06-23
12/31/91 Valid 1991-12-31
Oct-84 Short, missing Day 1984-10-01
Feb-09 Short, missing Day 2009-02-01
10-Dec-80 Valid 1980-12-10
10/7/81 Valid 1981-10-07
Facere volupt No Digit None
... (omitted for brevity)
Tested with Python: 3.4.2

How to find earliest and latest dates from a CSV File [Python]

My CSV file is arranged so that there's a row named "Dates," and below that row is a gigantic column of a million dates, in the traditional format like "4/22/2015" and "3/27/2014".
How can I write a program that identifies the earliest and latest dates in the CSV file, while maintaining the original format (month/day/year)?
I've tried
for line in count_dates:
dates = line.strip().split(sep="/")
all_dates.append(dates)
print (all_dates)
I've tried to take away the "/" and replace it with a blank space, but it does not print anything.
import pandas as pd
import datetime
df = pd.read_csv('file_name.csv')
df['Dates'] = df['Dates'].apply(lambda v: datetime.datetime.strptime(v, '%m/%d/%Y'))
print df['Dates'].min(), df['Dates'].max()
Considering you have a large file, reading it in its entirety into memory is a bad idea.
Read the file line by line, manually keeping track of the earliest and latest dates. Use datetime.datetime.strptime to convert the strings to dates (takes the string format as parameter.
import datetime
with open("input.csv") as f:
f.readline() # get the "Dates" header out of the way
first = f.readline().strip()
earliest = datetime.datetime.strptime(first, "%m/%d/%Y")
latest = datetime.datetime.strptime(first, "%m/%d/%Y")
for line in f:
date = datetime.datetime.strptime(line.strip(), "%m/%d/%Y")
if date < earliest: earliest = date
if date > latest: latest = date
print "Earliest date:", earliest
print "Latest date:", latest
Let's open the csv file, read out all the dates. Then use strptime to turn them into comparable datetime objects (now, we can use max). Lastly, let's print out the biggest (latest) date
import csv
from datetime import datetime as dt
with open('path/to/file') as infile:
dt.strftime(max(dt.strptime(row[0], "%m/%d/%Y") \
for row in csv.reader(infile)), \
"%m/%d/%Y")
Naturally, you can use min to get the earliest date. However, this takes two linear runs, and you can do this with just one, if you are willing to do some heavy lifting yourself:
import csv
from datetime import datetime as dt
with open('path/to/file') as infile:
reader = csv.reader(infile)
date, *_rest = next(infile)
date = dt.strptime(date, "%m/%d/%Y")
for date, *_rest in reader:
date = dt.strptime(date, "%m/%d/%Y")
earliest = min(date, earliest)
latest = max(date, latest)
print("earliest:", dt.strftime(earliest, "%m/%d/%Y"))
print("latest:", dt.strftime(latest, "%m/%d/%Y"))
A bit of an RTFM answer: Open the file in csv format (see the csv library), and then iterate line by line converting the field that is a date into a date object (see the docs for converting a string to a date object), and if it is less than minimum so far store it as minimum, similar for max, with a special condition on the first line that the date becomes both min and max dates.
Or for some overkill you could just use Pandas to read it into a data frame specifying the specific column as date format then just use max & min.
I think it is more convenient to use pandas for this purpose.
import pandas as pd
df = pd.read_csv('file_name.csv')
df['name_of_column_with_date'] = pd.to_datetime(df['name_of_column_with_date'], format='%-m/%d/%Y')
print('min_date{}'.format(min(df['name_of_column_with_date'])))
print('max_date{}'.format(max(df['name_of_column_with_date'])))
The built-in functions work well with Pandas Dataframes.
For more understanding of the format feature in pd.to_datatime you can use Python strftime cheat sheet

csv file to date format

I would like to read in a csv file of dates (shown below) and loop through it using solar.GetAltitude on each date to calculate a list of sun altitudes. (I'm using Python 2.7.2 on Windows 7 Enterprise.)
CSV file: TimeStamp 01/01/2014 00:10 01/01/2014 00:20 01/01/2014 00:30
01/01/2014 00:40
My code gives the following error ValueError: unconverted data remains:. This suggests the wrong date format, but it works fine on a single date, rather than a string of dates.
I've researched this topic carefully on Stack Overflow. I've also tried the map function, np.datetime64 and reading to a list rather than a string but get a different error referring to no attribute 'year'.
I'd really appreciate any help because I'm running out of ideas.
import datetime
from datetime import datetime
import julian
import solar
from solar import *
import os
import csv
# Create lists to hold the records.
dates = []
# Navigate to correct directory
os.chdir('D:\\Di_Python')
filename = 'SPA timestamp small.csv'
# Read through the entire file, skip the first line
with open(filename) as f:
# Create a csv reader object.
reader = csv.reader(f)
# Ignore the header row.
next(reader)
# Store the dates in the appropriate list.
for row in reader:
dates.append(row)
print row
# Change list to string so can use a function on it
lines = []
for date in dates:
lines.append('\t'.join(map(str, date)))
result = '\n'.join(lines)
print result
minutes = []
minutes.append(datetime.datetime.strptime(result,'%d/%m/%Y %H:%M'))
# Inputs
latitude_deg = 52.8
longitude_deg = -1.2
elevation = 0
# i should be 52560 - 10 min interval whole year
for i in minutes:
utc_datetime = i
altitude = solar.GetAltitude(latitude_deg, longitude_deg, utc_datetime)
altitude_list.append(altitude)
print altitude_list
First of all, the code is not indented properly making it harder to guess.
I think the input to datetime.datetime.strptime is not correct. You create result by using a '\n'.join(...) but the format string does not contain the '\n'. Creating a string from the list of dates seems unnecessary to me.
I think what you want is this:
for date in dates:
minutes.append(datetime.datetime.strptime(date, '%d/%m/%Y %H:%M'))
Note that the names you use for the lists are misleading as minutes holds datetime.datetime objects rather than minute values!
Many thanks to Vikramis and Lutz Horn for their help and comments. After experimenting with Vikramis' code, I achieved a working version which I have copied below.
My error occurred at line 40:
minutes.append(datetime.datetime.strptime(result,'%d/%m/%Y %H:%M'))
I found that I needed to create a string from the list to avoid the following error "TypeError: must be string, not list". I have now tidied this up by using (str(date) to replace the for loop and hopefully used more sensible names.
My problem was with the formatting. It needs to be
"['%d/%m/%Y %H:%M']" because I'm accessing items in a list, rather than "'%d/%m/%Y %H:%M'" which works in the shell for a single date.
import datetime
from datetime import datetime
import julian
import solar
from solar import *
import os
import csv
# Create lists to hold the records.
dates = []
datetimeObj = []
altitude_list = []
# Navigate to correct directory
os.chdir('D:\\Di_Python')
filename = 'SPA timestamp small.csv'
# Read through the entire file, skip the first line
with open(filename) as f:
# Create a csv reader object.
reader = csv.reader(f)
# Ignore the header row.
next(reader)
# Store the dates in the appropriate list.
for row in reader:
dates.append(row)
print row
# Change format to datetime
# str(date) used to avoid TypeError: must be string, not list
for date in dates:
datetimeObj.append(datetime.datetime.strptime(str(date),"['%d/%m/%Y %H:%M']"))
for j in datetimeObj:
print j
# Inputs
latitude_deg = 52.8
longitude_deg = -1.2
elevation = 0
# i should be 52560 - 10 min interval whole year
for i in datetimeObj:
utc_datetime = i
altitude = solar.GetAltitude(latitude_deg, longitude_deg, utc_datetime)
print altitude
altitude_list.append(altitude)
# print altitude_list

Categories