Checking for missing values in CSV

Checking for missing values in CSV - python

I have a CSV file that counts the data in a timestamp every 15 min.
I need tried to figure out how to see, if there is missing any of the 15 min in the file. But i cant get the code to work 100%.
hope you can help!
First i geathered the data from csv and set it in timestamp. The format is yyyy-mm-dd-hh:mm.
lst = list()
with open("CHFJPY15.csv", "r") as f:
f_r = f.read()
sline = f_r.split()
for line in sline:
parts = line.split(',')
date = parts[0]
time = parts[1]
closeingtime = parts[5]
timestamp = date + ' ' + time + ' ' + closeingtime
lst.append(timestamp)
print(lst, "liste")
(All credits to BillBell for the code below)
Here try creating a consistently formatted list of data items.
from datetime import timedelta
interval = timedelta(minutes=15)
from datetime import datetime
current_time = datetime(2015,12,9,19,30)
data = []
omits = [3,5,9,11,17]
for i in range(20):
current_time += interval
if i in omits:
continue
data.append(current_time.strftime('%y.%m.%d.%H:%M')+' 123.456')
Now I read through the dates subtracting each from it predecessor. I set the first 'predecessor', which I call previous to now because that's bound to differ from the other dates.
I split each datum from the list into two, ignoring the second piece. Using strptime I turn strings into dates. Dates can be subtracted and the differences compared.
previous = datetime.now().strftime('%y.%m.%d.%H:%M')
first = True
for d in data:
date_part, other = d.split(' ')
if datetime.strptime(date_part, '%y.%m.%d.%H:%M') - datetime.strptime(previous, '%y.%m.%d.%H:%M') != interval:
if not first:
'unacceptable gap prior to ', date_part
else:
first = False
previous = date_part
Hope you can see the problem.

Related

Time and date strings to DateTime Objects from csv file in python

So I'm working on a function which checks the dates in each row of a csv file against a standard date made up of two cells in the header row. What I need to do is take the date from A2 and the time from A3 and concatenate them into one object which can be compared against the rest of the rows of the file and then from there expel the rows which fail the test.
The only problem I'm having is in running the comparison with the time objects and getting the strings out of the csv. My current code gives me a ValueError because the format of value of date_time does not match the format %m/%d/%Y %H:%M:%S. Which is correct, because the value of date_time is the whole entire line.
Right now I'm simply trying to get the comparison to run on an arbitrary static time.
But if I want to take the date from cell A2 and concatenate it with the time in cell A3, then compare that new object with the rest of the rows in the file whose time and date do not need concatenation, what is the best way to go about running this comparison when you don't know what the dates are going to be?
def CheckDates(f):
with open(f, newline='', encoding='utf-8') as g:
r = csv.reader(g)
date_time = str(next(r))
for line in r:
if datetime.strptime(date_time, '%m/%d/%Y %H:%M:%S') >= datetime.strptime('01/11/2022 13:19:00', '%m/%d/%Y %H:%M:%S'):
# Dates pass
pass
else:
# Dates fail
pass
edited typos and added an example csv
TD,08/24/2021,14:14:08,21012,223,0,1098,0,031,810,12,01,092,048,0008,02
Date/Time,G120010,M129000,G110100,M119030,G112070,G112080,G111030,G127020,G127030,G120020,G120030,G121020,G111040,G112010,P102000,G112020,G112040,G112090,G110050,G110060,G110070,T111100
06/27/2022 00:00:01,40,133.2,0,0,7.284853,0,0.6030464,0,0,1,0,5,11,5,0,0,414,344,0,154,0,5
06/27/2022 00:00:03,40,133.2,0,0,7.284853,0,0.5898247,0,0,1,0,5,11,5,0,0,414,344,0,154,0,5
06/27/2022 00:00:05,40,133.2,0,0,7.284853,0,0.6135368,0,0,1,0,5,11,5,0,0,414,344,0,154,0,5
06/27/2022 00:00:07,40,133.2,0,0,7.284853,0,0.6087456,0,0,1,0,5,11,5,0,0,414,344,0,154,0,5
06/27/2022 00:00:09,40,133.2,0,0,7.284853,0,0.5903625,0,0,1,0,5,11,5,0,0,414,344,0,154,0,5
06/27/2022 00:00:11,40,133.2,0,0,7.284853,0,0.5799789,0,0,1,0,5,11,5,0,0,414,344,0,154,0,5
06/27/2022 00:00:13,40,133.2,0,0,7.284853,0,0.5821953,0,0,1,0,5,11,5,0,0,414,344,0,154,0,5
06/27/2022 00:00:15,40,133.2,0,0,7.284853,0,0.6024017,0,0,1,0,5,11,5,0,0,414,344,0,154,0,5
06/27/2022 00:00:17,40,133.2,0,0,7.284853,0,0.5984001,0,0,1,0,5,11,5,0,0,414,344,0,154,0,5

This should do the trick. I modified a couple rows of your data file for "dramatic effect"...
# time compare
from datetime import datetime, timedelta
f = 'data.csv'
with open(f, 'r') as src:
row_0 = src.readline()
tokens = row_0.strip().split(',') # split (tokenize) the line
orig_time = tokens[1] + ' ' + tokens[2] # concatenate the strings
base_time = datetime.strptime(orig_time, '%m/%d/%Y %H:%M:%S')
print(f'recovered this base time: {base_time}')
src.readline() # burn row 2
# process the remainder
for line in src:
tokens = line.strip().split(',')
row_time = datetime.strptime(tokens[0], '%m/%d/%Y %H:%M:%S')
# calculate the difference. The result of comparing datetimes
# is a "timedelta" object that can be queried.
td = row_time - base_time
# make a comparision to see if it is pos/neg
if td < timedelta(0):
print('this line is before the base time:')
print(f' {line}')
Output:
recovered this base time: 2021-08-24 14:14:08
this line is before the base time:
06/27/2019 00:55:05,40,133.2,0,0,7.284853,0,0.6135368,0,0,1,0,5,11,5,0,0,414,344,0,154,0,5
this line is before the base time:
06/27/2021 10:00:11,40,133.2,0,0,7.284853,0,0.5799789,0,0,1,0,5,11,5,0,0,414,344,0,154,0,5
this line is before the base time:
06/27/2020 00:00:17,40,133.2,0,0,7.284853,0,0.5984001,0,0,1,0,5,11,5,0,0,414,344,0,154,0,5

How to create hourly list with datetime_range without year

I'm trying to create list of hours contained within each specified interval, which would be quite complicated with loop. Therefore, I wanted to ask for datetime recommendations.
# input in format DDHH/ddhh:
validity = ['2712/2812','2723/2805','2800/2812']
# demanded output:
val_hours = ['2712', '2713', '2714'..., '2717', '2723', '2800',...'2804',]
It would be great if last hour of validity would be considered as non-valid, becouse interval is ended by that hour, or more precisely by 59th minute of previous one.
I've tried quite complicated way with if conditions and loops, but I am persuaded that there is better one - as always.
It is something like:
#input in format DDHH/ddhh:
validity = ['2712/2812','2723/2805','2800/2812']
output = []
#upbound = previsously defined function defining list of lengt of each group
upbound = [24, 6, 12]
#For only first 24-hour group:
for hour in range(0,upbound[0]):
item = int(validity[0][-7:-5]) + hour
if (hour >= 24):
hour = hour - 24
output = output + hour
Further I would have to prefix numbers with date smaller than 10, like 112 (01st 12:00 Zulu) with zero and ensure correct day.
Loops and IFs seem to me just to compúlicated. Not mentioning error handling, it looks like two or three conditions.
Thank you for your help!

For each valid string, I use datetime.strptime to parse it, then based on either start date is less than or equal to end date, or greater than end date, I calculate the hours.
For start date less than or equal to end date, I consider original valid string, else I create two strings start_date/3023 and 0100/end_date
import datetime
validity = ['2712/2812','2723/2805','2800/2812','3012/0112','3023/0105','0110/0112']
def get_valid_hours(valid):
hours_li = []
#Parse the start date and end date as datetime
start_date_str, end_date_str = valid.split('/')
start_date = datetime.datetime.strptime(start_date_str,'%d%H')
end_date = datetime.datetime.strptime(end_date_str, '%d%H')
#If start date less than equal to end date
if start_date <= end_date:
dt = start_date
i=0
#Keep creating new dates until we hit end date
while dt < end_date:
#Append the dates to a list
dt = start_date+datetime.timedelta(hours=i)
hours_li.append(dt.strftime('%d%H'))
i+=1
#Else split the validity into two and calculate them separately
else:
start_date_str, end_date_str = valid.split('/')
return get_valid_hours('{}/3023'.format(start_date_str)) + get_valid_hours('0100/{}'.format(end_date_str))
#Append sublist to a bigger list
return hours_li
for valid in validity:
print(get_valid_hours(valid))
The output then looks like, not sure if this was the format needed!
['2712', '2713', '2714', '2715', '2716', '2717', '2718', '2719', '2720', '2721', '2722', '2723', '2800', '2801', '2802', '2803', '2804', '2805', '2806', '2807', '2808', '2809', '2810', '2811', '2812']
['2723', '2800', '2801', '2802', '2803', '2804', '2805']
['2800', '2801', '2802', '2803', '2804', '2805', '2806', '2807', '2808', '2809', '2810', '2811', '2812']
['3012', '3013', '3014', '3015', '3016', '3017', '3018', '3019', '3020', '3021', '3022', '3023', '0100', '0101', '0102', '0103', '0104', '0105', '0106', '0107', '0108', '0109', '0110', '0111', '0112']
['0100', '0101', '0102', '0103', '0104', '0105']
['0110', '0111', '0112']

Finally, I created something easy like this:
validity = ['3012/0112','3023/0105','0110/0112']
upbound = [24, 6, 12]
hours_list = []
for idx, val in enumerate(validity):
hours_li = []
DD = val[:2]
HH = val[2:4]
dd = val[5:7]
hh = val[7:9]
if DD == dd:
for i in range(int(HH),upbound[idx]):
hours_li.append(DD + str(i).zfill(2))
if DD <> dd:
for i in range(int(HH),24):
hours_li.append(DD + str(i).zfill(2))
for j in range(0,int(hh)):
hours_li.append(dd + str(j).zfill(2))
hours_list.append(hours_li)
This works for 24h validity (it could be solved by one if condition and similar block of concatenate), does not use datetime, just numberst and str. It is neither pythonic nor fast, but works.

How to search for a substring, find the beginning and ending, and then check if that data is a weekday?

I've come up with the following which should be fairly close, but it's not quite right. I am getting the following error when I try to test if the data is a weekday. AttributeError: 'str' object has no attribute 'isoweekday'
Here is my feeble code:
offset = str(link).find('Run:')
amount = offset + 15
pos = str(link)[offset:amount]
if pos.isoweekday() in range(1, 6):
outF.write(str(link))
outF.write('\n')
I'm looking for the string 'Run: ' (it always has 2 blanks after the colon) and then I want to move 15 spaces to the right, to capture the date. So, n-number of spaces to find 'Run: ' and then get the date, like '2018-12-23' and test if this date is a weekday. If this substring is a weekday, I want to write the entire string to a line in a CSV file (the writing to a CSV file works fine). I'm just not sure how to find that one date (there are several dates in the string; I need the one immediately following 'Run: ').

You've only forgotten to load it into a datetime object:
from datetime import datetime
# ...
pos_date = datetime.strptime(pos, "%Y-%m-%d")
if pos_date.isoweekday() in range(1, 6):
# ...
Also, as you are using .isoweekday() and Monday is represented as 1, you don't really need to check the lower boundary:
if pos_date.isoweekday() <= 5: # Monday..Friday
# ...

Maybe to convert back to datetime type:
offset = str(link).find('Run:')
amount = offset + 15
pos = str(link)[offset:amount]
if datetime.strptime(pos,'%Y-%m-%d').isoweekday() in range(1, 6):
outF.write(str(link))
outF.write('\n')
Then it should work as expected.

Let's suppose your link is
link = "Your Link String is Run: 2018-12-21 21:15:48"
Your following code will work well to find the offset starting from Run
offset = str(link).find('Run:')
amount = offset + 16
Since, there are two spaces after Run: hence, 16 needs to be added to offset.
Now extracting exactly the date string 2018-12-21, we need to add 6 to offset as Run: has 6 character before starting the date string.
pos = str(link)[offset + 6:amount]
Now formatting our date string in an datetime object with
pos_date = datetime.strptime(pos, "%Y-%m-%d")
Remember to import datetime at the top of your program file as
from datetime import datetime
Now checking and displaying if the date is a weekday
if pos_date.isoweekday() in range(1, 6):
print("It's a Week Day!")
This will return It's a Week Day!.

link = "something something Run: 2018-12-24 ..."
offset = str(link).find('Run:')
amount = offset + 15 # should be 16
pos = str(link)[offset:amount] # this is a string
The pos of the example above will be Run: 2018-12-24, so it does not capture the date exactly.
A string object does not have isoweekday method, so pos.isoweekday() will result to error. But a datetime.datetime object does have that method.
A solution:
import datetime
link = "something something Run: 2018-12-24 ..."
offset = str(link).find('Run:') # will only give the index of 'R', so offset will be 20
amount = offset + 16
pos = str(link)[offset:amount] # pos is 'Run: 2018-12-24'
datestring = pos.split()[1] # split and capture only the date string
#now convert the string into datetime object
datelist = datestring.split('-')
date = datetime.datetime(int(datelist[0]), int(datelist[1]), int(datelist[2]))
if date.isoweekday() in range(1, 6):
....
This okay..?

Another alternative to this would be to use dateutil.parser
from dateutil.parser import parse
try:
if parse(pos).isoweekday() <=5:
....
except ValueError:
.....
The advantage here is that parse will accept a wide variety of date formats that datetime might error out for

Reading data in a loop error - "must be str, not numpy.int32"

Python 3
I need to plot a time series of Ozone from August 3rd to August 10th using this data website . I need to "stitch" the data together.
http://skywatch.colorado.edu/data/ozone_18_09_03.dat
So right now I have
pre= 'http://skywatch.colorado.edu/data/ozone_18_09_0'
ozone = []
utc = []
dates = np.arange(3,10,1)
for date in dates:
url = pre + dates[i] + ".dat"
lines = urllib.request.urlopen(url).readlines()
for line in lines: #for x number of times (however many lines appear in the dataset)
entries = line.decode("utf-8").split("\t")
if entries[0][0] != ';': #if there are entries that do not have a semicolon
utc.append(float(entries[0][0:2]) + \
float(entries[0][3:5])/60. + \
float(entries[0][6:8])/3600.)
#converts the UTC time variable into a float and adds it to the list 'utc'
ozone.append(float(entries[1]))
When I try to run this I get an error
----> 9 url = pre + dates[i] + ".dat"
TypeError: must be str, not numpy.int32
note sure How to deal with this

I think you may need to explicitly convert the numpy.int32 objects to strings, as numpy most likely did not define __add__(self, other) for other: str.
Also, you're iterating through dates with the variable date, so you would use something like this:
url = pre + str(date) + ".dat"

Sorting by month-year groups by month instead

I have a curious python problem.
The script takes two csv files, one with a column of dates and the other a column of text snippets. in the other excel file there is a bunch of names (substrings).
All that the code does is step through both lists building up a name-mentioned-per-month matrix.
FILE with dates and text: (Date, Snippet first column)
ENTRY 1 : Sun 21 nov 2014 etc, The release of the iphone 7 was...
-strings file
iphone 7
apple
apples
innovation etc.
The problem is that when i try to order it so that the columns follow in asceding order, e.g. oct-2014, nov-2014, dec-2014 and so on, it just groups the months together instead, which isn't what i want
import csv
from datetime import datetime
file_1 = input('Enter first CSV name (one with the date and snippet): ')
file_2 = input('Enter second CSV name (one with the strings): ')
outp = input('Enter the output CSV name: ')
file_1_list = []
head = True
for row in csv.reader(open(file_1, encoding='utf-8', errors='ignore')):
if head:
head = False
continue
date = datetime.strptime(row[0].strip(), '%a %b %d %H:%M:%S %Z %Y')
date_str = date.strftime('%b %Y')
file_1_list.append([date_str, row[1].strip()])
file_2_dict = {}
for line in csv.reader(open(file_2, encoding='utf-8', errors='ignore')):
s = line[0].strip()
for d in file_1_list:
if s.lower() in d[1].lower():
if s in file_2_dict.keys():
if d[0] in file_2_dict[s].keys():
file_2_dict[s][d[0]] += 1
else:
file_2_dict[s][d[0]] = 1
else:
file_2_dict[s] = {
d[0]: 1
}
months = []
for v in file_2_dict.values():
for k in v.keys():
if k not in months:
months.append(k)
months.sort()
rows = [[''] + months]
for k in file_2_dict.keys():
tmp = [k]
for m in months:
try:
tmp.append(file_2_dict[k][m])
except:
tmp.append(0)
rows.append(tmp)
print("still working on it be patient")
writer = csv.writer(open(outp, "w", encoding='utf-8', newline=''))
for r in rows:
writer.writerow(r)
print('Done...')
From my understanding I am months.sort() isnt doing what i expect it to?
I have looked here , where they apply some other function to sort the data, using attrgetter,
from operator import attrgetter
>>> l = [date(2014, 4, 11), date(2014, 4, 2), date(2014, 4, 3), date(2014, 4, 8)]
and then
sorted(l, key=attrgetter('month'))
But I am not sure whether that would work for me?
From my understanding I parse the dates 12-13, am I missing an order data first, like
data = sorted(data, key = lambda row: datetime.strptime(row[0], "%b-%y"))
I have only just started learning python and so many things are new to me i dont know what is right and what isnt?
What I want(of course with the correctly sorted data):

This took a while because you had so much unrelated stuff about reading csv files and finding and counting tags. But you already have all that, and it should have been completely excluded from the question to avoid confusing people.
It looks like your actual question is "How do I sort dates?"
Of course "Apr-16" comes before "Oct-14", didn't they teach you the alphabet in school? A is the first letter! I'm just being silly to emphasize a point -- it's because they are simple strings, not dates.
You need to convert the string to a date with the datetime class method strptime, as you already noticed. Because the class has the same name as the module, you need to pay attention to how it is imported. You then go back to a string later with the member method strftime on the actual datetime (or date) instance.
Here's an example:
from datetime import datetime
unsorted_strings = ['Oct-14', 'Dec-15', 'Apr-16']
unsorted_dates = [datetime.strptime(value, '%b-%y') for value in unsorted_strings]
sorted_dates = sorted(unsorted_dates)
sorted_strings = [value.strftime('%b-%y') for value in sorted_dates]
print(sorted_strings)
['Oct-14', 'Dec-15', 'Apr-16']
or skipping to the end
from datetime import datetime
unsorted_strings = ['Oct-14', 'Dec-15', 'Apr-16']
print (sorted(unsorted_strings, key = lambda x: datetime.strptime(x, '%b-%y')))
['Oct-14', 'Dec-15', 'Apr-16']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Checking for missing values in CSV - python

Related

Time and date strings to DateTime Objects from csv file in python

How to create hourly list with datetime_range without year

How to search for a substring, find the beginning and ending, and then check if that data is a weekday?

Reading data in a loop error - "must be str, not numpy.int32"

Sorting by month-year groups by month instead

Categories

Resources