read xls, convert all dates into proper format, -> write to csv - python

I'm reading excel files and writing them out as csv. A couple of columns contain dates which are formatted as float number in excel. All those fields need to get converted to a proper datetime (dd/mm/YY) before I wrote to CSV.
I found some good articles on how that works in general, but struggling to get that working for all rows in a opened sheet at once. (Newbie in Python)
Code looks like below for now:
wb = xlrd.open_workbook(args.inname)
xl_sheet = wb.sheet_by_index(0)
print args.inname
print ('Retrieved worksheet: %s' % xl_sheet.name)
print outname
# TODO: Convert xldate.datetime from the date fileds to propper datetime
output = open(outname, 'wb')
wr = csv.writer(output, quoting=csv.QUOTE_ALL)
for rownum in xrange(wb.sheet_by_index(0).nrows):
wr.writerow(wb.sheet_by_index(0).row_values(rownum))
output.close()
I'm sure i have to change the "for rownum ...." line but I'm struggling doing it. I tried several options, which all failed.
thanks

You need to go through the row before you write it out to file, converting values. You are right to identify that it is near the for rownum line:
# You need to know which columns are dates before hand
# you can't get this from the "type" of the cell as they
# are just like any other number
date_cols = [5,16,23]
... # Your existing setup code here #
# write the header row (in response to OP comment)
headerrow = wb.sheet_by_index(0).row_values(0)
wr.writerow(headerrow)
# convert and write the data rows (note range now starts from 1, not 0)
for rownum in xrange(1,wb.sheet_by_index(0).nrows):
# Get the cell values and then convert the relevant ones before writing
cell_values = wb.sheet_by_index(0).row_values(rownum)
for col in date_cols:
cell_values[col] = excel_time_to_string(cell_values[col])
wr.writerow(cell_values)
Exactly what you put in your excel_time_to_string() function is up to you - the answer by #MarkRansom has a reasonable approach - or you could use the xlrd own package versions outlined in this answer.
For instance:
def excel_time_to_string(xltimeinput):
return str(xlrd.xldate.xldate_as_datetime(xltimeinput, wb.datemode))
* EDIT *
In response to request for help in comments after trying. Here's a more error-proof version of excel_time_to_string()
def excel_time_to_string(xltimeinput):
try:
retVal = xlrd.xldate.xldate_as_datetime(xltimeinput, wb.datemode)
except ValueError:
print('You passed in an argument in that can not be translated to a datetime.')
print('Will return original value and carry on')
retVal = xltimeinput
return retVal

The conversion from Excel to Python is quite simple:
>>> excel_time = 42054.441953
>>> datetime.datetime(1899,12,30) + datetime.timedelta(days=excel_time)
datetime.datetime(2015, 2, 19, 10, 36, 24, 739200)
Or to do the complete conversion to a string:
def excel_time_to_string(excel_time, fmt='%Y-%m-%d %H:%M:%S'):
dt = datetime.datetime(1899,12,30) + datetime.timedelta(days=excel_time)
return dt.strftime(fmt)
>>> excel_time_to_string(42054.441953)
'2015-02-19 10:36:24'
>>> excel_time_to_string(42054.441953, '%d/%m/%y')
'19/02/15'

Related

Difficulty creating annotation script in python

I was trying to make an annotation tool in python to edit a large csv file and generate a json output, but being a newer programmer, I have been facing a lot of difficulties.
I have two csv files that I generated and then got a list of rows and columns that I wanted to match from one file:
filtered list of columns and rows
And the other file has a list of outputs that I wanted to match it with and gather the specified data entry: raw outputs
For example, I would want to print the data in row 6 from column Q5.3 alongside the original question and then specify if it is good or bad. If it is bad I want to be able to add a comment.
I would like to generate a json file that compiles this in the end. I tried to write the code but it was complete garbage, I guess I was hoping to be able to understand how to properly implement and just became really confused.
Any help would be really appreciated!
The output should go through all the specified data and print:
Question Number,
Question,
Response,
Annotate as Good or Bad,
If Bad then able to add comment,
Continue to next data piece,
When done generate a json for data
Thank you :)
My attempt:
import csv
from csv import reader
import json
csv_results_path = 'results.csv'
categories = { '1':'Unacceptable', '2':'Edit'}
# Checking the outputs (comment out)
''''
print(rows)
print(columns)
'''
'''
# Acquire data from specifed row, column in responses
# Example row 6, column '12.2'
'''
def get_annotation_input():
while True:
try:
annotation = int(input("Annotation: "))
if annotation not in range (1,3):
raise ValueError
return annotation, edited_comment
except ValueError:
print("Enter 1 or 2")
def annotate():
annotator = input("What is your name? ")
print(''.join(["-"]*50))
print("Annotate the following answer as 1-Unacceptable, 2-Edit")
with open("annotations.json", mode='a') as annotation_file:
annotation_data = ('annotator': annotator, 'row_number':, 'question_number': , 'annotation' : categories[str(annotation)], 'edited_response': }
json.dump(annotation_data, annotation_file)
annotation_file.write('\n')
if __name__ == "__main__":
annotate()
with open('annotations_full.csv', 'rU') as infile:
response_reader = csv.DictReader(infile)
responses = {}
for row in response_reader:
for header, response in row.items():
try:
responses[header].append(response)
except KeyError:
responses[header] = [response]
rows = responses['row_number']
columns = responses['question_number']
print(rows)
print(columns)
I was successfully able to get a list of the rows and columns that I wanted, however, I am having difficultly accessing the data in another csv file using the row and corresponding column to display and annotate. Also, when I attempted to write code to allow a field for an edited response if '2' is specified, I faced many output errors.

Transfer data from excel worksheet (openpyxl) to database table (dbf)

I have a simple problem of reading an excel worksheet, treat every row containing about 83 columns as unique database record, add it to local datum record and ultimately append and write onto DBF file.
I can extract all the values from excel and add them to the list. But the list is not correct syntax and I don't know how to prepare/convert the list to database record. I am using Openpyxl, dbf and python 3.7.
At the moment I am only testing and trying to prepare the data for Row 3 (hence min_max rows = 3)
I understand that the data should be in the format
(('','','', ... 83 entries), \
('','','', ... 83 entries) \
)
But I do not know how to convert the list data into record
or, alternatively, how to read in excel data directly into a DF appendable format
tbl_tst.open(mode=dbf.READ_WRITE) # all fields character string
for everyrow in ws_IntMstDBF.iter_rows(min_row = 3, max_row = 3, max_col = ws_IntMstDBF.max_column-1):
datum = [] #set([83]), will defining datum as () help solve the problem?
for idx, cells in enumerate(everyrow):
if cells.value is None: # for None entries, enter empty string
datum.append("")
continue
datum.append(cells.value) # else enter cell values
tbl_tst.append(datum) # append that record to table !!! list is not record error here
tbl_tst.close()
The error is complaining about using list to append to table, and this should be a record etc. Please guide how I can convert excel rows into appendable DBF table data.
raise TypeError("data to append must be a tuple, dict, record, or template; not a %r" % type(data))
TypeError: data to append must be a tuple, dict, record, or template; not a <class 'list'>
Change
tbl_tst.append(datum)
to
tbl_tst.append(tuple(datum))
and that will get rid of that error. As long as all your cell data has the appropriate type then the append should work.
Thank you for the responses, I went on a bit of tangent since last night while trying different solutions.
One solution that worked for me is as follows:
I made sure that the worksheet data I am using is all strings/Text and converted any null entries to String type and entered empty string. So the following code does this task:
#house keeping
for eachrow in ws_IntMstDBF.iter_rows(min_row=2, max_row=ws_IntMstDBF.max_row, max_col=ws_IntMstDBF.max_column):
for idx, cells in enumerate(eachrow):
if cells.value is None: # change every Null cell type to String and put 0x20 (space)
cells.data_type = 's'
cells.value = " "
After writing the worksheet, I reopened it using panda dataframe and verified if the contents were all string type and there were no "nan" values remaining in the dataframe.
Then I used df2dbf function from "Dani Arribas-Bel", modified it to suit the data I am working with and converted to dbf.
The code which imports dataframe and converts to dbf format is as follows:
abspath = Path(__file__).resolve() # resolve to relative path to absolute
rootpath = abspath.parents[3] # root (my source file is3 sub directories deep
xlspath = rootpath / 'sub-dir1' / 'sub-dir2' / 'sub-dir3' / 'test.xlsx'
# above code is only resolving file location, ignore
pd_Mst_df = pd.read_excel(xlspath)
#print(pd_Mst_df) # for debug
print("... Writing Master DBF file ")
df2dbf(pd_Mst_df, dbfpath) # dbf path is defined similar to pd_Mst path
The function df2dbg uses pysal to write dataframe in dbf format:
I made some modifications to the code to detect the length row length and character types as follows:
import pandas as pd
import pysal as ps
import numpy as np
# code from function df2dbf
else:
type2spec = {int: ('N', 20, 0),
np.int64: ('N', 20, 0),
float: ('N', 36, 15),
np.float64: ('N', 36, 15),
str: ('C', 200, 0)
}
#types = [type(df[i].iloc[0]) for i in df.columns]
types = [type('C') for i in range(0, len(df.columns))] #84)] #df.columns)] #range(0,84)] # i not required, to be removed
specs = [type2spec[t] for t in types]
db = ps.open(dbf_path, 'w')
# code continues from function df2dbf
Pandas dataframe didn't require further modifications as all source data was formatted correctly before being committed to excel file.
I will provide the link to pysal and df2dbf as soon as I find it on stackoverflow.
Check out the Python Pandas library...
To read the data from excel inta a Pandas dataframe, you could use pandas.read_excel
Once the date is read into a Pandas dataframe, you can manipulate it and afterwards write it to a database using pandas.DataFrame.to_sql
See also this explanation for dealing with database io

How to split a log file into several csv files with python

I'm pretty new to python and coding in general, so sorry in advance for any dumb questions. My program needs to split an existing log file into several *.csv files (run1,.csv, run2.csv, ...) based on the keyword 'MYLOG'. If the keyword appears it should start copying the two desired columns into the new file till the keyword appears again. When finished there need to be as many csv files as there are keywords.
53.2436 EXP MYLOG: START RUN specs/run03_block_order.csv
53.2589 EXP TextStim: autoDraw = None
53.2589 EXP TextStim: autoDraw = None
55.2257 DATA Keypress: t
57.2412 DATA Keypress: t
59.2406 DATA Keypress: t
61.2400 DATA Keypress: t
63.2393 DATA Keypress: t
...
89.2314 EXP MYLOG: START BLOCK scene [specs/run03_block01.csv]
89.2336 EXP Imported specs/run03_block01.csv as conditions
89.2339 EXP Created sequence: sequential, trialTypes=9
...
[EDIT]: The output per file (run*.csv) should look like this:
onset type
53.2436 EXP
53.2589 EXP
53.2589 EXP
55.2257 DATA
57.2412 DATA
59.2406 DATA
61.2400 DATA
...
The program creates as much run*.csv as needed, but i can't store the desired columns in my new files. When finished, all I get are empty csv files. If I shift the counter variable to == 1 it creates just one big file with the desired columns.
Thanks again!
import csv
QUERY = 'MYLOG'
with open('localizer.log', 'rt') as log_input:
i = 0
for line in log_input:
if QUERY in line:
i = i + 1
with open('run' + str(i) + '.csv', 'w') as output:
reader = csv.reader(log_input, delimiter = ' ')
writer = csv.writer(output)
content_column_A = [0]
content_column_B = [1]
for row in reader:
content_A = list(row[j] for j in content_column_A)
content_B = list(row[k] for k in content_column_B)
writer.writerow(content_A)
writer.writerow(content_B)
Looking at the code there's a few things that are possibly wrong:
the csv reader should take a file handler, not a single line.
the reader delimiter should not be a single space character as it looks like the actual delimiter in your logs is a variable number of multiple space characters.
the looping logic seems to be a bit off, confusing files/lines/rows a bit.
You may be looking at something like the code below (pending clarification in the question):
import csv
NEW_LOG_DELIMITER = 'MYLOG'
def write_buffer(_index, buffer):
"""
This function takes an index and a buffer.
The buffer is just an iterable of iterables (ex a list of lists)
Each buffer item is a row of values.
"""
filename = 'run{}.csv'.format(_index)
with open(filename, 'w') as output:
writer = csv.writer(output)
writer.writerow(['onset', 'type']) # adding the heading
writer.writerows(buffer)
current_buffer = []
_index = 1
with open('localizer.log', 'rt') as log_input:
for line in log_input:
# will deal ok with multi-space as long as
# you don't care about the last column
fields = line.split()[:2]
if not NEW_LOG_DELIMITER in line or not current_buffer:
# If it's the first line (the current_buffer is empty)
# or the line does NOT contain "MYLOG" then
# collect it until it's time to write it to file.
current_buffer.append(fields)
else:
write_buffer(_index, current_buffer)
_index += 1
current_buffer = [fields] # EDIT: fixed bug, new buffer should not be empty
if current_buffer:
# We are now out of the loop,
# if there's an unwritten buffer then write it to file.
write_buffer(_index, current_buffer)
You can use pandas to simplify this problem.
Import pandas and read in log file.
import pandas as pd
df = pd.read_fwf('localizer2.log', header=None)
df.columns = ['onset', 'type', 'event']
df.set_index('onset', inplace=True)
Set Flag where third column == 'MYLOG'
df['flag'] = 0
df.loc[df.event.str[:5] == 'MYLOG', 'flag'] = 1
df.flag = df['flag'].cumsum()
Save each run as a separate run*.csv file
for i in range(1, df.flag.max()+1):
df.loc[df.flag == i, 'event'].to_csv('run{0}.csv'.format(i))
EDIT:
Looks like your format is different than I originally assumed. Changed to use pd.read_fwf. my localizer.log file was a copy and paste of your original data, hope this works for you. I assumed by the original post that it did not have headers. If it does have headers then remove header=None and df.columns = ['onset', 'type', 'event'].

Python returns 5 digit timestamp, what is this?

I'm a PHP programmer doing a bit of Python (3.4) just because it's way easier to do it in Python.
My script converts a .xlsx file, into many .csv files (one .csv per sheet).
Here is the code:
wb = xlrd.open_workbook(filepath)
for i in range(0, wb.nsheets):
sh = wb.sheet_by_index(i)
sheet_name = sh.name
sheet_name = sheet_name.replace(" ", "_");
fp = open(sheet_name+'.csv', 'at', encoding='utf8')
wr = csv.writer(fp, quoting=csv.QUOTE_ALL)
for row_num in range(sh.nrows):
wr.writerow(sh.row_values(row_num))
fp.close()
Full code here: https://github.com/xtrimsky/xlsx_to_csv
This works well, except that I have a field that is a date, in excel it shows for example: 01/01/2009
But the final csv, contains a number that is 39814. What is this please, what can I do with it ? 01/02/2009 is 39815.
Is it a number I can use to find the unix timestamp ? Or is it an issue and I should change my script ? I would feel safer if it would just return the string "01/01/2009".
Can someone please help me understand what I am dealing with ?
If 39814 maps to 2009-1-1 and 39815 maps to 2009-1-2, then it looks like the ordinal is counting the number of days since 1899-12-30:
In [57]: DT.date(1899,12,30) + DT.timedelta(days=39814)
Out[57]: datetime.date(2009, 1, 1)
See Why 1899-12-30!?
To convert the Excel number to a Unix timestamp, you could use the timetuple method to convert the datetime.date object to a timetuple, and then time.mktime to convert it to a timestamp (seconds since the Epoch):
In [80]: import time
In [81]: time.mktime((DT.datetime(1899,12,30) + DT.timedelta(days=39814)).timetuple())
Out[81]: 1230786000.0

Reading numeric Excel data as text using xlrd in Python

I am trying to read in an Excel file using xlrd, and I am wondering if there is a way to ignore the cell formatting used in Excel file, and just import all data as text?
Here is the code I am using for far:
import xlrd
xls_file = 'xltest.xls'
xls_workbook = xlrd.open_workbook(xls_file)
xls_sheet = xls_workbook.sheet_by_index(0)
raw_data = [['']*xls_sheet.ncols for _ in range(xls_sheet.nrows)]
raw_str = ''
feild_delim = ','
text_delim = '"'
for rnum in range(xls_sheet.nrows):
for cnum in range(xls_sheet.ncols):
raw_data[rnum][cnum] = str(xls_sheet.cell(rnum,cnum).value)
for rnum in range(len(raw_data)):
for cnum in range(len(raw_data[rnum])):
if (cnum == len(raw_data[rnum]) - 1):
feild_delim = '\n'
else:
feild_delim = ','
raw_str += text_delim + raw_data[rnum][cnum] + text_delim + feild_delim
final_csv = open('FINAL.csv', 'w')
final_csv.write(raw_str)
final_csv.close()
This code is functional, but there are certain fields, such as a zip code, that are imported as numbers, so they have the decimal zero suffix. For example, is there is a zip code of '79854' in the Excel file, it will be imported as '79854.0'.
I have tried finding a solution in this xlrd spec, but was unsuccessful.
That's because integer values in Excel are imported as floats in Python. Thus, sheet.cell(r,c).value returns a float. Try converting the values to integers but first make sure those values were integers in Excel to begin with:
cell = sheet.cell(r,c)
cell_value = cell.value
if cell.ctype in (2,3) and int(cell_value) == cell_value:
cell_value = int(cell_value)
It is all in the xlrd spec.
I know this isn't part of the question, but I would get rid of raw_str and write directly to your csv. For a large file (10,000 rows) this will save loads of time.
You can also get rid of raw_data and just use one for loop.

Categories