Reading numeric Excel data as text using xlrd in Python - python

I am trying to read in an Excel file using xlrd, and I am wondering if there is a way to ignore the cell formatting used in Excel file, and just import all data as text?
Here is the code I am using for far:
import xlrd
xls_file = 'xltest.xls'
xls_workbook = xlrd.open_workbook(xls_file)
xls_sheet = xls_workbook.sheet_by_index(0)
raw_data = [['']*xls_sheet.ncols for _ in range(xls_sheet.nrows)]
raw_str = ''
feild_delim = ','
text_delim = '"'
for rnum in range(xls_sheet.nrows):
for cnum in range(xls_sheet.ncols):
raw_data[rnum][cnum] = str(xls_sheet.cell(rnum,cnum).value)
for rnum in range(len(raw_data)):
for cnum in range(len(raw_data[rnum])):
if (cnum == len(raw_data[rnum]) - 1):
feild_delim = '\n'
else:
feild_delim = ','
raw_str += text_delim + raw_data[rnum][cnum] + text_delim + feild_delim
final_csv = open('FINAL.csv', 'w')
final_csv.write(raw_str)
final_csv.close()
This code is functional, but there are certain fields, such as a zip code, that are imported as numbers, so they have the decimal zero suffix. For example, is there is a zip code of '79854' in the Excel file, it will be imported as '79854.0'.
I have tried finding a solution in this xlrd spec, but was unsuccessful.

That's because integer values in Excel are imported as floats in Python. Thus, sheet.cell(r,c).value returns a float. Try converting the values to integers but first make sure those values were integers in Excel to begin with:
cell = sheet.cell(r,c)
cell_value = cell.value
if cell.ctype in (2,3) and int(cell_value) == cell_value:
cell_value = int(cell_value)
It is all in the xlrd spec.

I know this isn't part of the question, but I would get rid of raw_str and write directly to your csv. For a large file (10,000 rows) this will save loads of time.
You can also get rid of raw_data and just use one for loop.

Related

Python Xlsx writing format advice

I've created a list and a for loop to iterate over each item in it to print it to a cell in excel. I'm using openpyxl. When I first started using it using easy statements like:
sheet["A1"] = "hello"
results in Cell A1 perfectly representing the hello value, without quotation marks.
I have this code:
workbook = Workbook()
sheet = workbook.active
text = ["Whistle", "Groot", "Numbers", "Mulan", "Buddy Holly"]
other = [5, 8, 100, 120]
for i in range(1,len(text)+1):
cell_letter = "A"
cell_number = str(i)
sheet[str((cell_letter + cell_number))] = str(text[i-1:i])
and it writes to the corresponding cell locations with the iterations over the variable "text". But when i open the file the format is ['Whistle'] and ['Groot']
What am I missing? Should I be passing each iteration to another variable to convert it from a list to a tuple for it to be written in then?
Sorry if my code seems a bit messy, I've literally just learned this over the past few hours and it's (kind of) doing what I need it to do, with the exception of the writing format.
Openpyxl let you write a list of lists, where the intern lists represents the 'lines' in a xlsx file.
So, you can store what you want as:
data_to_write = [["Whistle", "Groot", "Numbers", "Mulan", "Buddy Holly"]]
or, if you want some data in the next line:
data_to_write = [["Whistle", "Groot", "Numbers"], ["Mulan", "Buddy Holly"]]
then, add it to your WorkSheet:
for line in data_to_write:
sheet.append(line)
and, finally, save it:
workbook.save("filename.xlsx")
The full code could be something like:
from openpyxl import Workbook
workbook = Workbook()
sheet = workbook.active
data_to_write = [["Whistle", "Groot", "Numbers", "Mulan", "Buddy Holly"]]
for line in data_to_write:
sheet.append(line)
workbook.save('example.xlsx')
Give it a try and, then, give me a feedback, please XD

export data to xls file format

I have a text file with data of 6000 records in this format
{"id":"1001","user":"AB1001","first_name":"David ","name":"Shai","amount":"100","email":"me#no.mail","phone":"9999444"}
{"id":"1002","user":"AB1002","first_name":"jone ","name":"Miraai","amount":"500","email":"some1#no.mail","phone":"98894004"}
I want to export all data to excel file as shown bellow example
I would recommend reading in the text file, then converting to a dictionary with json, and using pandas to save a .csv file that can be opened with excel.
In the example below, I copied your text into a text file, called "myfile.txt", and I saved the data as "myfile2.csv".
import pandas as pd
import json
# read lines of text file
with open('myfile.txt') as f:
lines=f.readlines()
# remove empty lines
lines2 = [line for line in lines if not(line == "\n")]
# convert to dictionaries
dicts = [json.loads(line) for line in lines2]
# save to .csv
pd.DataFrame(dicts ).to_csv("myfile2.csv", index = False)
You can use VBA and a json-parser
Your two lines are not a valid JSON. However, it is easy to convert it to a valid JSON as shown in the code below. Then it is a relatively simple matter to parse it and write it to a worksheet.
The code assumes no blank lines in your text file, but it is easy to fix if that is not the case.
Using your data on two separate lines in a windows text file (if not windows, you may have to change the replacement of the newline token with a comma depending on what the generating system uses for newline.
I used the JSON Converter by Tim Hall
'Set reference to Microsoft Scripting Runtime or
' use late binding
Option Explicit
Sub parseData()
Dim JSON As Object
Dim strJSON As String
Dim FSO As FileSystemObject, TS As TextStream
Dim I As Long, J As Long
Dim vRes As Variant, v As Variant, O As Object
Dim wsRes As Worksheet, rRes As Range
Set FSO = New FileSystemObject
Set TS = FSO.OpenTextFile("D:\Users\Ron\Desktop\New Text Document.txt", ForReading, False, TristateUseDefault)
'Convert to valid JSON
strJSON = "[" & TS.ReadAll & "]"
strJSON = Replace(strJSON, vbLf, ",")
Set JSON = parsejson(strJSON)
ReDim vRes(0 To JSON.Count, 1 To JSON(1).Count)
'Header row
J = 0
For Each v In JSON(1).Keys
J = J + 1
vRes(0, J) = v
Next v
'populate the data
I = 0
For Each O In JSON
I = I + 1
J = 0
For Each v In O.Keys
J = J + 1
vRes(I, J) = O(v)
Next v
Next O
'write to a worksheet
Set wsRes = Worksheets("sheet6")
Set rRes = wsRes.Cells(1, 1)
Set rRes = rRes.Resize(UBound(vRes, 1) + 1, UBound(vRes, 2))
Application.ScreenUpdating = False
With rRes
.EntireColumn.Clear
.Value = vRes
.Style = "Output"
.EntireColumn.AutoFit
End With
End Sub
Results from your posted data
Try using the pandas module in conjunction with the eval() function:
import pandas as pd
with open('textfile.txt', 'r') as f:
data = f.readlines()
df = pd.DataFrame(data=[eval(i) for i in data])
df.to_excel('filename.xlsx', index=False)

Reading rows in CSV file and appending a list creates a list of lists for each value

I am copying list output data from a DataCamp course so I can recreate the exercise in Visual Studio Code or Jupyter Notebook. From DataCamp Python Interactive window, I type the name of the list, highlight the output and paste it into a new file in VSCode. I use find and replace to delete all the commas and spaces and now have 142 numeric values, and I Save As life_exp.csv. Looks like this:
43.828
76.423
72.301
42.731
75.32
81.235
79.829
75.635
64.062
79.441
When I read the file into VSCode using either Pandas read_csv or csv.reader and use values.tolist() with Pandas or a for loop to append an existing, blank list, both cases provide me with a list of lists which then does not display the data correctly when I try to create matplotlib histograms.
I used NotePad to save the data as well as a .csv and both ways of saving the data produce the same issue.
import matplotlib.pyplot as plt
import csv
life_exp = []
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
for row in exp_read:
life_exp.append(row)
And
import pandas as pd
life_exp_df = pd.read_csv('c:\\data\\life_exp.csv', header = None)
life_exp = life_exp_df.values.tolist()
When you print life_exp after importing using csv, you get:
[['43.828'],
['76.423'],
['72.301'],
['42.731'],
['75.32'],
['81.235'],
['79.829'],
['75.635'],
['64.062'],
['79.441'],
['56.728'],
….
And when you print life_exp after importing using pandas read_csv, you get the same thing, but at least now it's not a string:
[[43.828],
[76.423],
[72.301],
[42.731],
[75.32],
[81.235],
[79.829],
[75.635],
[64.062],
[79.441],
[56.728],
…
and when you call plt.hist(life_exp) on either version of the list, you get each value as bin of 1.
I just want to read each value in the csv file and put each value into a simple Python list.
I have spent days scouring stackoverflow thinking someone has done this, but I can't seem to find an answer. I am very new to Python, so your help greatly appreciated.
Try:
import pandas as pd
life_exp_df = pd.read_csv('c:\\data\\life_exp.csv', header = None)
# Select the values of your first column as a list
life_exp = life_exp_df.iloc[:, 0].tolist()
instead of:
life_exp = life_exp_df.values.tolist()
With csv reader, it will parse the line into a list using the delimiter you provide. In this case, you provide \n as the delimiter but it will still take that single item and return it as a list.
When you append each row, you are essentially appending that list to another list. The simplest work-around is to index into row to extract that value
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
for row in exp_read:
life_exp.append(row[0])
However, if your data is not guaranteed to be formatted the way you have provided, you will need to handle that a bit differently:
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
for row in exp_read:
for number in row:
life_exp.append(number)
A bit cleaner with list comprehension:
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
[life_exp.append(number) for row in exp_read for number in row]

Problems with reading values as formula instead of data

When I want to read my Excel:
from openpyxl import load_workbook
import numpy as np
"read Excel"
wb = load_workbook('Libro1.xlsx')
hoja_1 = wb.get_sheet_by_name('1')
x = np.zeros(hoja_1.max_row)
y = np.zeros(hoja_1.max_row)
for i in range(0, hoja_1.max_row):
x[i] = hoja_1.cell(row = i + 1, column = 1).value
y[i] = hoja_1.cell(row = i + 1, column = 2).value
print(x)
print(y)
I get an error in:
x[i] = hoja_1.cell(row = i + 1, column = 1).value
ValueError: could not convert string to float: '=A1+1'
x and y are object. Convert that into list before assigning values.
Try this:
x = list(np.zeros(hoja_1.max_row))
y = list(np.zeros(hoja_1.max_row))
One but rather rare possibility is that a cell in 'Libro1.xlsx' marked as text format, which prevents it from recalulating.
When you open an Excel file, do you see the value of =A1+1 recalculated, or as text?
In any case, apply the simplest numeric formatting to all cells in Excel file (select cells and choose format), also eliminate any conditional formatting.
Similar problem described here: https://bitbucket.org/openpyxl/openpyxl/issues/699/valueerror-could-not-convert-string-to. Make sure your openpyxl version is up to date.

read xls, convert all dates into proper format, -> write to csv

I'm reading excel files and writing them out as csv. A couple of columns contain dates which are formatted as float number in excel. All those fields need to get converted to a proper datetime (dd/mm/YY) before I wrote to CSV.
I found some good articles on how that works in general, but struggling to get that working for all rows in a opened sheet at once. (Newbie in Python)
Code looks like below for now:
wb = xlrd.open_workbook(args.inname)
xl_sheet = wb.sheet_by_index(0)
print args.inname
print ('Retrieved worksheet: %s' % xl_sheet.name)
print outname
# TODO: Convert xldate.datetime from the date fileds to propper datetime
output = open(outname, 'wb')
wr = csv.writer(output, quoting=csv.QUOTE_ALL)
for rownum in xrange(wb.sheet_by_index(0).nrows):
wr.writerow(wb.sheet_by_index(0).row_values(rownum))
output.close()
I'm sure i have to change the "for rownum ...." line but I'm struggling doing it. I tried several options, which all failed.
thanks
You need to go through the row before you write it out to file, converting values. You are right to identify that it is near the for rownum line:
# You need to know which columns are dates before hand
# you can't get this from the "type" of the cell as they
# are just like any other number
date_cols = [5,16,23]
... # Your existing setup code here #
# write the header row (in response to OP comment)
headerrow = wb.sheet_by_index(0).row_values(0)
wr.writerow(headerrow)
# convert and write the data rows (note range now starts from 1, not 0)
for rownum in xrange(1,wb.sheet_by_index(0).nrows):
# Get the cell values and then convert the relevant ones before writing
cell_values = wb.sheet_by_index(0).row_values(rownum)
for col in date_cols:
cell_values[col] = excel_time_to_string(cell_values[col])
wr.writerow(cell_values)
Exactly what you put in your excel_time_to_string() function is up to you - the answer by #MarkRansom has a reasonable approach - or you could use the xlrd own package versions outlined in this answer.
For instance:
def excel_time_to_string(xltimeinput):
return str(xlrd.xldate.xldate_as_datetime(xltimeinput, wb.datemode))
* EDIT *
In response to request for help in comments after trying. Here's a more error-proof version of excel_time_to_string()
def excel_time_to_string(xltimeinput):
try:
retVal = xlrd.xldate.xldate_as_datetime(xltimeinput, wb.datemode)
except ValueError:
print('You passed in an argument in that can not be translated to a datetime.')
print('Will return original value and carry on')
retVal = xltimeinput
return retVal
The conversion from Excel to Python is quite simple:
>>> excel_time = 42054.441953
>>> datetime.datetime(1899,12,30) + datetime.timedelta(days=excel_time)
datetime.datetime(2015, 2, 19, 10, 36, 24, 739200)
Or to do the complete conversion to a string:
def excel_time_to_string(excel_time, fmt='%Y-%m-%d %H:%M:%S'):
dt = datetime.datetime(1899,12,30) + datetime.timedelta(days=excel_time)
return dt.strftime(fmt)
>>> excel_time_to_string(42054.441953)
'2015-02-19 10:36:24'
>>> excel_time_to_string(42054.441953, '%d/%m/%y')
'19/02/15'

Categories