getting started with processing data with python - python

i have an excel spreadsheet of about 3 million cells. i asked the following question and i liked the answer about saving the spreadsheet as CSV and then processing it with python:
solution to perform lots of calculations on 3 million data points and make charts
is there a library that i can use that will read the csv into a matrix or should i write one myself?
does python speak with VBA at all?
after i am done processing the data, is it simple to put it back in the form of a CSV so that i can open it in excel for viewing?

is there a library that i can use that will read the csv into a matrix or should i write one myself?
The csv module handles just about everything you could want.
does python speak with VBA at all?
Iron Python might.
after i am done processing the data, is it simple to put it back in the form of a CSV so that i can open it in excel for viewing?
The csv module handles just about everything you could want.
Suggestion: Read this: http://docs.python.org/library/csv.html

I like NumPy's loadtxt for this sort of thing. Very configurable for reading CSVs. And savetxt for putting it back after manipulation. Or you could check out the built in csv module if you'd rather not install anything new.

If we speak pythonish, why not to use http://www.python-excel.org/ ?
Example of read file:
import xlrd
rb = xlrd.open_workbook('file.xls',formatting_info=True)
sheet = rb.sheet_by_index(0)
for rownum in range(sheet.nrows):
row = sheet.row_values(rownum)
for c_el in row:
print c_el
Writing the new file:
import xlwt
from datetime import datetime
font0 = xlwt.Font()
font0.name = 'Times New Roman'
font0.colour_index = 2
font0.bold = True
style0 = xlwt.XFStyle()
style0.font = font0
style1 = xlwt.XFStyle()
style1.num_format_str = 'D-MMM-YY'
wb = xlwt.Workbook()
ws = wb.add_sheet('A Test Sheet')
ws.write(0, 0, 'Test', style0)
ws.write(1, 0, datetime.now(), style1)
ws.write(2, 0, 1)
ws.write(2, 1, 1)
ws.write(2, 2, xlwt.Formula("A3+B3"))
wb.save('example.xls')
There are other examples on the page.

If you don't want to deal with changing back and forth from CSV you can use win32com, which can be downloaded here. http://python.net/crew/mhammond/win32/Downloads.html

Related

What is an efficient way to make excel report from python?

For the report I use this code:
from openpyxl import load_workbook
ReportName = "tets.xlsx"
new_row_data = [
["value", 'value2', "value3"]]
wb = load_workbook(ReportName)
# Select Second Worksheet
ws = wb.worksheets[1]
# Append 1 or 2(if multiple) new Rows - Columns A - D
for row_data in new_row_data:
# Append Row Values
ws.append(row_data)
wb.save(ReportName) #save the file
It has no errors, but I don't understand why sometimes it doesn't make the report inside the excel file, it saves the excel file, then when I open it, I don't see the values.
Do you know a better way or more solid way to make the auto report?
I would recommend using CSV files, that can be loaded into excel. Python is really great at working with CSVs.
Here is a link that might be useful.
[https://docs.python.org/3/library/csv.html][1]

Convert XLSX to CSV without losing values from formulas

I've tried a few methods, including pandas:
df = pd.read_excel('file.xlsx')
df.to_csv('file.csv')
But every time I convert my xlsx file over to csv format, I lose all data within columns that include a formula. I have a formula that concatenates values from two other cells + '#domain' to create user emails, but this entire column returns blank in the csv product.
The formula is basically this:
=CONCATENATE(B2,".",E2,"#domain")
The conversion is part of a larger code workflow, but it won't work if this column is left blank. The only thing I've tried that worked was this API, but I'd rather not pay a subscription if this can be done locally on the machine.
Any ideas? I'll try whatever you throw at me - bear in mind I'm new to this, but I will do my best!
You can try to open the excel file with the openpyxl library in the data-only mode. This will prevent the raw formulas - they are going to be calculated just the way you see them in excel itself.
import openpyxl
wb = openpyxl.load_workbook(filename, data_only=True)
Watch out when youre working with you original file and save it with the openpyxl-lib in the data-only-mode all your formulas will be lost. I had this once and it was horrible. So i recommend using a copy of your file to work with.
Since you have your xlsx-file with values only you can now use the internal csv library to generate a proper csv-file (idea from this post: How to save an Excel worksheet as CSV):
import csv
sheet = wb.active # was .get_active_sheet()
with open('test.csv', 'w', newline="") as f:
c = csv.writer(f)
for r in sheet.iter_rows(): # generator; was sh.rows
c.writerow([cell.value for cell in r])

how can I quickly convert in python an xlsx file into a csv file?

I have a 140MB Excel file I need to analyze using pandas. The problem is that if I open this file as xlsx it takes python 5 minutes simply to read it. I tried to manually save this file as csv and then it takes Python about a second to open and read it! There are different 2012-2014 solutions that why Python 3 don't really work on my end.
Can somebody suggest how to convert very quickly file 'C:\master_file.xlsx' to 'C:\master_file.csv'?
There is a project aiming to be very pythonic on dealing with data called "rows". It relies on "openpyxl" for xlsx, though. I don't know if this will be faster than Pandas, but anyway:
$ pip install rows openpyxl
And:
import rows
data = rows.import_from_xlsx("my_file.xlsx")
rows.export_to_csv(data, open("my_file.csv", "wb"))
I faced the same problem as you. Pandas and openpyxl didn't work for me.
I came across with this solution and that worked great for me:
import win32com.client
xl=win32com.client.Dispatch("Excel.Application")
xl.DisplayAlerts = False
xl.Workbooks.Open(Filename=your_file_path,ReadOnly=1)
wb = xl.Workbooks(1)
wb.SaveAs(Filename='new_file.csv', FileFormat='6') #6 means csv
wb.Close(False)
xl.Application.Quit()
wb=None
xl=None
Here you convert the file to csv by means of Excel. All the other ways that I tried refuse to work.
Use read-only mode in openpyxl. Something like the following should work.
import csv
import openpyxl
wb = load_workbook("myfile.xlsx", read_only=True)
ws = wb['sheetname']
with open("myfile.csv", "wb") as out:
writer = csv.writer(out)
for row in ws:
values = (cell.value for cell in row)
writer.writerow(values)
Fastest way that pops to mind:
pandas.read_excel
pandas.DataFrame.to_csv
As an added benefit, you'll be able to do cleanup of the data before saving it to csv.
import pandas as pd
df = pd.read_excel('C:\master_file.xlsx', header=0) #, sheetname='<your sheet>'
df.to_csv('C:\master_file.csv', index=False, quotechar="'")
At some point, dealing with lots of data will take lots of time. Just a fact of life. Good to look for options if it's a problem, though.

getting a row using xlwt

Anyone know how to reference a given row of newSheet shown below
import xlwt
outFile = xlwt.Workbook()
newSheet = outFile.add_sheet('Sheet 1', cell_overwrite_ok=True)
#Write a bunch of data to newSheet
For example I want to reference the first row so I can find which column has a certain header.
EDIT: I'd like to be to run this code somehow
newSheet.col(firstRow.index('some pattern')).width = 3000
xlwtis only for writing Excel files. Use xlrd for reading.
If you have written the file yourself should know what you wrote where.
Just remember in a dict or list where you wrote your header.

Reading .xlsx format in python

I've got to read .xlsx file every 10min in python.
What is the most efficient way to do this?
I've tried using xlrd, but it doesn't read .xlsx - according to documentation he does, but I can't do this - getting Unsupported format, or corrupt file exceptions.
What is the best way to read xlsx?
I need to read comments in cells too.
xlrd hasn't released the version yet to read xlsx. Until then, Eric Gazoni built a package called openpyxl - reads xlsx files, and does limited writing of them.
Use Openpyxl some basic examples:
import openpyxl
# Open Workbook
wb = openpyxl.load_workbook(filename='example.xlsx', data_only=True)
# Get All Sheets
a_sheet_names = wb.get_sheet_names()
print(a_sheet_names)
# Get Sheet Object by names
o_sheet = wb.get_sheet_by_name("Sheet1")
print(o_sheet)
# Get Cell Values
o_cell = o_sheet['A1']
print(o_cell.value)
o_cell = o_sheet.cell(row=2, column=1)
print(o_cell.value)
o_cell = o_sheet['H1']
print(o_cell.value)
# Sheet Maximum filled Rows and columns
print(o_sheet.max_row)
print(o_sheet.max_column)
There are multiple ways to read XLSX formatted files using python. Two are illustrated below and require that you install openpyxl at least and if you want to parse into pandas directly you want to install pandas, eg. pip install pandas openpyxl
Option 1: pandas direct
Primary use case: load just the data for further processing.
Using read_excel() function in pandas would be your best choice. Note that pandas should fall back to openpyxl automatically but in the event of format issues its best to specify the engine directly.
df_pd = pd.read_excel("path/file_name.xlsx", engine="openpyxl")
Option 2 - openpyxl direct
Primary use case: getting or editing specific Excel document elements such as comments (requested by OP), formatting properties or formulas.
Using load_workbook() followed by comment extraction using the comment attribute for each cell would be achieved by the following.
from openpyxl import load_workbook
wb = load_workbook(filename = "path/file_name.xlsx")
ws = wb.active
ws["A1"].comment # <- loop through row & columns to extract all comments

Categories