Code to pick the wanted lines from a dataframe. The original data is in Excel format and I put it in dataframe here.
I want to pick all the rows of “Test Date” fall in “201506” and “201508”, and write them to an Excel file. The lines are working fine.
import pandas as pd
data_short = {'Contract_type' : ["Other", "Other", "Type-I", "Type-I", "Type-I", "Type-II", "Type-II", "Type-III", "Type-III", "Part-time"],
'Test Date': ["20150816", "20150601", "20150204", "20150609", "20150204", "20150806", "20150201", "20150615", "20150822", "20150236" ],
'Test_time' : ["16:26", "07:39", "18:48", "22:32", "03:54", "03:30", "04:00", "22:02", "13:43", "10:29"],
}
df = pd.DataFrame(data_short)
data_201508 = df[df['Test Date'].astype(str).str.startswith('201508')]
data_201506 = df[df['Test Date'].astype(str).str.startswith('201506')]
data_68 = data_201506.append(data_201508)
writer = pd.ExcelWriter("C:\\test-output.xlsx", engine = 'openpyxl')
data_68.to_excel(writer, "Sheet1", index = False)
writer.save()
But when I applied them to a larger file, ~600,000 rows with 25 columns (65 MB in file size), it returns error message like below:
Traceback (most recent call last):
File "C:\Python27\Working Scripts\LL move pick wanted ATA in months.py", line 15, in <module>
writer.save()
File "C:\Python27\lib\site-packages\pandas\io\excel.py", line 732, in save
return self.book.save(self.path)
File "C:\Python27\lib\site-packages\openpyxl\workbook\workbook.py", line 263, in save
save_workbook(self, filename)
File "C:\Python27\lib\site-packages\openpyxl\writer\excel.py", line 239, in save_workbook
writer.save(filename, as_template=as_template)
File "C:\Python27\lib\site-packages\openpyxl\writer\excel.py", line 222, in save
self.write_data(archive, as_template=as_template)
File "C:\Python27\lib\site-packages\openpyxl\writer\excel.py", line 80, in write_data
self._write_worksheets(archive)
File "C:\Python27\lib\site-packages\openpyxl\writer\excel.py", line 163, in _write_worksheets
xml = sheet._write(self.workbook.shared_strings)
File "C:\Python27\lib\site-packages\openpyxl\worksheet\worksheet.py", line 776, in _write
return write_worksheet(self, shared_strings)
File "C:\Python27\lib\site-packages\openpyxl\writer\worksheet.py", line 263, in write_worksheet
xf.write(worksheet.page_breaks.to_tree())
File "src/lxml/serializer.pxi", line 1016, in lxml.etree._FileWriterElement.__exit__ (src\lxml\lxml.etree.c:142025)
File "src/lxml/serializer.pxi", line 904, in lxml.etree._IncrementalFileWriter._write_end_element (src\lxml\lxml.etree.c:140218)
File "src/lxml/serializer.pxi", line 999, in lxml.etree._IncrementalFileWriter._handle_error (src\lxml\lxml.etree.c:141711)
File "src/lxml/serializer.pxi", line 195, in lxml.etree._raiseSerialisationError (src\lxml\lxml.etree.c:131087)
lxml.etree.SerialisationError: IO_WRITE
Does it mean the computer is not good enough (8GB, Win10)? Is there a way to optimize the code (for example, consume less memory)? Thank you.
btw: Question similiar to I/O Error while saving Excel file - Python but no solution...
found a solution: write the output to csv instead (anyway it can be opened in Excel as well)
data_wanted_all.to_csv("C:\\test-output.csv", index=False)
post here for in case some one encounters the same problem. let me know if this question shall be removed. :)
Related
I have file in Google sheets I want to read it into a Pandas Dataframe. But gives me an error i don't know what's it.
this is the code :
import pandas as pd
sheet_id = "1HUbEhsYnLxJP1IisFcSKtHTYlFj_hHe5v21qL9CVyak"
df = pd.read_csv(f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?
gid=556844753&format=csv")
print(df)
And this is the error :
File "c:\Users\blaghzao\Documents\Stage PFA(Laghzaoui Brahim)\google_sheet.py", line 3, in <module>
df = pd.read_csv(f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?gid=556844753&format=csv")
File "C:\Users\blaghzao\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\blaghzao\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\blaghzao\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
return parser.read(nrows)
File "C:\Users\blaghzao\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1255, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\blaghzao\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 225, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 89 fields in line 3, saw 243
I found the answer, the problem it's just with access permissions of the file.
enter image description here
Remove gid from code
import pandas as pd
sheet_id = "1HUbEhsYnLxJP1IisFcSKtHTYlFj_hHe5v21qL9CVyak"
df = pd.read_csv(f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv")
print(df)
Click on link for sample image
As far as I know this error rises using comma delimiter and you have more commas then expected.
Can you try with below read_csv() method to avoid them;
df = pd.read_csv(f"https://docs.google.com/spreadsheets/d/{sheet_id}/export? gid=556844753&format=csv", on_bad_lines='skip')
This will avoid bad lines so you can identify problem depending on skipped lines. I believe your csv format export is not matching with what pandas read_csv() expects.
I am trying to read a folder of .xlsm files, take columns A:J, remove any empty rows, and then combine each excel file into a single CSV. The code below seems to work when I use just one specific excel file but has an error when I loop. Any help would be appreciated.
import pandas as pd
import os
import glob
# defines the folder to pull from and to save into
source = r"C:\Users\bwendt\QAR"
#defines list of files as dir and changes directory to source
os.chdir(source)
files = glob.glob(source + "/*.xlsm")
MultiRents = []
#loops through list of file paths, reads the file, removes blank cells, and adds to data frame
for f in files:
data = pd.read_excel(f,"Page2",usecols = "A:J")
#data.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True)
MultiRents.append(data)
#create pandas DF
df = pd.DataFrame.from_records(MultiRents)
#Exports dataframe to a csv file
export_csv = df.to_csv("Multifamily_Rents.csv")
Traceback: Traceback (most recent call last):
File "", line 1, in
runfile('C:/Users/bwendt/.spyder-py3/Print_rents.py', wdir='C:/Users/bwendt/.spyder-py3')
File
"C:\Users\bwendt\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py",
line 827, in runfile
execfile(filename, namespace)
File
"C:\Users\bwendt\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py",
line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/bwendt/.spyder-py3/Print_rents.py", line 21, in
data = pd.read_excel(f,"Page2",usecols = "A:J")
File
"C:\Users\bwendt\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\util_decorators.py",
line 188, in wrapper
return func(*args, **kwargs)
File
"C:\Users\bwendt\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\util_decorators.py",
line 188, in wrapper
return func(*args, **kwargs)
File
"C:\Users\bwendt\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\excel.py",
line 350, in read_excel
io = ExcelFile(io, engine=engine)
File
"C:\Users\bwendt\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\excel.py",
line 653, in init
self._reader = self._enginesengine
File
"C:\Users\bwendt\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\excel.py",
line 424, in init
self.book = xlrd.open_workbook(filepath_or_buffer)
File
"C:\Users\bwendt\AppData\Local\Continuum\anaconda3\lib\site-packages\xlrd__init__.py",
line 157, in open_workbook
ragged_rows=ragged_rows,
File
"C:\Users\bwendt\AppData\Local\Continuum\anaconda3\lib\site-packages\xlrd\book.py",
line 92, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File
"C:\Users\bwendt\AppData\Local\Continuum\anaconda3\lib\site-packages\xlrd\book.py",
line 1278, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File
"C:\Users\bwendt\AppData\Local\Continuum\anaconda3\lib\site-packages\xlrd\book.py",
line 1272, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
XLRDError: Unsupported format, or corrupt file: Expected BOF record;
found b'\x0eWendt, '
I am trying to run a csv file through my code and work with the data. I am receiving a error message that I don't exactly understand.
Here is the csv file
There is a lot more code but I will only include code that is relevant to the problem. Comment below if you need more info.
import pandas as pd
df_playoffs = pd.read_csv('/Users/hannahbeegle/Desktop/playoff_teams.csv.numbers', encoding='latin-1', index_col = 'team')
df_playoffs.fillna('None', inplace=True)
Here is the error message:
Traceback (most recent call last):
File "/Users/hannahbeegle/Desktop/Baseball.py", line 130, in <module>
df_playoffs = pd.read_csv('/Users/hannahbeegle/Desktop/playoff_teams.csv.numbers', encoding='latin-1', index_col = 'team')
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers.py", line 435, in _read
data = parser.read(nrows)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 955, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2172, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2
Looks as though your csv is maybe tab delimited, in the line that specifies the csv, edit to something like:
pd.read_csv('/Users/hannahbeegle/Desktop/playoff_teams.csv.numbers', sep="\t", encoding='latin-1', index_col = 'team')
[edited section after comments]
If the data is "ragged" then you could try breaking it up into a dictionary, and then using that to build the dataframe - here's an example I tried with a mocked-up sample dataset:
record_dict = {}
file=open("variable_columns.csv", mode="r")
for line in file.readlines():
split_line = line.split()
record_dict[split_line[0]]=split_line[1:]
df_playoffs = pd.DataFrame.from_dict(record_dict, orient='index' )
df_playoffs.sample(5)
You might need to look at the line.split() line, and enter "\t" as the split parameter (i.e. line.split("\t") but you can experiment with this.
Also, notice that pandas has forced the data to be rectangular, so some of the columns will contain None for the "short" rows.
I am trying to open an xlsx file that is created by another system (and this is the format in which the data always comes, and is not in my control). I tried both openpyxl (v2.3.2) and xlrd (v1.0.0) (as well as pandas (v0.20.1) read_excel and pd.ExcelFile(), both of which are using xlrd, and so may be moot), and I am running into errors; plus not finding answers from my searches. Any help is appreciated.
xlrd code:
import xlrd
workbook = xlrd.open_workbook(r'C:/Temp/Data.xlsx')
Error:
Traceback (most recent call last):
File "<ipython-input-3-9e5d87f720d0>", line 2, in <module>
workbook = xlrd.open_workbook(r'C:/Temp/Data.xlsx')
File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\__init__.py", line 422, in open_workbook
ragged_rows=ragged_rows,
File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 833, in open_workbook_2007_xml
x12sheet.process_stream(zflo, heading)
File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 548, in own_process_stream
self_do_row(elem)
File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 685, in do_row
self.sheet.put_cell(rowx, colx, None, float(tvalue), xf_index)
ValueError: could not convert string to float:
openpyxl code:
import openpyxl
wb = openpyxl.load_workbook(r'C:/Temp/Data.xlsx')
Error:
Traceback (most recent call last):
File "<ipython-input-2-6083ad2bc875>", line 1, in <module>
wb = openpyxl.load_workbook(r'C:/Temp/Data.xlsx')
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\reader\excel.py", line 234, in load_workbook
parser.parse()
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\reader\worksheet.py", line 106, in parse
dispatcher[tag_name](element)
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\reader\worksheet.py", line 243, in parse_row_dimensions
self.parse_cell(cell)
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\reader\worksheet.py", line 188, in parse_cell
value = _cast_number(value)
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\cell\read_only.py", line 23, in _cast_number
return long(value)
ValueError: invalid literal for int() with base 10: ' '
pandas code:
import pandas as pd
df = pd.read_excel(r'C:/Temp/Data.xlsx', sheetname='Sheet1')
Error:
Traceback (most recent call last):
File "<ipython-input-5-b86ec98a4e9e>", line 2, in <module>
df = pd.read_excel(r'C:/Temp/Data.xlsx', sheetname='Sheet1')
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\io\excel.py", line 200, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\io\excel.py", line 257, in __init__
self.book = xlrd.open_workbook(io)
File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\__init__.py", line 422, in open_workbook
ragged_rows=ragged_rows,
File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 833, in open_workbook_2007_xml
x12sheet.process_stream(zflo, heading)
File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 548, in own_process_stream
self_do_row(elem)
File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 685, in do_row
self.sheet.put_cell(rowx, colx, None, float(tvalue), xf_index)
ValueError: could not convert string to float:
For what its worth, here is an example snippet of the input file:
I am guessing that the errors are coming from the first row having blanks beyond the first column - because the errors vanish when I delete the first two rows and . I cannot skip the first two rows, because I want to extract the value in cell A1. I would also like to force the values read to be string type, and will later convert to float with error checking. thanks!
===========
Update(Aug 9 10AM EDT): Using Charlie's suggestion, was able to open excel file in read only mode; and was able to read most of the contents - but still running into an error somewhere.
new code (sorry it is not very pythonic - still a newbie):
wb = openpyxl.load_workbook(r'C:/Temp/Data.xlsx', read_only=True)
ws = wb['Sheet1']
ws.max_row = ws.max_column = None
i=1
for row in ws.rows:
for cell in row:
if i<2000:
i += 1
try:
print(i, cell.value)
except:
print("error")
Error:
Traceback (most recent call last):
File "<ipython-input-65-2e8f3cf2294a>", line 2, in <module>
for row in ws.rows:
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\worksheet\read_only.py", line 125, in get_squared_range
yield tuple(self._get_row(element, min_col, max_col))
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\worksheet\read_only.py", line 165, in _get_row
value, data_type, style_id)
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\cell\read_only.py", line 36, in __init__
self.value = value
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\cell\read_only.py", line 132, in value
value = _cast_number(value)
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\cell\read_only.py", line 23, in _cast_number
return long(value)
ValueError: invalid literal for int() with base 10: ' '
=========
Update2 (10:35AM): when i read the file without ws.max_row and ws.max_column set as None, the code was reading just one column, without errors. The value in cell A66 is "Generated from:". But when i read the file with ws.max_row and ws.max_column set as None, this particular cell is causing trouble. But I can read all other cells before that, and that will work fine for me, right now. thanks, #Charlie.
Sounds like the source file is probably corrupt and contains cells that with empty strings that are typed as numbers. You might be able to use openpyxl's read-only mode to skip the first tow rows.
If your program works after you delete the first two rows then lets skip them. try use skiprows to ignore the first 2 rows that are blanks or are headers. you can use the parse method from panda.
xls = pd.read_excel('C:/Temp/Data.xlsx')
df = xls.parse('Sheet1', skiprows=2) #assuming your data is on sheet1.
I am trying to append the values from one sheet row by row to a new workbook. My code works when I run it on a small test file, but when I run it on my target file it returns an error when saving.
Here is my code:
from openpyxl import load_workbook
from openpyxl import Workbook
wb = load_workbook(filename='RM Activity-Pricing Report - 2014-05-31.xlsm',keep_vba=False, data_only=True)
ws_Ottawa = wb.get_sheet_by_name('Ottawa')
wb2 = Workbook()
ws2 = wb2.create_sheet()
for row in ws_Ottawa.iter_rows():
ws2.append(row)
wb2.save('new_big_file.xlsx')
The output error I get in Spyder (python 3.5) is:
Traceback (most recent call last):
File "<ipython-input-22-171ffbcd4891>", line 1, in <module>
runfile('Z:/Revenue Management Report/ExtractPromoData.py', wdir='Z:/Revenue Management Report')
File "C:\Anaconda3-64\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Anaconda3-64\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 88, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "Z:/Revenue Management Report/ExtractPromoData.py", line 35, in <module>
wb2.save('new_big_file4.xlsx')
File "C:\Anaconda3-64\lib\site-packages\openpyxl\workbook\workbook.py", line 298, in save
save_workbook(self, filename)
File "C:\Anaconda3-64\lib\site-packages\openpyxl\writer\excel.py", line 198, in save_workbook
writer.save(filename, as_template=as_template)
File "C:\Anaconda3-64\lib\site-packages\openpyxl\writer\excel.py", line 181, in save
self.write_data(archive, as_template=as_template)
File "C:\Anaconda3-64\lib\site-packages\openpyxl\writer\excel.py", line 87, in write_data
self._write_worksheets(archive)
File "C:\Anaconda3-64\lib\site-packages\openpyxl\writer\excel.py", line 114, in _write_worksheets
write_worksheet(sheet, self.workbook.shared_strings,
File "C:\Anaconda3-64\lib\site-packages\openpyxl\writer\worksheet.py", line 233, in write_worksheet
write_rows(xf, worksheet)
File "C:\Anaconda3-64\lib\site-packages\openpyxl\writer\lxml_worksheet.py", line 59, in write_rows
if cell.value is None and not cell.has_style:
File "C:\Anaconda3-64\lib\site-packages\openpyxl\cell\cell.py", line 306, in value
if value is not None and self.is_date:
File "C:\Anaconda3-64\lib\site-packages\openpyxl\cell\cell.py", line 351, in is_date
if self.data_type == "n" and self.number_format != "General":
File "C:\Anaconda3-64\lib\site-packages\openpyxl\styles\styleable.py", line 49, in __get__
return coll[idx - 164]
IndexError: list index out of range
I do not get an error when I use my code on a smaller test .xlsx file.
Possible reasons for the problem that I suspect are:
1)input file is .xlsm
2)input file is has columns from A to CI
3)input file is password protected (but since the error is in saving this does not seem like it should be an issue)
Taking into account what Charlie said, this is my work-around
from openpyxl import load_workbook
from openpyxl import Workbook
wb = load_workbook(filename='RM Activity-Pricing Report - 2014-5-31.xlsm',keep_vba=False, data_only=True)#,guess_types=True)
ws_Ottawa = wb.get_sheet_by_name('Ottawa')
wb2 = Workbook()
ws2 = wb2.create_sheet()
counter = 0
new_rows = []
for rrow in ws_Ottawa.iter_rows():
new_rows.append([])
for cell in rrow:
new_rows[counter].append(cell.value)
counter +=1
for wrow in new_rows:
ws2.append(wrow)
wb2.save('new_big_file4.xlsx')
print("ALL DONE")
You quite simply cannot do what you are trying to do. Unfortunately, the way data is stored within the file formats means that much relevant information is not stored with the cell but using reference to the workbook object. These obviously differ from workbook to workbook which is why you see errors when saving: the number format you want to use doesn't exist in the new file.