lxml.etree.XMLSyntaxError while trying to read_excel using pandas

lxml.etree.XMLSyntaxError while trying to read_excel using pandas - python

I've a spreadsheet (~50 mb) with multiple sheets, and I'm trying to read it using pandas.
import pandas as pd
df = pd.read_excel('compiled_output.xlsx', sheet_name='Sheet1')
I'm not sure why it's throwing lxml.etree.XMLSyntaxError; I've done this many times before. I also tried passing engine=openpyxl, downgraded to pandas==1.2.4 but I get the same error:
df = pd.read_excel('compiled_output.xlsx', sheet_name='Sheet1')
File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 336, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "/usr/local/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 1131, in __init__
self._reader = self._engines[engine](self._io, storage_options=storage_options)
File "/usr/local/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 475, in __init__
super().__init__(filepath_or_buffer, storage_options=storage_options)
File "/usr/local/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 391, in __init__
self.book = self.load_workbook(self.handles.handle)
File "/usr/local/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 486, in load_workbook
return load_workbook(
File "/usr/local/lib/python3.9/site-packages/openpyxl/reader/excel.py", line 317, in load_workbook
reader.read()
File "/usr/local/lib/python3.9/site-packages/openpyxl/reader/excel.py", line 282, in read
self.read_worksheets()
File "/usr/local/lib/python3.9/site-packages/openpyxl/reader/excel.py", line 216, in read_worksheets
rels = get_dependents(self.archive, rels_path)
File "/usr/local/lib/python3.9/site-packages/openpyxl/packaging/relationship.py", line 131, in get_dependents
node = fromstring(src)
File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring
File "src/lxml/parser.pxi", line 1896, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1784, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1141, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
File "<string>", line 2
lxml.etree.XMLSyntaxError: internal error: Huge input lookup, line 2, column 12753697

Related

Dynamically choose BigQuery tablename in Apache Beam pipeline

I'm building an Apache Beam pipeline using GCP Dataflow to process incoming events which need to be written in separate BigQuery tables depending on the content of the event. The decision of which table the data needs to be written to happens in one of the stages of the pipeline. My problem is how do I dynamicallys ett he name of the table that the data needs to go into. Also, in some cases, data needs to be written to two tables, after applying a transform.
I have gone through the solutions posted on these links, but it seems that they could be for old versions of google-cloud/apache-beam and are not working for me:
Dynamically set bigquery table id in dataflow pipeline
Writing different values to different BigQuery tables in Apache Beam
Attaching a sample pipeline using DirectRunner where I tried to follow the 2nd link mentioned above:
#Standard Python Imports
import argparse
import logging
import json
#3rd Party Imports
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions
from apache_beam.options.pipeline_options import PipelineOptions
def transform_entry(line):
return json.loads(line)
def getTableName(entry):
if (entry["tablename"] == "table1"):
return "table1"
else:
return "table2"
def getRow(entry):
return entry["dataRow"]
def run(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument('--temp_location',
default='<<TEMPORARY LOCATION>>')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(StandardOptions).streaming = True
with beam.Pipeline(options=pipeline_options) as p:
writeData = (p
| 'ReadInput' >> beam.io.ReadFromText('./sample_input.json')
| 'Parse' >> beam.Map(transform_entry))
eventRow = (writeData
| 'Get Data Row' >> beam.map(getRow)
| 'Write Event Row' >> beam.io.gcp.bigquery.WriteToBigQuery(
project='<<GCP PROJECT>>',
dataset='<<DATASET NAME>>',
table=getTableName,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
))
print(eventRow)
if __name__ == '__main__':
logging.getLogger().setLevel(logging.ERROR)
run()
Could someone please help me out with this?
Attaching the traceback here:
/home/animesh/.local/lib/python3.10/site-packages/apache_beam/io/gcp/bigquery.py:1992: BeamDeprecationWarning: options is deprecated since First stable release. References to <pipeline>.options will not be supported
is_streaming_pipeline = p.options.view_as(StandardOptions).streaming
<apache_beam.io.gcp.bigquery.WriteResult object at 0x7f2421660100>
Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam/runners/common.py", line 1571, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 1521, in process
yield (self.destination(element, *side_inputs), element)
File "/home/animesh/Documents/cliqmetrics/logger/dataflow-pipeline/stackques/stackpipe.py", line 85, in getTableName
KeyError: 'tablename'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/animesh/Documents/cliqmetrics/logger/dataflow-pipeline/stackques/stackpipe.py", line 56, in <module>
run()
File "/home/animesh/Documents/cliqmetrics/logger/dataflow-pipeline/stackques/stackpipe.py", line 37, in run
with beam.Pipeline(options=pipeline_options) as p:
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/pipeline.py", line 600, in __exit__
self.result = self.run()
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/pipeline.py", line 553, in run
self._options).run(False)
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/pipeline.py", line 577, in run
return self.runner.run_pipeline(self, self._options)
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/runners/direct/direct_runner.py", line 131, in run_pipeline
return runner.run_pipeline(pipeline, options)
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 201, in run_pipeline
self._latest_run_result = self.run_via_runner_api(
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 222, in run_via_runner_api
return self.run_stages(stage_context, stages)
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 453, in run_stages
bundle_results = self._execute_bundle(
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 781, in _execute_bundle
self._run_bundle(
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 1010, in _run_bundle
result, splits = bundle_manager.process_bundle(
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 1346, in process_bundle
result_future = self._worker_handler.control_conn.push(process_bundle_req)
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/runners/portability/fn_api_runner/worker_handlers.py", line 379, in push
response = self.worker.do_instruction(request)
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/runners/worker/sdk_worker.py", line 596, in do_instruction
return getattr(self, request_type)(
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/runners/worker/sdk_worker.py", line 634, in process_bundle
bundle_processor.process_bundle(instruction_id))
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/runners/worker/bundle_processor.py", line 1003, in process_bundle
input_op_by_transform_id[element.transform_id].process_encoded(
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/runners/worker/bundle_processor.py", line 227, in process_encoded
self.output(decoded_value)
File "apache_beam/runners/worker/operations.py", line 526, in apache_beam.runners.worker.operations.Operation.output
File "apache_beam/runners/worker/operations.py", line 528, in apache_beam.runners.worker.operations.Operation.output
File "apache_beam/runners/worker/operations.py", line 237, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam/runners/worker/operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam/runners/worker/operations.py", line 1021, in apache_beam.runners.worker.operations.SdfProcessSizedElements.process
File "apache_beam/runners/worker/operations.py", line 1030, in apache_beam.runners.worker.operations.SdfProcessSizedElements.process
File "apache_beam/runners/common.py", line 1432, in apache_beam.runners.common.DoFnRunner.process_with_sized_restriction
File "apache_beam/runners/common.py", line 817, in apache_beam.runners.common.PerWindowInvoker.invoke_process
File "apache_beam/runners/common.py", line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
File "apache_beam/runners/common.py", line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "apache_beam/runners/common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
File "apache_beam/runners/worker/operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam/runners/worker/operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/worker/operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam/runners/common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam/runners/common.py", line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "apache_beam/runners/common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
File "apache_beam/runners/worker/operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam/runners/worker/operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/worker/operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam/runners/common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam/runners/common.py", line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "apache_beam/runners/common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
File "apache_beam/runners/worker/operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam/runners/worker/operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/worker/operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 1507, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam/runners/common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam/runners/common.py", line 1571, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "/home/animesh/.local/lib/python3.10/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 1521, in process
yield (self.destination(element, *side_inputs), element)
File "/home/animesh/Documents/cliqmetrics/logger/dataflow-pipeline/stackques/stackpipe.py", line 85, in getTableName
KeyError: "tablename [while running 'Write Event Row/_StreamToBigQuery/AppendDestination']"

Your function and the way to apply dynamic table name based on current element in the PCollection and a function are corrects, but you have a problem in the current element in your PCollection.
You have a KeyError in the Dict inside your PCollection, the key tablename seems not to be present.
You can add a mock instead of ReadFromText in order to be sure the expected this key is present and your input PCollection of Dict is created as expected : you can use beam.Create([{'field_name':'field_value'}]) for example.
So you will test more easily the write to BQ part with dynamic table name.

"EOFError: Ran out of input" when packaging a Python script with PyInstaller

I'm developing an application for Windows operating systems written in Python 3.8 and which makes use of the nnunet library (https://pypi.org/project/nnunet/) which uses multiprocessing. I have tested the script and it works correctly.
Now I'm trying to package everything with pyinstaller v5.7.0. The creation of the .exe is successful but when I run it I get the following error:
Traceback (most recent call last):
File "main.py", line 344, in <module>
File "nnunet\inference\predict.py", line 694, in predict_from_folder
File "nnunet\inference\predict.py", line 496, in predict_cases_fastest
File "nnunet\inference\predict.py", line 123, in preprocess_multithreaded
File "multiprocess\process.py", line 121, in start
File "multiprocess\context.py", line 224, in _Popen
File "multiprocess\context.py", line 327, in _Popen
File "multiprocess\popen_spawn_win32.py", line 93, in __init__
File "multiprocess\reduction.py", line 70, in dump
File "dill\_dill.py", line 394, in dump
File "pickle.py", line 487, in dump
File "dill\_dill.py", line 388, in save
File "pickle.py", line 603, in save
File "pickle.py", line 717, in save_reduce
File "dill\_dill.py", line 388, in save
File "pickle.py", line 560, in save
File "dill\_dill.py", line 1186, in save_module_dict
File "pickle.py", line 971, in save_dict
Traceback (most recent call last):
File "main.py", line 341, in <module>
File "pickle.py", line 997, in _batch_setitems
File "D:\MyProject\venv\Lib\site-packages\PyInstaller\hooks\rthooks\pyi_rth_multiprocessing.py", line 49, in _freeze_support
File "dill\_dill.py", line 388, in save
spawn.spawn_main(**kwds)
File "pickle.py", line 560, in save
File "pickle.py", line 901, in save_tuple
File "dill\_dill.py", line 388, in save
File "multiprocessing\spawn.py", line 116, in spawn_main
File "pickle.py", line 560, in save
File "multiprocessing\spawn.py", line 126, in _main
File "dill\_dill.py", line 1427, in save_instancemethod0
EOFError: Ran out of input
[588] Failed to ex File "pickle.py", line 692, in save_reduce
ecute script 'main' d File "dill\_dill.py", line 388, in save
ue to unhandled File "pickle.py", line 560, in save
exception!
File "pickle.py", line 886, in save_tuple
File "dill\_dill.py", line 388, in save
File "pickle.py", line 603, in save
File "pickle.py", line 717, in save_reduce
File "dill\_dill.py", line 388, in save
File "pickle.py", line 560, in save
File "dill\_dill.py", line 1186, in save_module_dict
File "pickle.py", line 971, in save_dict
File "pickle.py", line 997, in _batch_setitems
File "dill\_dill.py", line 388, in save
File "pickle.py", line 603, in save
File "pickle.py", line 687, in save_reduce
File "dill\_dill.py", line 388, in save
File "pickle.py", line 560, in save
File "dill\_dill.py", line 1698, in save_type
File "dill\_dill.py", line 1070, in _save_with_postproc
File "pickle.py", line 692, in save_reduce
File "dill\_dill.py", line 388, in save
File "pickle.py", line 560, in save
File "pickle.py", line 901, in save_tuple
File "dill\_dill.py", line 388, in save
File "pickle.py", line 560, in save
File "pickle.py", line 886, in save_tuple
File "dill\_dill.py", line 388, in save
File "pickle.py", line 560, in save
File "dill\_dill.py", line 1698, in save_type
File "dill\_dill.py", line 1084, in _save_with_postproc
File "pickle.py", line 997, in _batch_setitems
File "dill\_dill.py", line 388, in save
File "pickle.py", line 603, in save
File "pickle.py", line 717, in save_reduce
File "dill\_dill.py", line 388, in save
File "pickle.py", line 560, in save
File "dill\_dill.py", line 1186, in save_module_dict
File "pickle.py", line 971, in save_dict
File "pickle.py", line 997, in _batch_setitems
File "dill\_dill.py", line 388, in save
File "pickle.py", line 603, in save
File "pickle.py", line 717, in save_reduce
File "dill\_dill.py", line 388, in save
File "pickle.py", line 560, in save
File "dill\_dill.py", line 1186, in save_module_dict
File "pickle.py", line 971, in save_dict
File "pickle.py", line 997, in _batch_setitems
File "dill\_dill.py", line 388, in save
File "pickle.py", line 578, in save
File "PyInstaller\loader\pyimod01_archive.py", line 76, in __getattr__
AssertionError
[4392] Failed to execute script 'main' due to unhandled exception!
Below is the code of my python script:
#==============================
# main.py
#==============================
from multiprocessing import freeze_support
from nnunet.inference.predict import predict_from_folder
if __name__ == "__main__":
freeze_support()
...
predict_from_folder(...)
...
Below is the code of the nnunet library that triggers the error:
#==============================
# nnunet\inference\predict.py
#==============================
def preprocess_multithreaded(trainer, list_of_lists, output_files, num_processes=2, segs_from_prev_stage=None):
if segs_from_prev_stage is None:
segs_from_prev_stage = [None] * len(list_of_lists)
num_processes = min(len(list_of_lists), num_processes)
classes = list(range(1, trainer.num_classes))
assert isinstance(trainer, nnUNetTrainer)
q = Queue(1)
processes = []
for i in range(num_processes):
pr = Process(
target=preprocess_save_to_queue,
args=(
trainer.preprocess_patient,
q,
list_of_lists[i::num_processes],
output_files[i::num_processes],
segs_from_prev_stage[i::num_processes],
classes,
trainer.plans['transpose_forward']
)
)
pr.start() ## <------------ The error is generated here!!!!!!!!!!!!!
processes.append(pr)
try:
end_ctr = 0
while end_ctr != num_processes:
item = q.get()
if item == "end":
end_ctr += 1
continue
else:
yield item
finally:
for p in processes:
if p.is_alive():
p.terminate()
p.join()
q.close()
def predict_cases_fastest(...):
...
pool = Pool(num_threads_nifti_save)
...
preprocessing = preprocess_multithreaded(
trainer,
list_of_lists,
cleaned_output_files,
num_threads_preprocessing,
segs_from_prev_stage
)
...
pool.starmap_async(...)
...
pool.close()
pool.join()
def predict_from_folder(...):
...
return predict_cases_fastest(...)
if __name__ == "__main__":
...
Edit 03-02-2023
I have created a public project with which it is possible to reproduce the reported problem: https://gitlab.com/carlopoletto/nnunet_pyinstaller_problem
In the ./scripts folder there are some scripts to install everything and run the tests:
./scripts/install: dependency installation
./scripts/dist: creating the executable with pyinstaller
./scripts/run_py: running the python script (NB: this script automatically delete the ./temp folder and recreate it by copying the contents of ./data)
./scripts/run_exe: running the executable created with ./scripts/dist (NB: this script automatically delete the ./temp folder and recreate it by copying the contents of ./data)
The problem appears to be internal to the nnunet library. I don't know if this problem can be solved by properly configuring pyinstaller.

Empty SVG file errror while it's not

Working with Pycairo, I create an "SVGSurface" (which create for me an svg file) and I write on it using a "context". Now after I finish, I need to use the svg file, but it seems that the file is not closed so that it gave me an error telling that the document is empty.
ps = cairo.SVGSurface("header.svg", width, height)
cr = cairo.Context(ps)
drawRectangle (cr,
papersize.convert_length(int(lg[0]), "px","pt"),
papersize.convert_length(int(lg[2]), "px","pt"),
papersize.convert_length(int(lg[1])-int(lg[0])-1, "px","pt"),
papersize.convert_length(int(lg[3])-int(lg[2])-1, "px","pt"),
0, 0, 0.5
)
cr.show_page()
head = st.fromfile("header.svg")
It gives me this error :
File "/usr/local/lib/python2.7/dist-packages/svgutils/transform.py", line 249, in fromfile
svg_file = etree.parse(fid)
File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:81117)
File "src/lxml/parser.pxi", line 1832, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:118116)
File "src/lxml/parser.pxi", line 1852, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:118399)
File "src/lxml/parser.pxi", line 1747, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:117187)
File "src/lxml/parser.pxi", line 1162, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:111914)
File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105109)
File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106817)
File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105671)
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1 (line 1)
I tried to close the file with os but it didn't work

How to download all transcripts from seeking alpha

Is there some way to automatically download all the transcripts from the SA website?
http://seekingalpha.com/earnings/earnings-call-transcripts
I tried using the http://newspaper.readthedocs.io/en/latest/ python code but I get the following error:
earnings_call_transcripts_2 = newspaper.build('http://seekingalpha.com/earnings/earnings-call-transcripts', memoize_articles=False)
Traceback (most recent call last):
File "/Users/name/anaconda/lib/python3.5/site-packages/newspaper/parsers.py", line 67, in fromstring
cls.doc = lxml.html.fromstring(html)
File "/Users/name/anaconda/lib/python3.5/site-packages/lxml/html/__init__.py", line 867, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/Users/name/anaconda/lib/python3.5/site-packages/lxml/html/__init__.py", line 752, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77697)
File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116494)
File "src/lxml/parser.pxi", line 1700, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115040)
File "src/lxml/parser.pxi", line 1040, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109165)
File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103404)
File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105058)
File "src/lxml/parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104143)
File "<string>", line None
lxml.etree.XMLSyntaxError: line 295: b"htmlParseEntityRef: expecting ';'"
[Source parse ERR] http://seekingalpha.com/earnings/earnings-call-transcripts

Pandas inconsistent "Initializing from file failed" with multithreading

I have some code like the following:
df = pd.read_csv(filepath, header=None, parse_dates=[[0,1]])
This is called from a watchdog thread. Another process writes the file every minute, and
then the above code gets called through watchdog's "on_modified" and does some processing.
It works a few times and most of the times I get the following traceback:
Exception in thread Thread-1:
Traceback (most recent call last):
File "E:\Portable Python 2.7.6.1\App\Lib\threading.py", line 810, in __bootstr
ap_inner
self.run()
File "E:\Portable Python 2.7.6.1\App\lib\site-packages\watchdog-0.8.1-py2.7.eg
g\watchdog\observers\api.py", line 236, in run
self.dispatch_events(self.event_queue, self.timeout)
File "E:\Portable Python 2.7.6.1\App\lib\site-packages\watchdog-0.8.1-py2.7.eg
g\watchdog\observers\api.py", line 406, in dispatch_events
self._dispatch_event(event, watch)
File "E:\Portable Python 2.7.6.1\App\lib\site-packages\watchdog-0.8.1-py2.7.eg
g\watchdog\observers\api.py", line 401, in _dispatch_event
handler.dispatch(event)
File "E:\Portable Python 2.7.6.1\App\lib\site-packages\watchdog-0.8.1-py2.7.eg
g\watchdog\events.py", line 343, in dispatch
_method_map[event_type](event)
File "tx_stats.py", line 272, in on_modified
dur, tail_skip_count)
File "tx_stats.py", line 209, in read_file
df = pd.read_csv(filepath, header=None, parse_dates=[[0,1]])
File "E:\Portable Python 2.7.6.1\App\lib\site-packages\pandas-0.14.1-py2.7-win
32.egg\pandas\io\parsers.py", line 452, in parser_f
return _read(filepath_or_buffer, kwds)
File "E:\Portable Python 2.7.6.1\App\lib\site-packages\pandas-0.14.1-py2.7-win
32.egg\pandas\io\parsers.py", line 234, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "E:\Portable Python 2.7.6.1\App\lib\site-packages\pandas-0.14.1-py2.7-win
32.egg\pandas\io\parsers.py", line 542, in __init__
self._make_engine(self.engine)
File "E:\Portable Python 2.7.6.1\App\lib\site-packages\pandas-0.14.1-py2.7-win
32.egg\pandas\io\parsers.py", line 679, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "E:\Portable Python 2.7.6.1\App\lib\site-packages\pandas-0.14.1-py2.7-win
32.egg\pandas\io\parsers.py", line 1041, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "parser.pyx", line 332, in pandas.parser.TextReader.__cinit__ (pandas\par
ser.c:3218)
File "parser.pyx", line 560, in pandas.parser.TextReader._setup_parser_source
(pandas\parser.c:5608)
IOError: Initializing from file failed
What could be wrong? The most puzzling part is that it works some of the time, but not every time. Could it be because of a race condition between this process reading the file and the other process closing the file?
The file contains some statistics like:
2014.09.02,00:15,111.7,63,159,134.11261, and so on
2014.09.02,00:16,126.1,08,1235,154.11353,
2014.09.02,00:17,123.5,65,100,153.11313,
2014.09.02,01:18,114.7,59,101,152.11334,
2014.09.02,01:19,111.3,42,1229,153.11283,

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

lxml.etree.XMLSyntaxError while trying to read_excel using pandas - python

Related

Dynamically choose BigQuery tablename in Apache Beam pipeline

"EOFError: Ran out of input" when packaging a Python script with PyInstaller

Empty SVG file errror while it's not

How to download all transcripts from seeking alpha

Pandas inconsistent "Initializing from file failed" with multithreading

Categories

Resources