No text parsed from document

No text parsed from document - python

I have written an html text parser, when I use it in a large batch of files i.e. 5,000 or more, it randomly produces this error, when I re-run it it produces the same error in the exact same files. So I removed those files and parsed them individually and the parser read them.
So I created a new folder with the "Problematic" files and tried parsing them separately, it produced no error for most then it re-produced the same error again.
This is the code
import pandas as pd
import shutil
import os
import glob
source_file = r'C:/Users/Ahmed_Abdelmuniem/Desktop/Mar/Problematic/'
file_names = glob.glob(os.path.join(source_file,"*.html"))
for file_name in file_names:
table = pd.read_html(file_name)
print (table)
This is the error:
Traceback (most recent call last):
File "C:\Users\Ahmed_Abdelmuniem\PycharmProjects\No Text Parsed Troubleshooting\main.py", line 11, in <module>
table = pd.read_html(file_name)
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\html.py", line 1085, in read_html
return _parse(
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\html.py", line 893, in _parse
tables = p.parse_tables()
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\html.py", line 213, in parse_tables
tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\html.py", line 735, in _build_doc
raise XMLSyntaxError("no text parsed from document", 0, 0, 0)
File "<string>", line 0
lxml.etree.XMLSyntaxError: no text parsed from document
Process finished with exit code 1
I took the "unreadable" files outside of the folder and parsed them individually and the code read them, I can't seem to identify what is wrong.
I hope my explanation is clear and sufficient.

There is a hidden .DS_STORE file in the folder. This is my code:
from lxml import etree
import pandas as pd
import os
from time import sleep
locations_folder = '/Users/jerryhu/Documents/Documents/Zenly/locations'
failed_files = []
def parse(file):
tables = pd.read_html(file)
dataframe = pd.DataFrame(tables[0])
path, name = os.path.split(file)
with open(f'/Users/jerryhu/Documents/Documents/Zenly/locations_csv/{name}'.replace('.html', '.csv'), 'w') as writeCSV:
dataframe.to_csv(writeCSV)
print(f"Writing {name.replace('.html', '.csv')} to disk")
try:
failed_files.remove(file)
except:
pass
for filename in os.listdir(locations_folder):
file = os.path.join(locations_folder, filename)
if os.path.exists(file):
try:
parse(file)
except:
failed_files.append(file)
print("\nFinished. These files failed to parse:")
for i in failed_files:
print(i)
print("Retrying in 3 seconds.")
sleep(3)
for i in failed_files:
try:
parse(i)
except:
print(f'{i} couldn\'t be parsed.')
This is the error returned:
Writing 2022-10-09.csv to disk
Writing 2022-09-13.csv to disk
Writing 2022-09-05.csv to disk
Writing 2022-08-28.csv to disk
Writing 2022-12-22.csv to disk
Writing 2022-08-08.csv to disk
Writing 2023-01-01.csv to disk
Writing 2022-09-25.csv to disk
Writing 2022-12-02.csv to disk
Writing 2022-11-12.csv to disk
Writing 2022-12-14.csv to disk
Writing 2022-10-29.csv to disk
Writing 2022-11-04.csv to disk
Writing 2022-10-05.csv to disk
Writing 2022-11-28.csv to disk
Writing 2022-08-24.csv to disk
Writing 2022-07-17.csv to disk
Writing 2022-09-09.csv to disk
Writing 2022-10-13.csv to disk
Finished. These files failed to parse:
/Users/jerryhu/Documents/Documents/Zenly/locations/.DS_Store
Retrying in 3 seconds.
/Users/jerryhu/Documents/Documents/Zenly/locations/.DS_Store couldn't be parsed.
Just put a try and except block to skip the DS_Store file.

Related

How can I process only which files have been copied completely?

I am writing a Python script in Windows. I'm processed Zip File from the folder if the file is completed copied and start the program it is working fine, but I encounter with Problem when the program is running and start Copy file in the folder. It's giving error and closed.
How can I process only which files have been copied completely?
Or check if the file size is increasing it does not process in 30 seconds interval if file size does not increase then it processed?
My code:
import zipfile
import os
import xml.etree.ElementTree as ET
import shutil
import configparser
import time
#function used to read zip and xml
def read_zipfile(file):
with zipfile.ZipFile(file,'r') as zf:
for name in zf.namelist():
if name.endswith('.xml'):
#open Zip and read Xml
xml_content=zf.open(name)
# here you do your magic with [f] : parsing, etc
return xml_content
#Function use to parsing XML file
def XMl_Parsa(f):
tree=ET.parse(f)
root=tree.getroot()
# attribute are iterated attribute to get value of tag;
for node in tree.iter('attribute'):
if(node.attrib['name']=='ProductNameCont'):
zone = str(node.text)
return zone
#fucnction used to move file
def move_zipFile(zone,out_put,in_xml,file):
#defing destination path
Product_zone=(os.path.join(out_put,zone))
print(Product_zone)
#Moveing fine for base folder to Product folder
try:
os.makedirs(Product_zone,exist_ok=True)
print("Directory '%s' created successfully" % zone)
except OSError as error:
print("Directory '%s' Exist " % error)
try:
#unziping zip file
shutil.unpack_archive(os.path.join(in_xml, file),os.path.join(Product_zone,os.path.splitext(file)[0]))
os.remove(os.path.join(in_xml, file))
print("File '%s' moved to successfully" % file)
except OSError as error:
print("File '%s' Exist " % error)
#Function use for read Config File
def config_read():
config = configparser.ConfigParser()
config.read('./Config.ini')
xml_path = config.get('Path', 'xml_path')
dest = config.get('Path', 'dest')
return xml_path,dest
def main():
in_xml=config_read()[0]
out_put=config_read()[1]
for file in os.listdir(in_xml):
move_zipFile(XMl_Parsa(read_zipfile(os.path.join(in_xml, file))),out_put,in_xml,file)
if __name__=="__main__":
while 1:
time.sleep(10)
main()
Error
Traceback (most recent call last):
File "Clero_zipFile_transfer - Copy.py", line 65, in <module>
File "Clero_zipFile_transfer - Copy.py", line 60, in main
File "Clero_zipFile_transfer - Copy.py", line 9, in read_zipfile
File "zipfile.py", line 1268, in __init__
File "zipfile.py", line 1335, in _RealGetContents
zipfile.BadZipFile: File is not a zip file
[2916] Failed to execute script Clero_zipFile_transfer - Copy

wave write function not working, what am I doing wrong?

I am trying to halve the existing sampling rate of a folder full of .wav files. This is the only way I have found to do it but it is not working. The read part works just fine up until f.close(), then the wave.write part causes the error.
import wave
import contextlib
import os
for file_name in os.listdir(os.getcwd()):
if file_name.endswith(".wav"):
with contextlib.closing(wave.open(file_name, 'rb')) as f:
rate = f.getframerate()
new_rate = rate/2
f.close()
with contextlib.closing(wave.open(file_name, 'wb')) as f:
rate = f.setframerate(new_rate)
This is the output when I run it.
Traceback (most recent call last):
File "C:\Users\hsash\OneDrive\Desktop\used AR1-20210513T223533Z-001 - Copy (2)\sounds\python code.py", line 36, in <module>
rate = f.setframerate(new_rate)
File "C:\Users\hsash\AppData\Local\Programs\Python\Python39\lib\contextlib.py", line 303, in __exit__
self.thing.close()
File "C:\Users\hsash\AppData\Local\Programs\Python\Python39\lib\wave.py", line 444, in close
self._ensure_header_written(0)
File "C:\Users\hsash\AppData\Local\Programs\Python\Python39\lib\wave.py", line 462, in _ensure_header_written
raise Error('# channels not specified')
wave.Error: # channels not specified

It says right there that #channels not specified. When you are opening a wavefile for writing, python sets all of the header fields to zero irrespectively of the current state of the file.
In order to make sure that the other fields are saved you need to copy them over from the old file when you read it the first time.
In the snippet below I'm using getparams and setparams to copy the header fields over and I'm using readframes and writeframes to copy the wave data.
import wave
import contextlib
import os
for file_name in os.listdir(os.getcwd()):
if file_name.endswith(".wav"):
with contextlib.closing(wave.open(file_name, 'rb')) as f:
rate = f.getframerate()
params = f.getparams()
frames = f.getnframes()
data = f.readframes(frames)
new_rate = rate/2
f.close()
with contextlib.closing(wave.open(file_name, 'wb')) as f:
f.setparams(params)
f.setframerate(new_rate)
f.writeframes(data)

Pandas not reading tables from html files in folder

I am trying to read the tables of each individual html file in a folder using pandas, to find out the number of tables in each file.
However, this feature works when specifying a single file, but when I try to run it in the folder it says there are no tables.
This is the code for the single file
import pandas as pd
file = r'C:\Users\Ahmed_Abdelmuniem\Desktop\XXX.html'
table = pd.read_html(file)
print ('tables found:', len(table))
This is the output
C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\python.exe C:/Users/Ahmed_Abdelmuniem/PycharmProjects/PandaHTML/main.py
tables found: 72
Process finished with exit code 0
This is the code for each file in a folder
import pandas as pd
import shutil
import os
source_dir = r'C:\Users\Ahmed_Abdelmuniem\Desktop\TMorning'
target_dir = r'C:\Users\Ahmed_Abdelmuniem\Desktop\TAfternoon'
file_names = os.listdir(source_dir)
for file_name in file_names:
table = pd.read_html(file_name)
print ('tables found:', len(table))
This is the error log:
C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\python.exe "C:/Users/Ahmed_Abdelmuniem/PycharmProjects/File mover V2.0/main.py"
Traceback (most recent call last):
File "C:\Users\Ahmed_Abdelmuniem\PycharmProjects\File mover V2.0\main.py", line 12, in <module>
table = pd.read_html(file_name)
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\html.py", line 1085, in read_html
return _parse(
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\html.py", line 913, in _parse
raise retained
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\html.py", line 893, in _parse
tables = p.parse_tables()
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\html.py", line 213, in parse_tables
tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\html.py", line 543, in _parse_tables
raise ValueError("No tables found")
ValueError: No tables found
Process finished with exit code 1

os.listdir returns a list containing the names of the entries in the directory including subdirectories or any other files. If you want to keep only html files, prefer use glob.glob.
import glob
file_names = glob.glob(os.path.join(source_dir, '*.html'))
Edit: if you want to use os.listdir, you have to get the actual path to the file:
for file_name in file_names:
table = pd.read_html(os.path.join(source_dir, file_name))
print ('tables found:', len(table))

FileNotFound error / reading PDF Files with PyPDF2 and os.listdir()

I have the following script to merge a couple of PDFs together:
import PyPDF2
import sys
import os
inputs = sys.argv[1]
list = os.listdir(inputs)
merger = PyPDF2.PdfFileMerger()
for pdf in list:
merger.append(pdf)
merger.write('merged.pdf')
print('All done')
The folder with the files is in a different directory than the running script, thus I inserted the full path.
Upon running like so from the terminal, python3 pdf-merger.py /Users/user/Documents/pdf_list, I get the following error:
Traceback (most recent call last):
File "pdf-merger.py", line 11, in <module>
merger.append(pdf)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/PyPDF2/merger.py", line 203, in append
self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/PyPDF2/merger.py", line 114, in merge
fileobj = file(fileobj, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'card.pdf'
I also tried with an alternative method:
import PyPDF2
import sys
import os
inputs = sys.argv[1]
list = [os.path.join(inputs,a) for a in os.listdir(inputs)]
merger = PyPDF2.PdfFileMerger()
for pdf in list:
merger.append(pdf)
merger.write('merged.pdf')
print('All done')
This time I get a PyPDF2.utils.PdfReadError: Could not read malformed PDF file, no matter what file it is.
Any ideas?

Found the problem. There was a hidden .DS_Store file in the directory which corrupted the script.
Ignoring it with if pdf.endswith('.pdf') resolved the issue!

How do I to turn my .tar.gz file into a file-like object for shutil.copyfileobj?

My goal is to extract a file out of a .tar.gz file without also extracting out the sub directories that precede the desired file. I am trying to module my method off this question. I already asked a question of my own but it seemed like the answer I thought would work didn't work fully.
In short, shutil.copyfileobj isn't copying the contents of my file.
My code is now:
import os
import shutil
import tarfile
import gzip
with tarfile.open('RTLog_20150425T152948.gz', 'r:*') as tar:
for member in tar.getmembers():
filename = os.path.basename(member.name)
if not filename:
continue
source = tar.fileobj
target = open('out', "wb")
shutil.copyfileobj(source, target)
Upon running this code the file out was successfully created however, the file was empty. I know that this file I wanted to extract does, in fact, have lots of information (approximately 450 kb). A print(member.size) returns 1564197.
My attempts to solve this were unsuccessful. A print(type(tar.fileobj)) told me that tar.fileobj is a <gzip _io.BufferedReader name='RTLog_20150425T152948.gz' 0x3669710>.
Therefore I tried changing source to: source = gzip.open(tar.fileobj) but this raised the following error:
Traceback (most recent call last):
File "C:\Users\dzhao\Desktop\123456\444444\blah.py", line 15, in <module>
shutil.copyfileobj(source, target)
File "C:\Python34\lib\shutil.py", line 67, in copyfileobj
buf = fsrc.read(length)
File "C:\Python34\lib\gzip.py", line 365, in read
if not self._read(readsize):
File "C:\Python34\lib\gzip.py", line 433, in _read
if not self._read_gzip_header():
File "C:\Python34\lib\gzip.py", line 297, in _read_gzip_header
raise OSError('Not a gzipped file')
OSError: Not a gzipped file
Why isn't shutil.copyfileobj actually copying the contents of the file in the .tar.gz?

fileobj isn't a documented property of TarFile. It's probably an internal object used to represent the whole tar file, not something specific to the current file.
Use TarFile.extractfile() to get a file-like object for a specific member:
…
source = tar.extractfile(member)
target = open("out", "wb")
shutil.copyfile(source, target)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

No text parsed from document - python

Related

How can I process only which files have been copied completely?

wave write function not working, what am I doing wrong?

Pandas not reading tables from html files in folder

FileNotFound error / reading PDF Files with PyPDF2 and os.listdir()

How do I to turn my .tar.gz file into a file-like object for shutil.copyfileobj?

Categories

Resources