textract doesn´t work on pdf

textract doesn´t work on pdf - python

im new to python. Im using Pycharm 2018.2 and the latest version on Anaconda. Im working on windows 10.
After solving all the problems with installing textract on win 10. I got a positive installation result using anaconda prompt. Additional i have import the Project Interpreter from the \continuum\anaconda3\python.exe
My Target is that i want to extract pdf text from large files so save this text as a .txt
I have tried the test_pdf.py files from textract but they dont work.
Here is the conclusion code:
"textract" is wrong written or cant be found (self translate from
german :-/)
So I tried my own as on the textract page. But it doesnt work...:
Code:
import textract
text = textract.process('pfad/large.pdf')
Results:
C:\Users\raz\AppData\Local\Continuum\anaconda3\python.exe "C:/Users/raz/Google Drive/FOM/Master/Master/NurText/Testo.py"
Traceback (most recent call last):
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\utils.py", line 85, in run
stdout=subprocess.PIPE, stderr=subprocess.PIPE,
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\subprocess.py", line 709, in init
restore_signals, start_new_session)
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\subprocess.py", line 997, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] Das System kann die angegebene Datei nicht finden
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Users/raz/Google Drive/FOM/Master/Master/NurText/Testo.py", line 2, in
text = textract.process('pfad/large.pdf')
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers_init_.py", line 77, in process
return parser.process(filename, encoding, **kwargs)
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\utils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\pdf_parser.py", line 28, in extract
raise ex
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\pdf_parser.py", line 20, in extract
return self.extract_pdftotext(filename, **kwargs)
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\pdf_parser.py", line 43, in extract_pdftotext
stdout, _ = self.run(args)
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\utils.py", line 92, in run
' '.join(args), 127, '', '',
textract.exceptions.ShellError: The command pdftotext pfad/large.pdf - failed with exit code 127
------------- stdout -------------
------------- stderr -------------
Thanks for your help

Related

Textract: failed with exit code 127 // windows 10 // pdftotext

When I'm trying to run my (after deploying with pyinstaller) program for reading and converting a PDF file and entering it into a google sheet. I get the error shown in the image below. However I can not seem to figure out what the problem is:
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\site-packages\textract\parsers\utils.py", line 82, in run
pipe = subprocess.Popen(
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 854, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 1307, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\tkinter\__init__.py", line 1883, in __call__
return self.func(*args)
File "EinkaufRGWindows.py", line 40, in InkoopRekeningen
text = textract.process(str(importfolder) + str(i))
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\site-packages\textract\parsers\__init__.py", line 77, in process
return parser.process(filename, encoding, **kwargs)
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\site-packages\textract\parsers\utils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\site-packages\textract\parsers\pdf_parser.py", line 28, in extract
raise ex
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\site-packages\textract\parsers\pdf_parser.py", line 20, in extract
return self.extract_pdftotext(filename, **kwargs)
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\site-packages\textract\parsers\pdf_parser.py", line 43, in extract_pdftotext
stdout, _ = self.run(args)
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\site-packages\textract\parsers\utils.py", line 90, in run
raise exceptions.ShellError(
textract.exceptions.ShellError: The command `pdftotext //Mac/Home/Desktop/Wickey Einkauf Test/Rekeningen/Lekkerkerker_ - 20803471.pdf -` failed with exit code 127
------------- stdout -------------
------------- stderr -------------

You're getting a FileNotFoundError it seems. If you look at the error, the command being run is:
pdftotext //Mac/Home/Desktop/Wickey Einkauf Test/Rekeningen/Lekkerkerker_ -
0803471.pdf -
There are a couple of things here I would look at. Firstly, there is an extra slash at the start of your file path, which seems wrong. Secondly, you have spaces in the file path, but there are no quotations enclosing the path. This second part means pdftotext will read this as a few separate command arguments, rather than one. You can fix this by formatting you subprocess call to have the file wrapped in quotation marks, like so:
pdftotext "example file path.pdf" -

You need to install pdftotext using pip.
To install it you need to have Microsoft Visual C++ 14 or greater.

I had the same issue. It seems to be an OS issue. For me, switching to GIT bash worked. https://github.com/deanmalmgren/textract/issues/229
If you are using Pycharm, change default terminal to bash.

Error while performing OCR using pytesseract

I wanna to use pytesseract. This is my code.
import pytesseract
from pdf2image import convert_from_path
PDF_file = 'file.pdf'
text = ''
pages = convert_from_path(PDF_file, 500)
pageText = str(((pytesseract.image_to_string(pages[0]))))
and at result I get this error
Traceback (most recent call last):
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdf2image\pdf2image.py", line 409, in pdfinfo_from_path
proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 854, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 1307, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\user\Desktop\projects\pdfparser\pdftest.py", line 13, in
pages = convert_from_path(PDF_file, 500)
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdf2image\pdf2image.py", line 89, in convert_from_path
page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdf2image\pdf2image.py", line 430, in pdfinfo_from_path
raise PDFInfoNotInstalledError(
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

As a lot of comments already pointed out, the error message
PDFInfoNotInstalledError( pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
Tells you precisely what went wrong: Poppler is not installed. Please refer to the README for help on that side.
You see, pdf2image is only a wrapper around the pdftoppm command-line utility. On Linux it is installed by default so you would not need to bother with it, but on Windows it is not.

OSError: [Errno 2] No such file or directory with python2.7/subprocess.py

I am trying to use a Python scrip that converts Latex to a png. The library is at https://github.com/cptdeadbones/pytex2png.
When I run the command python examples.py I get the following error:
Traceback (most recent call last):
File "./examples.py", line 14, in <module>
main()
File "./examples.py", line 11, in main
pytex2png.convert("examples/"+file,"output/"+file+".png")
File "/Users/kekearif/Desktop/pytex2png-master/pytex2png.py", line 44, in convert
make_transparent_bg(output,display)
File "/Users/kekearif/Desktop/pytex2png-master/pytex2png.py", line 31, in make_transparent_bg
exe_command(command_line)
File "/Users/kekearif/Desktop/pytex2png-master/pytex2png.py", line 11, in exe_command
p = subprocess.Popen(args,stdout=subprocess.PIPE)
File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 390, in __init__
errread, errwrite)
File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1024, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory'
What does this error mean? Is it an error in how the arguments are being passed?
From the source:
def exe_command(cmd, display=False):
args = shlex.split(cmd)
if(display):
p = subprocess.Popen(args)
else:
p = subprocess.Popen(args,stdout=subprocess.PIPE)
p.wait()
Is there a mistake in the function above?

I hope you have followed the Usage as describe here : Usage which create Makefile.
Have you check the Disclaimers ?? It says the following :
This code was devloped and tested on a Linux based system. I have no
idea if it will work on another system. If you have tested it on
another system, please let me know.

“git.exc.GitCommandNotFound: [WinError 2] The system cannot find the file specified” error in Python 3.5

I am trying to lemmatize a Latin text using Python 3.5 in Pycharm 5.0.4 with the CLTK library, but there seems to be a problem with Git. I get the error git.exc.GitCommandNotFound: [WinError 2] The system cannot find the file specified among other errors I believe are related—see below for the full output. I have tried adding a Git repository to the project folder and adding the git.exe path to version control but that seems to have done nothing. What can I do to get Git to work properly—please keep in mind that I am a complete neophyte when it comes to Python in particular and not very experienced with programming in general.
Code:
from cltk.stem.lemma import LemmaReplacer
from cltk.stem.latin.j_v import JVReplacer
from cltk.corpus.utils.importer import CorpusImporter
corpus_importer = CorpusImporter('latin')
corpus_importer.import_corpus('latin_text_latin_library')
corpus_importer.import_corpus('latin_models_cltk')
#corpus_importer.import_corpus('phi5', '~/PHI5/')
#t.convert_corpus(corpus='phi5')
j = JVReplacer()
lemmatizer = LemmaReplacer('latin')
In = open("CIC.txt","rt")
Out = open("CIC4.txt","wt")
text = In.read()
text = text.lower()
text = j.replace(text)
Out.write(str(lemmatizer.lemmatize(text)))
In.close()
Out.close()
Output:
C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\python.exe "C:\Program Files (x86)\JetBrains\PyCharm 5.0.4\helpers\pydev\pydevd.py" --multiproc --qt-support --client 127.0.0.1 --port 58508 --file C:/Users/Rune/PycharmProjects/untitled/Pucker.py
pydev debugger: process 14648 is connecting
Connected to pydev debugger (build 143.1919)
--- Logging error ---
Traceback (most recent call last):
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\site-packages\git\cmd.py", line 604, in execute
**subprocess_kwargs
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\subprocess.py", line 950, in __init__
restore_signals, start_new_session)
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\subprocess.py", line 1220, in _execute_child
startupinfo)
File "C:\Program Files (x86)\JetBrains\PyCharm 5.0.4\helpers\pydev\pydev_monkey.py", line 387, in new_CreateProcess
return getattr(_subprocess, original_name)(appName, patch_arg_str_win(commandLine), *args)
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\site-packages\cltk\corpus\utils\importer.py", line 134, in import_corpus
Repo.clone_from(git_uri, target_dir, depth=1)
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\site-packages\git\repo\base.py", line 885, in clone_from
return cls._clone(git, url, to_path, GitCmdObjectDB, progress, **kwargs)
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\site-packages\git\repo\base.py", line 826, in _clone
v=True, **add_progress(kwargs, git, progress))
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\site-packages\git\cmd.py", line 450, in <lambda>
return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\site-packages\git\cmd.py", line 878, in _call_process
return self.execute(make_call(), **_kwargs)
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\site-packages\git\cmd.py", line 607, in execute
raise GitCommandNotFound(str(err))
git.exc.GitCommandNotFound: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\logging\__init__.py", line 980, in emit
msg = self.format(record)
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\logging\__init__.py", line 830, in format
return fmt.format(record)
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\logging\__init__.py", line 567, in format
record.message = record.getMessage()
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\logging\__init__.py", line 330, in getMessage
msg = msg % self.args
TypeError: not enough arguments for format string
Call stack:
File "C:\Program Files (x86)\JetBrains\PyCharm 5.0.4\helpers\pydev\pydevd.py", line 2411, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Program Files (x86)\JetBrains\PyCharm 5.0.4\helpers\pydev\pydevd.py", line 1802, in run
launch(file, globals, locals) # execute the script
File "C:\Program Files (x86)\JetBrains\PyCharm 5.0.4\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/Rune/PycharmProjects/untitled/Pucker.py", line 5, in <module>
corpus_importer.import_corpus('latin_text_latin_library')
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\site-packages\cltk\corpus\utils\importer.py", line 136, in import_corpus
logger.error("Git clone of '%s' failed: '%s'", (git_uri, e))
Message: "Git clone of '%s' failed: '%s'"
Arguments: (('https://github.com/cltk/latin_text_latin_library.git', GitCommandNotFound('[WinError 2] The system cannot find the file specified',)),)
Process finished with exit code 0

You can manually download the corpus from https://github.com/cltk/latin_models_cltk and place it in the
~/cltk_data/latin/model/ folder (so ~/cltk_data/latin/model/latin_models_cltk/lemmata/ is an existing folder afterwards). Then you should be able to run the following just fine:
from cltk.stem.lemma import LemmaReplacer
LemmaReplacer('latin').lemmatize('some_latin_here')
For the same in Greek, just replace the 'latin' by 'greek' everywhere in these instructions. I imagine (but haven't tried) that it works the same for other languages too.

I am pretty sure that CLTK does not work with any Python version below 3.6, at least as of August 2017. I had a devil of a time getting 3.6 installed on Ubuntu (Ubuntu linus running dual boot from my PC laptop) but eventually got it to work. Also I specifically solved the problem that OP articulates above by following the instructions at https://disiectamembra.wordpress.com/2016/07/01/current-state-of-the-cltk-latin-lemmatizer/. Good luck!

Compiling and Executing Java file in python

how can I open an java file in python?, i've search over the net and found this:
import os.path, subprocess
from subprocess import STDOUT, PIPE
def compile_java (java_file):
subprocess.check_call(['javac', java_file])
def execute_java (java_file):
cmd=['java', java_file]
proc=subprocess.Popen(cmd, stdout = PIPE, stderr = STDOUT)
input = subprocess.Popen(cmd, stdin = PIPE)
print(proc.stdout.read())
compile_java("CsMain.java")
execute_java("CsMain")
but then I got this error:
Traceback (most recent call last):
File "C:\Python33\lib\subprocess.py", line 1106, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\casestudy\opener.py", line 13, in <module>
compile_java("CsMain.java")
File "C:\casestudy\opener.py", line 5, in compile_java
subprocess.check_call(['javac', java_file])
File "C:\Python33\lib\subprocess.py", line 539, in check_call
retcode = call(*popenargs, **kwargs)
File "C:\Python33\lib\subprocess.py", line 520, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Python33\lib\subprocess.py", line 820, in __init__
restore_signals, start_new_session)
File "C:\Python33\lib\subprocess.py", line 1112, in _execute_child
raise WindowsError(*e.args)
FileNotFoundError: [WinError 2] The system cannot find the file specified
>>>
the python file and java file is in the same folder, and I am using Python 3.3.2, how can I resolve this? or do you guys have another way on doing this?, any answer is appreciated thanks!

I think it isn't recognizing the javac command. Try manually running the command and if javac isn't a recognized command, register it in your PATH variable and try again.
Or you could just try typing the full pathname to the Java directory for javac and java.

you need to add path to your java file name. like this:
compile_java("C:\\path\to\this\CsMain.java")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

textract doesn´t work on pdf - python

Related

Textract: failed with exit code 127 // windows 10 // pdftotext

Error while performing OCR using pytesseract

OSError: [Errno 2] No such file or directory with python2.7/subprocess.py

“git.exc.GitCommandNotFound: [WinError 2] The system cannot find the file specified” error in Python 3.5

Compiling and Executing Java file in python

Categories

Resources