pdf file with python - python

How can search a word or a line in a pdf file?
Is there an existing module to do that by being concise?
Thank you in advance,

There's something called as pyPDF.
It is a Pure-Python library built as a PDF toolkit.
You can extract ( using extractText() method ) & also perform search on the pdf file with using something like following code.
pdf = pyPdf.PdfFileReader(file(path, "rb"))
content = pdf.getPage(1).extractText()

Related

Reading from pdf file to text yields no results

So I'm trying something very simple: I just want to read text from a pdf file in to a variable - that's it. This is what I'm getting:
Does anyone know a reliable way to just read pdf in to a text file?
Try the following library - pdfplumber:
import pdfplumber
pdf_file = pdfplumber.open('anyfile.pdf')
page = pdf_file.pages[0]
text = page.extract_text()
print(text)
pdf_file.close()
I haven't used PyPDF2 before but pdfplumber seems to work well for me.

flask - python - markdown to html

I'm building my first website using flask and HTML. Some of my data that I want to migrate to this website resides in Markdown format. I am trying to convert Markdown into HTML using this however, I cannot get my hear around it:
https://github.com/Python-Markdown/markdown
I import it into my *.py file not sure what are the next steps after. This is what I got so far
from markdown import markdown
html = markdown.markdown(text)
not sure what should be put into the "text" variable. Also I have my markdown data residing in an html file how do I reference that from here? I have read through the installation guide but it's not very clear for me.
Thank you.
According to the docs located at https://python-markdown.github.io/reference/#using-markdown-as-a-python-library
text is supposed to contain your markdown text. In the below example found in the docs, some_file.txt would be the file containing your markdown.
input_file = codecs.open("some_file.txt", mode="r", encoding="utf-8")
text = input_file.read()
html = markdown.markdown(text)
To get your text, you would need to parse it out of the HTML. There are several ways of doing this but we would need more information about the file to proceed. Is your HTML file stored locally? Where in the file is the markdown? A MRE would be helpful

Python FileStorage: Read XML file

Below is a simple example to write an XML file and read it back. The writing works OK, but I am not sure how to read this file back? Below is some sample code. How do I get thse values from the XML file?
file1 = 'result1.xml'
fs = cv2.FileStorage(file1, cv2.FILE_STORAGE_WRITE)
fs.write('var1', 1)
fs.write('var2', 2)
fs = cv2.FileStorage(file1,cv2.FILE_STORAGE_READ)
fn = fs.real
Python in different versions has its own library to parse XML data.
Here is where you can find the documentation : XML Library
You have to be careful when using it, as said in title of the webpage, this library is not safe if XML files aren't built properly.
Here is another useful website : How to parse XML files using Python ?

Open a word document with python using windows

I am trying to open a word document with python in windows, but I am unfamiliar with windows.
My code is as follows.
import docx as dc
doc = dc.Document(r'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx')
Through another post, I learned that I had to put the r in front of my string to convert it to a raw string or it would interpret the \U as an escape sequence.
The error I get is
PackageNotFoundError: Package not found at 'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx'
I'm unsure of why it cannot find my document, 01100-Allergan-UD1314-SUMMARY OF WORK.docx. The pathway is correct as I copied it directly from the file system.
Any help is appreciated thanks.
try this
import StringIO
from docx import Document
file = r'H:\myfolder\wordfile.docx'
with open(file) as f:
source_stream = StringIO(f.read())
document = Document(source_stream)
source_stream.close()
http://python-docx.readthedocs.io/en/latest/user/documents.html
Also, in regards to debugging the file not found error, simplify your directory names and files names. Rename the file to 'file' instead of referring to a long path with spaces, etc.
If you want to open the document in Microsoft Word try using os.startfile().
In your example it would be:
os.startfile(r'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx')
This will open the document in word on your computer.

Read Docx files via python

Does anyone know a python library to read docx files?
I have a word document that I am trying to read data from.
There are a couple of packages that let you do this.
Check
python-docx.
docx2txt (note that it does not seem to work with .doc). As per this, it seems to get more info than python-docx.
From original documentation:
import docx2txt
# extract text
text = docx2txt.process("file.docx")
# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir")
textract (which works via docx2txt).
Since .docx files are simply .zip files with a changed extension, this shows how to access the contents.
This is a significant difference with .doc files, and the reason why some (or all) of the above do not work with .docs.
In this case, you would likely have to convert doc -> docx first. antiword is an option.
python-docx can read as well as write.
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
Now all paragraphs will be in the list allText.
Thanks to "How to Automate the Boring Stuff with Python" by Al Sweigart for the pointer.
See this library that allows for reading docx files https://python-docx.readthedocs.io/en/latest/
You should use the python-docx library available on PyPi. Then you can use the following
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
A quick search of PyPI turns up the docx package.
import docx
def main():
try:
doc = docx.Document('test.docx') # Creating word reader object.
data = ""
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
data = '\n'.join(fullText)
print(data)
except IOError:
print('There was an error opening the file!')
return
if __name__ == '__main__':
main()
and dont forget to install python-docx using (pip install python-docx)

Categories