I'm trying to write a programm for Data extraction from PDF in Python (Excel Macro could be an option) .
At first at want to select a text or a position in a pdf file and generate a local path/link to that file in that position. This link will be copied to an excel cell. When I click on the link the PDF document should open on the specified coordinates of the previously selected text.
I know the question is very broad. I'm an enthusiast beginner and need a nudge in the right direction and to know if it is possible.
How can I get the path of the active pdf file in the desktop? and the coordinate of the selected text? I could give these automatically as parameters then to my programm.
Thank you !
There are lot of ways to do this, I would say look into the Slate --> https://pypi.python.org/pypi/slate , or http://www.unixuser.org/~euske/python/pdfminer/index.html
And yes it's quite easy , also look into pyPdf
import pyPdf
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace("\xa0", " ").strip().split())
return content
print getPDFContent("test.pdf")
Related
import pdftotext
# Load your PDF
with open("docs/doc1.pdf", "rb") as f:
docs = pdftotext.PDF(f)
print(docs[0])
this code print blank for this specific file, if i change the file it is giving me result.
I tried even apache Tika. Tika also return None, How to solve this problem?
One thing I would like to mention here is that pdf is made of multiple images
Here is the file
This is sample pdf, not the original one. but i want to extract text from the pdf something like this
I have a MS Word document that consists of several hundred pages.
Each page is identical apart from the name of a person which is unique across each page. (One page is one user).
I would like to take this word document and automate the process to save each page individually so I end up with several hundred word documents, one for each person, rather than one document that consists of everyone, that I can then distribute to the different people.
I have been using the module python-docx found here: https://python-docx.readthedocs.io/en/latest/
I am struggling on how to achieve this task.
As far as I have researched it is not possible to loop over each page as the pages are not determined in the .docx file itself but are generated by the program, i.e. Microsoft Word.
However python-docx can interpret the text and since each page is the same can I not say to python when you see this text (the last piece of text on a given page) consider this to be the end of a page and anything after this point is a new page.
Ideally if I could write a loop that would consider such a point and create a document up until that point, and repeating the over all pages that would be great. It would need to take all formatting/pictures as well.
I am not against other methods, such as converting to PDF first if that is an option.
Any thoughts?
I had the exact same problem. Unfortunately I could not find a way to split .docx by page. The solution was to first use python-docx or docx2python (whatever you like) to iterate over each page and extract the unique (person) information and put it in a list so you end up with:
people = ['person_A', 'person_B', 'person_C', ....]
Then save the .docx as a pdf split the pdfs up by page and then save them as person_A.pdf etc like this:
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open("document.pdf", "rb"))
for i in range(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open(f"{people[i]}.pdf", "wb") as outputStream:
output.write(outputStream)
The result is a bunch of one page PDFs saved as Person_A.pdf, Person_B.pdf etc.
Hope that helps.
I would suggest another package aspose-words-cloud to split a word document into separate pages. Currently, it works with cloud storage(Aspose cloud storage, Amazon S3, DropBox, Google Drive Storage, Google Cloud Storage, Windows Azure Storage and FTP Storage). However, in near future, it will support process files from the request body(streams).
P.S: I am developer evangelist at Aspose.
# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
import os
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxx-xxxxx-xxxx-xxxxx-xxxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxx'
words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'
remoteFolder = 'Temp'
localFolder = 'C:/Temp'
localFileName = '02_pages.docx'
remoteFileName = '02_pages.docx'
#upload file
words_api.upload_file(asposewordscloud.models.requests.UploadFileRequest(open(localFolder + '/' + localFileName,'rb'),remoteFolder + '/' + remoteFileName))
#Split DOCX pages as a zip file
request = asposewordscloud.models.requests.SplitDocumentRequest(name=remoteFileName, format='docx', folder=remoteFolder, zip_output= 'true')
result = words_api.split_document(request)
print("Result {}".format(result.split_result.zipped_pages.href))
#download file
request_download=asposewordscloud.models.requests.DownloadFileRequest(result.split_result.zipped_pages.href)
response_download = words_api.download_file(request_download)
copyfile(response_download, 'C:/'+ result.split_result.zipped_pages.href)
I'm having a frustrating issue outputting to text files from Python. Actually, the files appear perfectly normal when opened up in a text editor, but I am uploading these files into QDA miner, a data analysis suite and once they are uploaded into QDA miner, this is what the text looks like:
. 

"This problem really needs to be focused in a way that is particular to its cultural dynamics and tending in the industry,"
As you can see, many of these weird ( 

) symbols show up throughout the texts. The text that my python script parses initially is a RTF file that I convert to plain text using OSX's built in text editor.
Is there an easy way to remove these symbols? I am parsing over singular 100+mb text files and separating them into thousands of separate articles, I have to have a way to batch convert them otherwise it will be near impossible. I should also mention that the origin of these text files is copied from webpages.
Here is some relevant code from the script I wrote:
test1 = filedialog.askopenfile()
newFolder = ((str(test1)[25:])[:-32])
folderCreate(newFolder)
masterFileName = newFolder+"/"+"MASTER_FILE"
masterOutput = open(masterFileName,"w")
edit = test1.readlines()
for i,line in enumerate(edit):
for j in line.split():
if j in ["Author","Author:"]:
try:
outputFileName = "-".join(edit[i-2].lower().title().split())+".txt"
output = open(newFolder+"/"+outputFileName,"w") # create file with article name # backslashed changed to front slash windows
print("File created - ","-".join(edit[i-2].lower().title().split()))
counter2 = counter2+1
except:
print("Filename error.")
counter = counter+1
pass
#Count number of words in each article
wordCount = 0
for word in edit[i+1].split():
wordCount+=1
fileList.append((outputFileName,str(wordCount)))
#Now write to file
output.write(edit[i-2])
output.write("\n")
author = line
output.write(author) # write article author
output.write("\n")
output.write("\n")
content = edit[i+1]
output.write(content) # write article content
Thanks
Hello I am trying to make a python function to save a list of URLs in .txt file
Example: visit http://forum.domain.com/ and save all viewtopic.php?t= word URL in .txt file
http://forum.domain.com/viewtopic.php?t=1333
http://forum.domain.com/viewtopic.php?t=2333
I use this function but not save
I am very new in python can someone help me to create this
web_obj = opener.open('http://forum.domain.com/')
data = web_obj.read()
fl_url_list = open('urllist.txt', 'r')
url_arr = fl_url_list.readlines()
fl_url_list.close()
This is far from trivial and can have quite a few corner cases (I suppose the page you're referring to is a web page)
To give you a few pointers, you need to:
download the web page : you're already doing it (in data)
extract the URLs : this is hard, most probably, you'll want to usae an html parser, extract <a> tags, fetch the hrefattribute and put that into a list. then filter that list to have only the url formatted like you like (say with viewtopic). Let's say you got it into urlList
then open a file for Writing Text (thus wt, not r).
write the content f.write('\n'.join(urlList))
close the file
I advise to try to follow these steps and ask relevant questions when you're stuck on a particular issue.
So far here is the code I have (it is working and extracting text as it should.)
import pyPdf
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
print getPDFContent("/home/nick/TAM_work/TAM_pdfs/2006-1.pdf").encode("ascii", "ignore")
I now need to add a for loop to get it to run on all PDF's in /TAM_pdfs, save the text as a CSV and (if possible) add something to count the pictures. Any help would be greatly appreciated. Thanks for looking.
Matt
Take a look at os.walk()
for loop to get it to run on all PDF's in a directory: look at the glob module
save the text as a CSV: look at the csv module
count the pictures: look at the pyPDF module :-)
Two comments on this statement:
content = " ".join(content.replace(u"\xa0", " ").strip().split())
(1) It is not necessary to replace the NBSP (U+00A0) with a SPACE, because NBSP is (naturally) considered to be whitespace by unicode.split()
(2) Using strip() is redundant:
>>> u" foo bar ".split()
[u'foo', u'bar']
>>>
The glob module can help you find all files in a single directory that match a wildcard pattern.