I have a word document in Docx format which has multiple blank pages. I need to delete all the blank pages in the document using python. I have tried to delete empty lines in the document. But it is not working. tried the below code:
from docx import *
document=Document("\\doc_path")
z=len(document.paragraphs)
for i in range(0,z):
if document.paragraphs[i].text=="":
document.paragraphs[i].clear()
The above-mentioned code is not working and deleting all the intermediate empty lines on a page.
I am still a newbie to Python. I am trying to develop a generic pdf-scraper to csv that is organized by the columns containing: page number and paragraphs.
I'm using the PyMuPDF library and I have managed to extract all the text. But I have no clue how to parse the text and write it into csv:
page number, paragraph
page number, paragraph
page number, paragraph
Luckily there is a structure. Each paragraph ends with an enter (\n). Each page ends with a page number followed by an enter (\n). I would like to include headers as well but they are harder to delimit.
import fitz
import csv
pdf = '/file/path.pdf'
doc = fitz.open(pdf)
for page in doc:
text = page.getText(text)
print (text)
I am reading the input docx file sections/paragraphs and then copy-pasting the content in it to another docx file at a particular section. The content is having images, tables and bullet points in between the data. However, I'm getting only text not the images, tables and bullet points present in between the text.
Tika module is able to read whole content but the whole docx is coming in a single string so I'm unable to iterate over the section and also I'm unable to edit(copy-pasting the content) the output docx file.
Tried using python-docx, whereas it reads only content and it won't identify the images and tables inside the paragraph in between text data. The python-docx will identifies all the images and tables present in whole document not particularly with paragraph
Tried unzipping word to XML, but the XML is having images in a separate folder. Also, the code will not identify the bullets
def tika_extract_data(input_file, output_file):
import tika, collections
from tika import unpack
parsed = collections.OrderedDict()
parsed = unpack.from_file(input_file)
with open(output_file, 'w') as f:
for line in parsed:
if line == 'content':
lines = parsed[line]
# print(lines)
for indx, j in enumerate(lines.split("\\n")):
print(j)
I expected the output file should be having all the sections replaced with the copied input section content(images, tables, smart art and formats)
The output file just has the text data.
i have to basically make a program that take a user-input web address and parses html to find links . then stores all the links in another HTML file in a certain format. i only have access to builtin python modules (python 3) . im able to get the HTML code from the link using urllib.request and put that into a string. how would i actually go about extracting links from this string and putting them into a string array? also would it be possible to identify links (such as an image link / mp3 link) so i can put them into different arrays (then i could catagorize them when im creating the output file)
You can use the re module to parse the HTML text for links. Particularly the findall method can return every match.
As far as sorting by file type that depends on whether the url actually contains the extension (i.e. .mp3, .js, .jpeg, etc...)
You could do a simple for loop like such:
import re
html = getHTMLText()
mp3s = []
other = []
for match in re.findall('<reexpression>',html):
if match.endswith('.mp3'):
mp3s.append(match)
else:
other.append(match)
try to use HTML.Parser library or re library
they will help you to do that
and i think you can use regex to do it
r'http[s]?://[^\s<>"]+|www.[^\s<>"]+
I have the following code (doop.py), which strips a .html file of all the 'nonsense' html script, outputting only the 'human-readable' text; eg. it will take a file containing the following:
<html>
<body>
<a href="http://www.w3schools.com">
This is a link</a>
</body>
</html>
and give
$ ./doop.py
File name: htmlexample.html
This is a link
The next thing I need to do is add a function that, if any of the html arguments within the file represent a URL (a web address), the program will read the content of the designated webpage instead of a disk file. (For present purposes, it is sufficient for doop.py to recognize an argument beginning with http:// (in any mixture of letter-cases) as a URL.)
I'm not sure where to start with this - I'm sure it would involve telling python to open a URL, but how do I do that?
Thanks,
A
Apart from urllib2 that others already mentioned, you can take a look at Requests module by Kenneth Reitz. It has a more concise and expressive syntax than urllib2.
import requests
r = requests.get('https://api.github.com', auth=('user', 'pass'))
r.text
As with most things pythonic: there is a library for that.
Here you need the urllib2 library
This allows you to open a url like a file, and read and writ from it like a file.
The code you would need would look something like this:
import urllib2
urlString = "http://www.my.url"
try:
f = urllib2.urlopen(urlString) #open url
pageString = f.read() #read content
f.close() #close url
readableText = getReadableText(pageString)
#continue using the pageString as you wish
except IOException:
print("Bad URL")
Update:
(I don't have a python interpreter to hand, so can't test that this code will work or not, but it should!!)
Opening the URL is the easy part, but first you need to extract the URLs from your html file. This is done using regular expressions (regex's), and unsurprisingly, python has a library for that (re). I recommend that you read up on both regex's, but they are basically a patter against which you can match text.
So what you need to do is write a regex that matches URLs:
(http|ftp|https)://[\w-_]+(.[\w-_]+)+([\w-.,#?^=%&:/~+#]*[\w-\#?^=%&/~+#])?
If you don't want to follow urls to ftp resources, then remove "ftp|" from the beginning of the pattern. Now you can scan your input file for all character sequences that match this pattern:
import re
input_file_str = #open your input file and read its contents
pattern = re.compile("(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?") #compile the pattern matcher
matches = pattern.findall(input_file_str) #find all matches, storing them in an interator
for match in matches : #go through iteratr
urlString = match #get the string that matched the pattern
#use the code above to load the url using matched string!
That should do it
You can use third part libraries like beautifulsoup or Standard HTML Parser . Here is a previous stack overflow question. html parser python
Other Links
http://unethicalblogger.com/2008/05/03/parsing-html-with-python.html
Standard Library
http://docs.python.org/library/htmlparser.html
Performance comparision
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
While parsing one needs to parse for http
Rather than write your own HTML Parser / Scraper, I would personally recommend Beautiful Soup which you can use to load up your HTML, get the elements you want out of it, find all the links, and then use urllib to fetch the new links for you to parse and process further.