Python finding exact string in .html file

Python finding exact string in .html file - python

I have a .html file which gets dynamically filled depending on what actions are taken in the program, however I am having an issue when searching for an exact string, the issue is that although I know the file is not blank, the loop doesn't return anything and thinks its blank.
I have searched and read many other SO questions and tried many of them, including 'blah' in line, re.findall, and with open() all the time they return only blank, I'm thinking I need HTML parsing or similar?
Can anyone shed any light on this for me?
f = open(outApp + '_report.html', 'r+')
for line in f:
#check the for loop works
self.progressBox.AppendText(line)
if 'mystring' in line:
#do stuff
The string I wish to find is My country which is wrapped in h2 tags

It is definitely shouldn't be done without special HTML parser.
Google about any python HTML parser you want. For basic usage they are all easy. For example lxml. In pseudo-code your task would be:
from some_cool_lib import SomeCoolHTMLParser
parser = SomeCoolHTMLParser()
doc = parser.parse(path_to_my_html_file)
h2_elements = doc.findall('h2')
for h2 in h2_elements:
if h2.text == 'My country':
# do stuff

Related

Parsing XML with ElementTree's iter() with no argument, does not return the first several tags in file

I am trying to extract all of the headers from an XML file and put them into a list in python, however, every time I run my code the first tag extracted from the file is not actually first tag in the XML file. It instead begins with the 18th tag and then prints the remainder of the list from there. The really weird part is when I originally wrote this code, it worked as expected, but as I added code to extract the element text and put it in a list, the header code stopped working, both in the original program and the standalone code below. I should also mention the complete program does not manipulate the XML file in any way. All manipulation is done exclusively on the python lists after the extraction.
import xml.etree.ElementTree as ET
tree = ET.parse("Sample.xml")
root = tree.getroot()
headers = [elem.tag for elem in root.iter()]
print(headers)
Sample.XML is a sensitive file so I had to redact all the element text. It is also a very large file so I only included one account's worth of elements.
-<ExternalCollection xmlns="namespace.xsd">
-<Batch>
<BatchID>***</BatchID>
<ExternalCollectorName>***</ExternalCollectorName>
<PrintDate>***</PrintDate>
<ProviderOrganization>***</ProviderOrganization>
<ProvOrgID>***</ProvOrgID>
-<Account>
<AccountNum>***</AccountNum>
<Guarantor>***</Guarantor>
<GuarantorAddress1>***</GuarantorAddress1>
<GuarantorAddress2/>
<GuarantorCityStateZip>***</GuarantorCityStateZip>
<GuarantorEmail/>
<GuarantorPhone>***</GuarantorPhone>
<GuarantorMobile/>
<GuarantorDOB>***</GuarantorDOB>
<AccountID>***</AccountID>
<GuarantorID>***</GuarantorID>
-<Incident>
<Patient>***</Patient>
<PatientDOB>***</PatientDOB>
<FacilityName>***</FacilityName>
-<ServiceLine>
<DOS>***</DOS>
<Provider>***</Provider>
<Code>***</Code>
<Modifier>***</Modifier>
<Description>***</Description>
<Billed>***</Billed>
<Expected>***</Expected>
<Balance>***</Balance>
<SelfPay>***</SelfPay>
<IncidentID>***</IncidentID>
<ServiceLineID>***</ServiceLineID>
-<OtherActivity>
</OtherActivity>
</ServiceLine>
</Incident>
</Account>
</Batch>
</ExternalCollection>
The output is as follows:
'namespace.xsd}PatientDOB', '{namespace.xsd}FacilityName', '{namespace.xsd}ServiceLine', '{namespace.xsd}DOS', '{namespace.xsd}Provider', '{namespace.xsd}Code', '{namespace.xsd}Modifier', '{namespace.xsd}Description', '{namespace.xsd}Billed', '{namespace.xsd}Expected', '{namespace.xsd}Balance', '{namespace.xsd}SelfPay', '{namespace.xsd}IncidentID', '{namespace.xsd}ServiceLineID', '{namespace.xsd}OtherActivity'
As you can see, for some reason the first returned value is Patient DOB instead of the actual first tag.
Thank y'all in advance!

Your input file should not contain "-" chars in front of XML tags.
You should drop at least the first "-", in front of the root tag, otherwise
a parsing error occurs.
Note also that your first printed tag name has no initial "{", so apparently
something weird is going on with your list, presumably, after your loop.
I ran your code and got a proper list, containing all tags.
Try the following loop:
for elem in root.iter():
print(elem.tag)
Maybe it will give you some clue about the real cause of your error.
Consider also upgrading your Python installation. Maybe you have
some outdated modules.
Yet another hint: Run your code on just this input that you included
in your post, with content replaced with "***". Maybe the real cause
of your error is in the actual content of any source element
(which you replaced here with asterixes).

Editing a DOCX file

I am working on a little project that should be quite simple. I know its been done before but for the life of me, I cannot get it to work. Alright so I made a docx template using Microsoft word that contains a Header and just some text in the body of the paper. My goal is have a program that can change this text. Using python-docx I have successfully been able to write a program that modifies the body text easily. That being said I am trying to learn how to do the same thing using XML parsing, which will allow the header to be changed. Long story short, XML parsing (I think thats what it is) will give me much more freedom down the road.
I know after the docx is unzipped, the word/document.xml contains the body text.
Here is my code so far.
from lxml import etree as ET
tree = ET.parse('document.xml')
root = tree.getroot()
for i in root.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t'):
if i.text == 'Title':
i.text = 'How to cook'
tree.write('document_output.xml', xml_declaration = True, encoding = "UTF-8", method = "xml" \
, standalone = "yes")
This program successfully changes the wanted text to the updated text.
Here is the original document.xml
https://www.dropbox.com/s/ghe1m176rdqtng7/document.xml?dl=0
Here is the output.
https://www.dropbox.com/s/8n9llagozbvb2mz/document_output.xml?dl=0
P.S. viewing the code from dropbox, it makes everything start at line 4 instead of line 1.
If you view them in an XML viewer you can see they are identical. Also, if you use a text difference tool, the only difference is the changed word. And I wouldn't think this would matter but the top line uses single quotes instead of double.
Hope someone can shed some light on why this is still not opening properly in Word.
Thanks for all the help!!

you're having the usual problems with ET.
As a starter, check out these Stackoverflow threads:
Namespace 1
Namespace 2
Namespace 3 with xml declaration
xml declaration
As you can see, you're not the first person with these problems.
What you could do for the namespaces is parse the xml twice:
first time in order to extract the namespaces and
a second time in order to do your actual work.
Besides, some people already suggested to switch from Elementtree to lxml.

changing plaintext tags into HTML tags to display in browser in python

ok so I'm writing a function in python which takes a text document which is tagged with tags like ===, ==, ---, #text# etc. etc. (alot like wikipedia). Now my program basically has to replace those with HTML tags such as &ndash, &mdash, <>text etc. so that they can be displayed properly in a browser. This is what i've got so far:
def tag_change ():
for () in range ()
sub('--', '–')
sub('---', '—')
sub('''*''', '<i>*</i>')
sub("'''*'''", '<b>*</b>')
sub("==*==", "<h1>*</h1>")
sub("#*#", "<li>*</li>")
Am I on the right track? Or is there something else I need to include? I'm fairly new to this

Your best bet (if you want to write your own function and avoid using an existing tool) is to use regex, which is simple enough
import re
def subst(text):
str = '#text#'
capture = re.search('#(.+)#', str)
return '<li>'+ capture.group(1)+ '</li>'
I hope you get the idea
you could also use patterns like '==(.+)==' and so forth to capture what you want.
You can view this post to learn more about using re.search and re.match
https://stackoverflow.com/a/180993/2152321
You can also learn more about regex pattern construction here
http://www.tutorialspoint.com/python/python_reg_expressions.htm

Extract template arguments in Python from MediaWiki's API wikitext

Is there a way to extract parts of text from MediaWikia's API? For example, this link dumps all the content into XML format:
http://marvel.wikia.com/api.php?action=query&prop=revisions&titles=All-New%20X-Men%20Vol%201%201&rvprop=content&format=xml
But there isn't much structure to it, even in the json format.
I'd like to get the text of Writer1_1, Penciler1_1, etc. Perhaps I'm not making my parameters right, so maybe there are other options I could output.
You can see the content in a more user-readable way here.

I'm sure the regex and final splitting could be more efficient, but this gets the job done for what you asked.
import urllib2
import re
data = urllib2.urlopen('http://marvel.wikia.com/api.php?action=query&prop=revisions&titles=All-New%20X-Men%20Vol%201%201&rvprop=content')
regex = re.compile('(Writer1_1|Penciler1_1)')
for line in data.read().split('|'):
if regex.search(line):
#assume everything after = is the full name
print ' '.join(line.split()[2:])

Python: Keyword to Links

I am building a blog on Google App Engine. I would like to convert some keywords in my blog posts to links, just like what you see in many WordPress blogs.
Here is one WP plugin which do the same thing:http://wordpress.org/extend/plugins/blog-mechanics-keyword-link-plugin-v01/
A plugin that allows you to define keyword/link pairs. The keywords are automatically linked in each of your posts.
I think this is more than a simple Python Replace. What I am dealing with is HTML code. It can be quite complex sometimes.
Take the following code snippet as an example. I want to conver the word example into a link to http://example.com:
Here is an example link:example.com
By a simple Python replace function which replaces example with example, it would output:
Here is an example link:example.com">example.com</a>
but I want:
Here is an example link:example.com
Is there any Python plugin that capable of this? Thanks a lot!

This is roughly what you could do using Beautifulsoup:
from BeautifulSoup import BeautifulSoup
html_body ="""
Here is an example link:<a href='http://example.com'>example.com</a>
"""
soup = BeautifulSoup(html_body)
for link_tag in soup.findAll('a'):
link_tag.string = "%s%s%s" % ('|',link_tag.string,'|')
for text in soup.findAll(text=True):
text_formatted = ['example'\
if word == 'example' and not (word.startswith('|') and word.endswith('|'))\
else word for word in foo.split() ]
text.replaceWith(' '.join(text_formatted))
for link_tag in soup.findAll('a'):
link_tag.string = link_tag.string[1:-1]
print soup
Basically I'm stripping out all the text from the post_body, replacing the example word with the given link, without touching the links text that are saved by the '|' characters during the parsing.
This is not 100% perfect, for example it does not work if the word you are trying to replace ends with a period; with some patience you could fix all the edge cases.

This would probably be better suited to client-side code. You could easily modify a word highlighter to get the desired results. By keeping this client-side, you can avoid having to expire page caches when your 'tags' change.
If you really need it to be processed server-side, then you need to look at using re.sub which lets you pass in a function, but unless you are operating on plain-text you will have to first parse the HTML using something like minidom to ensure you are not replacing something in the middle of any elements.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.