How to ignore some characters when parsing a string? - python

I like to parse a list of log strings in Python3 via the parse module.
A string can look like:
[00:00:01.913,100] [0m<inf> main: starting v2.9.0(b2393) on Zephyr v2.6.99[0m
or
[00:00:02.789,700] [1;33m<wrn> BC127: codec config differs. set default one[0m
So far I defined my parser
myParser = parse.compile("[{}] <{}> {}: {}")
Of course my parser fails. I don't know how to tell the parse module that it should discard for example everything between ] and <. Like "]*<".
my pseudo code is
myParser = parse.compile("[{}]*<{}> {}: {}*")

Use Regular Expression:
>>> import re
>>> s = '<# """#$ FSDF > important information < not important >'
>>> re.sub('<[^>]+>', '', s
'important information'

Related

Getting raw text from lxml

Trying to get the text from an HtmlElement in lxml. For example, I have the HTML read in by
thing = lxml.html.fromstring("<code><div></code>")
But when I call thing.text I get <div>, meaning that lxml is translating escape characters. Is there a way to get this raw text, i.e., <div>? This is part of the output when I do lxml.html.tostring(thing), but that includes the opening and closing tags which I don't want.
Tried calling tostring with a few different encoding options but no luck.
So I looked into it a bit closer:
cdef tostring(...) in src\lxml\etree.pyx - see https://github.com/lxml/lxml/blob/master/src/lxml/etree.pyx
cdef _tostring(...) in src\lxml\serializer.pxi - see https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi
and I couldn't find anything that would suggest you could get the escaped string by configuring the parameters of the tostring() function. It seems like it will always return the unescaped string maybe due to security concerns ...
The way I see it, you would have to use another function such as html.escape to get the escaped string:
import lxml.html
from html import escape as html_escape
thing = lxml.html.fromstring("<code><div>MY TEST DIV</div></code>")
raw_thing = lxml.html.tostring(thing, method="text", encoding="unicode") # <div>MY TEST DIV</div>
escaped_thing = html_escape(raw_thing) # <div>MY TEST DIV</div>
print(escaped_thing)
Essentialy what you are looking for is lxml.html.tostring(root, method='text', encoding="unicode"):
import lxml.html
thing = lxml.html.fromstring("<code><div>MY TEST DIV</div></code>")
output = lxml.html.tostring(thing, method='xml', encoding="unicode")
print(output) # <code><div>MY TEST DIV</div></code>
The problem is that it cannot separate the root element from its child in <code><div>MY TEST DIV</div></code>
However with a different approach you can get the desired output:
import xml.etree.ElementTree as ET
thing = """
<code><div>MY TEST DIV</div><div><div>AAA</div></div><div><div>BBB</div></div></code>
"""
root = ET.fromstring(thing)
root_text = ET._escape_attrib(root.text)
print(root_text)
for child in root:
child_text = ET._escape_attrib(child.text)
print(child_text)
The code above prints out:
<div>MY TEST DIV</div>
<div>AAA</div>
<div>BBB</div>

Delete only a specific query from an URL

SO I have the following URL: https://foo.bar?query1=value1&query2=value2&query3=value3
I'd need a function that can strip just query2 for example, so that the result would be:
https://foo.bar?query1=value1&query3=value3
I think maybe urllib.parse or furl can do this in an easy and clean way?
You should use urllib.parse as it's designed exactly for these purposes. I'm unclear the reason for anyone reinventing the wheel here.
Basically 3 steps:
Use urlparse to parse the url into it's component parts
Use parse_qs to parse the query string part of that keeping blanks (if relevant intact)
Remove the erroneous query2 and re-encode the query string and url back
From the docs:
Parse a URL into six components, returning a 6-item named tuple. This
corresponds to the general structure of a URL:
scheme://netloc/path;parameters?query#fragment. Each tuple item is a
string, possibly empty.
from urllib.parse import urlparse, urlencode, parse_qs, urlunparse
url = "https://foo.bar?query1=value1&query2=value2&query3=value3"
url_bits = list(urlparse(url))
print(url_bits)
query_string = parse_qs(url_bits[4], keep_blank_values=True)
print(query_string)
del(query_string['query2'])
url_bits[4] = urlencode(query_string, doseq=True)
new_url = urlunparse(url_bits)
print(new_url)
# >>>['https', 'foo.bar', '', '', 'query1=value1&query2=value2&query3=value3', '']
# >>>{'query1': ['value1'], 'query2': ['value2'], 'query3': ['value3']}
# >>>https://foo.bar?query1=value1&query3=value3
If you want by position:
url="https://foo.bar?query1=value1&query2=value2&query3=value3"
findindex1=url.find("&")
findindex2=url.find("&",findindex1+1)
url=url[0:findindex1]+url[findindex2:len(url)]
if you want by the name:
url="https://foo.bar?query1=value1&query3=value3&query2=value2"
findindex1=url.find("query2")
findindex2=url.find("&",findindex1+1)
if findindex2==-1:
url=url[0:findindex1-1]
else:
url=url[0:findindex1-1]+url[findindex2:len(url)]
Hi you could try it with regular expressions.
re.sub("ThePatternOfTheURL","ThePatternYouWantToHave", "TheInput")
so it could look something like that
pattern = "'(https\:\/\/)([a-zA-Z.?0-9=]+)([&]query2=value2)([&][a-zA-Z0-9=]+)'"
#filters the third group out with query2
filter = r"\1\2\4"
yourUrl = "https://foo.bar?query1=value1&query2=value2&query3=value3"
newURL=re.sub(pattern, filter, yourUrl)
I think this should work for you

Python url construction: escape characters other than regular letters

I am using wikipedia api and using following api request,
http://en.wikipedia.org/w/api.php?`action=query&meta=globaluserinfo&guiuser='$cammer'&guiprop=groups|merged|unattached&format=json`
but the problem is I am unable to escape Dollar Sign and similar characters like that, I tried the following but it didn't work,
r['guiprop'] = u'groups|merged|unattached'
r['guiuser'] = u'$cammer'
I found it this in w3school but checking this for every single character would a pain full, what would be the best way to escape this in the strip.http://www.w3schools.com/tags/ref_urlencode.asp
You should take a look at using urlencode.
from urllib import urlencode
base_url = "http://en.wikipedia.org/w/api.php?"
arguments = dict(action="query",
meta="globaluserinfo",
guiuser="$cammer",
guiprop="groups|merged|unattached",
format="json")
url = base_url + urlencode(arguments)
If you don't need to build a complete url you can just use the quote function for a single string:
>>> import urllib
>>> urllib.quote("$cammer")
'%24cammer'
So you end up with:
r['guiprop'] = urllib.quote(u'groups|merged|unattached')
r['guiuser'] = urllib.quote(u'$cammer')

Removing extra spaces in Chinese HTML files using lxml

I have a bunch of improperly formatted Chinese html files. They contain unnecessary spaces and line breaks which will be displayed as extra spaces in the browser. I've written a script using lxml to modify the html files. It works fine on simple tags, but I'm stuck on nested ones. For example:
<p>祝你<span>19</span>岁
生日快乐。</p>
will be displayed is the browser as:
祝你19岁 生日快乐。
Notice the extra space. This is what needs to be deleted. The result html should be like this:
<p>祝你<span>19</span>岁生日快乐。</p>
How do I do this?
Note that the nesting(like the span tag) could be arbitrary, but I don't need to consider the content in the nested elements, they should be preserved as they are. Only the text in the outer element needs to by formatted.
This is what I've got:
# -*- coding: utf-8 -*-
import lxml.html
import re
s1 = u"""<p>祝你19岁
生日快乐。</p>"""
p1 = lxml.html.fragment_fromstring(s1)
print p1.text # I get the whole line.
p1.text = re.sub("\s+", "", p1.text)
print p1.tostring() # spaces are removed.
s2 = u"""<p>祝你<span>19</span>岁
生日快乐。</p>"""
p2 = lxml.html.fragment_fromstring(s2)
print p2.text # I get "祝你"
print p2.tail # I get None
i = p2.itertext()
print i.next() # I get "祝你"
print i.next() # I get "19" from <span>
print i.next() # I get the tailed text, but how do I assemble them back?
print p2.text_content() # The whole text, but how do I put <span> back?
>>> root = etree.fromstring('<p>祝你<span>19</span>岁\n生日快乐。</p>')
>>> etree.tostring(root)
b'<p>祝你<span>19</span>岁\n生日快乐。</p>'
>>> for e in root.xpath('/p/*'):
... if e.tail:
... e.tail = e.tail.replace('\n', '')
...
>>> etree.tostring(root)
b'<p>祝你<span>19</span>岁生日快乐。</p>'
Controversially, I wonder whether this is possible to complete without using an HTML/XML parser, considering that it appears to be cause by line wrapping.
I built a regular expression to look for whitespace between Chinese text with the help of this solution here: https://stackoverflow.com/a/2718268/267781
I don't know whether the catch-all of any whitespace between characters or whether the more specific [char]\n\s*[char] is most suitable to your problem.
# -*- coding: utf-8 -*-
import re
# Whitespace in Chinese HTML
## Used this solution to create regexp: https://stackoverflow.com/a/2718268/267781
## \s+
fixwhitespace2 = re.compile(u'[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d](\s+)[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d]',re.M)
## \n\s*
fixwhitespace = re.compile(u'[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d](\n\s*)[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d]',re.M)
sample = u'<html><body><p>\u795d\u4f6019\u5c81\n \u751f\u65e5\u5feb\u4e50\u3002</p></body></html>'
fixwhitespace.sub('',sample)
Yielding
<html><body><p>祝你19日快乐。</p></body></html>
However, here's how you might do it using the parser and xpath to find linefeeds:
# -*- coding: utf-8 -*-
from lxml import etree
import re
fixwhitespace = re.compile(u'[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d](\n\s*)[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d]',re.M)
sample = u'<html><body><p>\u795d\u4f6019\u5c81\n \u751f\u65e5\u5feb\u4e50\u3002</p></body></html>'
doc = etree.HTML(sample)
for t in doc.xpath("//text()[contains(.,'\n')]"):
if t.is_tail:
t.getparent().tail = fixwhitespace.sub('',t)
elif t.is_text:
t.getparent().text = fixwhitespace.sub('',t)
print etree.tostring(doc)
Yields:
<html><body><p>祝你19日快乐。</p></body></html>
I'm curious what the best match to your working data is.

Escaping '<' and '>' in xml when using xml.dom.minidom

I am stuck while escaping "<" and ">" in the xml file using xml.dom.minidom.
I tried to get the unicode hex value and use that instead
http://slayeroffice.com/tools/unicode_lookup/
Tried to use the standard "<" and ">" but still with no success.
from xml.dom.minidom import Document
doc = Document()
e = doc.createElement("abc")
s1 = '<hello>bhaskar</hello>'
text = doc.createTextNode(s1)
e.appendChild(text)
e.toxml()
'<abc><hello>bhaskar</hello></abc>'
same result with writexml()
Also tried by specifying encoding 'UTF-8', 'utf-8', 'utf' in the toxml() writexml() calls but with same results.
from xml.dom.minidom import Document
doc = Document()
e = doc.createElement("abc")
s1 = u'<hello>bhaskar</hello>'
text = doc.createTextNode(s1)
e.appendChild(text)
e.toxml()
u'<abc>&lt;hello&gt;bhaskar&lt;/hello&gt;</abc>'
Tried other ways but with same results.
Only way i could work-around is by overriding the writer
import xml.dom.minidom as md
# XXX Hack to handle '<' and '>'
def wd(writer, data):
data = data.replace("<", "<").replace(">", ">")
writer.write(data)
md._write_data = wd
Edit - This is the code.
import xml.dom.minidom as md
doc = md.Document()
entity_descr = doc.createElement("EntityDescriptor")
doc.appendChild(entity_descr)
entity_descr.setAttribute('xmlns', 'urn:oasis:names:tc:SAML:2.0:metadata')
entity_descr.setAttribute('xmlns:saml', 'urn:oasis:names:tc:SAML:2.0:assertion')
entity_descr.setAttribute('xmlns:ds', 'http://www.w3.org/2000/09/xmldsig#')
# Get the entity_id from saml20_idp_settings
entity_descr.setAttribute('entityID', self.group['entity_id'])
idpssodescr = doc.createElement('IDPSSODescriptor')
idpssodescr.setAttribute('WantAuthnRequestsSigned', 'true')
idpssodescr.setAttribute('protocolSupportEnumeration',
'urn:oasis:names:tc:SAML:2.0:protocol')
entity_descr.appendChild(idpssodescr)
keydescr = doc.createElement('KeyDescriptor')
keydescr.setAttribute('use', 'signing')
idpssodescr.appendChild(keydescr)
keyinfo = doc.createElement('ds:KeyInfo')
keyinfo.setAttribute('xmlns:ds', 'http://www.w3.org/2000/09/xmldsig#')
keydescr.appendChild(keyinfo)
x509data = doc.createElement('ds:X509Data')
keyinfo.appendChild(x509data)
# check this part
s = "this is a cert blah blah"
x509cert = doc.createElement('ds:X509Certificate')
cert = doc.createTextNode(s)
x509cert.appendChild(cert)
x509data.appendChild(x509cert)
sso = doc.createElement('SingleSignOnService')
sso.setAttribute('Binding', 'urn:oasis:names:tc:SAML:2.0:bindings:HTTP-Redirect')
sso.setAttribute('Location', 'http://googleapps/singleSignOn')
idpssodescr.appendChild(sso)
# Write the metadata file.
fobj = open('metadata.xml', 'w')
doc.writexml(fobj, " ", "", "\n", "UTF-8")
fobj.close()
This produces
<?xml version="1.0" encoding="UTF-8"?>
<EntityDescriptor entityID="skar" xmlns="urn:oasis:names:tc:SAML:2.0:metadata"
xmlns:ds="http://www.w3.org/2000/09/xmldsig#"
xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion">
<IDPSSODescriptor WantAuthnRequestsSigned="true"
protocolSupportEnumeration="urn:oasis:names:tc:SAML:2.0:protocol">
<KeyDescriptor use="signing">
<ds:KeyInfo xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
<ds:X509Data>
<ds:X509Certificate>
this is a cert blah blah
</ds:X509Certificate>
</ds:X509Data>
</ds:KeyInfo>
</KeyDescriptor>
<SingleSignOnService Binding="urn:oasis:names:tc:SAML:2.0:bindings:HTTP-Redirect"
Location="http:///singleSignOn"/>
</IDPSSODescriptor>
</EntityDescriptor>
Note the "This is a cert" comes seperately
Have broken my head over this but with the same results.
This is not a bug, it is a feature. To insert actual XML, insert DOM objects instead. Text inside an XML tag needs to be entity escaped though to be valid XML.
from xml.dom.minidom import Document
doc = Document()
e = doc.createElement("abc")
eh = doc.createElement("hello")
s1 = 'bhaskar'
text = doc.createTextNode(s1)
eh.appendChild(text)
e.appendChild(eh)
e.toxml()
EDIT: I don't know what Python's API is like, but it looks very similar to C#'s, so you might be able to do something like e.innerXml = s1 to do what you're trying to do... but that could be bad. The better thing to do is parse it and appendChild it as well.
EDIT 2: I just ran this via Python locally, and there's definitely something wrong on your end, not in the libraries. Make sure that your string doesn't have any newlines or whitespace at the start of it. For reference, the test code I used was:
Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from xml.dom.minidom import Document
>>> cert = "---- START CERTIFICATE ----\n Hello world\n---- END CERTIFICATE ---"
>>> doc = Document()
>>> e = doc.createElement("cert")
>>> certEl = doc.createTextNode(cert)
>>> e.appendChild(certEl)
<DOM Text node "'---- START'...">
>>> print e.toxml()
<cert>---- START CERTIFICATE ----
Hello world
---- END CERTIFICATE ---</cert>
>>>
EDIT 3: The final edit. The problem is in your writexml call. Simply using the following fixes this:
doc.writexml(fobj)
# or
doc.writexml(fobj, "", " ", "")
Unfortuanately, it seems that you won't be able to use the newline parameter to get pretty printing though... it seems that the Python library (or atleast minidom) is written rather poorly and will modify TextNode's while printing them. Not so much a poor implementation as a naive one. A shame really...
If you use "<" as text in XML, you need to escape it, else it is considered markup. So xml.dom is right in escaping it, since you've asked for a text node.
Assuming you really want to insert a piece of XML, I recommend to use createElement("hello"). If you have a fragment of XML that you don't know the structure of, you should first parse it, and then move the nodes of that parse result into the other tree.
If you want to hack, you can inherit from xml.dom.minidom.Text, and overwrite the writexml method. See the source of minidom for details.

Categories