I am analyzing StackOverflow's dump file "Posts.Small.xml" using pySpark. I want to separate 'code block' from 'text' in a Row. A typical parsed row looks like:
['[u"<p>I want to use a track-bar to change a form\'s opacity.</p>
<p>This is my code:</p>
<pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
</code></pre>
<p>When I try to build it, I get this error:</p>
<blockquote>
<p>Cannot implicitly convert type \'decimal\' to \'double\'.
</p>
</blockquote>
<p>I tried making <code>trans</code> a <code>double</code>, but then the control doesn\'t work.',
'", u\'This code has worked fine for me in VB.NET in the past.',
'\', u"</p>
When setting a form\'s opacity should I use a decimal or double?"]']
I've tried "itertools" and some python functions but couldn't get the result.
My initial code to extract the above row is:
postsXml = textFile.filter( lambda line: not line.startswith("<?xml version=")
postsRDD = postsXml.map(............)
tokensentRDD = postsRDD.map(lambda x:(x[0], nltk.sent_tokenize(x[3])))
new = tokensentRDD.map(lambda x: x[1]).take(1)
a = ''.join(map(str,new))
b = a.replace("<", "<")
final = b.replace(">", ">")
nltk.sent_tokenize(final)
Any ideas are appreciated!
You can extract the code contents by using XPath (the lxml library will help) and then extract the text content selecting everything else, for example:
import lxml.etree
data = '''<p>I want to use a track-bar to change a form's opacity.</p>
<p>This is my code:</p> <pre><code>decimal trans = trackBar1.Value / 5000; this.Opacity = trans;</code></pre>
<p>When I try to build it, I get this error:</p>
<p>Cannot implicitly convert type 'decimal' to 'double'.</p>
<p>I tried making <code>trans</code> a <code>double</code>.</p>'''
html = lxml.etree.HTML(data)
code_blocks = html.xpath('//code/text()')
text_blocks = html.xpath('//*[not(descendant-or-self::code)]/text()')
The easiest way will probably be to apply a regex to the text, matching tags '' and ''. That would enable you to find the code blocks. You don't say what you would do with them afterwards, though. So ...
from itertools import zip_longest
sample_paras = [
"""<p>I want to use a track-bar to change a form\'s opacity.</p>
<p>This is my code:</p>
<pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
</code></pre>
<p>When I try to build it, I get this error:</p>
<blockquote>
<p>Cannot implicitly convert type \'decimal\' to \'double\'. </p>
</blockquote>
<p>I tried making <code>trans</code> a <code>double</code>, but then the control doesn\'t work.""",
"""This code has worked fine for me in VB.NET in the past.""",
"""</p>
When setting a form\'s opacity should I use a decimal or double?""",
]
single_block = " ".join(sample_paras)
import re
separate_code = re.split(r"</?code>", single_block)
text_blocks, code_blocks = zip(*zip_longest(*[iter(separate_code)] * 2))
print("Text:\n")
for t in text_blocks:
print("--")
print(t)
print("\n\nCode:\n")
for t in code_blocks:
print("--")
print(t)
Related
I am reading in an HTML document and want to store the HTML nested within a div tag of a certain name, while maintaining its structure (the spacing). This is for the ability convert an HTML doc into components for React. I am struggling with how to store the structure of the nested HTML, and locate the correct closing tag for the div the denotes that everything nested within it will become a React component (div class='rc-componentname' is the opening tag). Any help would be very appreciated. Thanks!
Edit: I assume regex are the best way to go about this. I haven't used regex before so if that is correct someone could point me in the right direction for the expression used in this context that would be great.
import os
components = []
class react_template():
def __init__(self, component_name): # add nested html as second element
self.Import = "import React, { Component } from ‘react’;"
self.Class = "Class " + component_name + ' extends Component {'
self.Render = "render() {"
self.Return = "return "
self.Export = "Default export " + component_name + ";"
def react(component):
r = react_template(component)
if not os.path.exists('components'): # create components folder
os.mkdir('components')
os.chdir('components')
if not os.path.exists(component): # create folder for component
os.mkdir(component)
os.chdir(component)
with open(component + '.js', 'wb') as f: # create js component file
for j_key, j_code in r.__dict__.items():
f.write(j_code.encode('utf-8') + '\n'.encode('utf-8'))
f.close()
def process_html():
with open('file.html', 'r') as f:
for line in f:
if 'rc-' in line:
char_soup = list(line)
for index, char in enumerate(char_soup):
if char == 'r' and char_soup[index+1] == 'c' and char_soup[index+2] == '-':
sliced_soup = char_soup[int(index+3):]
c_slice_index = sliced_soup.index("\'")
component = "".join(sliced_soup[:c_slice_index])
components.append(component)
innerHTML(sliced_soup)
# react(component)
def innerHTML(sliced_soup): # work in progress
first_closing = sliced_soup.index(">")
sliced_soup = "".join(sliced_soup[first_closing:]).split(" ")
def generate_components(components):
for c in components:
react(c)
if __name__ == "__main__":
process_html()
I see you've used the word soup in your code... maybe you've already tried and disliked BeautifulSoup? If you haven't tried it, I'd recommend you look at BeautifulSoup instead of attempting to parse HTML with regex. Although regex would be sufficient for a single tag or even a handful of tags, markup languages are deceptively simple. BeautifulSoup is a fine library and can make things easier for dealing with markup.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
This will allow you to treat the entirety of your html as a single object and enable you to:
# create a list of specific elements as objects
soup.find_all('div')
# find a specific element by id
soup.find(id="custom-header")
I fail to select a form using its name with Mechanize in rails. The source code of the page I am trying to take data from looks like this:
var strAccesBamPoppin = "";
if(!emailing){
strAccesBamPoppin = '<form name="bamaccess_' + idTCM + '" id="bamaccess_' + idTCM + '" class="bamaccessDecloi" autocomplete="off" method="post" action="'+chemin+'"';
if (typeConnexion == "True") {strAccesBamPoppinPoppin = strAccesBamPoppin +'>';
With python, I would use something like
XX.select_form('bamaccess')
What would be the equivalent with Ruby? Thanks.
XX.forms.select{|x| x[:name][/bamaccess/]}
The above should work for sure.
XX.forms_with(name: /bamaccess/)
This will return an array of all the forms with a name containing bamaccess
Keep in mind that /string/ is a regexp since the name will contain bamaccess_*****
I have a program written in Python that prints a code as such:
<script type="text/javascript" language="JavaScript">
ArtistName = "FUN.";
SongName = "We Are Young";
</script>
I have tried writing a script in awk to allow me to save the ArtistName and SongName as variable but can't see to figure it out. Is there a way to do this in python?
If your data is consistent enough, you might be able to get away using a regex (as long as you don't have names with an " etc...), eg:
text="""<script type="text/javascript" language="JavaScript">
ArtistName = "FUN.";
SongName = "We Are Young";
</script> """
import re
print dict(re.findall(r'((?:Artist|Song)Name)\s=\s"([^"]*)"', text))
# {'ArtistName': 'FUN.', 'SongName': 'We Are Young'}
You can save ArtistName and SongName as keys in a dictionary using regular expressions if you want.
Here are some links to explanations of regular expressions: Python Docs and Tutorials Point
import re
s = #string you're parsing
regex = re.compile(r'\w+ = \".*\";')
matches = regex.findall(s)
dict1 = {}
for m in matches:
elems = m.split(" = ")
dict1[str(elems[0])] = elems[len(elems) - 1].strip(';')
print (dict1['ArtistName'])
print (dict1['SongName'])
Output (using your example string):
'"FUN."'
'"We Are Young"'
I have s special xml file like below:
<alarm-dictionary source="DDD" type="ProxyComponent">
<alarm code="402" severity="Alarm" name="DDM_Alarm_402">
<message>Database memory usage low threshold crossed</message>
<description>dnKinds = database
type = quality_of_service
perceived_severity = minor
probable_cause = thresholdCrossed
additional_text = Database memory usage low threshold crossed
</description>
</alarm>
...
</alarm-dictionary>
I know in python, I can get the "alarm code", "severity" in tag alarm by:
for alarm_tag in dom.getElementsByTagName('alarm'):
if alarm_tag.hasAttribute('code'):
alarmcode = str(alarm_tag.getAttribute('code'))
And I can get the text in tag message like below:
for messages_tag in dom.getElementsByTagName('message'):
messages = ""
for message_tag in messages_tag.childNodes:
if message_tag.nodeType in (message_tag.TEXT_NODE, message_tag.CDATA_SECTION_NODE):
messages += message_tag.data
But I also want to get the value like dnkind(database), type(quality_of_service), perceived_severity(thresholdCrossed) and probable_cause(Database memory usage low threshold crossed
) in tag description.
That is, I also want to parse the content in the tag in xml.
Could anyone help me with this?
Thanks a lot!
Once you have the text from the description tag, it's nothing to do with XML parsing. You just need do simple string-parsing to get the type = quality_of_service keys/values strings into something nicer to use in Python like a dictionary
With some slightly simpler parsing thanks to ElementTree, it would look like this
messages = """
<alarm-dictionary source="DDD" type="ProxyComponent">
<alarm code="402" severity="Alarm" name="DDM_Alarm_402">
<message>Database memory usage low threshold crossed</message>
<description>dnKinds = database
type = quality_of_service
perceived_severity = minor
probable_cause = thresholdCrossed
additional_text = Database memory usage low threshold crossed
</description>
</alarm>
...
</alarm-dictionary>
"""
import xml.etree.ElementTree as ET
# Parse XML
tree = ET.fromstring(messages)
for alarm in tree.getchildren():
# Get code and severity
print alarm.get("code")
print alarm.get("severity")
# Grab description text
descr = alarm.find("description").text
# Parse "thing=other" into dict like {'thing': 'other'}
info = {}
for dl in descr.splitlines():
if len(dl.strip()) > 0:
key, _, value = dl.partition("=")
info[key.strip()] = value.strip()
print info
I'm not completely sure on Python, but after quick research.
Seeing as you can already get all of the content from the description tag in XML, can you not split by line breaks, and then split each line using the str.split() function on the equals signs to give you name / value separately?
e.g.
for messages_tag in dom.getElementsByTagName('message'):
messages = ""
for message_tag in messages_tag.childNodes:
if message_tag.nodeType in (message_tag.TEXT_NODE, message_tag.CDATA_SECTION_NODE):
messages += message_tag.data
tag = str.split('=');
tagName = tag[0]
tagValue = tag[1]
(I haven't taken into account splitting each line up and looping)
But that should get you on the right track :)
AFAIK there is no library to handle the text as DOM elements.
You can however (after you have the message in the message variable) do:
description = {}
messageParts = message.split("\n")
for part in messageParts:
descInfo = part.split("=")
description[descInfo[0].strip()] = descInfo[1].strip()
then you'll have inside description the information you need in the form of a key-value map.
You should also add error handling on my code...
The opposite may be achieved using pyparsing as follows:
from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
removeText = replaceWith("")
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)
How could I keep the contents of the tag "table"?
UPDATE 0:
I tried:
# keep only the tables
tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
f = replaceWith(tableBody)
tableBody.setParseAction(f)
data = (tableBody).transformString(data)
print data
and I get something like this...
garbages
<input type="hidden" name="cassstx" value="en_US:frontend"></form></td></tr></table></span></td></tr></table>
{<"table"> SkipTo:(</"table">) </"table">}
<div id="asbnav" style="padding-bottom: 10px;">{<"table"> SkipTo:(</"table">) </"table">}
</div>
even more garbages
UPDATE 2:
Thanks Martelli. What I need is:
from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'
tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]
print thetable
You could first extract the table (similarly to the way you're now extracting the script but without the removal of course;-), obtaining a thetable string; then, you extract the script, replaceWith(thetable) instead of replaceWith(''). Alternatively, you could prepare a more elaborate parse action, but the simple two-phase approach looks more straightforward to me. E.g. (to preserve specifically the contents of the table, not the table tags):
from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'
tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]
removeText = replaceWith(thetable)
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)
print data
This prints beforebuhafter (what's outside the script tag, with the contents of the table tag sandwiched inside), hopefully "as desired".