Flatten HTML code, with tree structure delimiters - python

I have some raw HTML scraped from a random website, possibly messy, with some scripts, self-closing tags, etc. Example:
ex="<!DOCTYPE html PUBLIC \\\n><html lang=\\'en-US\\'><head><meta http-equiv=\\'Content-Type\\'/><title>Some text</title></head><body><h1>Some other text</h1><p><span style='color:red'>My</span> first paragraph.</p></body></html>"
I want to return the HTML DOM without any string, attributes or such stuff, only the tag structure, in the format of a string showing the relation between parents, children and siblings, this would be my expected output (though the use of brackets is a personnal choice):
'[html[head[meta, title], body[h1, p[span]]]]'
So far I tried using beautifulSoup (this answer was helpful). I figured out I should split the work in two steps:
- extract the tag "skeleton" of the html DOM, emptying everything like strings, attributes, and stuff before the <html>.
- return the flat HTML DOM, but structured with tree-like delimiters indicating each children and siblings, such as brackets.
I posted the code as an self-answer

You can use recursion. The name argument will give the name of the tag. You can check if the type is bs4.element.Tag to confirm if an element is a tag.
import bs4
ex="<!DOCTYPE html PUBLIC \\\n><html lang=\\'en-US\\'><head><meta http-equiv=\\'Content-Type\\'/><title>Some text</title></head><body><h1>Some other text</h1><p><span style='color:red'>My</span> first paragraph.</p></body></html>"
soup=bs4.BeautifulSoup(ex,'html.parser')
str=''
def recursive_child_seach(tag):
global str
str+=tag.name
child_tag_list=[x for x in tag.children if type(x)==bs4.element.Tag]
if len(child_tag_list) > 0:
str+='['
for i,child in enumerate(child_tag_list):
recursive_child_seach(child)
if not i == len(child_tag_list) - 1: #if not last child
str+=', '
if len(child_tag_list) > 0:
str+=']'
return
recursive_child_seach(soup.find())
print(str)
#html[head[meta, title], body[h1, p[span]]]
print('['+str+']')
#[html[head[meta, title], body[h1, p[span]]]]

I post here my first solution, which is still a bit messy and uses a lot of regex. The first function gets the emptied DOM structure and outputs it as a raw string, the second function modifies the string to add the delimiters.
import re
def clear_tags(htmlstring, remove_scripts=False):
htmlstring = re.sub("^.*?(<html)",r"\1", htmlstring, flags=re.DOTALL)
finishyoursoup = soup(htmlstring, 'html.parser')
for tag in finishyoursoup.find_all():
tag.attrs = {}
for sub in tag.contents:
if sub.string:
sub.string.replace_with('')
if remove_scripts:
[tag.extract() for tag in finishyoursoup.find_all(['script', 'noscript'])]
return(str(finishyoursoup))
clear_tags(ex)
# '<html><head><meta/><title></title></head><body><h1></h1><p><span></span></p></b
def flattened_html(htmlstring):
import re
squeletton = clear_tags(htmlstring)
step1 = re.sub("<([^/]*?)>", r"[\1", squeletton) # replace begining of tag
step2 = re.sub("</(.*?)>", r"]", step1) # replace end of tag
step3 = re.sub("<(.*?)/>", r"[\1]", step2) # deal with self-closing tag
step4 = re.sub("\]\[", ", ", step3) # gather sibling tags with coma
return(step4)
flattened_html(ex)
# '[html[head[meta, title], body[h1, p[span]]]]'

Related

Trying to isolate URL suffix's from list of href tags

I'm currently working on a simple web crawling program that will crawl the SCP wiki to find links to other articles in each article. So far I have been able to get a list of href tags that go to other articles, but can't navigate to them since the URL I need is embedded in the tag:
[ SCP-1512,
SCP-2756,
SCP-002,
SCP-004 ]
Is there any way I would be able to isolate the "/scp-xxxx" from each item in the list so I can append it to the parent URL?
The code used to get the list looks like this:
import requests
import lxml
from bs4 import BeautifulSoup
import re
def searchSCP(x):
url = str(SCoutP(x))
c = requests.get(url)
crawl = BeautifulSoup(c.content, 'lxml')
#Searches HTML for text containing "SCP-" and href tags containing "scp-"
ref = crawl.find_all(text=re.compile("SCP-"), href=re.compile("scp-",))
param = "SCP-" + str(SkateP(x)) #SkateP takes int and inserts an appropriate number of 0's.
for i in ref: #Below function is for sorting out references to the article being searched
if str(param) in i:
ref.remove(i)
if ref != []:
print(ref)
The main idea I've tried to use is finding every item that contains items in quotations, but obviously that just returned the same list. What I want to be able to do is select a specific item in the list and take out ONLY the "scp-xxxx" part or, alternatively, change the initial code to only extract the href content in quotations to the list.
Is there any way I would be able to isolate the "/scp-xxxx" from each item in the list so I can append it to the parent URL?
If I understand correctly, you want to extract the href attribute - for that, you can use i.get('href') (or probably even just i['href']).
With .select and list comprehension, you won't even need regex to filter the results:
[a.get('href') for a in crawl.select('*[href*="scp-"]') if 'SCP-' in a.get_text()]
would return
['/scp-1512', '/scp-2756', '/scp-002', '/scp-004']
If you want the parent url attached:
root_url = 'https://PARENT-URL.com' ## replace with the actual parent url
scpLinks = [root_url + l for l, t in list(set([
(a.get('href'), a.get_text()) for a in crawl.select('*[href*="scp-"]')
])) if 'SCP-' in t]
scpLinks should return
['https://PARENT-URL.com/scp-004', 'https://PARENT-URL.com/scp-002', 'https://PARENT-URL.com/scp-1512', 'https://PARENT-URL.com/scp-2756']
If you want to filter out param, add str(param) not in t to the filter:
scpLinks = [root_url + l for l, t in list(set([
(a.get('href'), a.get_text()) for a in crawl.select('*[href*="scp-"]')
])) if 'SCP-' in t and str(param) not in t]
if str(param) was 'SCP-002', then scpLinks would be
['https://PARENT-URL.com/scp-004', 'https://PARENT-URL.com/scp-1512', 'https://PARENT-URL.com/scp-2756']

Skipping "nested tags" when parsing XML with Python

I currently have an XML file that I'd like to parse with Python. I'm using Python's Element Tree and it works fine except I had a question.
The file currently looks something like:
<Instance>
<TextContent>
<Sentence>Hello, my name is John and his <Thing>name</Thing> is Tom.</Sentence>
</TextContent>
<Instance>
What I basically want to do is skip over the nested tags inside of the <Sentence> tag (i.e. <Thing>). One way that I've found to do that is to get the text content up until the tag, the text content of the tag, and concatenate them. The code that I'm using is:
import xml.etree.ElementTree as ET
xtree = ET.parse('some_file.xml')
xroot = xtree.getroot()
for node in xroot:
text_before = node[0][0].text
text_nested = node[0][0][0].text
How do I get the portion of text that comes after the nested tag?
Better yet, is there a way that I can completely disregard the nested tag?
Thanks in advance.
I changed slightly your source XML file, so that Sentence contains two
child elements:
<Instance>
<TextContent>
<Sentence>Hello, my <Thing>name</Thing> is John and his <Thing>name</Thing> is Tom.</Sentence>
</TextContent>
</Instance>
To find the Sentence element, run: st = xroot.find('.//Sentence').
Then define the following generator:
def allTextNodes(root):
if root.text is not None:
yield root.text
for child in root:
if child.tail is not None:
yield child.tail
To see the list of all direct descendant text nodes, run:
lst = list(allTextNodes(st))
The result is:
['Hello, my ', ' is John and his ', ' is Tom.']
But to get the concatenated text, as a single variable, run:
txt = ''.join(allTextNodes(st))
getting: Hello, my is John and his is Tom. (note double spaces,
"surrounding" both omitted Thing elements.

Remove the first and last node from a nodelist with minidom

I am working on XML parsing and I have been using minidom for my work. There are lots of custom defined entities used in the file, so using lxml has been a pain. DOM seems to ignore that and hence for my current work, I am using DOM.
I need to get all <para> tags from an xml and all inner text inside the tag. Then I need to remove the first occurrence and last occurrence of the tag and get all the text in the remaining tags and their inner text. Here is my code so far:
file='C:/My_Folders/something.xml'
doc=parse(file)
paras=doc.getElementsByTagName('para')
def getText(paras):
rc = []
for node in paras:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
else:
rc.append(getText(node.childNodes))
return ','.join(rc)
print(getText(paras))
In this code, I get all the text from all <para> tags, which I don't want. I don't want the first and last occurrences of the <para> tags. Can someone help me?
Here is the sample XML.
<para
><reviewer-note >tlewis</reviewer-note
></para>
<para><user-typing>Resilient.</para>
<para>hashing.</para>
<para>"X" release.</para>
<para>[See <url
href="http://www.google.com"
>Trunk/ECMP Groups</url>.]</para>
I don't want the first tag text. i.e. tlewis, and also the last tag text. i.e. Trunk/ECMP Groups. I want the other <para> tag text such as Resilient, hashing, and "X" release and concatenate these 3.
Needed output:
Resilient,hashing,"X" release
You can use BeautifulSoup for parsing XML. In my example I selected all <para> tags with select() method and then concatenated them together (without first and last one):
data = """<para
><reviewer-note >tlewis</reviewer-note
></para>
<para><user-typing>Resilient.</para>
<para>hashing.</para>
<para>"X" release.</para>
<para>[See <url
href="http://www.google.com"
>Trunk/ECMP Groups</url>.]</para>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
all_params = soup.select('para')[1:-1]
string_output = ''
for param in all_params:
string_output += param.text.strip() + ','
string_output = string_output.rstrip(',')
print(string_output)
Output:
Resilient.,hashing.,"X" release.

python + beautiful soup, navigating the parse tree without the aid of attributes?

i'm trying to scrape text from this page:
http://codingbat.com/prob/p187868
specifically, i want to scrape two strings from the page, to combine as the key in a dictionary with the problem statement as value. these are the two parts of the name of the problem (here: 'Warmup-1' and 'sleepin'). however, the strings are contained in different levels of the parse tree and this is creating problems.
abstractly, the problem is this:
i'm trying to scrape text from a parse tree of:
div-->{[a[span'h2'[string1]]], [span'h2'[string2]], some other tags}
since they are both contained in 'span' tags with the attribute class='h2', i can scrape a list of these and then select from the list.
div_nameparts = name_div.find_all('span', class_='h2')
name1 = div_nameparts[0].string
name2 = div_nameparts[1].string
problem_name = name1+' > '+name2
print(problem_name)
but what if those tags didn't share an attribute like they do here ('h2')?
if i try to navigate the parse tree using div.a.string, i can get the first tag - string1. but div.span.string does not return the second value (string2).
name1 = name_div.a.string
name2 = name_div.span.string
instead it again returns the first (string1), apparently navigating to div.a.span (the child of a child) and stopping, before finding its way to div.span (the next child).
and if i try div.a.next_sibling to try to navigate to div.span and get the string with div.span.string,
name1 = name_div.a.string
name2_div = name_div.a.next_sibling
name2 = name2_div.string
it returns an empty string, a value of None.
is there a better/effective way to get to navigate the parse tree to get to these span tags?
thanks in advance!
This'll work as long as the 'greater than' symbol (' > ') with leading and trailing space doesn't appear before the pair of strings you want:
gt = soup.find(text=' > ')
string1 = gt.findPrevious('span').text
string2 = gt.findNext('span').text
print(string1, gt, string2, sep='')
The output:
Warmup-1 > sleepIn

python - sorting strings via xml attribute, .text malforms xml data

#!/usr/bin/env python
import os, sys, os.path
import string
def sort_strings_file(xmlfile,typee):
"""sort all strings within given strings.xml file"""
all_strings = {}
orig_type=typee
# read original file
tree = ET.ElementTree()
tree.parse(xmlfile)
# iter over all strings, stick them into dictionary
for element in list(tree.getroot()):
all_strings[element.attrib['name']] = element.text
# create new root element and add all strings sorted below
newroot = ET.Element("resources")
for key in sorted(all_strings.keys()):
# Check for IDs
if typee == "id":
typee="item"
# set main node type
newstring = ET.SubElement(newroot, typee)
#add id attrib
if orig_type == "id":
newstring.attrib['type']="id"
# continue on
newstring.attrib['name'] = key
newstring.text = all_strings[key]
# write new root element back to xml file
newtree = ET.ElementTree(newroot)
newtree.write(xmlfile, encoding="UTF-8")
This works great and all, but if a string start with like <b> it breaks badly.
EX
<string name="uploading_to"><b>%s</b> Odovzdávanie do</string>
becomes
<string name="uploading_to" />
I've looked into the xml.etree Element class, but it seems to only have .text method. I just need a way to pull everything in between xml tags. No, I can't change the input data. It comes directly from an Android APK ready to be translated, I cannot predict how / what the data comes in besides the fact that it must be valid XML Android code.
I think you are looking for the itertext() method instead. .text only returns text directly contained at the start of the element:
>>> test = ET.fromstring('<elem>Sometext <subelem>more text</subelem> rest</elem>')
>>> test.text
'Sometext '
>>> ''.join(test.itertext())
'Sometext more text rest'
The .itertext() iterator on the other hand let's you find all text contained in the element, including inside nested elements.
If, however, you only want text directly contained in an element, skipping the contained children, you want the combination of .text and the .tail values of each of the children:
>>> (test.text or '') + ''.join(child.tail for child in test.getchildren())
'Sometext middle rest'
If you need to capture everything contained, then you need to do a little more work; capture the .text, and cast each child to text with ElementTree.tostring():
>>> (test.text or '') + ''.join(ET.tostring(child) for child in test.getchildren())
'Sometext <subelem>more text</subelem> middle <subelem>other text</subelem> rest'
ET.tostring() takes the element tail into account. I use (test.text or '') because the .text attribute can be None as well.
You can capture that last method in a function:
def innerxml(elem):
return (elem.text or '') + ''.join(ET.tostring(child) for child in elem.getchildren())

Categories