python + beautiful soup, navigating the parse tree without the aid of attributes? - python

i'm trying to scrape text from this page:
http://codingbat.com/prob/p187868
specifically, i want to scrape two strings from the page, to combine as the key in a dictionary with the problem statement as value. these are the two parts of the name of the problem (here: 'Warmup-1' and 'sleepin'). however, the strings are contained in different levels of the parse tree and this is creating problems.
abstractly, the problem is this:
i'm trying to scrape text from a parse tree of:
div-->{[a[span'h2'[string1]]], [span'h2'[string2]], some other tags}
since they are both contained in 'span' tags with the attribute class='h2', i can scrape a list of these and then select from the list.
div_nameparts = name_div.find_all('span', class_='h2')
name1 = div_nameparts[0].string
name2 = div_nameparts[1].string
problem_name = name1+' > '+name2
print(problem_name)
but what if those tags didn't share an attribute like they do here ('h2')?
if i try to navigate the parse tree using div.a.string, i can get the first tag - string1. but div.span.string does not return the second value (string2).
name1 = name_div.a.string
name2 = name_div.span.string
instead it again returns the first (string1), apparently navigating to div.a.span (the child of a child) and stopping, before finding its way to div.span (the next child).
and if i try div.a.next_sibling to try to navigate to div.span and get the string with div.span.string,
name1 = name_div.a.string
name2_div = name_div.a.next_sibling
name2 = name2_div.string
it returns an empty string, a value of None.
is there a better/effective way to get to navigate the parse tree to get to these span tags?
thanks in advance!

This'll work as long as the 'greater than' symbol (' > ') with leading and trailing space doesn't appear before the pair of strings you want:
gt = soup.find(text=' > ')
string1 = gt.findPrevious('span').text
string2 = gt.findNext('span').text
print(string1, gt, string2, sep='')
The output:
Warmup-1 > sleepIn

Related

Skipping "nested tags" when parsing XML with Python

I currently have an XML file that I'd like to parse with Python. I'm using Python's Element Tree and it works fine except I had a question.
The file currently looks something like:
<Instance>
<TextContent>
<Sentence>Hello, my name is John and his <Thing>name</Thing> is Tom.</Sentence>
</TextContent>
<Instance>
What I basically want to do is skip over the nested tags inside of the <Sentence> tag (i.e. <Thing>). One way that I've found to do that is to get the text content up until the tag, the text content of the tag, and concatenate them. The code that I'm using is:
import xml.etree.ElementTree as ET
xtree = ET.parse('some_file.xml')
xroot = xtree.getroot()
for node in xroot:
text_before = node[0][0].text
text_nested = node[0][0][0].text
How do I get the portion of text that comes after the nested tag?
Better yet, is there a way that I can completely disregard the nested tag?
Thanks in advance.
I changed slightly your source XML file, so that Sentence contains two
child elements:
<Instance>
<TextContent>
<Sentence>Hello, my <Thing>name</Thing> is John and his <Thing>name</Thing> is Tom.</Sentence>
</TextContent>
</Instance>
To find the Sentence element, run: st = xroot.find('.//Sentence').
Then define the following generator:
def allTextNodes(root):
if root.text is not None:
yield root.text
for child in root:
if child.tail is not None:
yield child.tail
To see the list of all direct descendant text nodes, run:
lst = list(allTextNodes(st))
The result is:
['Hello, my ', ' is John and his ', ' is Tom.']
But to get the concatenated text, as a single variable, run:
txt = ''.join(allTextNodes(st))
getting: Hello, my is John and his is Tom. (note double spaces,
"surrounding" both omitted Thing elements.

Flatten HTML code, with tree structure delimiters

I have some raw HTML scraped from a random website, possibly messy, with some scripts, self-closing tags, etc. Example:
ex="<!DOCTYPE html PUBLIC \\\n><html lang=\\'en-US\\'><head><meta http-equiv=\\'Content-Type\\'/><title>Some text</title></head><body><h1>Some other text</h1><p><span style='color:red'>My</span> first paragraph.</p></body></html>"
I want to return the HTML DOM without any string, attributes or such stuff, only the tag structure, in the format of a string showing the relation between parents, children and siblings, this would be my expected output (though the use of brackets is a personnal choice):
'[html[head[meta, title], body[h1, p[span]]]]'
So far I tried using beautifulSoup (this answer was helpful). I figured out I should split the work in two steps:
- extract the tag "skeleton" of the html DOM, emptying everything like strings, attributes, and stuff before the <html>.
- return the flat HTML DOM, but structured with tree-like delimiters indicating each children and siblings, such as brackets.
I posted the code as an self-answer
You can use recursion. The name argument will give the name of the tag. You can check if the type is bs4.element.Tag to confirm if an element is a tag.
import bs4
ex="<!DOCTYPE html PUBLIC \\\n><html lang=\\'en-US\\'><head><meta http-equiv=\\'Content-Type\\'/><title>Some text</title></head><body><h1>Some other text</h1><p><span style='color:red'>My</span> first paragraph.</p></body></html>"
soup=bs4.BeautifulSoup(ex,'html.parser')
str=''
def recursive_child_seach(tag):
global str
str+=tag.name
child_tag_list=[x for x in tag.children if type(x)==bs4.element.Tag]
if len(child_tag_list) > 0:
str+='['
for i,child in enumerate(child_tag_list):
recursive_child_seach(child)
if not i == len(child_tag_list) - 1: #if not last child
str+=', '
if len(child_tag_list) > 0:
str+=']'
return
recursive_child_seach(soup.find())
print(str)
#html[head[meta, title], body[h1, p[span]]]
print('['+str+']')
#[html[head[meta, title], body[h1, p[span]]]]
I post here my first solution, which is still a bit messy and uses a lot of regex. The first function gets the emptied DOM structure and outputs it as a raw string, the second function modifies the string to add the delimiters.
import re
def clear_tags(htmlstring, remove_scripts=False):
htmlstring = re.sub("^.*?(<html)",r"\1", htmlstring, flags=re.DOTALL)
finishyoursoup = soup(htmlstring, 'html.parser')
for tag in finishyoursoup.find_all():
tag.attrs = {}
for sub in tag.contents:
if sub.string:
sub.string.replace_with('')
if remove_scripts:
[tag.extract() for tag in finishyoursoup.find_all(['script', 'noscript'])]
return(str(finishyoursoup))
clear_tags(ex)
# '<html><head><meta/><title></title></head><body><h1></h1><p><span></span></p></b
def flattened_html(htmlstring):
import re
squeletton = clear_tags(htmlstring)
step1 = re.sub("<([^/]*?)>", r"[\1", squeletton) # replace begining of tag
step2 = re.sub("</(.*?)>", r"]", step1) # replace end of tag
step3 = re.sub("<(.*?)/>", r"[\1]", step2) # deal with self-closing tag
step4 = re.sub("\]\[", ", ", step3) # gather sibling tags with coma
return(step4)
flattened_html(ex)
# '[html[head[meta, title], body[h1, p[span]]]]'

How to web scrape all of the batters names?

I would like to scrape all of the MLB batters stats for 2018. Here is my code so far:
#import modules
from urllib.request import urlopen
from lxml import html
#fetch url/html
response = urlopen("https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml")
content = response.read()
tree = html.fromstring( content )
#parse data
comment_html = tree.xpath('//comment()[contains(., "players_standard_batting")]')[0]
comment_html = str(comment_html).replace("-->", "")
comment_html = comment_html.replace("<!--", "")
tree = html.fromstring( comment_html )
for batter_row in tree.xpath('//table[#id="players_standard_batting"]/tbody/tr[contains(#class, "full_table")]'):
csk = batter_row.xpath('./td[#data-stat="player"]/#csk')[0]
When I scraped all of the batters there is 0.01 attached to each name. I tried to remove attached numbers using the following code:
bat_data = [csk]
string = '0.01'
result = []
for x in bat_data :
if string in x:
substring = x.replace(string,'')
if substring != "":
result.append(substring)
else:
result.append(x)
print(result)
This code removed the number, however, only the last name was printed:
Output:
['Zunino, Mike']
Also, there is a bracket and quotations around the name. The name is also in reverse order.
1) How can I print all of the batters names?
2) How can I remove the quotation marks and brackets?
3) Can I reverse the order of the names so the first name gets printed and then the last name?
The final output I am hoping for would be all of the batters names like so: Mike Zunino.
I am new to this site... I am also new to scraping/coding and will greatly appreciate any help I can get! =)
You can do the same in different ways. Here is one such approach which doesn't require post processing. You get the names how you wanted to get:
from urllib.request import urlopen
from lxml.html import fromstring
url = "https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml"
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
for batter_row in tree.xpath('//table[contains(#class,"stats_table")]//tr[contains(#class,"full_table")]'):
csk = batter_row.xpath('.//td[#data-stat="player"]/a')[0].text
print(csk)
Output you may get like:
Jose Abreu
Ronald Acuna
Jason Adam
Willy Adames
Austin L. Adams
You get only the last batter because you are overwriting the value of csk each time in your first loop. Initialize the empty list bat_data first and then add each batter to it.
bat_data= []
for batter_row in blah:
csk = blah
bat_data.append(csk)
This will give you a list of all batters, ['Abreu,Jose0.01', 'Acuna,Ronald0.01', 'Adam,Jason0.01', ...]
Then loop through this list but you don't have to check if string is in the name. Just do x.replace('0.01', '') and then check if the string is empty.
To reverse the order of the names
substring = substring.split(',')
substring.reverse()
nn = " ".join(substring)
Then append nn to the result.
You are getting the quotes and the brackets because you are printing the list. Instead iterate through the list and print each item.
Your code edited assuming you got bat_data correctly:
for x in bat_data :
substring = x.replace(string,'')
if substring != "":
substring = substring.split(',')
substring.reverse()
substring = ' '.join(substring)
result.append(substring)
for x in result:
print(x)
1) Print all batter names
print(result)
This will print everything in the result object. If it’s not printing what you expect then there’s something else wrong going on.
2) Remove quotations
The brackets are due to it being an array object. Try this...
print(result[0])
This will tell the interpreter to print result at the 0 index.
3) Reverse order of names
Try
name = result[0].split(“ “).reverse()[::-1]

How can I get text content of multiple elements with Python Selenium?

Here is my code:
def textfinder():
try:
textfinder1 = driver.find_elements_by_class_name("m-b-none").text
except NoSuchElementException:
pass
print("no such element")
print(textfinder1)
It works only when I use find_element. When I use find_elements, it gives me error "list" object has no attribute "text". I understand that it returns a list, but I just don’t know how to "read" it. When I remove .text from the command, I don’t get any error, but some weird data, but I need the text content of the class.
Actually, when you do
text = driver.find_element_by_class_name("m-b-none").text
You will get the first element that is matched, and this element possesses, thanks to Selenium, an attribute whose name is text. A contrario, when you do
matched_elements = driver.find_elements_by_class_name("m-b-none")
^
it will match all corresponding elements. Given that matched_elements is a list, and that this list is a Python-native one, (it is not, for example, a Selenium-modified object which has text as attribute), you will have to iter over it, and iteratively get the text of each element. As follows
texts = []
for matched_element in matched_elements:
text = matched_element.text
texts.append(text)
print(text)
Or if you want to leave your code unchanged as possible, you can do it in one line:
texts = [el.text for el in driver.find_elements_by_class_name("m-b-none")]
You would need to reference each like an element of a list. Something like this:
textfinder1 = driver.find_elements_by_class_name("m-b-none")
for elements in textfinder1:
print elements.text
The method find_elements_by_class_name returns a list, so you should do:
text = ''
textfinder1 = driver.find_elements_by_class_name("m-b-none")
for i in textfinder1:
text += i.text + '\n' # Or whatever the way you want to concatenate your text
print(text)

python - sorting strings via xml attribute, .text malforms xml data

#!/usr/bin/env python
import os, sys, os.path
import string
def sort_strings_file(xmlfile,typee):
"""sort all strings within given strings.xml file"""
all_strings = {}
orig_type=typee
# read original file
tree = ET.ElementTree()
tree.parse(xmlfile)
# iter over all strings, stick them into dictionary
for element in list(tree.getroot()):
all_strings[element.attrib['name']] = element.text
# create new root element and add all strings sorted below
newroot = ET.Element("resources")
for key in sorted(all_strings.keys()):
# Check for IDs
if typee == "id":
typee="item"
# set main node type
newstring = ET.SubElement(newroot, typee)
#add id attrib
if orig_type == "id":
newstring.attrib['type']="id"
# continue on
newstring.attrib['name'] = key
newstring.text = all_strings[key]
# write new root element back to xml file
newtree = ET.ElementTree(newroot)
newtree.write(xmlfile, encoding="UTF-8")
This works great and all, but if a string start with like <b> it breaks badly.
EX
<string name="uploading_to"><b>%s</b> Odovzdávanie do</string>
becomes
<string name="uploading_to" />
I've looked into the xml.etree Element class, but it seems to only have .text method. I just need a way to pull everything in between xml tags. No, I can't change the input data. It comes directly from an Android APK ready to be translated, I cannot predict how / what the data comes in besides the fact that it must be valid XML Android code.
I think you are looking for the itertext() method instead. .text only returns text directly contained at the start of the element:
>>> test = ET.fromstring('<elem>Sometext <subelem>more text</subelem> rest</elem>')
>>> test.text
'Sometext '
>>> ''.join(test.itertext())
'Sometext more text rest'
The .itertext() iterator on the other hand let's you find all text contained in the element, including inside nested elements.
If, however, you only want text directly contained in an element, skipping the contained children, you want the combination of .text and the .tail values of each of the children:
>>> (test.text or '') + ''.join(child.tail for child in test.getchildren())
'Sometext middle rest'
If you need to capture everything contained, then you need to do a little more work; capture the .text, and cast each child to text with ElementTree.tostring():
>>> (test.text or '') + ''.join(ET.tostring(child) for child in test.getchildren())
'Sometext <subelem>more text</subelem> middle <subelem>other text</subelem> rest'
ET.tostring() takes the element tail into account. I use (test.text or '') because the .text attribute can be None as well.
You can capture that last method in a function:
def innerxml(elem):
return (elem.text or '') + ''.join(ET.tostring(child) for child in elem.getchildren())

Categories