I'm writing a solution to test this phenomenon in Python. I have most of the logic done, but there are many edge cases that arise when following links in Wikipedia articles.
The problem I'm running into arises for a page like this where the first <p> has multiple levels of child elements and the first <a> tag after the first set of parentheses needs to be extracted. In this case, (to extract this link), you have to skip over the parentheses, and then get to the very next anchor tag/href. In most articles, my algorithm can skip over the parentheses, but with the way that it looks for links in front of parentheses (or if they don't exist), it is finding the anchor tag in the wrong place. Specifically, here: <span style="font-size: small;"><span id="coordinates">Coordinates
The algorithm works by iterating through the elements in the first paragraph tag (in the main body of the article), stringifying each element iteratively, and first checking to see if it contains either an '(' or an '
Is there any straight forward way to avoid embedded anchor tags and only take the first link that is a direct child of the first <p> ?
Below is the function with this code for reference:
**def getValidLink(self, currResponse):
currRoot = BeautifulSoup(currResponse.text,"lxml")
temp = currRoot.body.findAll('p')[0]
parenOpened = False
parenCompleted = False
openCount = 0
foundParen = False
while temp.next:
temp = temp.next
curr = str(temp)
if '(' in curr and str(type(temp)) == "<class 'bs4.element.NavigableString'>":
foundParen = True
break
if '<a' in curr and str(type(temp)) == "<class 'bs4.element.Tag'>":
link = temp
break
temp = currRoot.body.findAll('p')[0]
if foundParen:
while temp.next and not parenCompleted:
temp = temp.next
curr = str(temp)
if '(' in curr:
openCount += 1
if parenOpened is False:
parenOpened = True
if ')' in curr and parenOpened and openCount > 1:
openCount -= 1
elif ')' in curr and parenOpened and openCount == 1:
parenCompleted = True
try:
return temp.findNext('a').attrs['href']
except KeyError:
print "\nReached article with no main body!\n"
return None
try:
return str(link.attrs['href'])
except KeyError:
print "\nReached article with no main body\n"
return None**
I think you are seriously overcomplicating the problem.
There are multiple ways to use the direct parent-child relationship between the elements in BeautifulSoup. One way is the > CSS selector:
In [1]: import requests
In [2]: from bs4 import BeautifulSoup
In [3]: url = "https://en.wikipedia.org/wiki/Sierra_Leone"
In [4]: response = requests.get(url)
In [5]: soup = BeautifulSoup(response.content, "html.parser")
In [6]: [a.get_text() for a in soup.select("#mw-content-text > p > a")]
Out[6]:
['West Africa',
'Guinea',
'Liberia',
...
'Allen Iverson',
'Magic Johnson',
'Victor Oladipo',
'Frances Tiafoe']
Here we've found a elements that are located directly under the p elements directly under the element with id="mw-content-text" - from what I understand this is where the main Wikipedia article is located in.
If you need a single element, use select_one() instead of select().
Also, if you want to solve it via find*(), pass the recursive=False argument.
Related
I'm building a real state web-scraper and i'm having problems when a certain index doesn't exist in the html.
How can i fix this? The code that is having this trouble is this
info_extra = container.find_all('div', class_="info-right text-xs-right")[0].text
I'm new to web-scraping so I'm kinda lost.
Thanks!
One general way is to check the length before you attempt to access the index.
divs = container.find_all('div', class_="info-right text-xs-right")
if len(divs) > 0:
info_extra = divs[0].text
else:
info_extra = None
You can simplify this further by knowing that an empty list is false.
divs = container.find_all('div', class_="info-right text-xs-right")
if divs:
info_extra = divs[0].text
else:
info_extra = None
You can simplify even further by using the walrus operator :=
if (divs := container.find_all('div', class_="info-right text-xs-right")):
info_extra = divs[0].text
else:
info_extra = None
Or all in one line:
info_extra = divs[0].text if (divs := container.find_all('div', class_="info-right text-xs-right") else None
I'm new to web-scraping too and most of my problems are when I ask for an element on the page that doesn't exist
Have you tried the Try/Except block?
try:
info_extra = container.find_all('div', class_="info-right text-xs-right")[0].text
except Exception as e:
raise
https://docs.python.org/3/tutorial/errors.html
Good luck
First of all, you should always check data before doing anything with it.
Now if there is just one result in site for your selector
info_extra_element = container.select_one('div.info-right.text-xs-right'
)
if info_extra_element:
info_extra = info_extra_element.text
else:
# On unexpected situation where selector couldn't be found
# report it and do something to prevent your program from crashing.
print("selector couldn't be found on the page")
info_extra = ''
If there are a list of elements that match your selector
info_extra_elements = container.select('div.info-right.text-xs-right'
).text
info_extra_texts = []
for element in info_extra_elements:
info_extra_texts.append(element.text)
PS.
Based on this answer, It's a good practice to use a CSS selector when you want to filter based on class.
find method can be used when you just want to filter based on element tag.
<span class="cname">
<em class="multiple">2017</em> Ford
</span>
<span class="cname">
Toyota
</span>
I want to get only "FORD" and TOYOTA in span.
test.find_element_by_class_name('cname').text
return "2017 FORD" and "TOYOTA". So how can i get particular text of span?
Pure XPath solution:
//span[#class='cname']//text()[not(parent::em[#class='multiple'])]
And if you alse want to filter white-space-only text-nodes():
//span[#class='cname']//text()[not(parent::em[#class='multiple']) and not(normalize-space()='')]
Both return text-nodes not an element. So Selenium will probably fail.
Take a look here: https://sqa.stackexchange.com/a/33097 on how to get a text-node().
Otherwise use this answer: https://stackoverflow.com/a/67518169/3710053
EDIT:
Another way to go is this XPath:
//span[#class='cname']
And then use this code python-example to get only direct text()-nodes.
EDIT 2
all_text = driver.find_element_by_xpath("//span[#class='cname']").text
child_text = driver.find_element_by_xpath("//span[#class='cname']/em[#class='multiple']").text
parent_text = all_text.replace(child_text, '')
If can have a check for integer, if it is a integer then don't print or do something else otherwise print them for //span[#class='cname'
Code :
cname_list = driver.find_elements(By.XPATH, "//span[#class='cname']")
for cname in cname_list:
if cname.text.isdigit() == True:
print("It is an integer")
else:
print(cname.text)
or
cname_list = driver.find_elements(By.XPATH, "//span[#class='cname']")
for cname in cname_list:
if type(cname.text) is int:
print("We don't like int for this use case") # if you don't want you can simply remove this line
else:
print(cname.text)
You can get the parent element text without the child element text as following:
total_text = driver.find_element_by_xpath(parent_div_element_xpath).text
child_text = driver.find_element_by_xpath(child_div_element_xpath).text
parent_only_text = total_text.replace(child_text, '')
So in your specific case try the following:
total_text = driver.find_element_by_xpath("//span[#class='cname']").text
child_text = driver.find_element_by_xpath(//*[#class='multiple']).text
parent_only_text = total_text.replace(child_text, '')
Or to be more precise
father = driver.find_element_by_xpath("//span[#class='cname']")
total_text = father.text
child_text = father.find_element_by_xpath(".//*[#class='multiple']").text
parent_only_text = total_text.replace(child_text, '')
In a general case you can define and use the following method:
def get_text_excluding_children(driver, element):
return driver.execute_script("""
return jQuery(arguments[0]).contents().filter(function() {
return this.nodeType == Node.TEXT_NODE;
}).text();
""", element)
The element argument passed here is the webelement returned by driver.find_element
In your particular case you can find the element with:
element = driver.find_element_by_xpath("//span[#class='cname']")
and then pass it to get_text_excluding_children and it will return you the required text
I am using the following function construct a css selector using BS4:
def nth_of_type(elem):
count, curr = 0, 0
for i, e in enumerate(elem.find_parent().find_all(recursive=False), 1):
if e.name == elem.name:
count += 1
if e == elem:
curr = i
return '' if count == 1 else ':nth-child({})'.format(curr)
def getCssPath(elem):
rv = [elem.name + nth_of_type(elem)]
while True:
elem = elem.find_parent()
if not elem or elem.name == '[document]':
return '>'.join(rv[::-1])
rv.append(elem.name + nth_of_type(elem))
So if I scrape a page using:
page_r = requests.get('<my url>')
page_soup = BeautifulSoup(page_r.content, 'html.parser')
elements = page_soup.find_all('a')
print(getCssPath(elements[0])
# html>body>div:nth-child(2)>div:nth-child(6)>div>div>main>article>div>div:nth-child(1)>div:nth-child(1)>div>div>div>div:nth-child(2)>div:nth-child(1)>div>div>div>div>div:nth-child(1)>div:nth-child(2)>div>div:nth-child(4)>a`
But this is very long so I want to get the shortest CSS selector. Similar to the one you can get when you right click on the element in chrome and do Copy > Selector. This can involve classes and ids etc.
Is there any BS4 function already to get that or how should this function be modified to get that?
I'm using Django and Python 3.7. I want to have more efficient parsing so I was reading about SoupStrainer objects. I created a custom one to help me parse only the elements I need ...
def my_custom_strainer(self, elem, attrs):
for attr in attrs:
print("attr:" + attr + "=" + attrs[attr])
if elem == 'div' and 'class' in attr and attrs['class'] == "score":
return True
elif elem == "span" and elem.text == re.compile("my text"):
return True
article_stat_page_strainer = SoupStrainer(self.my_custom_strainer)
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)
One of the conditions is I only want to parse "span" elements whose text matches a certain pattern. Hence the
elem == "span" and elem.text == re.compile("my text")
clause. However, this results in an
AttributeError: 'str' object has no attribute 'text'
error when I try and run the above. What's the proper way to write my strainer?
TLDR; No, this is currently not easily possible in BeautifulSoup (modification of BeautifulSoup and SoupStrainer objects would be needed).
Explanation:
The problem is that the Strainer-passed function gets called on handle_starttag() method. As you can guess, you only have values in the opening tag (eg. element name and attrs).
https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/init.py#L524
if (self.parse_only and len(self.tagStack) <= 1
and (self.parse_only.text
or not self.parse_only.search_tag(name, attrs))):
return None
And as you can see, if your Strainer function returns False, the element gets discarded immediately, without having chance to take the inner text inside into consideration (unfortunately).
On the other hand if you add "text" to search.
SoupStrainer(text="my text")
it will start to search inside the tag for text, but this doesn't have context of element or attributes - you can see the irony :/
and combining it together will just find nothing. And you can't even access parent like shown here in find function:
https://gist.github.com/RichardBronosky/4060082
So currently Strainers are just good to filter on elements/attrs. You would need to change a lot of Beautiful soup code to get that working.
If you really need this, I suggest inheriting BeautifulSoup and SoupStrainer objects and modifying their behavior.
It seems you try to loop along soup elements in my_custom_strainer method.
In order to do so, you could do it as follows:
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)
my_custom_strainer(soup, attrs)
Then slightly modify my_custom_strainer to meet something like:
def my_custom_strainer(soup, attrs):
for attr in attrs:
print("attr:" + attr + "=" + attrs[attr])
for d in soup.findAll(['div','span']):
if d.name == 'span' and 'class' in attr and attrs['class'] == "score":
return d.text # meet your needs here
elif d.name == 'span' and d.text == re.compile("my text"):
return d.text # meet your needs here
This way you can access the soup objects iteratively.
I recently created a lxml / BeautifulSoup parser for html files, which also searches between specific tags.
The function I wrote opens up a your operating system's file manager and allows you to select the specifi html file to parse.
def openFile(self):
options = QFileDialog.Options()
options |= QFileDialog.DontUseNativeDialog
fileName, _ = QFileDialog.getOpenFileName(self, "QFileDialog.getOpenFileName()", "",
"All Files (*);;Python Files (*.py)", options=options)
if fileName:
file = open(fileName)
data = file.read()
soup = BeautifulSoup(data, "lxml")
for item in soup.find_all('strong'):
results.append(float(item.text))
print('Score =', results[1])
print('Fps =', results[0])
You can see that the tag i specified was 'strong', and i was trying to find the text within that tag.
Hope I could help in someway.
So I'm interested in this theory that if you go to a random Wikipedia article, click the first link not inside parentheses repeatedly, in 95% of the cases you will end up on the article about Philosophy.
I wanted to write a script in Python that does the link fetching for me and in the end, print a nice list of which articles were visited (linkA -> linkB -> linkC) etc.
I managed to get the HTML DOM of the web pages, and managed to strip out some unnecessary links and the top description bar which leads disambiguation pages. So far I have concluded that:
The DOM begins with the table which you see on the right on some pages, for example in Human. We want to ignore these links.
The valid link elements all have a <p> element somewhere as their ancestor (most often parent or grandparent if it's inside a <b> tag or similar. The top bar which leads to disambiguation pages, does not seem to contain any <p> elements.
Invalid links contain some special words followed by a colon, e.g. Wikipedia:
So far, so good. But it's the parentheses that get me. In the article about Human for example, the first link not inside parentheses is "/wiki/Species", but the script finds "/wiki/Taxonomy" which is inside them.
I have no idea how to go about this programmatically, since I have to look for text in some combination of parent/child nodes which may not always be the same. Any ideas?
My code can be seen below, but it's something I made up really quickly and not very proud of. It's commented however, so you can see my line of thoughts (I hope :) ).
"""Wikipedia fun"""
import urllib2
from xml.dom.minidom import parseString
import time
def validWikiArticleLinkString(href):
""" Takes a string and returns True if it contains the substring
'/wiki/' in the beginning and does not contain any of the
"special" wiki pages.
"""
return (href.find("/wiki/") == 0
and href.find("(disambiguation)") == -1
and href.find("File:") == -1
and href.find("Wikipedia:") == -1
and href.find("Portal:") == -1
and href.find("Special:") == -1
and href.find("Help:") == -1
and href.find("Template_talk:") == -1
and href.find("Template:") == -1
and href.find("Talk:") == -1
and href.find("Category:") == -1
and href.find("Bibcode") == -1
and href.find("Main_Page") == -1)
if __name__ == "__main__":
visited = [] # a list of visited links. used to avoid getting into loops
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')] # need headers for the api
currentPage = "Human" # the page to start with
while True:
infile = opener.open('http://en.wikipedia.org/w/index.php?title=%s&printable=yes' % currentPage)
html = infile.read() # retrieve the contents of the wiki page we are at
htmlDOM = parseString(html) # get the DOM of the parsed HTML
aTags = htmlDOM.getElementsByTagName("a") # find all <a> tags
for tag in aTags:
if "href" in tag.attributes.keys(): # see if we have the href attribute in the tag
href = tag.attributes["href"].value # get the value of the href attribute
if validWikiArticleLinkString(href): # if we have one of the link types we are looking for
# Now come the tricky parts. We want to look for links in the main content area only,
# and we want the first link not in parentheses.
# assume the link is valid.
invalid = False
# tables which appear to the right on the site appear first in the DOM, so we need to make sure
# we are not looking at a <a> tag somewhere inside a <table>.
pn = tag.parentNode
while pn is not None:
if str(pn).find("table at") >= 0:
invalid = True
break
else:
pn = pn.parentNode
if invalid: # go to next link
continue
# Next we look at the descriptive texts above the article, if any; e.g
# This article is about .... or For other uses, see ... (disambiguation).
# These kinds of links will lead into loops so we classify them as invalid.
# We notice that this text does not appear to be inside a <p> block, so
# we dismiss <a> tags which aren't inside any <p>.
pnode = tag.parentNode
while pnode is not None:
if str(pnode).find("p at") >= 0:
break
pnode = pnode.parentNode
# If we have reached the root node, which has parentNode None, we classify the
# link as invalid.
if pnode is None:
invalid = True
if invalid:
continue
###### this is where I got stuck:
# now we need to look if the link is inside parentheses. below is some junk
# for elem in tag.parentNode.childNodes:
# while elem.firstChild is not None:
# elem = elem.firstChid
# print elem.nodeValue
print href # this will be the next link
newLink = href[6:] # except for the /wiki/ part
break
# if we have been to this link before, break the loop
if newLink in visited:
print "Stuck in loop."
break
# or if we have reached Philosophy
elif newLink == "Philosophy":
print "Ended up in Philosophy."
break
else:
visited.append(currentPage) # mark this currentPage as visited
currentPage = newLink # make the the currentPage we found the new page to fetch
time.sleep(5) # sleep some to see results as debug
I found a python script on Github (http://github.com/JensTimmerman/scripts/blob/master/philosophy.py) to play this game.
It uses Beautifulsoup for HTML parsing and to cope with the parantheses issue he just removes text between brackets before parsing links.