I'm building a real state web-scraper and i'm having problems when a certain index doesn't exist in the html.
How can i fix this? The code that is having this trouble is this
info_extra = container.find_all('div', class_="info-right text-xs-right")[0].text
I'm new to web-scraping so I'm kinda lost.
Thanks!
One general way is to check the length before you attempt to access the index.
divs = container.find_all('div', class_="info-right text-xs-right")
if len(divs) > 0:
info_extra = divs[0].text
else:
info_extra = None
You can simplify this further by knowing that an empty list is false.
divs = container.find_all('div', class_="info-right text-xs-right")
if divs:
info_extra = divs[0].text
else:
info_extra = None
You can simplify even further by using the walrus operator :=
if (divs := container.find_all('div', class_="info-right text-xs-right")):
info_extra = divs[0].text
else:
info_extra = None
Or all in one line:
info_extra = divs[0].text if (divs := container.find_all('div', class_="info-right text-xs-right") else None
I'm new to web-scraping too and most of my problems are when I ask for an element on the page that doesn't exist
Have you tried the Try/Except block?
try:
info_extra = container.find_all('div', class_="info-right text-xs-right")[0].text
except Exception as e:
raise
https://docs.python.org/3/tutorial/errors.html
Good luck
First of all, you should always check data before doing anything with it.
Now if there is just one result in site for your selector
info_extra_element = container.select_one('div.info-right.text-xs-right'
)
if info_extra_element:
info_extra = info_extra_element.text
else:
# On unexpected situation where selector couldn't be found
# report it and do something to prevent your program from crashing.
print("selector couldn't be found on the page")
info_extra = ''
If there are a list of elements that match your selector
info_extra_elements = container.select('div.info-right.text-xs-right'
).text
info_extra_texts = []
for element in info_extra_elements:
info_extra_texts.append(element.text)
PS.
Based on this answer, It's a good practice to use a CSS selector when you want to filter based on class.
find method can be used when you just want to filter based on element tag.
Related
Hi I am trying to scrape and is wondering is there a one liner or simple way to handle none.
If none do something, if not none then do something else. I mean what would be the most pythonic way of handling none that references the value itself.
Right now what I have is
discount = soup.find_all('span', {"class":"jsx-30 discount"} )
if len(discount)==0:
discount =""
else:
discount = soup.find_all('span', {"class":"jsx-3024393758 label discount"} )[0].text
In case that you only want to grab the text of the first element, I would recommend to use find() instead of find_all().
To check if element exists you can use if statement:
discount = e.text if (e := soup.find('span', {"class":"jsx-3024393758 label discount"})) else ''
or try except:
try:
discount = soup.find('span', {"class":"jsx-3024393758 label discount"}).text
except:
discount = ''
<span class="cname">
<em class="multiple">2017</em> Ford
</span>
<span class="cname">
Toyota
</span>
I want to get only "FORD" and TOYOTA in span.
test.find_element_by_class_name('cname').text
return "2017 FORD" and "TOYOTA". So how can i get particular text of span?
Pure XPath solution:
//span[#class='cname']//text()[not(parent::em[#class='multiple'])]
And if you alse want to filter white-space-only text-nodes():
//span[#class='cname']//text()[not(parent::em[#class='multiple']) and not(normalize-space()='')]
Both return text-nodes not an element. So Selenium will probably fail.
Take a look here: https://sqa.stackexchange.com/a/33097 on how to get a text-node().
Otherwise use this answer: https://stackoverflow.com/a/67518169/3710053
EDIT:
Another way to go is this XPath:
//span[#class='cname']
And then use this code python-example to get only direct text()-nodes.
EDIT 2
all_text = driver.find_element_by_xpath("//span[#class='cname']").text
child_text = driver.find_element_by_xpath("//span[#class='cname']/em[#class='multiple']").text
parent_text = all_text.replace(child_text, '')
If can have a check for integer, if it is a integer then don't print or do something else otherwise print them for //span[#class='cname'
Code :
cname_list = driver.find_elements(By.XPATH, "//span[#class='cname']")
for cname in cname_list:
if cname.text.isdigit() == True:
print("It is an integer")
else:
print(cname.text)
or
cname_list = driver.find_elements(By.XPATH, "//span[#class='cname']")
for cname in cname_list:
if type(cname.text) is int:
print("We don't like int for this use case") # if you don't want you can simply remove this line
else:
print(cname.text)
You can get the parent element text without the child element text as following:
total_text = driver.find_element_by_xpath(parent_div_element_xpath).text
child_text = driver.find_element_by_xpath(child_div_element_xpath).text
parent_only_text = total_text.replace(child_text, '')
So in your specific case try the following:
total_text = driver.find_element_by_xpath("//span[#class='cname']").text
child_text = driver.find_element_by_xpath(//*[#class='multiple']).text
parent_only_text = total_text.replace(child_text, '')
Or to be more precise
father = driver.find_element_by_xpath("//span[#class='cname']")
total_text = father.text
child_text = father.find_element_by_xpath(".//*[#class='multiple']").text
parent_only_text = total_text.replace(child_text, '')
In a general case you can define and use the following method:
def get_text_excluding_children(driver, element):
return driver.execute_script("""
return jQuery(arguments[0]).contents().filter(function() {
return this.nodeType == Node.TEXT_NODE;
}).text();
""", element)
The element argument passed here is the webelement returned by driver.find_element
In your particular case you can find the element with:
element = driver.find_element_by_xpath("//span[#class='cname']")
and then pass it to get_text_excluding_children and it will return you the required text
I'm writing a solution to test this phenomenon in Python. I have most of the logic done, but there are many edge cases that arise when following links in Wikipedia articles.
The problem I'm running into arises for a page like this where the first <p> has multiple levels of child elements and the first <a> tag after the first set of parentheses needs to be extracted. In this case, (to extract this link), you have to skip over the parentheses, and then get to the very next anchor tag/href. In most articles, my algorithm can skip over the parentheses, but with the way that it looks for links in front of parentheses (or if they don't exist), it is finding the anchor tag in the wrong place. Specifically, here: <span style="font-size: small;"><span id="coordinates">Coordinates
The algorithm works by iterating through the elements in the first paragraph tag (in the main body of the article), stringifying each element iteratively, and first checking to see if it contains either an '(' or an '
Is there any straight forward way to avoid embedded anchor tags and only take the first link that is a direct child of the first <p> ?
Below is the function with this code for reference:
**def getValidLink(self, currResponse):
currRoot = BeautifulSoup(currResponse.text,"lxml")
temp = currRoot.body.findAll('p')[0]
parenOpened = False
parenCompleted = False
openCount = 0
foundParen = False
while temp.next:
temp = temp.next
curr = str(temp)
if '(' in curr and str(type(temp)) == "<class 'bs4.element.NavigableString'>":
foundParen = True
break
if '<a' in curr and str(type(temp)) == "<class 'bs4.element.Tag'>":
link = temp
break
temp = currRoot.body.findAll('p')[0]
if foundParen:
while temp.next and not parenCompleted:
temp = temp.next
curr = str(temp)
if '(' in curr:
openCount += 1
if parenOpened is False:
parenOpened = True
if ')' in curr and parenOpened and openCount > 1:
openCount -= 1
elif ')' in curr and parenOpened and openCount == 1:
parenCompleted = True
try:
return temp.findNext('a').attrs['href']
except KeyError:
print "\nReached article with no main body!\n"
return None
try:
return str(link.attrs['href'])
except KeyError:
print "\nReached article with no main body\n"
return None**
I think you are seriously overcomplicating the problem.
There are multiple ways to use the direct parent-child relationship between the elements in BeautifulSoup. One way is the > CSS selector:
In [1]: import requests
In [2]: from bs4 import BeautifulSoup
In [3]: url = "https://en.wikipedia.org/wiki/Sierra_Leone"
In [4]: response = requests.get(url)
In [5]: soup = BeautifulSoup(response.content, "html.parser")
In [6]: [a.get_text() for a in soup.select("#mw-content-text > p > a")]
Out[6]:
['West Africa',
'Guinea',
'Liberia',
...
'Allen Iverson',
'Magic Johnson',
'Victor Oladipo',
'Frances Tiafoe']
Here we've found a elements that are located directly under the p elements directly under the element with id="mw-content-text" - from what I understand this is where the main Wikipedia article is located in.
If you need a single element, use select_one() instead of select().
Also, if you want to solve it via find*(), pass the recursive=False argument.
I am trying to crawl Agoda's daily hotel price of multiple room types along with additional information such as the promotion information, breakfast condition, and book-now-pay-later regulation.
The codes I have are as below:
import requests
import math
from bs4 import BeautifulSoup
url = "http://www.agoda.com/ambassador-hotel-taipei/hotel/taipei-tw.html?asq=8m91A1C3D%252bTr%252bvRSmuClW5dm5vJXWO5dlQmHx%252fdU9qxilNob5hJg0b218wml6rCgncYsXBK0nWktmYtQJCEMu0P07Y3BjaTYhdrZvavpUnmfy3moWn%252bv8f2Lfx7HovrV95j6mrlCfGou99kE%252bA0aX0aof09AStNs69qUxvAVo53D4ZTrmAxm3bVkqZJr62cU&tyra=1%257c2&searchrequestid=2e2b0e8c-cadb-465b-8dea-2222e24a1678&pingnumber=1&checkin=2015-10-01&los=1"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
n = len(soup.select('.room-name'))
for i in range(0, n):
en_room = soup.select('.room-name')[i].text.strip()
currency = soup.select('.currency')[i].text
price = soup.select('.sellprice')[i].text
try:
sp_info = soup.select('.left-room-info')[i].text.strip()
except Exception as e:
sp_info = "N/A"
try:
pay_later = soup.select('.book-now-paylater')[i].text.strip()
except Exception as e:
pay_later = "N/A"
print en_room, i+1, currency, price, en_room, sp_info, pay_later
time.sleep(1)
I have two questions:
(1) The "left-room-info" class seems to contain two sub-classes "breakfast" and "room-promo". These sub-classes only show up when the particular room type provides such services.
When there is only one of the sub-classes shows up, the output works out well. However, when none of the sub-classes shows up, the output is empty when I expect to show "N/A". Also when both of the sub-classes show up, the output format has unnecessary empty lines which cannot be removed by .strip().
Is there any way to solve these problems?
(2) When I tried to extract information from the class '.book-now-paylater', the extracted data does not match each room type. For example, assuming there are 10 room types and only room 2, 4, 6, 8 allow travelers to book now pay later, the codes can extract exactly 4 pieces of book-now-pay-later information but these 4 pieces of information are then assigned inappropriately to room type 1, 2, 3, 4.
Is there any way to fix this problem?
Thank you for your help!
Gary
(1) This is happening because even if there is no text in the '.left-room-info' selection, it won't throw an exception, and your except will never run. You should be checking to see if the value is an empty string (''). You can do this with a simple if not string_var like this
sp_info = soup.select('.left-room-info')[i].text.strip()
if not sp_info:
sp_info = "N/A"
When both subclasses show up, you should split the string on the carriage return ('\r') and then strip each of the resulting pieces. The code would look something like this: (note that now sp_info is a list, not just a string)
sp_info = soup.select('.left-room-info')[i].text.strip().split('\r')
if len(sp_info) > 1:
sp_info = [ info.strip() for info in sp_info ]
Putting these pieces together, we'll get something like this
sp_info = soup.select('.left-room-info')[i].text.strip().split('\r')
if len(sp_info) > 1:
sp_info = [ info.strip() for info in sp_info ]
elif not sp_info[0]: # check for empty string
sp_info = ["N/A"] # keep sp_info a list for consistancy
(2) is a little more complicated. You're going to have to change how you parse the page. Namely, you're probably going to have to select on .room-type. The way you're selecting the book now pay laters, it doesn't associate them with any other elements, it just selects the 8 instances of that class. Here is how I would go about doing it:
import requests
import math
from bs4 import BeautifulSoup
url = "http://www.agoda.com/ambassador-hotel-taipei/hotel/taipei-tw.html?asq=8m91A1C3D%252bTr%252bvRSmuClW5dm5vJXWO5dlQmHx%252fdU9qxilNob5hJg0b218wml6rCgncYsXBK0nWktmYtQJCEMu0P07Y3BjaTYhdrZvavpUnmfy3moWn%252bv8f2Lfx7HovrV95j6mrlCfGou99kE%252bA0aX0aof09AStNs69qUxvAVo53D4ZTrmAxm3bVkqZJr62cU&tyra=1%257c2&searchrequestid=2e2b0e8c-cadb-465b-8dea-2222e24a1678&pingnumber=1&checkin=2015-10-01&los=1"
res = requests.get(url)
soup = BeautifulSoup(res.text)
rooms = soup.select('.room-type')[1:] # the first instance of the class isn't a room
room_list = []
for room in rooms:
room_info = {}
room_info['en_room'] = room.select('.room-name')[0].text.strip()
room_info['currency'] = room.select('.currency')[0].text.strip()
room_info['price'] = room.select('.sellprice')[0].text.strip()
sp_info = room.select('.left-room-info')[0].text.strip().split('\r')
if len(sp_info) > 1:
sp_info = ", ".join([ info.strip() for info in sp_info ])
elif not sp_info[0]: # check for empty string
sp_info = "N/A"
room_info['sp_info'] = sp_info
pay_later = room.select('.book-now-paylater')
room_info['pay_later'] = pay_later[0].text.strip() if pay_later else "N/A"
room_list.append(room_info)
In your code, you are not traversing the dom correctly. This will cause problems in scraping. (e.g. second problem). I shall give suggestive code snippet(not exact solution) hopeing you could solve the first problem by yourself.
# select all room types by tables tr tag
room_types = soup.find_all('tr', class_="room-type")
# iterate over the list to scrape data form each td or div inside tr
for room in room_types:
en_room = room.find('div', class_='room-name').text.strip()
So I'm interested in this theory that if you go to a random Wikipedia article, click the first link not inside parentheses repeatedly, in 95% of the cases you will end up on the article about Philosophy.
I wanted to write a script in Python that does the link fetching for me and in the end, print a nice list of which articles were visited (linkA -> linkB -> linkC) etc.
I managed to get the HTML DOM of the web pages, and managed to strip out some unnecessary links and the top description bar which leads disambiguation pages. So far I have concluded that:
The DOM begins with the table which you see on the right on some pages, for example in Human. We want to ignore these links.
The valid link elements all have a <p> element somewhere as their ancestor (most often parent or grandparent if it's inside a <b> tag or similar. The top bar which leads to disambiguation pages, does not seem to contain any <p> elements.
Invalid links contain some special words followed by a colon, e.g. Wikipedia:
So far, so good. But it's the parentheses that get me. In the article about Human for example, the first link not inside parentheses is "/wiki/Species", but the script finds "/wiki/Taxonomy" which is inside them.
I have no idea how to go about this programmatically, since I have to look for text in some combination of parent/child nodes which may not always be the same. Any ideas?
My code can be seen below, but it's something I made up really quickly and not very proud of. It's commented however, so you can see my line of thoughts (I hope :) ).
"""Wikipedia fun"""
import urllib2
from xml.dom.minidom import parseString
import time
def validWikiArticleLinkString(href):
""" Takes a string and returns True if it contains the substring
'/wiki/' in the beginning and does not contain any of the
"special" wiki pages.
"""
return (href.find("/wiki/") == 0
and href.find("(disambiguation)") == -1
and href.find("File:") == -1
and href.find("Wikipedia:") == -1
and href.find("Portal:") == -1
and href.find("Special:") == -1
and href.find("Help:") == -1
and href.find("Template_talk:") == -1
and href.find("Template:") == -1
and href.find("Talk:") == -1
and href.find("Category:") == -1
and href.find("Bibcode") == -1
and href.find("Main_Page") == -1)
if __name__ == "__main__":
visited = [] # a list of visited links. used to avoid getting into loops
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')] # need headers for the api
currentPage = "Human" # the page to start with
while True:
infile = opener.open('http://en.wikipedia.org/w/index.php?title=%s&printable=yes' % currentPage)
html = infile.read() # retrieve the contents of the wiki page we are at
htmlDOM = parseString(html) # get the DOM of the parsed HTML
aTags = htmlDOM.getElementsByTagName("a") # find all <a> tags
for tag in aTags:
if "href" in tag.attributes.keys(): # see if we have the href attribute in the tag
href = tag.attributes["href"].value # get the value of the href attribute
if validWikiArticleLinkString(href): # if we have one of the link types we are looking for
# Now come the tricky parts. We want to look for links in the main content area only,
# and we want the first link not in parentheses.
# assume the link is valid.
invalid = False
# tables which appear to the right on the site appear first in the DOM, so we need to make sure
# we are not looking at a <a> tag somewhere inside a <table>.
pn = tag.parentNode
while pn is not None:
if str(pn).find("table at") >= 0:
invalid = True
break
else:
pn = pn.parentNode
if invalid: # go to next link
continue
# Next we look at the descriptive texts above the article, if any; e.g
# This article is about .... or For other uses, see ... (disambiguation).
# These kinds of links will lead into loops so we classify them as invalid.
# We notice that this text does not appear to be inside a <p> block, so
# we dismiss <a> tags which aren't inside any <p>.
pnode = tag.parentNode
while pnode is not None:
if str(pnode).find("p at") >= 0:
break
pnode = pnode.parentNode
# If we have reached the root node, which has parentNode None, we classify the
# link as invalid.
if pnode is None:
invalid = True
if invalid:
continue
###### this is where I got stuck:
# now we need to look if the link is inside parentheses. below is some junk
# for elem in tag.parentNode.childNodes:
# while elem.firstChild is not None:
# elem = elem.firstChid
# print elem.nodeValue
print href # this will be the next link
newLink = href[6:] # except for the /wiki/ part
break
# if we have been to this link before, break the loop
if newLink in visited:
print "Stuck in loop."
break
# or if we have reached Philosophy
elif newLink == "Philosophy":
print "Ended up in Philosophy."
break
else:
visited.append(currentPage) # mark this currentPage as visited
currentPage = newLink # make the the currentPage we found the new page to fetch
time.sleep(5) # sleep some to see results as debug
I found a python script on Github (http://github.com/JensTimmerman/scripts/blob/master/philosophy.py) to play this game.
It uses Beautifulsoup for HTML parsing and to cope with the parantheses issue he just removes text between brackets before parsing links.