parsing html by using beautiful soup and selenium in python - python

I wanted to practice scraping with a real world example (Airbnb) by using BeautifulSoup and Selenium in python. Specifically, my goal is to get all the listings(homes)ID within LA. My strategy is to open a chrome and go to Airbnb website where I already manually searched homes in LA and starts from here. Up to this process, I decided to use selenium. After that I wanted to parse HTML codes inside of source codes and then find listing IDs that are shown at a current page. Then basically, wanted to just iterate through all the pages.
Here's my codes:
from urllib import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
option=webdriver.ChromeOptions()
option.add_argument("--incognito")
driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",chrome_options=option)
first_url="https://www.airbnb.com/s/Los-Angeles--CA--United-States/select_homes?refinement_paths%5B%5D=%2Fselect_homes&place_id=ChIJE9on3F3HwoAR9AhGJW_fL-I&children=0&guests=1&query=Los%20Angeles%2C%20CA%2C%20United%20States&click_referer=t%3ASEE_ALL%7Csid%3Afcf33cf1-61b8-41d5-bef1-fbc5d0570810%7Cst%3AHOME_GROUPING_SELECT_HOMES&superhost=false&title_type=SELECT_GROUPING&allow_override%5B%5D=&s_tag=tm-X8bVo"
n=3
for i in range(1,n+1):
if (i==1):
driver.get(first_url)
print first_url
#HTML parse using BS
html =driver.page_source
soup=BeautifulSoup(html,"html.parser")
listings=soup.findAll("div",{"class":"_f21qs6"})
#print out all the listing_ids within a current page
for i in range(len(listings)):
only_id= listings[i]['id']
print(only_id[8:])
after_first_url=first_url+"&section_offset=%d" % i
print after_first_url
driver.get(after_first_url)
#HTML parse using BS
html =driver.page_source
soup=BeautifulSoup(html,"html.parser")
listings=soup.findAll("div",{"class":"_f21qs6"})
#print out all the listing_ids within a current page
for i in range(len(listings)):
only_id= listings[i]['id']
print(only_id[8:])
If you find any inefficient codes, please understand since I'm a beginner. I made this codes by reading and watching multiple sources. Anyway, I guess I have correct codes but the issue is that every time I run this, I get a different result. What it means is that it loops over pages, but sometimes it gives the results for only certain number of pages. For example, it loops page1 but doesn't give any corresponding output and loops page2 and gives results but doesn't for page3. Its' so random that it gives results for some pages but doesn't for some other pages. On top of that, sometimes it loops page1,2,3, ... in an order, but sometimes it loops page1 and then move on to the last page (17) and then come back to page2. I guess my codes are not perfect since it gives unstable outputs. Did anyone have similar experience or could someone help me out what the problem is? Thanks.

Try below method
Assuming you are on the page you want to parse, Selenium stores the source HTML in the driver's page_source attribute. You would then load the page_source into BeautifulSoup as follows:
In [8]: from bs4 import BeautifulSoup
In [9]: from selenium import webdriver
In [10]: driver = webdriver.Firefox()
In [11]: driver.get('http://news.ycombinator.com')
In [12]: html = driver.page_source
In [13]: soup = BeautifulSoup(html)
In [14]: for tag in soup.find_all('title'):
....: print tag.text
....:
....:
Hacker News

Related

Crawl a webpage which is generated by Javascript

I want to crawl the data from this website
I only need the text "Pictograph - A spoon 勺 with something 一 in it"
I checked Network -> Doc and I think the information is hidden here.
Because I found there's a line is
i.length > 0 && (r += '<span>» Formation: <\/span>' + i + _Eb)
And I think this page generates part of the page that we can see from the link.
However, I don't know what is the code? It has html, but it also contains so many function().
Update
If the code is Javascript, I would like to know how can I crawl the website not using Selenium?
Thanks!
This page use JavaScript to add this element. Using Selenium I can get HTML after adding this element and then I can search text in HTML. This HTML has strange construction - all text is in tag so this part has no special tag to find it. But it is last text in this tag and it starts with "Formation:" so I use BeautifulSoup to ge all text with all subtags using get_text() and then I can use split('Formation:') to get text after this element.
import selenium.webdriver
from bs4 import BeautifulSoup as BS
driver = selenium.webdriver.Firefox()
driver.get('https://www.archchinese.com/chinese_english_dictionary.html?find=%E4%B8%8E')
soup = BS(driver.page_source)
text = soup.find('div', {'id': "charDef"}).get_text()
text = text.split('Formation:')[-1]
print(text.strip())
Maybe Selenium works slower but it was faster to create solution.
If I could find url used by JavaScript to load data then I would use it without Selenium but I didn't see these information in XHR responses. There was few responses compressed (probably gzip) or encoded and maybe there was this text but I didn't try to uncompress/decode it.

Data scraper: the contents of the div tag is empty (??)

I am data scraping a website to get a number. This number changes dynamically every split second, but upon inspection, the number is shown. I just need to capture that number but the div wrapper that contains it, it returns no value. What am I missing? (please go easy on me as I am quite new to Python and data scraping).
I have some code that works and returns the piece of html that supposedly contains the data I want, but no joy, the div wrapper returns no value.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://deuda-publica-espana.com')
deuda = BeautifulSoup(r.text, 'html.parser')
deuda = deuda.findAll('div', {'id': 'contador_PDEH'})
print(deuda)
I don't receive any errors, I am just getting [<div class="contador_xl contador_verde" id="contador_PDEH"></div>] with no value!
Indeed it is easy with selenium. I suspect there is a js script running a counter supplying the number which is why you can't find it with your method (as mentioned in comments)
from selenium import webdriver
d = webdriver.Chrome(r'C:\Users\User\Documents\chromedriver.exe')
d.get('https://deuda-publica-espana.com/')
print(d.find_element_by_id('contador_PDEH').text)
d.quit()

List links of xls files using Beautifulsoup

I'm trying to retrieve a list of downloadable xls files on a website.
I'm a bit reluctant to provide full links to the website in question.
Hopefully I'm able to provide all necessary details all the same.
If this is useless, please let me know.
Download .xls files from a webpage using Python and BeautifulSoup is a very similar question, but the details below will show that the solution most likely will have to be different since the links on that particular site are tagged with a href anchor:
And the ones I'm trying to get are not tagged the same way.
On the webpage, the files that are available for downloading are listed like this:
A simple mousehover gives these further details:
I'm following the setup here with a few changes to produce the snippet below that provides a list of some links, but not to any of the xls files:
from bs4 import BeautifulSoup
import urllib
import re
def getLinks(url):
with urllib.request.urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, "lxml")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
return links
links1 = getLinks("https://SOMEWEBSITE")
A further inspection using ctrl+shift+I in Google Chrome reveals that those particular links do not have a href anchor tag, but rather a ng-href anchor tag:
So I tried changing that in the snippet above, but with no success.
And I've tried different combinations with e.compile("^https://"), attrs={'ng-href' and links.append(link.get('ng-href')), but still with no success.
So I'm hoping someone has a better suggestion!
EDIT - Further details
It seems it's a bit problematic to read these links directly.
When I use ctrl+shift+I and the Select an element in the page to inspect it Ctrl+Shift+C, this is what I can see when I hover over one of the links listed above:
And what I'm looking to extract here is the information associated with the ng-href tag. But If I right-click the page and select Show Source, the same tag only appears once along with som metadata(?):
And I guess this is why my rather basic approach is failing in the first place.
I'm hoping this makes sense to some of you.
Update:
using selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
driver.get('http://.....')
# wait max 15 second until the links appear
xls_links = WebDriverWait(driver, 15).until(lambda d: d.find_elements_by_xpath('//a[contains(#ng-href, ".xls")]'))
# Or
# xls_links = WebDriverWait(driver, 15).until(lambda d: d.find_elements_by_xpath('//a[contains(#href, ".xls")]'))
links = []
for link in xls_links:
url = "https://SOMEWEBSITE" + link.get_attribute('ng-href')
print(url)
links.append(url)
Assume ng-href is not dynamically generated, from your last image I see that the URL is not starts with https:// but the slash / you can try with regex URL contains .xls
for link in soup.findAll('a', attrs={'ng-href': re.compile(r"\.xls")}):
xls_link = "https://SOMEWEBSITE" + link['ng-href']
print(xls_link)
links.append(xls_link)
My guess is that the data you are trying to crawl is created dynamically: ng-href is one of AngularJs's constructs. You could try using Google Chrome's Network inspection as you already did (ctrl+shift+I) and see if you can find the url that is queried (open the network tab and reload the page). The query should typically return a JSON with the links to the xls-files.
There is a thread about a similar problem here. Perhaps that helps you: Unable to crawl some href in a webpage using python and beautifulsoup

Why is the html in view-source different from what I see in the terminal when I call prettify()?

I have decided to view a website's source code, and chose a class, which is "expanded" (I found it using view-source, prettify() shows different code). I wanted to print out all of its contents, with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.quora.com/How-can-I-write-a-bot-using-Python")
soup = BeautifulSoup(page.content, 'html.parser')
print soup.find_all(class_='expanded')
but it simply prints out:
[]
Please help me detect what's wrong.
I already saw this thread and tried following what the answer said but it did not help me since this error appears in the terminal:
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
I had a look at the site in question and the only class similar was actually named ui_qtext_expanded
When you use findAll / find_all you have to iterate over it to return each item as it is a list of items using .text.. That is if you want the text and not the actual HTML..
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.quora.com/How-can-I-write-a-bot-using-Python")
soup = BeautifulSoup(page.content, 'html.parser')
res = soup.find_all(class_='ui_qtext_expanded')
for i in res:
print i.text
The beginning of the output from your link is
A combination of mechanize, Requests and BeautifulSoup works pretty good for the basic stuff.Learn about mechanize here.Mechanize is sufficient for basic form filling, form submission and that sort of stuff, but for real browser emulation (like dealing with Javascript rendered HTML) you should look into selenium.

Isolating data from dynamic table with beautifulSoup

I'm trying to extract data from a table(1), which has a couple filter options. I'm using BeautifulSoup and got to this page with Requests. An extract of code:
from bs4 import BeautifulSoup
tt = Contact_page.content # webpage with table
soup = BeautifulSoup(tt)
R_tables = soup.find('div', {'class': 'responsive-table'})
Using find_all("tr") and find_all("th") results in empty sets. Using R_tables.findChildren only goes down to "formrow" who then has no children. From formrow to my tr/th tags, I can't access it through BS4.
R_tables results in table 3. The XPath for this file is
"//*[#id="kronos_body"]/div[3]/div[2]/div[3]/script/text()
How can I get each row information for my data? soup.find("r") and soup.find("f") also result in empty sets.
Pardon me in advance if this post is sloppy, this is my first. I'll link what my most similar thread is in a comment, I can't link more than 2 times.
EDIT 1 : Apparently BS doesn't recognize any javascript apart from variables (correct me if I'm wrong, I'm still still relatively new). Are there any other modules that can help me out? I was proposed Ghost and Selenium, but I won't be using Selenium.

Categories