Isolating data from dynamic table with beautifulSoup - python

I'm trying to extract data from a table(1), which has a couple filter options. I'm using BeautifulSoup and got to this page with Requests. An extract of code:
from bs4 import BeautifulSoup
tt = Contact_page.content # webpage with table
soup = BeautifulSoup(tt)
R_tables = soup.find('div', {'class': 'responsive-table'})
Using find_all("tr") and find_all("th") results in empty sets. Using R_tables.findChildren only goes down to "formrow" who then has no children. From formrow to my tr/th tags, I can't access it through BS4.
R_tables results in table 3. The XPath for this file is
"//*[#id="kronos_body"]/div[3]/div[2]/div[3]/script/text()
How can I get each row information for my data? soup.find("r") and soup.find("f") also result in empty sets.
Pardon me in advance if this post is sloppy, this is my first. I'll link what my most similar thread is in a comment, I can't link more than 2 times.
EDIT 1 : Apparently BS doesn't recognize any javascript apart from variables (correct me if I'm wrong, I'm still still relatively new). Are there any other modules that can help me out? I was proposed Ghost and Selenium, but I won't be using Selenium.

Related

Why is the html in view-source different from what I see in the terminal when I call prettify()?

I have decided to view a website's source code, and chose a class, which is "expanded" (I found it using view-source, prettify() shows different code). I wanted to print out all of its contents, with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.quora.com/How-can-I-write-a-bot-using-Python")
soup = BeautifulSoup(page.content, 'html.parser')
print soup.find_all(class_='expanded')
but it simply prints out:
[]
Please help me detect what's wrong.
I already saw this thread and tried following what the answer said but it did not help me since this error appears in the terminal:
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
I had a look at the site in question and the only class similar was actually named ui_qtext_expanded
When you use findAll / find_all you have to iterate over it to return each item as it is a list of items using .text.. That is if you want the text and not the actual HTML..
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.quora.com/How-can-I-write-a-bot-using-Python")
soup = BeautifulSoup(page.content, 'html.parser')
res = soup.find_all(class_='ui_qtext_expanded')
for i in res:
print i.text
The beginning of the output from your link is
A combination of mechanize, Requests and BeautifulSoup works pretty good for the basic stuff.Learn about mechanize here.Mechanize is sufficient for basic form filling, form submission and that sort of stuff, but for real browser emulation (like dealing with Javascript rendered HTML) you should look into selenium.

parsing html by using beautiful soup and selenium in python

I wanted to practice scraping with a real world example (Airbnb) by using BeautifulSoup and Selenium in python. Specifically, my goal is to get all the listings(homes)ID within LA. My strategy is to open a chrome and go to Airbnb website where I already manually searched homes in LA and starts from here. Up to this process, I decided to use selenium. After that I wanted to parse HTML codes inside of source codes and then find listing IDs that are shown at a current page. Then basically, wanted to just iterate through all the pages.
Here's my codes:
from urllib import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
option=webdriver.ChromeOptions()
option.add_argument("--incognito")
driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",chrome_options=option)
first_url="https://www.airbnb.com/s/Los-Angeles--CA--United-States/select_homes?refinement_paths%5B%5D=%2Fselect_homes&place_id=ChIJE9on3F3HwoAR9AhGJW_fL-I&children=0&guests=1&query=Los%20Angeles%2C%20CA%2C%20United%20States&click_referer=t%3ASEE_ALL%7Csid%3Afcf33cf1-61b8-41d5-bef1-fbc5d0570810%7Cst%3AHOME_GROUPING_SELECT_HOMES&superhost=false&title_type=SELECT_GROUPING&allow_override%5B%5D=&s_tag=tm-X8bVo"
n=3
for i in range(1,n+1):
if (i==1):
driver.get(first_url)
print first_url
#HTML parse using BS
html =driver.page_source
soup=BeautifulSoup(html,"html.parser")
listings=soup.findAll("div",{"class":"_f21qs6"})
#print out all the listing_ids within a current page
for i in range(len(listings)):
only_id= listings[i]['id']
print(only_id[8:])
after_first_url=first_url+"&section_offset=%d" % i
print after_first_url
driver.get(after_first_url)
#HTML parse using BS
html =driver.page_source
soup=BeautifulSoup(html,"html.parser")
listings=soup.findAll("div",{"class":"_f21qs6"})
#print out all the listing_ids within a current page
for i in range(len(listings)):
only_id= listings[i]['id']
print(only_id[8:])
If you find any inefficient codes, please understand since I'm a beginner. I made this codes by reading and watching multiple sources. Anyway, I guess I have correct codes but the issue is that every time I run this, I get a different result. What it means is that it loops over pages, but sometimes it gives the results for only certain number of pages. For example, it loops page1 but doesn't give any corresponding output and loops page2 and gives results but doesn't for page3. Its' so random that it gives results for some pages but doesn't for some other pages. On top of that, sometimes it loops page1,2,3, ... in an order, but sometimes it loops page1 and then move on to the last page (17) and then come back to page2. I guess my codes are not perfect since it gives unstable outputs. Did anyone have similar experience or could someone help me out what the problem is? Thanks.
Try below method
Assuming you are on the page you want to parse, Selenium stores the source HTML in the driver's page_source attribute. You would then load the page_source into BeautifulSoup as follows:
In [8]: from bs4 import BeautifulSoup
In [9]: from selenium import webdriver
In [10]: driver = webdriver.Firefox()
In [11]: driver.get('http://news.ycombinator.com')
In [12]: html = driver.page_source
In [13]: soup = BeautifulSoup(html)
In [14]: for tag in soup.find_all('title'):
....: print tag.text
....:
....:
Hacker News

Python scraping deep nested divs whose classes change

I'm somewhat new to python, and working on this 1st part of a project where i need to get the link(s) on a FanDuel page, and i've been spinning my tires trying get the 'href'.
Here's what the Inspect Element shows:
What i'm trying to get to is highlighted above.
I see that the seems to be the parent, but as you go down the tree, the classes listed with lettering (ie - "_a _ch _al _nr _dq _ns _nt _nu") changes from day to day.
What I noticed is that the 'href' that i need has a constant "data-test-id" that does not change, so i was trying to use that as my way to find what i need, but it does not seem to be working.
I'm not sure how far, or if, I need to drill down farther to get what I need, or if my code is totally off. Thanks for your help in advance!
import requests
from bs4 import BeautifulSoup
url = "https://www.fanduel.com/contests/mlb/96"
#authentication might not be necessary, it was a test, still getting the same results
site = requests.get(url, cookies={'X-Auth-Token':'MY TOKEN IS HERE'})
soup = BeautifulSoup(site.content, 'lxml')
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})
#If i use this, i get an error
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})[('href')]
print(game)
The HTML is constructed by javascript, to check this, instead of using inspect element, use view source-page and see if the HTML is already constructed there ( this is the html that you get when you do requests.get() ) ,i've already checked this and this is true. To resolve this, you should have to use Selenium to render the javascript on the page, and then you can get the source page code by selenium after he constructed the elements from DOM.

Scrapy/Python web table missing closing TR / TD Tags

I'm redoing a data scraping project. There's a website with a table of data that is missing most or all of the closing TR and TD tags. When I first did the project with JS, I just copied the site and then split the data into arrays of rows when it encountered a new "" tag.
I want to try to rebuild this project using python/scrapy and just wondering if there was an easier way to access the data using selectors. Also I'm a little confused how to split the data when the response.data.split(') doesn't work.
I understand your problem . you can use beautyfulsoup's select method for successfully query. I make a demo code for you. hope this will help you.
import requests
from bs4 import BeautifulSoup
url = 'http://killedbypolice.net/';
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
soup.select('table tr')
print(soup.select('table')[0])

Website scraping with python3 & beautifulsoup 4

I'm starting to make progress on a website scraper, but I've run into two snags. Here is the code first:
import requests
from bs4 import BeautifulSoup
r=requests.get("http://www.nytimes.com")
soup=BeautifulSoup(r.text)
headlines=soup.find_all(class_="story-heading")
for headline in headlines:
print (headline)
Questions
Why do you a have to use find_all(class_= blahblahblah)
Instead of just find_all(blahblahblah)? I realize that the story-heading is a class of its own, but can't I just search all the HTML using find_all and get the same results? The notes for BeautifulSoup show find_all.a returning all the anchor tags in an HTML document, why won't find_all("story-heading") do the same?
Is it because if I try and do that, it will just find all the instances of "story-heading" within the HTML and return those? I am trying to get python to return everything in that tag. That's my best guess.
Why do I get all this extra junk code? Should my requests to find all just show me everything within the story-header tag? I'm getting a lot more text than what I am just trying to specify.
Beautiful Soup allows you use CSS Selectors. Look in the doc for "CSS selector"
You can find all elements with class "story-heading" like so:
soup.find_all(".story-heading")
If instead it's you're looking for id's just do
soup.find_all("#id-name")

Categories