I've created a script in python to parse the tabular content from a website. My script now can parse the content from it's landing page. However, there is a NEXT PAGE button at the bottom of that page which unfolds 50 more results when gets clicked and so on.
Website address
I've tried with (scrapes first 50 results):
import requests
from bs4 import BeautifulSoup
site_link = 'https://indiarailinfo.com/trains/passenger/0/0/0/0'
res = requests.get(site_link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("div[style='line-height:20px;']"):
tds = [elem.get_text(strip=True) for elem in items.select("div")]
print(tds)
How can I get all the tabular contents from that page exhausting the next page button using requests?
PS I know how to unfold the content using selenium, so solution related to any browser simulator is not what I'm after.
Clicking the next button is actually doing XHR to https://indiarailinfo.com/trains/passenger/0/1?i=1&&kkk=1571329558457
<button class="nextbtn" onclick="javascript:getNextTrainListPageBare($(this).parent(),'/trains/passenger/0/1?i=1&');"><div>NEXT PAGE<br>the next 50 Trains will appear below</div></button>
So all you have to do is get the data under 'onclick' ,compose a url and do HTTP GET using requests.
The returned data will look like this
https://pastebin.com/Nk0E5vHH
Now just use BeautifulSoup and extract the data you need.
Code below (replace 10 with the number that you need)
import requests
from bs4 import BeautifulSoup
site_link = 'https://indiarailinfo.com/trains/passenger/0/{}'
for x in range(10):
url = site_link.format(x)
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
print('Data for url: {}'.format(url))
for items in soup.select("div[style='line-height:20px;']"):
tds = [elem.get_text(strip=True) for elem in items.select("div")]
print(tds)
Related
I'm trying to scrap a number of visitors to my local climbing centre.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://portal.rockgympro.com/portal/public/c3b9019203e4bc4404983507dbdf2359/occupancy?&iframeid=occupancyCounter&fId=1644")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('span', id="count")
print(results)
It's printing this:
<span id="count" style="display:inline"></span>
That's nice, but the number 19 is missing... What am I doing wrong?
It's there in json format in the tag of the html. Just need to pull it out.
import requests
import json
from bs4 import BeautifulSoup
url = 'https://portal.rockgympro.com/portal/public/c3b9019203e4bc4404983507dbdf2359/occupancy?&iframeid=occupancyCounter&fId=1644'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scriptStr = str(soup.find_all('script')[2]).split('var data = ')[-1].split(';')[0].replace("'",'"')
last_char_index = scriptStr.rfind(",")
scriptStr = scriptStr[:last_char_index] + '}'
scriptStr = scriptStr.replace(' ', ' ')
jsonData = json.loads(scriptStr)
count = jsonData['REA']['count']
capacity = jsonData['REA']['capacity']
lastUpdate = jsonData['REA']['lastUpdate']
print(f'{count} of {capacity} Climbers\n{lastUpdate}')
Output:
58 of 220 Climbers
Last updated: now (5:20 PM)
You're not doing anything wrong, the issue is that the website is populating the <span> element using JavaScript, which runs after your request is made.
Unfortunately, the requests library cannot run JavaScript since it is a pure HTTP tool. I would recommend checking out something like Selenium which is more robust and can wait for the JavaScript to load before scraping the HTML.
You can try requests_html module to get dynamic values which are calculated by javascript. I tried with below logic it worked for me on your site.
from bs4 import BeautifulSoup
import time
from requests_html import HTMLSession
url="Your Site Link"
# create an HTML Session object
session = HTMLSession()
# Use the object above to connect to needed webpage
resp = session.get(url)
# Run JavaScript code on webpage
resp.html.render(sleep=10)
soup = BeautifulSoup(resp.html.html, 'lxml')
results = soup.find('span', id="count")
print(results)
Your Site calculate Result
In the dev tools under one of the tags, you can see that many of those figures are generated after the page load by the JavaScript function showGym(). In order to allow those figures to generate you could use a browser driver tool like webbot or Selenium which can wait on pages long enough for the javascript to execute populate those fields. It might be possible to have requests do that, but I don't know as I've only used webbot when reaching problems like these as it's very easy to use.
I am trying to create a bot that scrapes all the image links from a site and store them somewhere else so I can download the images after.
from selenium import webdriver
import time
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.artstation.com/artwork?sorting=trending'
page = requests.get(url)
driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)
soup = bs(driver.page_source, 'html.parser')
gallery = soup.find_all(class_="image-src")
data = gallery[0]
for x in range(len(gallery)):
print("TAG:", sep="\n")
print(gallery[x], sep="\n")
if page.status_code == 200:
print("Request OK")
This returns all the links tags i wanted but I can't find a way to remove the html or copy only the links to a new list. Here is an example of the tag i get:
<div class="image-src" image-src="https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301" ng-if="::!project.hide_as_adult"></div>
So, how do i get only the links within the gallery[] list?
What i want to do after is to take this links and edit the /smaller-square/ directory to /large/, which is the one that has the high resolution image.
The page loads it's data through AJAX, so through network inspector we see, where the call is made. This snippet will obtain all the image links found on page 1, sorted by trending:
import requests
import json
url = 'https://www.artstation.com/projects.json?page=1&sorting=trending'
page = requests.get(url)
json_data = json.loads(page.text)
for data in json_data['data']:
print(data['cover']['medium_image_url'])
Prints:
https://cdna.artstation.com/p/assets/images/images/012/272/796/medium/ben-zhang-brigitte-hero-concept.jpg?1533921480
https://cdna.artstation.com/p/assets/covers/images/012/279/572/medium/ham-sung-choul-braveking-140823-1-3-s3-mini.jpg?1533959982
https://cdnb.artstation.com/p/assets/covers/images/012/275/963/medium/michael-vicente-orb-gem-thumb.jpg?1533933774
https://cdnb.artstation.com/p/assets/images/images/012/275/635/medium/michael-kutsche-piglet-by-michael-kutsche.jpg?1533932387
https://cdna.artstation.com/p/assets/images/images/012/273/384/medium/ben-zhang-unnamed.jpg?1533923353
https://cdnb.artstation.com/p/assets/covers/images/012/273/083/medium/michael-vicente-orb-guardian-thumb.jpg?1533922229
... and so on.
If you print the variable json_data, you will see other information the page sends (like icon image url, total_count, data about the author etc.)
You can access the attributes using key-value.
Ex:
from bs4 import BeautifulSoup
s = '''<div class="image-src" image-src="https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301" ng-if="::!project.hide_as_adult"></div>'''
soup = BeautifulSoup(s, "html.parser")
print(soup.find("div", class_="image-src")["image-src"])
#or
print(soup.find("div", class_="image-src").attrs['image-src'])
Output:
https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301
I am creating a web crawler now and I want to scrape the user reviews from imdb. It's easy to directly get the 10 reviews and rate from the origin page. For example http://www.imdb.com/title/tt1392170/reviews The problem is to get all reviews, I need to press the "load more" so that more reviews will be shown while the url address doesn't change! So I don't know how can I get all the reviews in Python3. What I use now are requests, bs4.
My code now:
from urllib.request import urlopen, urlretrieve
from bs4 import BeautifulSoup
url_link='http://www.imdb.com/title/tt0371746/reviews?ref_=tt_urv'
html=urlopen(url_link)
content_bs=BeautifulSoup(html)
for b in content_bs.find_all('div',class_='text'):
print(b)
for rate_score in content_bs.find_all('span',class_='rating-other-user-rating'):
print(rate_score)
You can't press the load more button without initiating click event. However, BeautifulSoup doesn't have that property. But, what you can do to get the full content is something like what i've demonstrated below. It will fetch you all the review title along with reviews:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = 'http://www.imdb.com/title/tt0371746/reviews?ref_=tt_urv'
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
main_content = urljoin(url,soup.select(".load-more-data")[0]['data-ajaxurl']) ##extracting the link leading to the page containing everything available here
response = requests.get(main_content)
broth = BeautifulSoup(response.text,"lxml")
for item in broth.select(".review-container"):
title = item.select(".title")[0].text
review = item.select(".text")[0].text
print("Title: {}\n\nReview: {}\n\n".format(title,review))
I want to scrape information from this page.
Specifically, I want to scrape the table which appears when you click "View all" under the "TOP 10 HOLDINGS" (you have to scroll down on the page a bit).
I am new to webscraping, and have tried using BeautifulSoup to do this. However, there seems to be an issue because the "onclick" function I need to take into account. In other words: The HTML code I scrape directly from the page doesn't include the table I want to obtain.
I am a bit confused about my next step: should I use something like selenium or can I deal with the issue in an easier/more efficient way?
Thanks.
My current code:
from bs4 import BeautifulSoup
import requests
Soup = BeautifulSoup
my_url = 'http://www.etf.com/SHE'
page = requests.get(my_url)
htmltxt = page.text
soup = Soup(htmltxt, "html.parser")
print(soup)
You can get a json response from the api: http://www.etf.com/view_all/holdings/SHE. The table you're looking for is located in 'view_all'.
import requests
from bs4 import BeautifulSoup as Soup
url = 'http://www.etf.com/SHE'
api = "http://www.etf.com/view_all/holdings/SHE"
headers = {'X-Requested-With':'XMLHttpRequest', 'Referer':url}
page = requests.get(api, headers=headers)
htmltxt = page.json()['view_all']
soup = Soup(htmltxt, "html.parser")
data = [[td.text for td in tr.find_all('td')] for tr in soup.find_all('tr')]
print('\n'.join(': '.join(row) for row in data))
I was happily scrapping property data from www.century21.com with Python requests and BeautifulSoup. There is pagination in the site and I was able to scrap the results of the first page, but when I tried the to do the same for the second page, I got the data of the first page as output.
Here is an example of first page results: http://www.century21.com/real-estate/ada-oh/LCOHADA/#t=0&s=0
And here are the results of the second page for the same search term: http://www.century21.com/real-estate/ada-oh/LCOHADA/#t=0&s=10
I noticed that when I manually click the second URL to open it in the browser, the results of the first URL are showing for few seconds and then the page seems to fully load and show the results of the second page.
As you can imagine, Python request is grabbing the results of the first load of the second page which happens to be the same as the results of the first page. Same if I request third page results, fourth and so on.
Below is my code. If you run the it, it will print the address of the first property of the first page twice.
Any idea how to grab the correct page results?
from bs4 import BeautifulSoup
import requests
page1=requests.get("http://www.century21.com/real-estate/ada-oh/LCOHADA/#t=0&s=0")
c1=page1.content
soup1=BeautifulSoup(c1,"html.parser").find_all("div",{"class":"propertyRow"})[0].find_all("span",{"class":"propAddressCollapse"})[0].text
page2=requests.get("http://www.century21.com/real-estate/ada-oh/LCOHADA/#t=0&s=10")
c2=page2.content
soup2=BeautifulSoup(c2,"html.parser").find_all("div",{"class":"propertyRow"})[0].find_all("span",{"class":"propAddressCollapse"})[0].text
print(soup1)
print(soup2)
Make requests to "search.c21" endpoint, get the HTML string from the "list" key and parse it:
from bs4 import BeautifulSoup
import requests
page1 = requests.get("http://www.century21.com/search.c21?lid=COHADA&t=0&s=0&subView=searchView.AllSubView")
c1 = page1.json()["list"]
soup1 = BeautifulSoup(c1, "html.parser").find_all("div", {"class": "propertyRow"})[0].find_all("span", {
"class": "propAddressCollapse"})[0].text
page2 = requests.get("http://www.century21.com/search.c21?lid=COHADA&t=0&s=10&subView=searchView.AllSubView")
c2 = page2.json()["list"]
soup2 = BeautifulSoup(c2, "html.parser").find_all("div", {"class": "propertyRow"})[0].find_all("span", {
"class": "propAddressCollapse"})[0].text
print(soup1)
print(soup2)
Prints:
5489 Sr 235
202 W Highland Ave