I am trying to make a simple webscraper where I take the information off a HTML page. It's simple but I have a problem I can't seem to solve:
When I download the HTML page by myself and parse it using BeautifulSoup, it parses everything and gives me all the data, this is ok but I don't need to do this. Instead I am trying to using a link instead which doesn't seem to be working. Whenever I use the link using the "urlopen" function and parse the page using BeautifulSoup, it always seems to completely ignore/exclude some lists and tables from the HTML file. These tables appear when I look up the page online using the "Inspect Element" method, and they also appear when I download the HTML page myself but they never appear when I use the "urlopen" function. I even tried encoding post data and sending it as an argument of the function but it doesn't seem to work that way either.
import bs4
from urllib.request import urlopen as uReq
from urllib.parse import urlencode as uEnc
from bs4 import BeautifulSoup as soup
my_url = 'https://sp.com.sa/en/tracktrace/?tid=RB961555017SG'
#data = {'tid':'RB961555017SG'}
#sdata = uEnc(data)
#sdata = bytearray(sdata, 'utf-8')
uClient = uReq(my_url, timeout=2) # opening url or downloading the webpage
page_html = uClient.read() # saving html file in page_html
uClient.close() # closing url or connection idk properly
page_soup = soup(page_html, "html.parser") # parsing the html file and saving
updates = page_soup.findAll("div",{"class":"col-sm-12"})
#updates = page_soup.findAll("ol", {})
print(updates)
These tables contain the information I need, is there anyway I can fix this?
request works a bit differently than a browser. E.g. it does not actually run JavaScript.
In this case the table with info is generated by a script rather than hardcoded in HTML. You can see the actual source-code using "view-source:" followed by the url.
view-source:https://sp.com.sa/en/tracktrace/?tid=RB961555017SG
So, we'd want to run that script some way. The easiest is to use "selenium", which uses the driver of your browser. Then simply take the loaded html and run that through BS4. I noticed that there is a more specific tag you can use rather than "col-sm-12". Hope it helps :)
import bs4
from urllib.request import urlopen as uReq
from urllib.parse import urlencode as uEnc
from bs4 import BeautifulSoup as soup
my_url = 'https://sp.com.sa/en/tracktrace/?tid=RB961555017SG'
from selenium import webdriver
import time
chrome_path = "path of chrome driver that fits your current browser version"
driver = webdriver.Chrome(chrome_path)
driver.get(my_url)
time.sleep(5) #to make sure it's loaded
page_html = driver.page_source
driver.quit()
page_soup = soup(page_html, "html.parser") # parsing the html file and saving
updates = page_soup.findAll("table",{"class":"table-shipment table-striped"}) #The table is more specific to you wanted data than the col-sm-12 so I'd rather use that.
print(updates)
Related
I'm trying to scrap a number of visitors to my local climbing centre.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://portal.rockgympro.com/portal/public/c3b9019203e4bc4404983507dbdf2359/occupancy?&iframeid=occupancyCounter&fId=1644")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('span', id="count")
print(results)
It's printing this:
<span id="count" style="display:inline"></span>
That's nice, but the number 19 is missing... What am I doing wrong?
It's there in json format in the tag of the html. Just need to pull it out.
import requests
import json
from bs4 import BeautifulSoup
url = 'https://portal.rockgympro.com/portal/public/c3b9019203e4bc4404983507dbdf2359/occupancy?&iframeid=occupancyCounter&fId=1644'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scriptStr = str(soup.find_all('script')[2]).split('var data = ')[-1].split(';')[0].replace("'",'"')
last_char_index = scriptStr.rfind(",")
scriptStr = scriptStr[:last_char_index] + '}'
scriptStr = scriptStr.replace(' ', ' ')
jsonData = json.loads(scriptStr)
count = jsonData['REA']['count']
capacity = jsonData['REA']['capacity']
lastUpdate = jsonData['REA']['lastUpdate']
print(f'{count} of {capacity} Climbers\n{lastUpdate}')
Output:
58 of 220 Climbers
Last updated: now (5:20 PM)
You're not doing anything wrong, the issue is that the website is populating the <span> element using JavaScript, which runs after your request is made.
Unfortunately, the requests library cannot run JavaScript since it is a pure HTTP tool. I would recommend checking out something like Selenium which is more robust and can wait for the JavaScript to load before scraping the HTML.
You can try requests_html module to get dynamic values which are calculated by javascript. I tried with below logic it worked for me on your site.
from bs4 import BeautifulSoup
import time
from requests_html import HTMLSession
url="Your Site Link"
# create an HTML Session object
session = HTMLSession()
# Use the object above to connect to needed webpage
resp = session.get(url)
# Run JavaScript code on webpage
resp.html.render(sleep=10)
soup = BeautifulSoup(resp.html.html, 'lxml')
results = soup.find('span', id="count")
print(results)
Your Site calculate Result
In the dev tools under one of the tags, you can see that many of those figures are generated after the page load by the JavaScript function showGym(). In order to allow those figures to generate you could use a browser driver tool like webbot or Selenium which can wait on pages long enough for the javascript to execute populate those fields. It might be possible to have requests do that, but I don't know as I've only used webbot when reaching problems like these as it's very easy to use.
I am working on scraping a web site link is "https://homeshopping.pk/search.php?q=samsung%20phones". Iam finding difficulty in accessing to one of the div class. I think its is not formatted properly. Reason for asking this question is to confirm that is it not formatted properly or I am doing something wrong.
Screenshot is:
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
page_url = "https://homeshopping.pk/search.php?q=samsung%20phones"
uClient = uReq(page_url)
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
print(page_soup)
# finds each product from the store page
container1 = page_soup.find_all("div", {"class": "findify-container findify-search findify-widget-2"})
len(container1)
print(container1)
This is where this thing loads the products from - https://api-v3.findify.io/v3/search?user[uid]=TW1bcavcZKWeb32z&user[sid]=6kn0FcKb4QjgMz60&user&t_client=1584424566753&key=cae15cfe-508b-41d1-a019-161c02ffd97c&q=samsung%20phones
Now, are those params fixed? I have no slightest idea. Can you parse this? Absolutely, parse with json.loads, not bs
import requests, json
source = requests.get('https://api-v3.findify.io/v3/search?user[uid]=TW1bcavcZKWeb32z&user[sid]=6kn0FcKb4QjgMz60&user&t_client=1584424566753&key=cae15cfe-508b-41d1-a019-161c02ffd97c&q=samsung%20phones')
j = json.loads(source.content.decode())
for item in j["items"]:
print(item["title"])
So I am new to webscraping, I want to scrape all the text content of only the home page.
this is my code, but it now working correctly.
from bs4 import BeautifulSoup
import requests
website_url = "http://www.traiteurcheminfaisant.com/"
ra = requests.get(website_url)
soup = BeautifulSoup(ra.text, "html.parser")
full_text = soup.find_all()
print(full_text)
When I print "full_text" it give me a lot of html content but not all, when I ctrl + f " traiteurcheminfaisant#hotmail.com" the email adress that is on the home page (footer)
is not found on full_text.
Thanks you for helping!
A quick glance at the website that you're attempting to scrape from makes me suspect that not all content is loaded when sending a simple get request via the requests module. In other words, it seems likely that some components on the site, such as the footer you mentioned, are being loaded asynchronously with Javascript.
If that is the case, you'll probably want to use some sort of automation tool to navigate to the page, wait for it to load and then parse the fully loaded source code. For this, the most common tool would be Selenium. It can be a bit tricky to set up the first time since you'll also need to install a separate webdriver for whatever browser you'd like to use. That said, the last time I set this up it was pretty easy. Here's a rough example of what this might look like for you (once you've got Selenium properly set up):
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver = webdriver.Firefox(executable_path='/your/path/to/geckodriver')
driver.get('http://www.traiteurcheminfaisant.com')
time.sleep(2)
source = driver.page_source
soup = BeautifulSoup(source, 'html.parser')
full_text = soup.find_all()
print(full_text)
I haven't used BeatifulSoup before, but try using urlopen instead. This will store the webpage as a string, which you can use to find the email.
from urllib.request import urlopen
try:
response = urlopen("http://www.traiteurcheminfaisant.com")
html = response.read().decode(encoding = "UTF8", errors='ignore')
print(html.find("traiteurcheminfaisant#hotmail.com"))
except:
print("Cannot open webpage")
I'm trying beautifulsoup library of python for develop myself and I realized I had to get help.
import requests
from bs4 import BeautifulSoup
url = "https://www.basketball-reference.com/players/j/jamesle01.html"
r = requests.get(url)
soup = BeautifulSoup(r.content,"html.parser")
data = soup.find_all("table",{"class":"row_summable sortable stats_table now_sortable"})
print(data)
The html you download is not exactly the same as the html displayed on the webpage. At a certain point whilst loading the webpage, javascript adds the now_sortable class to the table in your browser.
When you download the page using requests, this bit of javascript is never performed, and therefore you don't have the now_sortable class in your table, and that's why you can't find the element.
Try changing your code to:
data = soup.find_all("table",{"class":"row_summable sortable stats_table"})
A general tip: when downloading a file using requests, try saving the page you've requested locally so you can have a proper look into it:
with open('local_page.html', 'w', encoding='utf-8') as fout:
fout.write(r.text)
You could just use Selenium to render the page and then pull the html:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://www.basketball-reference.com/players/j/jamesle01.html"
driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html,"html.parser")
data = soup.find_all("table",{"class":"row_summable sortable stats_table now_sortable"})
print(data)
I want to scrape the Google translate website and get the translated text from it using Python 3.
Here is my code:
from bs4 import BeautifulSoup as soup
from urllib.request import Request as uReq
from urllib.request import urlopen as open
my_url = "https://translate.google.com/#en/es/I%20am%20Animikh%20Aich"
req = uReq(my_url, headers={'User-Agent':'Mozilla/5.0'})
uClient = open(req)
page_html = uClient.read()
uClient.close()
html = soup(page_html, 'html5lib')
print(html)
Unfortunately, I am unable to find the required information in the parsed Webpage.
In chrome "Inspect", It is showing that the translated text is inside:
<span id="result_box" class="short_text" lang="es"><span class="">Yo soy Animikh Aich</span></span>
However, When I am searching for the information in the parsed HTML code, this is what I'm finding in it:
<span class="short_text" id="result_box"></span>
I have tried parsing using all of html5lib, lxml, html.parser. I have not been able to find a solution for this.
Please help me with the issue.
you could use a specific python api:
import goslate
gs = goslate.Goslate()
print(gs.translate('I am Animikh Aich', 'es'))
Yo soy Animikh Aich
Try like below to get the desired content:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://translate.google.com/#en/es/I%20am%20Animikh%20Aich")
soup = BeautifulSoup(driver.page_source, 'html5lib')
item = soup.select_one("#result_box span").text
print(item)
driver.quit()
Output:
Yo soy Animikh Aich
JavaScript is modifying the HTML code after it loads. urllib can't handle JavaScript, you'll have to use Selenium to get the data that you want.
For installation and demo, refer this link.