Using Python to scrape point information from embedded google map - python

I am trying to use the requests library + beautiful soup to pull information on the antenna points from the map shown on this website.
http://www.sites.bipt.be/
My original plan was to iterate through site numbers, pull out the lat/long information that appears on the left hand panel when a point is clicked, and display that data in Arc. So far I have accessed the element where that information is located when the point is clicked ( id = selectedsite ... ). But the element turns up empty in python seemingly because nothing is clicked?
This is my first time web scraping and I have limited HTML knowlege, if there is another approach that would be better or any pointers you could offer that would be greatly appreciated :)
import requests
from bs4 import BeautifulSoup
#api_key = 'MY_API_KEY'
#Prettify HTML
source = requests.get("http://www.sites.bipt.be/").text
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())
# Class where information on the selected site is located - turns up empty
div = soup.find(id='selectedsite')
print(div.prettify())

This webpage loads data with JavaScript. You cannot use BeautifulSoup to click on an element in a webpage. You have to use selenium.

Related

Cannot find the text I want to scrape in the Page Source

Simple question. Why is it that when I inspect element I see the data I want embedded within the JS tags - but when I go directly to Page Source I do not see it at all?
As an example, basically I am looking to get the description of the eBay listing. In this case, the text in the body of the listing that reads "BRAND NEW Factory Sealed
Playstation 5 (PS5) Bluray Disc System Console [...]
We usually ship within 24 hours of purchase."
Sample code below. If I search for the text within the printout, I cannot find it.
import requests
from bs4 import BeautifulSoup
url = 'www.ebay.com/itm/272037717929'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.prettify())
it's probably because ebay is using javascript to load content into the page. A workout this problem would be using something playwright or selenium. I personally prefer the first option. It uses a chromium browser to actually get the page contents, hence loads javascript in the proccess

Problem while scraping twitter using beautiful soup

problem while scraping a heavy website like Facebook or twitter with lot of html tags using beautiful soup and requests library.
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://twitter.com/elonmusk').text
soup = BeautifulSoup(html_text, 'lxml')
elon_tweet = soup.find_all('span', class_='css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0')
print(elon_tweet)
The tweet and its corresponding span
The full span image
img link to the span
when the code is executed this returns a empty list.
I'm new to web scraping, a detailed explanation would be welcomed.
Problem is that twitter is loading its content dynamically. This means that when you make a request, the page is loaded and first returns the html from here (write in your browser's address bar: "view-source:https://twitter.com/elonmusk")
Later, after the page is loaded, the JavaScript is executed and adds the full content of the page.
With requests from python you can only scrape the content available on "view-source:https://twitter.com/elonmusk", and as you can see, the element that you're trying to scrape it's not there.
To scrape this element you will need to use selenium, which allows you to emulate a browser directly from python, and therefore wait the few extra needed seconds so that the whole content will be loaded. You can find a good guide on this over here: https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-2/
Also if you don't want all this trouble, you can use instead an API that allows JavaScript rendering.

Scraping youtube to get dynamically loaded content

I'm trying to scrape youtube but most of the times I do it, It just gives back an empty result.
In this code snippet I'm trying to get the list of the video titles on the page. But when I run it I just get an empty result back. Even one title doesn't show up in result.
I've searched around and some results point out that it's due to the website loading content dynamically with javascript. Is this the case? How do I go about solving this? Is it possible to do without selenium?
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/user/schafer5/videos').text
soup = BeautifulSoup(source, 'lxml')
title = soup.findAll('a', attrs={'class': "yt-simple-endpoint style-scope ytd-grid-video-renderer")
print(title)
Is it possible to do without selenium?
Often services have APIs which allow easier automation than scraping sites, Youtube has API and there are ready official libraries for various languages, for Python there is google-api-python-client, you would need key to use, to get running I suggest following Youtube Data API quickstart, note that you might ignore OAuth 2.0 parts, as long as you need access only to public data.
I totally agree with #Daweo and that's the right way to scrape a website like Youtube. But if you want to use BeautifulSoup and not get an empty list at the end, your code should be changed to as follows:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/user/schafer5/videos').text
soup = BeautifulSoup(source, 'html.parser')
titles = [i.text for i in soup.findAll('a') if i.get('aria-describedby')]
print(titles)
I also suggest that you use the API.

Webscraping: Table not included in BeautifulSoup Page

I am trying to scrape a table of company info from the table on this page: https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/
I can see the table contents when using chrome's dev tool element inspector, but when I request the page in my script, the contents of the table are gone... just with no content.
Any idea how I can get that sweet, sweet content?
Thanks
Code is below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/")
page = BeautifulSoup(response.text, "html.parser")
page
You can find the API in the network traffic tab: it's calling
https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/##api-disclosure?isabstract=0&companyName=&ticker=&year=2018&analysis=1&index=&sic=&keywords=
and you should be able to reconstruct the table from the resulting JSON. I haven't played around with all the parameters but it's seems like only year affects the resulting data set, i.e.
https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/##api-disclosure?isabstract=0&year=2018&analysis=1
should give you the same result as the query above.
Based on the Network traffic using the dev tool, the content isn't directly on the html, but gets called dynamically from ApiService.js script. My suggestion would be to use Selenium to extract the content once the page has fully loaded (for example until the loading element has disappeared).

Beautiful Soup 4 finding text within table

I have been trying to use BS4 to scrape from this web page. I cannot find the data I want (player names in the table, ie, "Claiborne, Morris").
When I use:
soup = BeautifulSoup(r.content, "html.parser")
PlayerName = soup.find_all("table")
print (PlayerName)
None of the player's names are even in the output, it is only showing a different table.
When I use:
soup = BeautifulSoup(r.content, 'html.parser')
texts = soup.findAll(text=True)
print(texts)
I can see them.
Any advice on how to dig in and get player names?
The table you're looking for is dynamically filled by JavaScript when the page is rendered. When you retrieve the page using e.g. requests, it only retrieves the original, unmodified page. This means that some elements that you see in your browser will be missing.
The fact that you can find the player names in your second snippet of code, is because they are contained in the page's JavaScript source, as JSON. However you won't be able to retrieve them with BeautifulSoup as it won't parse the JavaScript.
The best option is to use something like Selenium, which mimics a browser as closely as possible and will execute JavaScript code, thereby rendering the same page content as you would see in your own browser.

Categories