I am trying to get the section-facts-description-text in Google Maps.
I have tried this code already:
import urllib
from bs4 import BeautifulSoup
url = "https://www.google.co.id/maps/place/Semarang,+Kota+Semarang,+Jawa+Tengah/#-7.0247703,110.3488077,12z/data=!3m1!4b1!4m5!3m4!1s0x2e708b4d3f0d024d:0x1e0432b9da5cb9f2!8m2!3d-7.0051453!4d110.4381254"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
for strong_tag in soup.find_all('span',{'class':'section-facts-description-text'}):
print strong_tag.text, strong_tag.next_sibling
What's wrong with my code? Is there anything I'm missing? Is there any option to do that action with library or API in python?
urllib requests the initial loading data off of a webpage and then terminates. In the case of many complex non-static webpages, Google Maps included, that payload consists almost entirely of JavaScript scripts, which then populate the page as you know it.
So instead of downloading the DOM elements and so on that you want, you're downloading the JavaScript that populates everything instead.
In order to pull down the loaded GMaps page instead, you'll need to use a web driver that's capable of opening the page, waiting until it's loaded, and only then downloading the content. For that you should investigate selenium.
Related
Simple question. Why is it that when I inspect element I see the data I want embedded within the JS tags - but when I go directly to Page Source I do not see it at all?
As an example, basically I am looking to get the description of the eBay listing. In this case, the text in the body of the listing that reads "BRAND NEW Factory Sealed
Playstation 5 (PS5) Bluray Disc System Console [...]
We usually ship within 24 hours of purchase."
Sample code below. If I search for the text within the printout, I cannot find it.
import requests
from bs4 import BeautifulSoup
url = 'www.ebay.com/itm/272037717929'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.prettify())
it's probably because ebay is using javascript to load content into the page. A workout this problem would be using something playwright or selenium. I personally prefer the first option. It uses a chromium browser to actually get the page contents, hence loads javascript in the proccess
problem while scraping a heavy website like Facebook or twitter with lot of html tags using beautiful soup and requests library.
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://twitter.com/elonmusk').text
soup = BeautifulSoup(html_text, 'lxml')
elon_tweet = soup.find_all('span', class_='css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0')
print(elon_tweet)
The tweet and its corresponding span
The full span image
img link to the span
when the code is executed this returns a empty list.
I'm new to web scraping, a detailed explanation would be welcomed.
Problem is that twitter is loading its content dynamically. This means that when you make a request, the page is loaded and first returns the html from here (write in your browser's address bar: "view-source:https://twitter.com/elonmusk")
Later, after the page is loaded, the JavaScript is executed and adds the full content of the page.
With requests from python you can only scrape the content available on "view-source:https://twitter.com/elonmusk", and as you can see, the element that you're trying to scrape it's not there.
To scrape this element you will need to use selenium, which allows you to emulate a browser directly from python, and therefore wait the few extra needed seconds so that the whole content will be loaded. You can find a good guide on this over here: https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-2/
Also if you don't want all this trouble, you can use instead an API that allows JavaScript rendering.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
url = "http://www.csgolounge.com/api/mathes"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data, "html.parser")
print (data)
I am trying to use this code to get the text from this page, but every time I try to scrape or get the text from the page, I am redirected to home page, and my code outputs the html from the homepage. The page I am trying to scrape is a .php file, and not an html or textfile. I would like to get the text from the page and then extract the data and do what I want with it.
I have tried changing the headers of my code, that the website would think that I am not a bot, but a chrome browser, but I still get redirected to the homepage. I have tried using diffrent html python parsers like BeautifulSoup, and the python built in class, as well as many other popular parsers, but they all give the same result.
Is there a way to stop this, and to get the text from this link? Is it a mistake in my code or what?
First of all, try it without the "www" part.
Rewrite http://www.csgolounge.com/api/mathes as https://csgolounge.com/api/mathes
If it doesn't work, try Selenium.
It may be getting stuck since it can't process the javascript part.
Selenium can handle it better.
I am trying to scrape this page on Flipkart:
http://www.flipkart.com/moto-x-play/p/itmeajtqp9sfxgsk?pid=MOBEAJTQRH4CCRYM&ref=L%3A7224647610489585789&srno=p_1&query=moto+x+play&otracker=from-search
I am trying to find the div with class "fk-ui-ccarousel-supercontainer same-vreco-section reco-carousel-border-top sameHorizontalReco" but it returns empty result.
from bs4 import BeautifulSoup
import requests
url = "http://www.flipkart.com/moto-x-play/p/itmeajtqp9sfxgsk?pid=MOBEAJTQRH4CCRYM&ref=L%3A7224647610489585789&srno=p_1&query=moto%20x%20play&otracker=from-search"
page = requests.get(url)
soup = BeautifulSoup(page.text)
divs = soup.find_all("div",{"class":"fk-ui-ccarousel-supercontainer same-vreco-section reco-carousel-border-top sameHorizontalReco"})
print divs
divs is empty. I copied the class name using inspect element.
I found the answer in this question. http://www.google.com/url?q=http%3A%2F%2Fstackoverflow.com%2Fquestions%2F22028775%2Ftried-python-beautifulsoup-and-phantom-js-still-cant-scrape-websites&sa=D&sntz=1&usg=AFQjCNFOZIMVyUDcUqNNuv-05Dp7P_L6-g
When you use requests.get(url) you load the HTML content of the url without JavaScript enabled. Without JavaScript enabled, the section of the page called 'customers who viewed this product also viewed' is never even rendered.
You can explore this behaviour by turning off JavaScript in your browser. If you scrape regularly, you might also want to download a JavaScript switcher plugin.
An alternative that you might want to look into is using a browser automation tool such as selenium.
requests.get(..) will return the content that is the plain HTTP GET on that url. all the Javascript rels that the page contains will not be downloaded, also, any inline javascript will not be executed either.
If flipkart uses js to modify the DOM after it is loaded in the browser, those changes will not reflect in the page.contents or page.text values.
you could try a different parser instead of the default parser in beautiful soup. I tried html5lib and it worked for a different website. maybe it will for you too. It will be slower than the default parser, but could be faster than selenium or other full fledged headless browsers.
I want to be able to generate auto alerts for certain type of matches to a web search. The first step is to read the url in Python, so that I can then parse it using BeautifulSoup or other regex based methods.
For a page like the one in the example below though, the html doesn't capture the results that I'm visualizing when I open the page with a browser.
Is there a way to actually get the HTML with search results themselves?
import urllib
link = 'http://www.sas.com/jobs/USjobs/search.html'
f = urllib.urlopen(link)
myfile = f.read()
print myfile
You cannot get the data that is being generated dynamically using javascript by using traditional urllib, urllib2 or requests modules (or even mechanize for that matter). You'll have to simulate a browser environment by using selenium with chrome or Firefox or phantomjs to evaluate the javascript in the webpage.
Have a look at Selenium Binding for python