Load entire html page in python - python

I need to store in a str variable an entire html page.
I'm doing this:
import requests
from bs4 import BeautifulSoup
url = my_url
response = requests.get(url)
page = str(BeautifulSoup(response.content))
This works but the page in my_url is not "complete". It is a website in which going to the end, new things will load, and i need all the page, not only the main visible part.
Is there a way to load the entire page and then store it?
I also tried to load the page manually and then looking at the source code, but the final part of the page is still not visible.
Alternatively, all I want from my_url page are all the links inside it, and all of them are like:
my_url/something/first-post
my_url/something/second-post
Is there a way to find all the links in another way? So, all the possible url that starts with "my_url/something/"
Thanks in advance

I think you should use Selenium and then scroll down with it to get entire the page.
as I know requests can't handle dynamic pages.

For the alternative option, you can find the <a> tags via find_all
links = soup.find_all('a')
to get all starting with you can use the following
result = [link for link in links if link.startswith('my_url/something/')]

Related

Cannot find the text I want to scrape in the Page Source

Simple question. Why is it that when I inspect element I see the data I want embedded within the JS tags - but when I go directly to Page Source I do not see it at all?
As an example, basically I am looking to get the description of the eBay listing. In this case, the text in the body of the listing that reads "BRAND NEW Factory Sealed
Playstation 5 (PS5) Bluray Disc System Console [...]
We usually ship within 24 hours of purchase."
Sample code below. If I search for the text within the printout, I cannot find it.
import requests
from bs4 import BeautifulSoup
url = 'www.ebay.com/itm/272037717929'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.prettify())
it's probably because ebay is using javascript to load content into the page. A workout this problem would be using something playwright or selenium. I personally prefer the first option. It uses a chromium browser to actually get the page contents, hence loads javascript in the proccess

Problem while scraping twitter using beautiful soup

problem while scraping a heavy website like Facebook or twitter with lot of html tags using beautiful soup and requests library.
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://twitter.com/elonmusk').text
soup = BeautifulSoup(html_text, 'lxml')
elon_tweet = soup.find_all('span', class_='css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0')
print(elon_tweet)
The tweet and its corresponding span
The full span image
img link to the span
when the code is executed this returns a empty list.
I'm new to web scraping, a detailed explanation would be welcomed.
Problem is that twitter is loading its content dynamically. This means that when you make a request, the page is loaded and first returns the html from here (write in your browser's address bar: "view-source:https://twitter.com/elonmusk")
Later, after the page is loaded, the JavaScript is executed and adds the full content of the page.
With requests from python you can only scrape the content available on "view-source:https://twitter.com/elonmusk", and as you can see, the element that you're trying to scrape it's not there.
To scrape this element you will need to use selenium, which allows you to emulate a browser directly from python, and therefore wait the few extra needed seconds so that the whole content will be loaded. You can find a good guide on this over here: https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-2/
Also if you don't want all this trouble, you can use instead an API that allows JavaScript rendering.

Navigate through all the search results pages with BeautifulSoup

I can not seem to grasp.
How can I make BeautifulSoup parse every page by navigating using Next page link up until the last page and stop parsing when there is no "Next page" found. On a site like this
enter link description here
I try looking for the Next button element name, I use 'find' to find it, but do not know how to make it recurring to do iterations until all pages are scraped.
Thank you
beautiful soup will only give you the tools, how to go about navigating pages is something you need to work out in a flow diagram sense.
Taking the page you mentioned, clicking through a few of the pages it seems that when we are on page 1, nothing is shown in the url.
htt...ru/moskva/transport
and we see in the source of the page:
<div class="pagination-pages clearfix">
<span class="pagination-page pagination-page_current">1</span>
<a class="pagination-page" href="/moskva/transport?p=2">2</a>
lets check what happens when we go to page 2
ht...ru/moskva/transport?p=2
<div class="pagination-pages clearfix">
<a class="pagination-page" href="/moskva/transport">1</a>
<span class="pagination-page pagination-page_current">2</span>
<a class="pagination-page" href="/moskva/transport?p=3">3</a>
perfect, now we have the layout. one more thing to know before we make our beautiful soup. what happenes when we go to a page past the last available page. which at the time of this writing was: 40161
ht...ru/moskva/transport?p=40161
we change this to:
ht...ru/moskva/transport?p=40162
the page seems to go back to page 1 automatically. great!
so now we have everything we need to make our soup loop.
instead of clicking next each time, just make a url statement. you know the elements required.
url = ht...ru/moskva/$searchterm?p=$pagenum
im assuming transport is the search term??? i dont know, i cant read russian. but you get the idea. construct the url. then do a requests call
request = requests.get(url)
mysoup = bs4.BeautifulSoup(request.text)
and now you can wrap that whole thing in a while loop, and each time except the first time check
mysoup.select['.pagination-page_current'][0].text == 1
this says, each time we get the page, find the currently selected page by using the class pagination-page_current, it returns an array so we select the first element [0] get its text .text and see if it equals 1.
this should only be true in two cases. the first page you run, and the last. so you can use this to start and stop the script, or however you want.
this should be everything you need to do this properly. :)
BeautifulSoup by itself does not load pages. You need to use something like requests, scrape the URL you want to follow, load it and pass its content to another BS4 soup.
import requests
# Scrape your url
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser') # You can now scrape the new page

Web scraping for divs inserted by scripts

Sorry if this is a silly question.
I am trying to use Beautifulsoup and urllib2 in python to look at a url and extract all divs with a particular class. However, the result is always empty even though I can see the divs when I "inspect element" in chrome's developer tools.
I looked at the page source and those divs were not there which means they were inserted by a script. So my question is how can i look for those divs (using their class name) using Beautifulsoup? I want to eventually read and follow hrefs under those divs.
Thanks.
[Edit]
I am currently looking at the H&M website: http://www.hm.com/sg/products/ladies and I am interested to get all the divs with class 'product-list-item'
Try using selenium to run the javascript
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
html = driver.page_source
check this link enter link description here
you can get all info by change the url, this link can be found in chrome dev tools > Network
The reason why you got nothing from that specific url is simply because, the info you need is not there.
So first let me explain a little bit about how that page is loaded in a browser: when you request for that page(http://www.hm.com/sg/products/ladies), the literal content will be returned in the very first phase(which is what you got from your urllib2 request), then the browser starts to read/parse the content, basically it tells the browser where to find all information it needs to render the whole page(e.g. CSS to control layout, additional javascript/urls/pages to populate certain area etc.), and the browser does all that behind the scene. When you "inspect element" in chrome, the page is already fully loaded, and those info you want is not in original url, so you need to find out which url is used to populate those area and go after that specific url instead.
So now we need to find out what happens behind the scene, and a tool is needed to capture all traffic when that page loads(I would recommend fiddler).
As you can see, lots of things happen when you open that page in a browser!(and that's only part of the whole page-loading process) So by educated guess, those info you need should be in one of those three "api.hm.com" requests, and the best part is they are alread JSON formatted, which means you might not even bother with BeautifulSoup, the built-in json module could do the job!
OK, now what? Use urllib2 to simulate those requests and get what you want.
P.S. requests is a great tool for this kind of job, you can get it here.
Try This one :
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://www.hm.com/sg/products/ladies")
soup = BeautifulSoup(page.read(),'lxml')
scrapdiv = open('scrapdiv.txt','w')
product_lists = soup.findAll("div",{"class":"o-product-list"})
print product_lists
for product_list in product_lists:
print product_list
scrapdiv.write(str(product_list))
scrapdiv.write("\n\n")
scrapdiv.close()

Redirected to main page when trying to parse html with python

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
url = "http://www.csgolounge.com/api/mathes"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data, "html.parser")
print (data)
I am trying to use this code to get the text from this page, but every time I try to scrape or get the text from the page, I am redirected to home page, and my code outputs the html from the homepage. The page I am trying to scrape is a .php file, and not an html or textfile. I would like to get the text from the page and then extract the data and do what I want with it.
I have tried changing the headers of my code, that the website would think that I am not a bot, but a chrome browser, but I still get redirected to the homepage. I have tried using diffrent html python parsers like BeautifulSoup, and the python built in class, as well as many other popular parsers, but they all give the same result.
Is there a way to stop this, and to get the text from this link? Is it a mistake in my code or what?
First of all, try it without the "www" part.
Rewrite http://www.csgolounge.com/api/mathes as https://csgolounge.com/api/mathes
If it doesn't work, try Selenium.
It may be getting stuck since it can't process the javascript part.
Selenium can handle it better.

Categories