I'm trying to scrape information for dresses from this website: https://www.libertylondon.com/uk/department/women/clothing/dresses/
Obviously, I'm not only interested in the first 60 results, but all of them. When clicking on the 'Show More' button a couple of times, I arrive at this url: https://www.libertylondon.com/uk/department/women/clothing/dresses/#sz=60&start=300
I would have expected that using the following code, I get a full download of the page mentioned above, but for some reason, it will still only yield the first 60 results.
import requests
import bs4
url = "https://www.libertylondon.com/uk/department/women/clothing/dresses/#sz=60&start=300"
res = requests.get(url)
res.encoding = 'utf-8'
res.raise_for_status()
html = res.text
soup = bs4.BeautifulSoup(html, "lxml")
elements = soup.find_all("div", attrs = {"class": "product product-tile"})
I can see that the issue lies within the request itself, since the soup variable does not contain the full html text I see when inspecting the page, but I can not figure out why that is.
Try the below url which fetch you 331 elements.
url : https://www.libertylondon.com/uk/department/women/clothing/dresses/?sz=331&start=0&format=ajax
import requests
import bs4
url="https://www.libertylondon.com/uk/department/women/clothing/dresses/?sz=331&start=0&format=ajax"
res = requests.get(url)
res.encoding = 'utf-8'
res.raise_for_status()
html = res.text
soup = bs4.BeautifulSoup(html, "lxml")
elements = soup.find_all("div", attrs = {"class": "product product-tile"})
print(len(elements))
The link you show after having clicked to the "Show more" button uses fragments (notice the # sign). This is not something sent to the server, but rather used by JavaScript in the front-end to load more items without reloading the full page.
However, you're lucky because if you look at the HTTP requests made in your browser console, you'll see that it does a request to https://www.libertylondon.com/uk/department/women/clothing/dresses/?sz=60&start=60. Those are query params (and seem to exactly match the fragments!), so this means the server will send the extra items.
I think in this case, the button "show more", load *sz* dress from the *from* th dress.
So when you do a http request with #sz=60&start=300 attribute, the database will only fetch the dresses from index 300 to 360, that is why the request only contains 60 dresses.
There is another button on the page that indicate another url: SHOW ALL, this button give this url: https://www.libertylondon.com/uk/department/women/clothing/dresses/?sz=120
With only the ?sz=120 url parameter you could get a answer with sz number of dresses. But it seems there is a limit of how many dresses you can load at once.
HTTP GET to https://www.libertylondon.com/uk/department/women/clothing/dresses/#sz=331&start=0 will return all items (331 is the current number of items and may be changed in the future)
Related
import requests
from bs4 import BeautifulSoup
result = requests.get('https://www.indeed.com/?vjk=5bc59746be36d8d0')
source = result.content
soup = BeautifulSoup(source, "lxml")
job_titles = soup.find_all("a", {"class": "jcs-JobTitle"})
print(job_titles)
The problem here that printing job_titles returns an empty list instead of the job titles in the web site
please help me fix this problem and any help would be appreciated
When I first went to the URL you're requesting, I was shown a search page with no jobs listed. It was only after I submitted a search that the page was populated with results. When I returned to the original URL again, the page was still populated (possibly with cached results). The blank page is probably what you're getting back when you get the page from requests.
Try using the full URL with parameters that the browser forwards you to after a search. For example, the URL https://www.indeed.com/jobs?q=data%20engineer&l=Raleigh%2C%20NC&vjk=b971ec43674ab50e gives me back 15 job title links.
i try to scrape some informations from a webpage and on the one page it is working fine, but on the other webpage it is not working cause i only get a none return-value
This code / webpage is working fine:
# https://realpython.com/beautiful-soup-web-scraper-python/
import requests
from bs4 import BeautifulSoup
URL = "https://www.monster.at/jobs/suche/?q=Software-Devel&where=Graz"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
name_box = soup.findAll("div", attrs={"class": "company"})
print (name_box)
But with this code / webpage i only get a None as return-value
# https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/
import requests
from bs4 import BeautifulSoup
URL = "https://www.bloomberg.com/quote/SPX:IND"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
name_box = soup.find("h1", attrs={"class": "companyName__99a4824b"})
print (name_box)
Why is that?
(at first i thought due the number in the class on the second webpage "companyName__99a4824b" it changes the classname dynamicly - but this is not the case - when i refresh the webpage it is still the same classname...)
The reason you get None is that the Bloomberg page uses Javascript to load its content while the user is on the page.
BeautifulSoup simply returns to you the html of the page as found as soon as it reaches the page -- which does not contain the companyName_99a4824b class-tag.
Only after the user has waited for the page to fully load does the html include the desired tag.
If you want to scrape that data, you'll need to use something like Selenium, which you can instruct to wait until the desired element of the page is ready.
The website blocks scrapers, check the title:
print(soup.find("title"))
To bypass this you must use a real browser which can run JavaScript.
A tool called Selenium can do that for you.
I am trying to webscrape and am currently stuck on how I should continue with the code. I am trying to create a code that scrapes the first 80 Yelp! reviews. Since there are only 20 reviews per page, I am also stuck on figuring out how to create a loop to change the webpage to the next 20 reviews.
from bs4 import BeautifulSoup
import requests
import time
all_reviews = ''
def get_description(pullman):
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
page_title = soup.title.text
#get a tag content based on class
p_tag = soup.find_all('p',class_='lemon--p__373c0__3Qnnj text__373c0__2pB8f comment__373c0__3EKjH text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_')[0]
#print the text within the tag
return p_tag.text
General notes/tips:
Use the "Inspect" tool on pages you want to scrape.
As for your question, its also going to work much nicer if you visit the website and parse BeautifulSoup and then use the soup object in functions - visit once, parse as many times as you want. You won't be blacklisted by websites as often this way. An example structure below.
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
get_description(soup)
get_reviews(soup)
If you inspect the page, each review appears as a copy of a template. If you take each review as an individual object and parse it, you can get the reviews you are looking for. The review template has the class id:lemon--li__373c0__1r9wz u-space-b3 u-padding-b3 border--bottom__373c0__uPbXS border-color--default__373c0__2oFDT
As for pagination, the pagination numbers are contained in a template with class="lemon--div__373c0__1mboc pagination-links__373c0__2ZHo6 border-color--default__373c0__2oFDT nowrap__373c0__1_N1j"
The individual page number links are contained within a-href tags, so just write a for loop to iterate over the links.
To get the next page, you're going to have to follow the "Next" link. The problem here is that the link is just the same as before plus #. Open the Inspector [Ctrl-Shift-I in Chrome, Firefox] and switch to the network tab, then click the next button, you'll see a request to something like:
https://www.yelp.com/biz/U4mOl3TRbaJ9-bgTQ1d6fw/review_feed?rl=en&sort_by=relevance_desc&q=&start=40
which looks something like:
{"reviews": [{"comment": {"text": "Such a great experience every time you come into this place...
This is JSON. The only problem is that you'll need to fool Yelp's servers into thinking you're browsing the website, by sending their headers to them, otherwise you get different data that doesn't look like comments.
They look like this in Chrome
My usual approach is to copy-paste the headers not prefixed with a colon (ignore :authority, etc) directly into a triple-quoted string called raw_headers, then run
headers = dict([[h.partition(':')[0], h.partition(':')[2]] for h in raw_headers.split('\n')])
over them, and pass them as an argument to requests with:
requests.get(url, headers=headers)
Some of the headers won't be necessary, cookies might expire, and all sorts of other issues might arise but this at least gives you a fighting chance.
I've created a script in python to parse the tabular content from a website. My script now can parse the content from it's landing page. However, there is a NEXT PAGE button at the bottom of that page which unfolds 50 more results when gets clicked and so on.
Website address
I've tried with (scrapes first 50 results):
import requests
from bs4 import BeautifulSoup
site_link = 'https://indiarailinfo.com/trains/passenger/0/0/0/0'
res = requests.get(site_link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("div[style='line-height:20px;']"):
tds = [elem.get_text(strip=True) for elem in items.select("div")]
print(tds)
How can I get all the tabular contents from that page exhausting the next page button using requests?
PS I know how to unfold the content using selenium, so solution related to any browser simulator is not what I'm after.
Clicking the next button is actually doing XHR to https://indiarailinfo.com/trains/passenger/0/1?i=1&&kkk=1571329558457
<button class="nextbtn" onclick="javascript:getNextTrainListPageBare($(this).parent(),'/trains/passenger/0/1?i=1&');"><div>NEXT PAGE<br>the next 50 Trains will appear below</div></button>
So all you have to do is get the data under 'onclick' ,compose a url and do HTTP GET using requests.
The returned data will look like this
https://pastebin.com/Nk0E5vHH
Now just use BeautifulSoup and extract the data you need.
Code below (replace 10 with the number that you need)
import requests
from bs4 import BeautifulSoup
site_link = 'https://indiarailinfo.com/trains/passenger/0/{}'
for x in range(10):
url = site_link.format(x)
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
print('Data for url: {}'.format(url))
for items in soup.select("div[style='line-height:20px;']"):
tds = [elem.get_text(strip=True) for elem in items.select("div")]
print(tds)
I was happily scrapping property data from www.century21.com with Python requests and BeautifulSoup. There is pagination in the site and I was able to scrap the results of the first page, but when I tried the to do the same for the second page, I got the data of the first page as output.
Here is an example of first page results: http://www.century21.com/real-estate/ada-oh/LCOHADA/#t=0&s=0
And here are the results of the second page for the same search term: http://www.century21.com/real-estate/ada-oh/LCOHADA/#t=0&s=10
I noticed that when I manually click the second URL to open it in the browser, the results of the first URL are showing for few seconds and then the page seems to fully load and show the results of the second page.
As you can imagine, Python request is grabbing the results of the first load of the second page which happens to be the same as the results of the first page. Same if I request third page results, fourth and so on.
Below is my code. If you run the it, it will print the address of the first property of the first page twice.
Any idea how to grab the correct page results?
from bs4 import BeautifulSoup
import requests
page1=requests.get("http://www.century21.com/real-estate/ada-oh/LCOHADA/#t=0&s=0")
c1=page1.content
soup1=BeautifulSoup(c1,"html.parser").find_all("div",{"class":"propertyRow"})[0].find_all("span",{"class":"propAddressCollapse"})[0].text
page2=requests.get("http://www.century21.com/real-estate/ada-oh/LCOHADA/#t=0&s=10")
c2=page2.content
soup2=BeautifulSoup(c2,"html.parser").find_all("div",{"class":"propertyRow"})[0].find_all("span",{"class":"propAddressCollapse"})[0].text
print(soup1)
print(soup2)
Make requests to "search.c21" endpoint, get the HTML string from the "list" key and parse it:
from bs4 import BeautifulSoup
import requests
page1 = requests.get("http://www.century21.com/search.c21?lid=COHADA&t=0&s=0&subView=searchView.AllSubView")
c1 = page1.json()["list"]
soup1 = BeautifulSoup(c1, "html.parser").find_all("div", {"class": "propertyRow"})[0].find_all("span", {
"class": "propAddressCollapse"})[0].text
page2 = requests.get("http://www.century21.com/search.c21?lid=COHADA&t=0&s=10&subView=searchView.AllSubView")
c2 = page2.json()["list"]
soup2 = BeautifulSoup(c2, "html.parser").find_all("div", {"class": "propertyRow"})[0].find_all("span", {
"class": "propAddressCollapse"})[0].text
print(soup1)
print(soup2)
Prints:
5489 Sr 235
202 W Highland Ave