Web scraping with BeautifulSoup only scrapes the first page - python

I am trying to scrape some data from the webmd messageboard. Initially I constructed a loop to get the page numbers for each category and stored the in a dataframe. When I try to run the loop I do get the proper amount of post for each subcategory but only for the first page. Any ideas what might be going wrong?
lists2=[]
df1= pd.DataFrame (columns=['page'],data=page_links)
for j in range(len(df1)):
pages = (df1.page.iloc[j])
print(pages)
req1 = urllib.request.Request(pages, headers=headers)
resp1 = urllib.request.urlopen(req1)
soup1 = bs.BeautifulSoup(resp1,'lxml')
for body_links in soup1.find_all('div',class_="thread-detail"):
body= body_links.a.get('href')
lists2.append(body)
I am getting the proper page in the print function but then it seem to iterate only in the first page and getting the links of the posts. Also when I copy and paste the link for any page besides the first one it seems to momentarily load the first page and then goes to the proper number page. I tried to add time.sleep(1) but does not work. Another thing I tried was to add {headers='Cookie': 'PHPSESSID=notimportant'}

Replace this line:
pages = (df1.page.iloc[j])
With this:
pages = (df1.page.iloc[j, 0])
You will now iterate through the values of your DataFrame

If page_links is list with urls like
page_links = ["http://...", "http://...", "http://...", ]
then you could use it directly
for url in page_links:
req1 = urllib.request.Request(url headers=headers)
If you need it in DataFrame then
for url in df1['page']:
req1 = urllib.request.Request(url headers=headers)
But if your current code displays all urls but you get result only for one page then problem is not in DataFrame but in HTML and find_all.
It seems only first page has <div class_="thread-detail"> so it can't find it on other pages and it can't add it to list. You should check it again. For other pages you may need different arguments in find_all. But without urls to these pages we can't check it and we can't help more.
It can be other common problem - page may use JavaScript to add these elements but BeautifulSoup can't run JavaScript - and then you woould need [Selenium](https://selenium-python.readthedocs.io/) to control web browser which can run JavaScript. You could turn off JavaScript in browser and open urls to check if you can see elements on page and in HTML inDevTools` in Chrome/Firefox.
As for PHPSESSID with requests you could use Session to get from server fresh cookies with PHPSESSID and automatically add them to other reuqests
import requests
s = reqeusts.Session()
# get any page to get fresh cookies from server
r = s.get('http://your-domain/main-page.html')
# use it automatically with cookies
for url in page_links:
r = s.get(url)

Related

API - Web Scrape

how to get access to this API:
import requests
url = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
print(requests.get(url))
I'm trying to retrieve data from this site via API, I found the url above and I can see its data , however I can't seem to get it right because I'm running into code 403.
This is the website url:
https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos
I'm trying to retrieve items category, they are visible for me, but I'm unable to take them.
Later I'll use these categories to iterate over products API.
API Category
Obs: please be gentle it's my first post here =]
To get the data as you shown in your image the following headers and endpoint are needed:
import requests
headers = {
'sm-token': '{"IdLoja":2691,"IdRede":884}',
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos',
}
params = {
'id_loja': '2691',
}
r = requests.get('https://www.nagumo.com.br/api/b2c/page/menu', params=params, headers=headers)
r.json()
Not sure exactly what your issue is here.
Bu if you want to see the content of the response and not just the 200/400 reponses. You need to add '.content' to your print.
Eg.
#Create Session
s = requests.Session()
#Example Connection Variables, probably not required for your use case.
setCookieUrl = 'https://www...'
HeadersJson = {'Accept-Language':'en-us'}
bodyJson = {"__type":"xxx","applicationName":"xxx","userID":"User01","password":"password2021"}
#Get Request
p = s.get(otherUrl, json=otherBodyJson, headers=otherHeadersJson)
print(p) #Print response (200 etc)
#print(p.headers)
#print(p.content) #Print the content of the response.
#print(s.cookies)
I'm also new here haha, but besides this requests library, you'll also need another one like beautiful soup for what you're trying to do.
bs4 installation: https:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Once you install it and import it, it's just continuing what you were doing to actively get your data.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
this gets the entire HTML content of the page, and so, you can get your data from this page based on their css selectors like this:
site_data = soup.select('selector')
site_data is an array of things with that 'selector', so a simple for loop and an array to add your items in would suffice (as an example, getting links for each book on a bookstore site)
For example, if i was trying to get links from a site:
import requests
from bs4 import BeautifulSoup
sites = []
URL = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.select("a") # list of all items with this selector
for link in links:
sites.append(link)
Also, a helpful tip is when you inspect the page (right click and at the bottom press 'inspect'), you can see the code for the page. Go to the HTML and find the data you want and right click it and select copy -> copy selector. This will make it really easy for you to get the data you want on that site.
helpful sites:
https://oxylabs.io/blog/python-web-scraping
https://realpython.com/beautiful-soup-web-scraper-python/

The problem repeated information in crawling next page and lack of information using beautiful soup and webdriver

I am trying to crawl the job's links from this web.
At first, I will explain my code:
I will crawl all the links of each job using for.
After getting all the links on the first page, I will move to the next page and repeat crawling the links of the job.
But the program returns the result like this:
As you can see, from the 11th and so on, the link is repeated as the first 10 links.
My assumption is that it doesn't actually go to the next page but still crawling data from the old page. In this case, there are how many pages, the program will crawl the 1st page with the same times.
And there should have more than 9 links on the 1st page.
I don't really know how to fix that.
How can I solve this problem?
Sincerely thanks!
import requests
def main(url):
params = {
"x-algolia-agent": "Algolia for JavaScript (3.35.1); Browser",
"x-algolia-application-id": "JF8Q26WWUD",
"x-algolia-api-key": "ecef10153e66bbd6d54f08ea005b60fc"
}
data = "{\"requests\":[{\"indexName\":\"vnw_job_v2\",\"params\":\"query=&hitsPerPage=1000&attributesToRetrieve=%5B%22*%22%2C%22-jobRequirement%22%2C%22-jobDescription%22%5D&attributesToHighlight=%5B%5D&query=&facetFilters=%5B%5D&filters=&numericFilters=%5B%5D&page=0&restrictSearchableAttributes=%5B%22jobTitle%22%2C%22skills%22%2C%22company%22%5D\"}]}"
r = requests.post(url, params=params, data=data)
for item in r.json()['results'][0]['hits']:
print(item['jobTitle'])
if __name__ == "__main__":
main('https://jf8q26wwud-dsn.algolia.net/1/indexes/*/queries')

How can we get response of next loading web page

I am writing a scraper to get all the movie list available on hungama.com
I am requesting "http://www.hungama.com/all/hungama-picks-54/4470/" url to get the response.
When you go to this url, this will show 12 movies on the screen but as you sroll down the movie count gets increasing by auto reload.
I am parsing the html source page with below code
response.css('div.movie-block-artist.boxshadow.clearfix1>div>div>a::text').extract()
but I only get 12 items whereas there are more movie items. how can I get all the movies available. Please help.
While scrolling down the content of that page, If you take a good look at xhr tab in network category within dev tools then you can see that it produces some URLs with pagination feature attached to it like :http://www.hungama.com/all/hungama-picks-54/3632/2/. So, changing the line as I did below, you can get all the content from that page.
import requests
from scrapy import Selector
page = 1
URL = "http://www.hungama.com/all/hungama-picks-54/3632/"
while True:
page+=1
res = requests.get(URL)
sel = Selector(res)
container = sel.css(".leftbox")
if len(container)<=0:break
for item in container:
title = item.css("#pajax_a::text").extract_first()
year = item.css(".subttl::text").extract_first()
print(title,year)
next_page = "http://www.hungama.com/all/hungama-picks-54/3632/{}/"
URL = next_page.format(page)
Btw, the URL you have provided above is not working. The one I've supplied is active now. However, you understood the logic I think.
There seems to be an ajax request as a lazy load feature with url http://www.hungama.com/all/hungama-picks-54/4470/2/?ajax_call=1&_country=IN which fetches movies .
In the above url change 2 to 3 (http://www.hungama.com/all/hungama-picks-54/4470/3/?ajax_call=1&_country=IN) and so on for getting next movies detail.

Unable to glean data from other pages

I've written a script in python using post requests to get data from a webpage. The webpage traverses 57 pages with a next or dropdown button. What I've written so far can fetch data only from the first page. I tried a lot to find a way to capture data going through it's next pages but failed. How can I get data from all of the 57 pages? Thanks in advance.
Here is what I've tried so far:
import requests
from lxml import html
with requests.session() as session:
session.headers = {"User-Agent":"Mozilla/5.0"}
page = session.post("http://registers.centralbank.ie/(X(1)S(cvjcqdbijraticyy2ssdyqav))/FundSearchResultsPage.aspx?searchEntity=FundServiceProvider&searchType=Name&searchText=&registers=6%2c29%2c44%2c45&AspxAutoDetectCookieSupport=1",
data={'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages':'2'},
headers={'Content-Type': 'application/x-www-form-urlencoded'})
tree = html.fromstring(page.text)
titles = tree.cssselect("table")[1]
list_row =[[tab_d.text_content() for tab_d in item.cssselect('td.gvwColumn,td.entityNameColumn,td.entityTradingNameColumn')]
for item in titles.cssselect('tr')]
for data in list_row:
print(' '.join(data))
This is The Link to that page
Btw, I didn't find any paginated links through which I can go on to the next page except for the "data" in requests parameter where there is a page number option which changes when button is clicked. However, changing that number doesn't bring data from other pages.

Python Scrape with requests and beautifulsoup

I am trying to do scraping excise using python requests and beautifulsoup.
Basically i am crawling amazon web page.
I am able to crawl the first page without any issues.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
#do some thing
But when I try to crawl the 2nd page with "#2" in urls
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers#2")
I see r still has same value that is equivalent to the value of 1 page.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
Dont know is #2 causing any trouble while making request to second page.
I also google about the issues but I could not find a fix.
What is right way to make request to url with #values. How to address this issue. Please advice.
"#2" is an fragment identifier, it's not visible on the server-side. Html content that you get, opening "http://someurl.com/page#123" is same as content for "http://someurl.com/page".
In browser you see second page because page's javascript see fragment identifier, create ajax request and inject new content into page. You should find ajax request's url and use it:
Looks like our url is:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&aj
Easily we can understand that all we need is to change "pg" param value to get another pages.
You need to request to the url in the href attribute of the anchor tags describing the pagination. It's at the bottom of the page. If I inspect the page in developer console in google chrome I find the first pages url is like:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_1?ie=UTF8&pg=1
and the second page's url is like this:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2
a tag for the second page is like this:
<a page="2" ajaxUrl="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&ajax=1" href="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2">21-40</a>
So you need to change the request url.

Categories