How can we get response of next loading web page - python

I am writing a scraper to get all the movie list available on hungama.com
I am requesting "http://www.hungama.com/all/hungama-picks-54/4470/" url to get the response.
When you go to this url, this will show 12 movies on the screen but as you sroll down the movie count gets increasing by auto reload.
I am parsing the html source page with below code
response.css('div.movie-block-artist.boxshadow.clearfix1>div>div>a::text').extract()
but I only get 12 items whereas there are more movie items. how can I get all the movies available. Please help.

While scrolling down the content of that page, If you take a good look at xhr tab in network category within dev tools then you can see that it produces some URLs with pagination feature attached to it like :http://www.hungama.com/all/hungama-picks-54/3632/2/. So, changing the line as I did below, you can get all the content from that page.
import requests
from scrapy import Selector
page = 1
URL = "http://www.hungama.com/all/hungama-picks-54/3632/"
while True:
page+=1
res = requests.get(URL)
sel = Selector(res)
container = sel.css(".leftbox")
if len(container)<=0:break
for item in container:
title = item.css("#pajax_a::text").extract_first()
year = item.css(".subttl::text").extract_first()
print(title,year)
next_page = "http://www.hungama.com/all/hungama-picks-54/3632/{}/"
URL = next_page.format(page)
Btw, the URL you have provided above is not working. The one I've supplied is active now. However, you understood the logic I think.

There seems to be an ajax request as a lazy load feature with url http://www.hungama.com/all/hungama-picks-54/4470/2/?ajax_call=1&_country=IN which fetches movies .
In the above url change 2 to 3 (http://www.hungama.com/all/hungama-picks-54/4470/3/?ajax_call=1&_country=IN) and so on for getting next movies detail.

Related

Getting empty list after running scaping code in python to get reviews of users from play store of my app

import requests
from bs4 import BeautifulSoup
# Send a GET request to the web page
response = requests.get('https://play.google.com/store/apps/details?id=com.example.app&reviewSortOrder=4&reviewType=0')
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all the review divs
review_divs = soup.find_all('div', class_='single-review')
# Extract the review data from each review div
reviews = []
for review_div in review_divs:
# Extract the review text
review_text = review_div.find('span', class_='review-body').text
# Extract the reviewer's name
reviewer_name = review_div.find('span', class_='author-name').text
# Extract the review rating
review_rating = review_div.find('div', class_='tiny-star').get('aria-label').split(' ')[1]
# Store the review data in a dictionary
review = {
'review_text': review_text,
'reviewer_name': reviewer_name,
'review_rating': review_rating
}
# Add the review dictionary to the list of reviews
reviews.append(review)
print(reviews)
I wrote this code to scrap reviews of my app from Playstore in tabular format. However, I am only getting empty block after running this code, I am not able to identify where I am making error in this code to get.Results I am expecting are list of all of the reviews of my app users.
Print your status, see what it returns print(response.status_code) –
Samuel
You should always check response first. I usually add a line after a request.get call with something like
print(response.status_code, response.reason, 'from', response.url)
so that I can see if the response is ok and also so that I'll know if the request was redirected to another page. (You can also proceed conditionally with if response.status_code == 200 or if response.ok, or interrupt the program by raising an error in case of a bad response with response.raise_for_status())
In this case, the link itself also appears to be problematic

API - Web Scrape

how to get access to this API:
import requests
url = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
print(requests.get(url))
I'm trying to retrieve data from this site via API, I found the url above and I can see its data , however I can't seem to get it right because I'm running into code 403.
This is the website url:
https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos
I'm trying to retrieve items category, they are visible for me, but I'm unable to take them.
Later I'll use these categories to iterate over products API.
API Category
Obs: please be gentle it's my first post here =]
To get the data as you shown in your image the following headers and endpoint are needed:
import requests
headers = {
'sm-token': '{"IdLoja":2691,"IdRede":884}',
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos',
}
params = {
'id_loja': '2691',
}
r = requests.get('https://www.nagumo.com.br/api/b2c/page/menu', params=params, headers=headers)
r.json()
Not sure exactly what your issue is here.
Bu if you want to see the content of the response and not just the 200/400 reponses. You need to add '.content' to your print.
Eg.
#Create Session
s = requests.Session()
#Example Connection Variables, probably not required for your use case.
setCookieUrl = 'https://www...'
HeadersJson = {'Accept-Language':'en-us'}
bodyJson = {"__type":"xxx","applicationName":"xxx","userID":"User01","password":"password2021"}
#Get Request
p = s.get(otherUrl, json=otherBodyJson, headers=otherHeadersJson)
print(p) #Print response (200 etc)
#print(p.headers)
#print(p.content) #Print the content of the response.
#print(s.cookies)
I'm also new here haha, but besides this requests library, you'll also need another one like beautiful soup for what you're trying to do.
bs4 installation: https:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Once you install it and import it, it's just continuing what you were doing to actively get your data.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
this gets the entire HTML content of the page, and so, you can get your data from this page based on their css selectors like this:
site_data = soup.select('selector')
site_data is an array of things with that 'selector', so a simple for loop and an array to add your items in would suffice (as an example, getting links for each book on a bookstore site)
For example, if i was trying to get links from a site:
import requests
from bs4 import BeautifulSoup
sites = []
URL = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.select("a") # list of all items with this selector
for link in links:
sites.append(link)
Also, a helpful tip is when you inspect the page (right click and at the bottom press 'inspect'), you can see the code for the page. Go to the HTML and find the data you want and right click it and select copy -> copy selector. This will make it really easy for you to get the data you want on that site.
helpful sites:
https://oxylabs.io/blog/python-web-scraping
https://realpython.com/beautiful-soup-web-scraper-python/

Web scraping with BeautifulSoup only scrapes the first page

I am trying to scrape some data from the webmd messageboard. Initially I constructed a loop to get the page numbers for each category and stored the in a dataframe. When I try to run the loop I do get the proper amount of post for each subcategory but only for the first page. Any ideas what might be going wrong?
lists2=[]
df1= pd.DataFrame (columns=['page'],data=page_links)
for j in range(len(df1)):
pages = (df1.page.iloc[j])
print(pages)
req1 = urllib.request.Request(pages, headers=headers)
resp1 = urllib.request.urlopen(req1)
soup1 = bs.BeautifulSoup(resp1,'lxml')
for body_links in soup1.find_all('div',class_="thread-detail"):
body= body_links.a.get('href')
lists2.append(body)
I am getting the proper page in the print function but then it seem to iterate only in the first page and getting the links of the posts. Also when I copy and paste the link for any page besides the first one it seems to momentarily load the first page and then goes to the proper number page. I tried to add time.sleep(1) but does not work. Another thing I tried was to add {headers='Cookie': 'PHPSESSID=notimportant'}
Replace this line:
pages = (df1.page.iloc[j])
With this:
pages = (df1.page.iloc[j, 0])
You will now iterate through the values of your DataFrame
If page_links is list with urls like
page_links = ["http://...", "http://...", "http://...", ]
then you could use it directly
for url in page_links:
req1 = urllib.request.Request(url headers=headers)
If you need it in DataFrame then
for url in df1['page']:
req1 = urllib.request.Request(url headers=headers)
But if your current code displays all urls but you get result only for one page then problem is not in DataFrame but in HTML and find_all.
It seems only first page has <div class_="thread-detail"> so it can't find it on other pages and it can't add it to list. You should check it again. For other pages you may need different arguments in find_all. But without urls to these pages we can't check it and we can't help more.
It can be other common problem - page may use JavaScript to add these elements but BeautifulSoup can't run JavaScript - and then you woould need [Selenium](https://selenium-python.readthedocs.io/) to control web browser which can run JavaScript. You could turn off JavaScript in browser and open urls to check if you can see elements on page and in HTML inDevTools` in Chrome/Firefox.
As for PHPSESSID with requests you could use Session to get from server fresh cookies with PHPSESSID and automatically add them to other reuqests
import requests
s = reqeusts.Session()
# get any page to get fresh cookies from server
r = s.get('http://your-domain/main-page.html')
# use it automatically with cookies
for url in page_links:
r = s.get(url)

Web scrape website with dropdown menu that changes website dynamically (onchange)

So I'm trying to scrape census data from a website that changes dynamically when a county is selected from the drop down menu. It looks like this:
<select id="cat_id_select_GEO" onchange="changeHeaderSelection('GEO');
<option value="0500000US01001" select="selected">Autaga County, Alabama</option>
<select>
a link
So from the research i've done, it sounds like i need to make some sort of Get request? (selenium?) but I am completely lost on how to do this. I know how to get the data i want, once i've made the county selection. But I've never had to scrape something where the website changes dynamically (i.e. the url doesn't change)
I understand that some may find this to be a simple question... but I've read numerous other similar questions and would greatly benefit from someone walking me through example, and/or directing me to a solid guide.
this is what i've been messing around with so far. I can see it kinda works at selecting the values... but it spits out this error: Message: stale element reference: element is not attached to the page document
(Session info: chrome=74.0.3729.169)
for index, row in StateURLs.iterrows():
url = row['URL']
state = row['STATE']
driver = webdriver.Chrome(executable_path=r'C:\chromedriver.exe')
driver.get(url)
select_county = Select(driver.find_element_by_id('cat_id_select_GEO'))
options = select_county.options
for index in range(0, len(options) - 1):
select_county.select_by_index(index)
I also would love help on how to then convert this webpages to beautiful soup so i can scrape each page after the selection is made
The main landing page does get requests with a query string that returns a json string containing the info from that is first returned when you submit your query including further urls that are listed on the results page.
import requests
search_term = 'searchTerm: Autauga County, Alabama'
search_term = search_term.replace(' ','+')
r = requests.get('https://factfinder.census.gov/rest/communityFactsNav/nav?N=0&_t=1558559559868&log=t&searchTerm=term ' + search_term + ',Alabama&src=').json()
Here is an example of that json
I can generate the correct url to use in the browser which returns all that data as json but can't seem to configure requests so works. Perhaps someone else can pick up this and work it out. I will look again tomorrow.
r = requests.get('https://factfinder.census.gov/rest/communityFactsNav/nav?N=0&_t=1558559559868&log=t&searchTerm=term ' + search_term + ',Alabama&src=', allow_redirects= True).json()
url = 'https://factfinder.census.gov' + r['CFMetaData']['measuresAndLinks']['links']['2017 American Community Survey'][0]['url']
code = url.split('/')[-2]
url = 'https://factfinder.census.gov/tablerestful/tableServices/renderProductData?renderForMap=f&renderForChart=f&pid=ACS_17_5YR_{}&prodToReplace=ACS_16_5YR_{}&log=t&_ts=576607332612'.format(code, code)

Displaying all search results in Python web scraper

I am quite new to Python and am building a web scraper, which will scrape the following page and links in them: https://www.nalpcanada.com/Page.cfm?PageID=33
The problem is the page's default is to display the first 10 search results, however, I want to scrape all 150 search results (when 'All' is selected, there are 150 links).
I have tried messing around with the URL, but the URL remains static no matter what display results option is selected. I have also tried to look at the Network section of the Developer Tools on Chrome, but can't seem to figure out what to use to display all results.
Here is my code so far:
import bs4
import requests
import csv
import re
response = requests.get('https://www.nalpcanada.com/Page.cfm?PageID=33')
soup = bs4.BeautifulSoup(response.content, "html.parser")
urls = []
for a in soup.findAll('a', href=True, class_="employerProfileLink", text="Vancouver, British Columbia"):
urls.append(a['href'])
pagesToCrawl = ['https://www.nalpcanada.com/' + url + '&QuestionTabID=47' for url in urls]
for pages in pagesToCrawl:
html = requests.get(pages)
soupObjs = bs4.BeautifulSoup(html.content, "html.parser")
nameOfFirm = soupObjs.find('div', class_="ip-left").find('h2').next_element
tbody = soupObjs.find('div', {"id":"collapse8"}).find('tbody')
offers = tbody.find('td').next_sibling.next_sibling.next_element
seeking = tbody.find('tr').next_sibling.next_sibling.find('td').next_sibling.next_sibling.next_element
print('Firm name:', nameOfFirm)
print('Offers:', offers)
print('Seeking:', seeking)
print('Hireback Rate:', int(offers) / int(seeking))
Replacing your response call with this code seems to work. The reason is that you weren't passing in the cookie properly.
response = requests.get(
'https://www.nalpcanada.com/Page.cfm',
params={'PageID': 33},
cookies={'DISPLAYNUM': '100000000'}
)
The only other issue I came across was that a ValueError was being raised by this line when certain links (like YLaw Group) don't seem to have "offers" and/or "seeking".
print('Hireback Rate:', int(offers) / int(seeking))
I just commented out the line since you will have to decide what to do in those cases.

Categories