requests_html render scrolldown, script not working - python

I need to crawl data from the website that data are loaded by scroll down.
The website returned 5 data before scrolling down, and expected 80 data returned after scrolling down are done.
I'm using the requests_html module and tried this
from requests_html import HTML, HTMLSession
keyword = '유산균'
n = 1
url = f'https://search.shopping.naver.com/search/all?frm=NVSHATC&origQuery={keyword}&pagingIndex={n}&pagingSize=80&productSet=total&query={keyword}&sort=rel&timestamp=&viewType=list'
session = HTMLSession()
ses = session.get(url)
html = HTML(html=ses.text)
item_list = html.find('div.basicList_title__3P9Q7')
print(len(item_list))
ses.html.render(scrolldown=100, sleep=.1)
'''
ses.html.render(script="window.scrollTo(0, 99999)", sleep= 10)
also tried not worked either
'''
print(len(item_list))
I expected 5, 80 as the result but both print returned the same result. 5 and 5.
what is wrong with my code?

When you monitor the network activity when loading the site, you'll see that it loads the search results from an api. This means that you can retrieve the data directly from the api without scraping. Here is an example that loads the first page as a pandas dataframe:
import requests
import pandas as pd
keyword = '유산균'
n = 1
r = requests.get(f'https://search.shopping.naver.com/api/search/all?sort=rel&pagingIndex={n}&pagingSize=80&viewType=list&productSet=total&deliveryFee=&deliveryTypeValue=&frm=NVSHATC&query={keyword}&origQuery={keyword}').json()
df = pd.DataFrame(r['shoppingResult']['products'])
You can add a loop to retrieve next pages, etc.

Related

web scraping using pandas

I want to scrape multiple pages of website using Python, but I'm getting Remote Connection closed error.
Here is my code
import pandas as pd
url_link = 'https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p={}&selectedItem=viewAllAwardedContracts.do'
LIST = []
for number in range(1,5379):
url = url_link.format(number)
dframe = pd.read_html(url, header=None)[0]
LIST.append(dframe)
Result_df = pd.concat(LIST)
Result_df.to_csv('Taneps_contracts.csv')
Any idea how to solve it?
For me, just using requests to fetch the html before passing to read_html is getting the data. I just edited your code to
import pandas as pd
import re
url_link = 'https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p={}&selectedItem=viewAllAwardedContracts.do'
LIST = []
for number in range(1,5379):
url = url_link.format(number)
r = requests.get(url) # getting page -> html in r.text
dframe = pandas.read_html(r.text, header=None)[0]
LIST.append(dframe)
Result_df = pd.concat(LIST)
Result_df.to_csv('Taneps_contracts.csv')
I didn't even have to add headers, but if this isn't enough for you (i.e., if the program breaks or if you don't end up with 53770+ rows), try adding convincing headers or using something like HTMLSession instead of directly calling requests.get...

Web scraping of hyperlinks going so slow

I am using the following function to scrape the Twitter URLs from a list of websites.
import httplib2
import bs4 as bs
from bs4 import BeautifulSoup, SoupStrainer
from urllib.parse import urlparse
import pandas as pd
import swifter
def twitter_url(website): # website address is given to the function in a string format
try:
http = httplib2.Http()
status, response = http.request(str('https://') + website)
url = 'https://twitter.com'
search_domain = urlparse(url).hostname
l = []
for link in bs.BeautifulSoup(response, 'html.parser',
parseOnlyThese=SoupStrainer('a')):
if link.has_attr('href'):
if search_domain in link['href']:
l.append(link['href'])
return list(set(l))
except:
ConnectionRefusedError
and then I apply the function into the dataframe which includes the URL addresses
df ['twitter_id'] = df.swifter.apply(lambda x:twitter_url(x['Website address']), axis=1)
The dataframe has about 100,000 website addresses. Even when I run the code for 10,000 samples, the code is running so slow. Is there any way to run this faster?
The issue must be a result of the time taken to retrieve the HTML code for each of the websites.
Since the URLs are processed one after the other, even if each one took 100ms it would still take 1000s (~16 mins) to finish up.
If you however process each URL in a separate thread, that should significantly cut down the time taken.
You can check out the threading library to accomplish that.

Scraping with requests_html randomly gives (no) result from JS site. Timing issue?

I want to scrape data from IMDb. Since beautifulsoup4 cannot work with JavaScript, I use html_request.
However, my code randomly gives (no) result. When I repeat the same code 10 times, sometimes it works, sometimes it does not. time.sleep() does not help (I thought maybe JS needs longer to be loaded).
Why is that and how to fix?
# from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.imdb.com/title/tt4236770/')
# time.sleep(1)
rating_show = r.html.find('.AggregateRatingButton__RatingScore-sc-1il8omz-1')[0] # either works or 'list index out of range' error
rating_show = float(rating_show.text)
rating_show
It is because the class and structure of the page are changing to avoid scraping. It is not due to javascript rendering.
By the way if you want to render the page you need to use render method r.html.render() after the get request.
Here you can just bypass the class in order to get the notation of the film like this:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.imdb.com/title/tt4236770/')
body = r.html.text
indice = body.find('/10')
print(body[indice - 3: indice])
# output: Always return '8.6'

Python Requests module not refreshing new page contents

Hi i want to crawl XHR request url which has JSON feed but when i change query paramater page value to 2 or any other it retrieve data from page 1 but when i did the same in browser it shows data according to its page.
enter code here
import json
import requests
url = 'https://www.daraz.pk/computer-graphic-cards/?'
params_dict = {}
params_dict['ajax']= 'true'
params_dict['page']= 1
params_dict['spm'] = 'a2a0e.home.cate_2_9.1.35e349378NoL6f'
res = requests.get(url, params=params_dict)
data = json.loads(res.text)
res.url # url changes but content is same of page 1
info = data.get('mods').get('listItems')
for i in info:
print(i['name'])
I think how the data is being returned has issues. I modified the call slightly by looping over the pages.
Looking at the data returned, it seems that some products are being returned on multiple pages even in the UI.
for page_num in range(1, 7):
res = requests.get('https://www.daraz.pk/computer-graphic-cards/?ajax=true&page=' + str(page_num)).json()
info = res.get('mods').get('listItems')
for i in info:
print('%s:%s:%s---------%s' % (i['itemId'],i['sellerName'],i['skuId'],i['name']))
print('----------------------- PAGE %s ------------------------------------------' % (page_num))
Data returned from this code snippet is linked here.

Python request.get() after few seconds

I want to get html-text a few seconds after opening url.
Here's the code:
import requests
url = "http://XXXXX…"
html = request.get(url).text
I want to get html-text few seconds after opening url.
Well, the webpage HTML stays the same right after you "get" the url using Requests, so there's no need to wait a few seconds as the HTML will not change.
I assume the reason that you would like to wait is for the page to load all the relevant resources (e.g. CSS/JS) that modifies the HTML?
If it's so, I wouldn't recommend you using the Requests module as you will have to manipulate and load all of the relevant resources by yourself.
I suggest you to have a look at Selenium for Python.
Selenium fully simulates a browser, hence you can wait and it will load all the resources for your webpage.
try using time.sleep(t)
response = request.get(url)
time.sleep(5) # suspend execution for 5 secs
html = response.text
You want to change the last line to:
html = requests.get(url).text
I have found the library requests-html handy for this purpose, though mostly I use Selenium (as already proposed in Danny answer).
from requests_html import HTMLSession, HTMLResponse
session = HTMLSession()
req = cast(HTMLResponse, session.get("http://XXXXX"))
req.html.render(sleep=5, keep_page=True)
Now, the req.html is a HTML object. In order to get the raw text or the html as a string you can use:
text = req.text
or:
text = req.html.html
Then you can parse your text string, e.g. with Beautiful Soup.
basically you can give a sleep to the request as a parameter as bellow:
import requests
import time
url = "http://XXXXX…"
seconds = 5
html = requests.get(url,time.sleep(seconds)).text #for example 5 seconds

Categories