requests & urllib failed get complete html - python

I was trying to get the comments & authors. The authors are chained so that I know who was replying to who. So it is important to store all comments out there otherwise replies to the missing comment are nowhere to be chained. (I know it is kinda confusing, but on this website, replies are also comments but special that also indicates the author of the comment they reply to.)
From a Chinese website (https://www.zhihu.com/node/AnswerCommentListV2?params=%7B%22answer_id%22%3A%2215184366%22%7D) using requests.
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
headers = {'User-Agent': user_agent,
}
url = "https://www.zhihu.com/node/AnswerCommentListV2?params=%7B%22answer_id%22%3A%"+"2215184366"+"%22%7D"
r = requests.get(url, headers=headers, allow_redirects = True)
soup = BeautifulSoup(r.text,"lxml")
soup.prettify()
for comment in soup.find_all("div", "zm-item-comment"):
p = comment.find("a", "zg-link author-link")
print(p)
However, I found that the codes above can get me most of the content I want but with some "holes". Most of the comments are nicely listed but some are missing. During the debug, I found that the response from requests was incomplete. The response itself missed some comments for unknown reasons.
Console Output(where all "None" should be comments)
I also tried similar approach using urllib and no good.
Could you please help me get the complete html as the browser does?
Update:
I think the problem has to do with the response from the website. The simple requests.get cannot get the full website as Chrome does. I am wondering if a fundamental solution to get the complete html exists.
I have tried #eLRuLL's code. It does get the lost authors name. However, the lost authors all appear to be "知乎用户” which means the general user of that website. (I am expecting different and specific user names) Comparing to the Chrome browser, the browser displays specific user names well.

Try this. You will have all the authors and comments.
import requests
from bs4 import BeautifulSoup
url = "https://www.zhihu.com/node/AnswerCommentListV2?params=%7B%22answer_id%22%3A%"+"2215184366"+"%22%7D"
res = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".zm-item-comment"):
try:
author = item.select(".author-link")[0].text
comment = item.select(".zm-comment-content")[0].text
print(author,comment)
except:pass

The problem seems to be that you think that all the comments should be inside an a tag, but if you check, the comments you are missing are exactly the ones that don't have a link on the user's name (so you can't use a tag to find them), so to get the name of the author you'd have to use:
p = comment_author.find("div", "zm-comment-hd").text
print(p)

Related

Python, Scraping BS4

There are a lot of post about this subject but I still don't manage to achieve what I want so here is my problem:
I am trying to extract stock price from this site:
https://bors.e24.no/#!/instrument/NHY.OSE
and I would like extract the price: 57,12 from the "inspection" text:
<div class="number LAST" data-reactid=".g.1.2.0">
57,12</div>
Here is the code I tried which generate "AttributeError" and 'NoneType' object has no attribute 'text'.
I also tried to remove .text, in the PRICE line, and the result is 'Price is: None'
from bs4 import BeautifulSoup
import requests
url = 'https://bors.e24.no/#!/instrument/NHY.OSE'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
PRICE= soup.find('div', class_= "number LAST").text
print('Price is:',(PRICE))
Try this:
import requests
headers = {
'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
}
api_url = "https://bors.e24.no/server/components?columns=ITEM, LAST, BID, ASK, CHANGE, CHANGE_PCT, TURNOVER, LONG_NAME&itemSector=NHY.OSE&type=table"
data = requests.get(api_url, headers=headers).json()
print(data["rows"][0]["values"]["LAST"])
Output:
56.92
This happens because your
requests.get(url)
Will not get all information in the page, including the price you are looking for, because the said webpage will load some parts of it and only then fetch more data. Because of that, trying to select the div with className="number LAST"
PRICE= soup.find('div', class_= "number LAST").text
Will throw an error because this doesn't exist, yet.
There are some ways to fix this problem:
You can try to use libraries like Selenium, which is often recommended for scraping more dynamic pages that rely on some Javascript and API calls to load content.
You can open your developer tools and inspect the Network tab where you might find the request that fetches the price you are trying to scrap.
I believe that in your case, after taking a look at the Network tab myself, the right URL to request could be 'https://bors.e24.no/server/components?columns=TIME,+PRICE,+VOLUME,+BUYER,+SELLER,+ID&filter=ITEM%3D%3DsNHY&limit=5&source=feed.ose.trades.EQUITIES%2BPCC&type=history', which seems to return a dictionary with the price you are looking for.
import requests
url = 'https://bors.e24.no/server/components?columns=TIME,+PRICE,+VOLUME,+BUYER,+SELLER,+ID&filter=ITEM%3D%3DsNHY&limit=5&source=feed.ose.trades.EQUITIES%2BPCC&type=history'
page = requests.get(url)
print(page.json()["rows"][0]["values"]["PRICE"])
If you are looking to scrap various links, you will need to find a way to dynamically change the previous link to one that matches others that you are trying to crawl. Which I guess would mean to change "NHY" and "ose" to something that would match other stock that you are looking for.

Can't scrape information from a static webpage using requests module

I'm trying to fetch product title and it's description from a webpage using requests module. The title and description appear to be static as they both are present in page source. However, I failed to grab them using following attempt. The script throws AttributeError at this moment.
import requests
from bs4 import BeautifulSoup
link = 'https://www.nordstrom.com/s/anine-bing-womens-plaid-shirt/6638030'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
product_title = soup.select_one("h1[itemProp='name']").text
product_desc = soup.select_one("#product-page-selling-statement").text
print(product_title,product_desc)
How can I scrape title and description from above pages using requests module?
The page is dynamic. go after the data from the api source:
import requests
import pandas as pd
api = 'https://www.nordstrom.com/api/ng-looks/styleId/6638030?customerId=f36cf526cfe94a72bfb710e5e155f9ba&limit=7'
jsonData = requests.get(api).json()
df = pd.json_normalize(jsonData['products'].values())
print(df.iloc[0])
Output:
id 6638030-400
name ANINE BING Women's Plaid Shirt
styleId 6638030
styleNumber
colorCode 400
colorName BLUE
brandLabelName ANINE BING
hasFlatShot True
imageUrl https://n.nordstrommedia.com/id/sr3/6d000f40-8...
price $149.00
pathAlias anine-bing-womens-plaid-shirt/6638030?origin=c...
originalPrice $149.00
productTypeLvl1 12
productTypeLvl2 216
isUmap False
Name: 0, dtype: object
When testing requests like these you should output the response to see what you're getting back. Best to use something like Postman (I think VSCode has a similar function to it now) to set up URLs, headers, methods, and parameters, and to also see the full response with headers. When you have everything working right, just convert it to python code. Postman even has some 'export to code' functions for common languages.
Anyways...
I tried your request on Postman and got this response:
Requests done from python vs a browser are the same thing. If the headers, URLs, and parameters are identical, they should receive identical responses. So the next step is comparing the difference between your request and the request done by the browser:
So one or more of the headers included by the browser gets a good response from the server, but just using User-Agent is not enough.
I would try to identify which headers, but unfortunately, Nordstrom detected some 'unusual activity' and seems to have blocked my IP :(
Probably due to sending an obvious handmade request. I think it's my IP that's blocked since I can't access the site from any browser, even after clearing my cache.
So double-check that the same hasn't happened to you while working with your scraper.
Best of luck!

Real page content isn't what I get with Requests and BeautifulSoup

as it happens sometimes to me, I can't access everything with requests that I can see on the page in the browser, and I would like to know why. On these pages, I am particularly interested in the comments. Does anyone have an idea how to access those comments, please? Thanks!
import requests
from bs4 import BeautifulSoup
import re
url='https://aukro.cz/uzivatel/paluska_2009?tab=allReceived&type=all&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
searched = soup.find_all('td', class_='col1')
print(searched)
Worth knowing you can get the scoring info for the individual as JSON using POST request. Handle the JSON as you require.
import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
headers = {
'Content-Type': 'application/json',
'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
}
url = 'https://aukro.cz/backend/api/users/profile?username=paluska_2009'
response = requests.post(url, headers=headers,data = "")
response.raise_for_status()
data = json_normalize(response.json())
df = pd.DataFrame(data)
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8',index = False )
Sample view of JSON:
I run your code and analized the content you have in page.
Seems like aukro.cz is built in Angular since it uses ng-app, therefore it's all dynamic content you apparently can't load using requests. You could try to use selenium in headless mode to scrape that part of content you are looking for.
Let me now if you need instructions for it.
To address your curiosity for QHarr's answer,
Upon loading the URL in chrome browser, if you trace Network calls. You will find out, there post request on URL - https://aukro.cz/backend/api/users/profile?username=paluska_2009, whose response - a JSON, which contains your desired information.
This is a trivial way of scraping data. While web-scraping, in most of the sites, you'll find out part of page is loading through some other api calls. To find the URL and POST params for the request, chrome Network tools is handy tool.
Let me know, if you need any details further.

Python scraper with POST request doesn't bring any results

I've written a script to scrape the "First Name" from a webpage using post request in python. However, Running my script I get neither any results nor any error. It seems to me that I'm doing things the right way. Hope somebody will point me into the right direction showing me what I'm missing here:
import requests
from lxml import html
payload = {'ScriptManager1':'UpdatePanel1|btnProceed','__EVENTTARGET':'','__EVENTARGUMENT':'','__VIEWSTATE':'/wEPDwULLTE2NzQxNDczNTcPZBYCAgQPZBYCAgMPZBYCZg9kFgQCAQ9kFgQCAQ9kFgICAQ9kFg4CBQ8QZGQWAGQCFQ8QZGQWAWZkAiEPEGRkFgFmZAI3DxBkZBYAZAI7DxBkZBYAZAJvDw9kFgIeBXZhbHVlZWQCew8PZBYCHwBlZAICD2QWAgIBD2QWAgIBD2QWAmYPZBYSAgcPEGRkFgBkAi0PEGRkFgFmZAJFDxYCHgdFbmREYXRlBmYcik5ut9RIZAJNDxBkZBYBZmQCZQ8WAh8BBmYcik5ut9RIZAJ7DxBkZBYAZAKBAQ8QZGQWAGQCyAEPD2QWAh8AZWQC1AEPD2QWAh8AZWQCBw9kFgICAw88KwARAgEQFgAWABYADBQrAABkGAMFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYDBQxyZG9QZXJtYW5lbnQFDHJkb1Byb3Zpc2lvbgUMcmRvUHJvdmlzaW9uBQlHcmlkVmlldzEPZ2QFCk11bHRpVmlldzEPD2RmZFSgnfO4lYFs09JWdr2kB8ZwSO3808nJf+616Y8YJ3UF','__VIEWSTATEGENERATOR':'5629D98D','__EVENTVALIDATION':'/wEdAAekSVFWk+dy9X9XnzfYeR4NT1Z25jJdJ6rNAjXmHpbD+Q8ekkJ2enuXq0jY/CeUlod/njRPjRiZUniYWoSlesZ/+0XiOc/vwjI5jxqS0D5ang1Wtvp3KMocxPzInS3xjMbN+DvxnwFeFeJ9MIBWR693SSiBqUlIhPoALKQ2G08CpjEhrdvaa2JXqLbLG45vzvU=','r1':'rdoPermanent','txtRegistNo':'SRO0394294','__ASYNCPOST':'true','btnProceed':'Proceed'}
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'}
response = requests.post("https://www.icaionlineregistration.org/StudentRegistrationForCaNo.aspx", params=payload, headers=headers).text
tree = html.fromstring(response)
item = tree.xpath('//div[#class="div_input_place"]/input[#id="txt_name"]/#value')
print(item)
URL is given in my script and the reg number to get the "First Name" is "SRO0394294". The xpath I've used above is the correct one.
__VIEWSTATE input is always changing. this input could be used to prevent the registration form from bots
The problem is probably that the __EVENTTARGET field is empty, it may be needed in order to submit your request. You can find the value to set with the form submit button in most cases.
Also since the __VIEWSTATE is always regenerating upon requests you'll need to grab it. You can do firstly a GET request, save the __VIEWSTATE input and then do a POST request with the previous __VIEWSTATE value.

Scraping Pantip Forum using BeautifulSoup

I'm trying to scrape some forum posts from http://pantip.com/tag/Isuzu
One such page is http://pantip.com/topic/35647305
I want to get each post text along with its author and timestamp into a csv file.
I'm using Beautiful Soup, but admittedly I'm a complete beginner at python and web scraping. The code that I have right now gets the required fields, but only for the first post. I need information for all posts on that thread. I tried soup.find_all() and soup.select(), but I'm not getting the desired results.
Here's the code I'm using:
from bs4 import BeautifulSoup
import urllib2
print "Reading URL..."
url = urllib2.urlopen("http://pantip.com/topic/35647305")
content = url.read()
soup = BeautifulSoup(content, "html.parser")
print "Finding desired HTML..."
table = soup.select("abbr.timeago")
print "\nScraped HTML is:"
print table
text = BeautifulSoup(str(table).strip(),"html.parser").get_text().encode("utf-8").replace("\n", "")
print "\nScraped text is:\n" + text
Any clues as to what I'm doing wrong would be deeply appreciated. Also, any suggestions as to how this could be done in a better, cleaner way are welcome.
As mentioned, I'm a beginner, so please don't mind any stupid mistakes. :-)
Thanks!
The comments are rendered using an Ajax request:
import requests
from bs4 import BeautifulSoup
params = {"tid": "35647305", # in the url
"type": "3"}
with requests.Session() as s:
s.headers.update({"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"})
r = (s.get("http://pantip.com/forum/topic/render_comments", params=params))
data = r.json() # data["comments"] contains what you want
Which will give you all the data. So all you need is to pass the tid from each url and update the tid in the params dict.

Categories