How to use request module (python 2.7) to crawl .js website? - python

As I am trying to crawl the real-time stock info from Taiwan Stock Exchange, I used their API to access the desired information. Something strange happened.
For example, I can access the information with the following link API link which will return a nice json for me on the browser (something like this: . But it cannot return a json in my program.
My code is the following
url = "http://mis.twse.com.tw/stock/api/getStockInfo.jsp?ex_ch=tse_t00.tw|otc_o00.tw|tse_1101.tw|tse_2330.tw&json=1&delay=0&_=1516681976742"
print url
def get_data(query_url):
headers = {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:57.0) Gecko/20100101 Firefox/57.0",
'Accept-Language': 'en-US'
#'Accept-Language': 'zh-tw'
}
req = requests.session()
req.get('http://mis.twse.com.tw/stock/index.jsp', headers = headers)
#print req.cookies['JSESSIONID']
#print req.cookies.get_dict()
response = req.get(query_url, cookies = req.cookies.get_dict(), headers = headers)
return response.text#json.loads(response.text)
a = get_data(query_url = url)
And it will simply return u'\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n'.
Is there something wrong in my code? Or it is simply not possible to access this kind of web pages with request module?
Or any other suggestions? Thanks a lot!!
ps: Their API is of the format: http://mis.twse.com.tw/stock/api/getStockInfo.jsp?ex_ch=[tickers]&[time]
pps: I try the module selenium and its webDriver. It worked but is very slow. That is why I would like to use request.

As seen in the attached image text is non english hence it is printing is ascii value probably. Use utf-8 encoding on the result and you would see the different result.

Related

AWS Lambda - python webscraping - unable to bypass cloudfare anti-bots from AWS ip but working from local ip

I've built a simple python web scraper that works as expected locally but does not work on AWS Lambda -- specifically and only for the website I would like to scrape. I've tested out just the scraping portion of the code and can confirm that is is a cloudflare anti-bot issue.
I've combed through relevant SO and medium articles and tried:
adding the appropriate headers
specifying user agent
using different libraries (urllib, cloudscraper, selenium)
using a virtual display (pyvirtualdisplay with xvfb) as according to this post: How to bypass Cloudflare bot protection in selenium
Example code of the urllib version to illustrate the question:
import json
import urllib.request
def lambda_handler(event, context):
url = 'https://disboard.org/servers/tag/python/15'
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
return respData
The above code returns a 403 status + reCAPTCHA.
I understand that data center IP ranges get handled more carefully by antispam than residential IPs -- is there any workaround for this?
Thank you in advance.

Can't scrape information from a static webpage using requests module

I'm trying to fetch product title and it's description from a webpage using requests module. The title and description appear to be static as they both are present in page source. However, I failed to grab them using following attempt. The script throws AttributeError at this moment.
import requests
from bs4 import BeautifulSoup
link = 'https://www.nordstrom.com/s/anine-bing-womens-plaid-shirt/6638030'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
product_title = soup.select_one("h1[itemProp='name']").text
product_desc = soup.select_one("#product-page-selling-statement").text
print(product_title,product_desc)
How can I scrape title and description from above pages using requests module?
The page is dynamic. go after the data from the api source:
import requests
import pandas as pd
api = 'https://www.nordstrom.com/api/ng-looks/styleId/6638030?customerId=f36cf526cfe94a72bfb710e5e155f9ba&limit=7'
jsonData = requests.get(api).json()
df = pd.json_normalize(jsonData['products'].values())
print(df.iloc[0])
Output:
id 6638030-400
name ANINE BING Women's Plaid Shirt
styleId 6638030
styleNumber
colorCode 400
colorName BLUE
brandLabelName ANINE BING
hasFlatShot True
imageUrl https://n.nordstrommedia.com/id/sr3/6d000f40-8...
price $149.00
pathAlias anine-bing-womens-plaid-shirt/6638030?origin=c...
originalPrice $149.00
productTypeLvl1 12
productTypeLvl2 216
isUmap False
Name: 0, dtype: object
When testing requests like these you should output the response to see what you're getting back. Best to use something like Postman (I think VSCode has a similar function to it now) to set up URLs, headers, methods, and parameters, and to also see the full response with headers. When you have everything working right, just convert it to python code. Postman even has some 'export to code' functions for common languages.
Anyways...
I tried your request on Postman and got this response:
Requests done from python vs a browser are the same thing. If the headers, URLs, and parameters are identical, they should receive identical responses. So the next step is comparing the difference between your request and the request done by the browser:
So one or more of the headers included by the browser gets a good response from the server, but just using User-Agent is not enough.
I would try to identify which headers, but unfortunately, Nordstrom detected some 'unusual activity' and seems to have blocked my IP :(
Probably due to sending an obvious handmade request. I think it's my IP that's blocked since I can't access the site from any browser, even after clearing my cache.
So double-check that the same hasn't happened to you while working with your scraper.
Best of luck!

Crawling a site with iframe

I am trying to crawl data from this site. It uses multiple iframes for different components.
When I try to open one of the iframe url in browser, it opens in that particular session but in another icognito/private session it doesn't. Same happens when I try to do this via requests or wget.
I have tried using requests along with session, then also it doesn't work. Here is my code snippet
import requests
s = requests.Session()
s.get('https://www.epc.shell.com/')
r = s.get('https://www.epc.shell.com/welcome.asp')
r.text
The last line only returns the javascript text with error that URL is invalid.
I know Selenium can solve this problem but I am considering it as last option.
Is it possible to crawl this URL with requests (or without using Javascript)? If yes, any help would be appreciated. If no, is there any alternative lightweight Javascript library in Python that can achieve this?
Your issue can be easily solved by adding custom headers to your requests, all in all, your code should look like this:
import requests
s = requests.Session()
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Language": "en-US,en;q=0.5"}
s.get('https://www.epc.shell.com/', headers = headers)
r = s.get('https://www.epc.shell.com/welcome.asp', headers = headers)
print(r.text)
(Do note that it is almost always recommended to use headers when sending requests).
I hope this helps!

Python: What is returned when I use requests.get('url') and print r.text?

I'm trying to scrape this webpage. This code works:
import requests
header = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0',
}
r = requests.get('http://www.machinefinder.com/ww/en-US/categories/used-drawn-planters', headers=header)
print r.text
but I'm not sure what the text that it returns really is. I wish it was JSON so that I could copy other examples I've found that parse JSON.
Note: my work security blocks the webpage and says "Illegal Web Browser" when I use
header={
'Content-Type': 'application/json;charset=UTF-8',
}
which is why I'm using Firefox instead.
>>>>type(r.text)
<type 'unicode'>
Looks to be the html for the page. You could use Beautiful soup to parse it
:https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
You can't get an arbitrary website to return a JSON formatted data unless it provides a wway to request (and get back..) a JSON formatted data.
r.text will generally hold the website's source code, unless again, it specifically returned JSON data.
So, you will have to resort to other means of parsing websites, such as BeautifulSoup.

Reading walmart product page with urllib doesn't work when using "user-agent" string

I'm building django based website where some data is dynamically loaded using Ajax from a user specified url. For this I'm using urllib2 and later on BeautifulSoup. I came to strange thing with Walmart links. Take a look:
import urllib2
url_to_parse = 'http://www.walmart.com/ip/JVC-HARX300-High-Quality-Full-Size-Headphone/13241375'
# 1 - read the url without user-agent string
opened_url = urllib2.urlopen(url_to_parse)
print len(opened_url.read())
# prints 309316
# 2 - read the url wit user-agent string
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0' }
req = urllib2.Request(url_to_parse, '', headers)
opened_url = urllib2.urlopen(req)
print len(opened_url.read())
# prints 0
My question is why on #2 a zero is printed? I use the user-agent method to deal with other websites (like amazon).
Wget is able to get the page content with no problems btw.
Your problem is not the User-Agent, it is your data parameter.
From the docs:
data may be a string specifying additional data to send to the server,
or None if no such data is needed.
It seems WalMart does not like your empty string. Change your call to this:
req = urllib2.Request(url_to_parse, None, headers)
Now both ways print the same value.

Categories