Crawling a site with iframe

Crawling a site with iframe - python

I am trying to crawl data from this site. It uses multiple iframes for different components.
When I try to open one of the iframe url in browser, it opens in that particular session but in another icognito/private session it doesn't. Same happens when I try to do this via requests or wget.
I have tried using requests along with session, then also it doesn't work. Here is my code snippet
import requests
s = requests.Session()
s.get('https://www.epc.shell.com/')
r = s.get('https://www.epc.shell.com/welcome.asp')
r.text
The last line only returns the javascript text with error that URL is invalid.
I know Selenium can solve this problem but I am considering it as last option.
Is it possible to crawl this URL with requests (or without using Javascript)? If yes, any help would be appreciated. If no, is there any alternative lightweight Javascript library in Python that can achieve this?

Your issue can be easily solved by adding custom headers to your requests, all in all, your code should look like this:
import requests
s = requests.Session()
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Language": "en-US,en;q=0.5"}
s.get('https://www.epc.shell.com/', headers = headers)
r = s.get('https://www.epc.shell.com/welcome.asp', headers = headers)
print(r.text)
(Do note that it is almost always recommended to use headers when sending requests).
I hope this helps!

Related

Can't scrape information from a static webpage using requests module

I'm trying to fetch product title and it's description from a webpage using requests module. The title and description appear to be static as they both are present in page source. However, I failed to grab them using following attempt. The script throws AttributeError at this moment.
import requests
from bs4 import BeautifulSoup
link = 'https://www.nordstrom.com/s/anine-bing-womens-plaid-shirt/6638030'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
product_title = soup.select_one("h1[itemProp='name']").text
product_desc = soup.select_one("#product-page-selling-statement").text
print(product_title,product_desc)
How can I scrape title and description from above pages using requests module?

The page is dynamic. go after the data from the api source:
import requests
import pandas as pd
api = 'https://www.nordstrom.com/api/ng-looks/styleId/6638030?customerId=f36cf526cfe94a72bfb710e5e155f9ba&limit=7'
jsonData = requests.get(api).json()
df = pd.json_normalize(jsonData['products'].values())
print(df.iloc[0])
Output:
id 6638030-400
name ANINE BING Women's Plaid Shirt
styleId 6638030
styleNumber
colorCode 400
colorName BLUE
brandLabelName ANINE BING
hasFlatShot True
imageUrl https://n.nordstrommedia.com/id/sr3/6d000f40-8...
price $149.00
pathAlias anine-bing-womens-plaid-shirt/6638030?origin=c...
originalPrice $149.00
productTypeLvl1 12
productTypeLvl2 216
isUmap False
Name: 0, dtype: object

When testing requests like these you should output the response to see what you're getting back. Best to use something like Postman (I think VSCode has a similar function to it now) to set up URLs, headers, methods, and parameters, and to also see the full response with headers. When you have everything working right, just convert it to python code. Postman even has some 'export to code' functions for common languages.
Anyways...
I tried your request on Postman and got this response:
Requests done from python vs a browser are the same thing. If the headers, URLs, and parameters are identical, they should receive identical responses. So the next step is comparing the difference between your request and the request done by the browser:
So one or more of the headers included by the browser gets a good response from the server, but just using User-Agent is not enough.
I would try to identify which headers, but unfortunately, Nordstrom detected some 'unusual activity' and seems to have blocked my IP :(
Probably due to sending an obvious handmade request. I think it's my IP that's blocked since I can't access the site from any browser, even after clearing my cache.
So double-check that the same hasn't happened to you while working with your scraper.
Best of luck!

Requests / BeautifulSoup VS robots.txt

Im trying to scrape a certain website, with a single input.
right now i have built it with Scrapy, and its working great, after all of the tweaks (including not obeying robots.txt), and its running on a loop automatically to data mine.
now i need to make something that will scrape a single page by input
problem is, the only page im able to access is the robots.txt page, and im not able to find any info online about going around robots.txt.
is there any tutorial on how to do it with BS or Requests?

Try passing these headers, and you will get the expected output.
import requests
headers = { 'accept':'*/*',
'accept-encoding':'gzip, deflate, br',
'accept-language':'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
'cache-control':'no-cache',
'dnt':'1',
'pragma':'no-cache',
'referer':'https',
'sec-fetch-mode':'no-cors',
'sec-fetch-site':'cross-site',
'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
}
URL = "https://www.crunchbase.com/login"
response = requests.get(url=URL, headers=headers)
print(response.text)
Hope this helps!

requests is module you use to actually get the HTML,
beautifulsoup is the parser you use to move through the HTML(allows you to choose the elements you want), and the answer to your question requests doesn't actually care about the robots.txt file (weather allowed or not allowed),if your requests are getting blocked I suggest request headers.
scrapy on the other hand actually reads and understands the robots.txt and you will have to set ROBOTSTXT_OBEY=False in order to scrape a "not-allowed" page.

Real page content isn't what I get with Requests and BeautifulSoup

as it happens sometimes to me, I can't access everything with requests that I can see on the page in the browser, and I would like to know why. On these pages, I am particularly interested in the comments. Does anyone have an idea how to access those comments, please? Thanks!
import requests
from bs4 import BeautifulSoup
import re
url='https://aukro.cz/uzivatel/paluska_2009?tab=allReceived&type=all&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
searched = soup.find_all('td', class_='col1')
print(searched)

Worth knowing you can get the scoring info for the individual as JSON using POST request. Handle the JSON as you require.
import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
headers = {
'Content-Type': 'application/json',
'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
}
url = 'https://aukro.cz/backend/api/users/profile?username=paluska_2009'
response = requests.post(url, headers=headers,data = "")
response.raise_for_status()
data = json_normalize(response.json())
df = pd.DataFrame(data)
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8',index = False )
Sample view of JSON:

I run your code and analized the content you have in page.
Seems like aukro.cz is built in Angular since it uses ng-app, therefore it's all dynamic content you apparently can't load using requests. You could try to use selenium in headless mode to scrape that part of content you are looking for.
Let me now if you need instructions for it.

To address your curiosity for QHarr's answer,
Upon loading the URL in chrome browser, if you trace Network calls. You will find out, there post request on URL - https://aukro.cz/backend/api/users/profile?username=paluska_2009, whose response - a JSON, which contains your desired information.
This is a trivial way of scraping data. While web-scraping, in most of the sites, you'll find out part of page is loading through some other api calls. To find the URL and POST params for the request, chrome Network tools is handy tool.
Let me know, if you need any details further.

How to use request module (python 2.7) to crawl .js website?

As I am trying to crawl the real-time stock info from Taiwan Stock Exchange, I used their API to access the desired information. Something strange happened.
For example, I can access the information with the following link API link which will return a nice json for me on the browser (something like this: . But it cannot return a json in my program.
My code is the following
url = "http://mis.twse.com.tw/stock/api/getStockInfo.jsp?ex_ch=tse_t00.tw|otc_o00.tw|tse_1101.tw|tse_2330.tw&json=1&delay=0&_=1516681976742"
print url
def get_data(query_url):
headers = {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:57.0) Gecko/20100101 Firefox/57.0",
'Accept-Language': 'en-US'
#'Accept-Language': 'zh-tw'
}
req = requests.session()
req.get('http://mis.twse.com.tw/stock/index.jsp', headers = headers)
#print req.cookies['JSESSIONID']
#print req.cookies.get_dict()
response = req.get(query_url, cookies = req.cookies.get_dict(), headers = headers)
return response.text#json.loads(response.text)
a = get_data(query_url = url)
And it will simply return u'\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n'.
Is there something wrong in my code? Or it is simply not possible to access this kind of web pages with request module?
Or any other suggestions? Thanks a lot!!
ps: Their API is of the format: http://mis.twse.com.tw/stock/api/getStockInfo.jsp?ex_ch=[tickers]&[time]
pps: I try the module selenium and its webDriver. It worked but is very slow. That is why I would like to use request.

As seen in the attached image text is non english hence it is printing is ascii value probably. Use utf-8 encoding on the result and you would see the different result.

How to Google in Python Using urllib or requests

What is the proper way to Google something in Python 3? I have tried requests and urllib for a Google page. When I simply res = requests.get("https://www.google.com/#q=" + query) that doesn't come back with the same HTML as when I inspect the Google page in Safari. The same happens with urllib. A similar thing happens when I use Bing. I am familiar with AJAX. However, it seems that that is now depreciated.

In python, if you do not specify the user agent header in http requests manually, python will add for you by default which can be detected by Google and may be forbidden by it.
Try the following if it can help.
import urllib
yourUrl = "post it here"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(yourUrl, headers = headers)
page = urllib.request.urlopen(req)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Crawling a site with iframe - python

Related

Can't scrape information from a static webpage using requests module

Requests / BeautifulSoup VS robots.txt

Real page content isn't what I get with Requests and BeautifulSoup

How to use request module (python 2.7) to crawl .js website?

How to Google in Python Using urllib or requests

Categories

Resources