How to Google in Python Using urllib or requests

How to Google in Python Using urllib or requests - python

What is the proper way to Google something in Python 3? I have tried requests and urllib for a Google page. When I simply res = requests.get("https://www.google.com/#q=" + query) that doesn't come back with the same HTML as when I inspect the Google page in Safari. The same happens with urllib. A similar thing happens when I use Bing. I am familiar with AJAX. However, it seems that that is now depreciated.

In python, if you do not specify the user agent header in http requests manually, python will add for you by default which can be detected by Google and may be forbidden by it.
Try the following if it can help.
import urllib
yourUrl = "post it here"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(yourUrl, headers = headers)
page = urllib.request.urlopen(req)

Related

Can't scrape information from a static webpage using requests module

I'm trying to fetch product title and it's description from a webpage using requests module. The title and description appear to be static as they both are present in page source. However, I failed to grab them using following attempt. The script throws AttributeError at this moment.
import requests
from bs4 import BeautifulSoup
link = 'https://www.nordstrom.com/s/anine-bing-womens-plaid-shirt/6638030'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
product_title = soup.select_one("h1[itemProp='name']").text
product_desc = soup.select_one("#product-page-selling-statement").text
print(product_title,product_desc)
How can I scrape title and description from above pages using requests module?

The page is dynamic. go after the data from the api source:
import requests
import pandas as pd
api = 'https://www.nordstrom.com/api/ng-looks/styleId/6638030?customerId=f36cf526cfe94a72bfb710e5e155f9ba&limit=7'
jsonData = requests.get(api).json()
df = pd.json_normalize(jsonData['products'].values())
print(df.iloc[0])
Output:
id 6638030-400
name ANINE BING Women's Plaid Shirt
styleId 6638030
styleNumber
colorCode 400
colorName BLUE
brandLabelName ANINE BING
hasFlatShot True
imageUrl https://n.nordstrommedia.com/id/sr3/6d000f40-8...
price $149.00
pathAlias anine-bing-womens-plaid-shirt/6638030?origin=c...
originalPrice $149.00
productTypeLvl1 12
productTypeLvl2 216
isUmap False
Name: 0, dtype: object

When testing requests like these you should output the response to see what you're getting back. Best to use something like Postman (I think VSCode has a similar function to it now) to set up URLs, headers, methods, and parameters, and to also see the full response with headers. When you have everything working right, just convert it to python code. Postman even has some 'export to code' functions for common languages.
Anyways...
I tried your request on Postman and got this response:
Requests done from python vs a browser are the same thing. If the headers, URLs, and parameters are identical, they should receive identical responses. So the next step is comparing the difference between your request and the request done by the browser:
So one or more of the headers included by the browser gets a good response from the server, but just using User-Agent is not enough.
I would try to identify which headers, but unfortunately, Nordstrom detected some 'unusual activity' and seems to have blocked my IP :(
Probably due to sending an obvious handmade request. I think it's my IP that's blocked since I can't access the site from any browser, even after clearing my cache.
So double-check that the same hasn't happened to you while working with your scraper.
Best of luck!

Crawling a site with iframe

I am trying to crawl data from this site. It uses multiple iframes for different components.
When I try to open one of the iframe url in browser, it opens in that particular session but in another icognito/private session it doesn't. Same happens when I try to do this via requests or wget.
I have tried using requests along with session, then also it doesn't work. Here is my code snippet
import requests
s = requests.Session()
s.get('https://www.epc.shell.com/')
r = s.get('https://www.epc.shell.com/welcome.asp')
r.text
The last line only returns the javascript text with error that URL is invalid.
I know Selenium can solve this problem but I am considering it as last option.
Is it possible to crawl this URL with requests (or without using Javascript)? If yes, any help would be appreciated. If no, is there any alternative lightweight Javascript library in Python that can achieve this?

Your issue can be easily solved by adding custom headers to your requests, all in all, your code should look like this:
import requests
s = requests.Session()
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Language": "en-US,en;q=0.5"}
s.get('https://www.epc.shell.com/', headers = headers)
r = s.get('https://www.epc.shell.com/welcome.asp', headers = headers)
print(r.text)
(Do note that it is almost always recommended to use headers when sending requests).
I hope this helps!

Scraping Data from website with a login page

I am trying to login to my university website using python and the requests library using the following code, nonetheless I am not able to.
import requests
payloads = {"User_ID": <username>,
"Password": <passwrord>,
"option": "credential",
"Log in":"Log in"
}
with requests.Session() as session:
session.post('', data=payloads)
get = session.get("")
print(get.text)
Does anyone have any idea on what I am doing wrong?

In order to login you will need to to post all the informations requested by the <input> tag. In your case you will have also to provide the hidden inputs. You can do this by scraping for these values and then post them. You might also need to post some headers to simulate a browser behaviour.
from lxml import html
import requests
s = requests.Session()
login_url = "https://intranet.cardiff.ac.uk/students/applications"
session_url = "https://login.cardiff.ac.uk/nidp/idff/sso?sid=1&sid=1"
to_get = s.get(login_url)
tree = html.fromstring(to_get.text)
hidden_inputs = tree.xpath(r'//form//input[#type="hidden"]')
payloads = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
payloads["Ecom_User_ID"] = "<username>"
payloads["Ecom_Password"] = "<password>"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
result = s.post(session_url, data=payloads, headers = headers)
Hope this works

In order to login to a website with python, you will have to use a more involved method than the request library because you will have to simulate the browser in your code and have it make requests to login to the school's website servers. The reason for this is that you need the school's server to think that it is getting the request from the browser, then it should return you the contents of the resulting page, and then you have to have those contents rendered so that you can scrape it. Luckily, a great way to do this is with the selenium module in python.
I would recommend googling around to learn more about selenium. This blog post is a good example of using selenium to log into a web page with detailed explanations of what each line of code is doing. This SO answer on using selenium to login to a website is also good as an entry point into doing this.

Python 3 Website detects scraper when using User-Agent spoofing

I'm trying to scrape some information from Indeed.com using urllib. Occasionally, the job link gets redirected to the hiring company's webpage. When this happens, Indeed throws up some html about using an incompatible browser or device, rather than continuing to the redirected page. After looking around, I found that in most cases spoofing urllib's user agent to look like a browser is enough to get around this, but this doesn't seem to be the case here.
Any suggestions on where to go beyond spoofing the User-Agent? Is it possible Indeed is able to realize the User-Agent is spoofed, and that there is no way around this?
Here's an example of the code:
import urllib
from fake_useragent import UserAgent
from http.cookiejar import CookieJar
ua = UserAgent()
website = 'http://www.indeed.com/rc/clk?jk=0fd52fac51427150&fccid=7f79c79993ec7e60'
req = urllib.request.Request(website)
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
opener.addheaders = [('User-Agent', ua.chrome)]
response = opener.open(req)
print(response.read().decode('utf-8'))
Thanks for the help!

This header usually works :
HDR = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
Another option is to use the requests package.

Python getting HTML content via 'requests' returns partial response

I'm reading a web site content using following 3 liners. I used an example domain for sale which doesn't have many content.
url = "http://localbusiness.com/"
response = requests.get(url)
html = response.text
It returns following html content where the website contains more html when you check through view source. Am I doing something wrong here
Python version 2.7
<html><head></head><body><!-- vbe --></body></html>

Try setting a User-Agent:
import requests
url = "http://localbusiness.com/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
'Content-Type': 'text/html',
}
response = requests.get(url, headers=headers)
html = response.text
The default User-Agent set by requests is 'User-Agent': 'python-requests/2.8.1'. Try to simulate that the request is coming from a browser and not a script.

#jason answered it correctly so I am extending his answer for the reason
Why It happens
Some DOM elements code changed through the Ajax calls and JavaScript code so that will not be seen in the response of your call (Although it's not the case here as you are already using the view source (ctrl+u) to compare and not view element)
Some sites uses user-agent to know the nature of user (as of desktop or mobile user) and provide the response accordingly (as the probable case here)
Other alternatives
You can use the mechanize module of python to mimic a browser to fool
a web site (come handy when the site is using some short of
authentication cookies) A small tutorial
Use selenium to actually implement a browser

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Google in Python Using urllib or requests - python

Related

Can't scrape information from a static webpage using requests module

Crawling a site with iframe

Scraping Data from website with a login page

Python 3 Website detects scraper when using User-Agent spoofing

Python getting HTML content via 'requests' returns partial response

Categories

Resources