MechanicalSoup: How to Requests.Session() in MechanicalSoup

MechanicalSoup: How to Requests.Session() in MechanicalSoup - python

Last time i had a question:
MechanicalSoup and Request
was closed because i had 2 questions, only 1 question is allowed. It's weird but i'll take it since i managed to answer question 2 with this code by using re module.
html = "<boop boop bap bap> </boop boop bap bap> <title form=ipooped> test </title>"
match_results = re.search("<title.*?>.*?</title.*?>", html, re.IGNORECASE)
content = match_results.group() #title = re.sub("<.*?>", "", title)
print(content);
So now i'm gonna ask a question again, and that is how i can set the cookie on mechanicalsoup like as in requests.session(). For example, this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...',
'cookie': 'PHPSESSID=a9ej3sro77tkdoh7hdhj832m68; security=low ...'
}
response = session.get(url, headers=headers);

Related

trouble logging in to LinkedIn with requests_html

I'm trying to scrape data from linkedin. to do that I'll need to login, the issue is that I get an error in the html tag. here is the code.
url = 'https://www.linkedin.com/checkpoint/lg/login-submit'
s = AsyncHTMLSession()
headers = {
'User_Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'
}
payload = {
'session_key': key,
'session_password': password
}
r = await s.post(url, data = payload, headers = headers)
print(r.html)
traceback:
<html url='https://www.linkedin.com/checkpoint/lg/login?errorKey=unexpected_error'>
EDIT:
I realized that it was probably a rendered page or some kind of trickery, so I used selenium. can't be stubborn with code.

Can't parse a Google search result page using BeautifulSoup

I'm parsing webpages using BeautifulSoup from bs4 in python. When I inspected the elements of a google search page, the first division had class = 'r' I wrote this code:
import requests
site = requests.get('<url>')
from bs4 import BeautifulSoup
page = BeautifulSoup(site.content, 'html.parser')
results = page.find_all('div', class_="r")
print(results)
But the command prompt returned just []
What could've gone wrong and how to correct it?
EDIT 1: I edited my code accordingly by adding the dictionary for headers, yet the result is the same [].
Here's the new code:
import requests
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
}
site = requests.get('<url>', headers = headers)
from bs4 import BeautifulSoup
page = BeautifulSoup(site.content, 'html.parser')
results = page.find_all('div', class_="r")
print(results)
NOTE: When I tell it to print the entire page, there's no problem, or when I take list(page.children) , it works fine.

Some website requires User-Agent header to be set to prevent fake request from non-browser. But, fortunately there's a way to pass headers to the request as such
# Define a dictionary of http request headers
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
}
# Pass in the headers as a parameterized argument
requests.get(url, headers=headers)
Note: List of user agents can be found here

>>> give_me_everything = soup.find_all('div', class_='yuRUbf')
Prints a bunch of stuff.
>>> give_me_everything_v2 = soup.select('.yuRUbf')
Prints a bunch of stuff.
Note that you can't do something like this:
>>> give_me_everything = soup.find_all('div', class_='yuRUbf').text
AttributeError: You're probably treating a list of elements like a single element.
>>> for all in soup.find_all('div', class_='yuRUbf'):
print(all.text)
Prints a bunch of stuff.
Code:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
"Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q="narendra modi" "scams" "frauds" "corruption" "modi" -lalit -nirav', headers=headers)
soup = BeautifulSoup(html.text, 'html.parser')
give_me_everything = soup.find_all('div', class_='yuRUbf')
print(give_me_everything)
Alternatively, you can do the same thing using Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.
The main difference is that you don't have to come with a different solution when something isn't working thus don't have to maintain the parser.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": 'narendra modi" "scams" "frauds" "corruption" "modi" -lalit -nirav',
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
link = result['link']
displayed_link = result['displayed_link']
print(f'{title}\n{link}\n{displayed_link}\n')
----------
Opposition Corners Modi Govt On Jay Shah Issue, Rafael ...
https://www.outlookindia.com/website/story/no-confidence-vote-opposition-corners-modi-govt-on-jay-shah-issue-rafael-deals-c/313790
https://www.outlookindia.com
Modi, Rahul and Kejriwal describe one another as frauds ...
https://www.business-standard.com/article/politics/modi-rahul-and-kejriwal-describe-one-another-as-frauds-114022400019_1.html
https://www.business-standard.com
...
Disclaimer, I work for SerpApi.

Workaround for blocked GET requests in Python

I'm trying to retrieve and process the results of a web search using requests and beautifulsoup.
I've written some simple code to do the job, and it returns successfully (status = 200), but the content of the request is just an error message "We're sorry for any inconvenience, but the site is currently unavailable.", and has been the same for the last several days. Searching within Firefox returns results without issue, however. I've run the code using a URL for the UK-based site and it works without issue so I wonder if the US site is set up to block attempts to scrape web searches.
Are there ways to mask the fact I'm attempting to retrieve search results from within Python (eg, masquerading as a standard search within Firefox) or some other work around to allow access to the search results?
Code included for reference below:
import pandas as pd
from requests import get
import bs4 as bs
import re
# works
# baseURL = 'https://www.autotrader.co.uk/car-search?sort=sponsored&radius=1500&postcode=ky119sb&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&make=TOYOTA&model=VERSO&year-from=1990&year-to=2017&minimum-mileage=0&maximum-mileage=200000&body-type=MPV&fuel-type=Diesel&minimum-badge-engine-size=1.6&maximum-badge-engine-size=4.5&maximum-seats=8'
# doesn't work
baseURL = 'https://www.autotrader.com/cars-for-sale/Certified+Cars/cars+under+50000/Jeep/Grand+Cherokee/Seattle+WA-98101?extColorsSimple=BURGUNDY%2CRED%2CWHITE&maxMileage=45000&makeCodeList=JEEP&listingTypes=CERTIFIED%2CUSED&interiorColorsSimple=BEIGE%2CBROWN%2CBURGUNDY%2CTAN&searchRadius=0&modelCodeList=JEEPGRAND&trimCodeList=JEEPGRAND%7CSRT%2CJEEPGRAND%7CSRT8&zip=98101&maxPrice=50000&startYear=2015&marketExtension=true&sortBy=derivedpriceDESC&numRecords=25&firstRecord=0'
a = get(baseURL)
soup = bs.BeautifulSoup(a.content,'html.parser')
info = soup.find_all('div', class_ = 'information-container')
price = soup.find_all('div', class_ = 'vehicle-price')
d = []
for idx, i in enumerate(info):
ii = i.find_next('ul').find_all('li')
year_ = ii[0].text
miles = re.sub("[^0-9\.]", "", ii[2].text)
engine = ii[3].text
hp = re.sub("[^\d\.]", "", ii[4].text)
p = re.sub("[^\d\.]", "", price[idx].text)
d.append([year_, miles, engine, hp, p])
df = pd.DataFrame(d, columns=['year','miles','engine','hp','price'])

By default, Requests sends a unique user agent when making requests.
>>> r = requests.get('https://google.com')
>>> r.request.headers
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
It is possible that the website you are using is trying to avoid scrapers by denying any request with a user agent of python-requests.
To get around this, you can change your user agent when sending a request. Since it's working on your browser, simply copy your browser user agent (you can Google it, or record a request to a webpage and copy your user agent like that). For me, it's Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36 (what a mouthful), so I'd set my user agent like this:
>>> headers = {
... 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
... }
and then send the request with the new headers (the new headers are added to the default headers, they don't replace them unless they have the same name):
>>> r = requests.get('https://google.com', headers=headers) # Using the custom headers we defined above
>>> r.request.headers
{'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
Now we can see that the request was sent with our preferred headers, and hopefully the site won't be able to tell the difference between Requests and a browser.

Python Scrape links from google result

Is there any way I can scrape certain links from google result containing specific words in link.
By using beautifulsoup or selenium ?
import requests
from bs4 import BeautifulSoup
import csv
URL = "https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
Want to extract links containing group links.

Not sure what you want to do, but if you want to extract facebook links from the returned content, you can just check whether facebook.com is within the URL:
import requests
from bs4 import BeautifulSoup
import csv
URL = "https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups"
r = requests.get(URL)
soup = BeautifulSoup(r.text, 'html5lib')
for link in soup.findAll('a', href=True):
if 'facebook.com' in link.get('href'):
print link.get('href')
Update:
There is another workaround. The thing you need to do is to set a legitimate user-agent. Therefore add headers to emulate a browser. :
# This is a standard user-agent of Chrome browser running on Windows 10
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
Example:
from bs4 import BeautifulSoup
import requests
URL = 'https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups'
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get(URL, headers=headers).text
soup = BeautifulSoup(resp, 'html.parser')
for link in soup.findAll('a', href=True):
if 'facebook.com' in link.get('href'):
print link.get('href')
Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' :
'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}

As I understand it, you need to get all the links from the Google search results that contain specific words in link. I assume you are talking about this: site:facebook.com friends groups.
For site:facebook.com you don't need to do a special check to see if the given expression is present in the link. Because you already wrote advanced operator site: in the search query. So Google returns results only from that site.
But for friends groups a special check is needed and let's see how this can be implemented.
To get these links, you need to get the selector that contains them. In our case, this is the .yuRUbf a selector. Let's use a select() method that will return a list of all the links we need.
To iterate over all links, we can use for loop and iterate the list of matched elements what select() method returned. Use get('href') or ['href'] to extract attributes, which be URL in this case.
In each iteration of the loop, you need to perform a check for the presence of specific words in the URL address:
for result in soup.select(".yuRUbf a"):
if ("groups" or "friends") in result["href"].lower():
print(result["href"])
Also, make sure you're using request headers user-agent to act as a "real" user visit. The updated workaround 0xInfection answer worked because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
To minimize blocks from Google, I decided to add a basic example of using proxies via requests.
Code and full example in online IDE:
from bs4 import BeautifulSoup
import requests, lxml
session = requests.Session()
session.proxies = {
'http': 'http://10.10.10.10:8000',
'https': 'http://10.10.10.10:8000',
}
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "site:facebook.com friends groups",
"hl": "en", # language
"gl": "us" # country of the search, US -> USA
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}
html = requests.get("https://www.google.co.in/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
for result in soup.select(".yuRUbf a"):
if ("groups" or "friends") in result["href"].lower():
print(result["href"])
Output:
https://www.facebook.com/groups/funwithfriendsknoxville/
https://www.facebook.com/FWFNYC/groups
https://www.facebook.com/groups/americansandfriendsPT/about/
https://www.facebook.com/funfriendsgroups/
https://www.facebook.com/groups/317688158367767/about/
https://m.facebook.com/funfriendsgroups/photos/
https://www.facebook.com/WordsWithFriends/groups
Or you can use Google Organic Results API from SerpApi. It will bypass blocks from search engines and you don't have to create the parser from scratch and maintain it.
Code example:
from serpapi import GoogleSearch
import os
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
# https://docs.python.org/3/library/os.html#os.getenv
"api_key": os.getenv("API_KEY"), # your serpapi api key
"engine": "google", # search engine
"q": "site:facebook.com friends groups" # search query
# other parameters
}
search = GoogleSearch(params) # where data extraction happens on the SerpApi backend
result_dict = search.get_dict() # JSON -> Python dict
for result in result_dict['organic_results']:
if ("groups" or "friends") in result['link'].lower():
print(result['link'])
Output:
https://www.facebook.com/groups/126440730781222/
https://www.facebook.com/FWFNYC/groups
https://m.facebook.com/FS1786/groups
https://www.facebook.com/pages/category/AIDS-Resource-Center/The-Big-Groups-159912964020164/
https://www.facebook.com/groups/889671771094194
https://www.facebook.com/groups/480003906466800/about/
https://www.facebook.com/funfriendsgroups/

Login website with python requests

I'm trying to login to a webpage using python 3 using requests and lxml. However, after sending a post request to the login page, I can't enter pages that are available after login. What am I missing?
import requests
from lxml import html
session_requests = requests.session()
login_URL = 'https://www.voetbal.nl/inloggen'
r = session_requests.get(login_URL)
tree = html.fromstring(r.text)
form_build_id = list(set(tree.xpath("//input[#name='form_build_id']/#value")))[0]
payload = {
'email':'mom.soccer#mail.com',
'password':'testaccount',
'form_build_id':form_build_id
}
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Content-Type':'multipart/form-data; boundary=----WebKitFormBoundarymGk1EraI6yqTHktz',
'Host':'www.voetbal.nl',
'Origin':'https://www.voetbal.nl',
'Referer':'https://www.voetbal.nl/inloggen',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
result = session_requests.post(
login_URL,
data = payload,
headers = headers
)
pvc_url = 'https://www.voetbal.nl/club/BBCB10Z/overzicht'
result_pvc = session_requests.get(
pvc_url,
headers = headers
)
print(result_pvc.text)
The account in this sample is activated, but it is just a test-account which I created to put my question up here. Feel free to try it out.

Answer:
there where multiple problems:
Payload: 'form_id': 'voetbal_login_login_form' was missing. Thanks #t.m.adam
Cookies: request cookies where missing. They seem to be static, so I tried to add them manually, which worked. Thanks #match and #Patrick Doyle
Headers: removed the 'content-type' line; which contained a dynamic part.
Login works like a charm now!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

MechanicalSoup: How to Requests.Session() in MechanicalSoup - python

Related

trouble logging in to LinkedIn with requests_html

Can't parse a Google search result page using BeautifulSoup

Workaround for blocked GET requests in Python

Python Scrape links from google result

Login website with python requests

Categories

Resources