I'm trying to push some value in the search box of amazon.com.
I'm using requests rather then selenium (push keys option).
I've identified the xpath of the search box, and now want to push value in it, IE: char "a" or "apple" or any other string and then collect the results.
However when pushing the data with the post method for request I get an error.
Here below it's my code:
import requests
from lxml import html
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)''AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
page = requests.get('https://www.amazon.com/', headers=headers)
page = requests.get('https://www.amazon.com/', headers=headers)
response_code = page.status_code
if response_code == 200:
htmlText = page.text
tree = html.fromstring(page.content)
search_box = tree.xpath('//input[#id="twotabsearchtextbox"]')
pushing_keys = requests.post(search_box,'a')
print(search_box)
However I get this error code:
requests.exceptions.MissingSchema: Invalid URL "[<InputElement 20b94374a98 name='field-keywords' type='text'>]": No schema supplied. Perhaps you meant http://[<InputElement 20b94374a98 name='field-keywords' type='text'>]?
How do I correctly push any char in the search box with requests?
Thanks
Try using this approach:
import requests
base_url = 'https://www.amazon.com'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)''AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
page = requests.get(base_url, headers=headers)
response_code = page.status_code
if response_code == 200:
key_word_to_search = 'bean bags'
pushing_keys = requests.get(f'{base_url}/s/ref=nb_sb_noss', headers=headers, params={'k': key_word_to_search})
print(pushing_keys.content)
The search box is using a get request.
See here
Related
I would to scrape the last odds in archive from this page https://www.betexplorer.com/soccer/estonia/esiliiga/elva-flora-tallinn/Q9KlbwaJ/ but I can't get it with requests. How can I get it without interact with Selenium?
To trigger the archive odds page in the Developer Tools I need to hover on the odd.
Code
url = "https://www.betexplorer.com/archive-odds/4l4ubxv464x0xc78lr/14/"
headers = {
"Referer": "https://www.betexplorer.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
}
Json = requests.get(url, headers=headers).json()
As the site is being loaded by JavaScript, requests doesn't work. I have used selenium to load the page, extract the complete source code after everything is loaded.
Then used beautifulsoup to create a soup object to get required data.
From the source code you can see that the data-bid of the <tr> are what are being passed to get the odds data.
I extracted all the data-bid and passed them to the URL you've provided at the very end of your question one by one.
This code will get all the odds data in JSON format
import time
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
base_url = 'https://www.betexplorer.com/soccer/estonia/esiliiga/elva-flora-tallinn/Q9KlbwaJ/'
driver = webdriver.Chrome()
driver.get(base_url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
t = soup.find('table', attrs= {'id': 'sortable-1'})
trs = t.find('tbody').findAll('tr')
for i in trs:
data_bid = i['data-bid']
url = f"https://www.betexplorer.com/archive-odds/4l4ubxv464x0xc78lr/{data_bid}/"
headers = {"Referer": "https://www.betexplorer.com",'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}
Json = requests.get(url, headers=headers).json()
# Do what you wish to do withe JSON data here....
I am trying to replicate ajax request from a web page (https://droughtmonitor.unl.edu/Data/DataTables.aspx). AJAX is initiated when we select values from dropdowns.
I am using the following request using python, but not able to see the response as in Network tab of the browser.
import bs4
import requests
import lxml
ses = requests.Session()
ses.get('https://droughtmonitor.unl.edu/Data/DataTables.aspx')
headers_dict = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
url = 'https://droughtmonitor.unl.edu/Ajax2018.aspx/ReturnTabularDMAreaPercent_urban'
req_data = {'area':'00064', 'statstype':'1'}
resp = ses.post(url,data = req_data,headers = headers_dict)
soup = bs4.BeautifulSoup(resp.content,'lxml')
print(soup)
You need to add several things to your request to get an Answer from the server.
You need to convert your dict to json to pass it as string and not as dict.
You also need to specify the type of request-data by setting the request header to Content-Type:application/json; charset=utf-8
with those changes I was able to request the correkt data.
import bs4
import requests
ses = requests.Session()
ses.get('https://droughtmonitor.unl.edu/Data/DataTables.aspx')
headers_dict = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Content-Type': 'application/json; charset=utf-8'}
url = 'https://droughtmonitor.unl.edu/Ajax2018.aspx/ReturnTabularDMAreaPercent_urban'
req_data = json.dumps({'area':'00037', 'statstype':'1'})
resp = ses.post(url,data = req_data,headers = headers_dict)
soup = bs4.BeautifulSoup(resp.content,'lxml')
print(soup)
Quite a tricky problem I must say.
From the requests documentation:
Instead of encoding the dict yourself, you can also pass it directly
using the json parameter (added in version 2.4.2) and it will be
encoded automatically:
>>> url = 'https://api.github.com/some/endpoint'
>>> payload = {'some': 'data'}
>>> r = requests.post(url, json=payload)
Then, to get the output, call r.json() and you will get the data you are looking for.
I am new to Python scraping, so as part of the practice I was trying few other sites where often data wasn't returned at all, but when I checked Groupon, I found that urllib only returns the first 8 results, while there are 36 results on the browser page.
I am using urllib and BS4. below is the code
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
req = Request('https://www.groupon.com/browse/chicago?category=beauty-and-spas')
req.add_header('User-Agent',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36')
try:
with urlopen(req) as response:
htmlcontent = response.read().decode('utf-8')
except:
htmlcontent = None
soup = BeautifulSoup(htmlcontent, 'lxml')
all_links = soup.find('div', { 'id': 'pull-results' }).select('figure > div > a')
Can somebody please tell, what am I missing in the code to be able to extract all the data?
If this doesn't work or shouldn't work, then do we have selenium as the only option?
Try the following to get all the items and their links traversing next pages:
import requests
from bs4 import BeautifulSoup
base_link = 'https://www.groupon.com/browse/chicago?category=beauty-and-spas'
url = 'https://www.groupon.com/partial/browse/get-paginated-cards?'
params = {
'category': 'beauty-and-spas',
'page': 1
}
with requests.Session() as s:
s.headers['user-agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
r = s.get(base_link)
params['_csrf'] = r.cookies['_csrf']
while True:
print("current page----------->",params['page'])
res = s.get(url,params=params)
soup = BeautifulSoup(res.json()[0]['cardSlotHtml'],"lxml")
if not soup.select_one("figure[data-pingdom-info='purchasable-deal']"):break
for item in soup.select("figure[data-pingdom-info='purchasable-deal']"):
item_title = item.select_one(".cui-udc-details .cui-udc-title").get_text(strip=True)
item_link = item.select_one(".cui-content > a[href]").get("href")
print(item_title,item_link)
params['page']+=1
The other 28 links are loaded dynamically, therefore urllib doesn't support it. However, you can scrape them by sending a GET request to:
https://www.groupon.com/partial/browse/get-lazy-loaded-cards?category=beauty-and-spas&_csrf=P6rFPl1o-xDta8uOABKo_9LOiUajyK9bieMg
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
req = Request(
"https://www.groupon.com/partial/browse/get-lazy-loaded-cards?category=beauty-and-spas&_csrf=P6rFPl1o-xDta8uOABKo_9LOiUajyK9bieMg"
)
req.add_header(
"User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36",
)
with urlopen(req) as response:
htmlcontent = response.read().decode("utf-8")
soup = BeautifulSoup(htmlcontent, "lxml")
for tag in soup.find_all(class_=r'\"cui-content\"'):
try:
link = tag.find("a")["href"]
except TypeError:
continue
print(link.replace('\\"', "").replace("\\n", ""))
Output:
https://www.groupon.com/deals/yupo-health-2?deal_option=75076cab-9268-4c2d-9459-e79a559dfed6
https://www.groupon.com/deals/infinity-laser-spa-2-4?deal_option=4dc14f7d-29ac-45e1-a664-ea76ddc44718
https://www.groupon.com/deals/dr-laser-nyc-3?deal_option=232325a0-8b3f-4fa9-b42b-7dc668ff7474
...
...
I am trying to get the value marked in the picture extracted to be a variable, but it seems that when it is within Vue Components, bs4 is not doing the searching like i am expecting. Can anyone point me in the general direction as to how i would be able to extract the value from this document in Python?
Code is found below picture, thanks in advance.
import requests
from bs4 import BeautifulSoup
URL = 'https://api.tracker.gg/api/v2/rocket-league/standard/profile/steam/76561198060134880'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
#print(soup.prettify())
div_list = soup.findAll({"class":'value'})
print(div_list)
Since the page is returning a json response you don't need beautifulsoup to parse it.
import requests
import json
URL = 'https://api.tracker.gg/api/v2/rocket-league/standard/profile/steam/76561198060134880'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}
response = requests.get(URL, headers = headers)
dict_of_response = json.loads(response.text)
obj_list = dict_of_response['data']['segments']
print(obj_list)
The obj_list variable now contains a list of dicts. Those dicts contain the data you want and now you only need to loop trough the list and do what you want with the data.
I am using beautifulsoup and urllib to extract a webpage , I have set the user agent and the cookie , and yet i fail to receive all the links from the webpage...
Heres my code :
import bs4 as bs
import urllib.request
import requests
#sauce = urllib.request.urlopen('https://github.com/search?q=javascript&type=Code&utf8=%E2%9C%93').read()
#soup = bs.BeautifulSoup(sauce,'lxml')
'''
session = requests.Session()
response = session.get(url)
print(session.cookies.get_dict())
'''
url = 'https://github.com/search?q=javascript&type=Code&utf8=%E2%9C%93'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Cookie' : '_gh_sess=eyJzZXNzaW9uX2lkIjoiMDNhMGI2NjQxZjY4Mjc1YmQ3ZjAyNmJiODM2YzIzMTUiLCJfY3NyZl90b2tlbiI6IlJJOUtrd3E3WVFOYldVUzkwdmUxZ0Z4MHZLN3M2eE83SzhIdVJTUFVsVVU9In0%3D--4485d36d4c86aec01cde254e34db68005193546e
logged_in: no'}
response = requests.get(url,headers=headers)
print(response.cookies)
soup = bs.BeautifulSoup(response.content,'lxml')
for url in soup.find_all('a'):
print(url.get('href'))
Is there something I'm missing? Inside a browser I get the links to all the code whereas in the script I get only a few of the links , none with the code...
The webpage opening perfectly in the browser...