I am creating an automated account creator using Pycharm. I am facing a problem to which I have yet to find a good solution. I want to get the site-key in order to pass the captcha to a service I have bought. I have used the requests.get method but it gives back "None" as a result. I am using selenium in my program. After some thought I realized that using requests.get method, if it worked, would bring me a different key than the one that my selenium driver is currently displayed. I googled a lot and found only that there is a module named Selenium-Requests which doesn't have Edge imported. I am using Edge as it is the only browser everyone has and doesn't require the developer version of it like Chrome and Firefox.
Generally I haven't found a fix that can help me retrieve the key within my driver.
This is the retrieve code:
registerurl = requests.get(url)
registerurlstring = ''.join(str(e) for e in registerurl)
soup = BeautifulSoup(registerurlstring, features="html5lib")
hidden_tags = soup.find({"id":"recaptcha-token"})
sitekey = hidden_tags
try:
print('Sitekey = ', sitekey)
except:
print('Sitekey = Not Found')
I am not sure this is what you are after or not.To get the recaptcha value which is inside iframe so you have to target that src value of that iframe and using python request module you can get value of that input.
import requests
from bs4 import BeautifulSoup
url='https://www.google.com/recaptcha/api2/anchor?ar=1&k=6Lc3HAsUAAAAACsN7CgY9MMVxo2M09n_e4heJEiZ&co=aHR0cHM6Ly9zaWdudXAuZXVuZS5sZWFndWVvZmxlZ2VuZHMuY29tOjQ0Mw..&hl=en&v=A1Aard-wURuGsXRGA7JMOqVO&theme=dark&size=invisible&badge=bottomright&cb=ezyy1frci5ms'
registerurl = requests.get(url)
soup = BeautifulSoup(registerurl.text, features="html5lib")
hidden_tags = soup.find('input' ,attrs={"id":"recaptcha-token"})
print(hidden_tags['value'])
Output:
03AOLTBLQFd9hdHGmOesrT0xDcA8MkI6FGIiM3892Uws3aEWzPxUT8-U8IBEZHYzUEba2Jp9m3s9z_sz_fuij9OXZHABulFrI8YCD95kXV_H6xTO9vOubuZfzscleb6fdkkAE3IwUUSdTzPbXILy6SGLPI3LpPUptC1enZLIkQxQq9T8AEPPvCIsVgGe4jSE_l1jCWIRmBeBXsLgPLABZSq6ah6QWFfAngdC1rQaLMKWzLBmzh6ytEEGNYHmEG7P6UVtYcTI1IRIvq-ba-oGIUS1ELUb-1d3upQ29JWBtQ2t7_VNn237fguztf_FUDEHnAfHppUsrz-ZlkE00sMXFCuQ1XF6Qz7lH2j5g2z5KZQiODhRUBRRyd-ydjetz053bKRcgWpnNoZGNf1GBlW5inL9AtyYTkpruttw5sruAPuVgs5mrniQ5hrHNvfDIZKX905T2E21W2DsW1_07rItFYa-zkylMU83YXRQ
Hope this helps.
Updated code to get the iframe src value using webdriver.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver=webdriver.Chrome()
driver.get("https://signup.eune.leagueoflegends.com/en/signup/index")
url=driver.find_element_by_css_selector("iframe[role='presentation']").get_attribute('src')
registerurl = requests.get(url)
soup = BeautifulSoup(registerurl.text, features="html5lib")
hidden_tags = soup.find('input' ,attrs={"id":"recaptcha-token"})
print(hidden_tags['value'])
Related
I am attempting to retrieve data from pro-football-reference.com, but due to their restrictions on accessing the site, if you access it too quickly, they ban you. I downloaded all of the HTML pages to a data folder on my local machine.
The issue is that now when I try to open the file using a Selenium webdriver it takes about 95 seconds to open and retrieve the html. Below is the code that I am running. The input 'box_score_file' is the local files that I have downloaded to my machine. On average files are around 500 KB.
Is this issue just due to the size of my files or is there a more efficient way to retrieve this?
def Parse_File(box_score_file):
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
driver.get(f'file://{box_score_file}')
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
[s.decompose() for s in soup.select('tr.over_header')]
[s.decompose() for s in soup.select('tr.thead')]
return soup
I've tried using the requests library, but since some of the tables for the html file are hidden by javascript if I use requests library I can't retrieve all of the tables that I want to create a DataFrame with.
I've also tried using playwright, but whenever I attempt to use playwright I receive the error ModuleNotFoundError: No module named 'pyee.asyncio'. I've tried 'pip install pyee', but anaconda then reports back that the requirement is already satisfied. I've also tried async playwright, but that just gives me the same issue as before saying there is no module named 'pyee.asyncio'
from playwright.sync_api import sync_playwright, TimeoutError as PlaywrightTimeout
def Parse_File(box_score_file):
try:
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(box_score_file)
print(page.title())
html = page.inner_html()
except PlaywrightTimeout:
print(f"Timeout error on {box_score_file}")
soup = BeautifulSoup(html, 'html.parser')
[s.decompose() for s in soup.select('tr.over_header')]
[s.decompose() for s in soup.select('tr.thead')]
return soup
This is my first time posting, so please let me know if I can provide any additional information or details and thank you for your time.
Looking to Pass Python a list then using a combinate of Beautiful soup and requests, pull the corresponding peice of information for each web page.
So i have a list of around 7000 barcodes that i want to pass to this site 'https://www.barcodelookup.com/' (you just add the barcode after the backslash), then pull back the manufacturer of that product which is in the span "product-text".I'm currently trying to get it to run with the below;
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.barcodelookup.com/194398882321')
soup = BeautifulSoup(source, 'lxml')
#print(soup.prettify())
price = soup.find('span', {'class' : 'product-text'})
print(price.text)
This gives an error as below;
TypeError: object of type 'Response' has no len()
Any help would be greatly appreciated, thanks
If you inspect the source, you will see that the response status is 403 and the overall source.text reveals that the website is protected by Cloudflare. This means that using requests is not really helpful for you. You need the means to overcome the 'antibot' protection from Cloudflare. Here are two options:
1. Use a third party service
I am an engineer at WebScrapingAPI and I can recommend you our web scraping API. We're preventing detection by using various proxies, IP rotations, captcha solvers and other advanced features. A basic example of using our API for your scenarios is:
import requests
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://www.barcodelookup.com/194398882321'
PARAMS = {
"api_key":API_KEY,
"url": TARGET_URL,
"render_js":1,
"timeout":"20000",
"proxy_type": "residential",
"extract_rules":'{"elements":{"selector":"span.product-text","output":"text"}}',
}
response = requests.get(SCRAPER_URL, params=PARAMS )
print(response.text)
Response:
{"elements":["\nUPC-A 194398882321, EAN-13 0194398882321\n","Media ","Sony Uk ","\n1-2-3: The 80s CD.\n"]}
2. Build an undetectable web scraper
You can also try building a more 'undetectable' web scraper on your end. For example, try using a real browser for your scraper, instead of requests. Selenium would be a good place to start. Here is an implementation example:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
BASE_URL = 'https://www.barcodelookup.com/194398882321'
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 5000)
driver.get(BASE_URL)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
price = soup.find('span', {'class' : 'product-text'})
print(price)
driver.quit()
In time though, Cloudflare might flag your 'fingerprint' and block your requests. Some more things you could add to your project are:
Residential proxies
Advanced Selenium Evasions
I am currently trying to scrape live stock market data from the yahoo finance page.
I am using bs4. My current issue is that whenever I run my script, it does not update properly to reflect the current price of the stock.
If anybody has any advice on how to change that it would be appreciated.
import requests
from bs4 import BeautifulSoup
while True:
page = requests.get("https://nz.finance.yahoo.com/quote/NZDUSD=X?p=NZDUSD=X")
soup = BeautifulSoup(page.text, "html.parser")
price = soup.find("div", {"class": "My(6px) Pos(r) smartphone_Mt(6px)"}).find("span").text
print(price)
NOT POSSIBLE WITH BS4 ALONE
This website particularly uses JavaScript to update the page and urlib etc. just parses the html content of the page not Java Script or AJAX content.
PhantomJs or Selenium Web Browser provide a more mechanized browser that often can run the JavaScript codes enabling dynamic websites. Try Using this :)
Using Selenium It can be done as:
from selenium import webdriver #its the library
import time
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as soup
#it Says that we are going to Use chrome browser
chrome_options = webdriver.ChromeOptions()
#hiding the Chrome Browser
chrome_options.add_argument("--headless")
#Initiating Chrome with all properties we need (in this case we use no specific properties
driver = webdriver.Chrome(chrome_options=chrome_options,executable_path='C:/Users/shary/Downloads/chromedriver.exe')
#URL We need to open
url = 'https://nz.finance.yahoo.com/quote/NZDUSD=X?p=NZDUSD=X'
#Starting Our Browser
driver = webdriver.Chrome()
#Accessing the url .. this will open the page just as you open in Chrome etc.
driver.get(url)
while 1:
#it will get you the html content repeatedly .. So you can get the changing price
html = driver.page_source
page_soup = soup(html,features="lxml")
price = page_soup.find("div", {"class": "D(ib) Mend(20px)"}).text
print(price)
time.sleep(5)
Note the Best Comments But Hope this you will understand it :) Else Watch a youtube tutorial to get proper idea what a Selenium Bot does
Hope This will Help. Its working perfect for me :) Accept This Answer if it helps you
The original code is here : https://github.com/amitabhadey/Web-Scraping-Images-using-Python-via-BeautifulSoup-/blob/master/code.py
So i am trying to adapt a Python script to collect pictures from a website to get better at web scraping.
I tried to get images from "https://500px.com/editors"
The first error was
The code that caused this warning is on line 12 of the file/Bureau/scrapper.py. To get rid of this warning, pass the additional argument
'features="lxml"' to the BeautifulSoup constructor.
So I did :
soup = BeautifulSoup(plain_text, features="lxml")
I also adapted the class to reflect the tag in 500px.
But now the script stopped running and nothing happened.
In the end it looks like this :
import requests
from bs4 import BeautifulSoup
import urllib.request
import random
url = "https://500px.com/editors"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="lxml")
for link in soup.find_all("a",{"class":"photo_link "}):
href = link.get('href')
print(href)
img_name = random.randrange(1,500)
full_name = str(img_name) + ".jpg"
urllib.request.urlretrieve(href, full_name)
print("loop break")
What did I do wrong?
Actually the website is loaded via JavaScript using XHR request to the following API
So you can reach it directly via API.
Note that you can increase parameter rpp=50 to any number as you want for getting more than 50 result.
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['url'])
also you can access the image url itself in order to write it directly!
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['image_url'][-1])
Note that image_url key hold different img size. so you can choose your preferred one and save it. here I've taken the big one.
Saving directly:
import requests
with requests.Session() as req:
r = req.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
result = []
for item in r['photos']:
print(f"Downloading {item['name']}")
save = req.get(item['image_url'][-1])
name = save.headers.get("Content-Disposition")[9:]
with open(name, 'wb') as f:
f.write(save.content)
Looking at the page you're trying to scrape I noticed something. The data doesn't appear to load until a few moments after the page finishes loading. This tells me that they're using a JS framework to load the images after page load.
Your scraper will not work with this page due to the fact that it does not run JS on the pages it's pulling. Running your script and printing out what plain_text contains proves this:
<a class='photo_link {{#if hasDetailsTooltip}}px_tooltip{{/if}}' href='{{photoUrl}}'>
If you look at the href attribute on that tag you'll see it's actually a templating tag used by JS UI frameworks.
Your options now are to either see what APIs they're calling to get this data (check the inspector in your web browser for network calls, if you're lucky they may not require authentication) or to use a tool that runs JS on pages. One tool I've seen recommended for this is selenium, though I've never used it so I'm not fully aware of its capabilities; I imagine the tooling around this would drastically increase the complexity of what you're trying to do.
Can you extract the VIN number from this webpage?
I tried urllib2.build_opener, requests, and mechanize. I provided user-agent as well, but none of them could see the VIN.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent',('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_7) ' 'AppleWebKit/535.1 (KHTML, like Gecko) ' 'Chrome/13.0.782.13 Safari/535.1'))]
page = opener.open(link)
soup = BeautifulSoup(page)
table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents[0]
print vin
That page has much of the information loaded and displayed with Javascript (probably through Ajax calls), most likely as a direct protection against scraping. To scrape this you therefore either need to use a browser that runs Javascript, and control it remotely, or write the scraper itself in javascript, or you need to deconstruct the site and figure out exactly what it loads with Javascript and how, and see if you can duplicate these calls.
You can use browser automation tools for the purpose.
For example this simple selenium script can do your work.
from selenium import webdriver
from bs4 import BeautifulSoup
link = "https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0"
browser = webdriver.Firefox()
browser.get(link)
page = browser.page_source
soup = BeautifulSoup(page)
table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents.span.contents[0]
print vin
BTW, table.contents[0] prints the entire span, including the span tags.
table.contents.span.contents[0] prints only the VIN no.
You could use selenium, which calls a browser. This works for me :
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time
# See: http://stackoverflow.com/questions/20242794/open-a-page-programatically-in-python
browser = webdriver.Firefox() # Get local session of firefox
browser.get("https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0") # Load page
time.sleep(0.5) # Let the page load
# Search for a tag "span" with an attribute "id" which contains "ctl00_ContentPlaceHolder1_VINc_VINLabel"
e=browser.find_element_by_xpath("//span[contains(#id,'ctl00_ContentPlaceHolder1_VINc_VINLabel')]")
e.text
# Works for me : u'4JGBF7BE9BA648275'
browser.close()
You do not have to use Selenium.
Just make an additional get request:
import requests
stock_number = '123456789' # located at VEHICLE INFORMATION
url = 'https://www.clearvin.com/ads/iaai/check?stockNumber={}&vin='.format(stock_number)
vin = requests.get(url).json()['car']['vin']