python Script for getting a captcha - python

So i was doing this site scraping for my app. I need to download the captcha image for displaying it to the user. But every time I visit the captcha url it generates a new captcha. How can I download the the dynamically generated captcha for automated Login
eg:https://academics.vit.ac.in/student/stud_login.asp
Here I download the captcha using below script>>>
from bs4 import BeautifulSoup
import urllib2
import urllib
url = "https://academics.vit.ac.in/student/stud_login.asp"
content = urllib2.urlopen(url)
soup = BeautifulSoup(content)
img = soup.find('img',id ='imgCaptcha')
print img
urllib.urlretrieve(img['src'],'captcha.bmp')
But some how this script doesn't seem to work.
1) One solution is to take screenshot and crop out the captcha.
But I need a different solution as I am going to work on devices with various screen sizes so taking screen shot would not solve the purpose.

img['src'] returns a relative url - captcha.asp. You have to make it into an absolute url before you can use it (https://academics.vit.ac.in/student/captcha.asp).
import urlparse
urllib.urlretrieve(urlparse.urljoin(url, img['src']), 'captcha.bmp')

Related

Find non placholder image when webscrpaing Python

I want to get an image from a website, but the website loads a placeholder image before the image I want. I need the image for the logic of the program.
This is the code:
import requests
from bs4 import BeautifulSoup
def main():
r = requests.get("https://www.simcoecountyschoolbus.ca/")
soup = BeautifulSoup(r.content, "html.parser")
northdiv= soup.find("div", id="status-icon-north")
northimages = northdiv.select('img')
statusNorth = northimages[0].get("src")
westdiv = soup.find("div", id="status-icon-west")
print(westdiv)
statusWest = westdiv.select("img")[0].get("src")
print(statusWest)
main()
I want to get the image "images/status-none.png" but it returns "images/status-some.png"
It looks like javascript is loading that data after the initial page load. You'd be better off getting the json data that is loaded from the backend request to this endpoint: https://www.simcoecountyschoolbus.ca/status.json
To find this you can open your browsers Developer Tools, then click the Network - fetch/XHR button and refresh the page... here you will see the backend api requests that load data after the initial page load. If you click on the one that says "status" you'll see the endpoint url, as well as the response which you can inspect for the data you want.

Need help on retrieving site key from dynamic website

I am creating an automated account creator using Pycharm. I am facing a problem to which I have yet to find a good solution. I want to get the site-key in order to pass the captcha to a service I have bought. I have used the requests.get method but it gives back "None" as a result. I am using selenium in my program. After some thought I realized that using requests.get method, if it worked, would bring me a different key than the one that my selenium driver is currently displayed. I googled a lot and found only that there is a module named Selenium-Requests which doesn't have Edge imported. I am using Edge as it is the only browser everyone has and doesn't require the developer version of it like Chrome and Firefox.
Generally I haven't found a fix that can help me retrieve the key within my driver.
This is the retrieve code:
registerurl = requests.get(url)
registerurlstring = ''.join(str(e) for e in registerurl)
soup = BeautifulSoup(registerurlstring, features="html5lib")
hidden_tags = soup.find({"id":"recaptcha-token"})
sitekey = hidden_tags
try:
print('Sitekey = ', sitekey)
except:
print('Sitekey = Not Found')
I am not sure this is what you are after or not.To get the recaptcha value which is inside iframe so you have to target that src value of that iframe and using python request module you can get value of that input.
import requests
from bs4 import BeautifulSoup
url='https://www.google.com/recaptcha/api2/anchor?ar=1&k=6Lc3HAsUAAAAACsN7CgY9MMVxo2M09n_e4heJEiZ&co=aHR0cHM6Ly9zaWdudXAuZXVuZS5sZWFndWVvZmxlZ2VuZHMuY29tOjQ0Mw..&hl=en&v=A1Aard-wURuGsXRGA7JMOqVO&theme=dark&size=invisible&badge=bottomright&cb=ezyy1frci5ms'
registerurl = requests.get(url)
soup = BeautifulSoup(registerurl.text, features="html5lib")
hidden_tags = soup.find('input' ,attrs={"id":"recaptcha-token"})
print(hidden_tags['value'])
Output:
03AOLTBLQFd9hdHGmOesrT0xDcA8MkI6FGIiM3892Uws3aEWzPxUT8-U8IBEZHYzUEba2Jp9m3s9z_sz_fuij9OXZHABulFrI8YCD95kXV_H6xTO9vOubuZfzscleb6fdkkAE3IwUUSdTzPbXILy6SGLPI3LpPUptC1enZLIkQxQq9T8AEPPvCIsVgGe4jSE_l1jCWIRmBeBXsLgPLABZSq6ah6QWFfAngdC1rQaLMKWzLBmzh6ytEEGNYHmEG7P6UVtYcTI1IRIvq-ba-oGIUS1ELUb-1d3upQ29JWBtQ2t7_VNn237fguztf_FUDEHnAfHppUsrz-ZlkE00sMXFCuQ1XF6Qz7lH2j5g2z5KZQiODhRUBRRyd-ydjetz053bKRcgWpnNoZGNf1GBlW5inL9AtyYTkpruttw5sruAPuVgs5mrniQ5hrHNvfDIZKX905T2E21W2DsW1_07rItFYa-zkylMU83YXRQ
Hope this helps.
Updated code to get the iframe src value using webdriver.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver=webdriver.Chrome()
driver.get("https://signup.eune.leagueoflegends.com/en/signup/index")
url=driver.find_element_by_css_selector("iframe[role='presentation']").get_attribute('src')
registerurl = requests.get(url)
soup = BeautifulSoup(registerurl.text, features="html5lib")
hidden_tags = soup.find('input' ,attrs={"id":"recaptcha-token"})
print(hidden_tags['value'])

Error while scraping image with beautifulsoup

The original code is here : https://github.com/amitabhadey/Web-Scraping-Images-using-Python-via-BeautifulSoup-/blob/master/code.py
So i am trying to adapt a Python script to collect pictures from a website to get better at web scraping.
I tried to get images from "https://500px.com/editors"
The first error was
The code that caused this warning is on line 12 of the file/Bureau/scrapper.py. To get rid of this warning, pass the additional argument
'features="lxml"' to the BeautifulSoup constructor.
So I did :
soup = BeautifulSoup(plain_text, features="lxml")
I also adapted the class to reflect the tag in 500px.
But now the script stopped running and nothing happened.
In the end it looks like this :
import requests
from bs4 import BeautifulSoup
import urllib.request
import random
url = "https://500px.com/editors"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="lxml")
for link in soup.find_all("a",{"class":"photo_link "}):
href = link.get('href')
print(href)
img_name = random.randrange(1,500)
full_name = str(img_name) + ".jpg"
urllib.request.urlretrieve(href, full_name)
print("loop break")
What did I do wrong?
Actually the website is loaded via JavaScript using XHR request to the following API
So you can reach it directly via API.
Note that you can increase parameter rpp=50 to any number as you want for getting more than 50 result.
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['url'])
also you can access the image url itself in order to write it directly!
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['image_url'][-1])
Note that image_url key hold different img size. so you can choose your preferred one and save it. here I've taken the big one.
Saving directly:
import requests
with requests.Session() as req:
r = req.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
result = []
for item in r['photos']:
print(f"Downloading {item['name']}")
save = req.get(item['image_url'][-1])
name = save.headers.get("Content-Disposition")[9:]
with open(name, 'wb') as f:
f.write(save.content)
Looking at the page you're trying to scrape I noticed something. The data doesn't appear to load until a few moments after the page finishes loading. This tells me that they're using a JS framework to load the images after page load.
Your scraper will not work with this page due to the fact that it does not run JS on the pages it's pulling. Running your script and printing out what plain_text contains proves this:
<a class='photo_link {{#if hasDetailsTooltip}}px_tooltip{{/if}}' href='{{photoUrl}}'>
If you look at the href attribute on that tag you'll see it's actually a templating tag used by JS UI frameworks.
Your options now are to either see what APIs they're calling to get this data (check the inspector in your web browser for network calls, if you're lucky they may not require authentication) or to use a tool that runs JS on pages. One tool I've seen recommended for this is selenium, though I've never used it so I'm not fully aware of its capabilities; I imagine the tooling around this would drastically increase the complexity of what you're trying to do.

Python Web Scraping with search and non dynamic URI

I'm a begginer in the world of python and web-scrapers, i am used to make scrapers with dynamic URLs, where the URI change when i input specific parameters in the URL itself.
Ex: Wikipedia.
(if i input a search named "Stack Overflow" i will have a URI that looks like this: https://en.wikipedia.org/wiki/Stack_Overflow)
At the moment i was challenged to develop a web-scraper to collect data from this page.
The field "Texto/Termos a serem pesquisados" corresponds a search field, but when i input the search the URL stays the same not letting me to get the right HTML code for my research.
I am used to work with BeautifulSoup and Requests to do the scraping thing, but in this case it has no use, since the URL stays the same after the search.
import requests
from bs4 import BeautifulSoup
url = 'http://comprasnet.gov.br/acesso.asp?url=/ConsultaLicitacoes/ConsLicitacao_texto.asp'
html = requests.get(url)
bs0bj = BeautifulSoup(html.content,'html.parser')
print(bsObj)
# And from now on i cant go any further
Usually i would do something like
url = 'https://en.wikipedia.org/wiki/'
input = input('Input your search :)
search = url + input
And then do all the BeautifulSoup thing, and findAll thing to get my data from the HTML code.
I have tried to use Selenium too, but im looking for something different than that due to all the webdriver thing. With the following piece of code i have achieved to some odd results but i still cant scrape the HTML in a good way.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
from bs4 import BeautifulSoup
# Acess the page and input the search on the field
driver = webdriver.Chrome()
driver.get('http://comprasnet.gov.br/acesso.asp?url=/ConsultaLicitacoes/ConsLicitacao_texto.asp')
driver.switch_to.frame('main2')
busca = driver.find_element_by_id("txtTermo")
busca.send_keys("GESTAO DE PESSOAS")
#data_inicio = driver.find_element_by_id('dt_publ_ini')
#data_inicio.send_keys("01/01/2018")
#data_fim = driver.find_element_by_id('dt_publ_fim')
#data_fim.send_keys('20/12/2018')
botao = driver.find_element_by_id('ok')
botao.click()
So given all that:
There is a way to scrape data from these static urls ?
Can i input a search in the field via code ?
Why cant i scrape the right source code ?
The problem is that your initial search page is using frames for the searching & results, which makes it harder for BeautifulSoup to work with it. I was able to obtain the search results by using a slightly different URL and MechanicalSoup instead:
>>> from mechanicalsoup import StatefulBrowser
>>> sb = StatefulBrowser()
>>> sb.open('http://comprasnet.gov.br/ConsultaLicitacoes/ConsLicitacao_texto.asp')
<Response [200]>
>>> sb.select_form() # select the search form
<mechanicalsoup.form.Form object at 0x7f2c10b1bc18>
>>> sb['txtTermo'] = 'search text' # input the text to search for
>>> sb.submit_selected() # submit the form
<Response [200]>
>>> page = sb.get_current_page() # get the returned page in BeautifulSoup form
>>> type(page)
<class 'bs4.BeautifulSoup'>
Note that the URL I'm using here is that of the frame that has the search form and not the page you provided that was inlining it. This removes one layer of indirection.
MechanicalSoup is built on top of BeautifulSoup and provides some tools for interacting with websites in a similar way to the old mechanize library.

Advice on how to scrape data from this website

I would like some advice on how to scrape data from this website.
I started with selenium, but got stuck at the beginning because, for example, I have no idea how to set the dates.
My code until now:
from bs4 import BeautifulSoup as soup
from openpyxl import load_workbook
from openpyxl.styles import PatternFill, Font
from selenium import webdriver
from selenium.webdriver.common.by import By
import datetime
import os
import time
import re
day = datetime.date.today().day
month = datetime.date.today().month
year = datetime.date.today().year
my_url = 'https://www.eex-transparency.com/homepage/power/germany/production/availability/non-usability-by-unit/non-usability-history'
cookieValue = '12-c12-cached|from:' +str(year)+ '-' +str(month)+ '-' +str(day-5)+ ','+'to:' +str(year)+ '-' +str(month)+ '-' + str(day) +',dateType:1,company:PreussenElektra,fuel:uranium,canceled:0,durationComparator:ge,durationValue:5,durationUnit:day'
#saving url
browser = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")
my_url = 'https://www.eex-transparency.com/homepage/power/germany/production/availability/non-usability-by-unit'
browser.add_cookie({'name': 'tem', 'value': cookieValue})
browser.get(my_url)
my_url = 'https://www.eex-transparency.com/homepage/power/germany/production/availability/non-usability-by-unit/non-usability-history'
browser.get(my_url)
Obviously I am not asking for code, just some suggestions on how to continue with Selenium (how to set dates and other data) or any idea on how to scrape this website
Thanks in advance.
EDIT: I am trying to follow the cookie way. That is my updated code, I read that the cookie need to be created before loading the page and so I did, any idea why it is not working?
Best approach for you will be changing cookies, because every filter data is saved in cookie.
Check cookies in chrome ( f12 -> application -> cookies ) and play with filters. If you will change it in programmers tools you have to refresh website :)
Check this post on how to change cookies in selenium python.
To get values from website you have to use classic way like u did here, but you will have to use classes:
radio = browser.find_elements_by_class_name('aaaaaa')
You can always use xPath to search elements ( chrome will generate them for you ).
Is there any particular reason why you have decided to use selenium over other web scraping tools (scrapy, urllib, etc.)? I personally have not used Selenium but I have used some of the other tools. Below is an example of a script to just pull all the html from a page.
import urllib
import urllib2
from bs4 import BeautifulSoup as soup
link = "https://ubuntu.com"
page = urllib2.urlopen(link)
data = soup(page, 'html.parser')
print (data)
This is just a short script to pull all the HTML off a page. I believe BeautifulSoup has additional tools for inputting data into fields, but the exact method slips my mind right now, if I can find my notes on it I will edit this post. I remember it being very straightforward, though.
Best of luck!
Edit: here's a discussion web scraping tools from reddit a while back that I had saved https://www.reddit.com/r/Python/comments/1qnbq3/webscraping_selenium_vs_conventional_tools/

Categories