How to check if (https://) is a image or web link - python

How to check if a hyperlink is a image link or web link.
image_list = []
url = 'http://www.image.jpg/'
if any(x in '.jpg .gif .png .jpeg' for x in url):
image_list.append(url)
else:
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib")
for link in soup.find_all('img'):
src = link.get('src')
if src.startswith("https"):
image_list.append(src)
The code above works in finding out the hyperlink contains image formats, however whenever i use a link that does not contain ".jpg ect..." it still appends the link to the image_list and skips the else statement.

Let's look at this code:
any(x in '.jpg .gif .png .jpeg' for x in url):
This checks if any letter in the URL is in the string. The 'p' from http is in the string, so you will always get a true result.
Here's how you could check the extension of a URL:
import posixpath
import urllib.parse
IMAGE_EXTS = { '.png', '.jpg', '.jpeg', '.gif' }
url = 'http://example.com/'
if posixpath.splitext(urllib.parse.urlparse(url).path)[1] in IMAGE_EXTS:
# Has image extension...
But that's a moot point, because the extension of a URL doesn't tell you whether it's an image. Unlike regular files, for URLs, the extension is completely irrelevant! You can have an .html URL which gives you a PNG image, or a .gif URL which is really an HTML web page. You need to check the Content-Type of the HTTP reply.

Related

Image source is different in html between my browser and get request

I suspect this has happened due to my misunderstanding of how either lxml or html works and I'd appreciate if someone could fill in this blank in my knowledge.
My code is:
url = "https://prnt.sc/ca0000"
response = requests.get(url,headers={'User-Agent': 'Chrome'})
# Navigate to the correct img src.
tree = html.fromstring(response.content)
xpath = '/html/body/div[3]/div/div/img/#src'
imageURL = tree.xpath(xpath)[0]
print(imageURL)
I expect when I do this to get a result such as:
data:image/png;base64,iVBORw0KGgoAAA...((THIS IS REALLY LONG))...Jggg==
Which if I understand correctly is where the image is stored locally on my computer.
However when I run the code I get:
"https://prnt.sc/ca0000"
Why are these different?
Problem is that this page uses javaScript to put data:image/png;base64 ... in place of https://prnt.sc/ca0000 but requests can't use JavaScript.
But there are two img with different scr - first has standard URL to image (https:///....) and other has fake https://prnt.sc/ca0000
So this xpath works for me even without JavaScript
xpath = '//img[#id="screenshot-image"]/#src'
This code get correct url and download image.
import requests
from lxml import html
url = "https://prnt.sc/ca0000"
response = requests.get(url, headers={'User-Agent': 'Chrome'})
tree = html.fromstring(response.content)
image_url = tree.xpath('//img[#id="screenshot-image"]/#src')[0]
print(image_url)
# -- download ---
response = requests.get(image_url, headers={'User-Agent': 'Chrome'})
with open('image.png', 'wb') as fh:
fh.write(response.content)
Result
https://image.prntscr.com/image/797501c08d0a46ae93ff3a477b4f771c.png

How can I download images from a CAPTCHA with Python?

I need to download the images that are inside the custom made CAPTCHA in this login site. How can I do it :(?
This is the login site, there are five images
and this is the link: https://portalempresas.sb.cl/login.php
I've been trying with this code that another user (#EnriqueBet) helped me with:
from io import BytesIO
from PIL import Image
# Download image function
def downloadImage(element,imgName):
img = element.screenshot_as_png
stream = BytesIO(img)
image = Image.open(stream).convert("RGB")
image.save(imgName)
# Find all the web elements of the captcha images
image_elements = driver.find_elements_by_xpath("/html/body/div[1]/div/div/section/div[1]/div[3]/div/div/div[2]/div[*]")
# Output name for the images
image_base_name = "Imagen_[idx].png"
# Download each image
for i in range(len(image_elements)):
downloadImage(image_elements[i],image_base_name.replace("[idx]","%s"%i)
But when it tries to get all of the image elements
image_elements = driver.find_elements_by_xpath("/html/body/div[1]/div/div/section/div[1]/div[3]/div/div/div[2]/div[*]")
It fails and doesn't get any of them. Please, help! :(
Instead of defining an explicit path to the images, why not simply download all images that are present on the page. This will work since the page itself only has 5 images and you want to download all of them. See the method below.
The following should extract all images from a given page and write it to the directory where the script is being run.
import re
import requests
from bs4 import BeautifulSoup
site = ''#set image url here
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
# sometimes an image source can be relative
# if it is provide the base url which also happens
# to be the site variable atm.
url = '{}{}'.format(site, url)
response = requests.get(url)
f.write(response.content)
The code is taken from here and credit goes to the respective owner.
This is a follow on answer from my earlier post
I have had no success getting my selenium to run due to versioning issues on selenium and my browser.
I have though thought of another way to download and extract all the images that are appearing on the captcha. As you can tell the images change on each visit, so to collect all the images the best option would be to automate them rather than manually saving the image from the site
To automate it, follow the steps below.
Firstly, navigate to the site using selenium and take a screenshot of the site. For example,
from selenium import webdriver
DRIVER = 'chromedriver'
driver = webdriver.Chrome(DRIVER)
driver.get('https://www.spotify.com')
screenshot = driver.save_screenshot('my_screenshot.png')
driver.quit()
This saves it locally. You can then open the image using library such as pil and crop the images of the captcha.
This would be done like so
im = Image.open('0.png').convert('L')
im = im.crop((1, 1, 98, 33))
im.save('my_screenshot.png)
Hopefully you get the idea here. You will need to do this one by one for all the images, ideally in a for loop with crop diemensions changed appropriately.
You can also try this It will save captcha image only
from PIL import Image
element = driver.find_element_by_id('captcha_image') #div id or the captcha container id
location = element.location
#print(location)
size = element.size
driver.save_screenshot('screenshot.png')
get_captcha_text(location, size)
def get_captcha_text(location, size):
im = Image.open('screenshot.png')
left = location['x']
top = location['y']
right = location['x'] + size['width']
bottom = location['y'] + size['height']
im = im.crop((left, top, right, bottom)) # defines crop points
im.save('screenshot.png')
return true

How to download in python big media links of a web page behind a log in form?

I'm looking for some library or libraries in Python to:
a) log in a web site,
b) find all links to some media files (let us say having "download" in their URLs), and
c) download each file efficiently directly to the hard drive (without loading the whole media file into RAM).
Thanks
You can use the broadly used requests module (more than 35k stars on github), and BeautifulSoup. The former handles session cookies, redirections, encodings, compression and more transparently. The later finds parts in the HTML code and has an easy-to-remember syntax, e.g. [] for properties of HTML tags.
It follows a complete example in Python 3.5.2 for a web site that you can scrap without a JavaScript engine (otherwise you can use Selenium), and downloading sequentially some links with download in its URL.
import shutil
import sys
import requests
from bs4 import BeautifulSoup
""" Requirements: beautifulsoup4, requests """
SCHEMA_DOMAIN = 'https://exmaple.com'
URL = SCHEMA_DOMAIN + '/house.php/' # this is the log-in URL
# here are the name property of the input fields in the log-in form.
KEYS = ['login[_csrf_token]',
'login[login]',
'login[password]']
client = requests.session()
request = client.get(URL)
soup = BeautifulSoup(request.text, features="html.parser")
data = {KEYS[0]: soup.find('input', dict(name=KEYS[0]))['value'],
KEYS[1]: 'my_username',
KEYS[2]: 'my_password'}
# The first argument here is the URL of the action property of the log-in form
request = client.post(SCHEMA_DOMAIN + '/house.php/user/login',
data=data,
headers=dict(Referer=URL))
soup = BeautifulSoup(request.text, features="html.parser")
generator = ((tag['href'], tag.string)
for tag in soup.find_all('a')
if 'download' in tag['href'])
for url, name in generator:
with client.get(SCHEMA_DOMAIN + url, stream=True) as request:
if request.status_code == 200:
with open(name, 'wb') as output:
request.raw.decode_content = True
shutil.copyfileobj(request.raw, output)
else:
print('status code was {} for {}'.format(request.status_code,
name),
file=sys.stderr)
You can use the mechanize module to log into websites like so:
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("http://www.example.com")
br.select_form(nr=0) #Pass parameters to uniquely identify login form if needed
br['username'] = '...'
br['password'] = '...'
result = br.submit().read()
Use bs4 to parse this response and find all the hyperlinks in the page like so:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(result, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
You can use re to further narrow down the links you need from all the links present in the response webpage, which are media links (.mp3, .mp4, .jpg, etc) in your case.
Finally, use requests module to stream the media files so that they don't take up too much memory like so:
response = requests.get(url, stream=True) #URL here is the media URL
handle = open(target_path, "wb")
for chunk in response.iter_content(chunk_size=512):
if chunk: # filter out keep-alive new chunks
handle.write(chunk)
handle.close()
when the stream attribute of get() is set to True, the content does not immediately start downloading to RAM, instead the response behaves like an iterable, which you can iterate over in chunks of size chunk_size in the loop right after the get() statement. Before moving on to the next chunk, you can write the previous chunk to memory hence ensuring that the data isn't stored in RAM.
You will have to put this last chunk of code in a loop if you want to download media of every link in the links list.
You will probably have to end up making some changes to this code to make it work as I haven't tested it for your use case myself, but hopefully this gives a blueprint to work off of.

BeautifulSoup scraper downloaded images are corrupt

I greatly need help for my code. I was attempting to do an exercise from a book and I followed it exactly. The code worked and it downloaded the images. However, all the images that was downloaded were corrupted. I have no idea whats causing it or what I missed.
Thanks.
#! python3
# downloadXkcd.py - Downloads every single XKCD comic.
import requests, os, bs4
url = 'http://xkcd.com'
os.makedirs('xkcd', exist_ok=True)
while not url.endswith('#'):
# Download the page.
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'html.parser')
# Find the URL of the comic image.
comicElem = soup.select('#comic img')
if comicElem == []:
print('Could not find comic image')
else:
comicUrl = comicElem[0].get('src')
# Download the image.
print('Downloading image %s' %(comicUrl))
res.raise_for_status()
# Save the image to ./xkcd.
imagefile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
for chunk in res.iter_content(100000):
imagefile.write(chunk)
imagefile.close()
# Get the prev button's url
prevlink = soup.select('a[rel="prev"]')[0]
url = 'http://xkcd.com' + prevlink.get('href')
print('Done')
You are writing wrong data to the file:
for chunk in res.iter_content(100000)
res is the data of the webpage. You should be getting the data of the image with the url comicUrl instead. I think you forgot this line:
print('Downloading image %s' %(comicUrl))
res = requests.get('http:' + comicUrl)
Note: I added http: before the url because the image urls you are extracting lack this. You should define a function to check whether it is necessary to add this schema.

Python to list HTTP-files and directories

How can I list files and folders if I only have an IP-address?
With urllib and others, I am only able to display the content of the index.html file. But what if I want to see which files are in the root as well?
I am looking for an example that shows how to implement username and password if needed. (Most of the time index.html is public, but sometimes the other files are not).
Use requests to get page content and BeautifulSoup to parse the result.
For example if we search for all iso files at http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid/:
from bs4 import BeautifulSoup
import requests
url = 'http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid/'
ext = 'iso'
def listFD(url, ext=''):
page = requests.get(url).text
print page
soup = BeautifulSoup(page, 'html.parser')
return [url + '/' + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
for file in listFD(url, ext):
print file
You cannot get the directory listing directly via HTTP, as another answer says. It's the HTTP server that "decides" what to give you. Some will give you an HTML page displaying links to all the files inside a "directory", some will give you some page (index.html), and some will not even interpret the "directory" as one.
For example, you might have a link to "http://localhost/user-login/": This does not mean that there is a directory called user-login in the document root of the server. The server interprets that as a "link" to some page.
Now, to achieve what you want, you either have to use something other than HTTP (an FTP server on the "ip address" you want to access would do the job), or set up an HTTP server on that machine that provides for each path (http://192.168.2.100/directory) a list of files in it (in whatever format) and parse that through Python.
If the server provides an "index of /bla/bla" kind of page (like Apache server do, directory listings), you could parse the HTML output to find out the names of files and directories. If not (e.g. a custom index.html, or whatever the server decides to give you), then you're out of luck :(, you can't do it.
Zety provides a nice compact solution. I would add to his example by making the requests component more robust and functional:
import requests
from bs4 import BeautifulSoup
def get_url_paths(url, ext='', params={}):
response = requests.get(url, params=params)
if response.ok:
response_text = response.text
else:
return response.raise_for_status()
soup = BeautifulSoup(response_text, 'html.parser')
parent = [url + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
return parent
url = 'http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid'
ext = 'iso'
result = get_url_paths(url, ext)
print(result)
HTTP does not work with "files" and "directories". Pick a different protocol.
You can use the following script to get names of all files in sub-directories and directories in a HTTP Server. A file writer can be used to download them.
from urllib.request import Request, urlopen, urlretrieve
from bs4 import BeautifulSoup
def read_url(url):
url = url.replace(" ","%20")
req = Request(url)
a = urlopen(req).read()
soup = BeautifulSoup(a, 'html.parser')
x = (soup.find_all('a'))
for i in x:
file_name = i.extract().get_text()
url_new = url + file_name
url_new = url_new.replace(" ","%20")
if(file_name[-1]=='/' and file_name[0]!='.'):
read_url(url_new)
print(url_new)
read_url("www.example.com")

Categories