Python: get CATCHA file from html - python

I try to decode captcha on python, but I don't know, how can I get it from html.
I use
html = session.get(page, headers=headers).text
soup = BeautifulSoup(html, "html.parser")
And html looks like
<img src="/captcha.gif" style="width:1px;height:1px"/>
How can I exctract it? I can do it only with save image?

you can get image on your PC like this:
import urllib.request
from bs4 import BeautifulSoup as BS
tag = '<img src="/captcha.gif" style="width:1px;height:1px"/>'
soup = BS(tag)
img_tag = soup.find('img')
urllib.request.urlretrieve('https://absolute/path/to'+img_tag['src'], os.getcwd() + '/temp_img')

Related

How can I get the url of "src=..."?

I am trying to loop through all the img classes here but I am not sure how I can get the src= link
import requests
from bs4 import BeautifulSoup
url = 'https://giphy.com/search/anxiety'
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
gifs = soup.findAll("img", attrs={"class": "giphy-gif-img"})
for gif in gifs:
print(gif.get('image-src'))
In your last line, you can use gif.get('src').
However, gifs is empty since there are no images with class=giphy-gif-img on the page.

How to get Html code after crawling with python

https://plus.google.com/s/casasgrandes27%40gmail.com/top
I need to crawl the following page with python but I need its HTML not the generic source code of link.
For example
Open the link: plus.google.com/s/casasgrandes27%40gmail.com/top without login second last thumbnail will be "G Suite".
<div class="Wbuh5e" jsname="r4nke">G Suite</div>
I am unable to find the above line of HTML-code after executing this python-code.
from bs4 import BeautifulSoup
import requests
L = list()
r = requests.get("https://plus.google.com/s/casasgrandes27%40gmail.com/top")
data = r.text
soup = BeautifulSoup(data,"lxml")
print(soup)
To get the soup object try the following
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
http://docs.python-requests.org/en/master/user/quickstart/#binary-response-content
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
you can try this code to read a HTML page :
import urllib.request
urls = "https://plus.google.com/s/casasgrandes27%40gmail.com/top"
html_file = urllib.request.urlopen(urls)
html_text = html_file.read()
html_text = str(html_text)
print(html_text)

how to reach dipper divs inside <span> tag using python crawler?

the body tag has a <span> tag. There are many other divs inside the span tag. I want to go dipper but when I trying this code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.instagram.com/artfido/'
data = urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
result = soup.body.span
print (result)
the result was just this:
<span id="react-root"></span>
How can I reach to divs inside the span tag?
Can we parse the <span> tag? Is it possible? If yes so why I'm not able to parse the span?
By using this:
result = soup.body.span.contents
The output was:
[]
As talked in comments, urlopen(url) returns a file like object, which means that you need to read from it if you want to get what's inside it.
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.instagram.com/artfido/'
data = urlopen(url)
soup = BeautifulSoup(data.read(), 'html.parser')
result = soup.body.span
print (result)
The code I used for my python 2.7 setup:
from bs4 import BeautifulSoup
import urllib2
url = 'https://www.instagram.com/artfido/'
data = urllib2.urlopen(url)
soup = BeautifulSoup(data.read(), 'lxml')
result = soup.body.span
print result
EDIT
for future reference, if you want something more simple for handling the url, there is a package called requests . In this case, it is similar but I find it easier to understand.
from bs4 import BeautifulSoup
import requests
url = 'https://www.instagram.com/artfido/'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
result = soup.body.span
print result

python beautifulsoup get html tag content

How can I get the content of an html tag with beautifulsoup? for example the content of <title> tag?
I tried:
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
soup = BeautifulSoup(url)
result = soup.findAll('title')
for each in result:
print(each.get_text())
But nothing happened. I'm using python3.
You need to fetch the website data first. You can do this with the urllib.request module. Note that HTML documents only have one title so there is no need to use find_all() and a loop.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
data = urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
result = soup.find('title')
print(result.get_text())

How to extract specific URL from HTML using Beautiful Soup?

I want to extract specific URLs from an HTML page.
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = http://bassrx.tumblr.com/tagged/tt # nsfw link
page = urlopen(url)
html = page.read() # get the html from the url
# this works without BeautifulSoup, but it is slow:
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
The output of the above is exactly the URL, nothing else: http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg
The only downside is it is very slow.
BeautifulSoup is extremely fast at parsing HTML, so that's why I want to use it.
The urls that I want are actually the img src. Here's a snippet from the HMTL that contains that information that I want.
<div class="media"><a href="http://bassrx.tumblr.com/image/85635265422">
<img src="http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg"/>
</a></div>
So, my question is, how can I get BeautifulSoup to extract all of those 'img src' urls cleanly without any other cruft?
I just want a list of matching urls. I've been trying to use soup.findall() function, but cannot get any useful results.
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://bassrx.tumblr.com/tagged/tt'
soup = BeautifulSoup(urlopen(url).read())
for element in soup.findAll('img'):
print(element.get('src'))
You can use div.media > a > img CSS selector to find img tags inside a which is inside a div tag with media class:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "<url_here>"
soup = BeautifulSoup(urlopen(url))
images = soup.select('div.media > a > img')
print [image.get('src') for image in images]
In order to make the parsing faster you can use lxml parser:
soup = BeautifulSoup(urlopen(url), "lxml")
You need to install lxml module first, of course.
Also, you can make use of a SoupStrainer class for parsing only relevant part of the document.
Hope that helps.
Have a look a BeautifulSoup.find_all with re.compile mix
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = "http://bassrx.tumblr.com/tagged/tt" # nsfw link
page = urlopen(url)
html = page.read()
bs = BeautifulSoup(html)
a_tumblr = [a_element for a_element in bs.find_all(href=re.compile("media\.tumblr"))]
##[<link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="shortcut icon"/>, <link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="apple-touch-icon"/>]

Categories