This is the HTML code:
<div xmlns="" style="box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;">42263 - Unencrypted Telnet Server</div>
I am trying to print 42263 - Unencrypted Telnet Server using Beautiful Soup but the output is an empty element i.e, []
This is my Python code:
from bs4 import BeautifulSoup
import csv
import urllib.request as urllib2
with open(r"C:\Users\sourabhk076\Documents\CBS_1.html") as fp:
soup = BeautifulSoup(fp.read(), 'html.parser')
divs = soup.find_all('div', attrs={'background':'#fdc431'})
print(divs)
background is not an attribute of the div tag. The attributes of the div tag are:
{'xmlns': '', 'style': 'box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;'}
So, either you'll have to use
soup.find_all('div', attrs={'style': 'box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;'}
or, you can use the lambda function to check if background: #fdc431 is in the style attribute value, like this:
soup = BeautifulSoup('<div xmlns="" style="box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;">42263 - Unencrypted Telnet Server</div>', 'html.parser')
print(soup.find(lambda t: t.name == 'div' and 'background: #fdc431' in t['style']).text)
# 42263 - Unencrypted Telnet Server
or, you can use RegEx, as shown by Jatimir in his answer.
Solution with regexes:
from bs4 import BeautifulSoup
import re
with open(r"C:\Users\sourabhk076\Documents\CBS_1.html") as fp:
soup = BeautifulSoup(fp.read(), 'html.parser')
Let's find the div that matches the following regular expression: background:\s*#fdc431;. \s matches a single Unicode whitespace character. I assumed that there can be 0 or more whitespaces so I added the * modifier to match 0 or more repetitions of the preceding RE. You can read more about regexes here as they sometimes come in handy. I also recommend you this online regex tester.
div = soup.find('div', attrs={'style': re.compile(r'background:\s*#fdc431;')})
This however is equivalent to:
div = soup.find('div', style=re.compile(r'background:\s*#fdc431;'))
You can read about that in the official documentation of BeautifulSoup
Worth reading are also the sections about the kinds of filters you can provide to the find and other similar methods.
You can supply either a string, regular expression, list, True or a function, as shown by Keyur Potdar in his anwser.
Assuming the div exists we can get its text by:
>>> div.text
'42263 - Unencrypted Telnet Server'
Related
Lets say there's a file that lives at the github repo:
https://github.com/someguy/brilliant/blob/master/somefile.txt
I'm trying to use requests to request this file, write the content of it to disk in the current working directory where it can be used later. Right now, I'm using the following code:
import requests
from os import getcwd
url = "https://github.com/someguy/brilliant/blob/master/somefile.txt"
directory = getcwd()
filename = directory + 'somefile.txt'
r = requests.get(url)
f = open(filename,'w')
f.write(r.content)
Undoubtedly ugly, and more importantly, not working. Instead of the expected text, I get:
<!DOCTYPE html>
<!--
Hello future GitHubber! I bet you're here to remove those nasty inline styles,
DRY up these templates and make 'em nice and re-usable, right?
Please, don't. https://github.com/styleguide/templates/2.0
-->
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
<title>Page not found · GitHub</title>
<style type="text/css" media="screen">
body {
background: #f1f1f1;
font-family: "HelveticaNeue", Helvetica, Arial, sans-serif;
text-rendering: optimizeLegibility;
margin: 0; }
.container { margin: 50px auto 40px auto; width: 600px; text-align: center; }
a { color: #4183c4; text-decoration: none; }
a:visited { color: #4183c4 }
a:hover { text-decoration: none; }
h1 { letter-spacing: -1px; line-height: 60px; font-size: 60px; font-weight: 100; margin: 0px; text-shadow: 0 1px 0 #fff; }
p { color: rgba(0, 0, 0, 0.5); margin: 20px 0 40px; }
ul { list-style: none; margin: 25px 0; padding: 0; }
li { display: table-cell; font-weight: bold; width: 1%; }
#error-suggestions { font-size: 14px; }
#next-steps { margin: 25px 0 50px 0;}
#next-steps li { display: block; width: 100%; text-align: center; padding: 5px 0; font-weight: normal; color: rgba(0, 0, 0, 0.5); }
#next-steps a { font-weight: bold; }
.divider { border-top: 1px solid #d5d5d5; border-bottom: 1px solid #fafafa;}
#parallax_wrapper {
position: relative;
z-index: 0;
}
#parallax_field {
overflow: hidden;
position: absolute;
left: 0;
top: 0;
height: 370px;
width: 100%;
}
etc etc.
Content from Github, but not the content of the file. What am I doing wrong?
The content of the file in question is included in the returned data. You are getting the full GitHub view of that file, not just the contents.
If you want to download just the file, you need to use the Raw link at the top of the page, which will be (for your example):
https://raw.githubusercontent.com/someguy/brilliant/master/somefile.txt
Note the change in domain name, and the blob/ part of the path is gone.
To demonstrate this with the requests GitHub repository itself:
>>> import requests
>>> r = requests.get('https://github.com/kennethreitz/requests/blob/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/html; charset=utf-8'
>>> r = requests.get('https://raw.githubusercontent.com/kennethreitz/requests/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/plain; charset=utf-8'
>>> print r.text
Requests: HTTP for Humans
=========================
.. image:: https://travis-ci.org/kennethreitz/requests.png?branch=master
[... etc. ...]
You need to request the raw version of the file, from https://raw.githubusercontent.com.
See the difference:
https://raw.githubusercontent.com/django/django/master/setup.py vs. https://github.com/django/django/blob/master/setup.py
Also, you should probably add a / between your directory and the filename:
>>> getcwd()+'foo.txt'
'/Users/burhanfoo.txt'
>>> import os
>>> os.path.join(getcwd(),'foo.txt')
'/Users/burhan/foo.txt'
Just as an update, https://raw.github.com was migrated to https://raw.githubusercontent.com. So the general format is:
url = "https://raw.githubusercontent.com/user/repo/branch/[subfolders]/file"
E.g. https://raw.githubusercontent.com/earnestt1234/seedir/master/setup.py. Still use requests.get(url) as in Martijn's answer.
Adding a working example ready for copy+paste:
import requests
from requests.structures import CaseInsensitiveDict
url = "https://raw.githubusercontent.com/organization/repo/branch/folder/file"
# If repo is private - we need to add a token in header:
headers = CaseInsensitiveDict()
headers["Authorization"] = "token TOKEN"
resp = requests.get(url, headers=headers)
print(resp.status_code)
(*) If repo is not private - remove the headers part.
Bonus:
Check out this Curl < --> Python-requests online converter.
Very new to beautiful soup. I'm attempting to get the text between tags.
databs.txt
<p>$343,343</p><h3>Single</h3><p class=3D'highlight-price' style=3D"margin: 0; font-family: 'Montserrat', sans-serif; text-decoration: none; color: #323232; font-weight: 500; font-size: 16px; line-height: 1.38;">$101,900</p><h3 class=3D"highlight-title" style=3D"margin: 0; margin-bottom: 6px; font-family: 'Montserrat', sans-serif; text-decoration: none; color: #323232; font-weight: 500; font-size: 13px; line-height: 1.45;">Multi</h3><p class=3D'highlight-price' style=3D"margin: 0; font-family: 'Montserrat', sans-serif; text-decoration: none; color: #323232; font-weight: 500; font-size: 16px; line-height: 1.38;">$201,900</p><h3 class=3D"highlight-title" style=3D"margin: 0; margin-bottom: 6px; font-family: 'Montserrat', sans-serif; text-decoration: none; color: #323232; font-weight: 500; font-size: 13px; line-height: 1.45;">Single</h3>
Python
#!/usr/bin/python
import os
from bs4 import BeautifulSoup
f = open(os.path.join("databs.txt"), "r")
text = f.read()
soup = BeautifulSoup(text, 'html.parser')
page1 = soup.find('p').getText()
print("P1:",page1)
page2 = soup.find('h3').getText()
print("H3:",page2)
Question:
How do I get the text "$101,900, Multi, $201,900, Single"?
If you want to get the tags that have attributes, you can use lambda function to get them as follows:
from bs4 import BeautifulSoup
html = """
<p>$343,343</p>
<h3>Single</h3>
<p class=3D'highlight-price' style=3D"margin: 0; font-family: 'Montserrat', sans-serif; text-decoration: none; color: #323232; font-weight: 500; font-size: 16px; line-height: 1.38;">$101,900</p><h3 class=3D"highlight-title" style=3D"margin: 0; margin-bottom: 6px; font-family: 'Montserrat', sans-serif; text-decoration: none; color: #323232; font-weight: 500; font-size: 13px; line-height: 1.45;">Multi</h3><p class=3D'highlight-price' style=3D"margin: 0; font-family: 'Montserrat', sans-serif; text-decoration: none; color: #323232; font-weight: 500; font-size: 16px; line-height: 1.38;">$201,900</p><h3 class=3D"highlight-title" style=3D"margin: 0; margin-bottom: 6px; font-family: 'Montserrat', sans-serif; text-decoration: none; color: #323232; font-weight: 500; font-size: 13px; line-height: 1.45;">Single</h3>
"""
soup = BeautifulSoup(html, 'lxml')
tags_with_attribute = soup.find_all(attrs=lambda x: x is not None)
clean_text = ", ".join([tag.get_text() for tag in tags_with_attribute])
Output would look like:
'$101,900, Multi, $201,900, Single'
Use find_all method to find all tags:
for p, h3 in zip(soup.find_all('p'), soup.find_all('h3')):
print("P:",p.getText())
print("H3:",h3.getText())
How can I get the URL from this output of Selenium in Python?
<div style="z-index: 999; overflow: hidden; background-position: 0px 0px; text-align: center; background-color: rgb(255, 255, 255); width: 480px; height: 672.172px; float: left; background-size: 1054px 1476px; display: none; border: 0px solid rgb(136, 136, 136); background-repeat: no-repeat; position: absolute; background-image: url("https://photo.venus.com/im/19230307.jpg?preset=zoom");" class="zoomWindow"> </div>
I got the above output from the following command line:
driver.find_element_by_class_name('zoomWindowContainer')
Firstly, get style atribute by:
div = driver.find_element_by_class_name('zoomWindow')
style = div.get_attribute("style") # str
Then, using regex to find url from style:
import re
urls = re.findall(r"https?://.+\.jpg", style) # list
print (urls[0])
I know there are a lot of answers on this question, but many of them are outdated, and when I found one that "worked", it did not work well enough.
This is my current code:
import requests
from bs4 import BeautifulSoup
url = "http://example.com"
req = requests.get(url)
html = req.text
PlainText = BeautifulSoup(html, 'lxml')
print (PlainText.get_text())
This is the output I get:
Example Domain
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
#media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
More information...
This is the output I want:
Example Domain
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.
More information...
How can I get only the text I can read printed out from a website?
Something like this should work, as long as the "Plain text" part doesn't contain the character '}'.
import requests
from bs4 import BeautifulSoup
url = "http://example.com"
req = requests.get(url)
html = req.text
PlainText = BeautifulSoup(html, 'lxml')
text = Plaintext.get_text()
split = text.split('}')
withoutCss = split[len(split) - 1]
print (withoutCss)
Here is a python program that uses a function to remove everything between the < tags and the > tags, and returns just the text that is not between these tags.
def striphtmltags(s):
b=True
r=''
for i in range(0, len(s)):
if(s[i]=='<'): b=False
if(b): r+=s[i]
if(s[i]=='>'): b=True
return(r.strip())
html="<html><body><h1>this is the header</h1>this is the main body<font color=blue>this is blue</font><h6>this is the footer</h6></body></html>"
text=striphtmltags(html)
print("text:", text)
This produces:
text: this is the headerthis is the main bodythis is bluethis is the footer
Lets say there's a file that lives at the github repo:
https://github.com/someguy/brilliant/blob/master/somefile.txt
I'm trying to use requests to request this file, write the content of it to disk in the current working directory where it can be used later. Right now, I'm using the following code:
import requests
from os import getcwd
url = "https://github.com/someguy/brilliant/blob/master/somefile.txt"
directory = getcwd()
filename = directory + 'somefile.txt'
r = requests.get(url)
f = open(filename,'w')
f.write(r.content)
Undoubtedly ugly, and more importantly, not working. Instead of the expected text, I get:
<!DOCTYPE html>
<!--
Hello future GitHubber! I bet you're here to remove those nasty inline styles,
DRY up these templates and make 'em nice and re-usable, right?
Please, don't. https://github.com/styleguide/templates/2.0
-->
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
<title>Page not found · GitHub</title>
<style type="text/css" media="screen">
body {
background: #f1f1f1;
font-family: "HelveticaNeue", Helvetica, Arial, sans-serif;
text-rendering: optimizeLegibility;
margin: 0; }
.container { margin: 50px auto 40px auto; width: 600px; text-align: center; }
a { color: #4183c4; text-decoration: none; }
a:visited { color: #4183c4 }
a:hover { text-decoration: none; }
h1 { letter-spacing: -1px; line-height: 60px; font-size: 60px; font-weight: 100; margin: 0px; text-shadow: 0 1px 0 #fff; }
p { color: rgba(0, 0, 0, 0.5); margin: 20px 0 40px; }
ul { list-style: none; margin: 25px 0; padding: 0; }
li { display: table-cell; font-weight: bold; width: 1%; }
#error-suggestions { font-size: 14px; }
#next-steps { margin: 25px 0 50px 0;}
#next-steps li { display: block; width: 100%; text-align: center; padding: 5px 0; font-weight: normal; color: rgba(0, 0, 0, 0.5); }
#next-steps a { font-weight: bold; }
.divider { border-top: 1px solid #d5d5d5; border-bottom: 1px solid #fafafa;}
#parallax_wrapper {
position: relative;
z-index: 0;
}
#parallax_field {
overflow: hidden;
position: absolute;
left: 0;
top: 0;
height: 370px;
width: 100%;
}
etc etc.
Content from Github, but not the content of the file. What am I doing wrong?
The content of the file in question is included in the returned data. You are getting the full GitHub view of that file, not just the contents.
If you want to download just the file, you need to use the Raw link at the top of the page, which will be (for your example):
https://raw.githubusercontent.com/someguy/brilliant/master/somefile.txt
Note the change in domain name, and the blob/ part of the path is gone.
To demonstrate this with the requests GitHub repository itself:
>>> import requests
>>> r = requests.get('https://github.com/kennethreitz/requests/blob/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/html; charset=utf-8'
>>> r = requests.get('https://raw.githubusercontent.com/kennethreitz/requests/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/plain; charset=utf-8'
>>> print r.text
Requests: HTTP for Humans
=========================
.. image:: https://travis-ci.org/kennethreitz/requests.png?branch=master
[... etc. ...]
You need to request the raw version of the file, from https://raw.githubusercontent.com.
See the difference:
https://raw.githubusercontent.com/django/django/master/setup.py vs. https://github.com/django/django/blob/master/setup.py
Also, you should probably add a / between your directory and the filename:
>>> getcwd()+'foo.txt'
'/Users/burhanfoo.txt'
>>> import os
>>> os.path.join(getcwd(),'foo.txt')
'/Users/burhan/foo.txt'
Just as an update, https://raw.github.com was migrated to https://raw.githubusercontent.com. So the general format is:
url = "https://raw.githubusercontent.com/user/repo/branch/[subfolders]/file"
E.g. https://raw.githubusercontent.com/earnestt1234/seedir/master/setup.py. Still use requests.get(url) as in Martijn's answer.
Adding a working example ready for copy+paste:
import requests
from requests.structures import CaseInsensitiveDict
url = "https://raw.githubusercontent.com/organization/repo/branch/folder/file"
# If repo is private - we need to add a token in header:
headers = CaseInsensitiveDict()
headers["Authorization"] = "token TOKEN"
resp = requests.get(url, headers=headers)
print(resp.status_code)
(*) If repo is not private - remove the headers part.
Bonus:
Check out this Curl < --> Python-requests online converter.