I'm working on a project to parse HTML page. It is for an internal website within a company but I changed the example so you can try.
I get the source code of a HTML page and I search for a certain markup. Then I want to extract a substring of this markup but it doesn't work. Python returns a none... Hier below my code with in comment the return of Python:
#!/usr/bin/python
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen("http://www.resto.be/restaurant/liege/4000-liege/8219-le-bar-a-gouts/")
page_source = response.read()
soup = BeautifulSoup(page_source)
name = soup.find_all("meta", attrs={"itemprop":"name"})
print(name[0])
# <meta content="LE BAR A GOUTS" itemprop="name"/>
print(name[0].find("<meta"))
# none
You don't have a string, you have a tag object. Printing the tag has a nice HTML represention, but it is not a string object.
As such, you are using the BeautifulSoup Tag.find() function, and it returns None if there are no child tags with the tag name <meta. Which indeed there are not here.
If you wanted to find the content attribute, use item access:
print name[0]['content']
Related
I am trying to convert the HTML String Tag into String using Python.
Here is the content I'm trying to convert:
htmltxt = "<b>Hello World</b>".
The result should appear like Hello World in bold. But I'm getting like
<html><body><b>Hello World</b></body></html>
with the below snippet of code
from bs4 import BeautifulSoup
htmltxt = "<b>Hello World</b>"
soup = BeautifulSoup(htmltxt, 'lxml')
Can anyone suggest me how to convert?
In this situation you're trying to find a tag from within your soup object. Given this is the only one and there is no id or class name you can use:
hello_world_tag = soup.find("b")
hello_world_tag_text = hello_world_tag.text
print(hello_world_tag_text) # Output: 'Hello World'
The key here is '.text'. Using beautiful soup to find a specific tag will return that entire tag, but the .text method returns just the text from within that tag.
Edit following comment:
I would still recommend using bs4 to parse html. Once you have your text if you'd like it in bold you may print with:
print('\033[1m' + text)
Note You won't get out a bold string per se, it is something that always have to be done by interpreting or formating.
Extracting text from HTML string with BeautifulSoup you can call the methods text or get_text():
from bs4 import BeautifulSoup
htmltxt = "<b>Hello World</b>"
soup = BeautifulSoup(htmltxt, 'lxml')
soup.text
Recently, I'm having online lessons and after each lesson, we have to export our work from Google Docs to HTML file so we can upload it to our websites. Google Docs exported HTML file has really messy code and the body tag always has different classes like: "c47", "c72", "c21", "c38" etc. Just a "c" letter with some random number after it I guess. The main problem is that after the body tag I want to add my button (CSS button, that is not the problem though) but when I tried to use bs4:
from bs4 import BeautifulSoup as Soup
from bs4 import NavigableString
html = open("new_html_file.html", "w+")
soup = Soup(html, features="html.parser")
body = soup.find('body')
button = soup.new_tag('button')
button['class'] = "cupid-blue"
button['onclick'] = "location.href=\'index.html\';"
button.insert(0, NavigableString("Button text"))
body.insert_after(button)
br_line = soup.new_tag('br')
button.insert_after(br_line)
html.write(str(soup.prettify()))
html.close()
It says that 'NoneType' object has no attribute 'insert_after' and the error is in line with body = soup.find('body') so basically it can't find the body tag because it seems that the body tag contains this class: "c47 c73" right? Every file has a different body tag class though, so I can't just hard code it.
My question is: Can I insert my button tag below the body tag (it must be on top of the whole webpage) not knowing the body's tag class?
I am wanting to make a simple python script to automate the process of pulling .mov files from an IP camera's SD card. The Model of IP camera supports http requests which returns HTML that contains the .mov file info. My python script so far..
from bs4 import BeautifulSoup
import requests
page = requests.get("http://192.168.1.99/form/getStorageFileList?type=3")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
OUTPUT:
NAME2041=Record_continiously/2018-06-02/8/MP_2018-06-03_00-33-15_60.mov
I want to only return the MOV file. So removing:
"NAME2041=Record_continiously/2018-06-02/8/"
I'm new to HTML parsing with python so I'm a bit confused with the functionality.
Is returned HTML considered a string? If so, I understand that it will be immutable and I will have to create a new string instead of "striping away" the preexisting string.
I have tried:
page.replace("NAME2041=Record_continiously/2018-06-02/8/","")
in which I receive an attribute error. Is anyone aware of any method that could accomplish this?
Here is a sample of the HTML I am working with...
<html>
<head></head>
<body>
000 Success NUM=2039 NAME0=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-17-38_60.mov SIZE0=15736218
NAME1=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-16-37_60.mov SIZE1=15683077
NAME2=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-15-36_60.mov SIZE2=15676882
NAME3=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-14-35_60.mov SIZE3=15731539
</body>
</html>
Use str.split with negative indexing.
Ex:
page = "NAME2041=Record_continiously/2018-06-02/8/MP_2018-06-03_00-33-15_60.mov"
print( page.split("/")[-1])
Output:
MP_2018-06-03_00-33-15_60.mov
as you asked for explanation of your code here it is:
# import statements
from bs4 import BeautifulSoup
import requests
page = requests.get("http://192.168.1.99/form/getStorageFileList?type=3") # returns response object
soup = BeautifulSoup(page.content, 'html.parser') #
page.content returns string content of response
you are passing this(page.content) string content to class BeautifulSoup which is initialized with two arguments your content(page.content) as string and parser here it is html.parser
soup is the object of BeautifulSoup
.prettify() is method used to pretty print the content
In string slicing you may get failure of result due to length of content so it's better to split your content as suggested by #Rakesh and that's the best approach in your case.
I'm trying to search for a particular class on a webpage; when I use inspect element, I can clearly see that the class exists. But when I use BeautifulSoup to find the class, eg
import bs4
import requests
url = r"https://twitter.com/TheSun/status/998755828931932160"
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, "html.parser")
body = soup.find(class_ = "permalink-inner permalink-tweet-container"
Yet body is a NoneType object, which indicates that BS4 was unable to find a class by the name permalink-inner permalink-tweet-container. Anyone know why this is? If you go to the URL I provided, you can see that it is of a Tweet, and the class I am trying to access represents the "body" of the Tweet. This code works for some tweets, but seemingly randomly gives me None for body.
I want to verify that the HTML tags present in a source string are also present in a target string.
For example:
>> source = '<em>Hello</em><label>What's your name</label>'
>> verify_target(’<em>Hi</em><label>My name is Jim</label>')
True
>> verify_target('<label>My name is Jim</label><em>Hi</em>')
True
>> verify_target('<em>Hi<label>My name is Jim</label></em>')
False
I would get rid of Regex and look at Beautiful Soup.
findAll(True) lists all the tags found in your source.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(source)
allTags = soup.findAll(True)
[tag.name for tag in allTags ]
[u'em', u'label']
then you just need to remove possible duplicates and confront your tags lists.
This snippet verifies that ALL of source's tags are present in target's tags.
from BeautifulSoup import BeautifulSoup
def get_tags_set(source):
soup = BeautifulSoup(source)
all_tags = soup.findAll(True)
return set([tag.name for tag in all_tags])
def verify(tags_source_orig, tags_source_to_verify):
return tags_source_orig == set.intersection(tags_source_orig, tags_source_to_verify)
source= '<label>What\'s your name</label><label>What\'s your name</label><em>Hello</em>'
source_to_verify= '<em>Hello</em><label>What\'s your name</label><label>What\'s your name</label>'
print verify(get_tags_set(source),get_tags_set(source_to_verify))
I don't think that regex is the right way here, basically because html is not always just a string, but it's a bit more complex, with nested tags.
I suggest you to use HTMLParser, create a class with parses the original source and builds a structure on it. Then verify that the same data structure is valid for the targets to be verified.