I'm trying to write my first parser with BeautifulSoup (BS4) and hitting a conceptual issue, I think. I haven't done much with Python -- I'm much better at PHP.
I can get BeautifulSoup to find the table I want, but when I try to step into the table and find all the rows, I get some variation on:
AttributeError: 'ResultSet' object has no attribute 'attr'
I tried walking through the sample code at How do I draw out specific data from an opened url in Python using urllib2? and got more or less the same error (note: if you want to try it you'll need a working URL.)
Some of what I'm reading says that the issue is that the ResultSet is a list. How would I know that? If I do print type(table) it just tells me <class 'bs4.element.ResultSet'>
I can find text in the table with:
for row in table:
text = ''.join(row.findAll(text=True))
print text
but if I try to search for HTML with:
for row in table:
text = ''.join(row.find_all('tr'))
print text
It complains about expected string, Tag found So how do I wrangle this string (which is a string full of HTML) back into a beautifulsoup object that I can parse?
BeautifulSoup data-types are bizarre to say the least. A lot of times they don't give enough information to easily piece together the puzzle. I know your pain! Anyway...on to my answer...
Its hard to provide a completely accurate example without seeing more of your code, or knowing the actual site you're attempting to scrape, but I'll do my best.
The problem is your ''.join(). .findAll('tr') returns a list of elements of the BeautifulSoup datatype 'tag'. Its how BS knows to find trs. Because of this, you're passing the wrong datatype to your ''.join().
You should code one more iteration. (I'm assuming there are td tags withing the trs)
text_list = []
for row in table:
table_row = row('tr')
for table_data in table_row:
td = table_data('td')
for td_contents in td:
content = td_contents.contents[0]
text_list.append(content)
text = ' '.join(str(x) for x in text_list)
This returns the entire table content into a single string. You can refine the value of text by simply changing the locations of text_list and text =.
This probably looks like more code than is required, and that might be true, but I've found my scrapes to be much more thorough and accurate when I go about it this way.
Related
I have created a script that parses through emails on a weekly basis looking for tables within specific emails. I know that I want things that are within a table tag with a specific class name. The goal then is to take those tables, essentially concat them with a tag in between, and put into another email to automatically send every week.
What I have so far is the actual email scraping, the sending of the email at the end, but I just don't know how to combine the results of a find_all into one element. I'm obviously open to different approaches, which is why I posed the question as thus.
What I have for code is this:
def parse_messages(enhance_str):
soup = BeautifulSoup(enhance_str, 'html.parser')
table = soup.find_all('table', {'class': 'MsoNormalTable'})
return table
which gives me a list-like object (I know find_all sub classes list), but any list methods I know don't work with this object. I thought I could just do something like
'<br/>'.join(table)
but this throws an attribute error.
I'm sure there is a simple answer, but I can't see it. Any help is greatly appreciated.
EDIT: As a clarifcation, I was just trying to preserve the html structure of these tables so I can just pop them into a new email and send them as is. The solution below works for me, so I'm marking it as the accepted answer.
Thanks for the help!
The elements in the output list of soup.find_all are bs4.element.Tag objects, not some objects you can join together as-is to make a string.
I'm not sure what you're upto but if you want to make them all a single str, you can iterate over the Tags, call str on them to get the string representation and then join:
'<br/>'.join([str(tag) for tag in table])
I am writing a simple program to scrape flight data off of travel websites.
When I test this in the python interpreter, it works fine. However, when I run it as .py file, it returns an empty list.
I know that the program is able to get the data from the site, because I can print the entire source no problem. However, for some reason the find_all method is returning an empty list.
The part that should work:
import urllib2
from bs4 import BeautifulSoup
raw_page_data = urllib2.urlopen(query_url)
soup = BeautifulSoup(raw_page_data, 'html.parser')
raw_prices = soup.find_all(class_='price-whole')
for item in raw_prices:
for character in item:
if character.isdigit() == True:
print character
Usually, in the interpreter, this returns a list of prices. I cannot print raw_prices at all, as it is empty. I believe that the error is with the find_all method, because the output is [].
I have been told that, perhaps, the JavaScript on the site is not loading the elements I am trying to find. However, this makes no sense since the entirety of the page source is loaded, and printable. If I try print soup,
it prints the entire source (which contains the elements I am trying to find).
I tried with different parser as well, to check if there was a bug.
Any help greatly appreciated cheers!
I am a (very) new Python user, and decided some of my first work would be to grab some lyrics from a forum and sort according to word frequency. I obviously haven't gotten to the frequency part yet, but the following is the code that does not work for obtaining the string values I want, resulting in an "AttributeError: 'ResultSet' object has no attribute 'getText' ":
from bs4 import BeautifulSoup
import urllib.request
url = 'http://www.thefewgoodmen.com/thefgmforum/threads/gdr-marching-songs-section-b.14998'
wp = urllib.request.urlopen(url)
soup = BeautifulSoup(wp.read())
message = soup.findAll("div", {"class": "messageContent"})
words = message.getText()
print(words)
If I alter the code to have getText() operate on the soup object:
words = soup.getText()
I, of course, get all of the string values throughout the webpage, rather than those limited to only the class messageContent.
My question, therefore, is two-fold:
1) Is there a simple way to limit the tag-stripping to only the intended sections?
2) What simple thing do I not understand in that I cannot have getText() operate on the message object?
Thanks.
The message in this case is a BeautifulSoup ResultSet, which is a list of BeautifulSoup Tag(s). What you need to do is call getText on each element of message like so,
words = [item.getText() for item in message]
Similarly, if you are just interested in a single Tag (let's say the first one for the sake of argument), you could get its content with,
words = message[0].getText()
I'm having a hard time extracting data from a httprequest response.
Can somebody help me? Here's a part of my code:
import requests
r = requests.get('https://www.example.com', verify=True)
keyword = r.text.find('loginfield')
print (keyword)
>>> 42136
42136 value basically means that string 'loginfield' exists on the response.text. But how do I extract specific strings from it?
Like for example I want to extract these exact strings:
<title>Some title here</title>
or this one:
<div id='bla...' #continues extracting of strings until it stops where I want it to stop extracting.
Anybody got an idea on how should I approach this problem?
You can use BeautifulSoup to parse HTML and get tags. Here's an example piece of code:
import requests
from bs4 import BeautifulSoup as BS
r = requests.get('https://www.example.com', verify=True)
soup = BS(r.text)
print(soup.find('title').text)
Should print:
Some title here
But depends on if it's the first title or not
Please note that for HTML-page data extraction, you should take a look at a specialized library like Beautiful soup. Your program will be less fragile and more maintainable that way.
string.find will return -1 if the string does not exists.
There is no string "loginfield" in the page you retrieved.
Once you have the correct index for your string, the returned value is the position of the first char of that string.
since you edited your question:
>>> r.text.find('loginfield')
42136
That means, the string "loginfield" starts at offset 42136 in the text. You could display say 200 chars starting at that position that way:
>>> print(r.text[42136:42136+200])
To find the various values you looking for, you have to figure out where there are relative to that position.
I have some html code that contains many <table>s in it.
I'm trying to get the information in the second table. Is there a way to do this without using soup.findAll('table') ?
When I do use soup.findAll('table'), I get an error:
ValueError: too many values to unpack
Is there a way to get the n-th tag in some code or another way that does not require going through all the tables? Or should I see if I can add titles to the tables? (like <table title="things">)
There are also headers (<h4>title</h4>) above each table, if that helps.
Thanks.
EDIT
Here's what I was thinking when I asked the question:
I was unpacking the objects into two values, when there were many more. I thought this would just give me the first two things from the list, but of course, it kept giving me the error mentioned above. I was unaware the return value was a list and thought it was a special object or something and I was basing my code off of my friends'.
I was thinking this error meant there were too many tables on the page and that it couldn't handle all of them, so I was asking for a way to do it without the method I was using. I probably should have stopped assuming things.
Now I know it returns a list and I can use this in a for loop or get a value from it with soup.findAll('table')[someNumber]. I learned what unpacking was and how to use it, as well. Thanks everyone who helped.
Hopefully that clears things up, now that I know what I'm doing my question makes less sense than it did when I asked it, so I thought I'd just put a note here on what I was thinking.
EDIT 2:
This question is now pretty old, but I still see that I was never really clear about what I was doing.
If it helps anyone, I was attempting to unpack the findAll(...) results, of which the amount of them I didn't know.
useless_table, table_i_want, another_useless_table = soup.findAll("table");
Since there weren't always the amount of tables I had guessed in the page, and all the values in the tuple need to be unpacked, I was receiving the ValueError:
ValueError: too many values to unpack
So, I was looking for the way to grab the second (or whichever index) table in the tuple returned without running into errors about how many tables were used.
To get the second table from the call soup.findAll('table'), use it as a list, just index it:
secondtable = soup.findAll('table')[1]
Martjin Pieter's answer will make it work indeed. I had some experience with nested table tag which broke my code when I just simply get the second table in the list without paying attention.
When you try to find_all and get the nth element, there is a potential you will mess up, you had better locate the first element you want and make sure the n-th element is actually a sibling of that element instead of children.
You can use the find_next_sibling() to secure your code
you can find the parent first and then use find_all(recursive=False) to guarantee your search range.
Just in case you need it. I will list my code below(use recursive=FALSE).
import urllib2
from bs4 import BeautifulSoup
text = """
<html>
<head>
</head>
<body>
<table>
<p>Table1</p>
<table>
<p>Extra Table</p>
</table>
</table>
<table>
<p>Table2</p>
</table>
</body>
</html>
"""
soup = BeautifulSoup(text)
tables = soup.find('body').find_all('table')
print len(tables)
print tables[1].text.strip()
#3
#Extra Table # which is not the table you want without warning
tables = soup.find('body').find_all('table', recursive=False)
print len(tables)
print tables[1].text.strip()
#2
#Table2 # your desired output
Here's my version
# Import bs4
from bs4 import BeautifulSoup
# Read your HTML
#html_doc = your html
# Get BS4 object
soup = BeautifulSoup(html_doc, "lxml")
# Find next Sibling Table to H3 Header with text "THE GOOD STUFF"
the_good_table = soup.find(name='h3', text='THE GOOD STUFF').find_next_sibling(name='table')
# Find Second tr in your table
your_tr = the_good_table.findAll(name='tr')[1]
# Find Text Value of First td in your tr
your_string = your_tr.td.text
print(your_string)
Output:
'I WANT THIS STRING'