I have created a script that parses through emails on a weekly basis looking for tables within specific emails. I know that I want things that are within a table tag with a specific class name. The goal then is to take those tables, essentially concat them with a tag in between, and put into another email to automatically send every week.
What I have so far is the actual email scraping, the sending of the email at the end, but I just don't know how to combine the results of a find_all into one element. I'm obviously open to different approaches, which is why I posed the question as thus.
What I have for code is this:
def parse_messages(enhance_str):
soup = BeautifulSoup(enhance_str, 'html.parser')
table = soup.find_all('table', {'class': 'MsoNormalTable'})
return table
which gives me a list-like object (I know find_all sub classes list), but any list methods I know don't work with this object. I thought I could just do something like
'<br/>'.join(table)
but this throws an attribute error.
I'm sure there is a simple answer, but I can't see it. Any help is greatly appreciated.
EDIT: As a clarifcation, I was just trying to preserve the html structure of these tables so I can just pop them into a new email and send them as is. The solution below works for me, so I'm marking it as the accepted answer.
Thanks for the help!
The elements in the output list of soup.find_all are bs4.element.Tag objects, not some objects you can join together as-is to make a string.
I'm not sure what you're upto but if you want to make them all a single str, you can iterate over the Tags, call str on them to get the string representation and then join:
'<br/>'.join([str(tag) for tag in table])
Related
I'm crawling this page (https://boards.greenhouse.io/reddit/jobs/4330383) using Selenium in Python and am looping through all of the required fields using:
required = driver.find_elements_by_css_selector("[aria-required=true]").
The problem is that I can't view each element's id. The command required[0].id (which is the same as driver.find_element_by_id("first_name").id returns a long string of alphanumeric characters and hyphens – even though the id is first_name in the HTML. Can someone explain to be why the id is being changed from first_name to this string? And how can I view the actual id that I'm expecting?
Additionally, what would be the simplest way to reference the associated label mentioned right before it in the HTML (i.e. "First Name " in this case)? The goal is to loop through the required list and be able to tell what each of these forms is actually asking the user for.
Thanks in advance! Any advice or alternatives are welcome, as well.
Your code is almost good. All you need to do is use the .get_attribute() method to get your id's:
required = driver.find_elements_by_css_selector("[aria-required=true]")
for r in required:
print(r.get_attribute("id"))
driver.find_element_by_id("first_name") returns a web element object.
In order to get a web element attribute value like href or id - get_attribute() method should be applied on the web element object.
So, you need to change your code to be
driver.find_element_by_id("first_name").get_attribute("id")
This will give you the id attribute value of that element
I'm going to answer my second question ("How to reference the element's associated label?") since I just figured it out using the find_element_by_xpath() method in conjunction with the .get_attribute("id") solutions mentioned in the previous answers:
ele_id = driver.find_element_by_id("first_name").get_attribute("id")
label_text = driver.find_element_by_xpath('//label[#for="{}"]'.format(ele_id)).text
Background: I am pretty new to python and decided to practice by making a webscraper for https://www.marketwatch.com/tools/markets/stocks/a-z which would allow me to pull the company name, ticker, origin, and sector. I could then use this in another scraper to combine it with more complex information. The page is separated by 2 indexing methods- one to select the first letter of the company name (at the top of the page and another for the number of pages within that letter index (at the bottom of the page). These two tags have the class = "pagination" identifier but when I scrape based on that criteria, I get two separate strings but they are not a delimited list separated by a comma.
Does anyone know how to get the strings as a list? or individually? I really only care about the second.
from bs4 import BeautifulSoup
import requests
# open the source code of the website as text
source = 'https://www.marketwatch.com/tools/markets/stocks/a-z/x'
page = requests.get(source).text
soup = BeautifulSoup(page, 'lxml')
for tags in soup.find_all('ul', class_='pagination'):
tags_text = tags.text
print(tags_text)
Which returns:
0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther
«123»
When I try to split on /n:
tags_text = tags.text.split('/n')
print(tags_text)
The return is:
['\n0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther']
['«123»']
Neither seem to form a list. I have found many ways to get the first string but I really only need the second.
Also, please note that I am using the X index as my current tab. If you build it from scratch, you might have more numbers in the second string and the word (current) might be in a different place in the first list.
THANK YOU!!!!
Edit:
Cleaned old, commented out code from the source and realized I did not show the results of trying to call the second element despite the lack of a comma in the split example:
tags_text = tags.text.split('/n')[1]
print(tags_text)
Returns:
File, "C:\.....", line 22, in <module>
tags_text = tags.text.split('/n')[1]
IndexError: list index out of range
Never mind, I was using print() when I should have been using return, .apphend(), or another term to actually do something with the value...
Im scraping through a vendor link directory. Ive created a soup & isolated all the data I want using the find_all method. However the string I need is nested further within the soup. I understand that find_all returns a list but I need to further distill the list to get what I need. Thanks for the help because Im about to chuck my laptop across the room. Below is my current code.
Im new to the coding world with a decent understanding of Python but only a basic understanding of Beautiful Soup.
URL = get(https://www......) # importing the url I want to work over
soup = BeautifulSoup(URL.text, 'html.parser') # making the soup
IsoUrl = soup.find_all('a',class='xmd-listing-company-name') # Isolates the tags of the links I need.
This is more or less where I get stuck. From the above isolation I get a list composed of the following. Below is only one item of the list.
<a class="xmd-listing-company-name"href="/rated.company.html" itemprop='url><span itemprop='name'>Company</span></a>'
There are 10+ of the above strings in the list. I want to scrape out '/rated.company.html' from each string & append them to a list to iterate through.
Any guidance is greatly appreciate. If I need to clarify anything please let me know
you can simply loop on the results of find_all and extract the href like below :
results = [iso['href'] for iso in IsoUrl]
# >>> ["/rated.company.html", ...]
I have some html code that contains many <table>s in it.
I'm trying to get the information in the second table. Is there a way to do this without using soup.findAll('table') ?
When I do use soup.findAll('table'), I get an error:
ValueError: too many values to unpack
Is there a way to get the n-th tag in some code or another way that does not require going through all the tables? Or should I see if I can add titles to the tables? (like <table title="things">)
There are also headers (<h4>title</h4>) above each table, if that helps.
Thanks.
EDIT
Here's what I was thinking when I asked the question:
I was unpacking the objects into two values, when there were many more. I thought this would just give me the first two things from the list, but of course, it kept giving me the error mentioned above. I was unaware the return value was a list and thought it was a special object or something and I was basing my code off of my friends'.
I was thinking this error meant there were too many tables on the page and that it couldn't handle all of them, so I was asking for a way to do it without the method I was using. I probably should have stopped assuming things.
Now I know it returns a list and I can use this in a for loop or get a value from it with soup.findAll('table')[someNumber]. I learned what unpacking was and how to use it, as well. Thanks everyone who helped.
Hopefully that clears things up, now that I know what I'm doing my question makes less sense than it did when I asked it, so I thought I'd just put a note here on what I was thinking.
EDIT 2:
This question is now pretty old, but I still see that I was never really clear about what I was doing.
If it helps anyone, I was attempting to unpack the findAll(...) results, of which the amount of them I didn't know.
useless_table, table_i_want, another_useless_table = soup.findAll("table");
Since there weren't always the amount of tables I had guessed in the page, and all the values in the tuple need to be unpacked, I was receiving the ValueError:
ValueError: too many values to unpack
So, I was looking for the way to grab the second (or whichever index) table in the tuple returned without running into errors about how many tables were used.
To get the second table from the call soup.findAll('table'), use it as a list, just index it:
secondtable = soup.findAll('table')[1]
Martjin Pieter's answer will make it work indeed. I had some experience with nested table tag which broke my code when I just simply get the second table in the list without paying attention.
When you try to find_all and get the nth element, there is a potential you will mess up, you had better locate the first element you want and make sure the n-th element is actually a sibling of that element instead of children.
You can use the find_next_sibling() to secure your code
you can find the parent first and then use find_all(recursive=False) to guarantee your search range.
Just in case you need it. I will list my code below(use recursive=FALSE).
import urllib2
from bs4 import BeautifulSoup
text = """
<html>
<head>
</head>
<body>
<table>
<p>Table1</p>
<table>
<p>Extra Table</p>
</table>
</table>
<table>
<p>Table2</p>
</table>
</body>
</html>
"""
soup = BeautifulSoup(text)
tables = soup.find('body').find_all('table')
print len(tables)
print tables[1].text.strip()
#3
#Extra Table # which is not the table you want without warning
tables = soup.find('body').find_all('table', recursive=False)
print len(tables)
print tables[1].text.strip()
#2
#Table2 # your desired output
Here's my version
# Import bs4
from bs4 import BeautifulSoup
# Read your HTML
#html_doc = your html
# Get BS4 object
soup = BeautifulSoup(html_doc, "lxml")
# Find next Sibling Table to H3 Header with text "THE GOOD STUFF"
the_good_table = soup.find(name='h3', text='THE GOOD STUFF').find_next_sibling(name='table')
# Find Second tr in your table
your_tr = the_good_table.findAll(name='tr')[1]
# Find Text Value of First td in your tr
your_string = your_tr.td.text
print(your_string)
Output:
'I WANT THIS STRING'
I'm trying to write my first parser with BeautifulSoup (BS4) and hitting a conceptual issue, I think. I haven't done much with Python -- I'm much better at PHP.
I can get BeautifulSoup to find the table I want, but when I try to step into the table and find all the rows, I get some variation on:
AttributeError: 'ResultSet' object has no attribute 'attr'
I tried walking through the sample code at How do I draw out specific data from an opened url in Python using urllib2? and got more or less the same error (note: if you want to try it you'll need a working URL.)
Some of what I'm reading says that the issue is that the ResultSet is a list. How would I know that? If I do print type(table) it just tells me <class 'bs4.element.ResultSet'>
I can find text in the table with:
for row in table:
text = ''.join(row.findAll(text=True))
print text
but if I try to search for HTML with:
for row in table:
text = ''.join(row.find_all('tr'))
print text
It complains about expected string, Tag found So how do I wrangle this string (which is a string full of HTML) back into a beautifulsoup object that I can parse?
BeautifulSoup data-types are bizarre to say the least. A lot of times they don't give enough information to easily piece together the puzzle. I know your pain! Anyway...on to my answer...
Its hard to provide a completely accurate example without seeing more of your code, or knowing the actual site you're attempting to scrape, but I'll do my best.
The problem is your ''.join(). .findAll('tr') returns a list of elements of the BeautifulSoup datatype 'tag'. Its how BS knows to find trs. Because of this, you're passing the wrong datatype to your ''.join().
You should code one more iteration. (I'm assuming there are td tags withing the trs)
text_list = []
for row in table:
table_row = row('tr')
for table_data in table_row:
td = table_data('td')
for td_contents in td:
content = td_contents.contents[0]
text_list.append(content)
text = ' '.join(str(x) for x in text_list)
This returns the entire table content into a single string. You can refine the value of text by simply changing the locations of text_list and text =.
This probably looks like more code than is required, and that might be true, but I've found my scrapes to be much more thorough and accurate when I go about it this way.