Beautiful Soup if Class "Contains" or Regex? - python

If my class names are constantly different say for example:
listing-col-line-3-11 dpt 41
listing-col-block-1-22 dpt 41
listing-col-line-4-13 CWK 12
Normally I could do:
for EachPart in soup.find_all("div", {"class" : "ClassNamesHere"}):
print EachPart.get_text()
There are way too many class names to work with here so a bunch of these are out.
I know Python doesn't have a ".contains" I would normally use but it does have an "in". Though I haven't been able to work out a way to incorporate that.
I'm hoping there's a way to do this with regex. Though again my Python syntax is really letting me down I've been trying variations on:
regex = re.compile('.*listing-col-.*')
for EachPart in soup.find_all(regex):
But that doesn't seem to be doing the trick.

BeautifulSoup supports CSS selectors which allow you to select elements based on the content of particular attributes. This includes the selector *= for contains.
The following will return all div elements with a class attribute containing the text 'listing-col-':
for EachPart in soup.select('div[class*="listing-col-"]'):
print EachPart.get_text()

You can try this for loop:
regex = re.compile('.*listing-col-.*')
for EachPart in soup.find_all("div", {"class" : regex}):
print EachPart.get_text()

You could avoid regex by using partial matching with gazpacho...
Input:
html = """\
<div class="listing-col-line-3-11 dpt 41">A</div>
<div class="listing-col-block-1-22 dpt 41">B</div>
<div class="listing-col-line-4-13 CWK 12">C</div>
"""
Partial matching code:
from gazpacho import Soup
soup = Soup(html)
divs = soup.find("div", {"class": "listing-col-"}, partial=True)
[div.text for div in divs]
Output:
['A', 'B', 'C']

Related

Extract href link from html in python

I get this output list of html
HyperSense Software
QSS Technosoft - A CMMI Level 3 Certified Company
and more in the same format I need to extract href link from them?
My code
mainurl="https://www.appfutura.com/app-developers"
html = urlopen(mainurl).read()
main_soup = BeautifulSoup(html,"lxml")
allurl=main_soup.find_all('h3')
for i in allurl:
for a in i :
print(a)
How can I extract href in this loop?
You're close. One small change in your for loop:
for i in allurl:
print(i.a["href"])
This gets the child with tag "a" and then the "href" attribute for that tag.
If you aren't sure how many "a" tags there are in each "h3" block, or there are more than one, you can use another for loop (or depending on what you're doing, list comprehensions):
for i in allurl:
aa = i.find_all('a')
for j in aa:
print(j["href"])
I found a way using css selector
urllist=[]
mainurl="https://www.appfutura.com/app-developers"
html = urlopen(mainurl).read()
main_soup = BeautifulSoup(html,"lxml")
elms = main_soup.select("h3 a")
for i in elms:
urllist.append(i.attrs["href"])
print(urllist)
Thanks !!

how to get text after a specific p tag in beautifulsoup?

how to get all text after third p tag from this code in BeautifulSoup web scraping.
questions = soup.find('div',{'class':'entry-content'})
exp = questions.p[3].text
(there is c a way something like this but i cant get it. )
anyone here can help. shall be very thanksfullenter image description here
Try below code, if that helps:
#This will fetch first div with class entry-content.
# In case if that is not the first div then instead use find_all and select the
# appropriate div with help of indexing.
questions = soup.find('div', class_= 'entry-content')
#This will get all the p tags present in questions.
p_tags = questions.find_all('p')
lst=[]
for tag in p_tags[3:]:
lst.append(tag.text)
#This will get you the text of the 4th <p> tag.
exp = p_tags[3].text
This questions = soup.find('div',{'class':'entry-content'})
Only finds one p tag,
you need:
questions = soup.find_all('div',{'class':'entry-content'})
To find all the p tags, then you can use [3]

Python BeautifulSoup webcrawling getting text tag inside link

I need to get the information within the "< b >" tags for each website.
response = requests.get(href)
soup = BeautifulSoup(response.content, "lxml") # or BeautifulSoup(response.content, "html5lib")
tempWeekend = []
print soup.findAll('b')
The soup.findAll('b') line prints all the b tags in the site, how can I limit it to just the dates that I want?
The website is http://www.boxofficemojo.com/movies/?page=weekend&id=catchingfire.htm, under the weekend tab.
It is often easiest to search using CSS selectors, e.g.
soup.select('table.chart-wide > tr > td > nobr > font > a > b')
Sadly, if the tags are not further identified, there is no way to select specific ones. How should BeautifulSoup be able to distinguish between them. If you know what to roughly expect in the tags you need you could iterate over all of them and check if they match:
for b in soup.findAll('b):
if b.innerHTML == whatever:
return b
or something like that...
Or you could get the surrounding tags, i.e. 'a' in your example and check if that matches and then get the next occurence of 'b'.
Why not search for all the b tags, and choose the ones which contain a month?
import requests
from bs4 import BeautifulSoup
s = requests.get('http://www.boxofficemojo.com/movies/?page=weekend&id=catchingfire.htm').content
soup = BeautifulSoup(s, "lxml") # or BeautifulSoup(response.content, "html5lib")
dates = []
for i in soup.find_all('b'):
if i.text.split()[0].upper() in "JAN FEB MAR APR JUN JUL AUG SEP OCT NOV DEC":
dates.append(i.text)
print dates
(Note: I did not check the exact abbreviations that the website uses. Please check these first and accordingly modify the code)
Looking at that page it doesn't have any divs or class or id tags which makes it tough. The only pattern I could see what that the <b> tag directly before the dates was <b>Date:</b>. I would iterate over the <b> tags and then collect the tags after I hit the one with Date in it.
i would try something like
all_a = site.find_all('a')
for a in all_a:
if '?yr=?' in a['href']:
dates.append(a.get_text())

Python webscraping and getting contents of first div tag of its class

I'm working with Python 3.3 and this website:
http://www.nasdaq.com/markets/ipos/
My goal is to read only the companies that are in the Upcoming IPO. It is in the div tag with div class="genTable thin floatL" There are two with this class, and the target data is in the first one.
Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
for divparent in soup.find_all('div', attrs={'class':'genTable thin floatL'}) [0]: # I tried putting a [0] so it will only return divs in the first genTable thin floatL class
for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
I'd like it to return only
3/7/2014 - RECRO PHARMA, INC.
2/28/2014 - VARONIS SYSTEMS INC
2/27/2014 - LUMENIS LTD
2/21/2014 - SUNDANCE ENERGY AUSTRALIA LTD
2/21/2014 - SEMLER SCIENTIFIC, INC.
But it prints all div classes with the re.match specifications and multiple times as well. I tried inserting [0] on the for divparent loop to retrieve only the first one but this cause the repeating problem instead.
EDIT: Here is the updated code according to warunsl solution. This works.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
table= divparent.find('table')
for div in table.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
You mentioned that there are two elements that fit the 'class':'genTable thin floatL' criteria. So running a for loop for it's first element does not make sense.
So replace your outer for loop with
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
Now you need not do a soup.find_all again. Doing so will search the entire document. You need to restrict the search to the divparent. So, you do:
table = divparent.find('table')
The remainder of the code to extract the dates and the company name would be the same, except that they will be with reference to the table variable.
for row in table.find_all('tr'):
for data in row.find_all('td'):
print data.string
Hope it helps.

How to get a nested element in beautiful soup

I am struggling with the syntax required to grab some hrefs in a td.
The table, tr and td elements dont have any class's or id's.
If I wanted to grab the anchor in this example, what would I need?
< tr >
< td > < a >...
Thanks
As per the docs, you first make a parse tree:
import BeautifulSoup
html = "<html><body><tr><td><a href='foo'/></td></tr></body></html>"
soup = BeautifulSoup.BeautifulSoup(html)
and then you search in it, for example for <a> tags whose immediate parent is a <td>:
for ana in soup.findAll('a'):
if ana.parent.name == 'td':
print ana["href"]
Something like this?
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [td.find('a') for td in soup.findAll('td')]
That should find the first "a" inside each "td" in the html you provide. You can tweak td.find to be more specific or else use findAll if you have several links inside each td.
UPDATE: re Daniele's comment, if you want to make sure you don't have any None's in the list, then you could modify the list comprehension thus:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [a for a in (td.find('a') for td in soup.findAll('td')) if a]
Which basically just adds a check to see if you have an actual element returned by td.find('a').

Categories