Beautiful soup - how to extract a string from an object - python

I am learning Beautiful soup. I have succeeded in tracking down the html lines that I need.
My next step is to extract an Id value from those lines.
The code to find the lines looks like this:
object = soup_station.find('img',{'src': re.compile("^Controls")})
If I now print object I will get this, for example:
<img src="Controls/RiverLevels/ChartImage.jpg?Id=471&ChartType=Histogram" id="StationDetails_Chart1_chartImage" alt="Current river level" />
The part I want to extract in the line above is the "471" after Id=.
I tried using re.search on object but it seems that object is not text.
Any help would be much appreciated!

You can adapt the following:
s = '<img src="Controls/RiverLevels/ChartImage.jpg?Id=471&ChartType=Histogram" id="StationDetails_Chart1_chartImage" alt="Current river level" />'
from bs4 import BeautifulSoup
import re
from urlparse import urlsplit, parse_qs
soup = BeautifulSoup(s)
# find the node with a src starting with Controls
node = soup.find('img',{'src': re.compile("^Controls")})
# Break up the url in the src attribute
url_split = urlsplit(node['src'])
# Parse out the query parameter from the url
qs = parse_qs(url_split.query)
# Display the value for `Id`
print qs['Id'][0]

You want to make sure that you are performing the regex search on the object's source. You can give this a try:
import re
ele = soup_station.find('img')
src = ele['src']
match = re.search(r'\?Id=(\d+)', src)
ele_id = match.group(1)

Related

Removing Empty Lines of a list in python

My goal is to get a simple text output like:
https://widget.reviews.io/rating-snippet/dist.js
But I keep getting output like this:
https://widget.reviews.io/rating-snippet/dist.js
All these empty lines are the problem
--> Before there where [] but I removed them with ''.join
Now I only have these empty lines.
Here is my code:
import requests
import re
from bs4 import BeautifulSoup
html = requests.get("https://www.nutrimuscle.com")
soup = BeautifulSoup(html.text, "html.parser")
# Find all script tags
for n in soup.find_all('script'):
# Check if the src attribute exists, and if it does grab the source URL
if 'src' in n.attrs:
javascript = n['src']
# Otherwise assume that the javascript is contained within the tags
else:
javascript = ''
kameleoonRegex = re.compile(r'[\w].*rating-snippet/dist.js')
#Everything I tried :D
kameleeonScript = kameleoonRegex.findall(javascript)
text = ''.join(kameleeonScript)
print(text)
It's probably not that hard but I've been on this for hours
if kameleeonScript: print(kameleeonScript[0])
did the job :)

Extracting a specific substring from a specific hyper-reference using Python

I'm new to Python, and for my second attempt at a project, I wanted to extract a substring – specifically, an identifying number – from a hyper-reference on a url.
For example, this url is the result of my search query, giving the hyper-reference http://www.chessgames.com/perl/chessgame?gid=1012809. From this I want to extract the identifying number "1012809" and append it to navigate to the url http://www.chessgames.com/perl/chessgame?gid=1012809, after which I plan to download the file at the url http://www.chessgames.com/pgn/alekhine_naegeli_1932.pgn?gid=1012809 . But I am currently stuck a few steps behind this because I can't figure out a way to extract the identifier.
Here is my MWE:
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
import re
y = str(soup)
x = re.findall("gid=[0-9]+",y)
print x
z = re.sub("gid=", "", x(1)) #At this point, things have completely broken down...
As Albin Paul commented, re.findall return a list, you need to extract elements from it. By the way, you don't need BeautifulSoup here, use urllib2.urlopen(url).read() to get the string of the content, and the re.sub is also not needed here, one regex pattern (?:gid=)([0-9]+) is enough.
import re
import urllib2
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url).read()
result = re.findall(r"(?:gid=)([0-9]+)",page)
print(result[0])
#'1012809'
You don't need regex here at all. Css selector along with string manipulation will lead you to the right direction. Try the below script:
import requests
from bs4 import BeautifulSoup
page_link = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
soup = BeautifulSoup(requests.get(page_link).text, 'lxml')
item_num = soup.select_one("[href*='gid=']")['href'].split("gid=")[1]
print(item_num)
Output:
1012809

Beautiful Soup Nested Tag Search

I am trying to write a python program that will count the words on a web page. I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (for example: <p class="hello"> inside <div>).
Every time I try finding such tag using page.findAll() (page is Beautiful Soup object containing the whole page) method it simply doesn't find any, although there are. Is there any simple method or another way to do it?
Maybe I'm guessing what you are trying to do is first looking in a specific div tag and the search all p tags in it and count them or do whatever you want. For example:
soup = bs4.BeautifulSoup(content, 'html.parser')
# This will get the div
div_container = soup.find('div', class_='some_class')
# Then search in that div_container for all p tags with class "hello"
for ptag in div_container.find_all('p', class_='hello'):
# prints the p tag content
print(ptag.text)
Hope that helps
Try this one :
data = []
for nested_soup in soup.find_all('xyz'):
data = data + nested_soup.find_all('abc')
Maybe you can turn in into lambda and make it cool, but this works. Thanks.
UPDATE: I noticed that text does not always return the expected result, at the same time, I realized there was a built-in way to get the text, sure enough reading the docs
we read that there is a method called get_text(), use it as:
from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))
INCORRECT, please read above.Supposing that you have your html file locally in index.html you can:
from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)
count= 0
matcher= re.compile("(\s|\n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
continue
temp = matcher.split(tag.text) # Split using tokens such as \s and \n
temp = filter(None, temp) # remove empty elements in the list
count +=len(temp)
print "number of words in the document %d" %count
fd.close()
Please note that it may not be accurate, maybe because of errors in formatting, false positives(it detects any word, even if it is code), text that is shown dynamically using javascript or css, or other reason
You can find all <p> tags using regular expressions (re module).
Note that r.content is a string which contains the whole html of the site.
for eg:
r = requests.get(url,headers=headers)
p_tags = re.findall(r'<p>.*?</p>',r.content)
this should get you all the <p> tags irrespective of whether they are nested or not. And if you want the a tags specifically inside the tags you can add that whole tag as a string in the second argument instead of r.content.
Alternatively if you just want just the text you can try this:
from readability import Document #pip install readability-lxml
import requests
r = requests.get(url,headers=headers)
doc = Document(r.content)
simplified_html = doc.summary()
this will get you a more bare bones form of the html from the site, and now proceed with the parsing.

Python 3 Regular expression on html links

my title may not be the most precise but I had some trouble coming up with a better one and considering it's work hours I'll go with this.
What I am trying to do is get the links from this specific page, then by using RE find specific links that are job ads with certain keywords in it.
Currently I find 2 ads but I haven't been able to get all the ads that match my keyword(in this case it's "säljare", Swedish for sales).
I would appreciate it anyone could look at my RE and say or hint towards fixing it. Thank you!:)
import urllib, urllib.request
import re
from bs4 import BeautifulSoup
url = "https://se.indeed.com/jobb?l=V%C3%A4stra+G%C3%B6talands+L%C3%A4n&start=10&pp=AAoAAAFd6hHqiAAAAAEX-kSOAQABQVlE682pK5mDD9vTZGjJhZBXQGaw6Nf2QaY"
reKey = re.compile('^<a.*?href=\"(.*?)\".*?>(.*säljare.*)</a>')
data = urllib.request.urlopen(url)
dataSoup = BeautifulSoup(data, 'html.parser')
for link in dataSoup.find_all('a'):
linkMatch = re.match(reKey, str(link))
if linkMatch:
print(linkMatch)
print(linkMatch.group(1), linkMatch.group(2))
If I understand your question correctly, you do not need a regex at all. Just check, if the title attribute containing the job title is present in the link and then check for a list of keyword (I added truckförare as a second keyword).
import urllib, urllib.request
import re
import ssl
from bs4 import BeautifulSoup
url = "https://se.indeed.com/jobb?l=V%C3%A4stra+G%C3%B6talands+L%C3%A4n&start=10&pp=AAoAAAFd6hHqiAAAAAEX-kSOAQABQVlE682pK5mDD9vTZGjJhZBXQGaw6Nf2QaY"
keywords = ['säljare', 'truckförare']
data = urllib.request.urlopen(url)
dataSoup = BeautifulSoup(data, 'html.parser')
for link in dataSoup.find_all('a'):
# if we do have a title attribute, check for all keywords
# if at least one of them is present,
# then print the title and the href attribute
if 'title' in link.attrs:
title = link.attrs['title'].lower()
for kw in keywords:
if kw in title:
print(title, link.attrs['href'])
While I personally like regexes (yes, I'm that kind of person ), most of the time you can get away with a little parsing in Python which IMHO makes the code more readable.
Instead of using re can you try in keyword.
for link in dataSoup.find_all('a'):
if keyword in link:
print link
A working solution:
<a[^>]+href=\"([^\"]+)\"[^>]+title=\"((?=[^\"]*säljare[^\"]*)[^\"]+)\"
<a // literal
[^>]+ // 1 or more not '>'
href=\"([^\"]+)\" // href literal then 1 or more not '"' grouped
[^>]+ // 1 or more not '>'
title=\" // literal
( // start of group
(?=[^\"]*säljare[^\"]*) // look ahead and match literal enclosed by 0 or more not '"'
[^\"]+ // 1 or more not '"'
)\" // end of group
Flags: global, case insensitive
Assumes: title after href
Demo

using lxml and requests in python to grab text between certain tags with a specific class name

I am trying to grab all text between a tag that has a specific class name. I believe I am very close to getting it right, so I think all it'll take is a simple fix.
In the website these are the tags I'm trying to retrieve data from. I want 'SNP'.
<span class="rtq_exch"><span class="rtq_dash">-</span>SNP </span>
From what I have currently:
from lxml import html
import requests
def main():
url_link = "http://finance.yahoo.com/q?s=^GSPC&d=t"
page = html.fromstring(requests.get(url_link).text)
for span_tag in page.xpath("//span"):
class_name = span_tag.get("class")
if class_name is not None:
if "rtq_exch" == class_name:
print(url_link, span_tag.text)
if __name__ == "__main__":main()
I get this:
http://finance.yahoo.com/q?s=^GSPC&d=t None
To show that it works, when I change this line:
if "rtq_dash" == class_name:
I get this (please note the '-' which is the same content between the tags):
http://finance.yahoo.com/q?s=^GSPC&d=t -
What I think is happening is it sees the child tag and stops grabbing the data, but I'm not sure why.
I would be happy with receiving
<span class="rtq_dash">-</span>SNP
as a string for span_tag.text, as I can easily chop off what I don't want.
A higher description, I'm trying to get the stock symbol from the page.
Here is the documentation for requests, and here is the documentation for lxml (xpath).
I want to use xpath instead of BeautifulSoup for several reasons, so please don't suggest changing to use that library instead, not that it'd be any easier anyway.
There are some possible ways. You can find the outer span and return direct-child text node of it :
>>> url_link = "http://finance.yahoo.com/q?s=^GSPC&d=t"
>>> page = html.fromstring(requests.get(url_link).text)
>>> for span_text in page.xpath("//span[#class='rtq_exch']/text()"):
... print(span_text)
...
SNP
or find the inner span and get the tail :
>>> for span_tag in page.xpath("//span[#class='rtq_dash']"):
... print(span_tag.tail)
...
SNP
Use BeautifulSoup:
import bs4
html = """<span class="rtq_exch"><span class="rtq_dash">-</span>SNP </span>"""
soup = bs4.BeautifulSoup(html)
snp = list(soup.findAll("span", class_="rtq_exch")[0].strings)[1]

Categories