Python - regex lookup for multiple lines of HTML - python

How do you parse multiple lines in HTML using regex in Python. I have managed to string match patterns on the same line using the code below.
i=0
while i<len(newschoollist):
url = "http://profiles.doe.mass.edu/profiles/general.aspx?topNavId=1&orgcode="+ newschoollist[i] +"&orgtypecode=6&"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '>Phone:</td><td>(.+?)</td></tr>'
pattern = re.compile(regex)
value = re.findall(pattern,htmltext)
print newschoollist[i], valuetag, value
i+=1
However when i try to recognize more complicated HTML like this...
<td>Attendance Rate</td>
<td class='center'> 90.1</td>
I get null values. I believe the problem is with my syntax. I have googled regex and read most of the documentation but am looking for some help with this kind of application. I am hoping someone can point me in the right direction. Is there a (.+?) like combination that will help me tell regex to jump down a line of HTML?
What i want the findall to pick up is the 90.1 when it finds
"Attendance Rate
"
Thanks!

Use an HTML Parser. Example using BeautifulSoup:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://profiles.doe.mass.edu/profiles/general.aspx?topNavId=1&orgcode=00350326'
soup = BeautifulSoup(urlopen(url))
for label in soup.select('div#whiteboxRight table td'):
value = label.find_next_sibling('td')
if not value:
continue
print label.get_text(strip=True), value.get_text(strip=True)
print "----"
Prints (profile contact information):
...
----
NCES ID: 250279000331
----
Web Site: http://www.bostonpublicschools.org
----
MA School Type: Public School
----
NCES School Reconstituted: No
...

I ended up using (soup.get_text()) and it worked great. Thanks!

Related

How do I get the text within HTML tags not included in .content?

I want to scrape the text from pages like this: https://www.ncbi.nlm.nih.gov/protein/p22217 into a string. In particular the block of text in DBSOURCE
I've seem multiple suggestions for using soup.findall(text=true) and the like but it comes up with nothing. Anything from before at least 2018 or so also seems to be outdated (I'm using python 3.7). I think the problem is that the content I want is outside the range of r.text and r.content; when I search with ctrl F the part I'm looking for just isn't there.
from bs4 import BeautifulSoup
import requests
url = "https://www.ncbi.nlm.nih.gov/protein/P22217"
r = requests.get(url)
data = r.content
soup = BeautifulSoup(data, "html.parser")
PageInfo = soup.find("pre", attrs={"class":"genbank"})
print(PageInfo)
The result of this and other attempts is "None". No error message, it just doesn't return anything.
You can use this instead as the page depends on xmlhttprequests
Code :
from bs4 import BeautifulSoup
import requests,re
url = "https://www.ncbi.nlm.nih.gov/protein/P22217"
r = requests.get(url)
soup = BeautifulSoup(r.content,features='html.parser')
pageId = soup.find('meta', attrs={'name':'ncbi_uidlist'})['content']
api = requests.get('https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id={}'.format(pageId))
data = re.search(r'DBSOURCE([\w\s\n\t.:,;()-_]*)KEYWORD',api.text)
print(data.group(1).strip())
Demo Code : Here
Explanation :
The request to url will help getting the id of the product you are asking for where exist in the meta of the pages.
by getting the id the second request will use the website api to get you the description you are asking for. A regex pattern wil be used to separate the wanted part and the unwanted part.
Regex :
DBSOURCE([\w\s\n\t.:,;()-_]*)KEYWORD
Demo Regex : Here
The page is doing XHR call in order to get the information you are looking for.
The call is to https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=135747&db=protein&report=genpept&conwithfeat=on&show-cdd=on&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
and it returns
<div class="sequence">
<a name="locus_P22217.3"></a><div class="localnav"><ul class="locals"><li>Comment</li><li>Features</li><li>Sequence</li></ul></div>
<pre class="genbank">LOCUS TRX1_YEAST 103 aa linear PLN 18-SEP-2019
DEFINITION RecName: Full=Thioredoxin-1; AltName: Full=Thioredoxin I;
Short=TR-I; AltName: Full=Thioredoxin-2.
ACCESSION P22217
VERSION P22217.3
**DBSOURCE** UniProtKB: locus TRX1_YEAST, accession P22217;
class: standard.
extra accessions:D6VY45
created: Aug 1, 1991.
...
So do HTTP call from your code in order to get the data.

Python 3 Regular expression on html links

my title may not be the most precise but I had some trouble coming up with a better one and considering it's work hours I'll go with this.
What I am trying to do is get the links from this specific page, then by using RE find specific links that are job ads with certain keywords in it.
Currently I find 2 ads but I haven't been able to get all the ads that match my keyword(in this case it's "säljare", Swedish for sales).
I would appreciate it anyone could look at my RE and say or hint towards fixing it. Thank you!:)
import urllib, urllib.request
import re
from bs4 import BeautifulSoup
url = "https://se.indeed.com/jobb?l=V%C3%A4stra+G%C3%B6talands+L%C3%A4n&start=10&pp=AAoAAAFd6hHqiAAAAAEX-kSOAQABQVlE682pK5mDD9vTZGjJhZBXQGaw6Nf2QaY"
reKey = re.compile('^<a.*?href=\"(.*?)\".*?>(.*säljare.*)</a>')
data = urllib.request.urlopen(url)
dataSoup = BeautifulSoup(data, 'html.parser')
for link in dataSoup.find_all('a'):
linkMatch = re.match(reKey, str(link))
if linkMatch:
print(linkMatch)
print(linkMatch.group(1), linkMatch.group(2))
If I understand your question correctly, you do not need a regex at all. Just check, if the title attribute containing the job title is present in the link and then check for a list of keyword (I added truckförare as a second keyword).
import urllib, urllib.request
import re
import ssl
from bs4 import BeautifulSoup
url = "https://se.indeed.com/jobb?l=V%C3%A4stra+G%C3%B6talands+L%C3%A4n&start=10&pp=AAoAAAFd6hHqiAAAAAEX-kSOAQABQVlE682pK5mDD9vTZGjJhZBXQGaw6Nf2QaY"
keywords = ['säljare', 'truckförare']
data = urllib.request.urlopen(url)
dataSoup = BeautifulSoup(data, 'html.parser')
for link in dataSoup.find_all('a'):
# if we do have a title attribute, check for all keywords
# if at least one of them is present,
# then print the title and the href attribute
if 'title' in link.attrs:
title = link.attrs['title'].lower()
for kw in keywords:
if kw in title:
print(title, link.attrs['href'])
While I personally like regexes (yes, I'm that kind of person ), most of the time you can get away with a little parsing in Python which IMHO makes the code more readable.
Instead of using re can you try in keyword.
for link in dataSoup.find_all('a'):
if keyword in link:
print link
A working solution:
<a[^>]+href=\"([^\"]+)\"[^>]+title=\"((?=[^\"]*säljare[^\"]*)[^\"]+)\"
<a // literal
[^>]+ // 1 or more not '>'
href=\"([^\"]+)\" // href literal then 1 or more not '"' grouped
[^>]+ // 1 or more not '>'
title=\" // literal
( // start of group
(?=[^\"]*säljare[^\"]*) // look ahead and match literal enclosed by 0 or more not '"'
[^\"]+ // 1 or more not '"'
)\" // end of group
Flags: global, case insensitive
Assumes: title after href
Demo

How to scrape data from a website using Python 2?

So when I run this code I keep getting empty brackets instead of the actual data.
I am trying to figure out why sense I don't receive any error messages.
import urllib
import re
symbolslist = ["aapl","spy","goog","nflx"]
for symbol in symbolslist:
url = "http://finance.yahoo.com/q?s=%s&ql=1"%(symbol)
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<span id="yfs_184_%s">(.+?)</span>'%(symbol.lower())
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print price
The brackets come up because the element code for regex is not 184 its l84 its an L not a one.
There is a number of libraries around which can help you to scrape sites. Take a look at Scrapy or at Beautiful Soup they should support both Python 2 and 3 as far as I know.

Python re regex matching issue

Ok please be gentle - this is my first stackoverflow question and I've struggled with this for a few hours. I'm sure the answer is something obvious, staring me in the face but I give up.
I'm trying to grab an element from a webpage (ie determine gender of a name) from a name website.
The python code I've written is here:
import re
import urllib2
response=urllib2.urlopen("http://www.behindthename.com/name/janet")
html=response.read()
print html
patterns = ['Masculine','Feminine']
for pattern in patterns:
print "Looking for %s in %s<<<" % (pattern,html)
if re.findall(pattern,html):
print "Found a match!"
exit
else:
print "No match!"
When I dump html I see Feminine there, but the re.findall isn't matching. What in the world am I doing wrong?
Do not parse an HTML with regex, use a specialized tool - an HTML parser.
Example using BeautifulSoup:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://www.behindthename.com/name/janet'
soup = BeautifulSoup(urlopen(url))
print soup.select('div.nameinfo span.info')[0].text # prints "Feminine"
Or, you can find an element by text:
gender = soup.find(text='Feminine')
And then, see if it is None (not found) or not: gender is None.

Python: Print Specific line of text out of TD tag

This is an easy one I am sure. I am parsing a website and I am trying to get the specific text in between tags. The text will either == [revoked, Active, Default] I am using Python. I have been able to print out all the inner text results, but I have not been able to find a good solution on the web for specific text. Here is my code
from BeautifulSoup import BeautifulSoup
import urllib2
import re
url = urllib2.urlopen("Some URL")
content = url.read()
soup = BeautifulSoup(content)
for tag in soup.findAll(re.compile("^a")):
print(tag.text)
I'm still not sure I understand what you are trying to do, but I'll try to help.
soup.find_all('a', text=['revoked', 'active', 'default'])
This will select only those <a …> tags that have one of given strings as their text.
I've used the snippet below in a similar occasion. See if this works with your goal:
table = soup.find(id="Table3")
for i in table.stripped_strings:
print(i)

Categories