Python 3 Regular expression on html links - python

my title may not be the most precise but I had some trouble coming up with a better one and considering it's work hours I'll go with this.
What I am trying to do is get the links from this specific page, then by using RE find specific links that are job ads with certain keywords in it.
Currently I find 2 ads but I haven't been able to get all the ads that match my keyword(in this case it's "säljare", Swedish for sales).
I would appreciate it anyone could look at my RE and say or hint towards fixing it. Thank you!:)
import urllib, urllib.request
import re
from bs4 import BeautifulSoup
url = "https://se.indeed.com/jobb?l=V%C3%A4stra+G%C3%B6talands+L%C3%A4n&start=10&pp=AAoAAAFd6hHqiAAAAAEX-kSOAQABQVlE682pK5mDD9vTZGjJhZBXQGaw6Nf2QaY"
reKey = re.compile('^<a.*?href=\"(.*?)\".*?>(.*säljare.*)</a>')
data = urllib.request.urlopen(url)
dataSoup = BeautifulSoup(data, 'html.parser')
for link in dataSoup.find_all('a'):
linkMatch = re.match(reKey, str(link))
if linkMatch:
print(linkMatch)
print(linkMatch.group(1), linkMatch.group(2))

If I understand your question correctly, you do not need a regex at all. Just check, if the title attribute containing the job title is present in the link and then check for a list of keyword (I added truckförare as a second keyword).
import urllib, urllib.request
import re
import ssl
from bs4 import BeautifulSoup
url = "https://se.indeed.com/jobb?l=V%C3%A4stra+G%C3%B6talands+L%C3%A4n&start=10&pp=AAoAAAFd6hHqiAAAAAEX-kSOAQABQVlE682pK5mDD9vTZGjJhZBXQGaw6Nf2QaY"
keywords = ['säljare', 'truckförare']
data = urllib.request.urlopen(url)
dataSoup = BeautifulSoup(data, 'html.parser')
for link in dataSoup.find_all('a'):
# if we do have a title attribute, check for all keywords
# if at least one of them is present,
# then print the title and the href attribute
if 'title' in link.attrs:
title = link.attrs['title'].lower()
for kw in keywords:
if kw in title:
print(title, link.attrs['href'])
While I personally like regexes (yes, I'm that kind of person ), most of the time you can get away with a little parsing in Python which IMHO makes the code more readable.

Instead of using re can you try in keyword.
for link in dataSoup.find_all('a'):
if keyword in link:
print link

A working solution:
<a[^>]+href=\"([^\"]+)\"[^>]+title=\"((?=[^\"]*säljare[^\"]*)[^\"]+)\"
<a // literal
[^>]+ // 1 or more not '>'
href=\"([^\"]+)\" // href literal then 1 or more not '"' grouped
[^>]+ // 1 or more not '>'
title=\" // literal
( // start of group
(?=[^\"]*säljare[^\"]*) // look ahead and match literal enclosed by 0 or more not '"'
[^\"]+ // 1 or more not '"'
)\" // end of group
Flags: global, case insensitive
Assumes: title after href
Demo

Related

How do I get the text within HTML tags not included in .content?

I want to scrape the text from pages like this: https://www.ncbi.nlm.nih.gov/protein/p22217 into a string. In particular the block of text in DBSOURCE
I've seem multiple suggestions for using soup.findall(text=true) and the like but it comes up with nothing. Anything from before at least 2018 or so also seems to be outdated (I'm using python 3.7). I think the problem is that the content I want is outside the range of r.text and r.content; when I search with ctrl F the part I'm looking for just isn't there.
from bs4 import BeautifulSoup
import requests
url = "https://www.ncbi.nlm.nih.gov/protein/P22217"
r = requests.get(url)
data = r.content
soup = BeautifulSoup(data, "html.parser")
PageInfo = soup.find("pre", attrs={"class":"genbank"})
print(PageInfo)
The result of this and other attempts is "None". No error message, it just doesn't return anything.
You can use this instead as the page depends on xmlhttprequests
Code :
from bs4 import BeautifulSoup
import requests,re
url = "https://www.ncbi.nlm.nih.gov/protein/P22217"
r = requests.get(url)
soup = BeautifulSoup(r.content,features='html.parser')
pageId = soup.find('meta', attrs={'name':'ncbi_uidlist'})['content']
api = requests.get('https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id={}'.format(pageId))
data = re.search(r'DBSOURCE([\w\s\n\t.:,;()-_]*)KEYWORD',api.text)
print(data.group(1).strip())
Demo Code : Here
Explanation :
The request to url will help getting the id of the product you are asking for where exist in the meta of the pages.
by getting the id the second request will use the website api to get you the description you are asking for. A regex pattern wil be used to separate the wanted part and the unwanted part.
Regex :
DBSOURCE([\w\s\n\t.:,;()-_]*)KEYWORD
Demo Regex : Here
The page is doing XHR call in order to get the information you are looking for.
The call is to https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=135747&db=protein&report=genpept&conwithfeat=on&show-cdd=on&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
and it returns
<div class="sequence">
<a name="locus_P22217.3"></a><div class="localnav"><ul class="locals"><li>Comment</li><li>Features</li><li>Sequence</li></ul></div>
<pre class="genbank">LOCUS TRX1_YEAST 103 aa linear PLN 18-SEP-2019
DEFINITION RecName: Full=Thioredoxin-1; AltName: Full=Thioredoxin I;
Short=TR-I; AltName: Full=Thioredoxin-2.
ACCESSION P22217
VERSION P22217.3
**DBSOURCE** UniProtKB: locus TRX1_YEAST, accession P22217;
class: standard.
extra accessions:D6VY45
created: Aug 1, 1991.
...
So do HTTP call from your code in order to get the data.

Extracting a specific substring from a specific hyper-reference using Python

I'm new to Python, and for my second attempt at a project, I wanted to extract a substring – specifically, an identifying number – from a hyper-reference on a url.
For example, this url is the result of my search query, giving the hyper-reference http://www.chessgames.com/perl/chessgame?gid=1012809. From this I want to extract the identifying number "1012809" and append it to navigate to the url http://www.chessgames.com/perl/chessgame?gid=1012809, after which I plan to download the file at the url http://www.chessgames.com/pgn/alekhine_naegeli_1932.pgn?gid=1012809 . But I am currently stuck a few steps behind this because I can't figure out a way to extract the identifier.
Here is my MWE:
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
import re
y = str(soup)
x = re.findall("gid=[0-9]+",y)
print x
z = re.sub("gid=", "", x(1)) #At this point, things have completely broken down...
As Albin Paul commented, re.findall return a list, you need to extract elements from it. By the way, you don't need BeautifulSoup here, use urllib2.urlopen(url).read() to get the string of the content, and the re.sub is also not needed here, one regex pattern (?:gid=)([0-9]+) is enough.
import re
import urllib2
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url).read()
result = re.findall(r"(?:gid=)([0-9]+)",page)
print(result[0])
#'1012809'
You don't need regex here at all. Css selector along with string manipulation will lead you to the right direction. Try the below script:
import requests
from bs4 import BeautifulSoup
page_link = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
soup = BeautifulSoup(requests.get(page_link).text, 'lxml')
item_num = soup.select_one("[href*='gid=']")['href'].split("gid=")[1]
print(item_num)
Output:
1012809

Need to extract data from a website and store in list using regex

So I have a task which requires me to extract data from a website to form a 'top 10 list'. I have chosen IMDB top 250 page http://www.imdb.com/chart/top.
In other words I need a little help using regex to isolate the names of the films and then store them in a list. I already have the HTML stored in a variable as a string (if this is the wrong way of approaching it let me know).
Also, I am limited to use of modules urlopen, re and htmlparser
import HTMLParser
from urllib import urlopen
import re
site = urlopen("http://www.imdb.com/chart/top?tt0468569")
content = site.read()
print content
You really shouldn't use regex but you stated in your comment you have to, so here it is with regex:
import requests
respText = requests.get("http://www.imdb.com/chart/top").text
for title in re.findall(r'<td class="titleColumn">.+?>(.+?)<', respText, re.DOTALL):
print(title)
In BeautifulSoup (Which you can't use)
soup = BeautifulSoup(respText, "html.parser")
for item in soup.find_all("td", {"class" : "titleColumn"}):
print(item.find("a").text)

Python - regex lookup for multiple lines of HTML

How do you parse multiple lines in HTML using regex in Python. I have managed to string match patterns on the same line using the code below.
i=0
while i<len(newschoollist):
url = "http://profiles.doe.mass.edu/profiles/general.aspx?topNavId=1&orgcode="+ newschoollist[i] +"&orgtypecode=6&"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '>Phone:</td><td>(.+?)</td></tr>'
pattern = re.compile(regex)
value = re.findall(pattern,htmltext)
print newschoollist[i], valuetag, value
i+=1
However when i try to recognize more complicated HTML like this...
<td>Attendance Rate</td>
<td class='center'> 90.1</td>
I get null values. I believe the problem is with my syntax. I have googled regex and read most of the documentation but am looking for some help with this kind of application. I am hoping someone can point me in the right direction. Is there a (.+?) like combination that will help me tell regex to jump down a line of HTML?
What i want the findall to pick up is the 90.1 when it finds
"Attendance Rate
"
Thanks!
Use an HTML Parser. Example using BeautifulSoup:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://profiles.doe.mass.edu/profiles/general.aspx?topNavId=1&orgcode=00350326'
soup = BeautifulSoup(urlopen(url))
for label in soup.select('div#whiteboxRight table td'):
value = label.find_next_sibling('td')
if not value:
continue
print label.get_text(strip=True), value.get_text(strip=True)
print "----"
Prints (profile contact information):
...
----
NCES ID: 250279000331
----
Web Site: http://www.bostonpublicschools.org
----
MA School Type: Public School
----
NCES School Reconstituted: No
...
I ended up using (soup.get_text()) and it worked great. Thanks!

Dont Want Spaces Between Paragraphs : Python

I am web scraping a news website to get news articles by using the following code :
import mechanize
from selenium import webdriver
from bs4 import BeautifulSoup
url = "http://www.thehindu.com/archive/web/2012/06/19/"
link_dictionary = {}
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)
for tag_li in soup.findAll('li', attrs={"data-section":"Editorial"}):
for link in tag_li.findAll('a'):
link_dictionary[link.string] = link.get('href')
print link_dictionary[link.string]
urlnew = link_dictionary[link.string]
brnew = mechanize.Browser()
htmltextnew = brnew.open(urlnew).read()
articletext = ""
soupnew = BeautifulSoup(htmltextnew)
for tag in soupnew.findAll('p'):
articletext += tag.text
print articletext
driver.close()
I am getting the desired result but I want a particular news article in a single line. For some articles, I am getting the whole article in a single line while in others I am getting different paragraphs. Can someone help me to sort out the issue ?? I am new to python programming. Thanks and Regards.
This is likely related to the way whitespace is managed in the particular site's HTML, and the fact that not all sites will use "p" tags for their content. Your best bet is to probably do a regular expression replace which eliminates the extra spaces (including newlines).
At the beginning of your file, import the regular expression module:
import re
Then after you've built your articletext, add the following code:
print re.sub('\s+', ' ', articletext, flags=re.M)
You might also want to extract the text from other elements that might be contained within.

Categories