extracting a string from html using HTMLParser & Python2.7 - python

I am developing an Alexa skill and therefore, I have the standard Python (2.7) libraries available for us. Therefore, I don't have BeautifulSoup4 available to use.
I'm trying to identify the below string within a full HTML page and then extract the string '104 spaces' from within the below. "104 spaces" is a variable however the other code remains constant:-
<p class="jp1"><strong>Car Parking</strong>104 spaces</p>
Is this possible to do with HTMLParser or alternatively, could this be done using a regex search? I appreciate regex is not the best for parsing HTML code however given I am using urllib2 which processes the HTML code as a string and the fact I am looking to extract a specific string that follows a specific string within this, it may suffice in this case.
One option I am thinking of is:-
s1 = "<strong>Car Parking</strong>104 spaces</p>"
s2 = "<strong>Car Parking</strong>"
print s1[s1.index(s2) + len(s2):]
This will return all of the text commencing with "104 spaces" and onwards. Therefore, how can I isolate this specific segment of text?
Thanks

I dont use python but here a regex. After this you need to replace the strong and p tag manually with python should be easy.
var test = '<p class="jp1"><strong>Car Parking</strong>104 spaces</p>'
test = test.match(/[<][/]strong[>](.*)[<][/]p[>]/gmi)
//console.log would be here test: [ '</strong>104 spaces</p>' ]
test = test.join('')
test = test.replace('</strong>','')
test = test.replace('</p>','')
console.log('test: ', test)

Related

How do I remove double quotes from whithin retreived JSON data

I'm currently using BeautifulSoup to web-scrape listings from a jobs website, and outputting the data into JSON via the site's HTML code.
I fix bugs with regex as they come along, but this particular issue has me stuck. When webscraping the job listing, instead of extracting info from each container of interest, I've chosen to instead extract JSON data within the HTML source code (< script type = "application/ld+json" >). From there I convert the BeautifulSoup results into strings, clean out the HTML leftovers, then convert the string into a JSON. However, I've hit a snag due to text within the job listing using quotes. Since the actual data is large, I'll just use a substitute.
example_string = '{"Category_A" : "Words typed describing stuff",
"Category_B" : "Other words speaking more irrelevant stuff",
"Category_X" : "Here is where the "PROBLEM" lies"}'
Now the above won't run in Python, but the string I have that has been extracted from the job listing's HTML is pretty much in the above format. When it's passed into json.loads(), it returns the error: json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 5035
I'm not at all sure how to address this issue.
EDIT Here's the actual code leading to the error:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import json, re
uClient = urlopen("http://www.ethiojobs.net/display-job/227974/Program-Manager---Mental-Health%2C-Child-Care-Gender-%26-Protection.html")
page_html = uClient.read()
uClient.close()
listing_soup = BeautifulSoup(page_html, "lxml")
json_script = listing_soup.find("script", "type":"application/ld+json"}).strings
extracted_json_str = ''.join(json_script)
## Clean up the string with regex
extracted_json_str_CLEAN1 = re.sub(pattern = r"\r+|\n+|\t+|\\l+| | |amp;|\u2013|</?.{,6}>", # last is to get rid of </p> and </strong>
repl='',
string = extracted_json_str)
extracted_json_str_CLEAN2 = re.sub(pattern = r"\\u2019",
repl = r"'",
string = extracted_json_str_CLEAN1)
extracted_json_str_CLEAN3 = re.sub(pattern=r'\u25cf',
repl=r" -",
string = extracted_json_str_CLEAN2)
extracted_json_str_CLEAN4 = re.sub(pattern=r'\\',
repl="",
string = extracted_json_str_CLEAN3)
## Convert to JSON (HERE'S WHERE THE ERROR ARISES)
json_listing = json.loads(extracted_json_str_CLEAN4)
I do know what's leading to the error: within the last bullet point of Objective 4 in the job description, the author used quotes when referring to a required task of the job (i.e. "quality control" ). The way I've been going about extracting information from these job listings, a simple instance of someone using quotes causes my whole approach to blow up. Surely there's got to be a better way to build this script without such liabilities like this (as well as having to use regex to fix each breakdown as they arise).
Thanks!
you need to apply the escape sequence(\) if you want double Quote(") in your value. So, your String input to json.loads() should look like below.
example_string = '{"Category_A": "Words typed describing stuff", "Category_B": "Other words speaking more irrelevant stuff", "Category_X": "Here is where the \\"PROBLEM\\" lies"}'
json.loads can parse this.
# WHen you extracting this I think you shood make a chekc for this.
# example:
if "\"" in extraction:
extraction = extraction.replace("\"", "\'")
print(extraction)
In this case you will convert " from extraction in ' I mean something you will need to convert because python give uyou a way to use both if uyou want to use " inside of a string you will need to inverse that simbols:
example:
"this is a 'test'"
'this was a "test"'
"this is not a \"test\""
#in case the condition is meat
if "\"" in item:
#use this
item = item.replace("\"", "\'")
#or use this
item = item.replace("\"", "\\\"")

Unable to parse some information from a webpage using search keywords

I've created a script to parse some information related to some songs from a website. When I try with this link or this one, I get my scrpt working flawlessly. What I could understand is that when I append my search keyword after this portion https://www.billboard.com/music/, I get the desired page having information.
However, things go wrong when I try with these keywords 1 Of The Girls or Al B. Sure! or Ashford & Simpson and so on.
I can't figure out how to append the above keywords after the base link https://www.billboard.com/music/ to locate the pages with information.
Script I've tried with:
import requests
from bs4 import BeautifulSoup
LINK = "https://www.billboard.com/music/Adele"
res = requests.get(LINK)
soup = BeautifulSoup(res.text,"lxml")
scores = [item.text for item in soup.select("[class$='-history__stats'] > p > span")]
print(scores)
Result I'm getting (as expected):
['4 No. 1 Hits', '6 Top 10 Hits', '13 Songs']
Result located in that page is just after the chart history like the following:
How can I fetch some information from a webpage using critical search keywords?
I don't know all use cases but the obvious pattern I have seen for cases mentioned is that special characters are stripped (without leaving whitespace in their place) out, words are lower-case and then spaces replaced with "-". The tricky bit may be the definition and handling of special characters.
e.g.
https://www.billboard.com/music/ashford-simpson
https://www.billboard.com/music/al-b-sure
https://www.billboard.com/music/1-of-the-girls
You could start with writing something to perform those string manipulations and then test the response code. Perhaps see if there is any form of validation in js files.
EDIT:
Multiple blanks between words becomes single blank before being replaced with "-" ?
Answer developed with #Mithu for preparing search terms:
import re
keywords = ["Y?N-Vee","Ashford & Simpson","Al B. Sure!","1 Of The Girls"]
spec_char = ["!","#","$","%","&","'","(",")","*","+",",",".","/",":",";","<","=",">","?","#","[","]","^","_","`","{","|","}","~",'"',"\\"]
for elem in keywords:
refined_keywords = re.sub('-+','-' , ''.join(i.replace(" ","-") for i in elem.lower() if i not in spec_char))
print(refined_keywords)

Extract information from a webpage in a particular format

I am trying to make a simple python script to extract certain links from a webpage. I am able to extract link successfully but now I want to extract some more information like bitrate,size,duration given on that webpage.
I am using the below xpath to extract the above mentioned info
>>> doc = lxml.html.parse('http://mp3skull.com/mp3/linkin_park_faint.html')
>>> info = doc.xpath(".//*[#id='song_html']/div[1]/text()")
>>> info[0:7]
['\n\t\t\t', '\n\t\t\t\t3.71 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t3.49 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t192 kbps', '2:41']
Now what I need is that for a particular link the info I require is generated in a form of tuple like (bitrate,size,duration).
The xpath I mentioned above generates the required info but it is ill-formatted that is it is not possible to achieve my required format with any logic at least I am not able to that.
So, is there any way to achieve the output in my format.?
I think BeautifulSoup will do the job, it parses even badly formatted HTML:
http://www.crummy.com/software/BeautifulSoup/
parsing is quite easy with BeautifulSoup - for example:
import bs4
import urllib
soup = bs4.BeautifulSoup(urllib.urlopen('http://mp3skull.com/mp3/linkin_park_faint.html').read())
print soup.find_all('a')
and have quite good docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
You can actually strip everything out with XPath:
translate(.//*[#id='song_html']/div[1]/text(), "\n\t,'", '')
So for your additional question, either:
info[0, len(info)]
for altogether, or:
info.rfind(" ")
Since the translate leaves a space character, but you could replace that with whatever you wanted.
Addl info found here
How are you with regular expressions and python's re module?
http://docs.python.org/library/re.html may be essential.
As far as getting the data out of the array, re.match(regex,info[n]) should suffice, as far as the triple tuple goes, the python tuple syntax takes care of it. Simply match from members of your info array with re.match.
import re
matching_re = '.*' # this re matches whole strings, rather than what you need
incoming_value_1 = re.match(matching_re,info[1])
# etc.
var truple = (incoming_value_1, incoming_value_2, incoming_value_2

Python: Keyword to Links

I am building a blog on Google App Engine. I would like to convert some keywords in my blog posts to links, just like what you see in many WordPress blogs.
Here is one WP plugin which do the same thing:http://wordpress.org/extend/plugins/blog-mechanics-keyword-link-plugin-v01/
A plugin that allows you to define keyword/link pairs. The keywords are automatically linked in each of your posts.
I think this is more than a simple Python Replace. What I am dealing with is HTML code. It can be quite complex sometimes.
Take the following code snippet as an example. I want to conver the word example into a link to http://example.com:
Here is an example link:example.com
By a simple Python replace function which replaces example with example, it would output:
Here is an example link:example.com">example.com</a>
but I want:
Here is an example link:example.com
Is there any Python plugin that capable of this? Thanks a lot!
This is roughly what you could do using Beautifulsoup:
from BeautifulSoup import BeautifulSoup
html_body ="""
Here is an example link:<a href='http://example.com'>example.com</a>
"""
soup = BeautifulSoup(html_body)
for link_tag in soup.findAll('a'):
link_tag.string = "%s%s%s" % ('|',link_tag.string,'|')
for text in soup.findAll(text=True):
text_formatted = ['example'\
if word == 'example' and not (word.startswith('|') and word.endswith('|'))\
else word for word in foo.split() ]
text.replaceWith(' '.join(text_formatted))
for link_tag in soup.findAll('a'):
link_tag.string = link_tag.string[1:-1]
print soup
Basically I'm stripping out all the text from the post_body, replacing the example word with the given link, without touching the links text that are saved by the '|' characters during the parsing.
This is not 100% perfect, for example it does not work if the word you are trying to replace ends with a period; with some patience you could fix all the edge cases.
This would probably be better suited to client-side code. You could easily modify a word highlighter to get the desired results. By keeping this client-side, you can avoid having to expire page caches when your 'tags' change.
If you really need it to be processed server-side, then you need to look at using re.sub which lets you pass in a function, but unless you are operating on plain-text you will have to first parse the HTML using something like minidom to ensure you are not replacing something in the middle of any elements.

Screen scraping in LXML with python-- extract specific data

I've been trying to write a program for the last several hours that does what I thought would be an incredibly simple task:
Program asks for user input (let's say the type 'happiness')
Program queries the website thinkexist using this format ("http://thinkexist.com/search/searchQuotation.asp?search=USERINPUT")
Program returns first quote from the website.
I've tried using Xpath with lxml, but have no experience and every single construction comes back with a blank array.
The actual meat of the quote appears to be contained in the class "sqq."
If I navigate the site via Firebug, click the DOM tab, it appears the quote is in a textNode attribute "wholeText" or "textContent"-- but I don't know how to use that knowledge programatically.
Any ideas?
import lxml.html
import urllib
site = 'http://thinkexist.com/search/searchquotation.asp'
userInput = raw_input('Search for: ').strip()
url = site + '?' + urllib.urlencode({'search':userInput})
root = lxml.html.parse(url).getroot()
quotes = root.xpath('//a[#class="sqq"]')
print quotes[0].text_content()
... and if you enter 'Shakespeare', it returns
In real life, unlike in Shakespeare, the sweetness
of the rose depends upon the name it bears. Things
are not only what they are. They are, in very important
respects, what they seem to be.
If it's not necessary for you to implement this via XPath, you may use BeautifilSoup library like this (let myXml variable contain the page HTML source):
soup = BeautifulSoup(myXml)
for a in soup.findAll(a,{'class' : 'sqq'}):
# this is your quote
print a.contents
Anyway, read the BS documentation, it may be very useful for some scraping needs that don't require the power of XPath.
You could open the html source to find out the exact class you are looking for. For example, to grab the first StackOverflow username encountered on the page you could do:
#!/usr/bin/env python
from lxml import html
url = 'http://stackoverflow.com/questions/4710307'
tree = html.parse(url)
path = '//div[#class="user-details"]/a[#href]'
print tree.findtext(path)
# -> Parseltongue
# OR to print text including the text in children
a = tree.find(path)
print a.text_content()
# -> Parseltongue

Categories