Extract information from a webpage in a particular format - python

I am trying to make a simple python script to extract certain links from a webpage. I am able to extract link successfully but now I want to extract some more information like bitrate,size,duration given on that webpage.
I am using the below xpath to extract the above mentioned info
>>> doc = lxml.html.parse('http://mp3skull.com/mp3/linkin_park_faint.html')
>>> info = doc.xpath(".//*[#id='song_html']/div[1]/text()")
>>> info[0:7]
['\n\t\t\t', '\n\t\t\t\t3.71 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t3.49 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t192 kbps', '2:41']
Now what I need is that for a particular link the info I require is generated in a form of tuple like (bitrate,size,duration).
The xpath I mentioned above generates the required info but it is ill-formatted that is it is not possible to achieve my required format with any logic at least I am not able to that.
So, is there any way to achieve the output in my format.?

I think BeautifulSoup will do the job, it parses even badly formatted HTML:
http://www.crummy.com/software/BeautifulSoup/
parsing is quite easy with BeautifulSoup - for example:
import bs4
import urllib
soup = bs4.BeautifulSoup(urllib.urlopen('http://mp3skull.com/mp3/linkin_park_faint.html').read())
print soup.find_all('a')
and have quite good docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

You can actually strip everything out with XPath:
translate(.//*[#id='song_html']/div[1]/text(), "\n\t,'", '')
So for your additional question, either:
info[0, len(info)]
for altogether, or:
info.rfind(" ")
Since the translate leaves a space character, but you could replace that with whatever you wanted.
Addl info found here

How are you with regular expressions and python's re module?
http://docs.python.org/library/re.html may be essential.
As far as getting the data out of the array, re.match(regex,info[n]) should suffice, as far as the triple tuple goes, the python tuple syntax takes care of it. Simply match from members of your info array with re.match.
import re
matching_re = '.*' # this re matches whole strings, rather than what you need
incoming_value_1 = re.match(matching_re,info[1])
# etc.
var truple = (incoming_value_1, incoming_value_2, incoming_value_2

Related

BeautifulSoup find partial string in section

I am trying to use BeautifulSoup to scrape a particular download URL from a web page, based on a partial text match. There are many links on the page, and it changes frequently. The html I'm scraping is full of sections that look something like this:
<section class="onecol habonecol">
<a href="https://longGibberishDownloadURL" title="Download">
<img src="\azure_storage_blob\includes\download_for_windows.png"/>
</a>
sentinel-3.2022335.1201.1507_1608C.ab.L3.FL3.v951T202211_1_3.CIcyano.LakeOkee.tif
</section>
The second to last line (sentinel-3.2022335...LakeOkee.tif) is the part I need to search using a partial string to pull out the correct download url. The code I have attempted so far looks something like this:
import requests, re
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}, string=re.compile(?))
I've been searching StackOverflow a long time now and while there are similar questions and answers, none of the proposed solutions have worked for me so far (re.compile, lambdas, etc.). I am able to pull up a section if I remove the string argument, but when I try to include a partial matching string I get None for my result. I'm unsure what to put for the string argument (? above) to find a match based on partial text, say if I wanted to find the filename that has "CIcyano" somewhere in it (see second to last line of html example at top).
I've tried multiple methods using re.compile and lambdas, but I don't quite understand how either of those functions really work. I was able to pull up other sections from the html using these solutions, but something about this filename string with all the periods seems to be preventing it from working. Or maybe it's the way it is positioned within the section? Perhaps I'm going about this the wrong way entirely.
Is this perhaps considered part of the section id, and so the string argument can't find it?? An example of a section on the page that I AM able to find has html like the one below, and I'm easily able to find it using the string argument and re.compile using "Name", "^N", etc.
<section class="onecol habonecol">
<h3>
Name
</h3>
</section>
Appreciate any advice on how to go about this! Once I get the correct section, I know how to pull out the URL via the a tag.
Here is the full html of the page I'm scraping, if that helps clarify the structure I'm working against.
I believe you are overthinking. Just remove the regular expression part, take the text and you will be fine.
import requests
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}).text
print(result)
You can query inside every section for the string you want. Like so:
s.find('section', attrs={'class':'onecol habonecol'}).find(string=re.compile(r'.sentinel.*'))
Using this regular expression you will match any text that has sentinel in it, be careful that you will have to match some characters like spaces, that's why there is a . at beginning of the regex, you might want a more robust regex which you can test here:
https://regex101.com/
I ended up finding another method not using the string argument in find(), instead using something like the code below, which pulls the first instance of a section that contains a partial text match.
sections = soup.find_all('section', attrs={'class':'onecol habonecol'})
for s in sections:
text = s.text
if 'CIcyano' in text:
print(s)
break
links = s.find('a')
dwn_url = links.get('href')
This works for my purposes and fetches the first instance of the matching filename, and grabs the URL.

BeautifulSoup, getting more returns than expected with regex

Using BeautifulSoup, I have the following line:
dimensions = SOUP.select(".specs__title > h4", text=re.compile(r'Dimensions'))
However, it's returning more than just the tags that have a text of 'Dimensions' as shown in these results:
[<h4>Dimensions</h4>, <h4>Details</h4>, <h4>Warranty / Certifications</h4>]
Am I using the regex incorrectly with the way SOUP works?
The select interface doesn't have a text keyword. Before we go further, the following is assuming you are using BeautifulSoup 4.7+.
If you'd like to filter by text, you might be able to do something like this:
dimensions = SOUP.select(".specs__title > h4:contains(Dimensions)")
More information on the :contains() pseudo-class implementation is available here: https://facelessuser.github.io/soupsieve/selectors/#:contains.
EDIT: To clarify, there is no way to incorporate regex directly into a select call currently. You would have to filter the elements after the fact to use regex. In the future there may be a way to use regex via some custom pseudo-class, but currently there is no such feature available in Soup Sieve (Beautiful Soup's select implementation in 4.7+).

web-scraping, regex and iteration in python

I have the following url 'http://www.alriyadh.com/file/278?&page=1'
I would like to write a regex to access urls from page=2 till page=12
For example, this url is needed 'http://www.alriyadh.com/file/278?&page=4', but not page = 14
I reckon what will work is a function that iterate the specified 10 pages to access all the urls within them. I have tried this regex but does not work
'.*?=[2-9]'
My aim is to get the content from those urls using newspaper package. I simply want this data for my research
Thanks in advance
does not require regex, a simple preset loop will do.
import requests
from bs4 import BeautifulSoup as bs
url = 'http://www.alriyadh.com/file/278?&page='
for page in range(2,13):
html = requests.get(url+str(page)).text
soup = bs(html)
Here's a regex to access the proper range (i.e. 2-12):
([2-9]|1[012])
Judging by what you have now, I am unsure that your regex will work as you intend it to. Perhaps I am misinterpreting your regex altogether, but is the '?=' intended to be a lookahead?
Or are you actually searching for a '?' immediately followed by a '=' immediately followed by any number 2-9?
How familiar are you with regexs in general? This particular one seems dangerously vague to find a meaningful match.

Extract template arguments in Python from MediaWiki's API wikitext

Is there a way to extract parts of text from MediaWikia's API? For example, this link dumps all the content into XML format:
http://marvel.wikia.com/api.php?action=query&prop=revisions&titles=All-New%20X-Men%20Vol%201%201&rvprop=content&format=xml
But there isn't much structure to it, even in the json format.
I'd like to get the text of Writer1_1, Penciler1_1, etc. Perhaps I'm not making my parameters right, so maybe there are other options I could output.
You can see the content in a more user-readable way here.
I'm sure the regex and final splitting could be more efficient, but this gets the job done for what you asked.
import urllib2
import re
data = urllib2.urlopen('http://marvel.wikia.com/api.php?action=query&prop=revisions&titles=All-New%20X-Men%20Vol%201%201&rvprop=content')
regex = re.compile('(Writer1_1|Penciler1_1)')
for line in data.read().split('|'):
if regex.search(line):
#assume everything after = is the full name
print ' '.join(line.split()[2:])

Python: Keyword to Links

I am building a blog on Google App Engine. I would like to convert some keywords in my blog posts to links, just like what you see in many WordPress blogs.
Here is one WP plugin which do the same thing:http://wordpress.org/extend/plugins/blog-mechanics-keyword-link-plugin-v01/
A plugin that allows you to define keyword/link pairs. The keywords are automatically linked in each of your posts.
I think this is more than a simple Python Replace. What I am dealing with is HTML code. It can be quite complex sometimes.
Take the following code snippet as an example. I want to conver the word example into a link to http://example.com:
Here is an example link:example.com
By a simple Python replace function which replaces example with example, it would output:
Here is an example link:example.com">example.com</a>
but I want:
Here is an example link:example.com
Is there any Python plugin that capable of this? Thanks a lot!
This is roughly what you could do using Beautifulsoup:
from BeautifulSoup import BeautifulSoup
html_body ="""
Here is an example link:<a href='http://example.com'>example.com</a>
"""
soup = BeautifulSoup(html_body)
for link_tag in soup.findAll('a'):
link_tag.string = "%s%s%s" % ('|',link_tag.string,'|')
for text in soup.findAll(text=True):
text_formatted = ['example'\
if word == 'example' and not (word.startswith('|') and word.endswith('|'))\
else word for word in foo.split() ]
text.replaceWith(' '.join(text_formatted))
for link_tag in soup.findAll('a'):
link_tag.string = link_tag.string[1:-1]
print soup
Basically I'm stripping out all the text from the post_body, replacing the example word with the given link, without touching the links text that are saved by the '|' characters during the parsing.
This is not 100% perfect, for example it does not work if the word you are trying to replace ends with a period; with some patience you could fix all the edge cases.
This would probably be better suited to client-side code. You could easily modify a word highlighter to get the desired results. By keeping this client-side, you can avoid having to expire page caches when your 'tags' change.
If you really need it to be processed server-side, then you need to look at using re.sub which lets you pass in a function, but unless you are operating on plain-text you will have to first parse the HTML using something like minidom to ensure you are not replacing something in the middle of any elements.

Categories