how to extract javascript variables by using python bs4 - python

<script type="text/javascript">var csrfMagicToken = "sid:bf8be784734837a64a47fcc30b9df99,162591180";var csrfMagicName = "__csrf_magic";</script>
The above script tag is from a webpage.
script = soup.find_all('script')[5]
By using the above line of code I was able to extract the script tag which I want but I need to extract the value of variables in a python script,I am using BeautifulSoup in my python script to extract the data.

You could use
(?:var|let)\s+(\w+)\s*=\s*"([^"]+)"
See a demo on regex101.com.
Note: However, there are a couple of drawbacks in general to using regular expressions on code. E.g. with the above, sth. like let x = -10; would not be matched but would be totally valid JavaScript code. Also, single quotes are not supported (yet) - it totally depends on your actual input.
That being said, you could go for:
(?:var|let)\s+
(?P<key>\w+)\s*=\s*
(['"])?(?(2)(?P<value1>.+?)\2|(?P<value2>[^;]+))
See another demo on regex101.com.
This still leaves you helpless against escaped quotes like let x = "some \" string"; or against variable declarations in comments. In general, favour a parser solution.

Related

How do i remove the adsense code in requests_html?

I am using the requests_html library to scrape a website but i am getting at the same time the adsense from that website from that grabbed text. The example looks something like this:
some text some text some text some text and then this:
(adsbygoogle = window.adsbygoogle || []).push({});
some text some text some text after a line break and then this:
sas.cmd.push(function() { 
 sas.call("std", { 
 siteId: 301357, // 
 pageId: 1101926, // Page : Seneweb_AF/rg
formatId: 49048, // Format : Pave 2 300x250 
 target: '' //
Ciblage 
 }); 
 });
Now how can i get rid of the italic-bold text above?
If requests_html doesn't have a builtin mechanism for handling this, then a solution is to use pure python; this is what i found so far:
curated_article = article.text.split('\n')
curated_article = "\n".join(list(filter(lambda a: not a.startswith("&#"), curated_article)))
print(curated_article)
where article is the html for a scraped article
Assuming you are able to get hold of the text as a string before you need to remove the unwanted parts, you can search and replace.
If (adsbygoogle = window.adsbygoogle || []).push({}); is always the exact same string (including the same whitespace every time), then you can use str.replace().
See How to use string.replace() in python 3.x.
If the text is not the exact same thing every time--and I am guessing that at least the second example you showed is not the same every time--then you can use regular expressions. See the python documentation of the re module.
If you only use a few regular expressions in your program you can just call re.sub,
something like this:
sanitized_text = re.sub(regularexpression, '', original_text, flags=re.MULTILINE|re.DOTALL)
It may take some trial and error get get pattern to match every case that is like the second example.
You'll need re.MULTILINE if there are newlines inside the retrieved article, as there almost certainly will be, and re.DOTALL in order to make certain regex patterns work across line boundaries, which it appears the second example will require.
If you end up having to use several regular expressions you can compile them using re.compile before you start scraping:
pattern = re.compile(regularexpression, flags=re.MULTILINE|re.DOTALL)
Later, when you have text to remove pieces from, you can do the search and replace like this:
sanitized_text = pattern.sub('', original_text)

Regular Expression - HTML

I am kind of new to regular expressions, but the one i made myself doesn't work. It is supposed to give me data from a websites html.
I basically want to get this out of html, and all of the multiple ones. I have the page url as a string btw.
Co-Op
And what i've done for my regexp is:
<a\bhref="http://store.steampowered.com/search/?category2=2"\bclass="name"*>(.*?)</a>\g
You should never parse HTML/XML or any other language that allows cascading using regular expressions.
A nice thing with HTML however, is that it can be converted to XML and XML has a nice toolkit for parsing:
echo 'Co-Op' | tidy -asxhtml -numeric 2> /dev/null | xmllint --html --xpath 'normalize-space(//a[#class="name" and #href="http://store.steampowered.com/search/?category2=2"])' - 2>/dev/null
With query:
normalize-space(//a[#class="name" and #href="http://store.steampowered.com/search/?category2=2"])
// means any tag (regardless of it's depth), a means the a tag, and we furthermore specify the constraints that class=name and href=(the link). And then we returned the normalize-space content between the such tag <a> and </a>.
In Python you can use:
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://store.steampowered.com/app/24860/").read()
soup = BeautifulSoup(page)
print soup.find_all('a',attrs={'class':'name','href':'http://store.steampowered.com/search/?category2=2'})
Comment on your regex:
the problem is that it contains tokens like ? that are interpreted as regex-directives rather than characters. You need to escape them. It should probably read:
<a\s+href="http://store\.steampowered\.com/search/\?category2=2"\s+class="name"\S*>(.*?)</a>\g
I also replaced \b with \s, \s means space characters like space, tab, new line. Although the regex is quite fragile: if one ever decides to swap href and class, the program has a problem. For most of these problems, there are indeed solutions, but you better use an XML analysis tool.

Finding a random sentence in HTML with python regex

I'm trying to write a small function for another script that pulls the generated text from "http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1"
Essentially, I need it to pull whatever sentence is between < br> tags.
I've been trying my darndest using regular expressions, but I never really could get the hang of those.
All of the searching I did turned up things for pulling either specific sentences, or single words.
This however needs to pull whatever arbitrary string is between < br> tags.
Can anyone help me out? Thanks.
Best I could come up with:
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('\<br>.*\<br>', html)
EDIT: Ended up going with a different approach all together, simply splitting the HTML in a list seperated by < br> and pulling [3], made for cleaner code and less string operations. Keeping this question up for future reference and other people with similar questions.
You need to use the DOTALL flag as there are newlines in the expression that you need to match. I would use
re.findall('<br>(.*?)<br>', html, re.S)
However will return multiple results as there are a bunch of <br><br> on that page. You may want to use the more specific:
re.findall('<hr><br>(.*?)<br><hr>', html, re.S)
from urllib import urlopen
import re
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('<body>.*?>\n*([^<]{5,})<.*?</body>', html, re.S)
if (len(output) > 0):
print(output)
output = re.sub('\n', ' ', output[0])
output = re.sub('\t', '', output)
print(output)
Terminal
imac2011:Desktop allendar$ python test.py
['A black cat crossing your path signifies that the animal is going somewhere.\n\t\t-- Groucho Marx\n\n']
A black cat crossing your path signifies that the animal is going somewhere. -- Groucho Marx
You could also strip of the final \n's and replace all those inside the text (on longer quotes) with <br /> if you are displaying it in HTML again, so you would maintain the original line breaks visually.
All jokes of that page have the same model, no ambigous things, you can use this
output = re.findall('(?<=<br>\s)[^<]+(?=\s{2}<br)', html)
No need to use the dotall flag cause there's no dot.
This is uh, 7 years later, but for future reference:
Use the beautifulsoup library for these kind of purposes, as suggested by Floris in the comments.

How to modify lxml autolink to be more liberal?

I am using the autolink function of the great lxml library as documented here: http://lxml.de/api/lxml.html.clean-module.html
My problem is that it only detects urls that start with http://.
I would like to use a broader url detection regex like this one:
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
I tried to make that regex work with the lxml autolink function without success.
I always end up with a:
lxml\html\clean.py", line 571, in _link_text
host = match.group('host')
IndexError: no such group
Any python/regex gurus out there who know how to make this work?
There are two things to do in order to adapt the regexp to lxml's autolink. First wrap the entire url pattern match in a group (?P<body> .. ) - this lets lxml know what goes inside the href="" attribute.
Next, wrap the host part in a (?<host> .. ) group and pass avoid_hosts=[] parameter when you call the autolink function. The reason for this is the regexp pattern you're using doesn't always find a host (sometimes the host part will be None) since it matches partial urls and ambiguous url-like patterns.
I've modified the regexp to include the above changes and given a snippet test case:
import re
import lxml.html
import lxml.html.clean
url_regexp = re.compile(r"""(?i)\b(?P<body>(?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|(?P<host>[a-z0-9.\-]+[.][a-z]{2,4}/))(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))""")
DOC = """<html><body>
http://foo.com/blah_blah
http://foo.com/blah_blah/.
http://www.extinguishedscholar.com/wpglob/?p=364.
http://✪df.ws/1234
rdar://1234
rdar:/1234
message://%3c330e7f840905021726r6a4ba78dkf1fd71420c1bf6ff#mail.gmail.com%3e
What about <mailto:gruber#daringfireball.net?subject=TEST> (including brokets).
bit.ly/foo
</body></html>"""
tree = lxml.html.fromstring(DOC)
body = tree.find('body')
lxml.html.clean.autolink(body, [url_regexp], avoid_hosts=[])
print lxml.html.tostring(tree)
Output:
<html><body>
http://foo.com/blah_blah
http://foo.com/blah_blah/.
http://www.extinguishedscholar.com/wpglob/?p=364.
http://✪df.ws/1234
rdar://1234
rdar:/1234
message://%3c330e7f840905021726r6a4ba78dkf1fd71420c1bf6ff#mail.gmail.com%3e
What about <mailto:gruber#daringfireball.net?subject=TEST>
(including brackets).
bit.ly/foo
</body></html>
You don't really give enough information to be sure, but I bet that you're having escaping issues with the backslashes in Gruber's regex. Try using a raw string, which allows backslashes without escaping, and triple-quotes, which allow you to use quotes in the string without having to escape those either. E.g.
re.compile(r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))""")

Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here.
Here comes the HTML source:
http://www.ptop.se
I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?
If you're only looking for one:
import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
print(match.group(1))
If you have a long string, and want every instance of the pattern in it:
import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))
Where s is the string that you're looking for matches in.
Quick explanation of the regexp bits:
r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)
"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.
Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."
"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.
The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.
It's pretty easy to do:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
print(tag['href'])
Once you've installed BeautifulSoup, anyway.
Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.
this should work, although there might be more elegant ways.
import re
url='http://www.ptop.se'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)
John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:
http://daringfireball.net/2009/11/liberal_regex_for_matching_urls
If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.
this regex can help you, you should get the first group by \1 or whatever method you have in your language.
href="([^"]*)
example:
amgheziName
result:
http://www.amghezi.com
There's tonnes of them on regexlib
Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.
This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/
(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)
Oputput:
Match 1. /wiki/Main_Page
Match 2. /wiki/Portal:Contents
Match 3. /wiki/Portal:Featured_content
Match 4. /wiki/Portal:Current_events
Match 5. /wiki/Special:Random
Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
You can use this.
<a[^>]+href=["'](.*?)["']

Categories