I am using the requests_html library to scrape a website but i am getting at the same time the adsense from that website from that grabbed text. The example looks something like this:
some text some text some text some text and then this:
(adsbygoogle = window.adsbygoogle || []).push({});
some text some text some text after a line break and then this:
sas.cmd.push(function() {
sas.call("std", {
siteId: 301357, //
pageId: 1101926, // Page : Seneweb_AF/rg
formatId: 49048, // Format : Pave 2 300x250
target: '' //
Ciblage
});
});
Now how can i get rid of the italic-bold text above?
If requests_html doesn't have a builtin mechanism for handling this, then a solution is to use pure python; this is what i found so far:
curated_article = article.text.split('\n')
curated_article = "\n".join(list(filter(lambda a: not a.startswith("&#"), curated_article)))
print(curated_article)
where article is the html for a scraped article
Assuming you are able to get hold of the text as a string before you need to remove the unwanted parts, you can search and replace.
If (adsbygoogle = window.adsbygoogle || []).push({}); is always the exact same string (including the same whitespace every time), then you can use str.replace().
See How to use string.replace() in python 3.x.
If the text is not the exact same thing every time--and I am guessing that at least the second example you showed is not the same every time--then you can use regular expressions. See the python documentation of the re module.
If you only use a few regular expressions in your program you can just call re.sub,
something like this:
sanitized_text = re.sub(regularexpression, '', original_text, flags=re.MULTILINE|re.DOTALL)
It may take some trial and error get get pattern to match every case that is like the second example.
You'll need re.MULTILINE if there are newlines inside the retrieved article, as there almost certainly will be, and re.DOTALL in order to make certain regex patterns work across line boundaries, which it appears the second example will require.
If you end up having to use several regular expressions you can compile them using re.compile before you start scraping:
pattern = re.compile(regularexpression, flags=re.MULTILINE|re.DOTALL)
Later, when you have text to remove pieces from, you can do the search and replace like this:
sanitized_text = pattern.sub('', original_text)
Related
<script type="text/javascript">var csrfMagicToken = "sid:bf8be784734837a64a47fcc30b9df99,162591180";var csrfMagicName = "__csrf_magic";</script>
The above script tag is from a webpage.
script = soup.find_all('script')[5]
By using the above line of code I was able to extract the script tag which I want but I need to extract the value of variables in a python script,I am using BeautifulSoup in my python script to extract the data.
You could use
(?:var|let)\s+(\w+)\s*=\s*"([^"]+)"
See a demo on regex101.com.
Note: However, there are a couple of drawbacks in general to using regular expressions on code. E.g. with the above, sth. like let x = -10; would not be matched but would be totally valid JavaScript code. Also, single quotes are not supported (yet) - it totally depends on your actual input.
That being said, you could go for:
(?:var|let)\s+
(?P<key>\w+)\s*=\s*
(['"])?(?(2)(?P<value1>.+?)\2|(?P<value2>[^;]+))
See another demo on regex101.com.
This still leaves you helpless against escaped quotes like let x = "some \" string"; or against variable declarations in comments. In general, favour a parser solution.
I have tried to use this:
c=requests.get('https://www.uniberg.com/referenzen.html').text
c.count('Programmierung')
But the output shows 2 occurances while there are actually none.
Also I tried this:
a=requests.get('https://www.uniberg.com/index.html').text.count('Mitarbeiter')
but it also returns the count of words like Mitarbeiterphilosophie which I don't want.
Can someone find a way to improve this or suggest another method?
Today https://www.uniberg.com/referenzen.html contanins 2 occurances Programmierung
I think, you need check in HTML source code, not in the render using a browser.
The words Programmierung are on HTML section with this CSS
section .detail {
display: none;
}
For the second point :
try this (using regex) :
import re
len(re.findall(r'\WMitarbeiter\W', requests.get('https://www.uniberg.com/index.html').text))
With regex :
\w stands for "word character", usually [A-Za-z0-9_].
\W is short for [^\w], the negated version of \w.
requests.get(URL) returns the entire Web-page(look at it with ctrl+U on Google-Chrome or just use wget to download the webpage) and not just what is rendered by web browser.That's why count is showing up as 2.
I'm trying to write a small function for another script that pulls the generated text from "http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1"
Essentially, I need it to pull whatever sentence is between < br> tags.
I've been trying my darndest using regular expressions, but I never really could get the hang of those.
All of the searching I did turned up things for pulling either specific sentences, or single words.
This however needs to pull whatever arbitrary string is between < br> tags.
Can anyone help me out? Thanks.
Best I could come up with:
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('\<br>.*\<br>', html)
EDIT: Ended up going with a different approach all together, simply splitting the HTML in a list seperated by < br> and pulling [3], made for cleaner code and less string operations. Keeping this question up for future reference and other people with similar questions.
You need to use the DOTALL flag as there are newlines in the expression that you need to match. I would use
re.findall('<br>(.*?)<br>', html, re.S)
However will return multiple results as there are a bunch of <br><br> on that page. You may want to use the more specific:
re.findall('<hr><br>(.*?)<br><hr>', html, re.S)
from urllib import urlopen
import re
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('<body>.*?>\n*([^<]{5,})<.*?</body>', html, re.S)
if (len(output) > 0):
print(output)
output = re.sub('\n', ' ', output[0])
output = re.sub('\t', '', output)
print(output)
Terminal
imac2011:Desktop allendar$ python test.py
['A black cat crossing your path signifies that the animal is going somewhere.\n\t\t-- Groucho Marx\n\n']
A black cat crossing your path signifies that the animal is going somewhere. -- Groucho Marx
You could also strip of the final \n's and replace all those inside the text (on longer quotes) with <br /> if you are displaying it in HTML again, so you would maintain the original line breaks visually.
All jokes of that page have the same model, no ambigous things, you can use this
output = re.findall('(?<=<br>\s)[^<]+(?=\s{2}<br)', html)
No need to use the dotall flag cause there's no dot.
This is uh, 7 years later, but for future reference:
Use the beautifulsoup library for these kind of purposes, as suggested by Floris in the comments.
I've seen other questions which will parse either all plain links, or all anchor tags from a string, but nothing that does both.
Ideally, the regular expression will be able to parse a string like this (I'm using Python):
>>> import re
>>> content = '
http://www.google.com Some other text.
And even more text! http://stackoverflow.com
'
>>> links = re.findall('some-regular-expression', content)
>>> print links
[u'http://www.google.com', u'http://stackoverflow.com']
Is it possible to produce a regular expression which would not result in duplicate links being returned? Is there a better way to do this?
No matter what you do, it's going to be messy. Nevertheless, a 90% solution might resemble:
r'<a\s[^>]*>([^<]*)</a>|\b(\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()])'
Since that pattern has two groups, it will return a list of 2-tuples; to join them, you could use a list comprehension or even a map:
map(''.join, re.findall(pattern, content))
If you want the src attribute of the anchor instead of the link text, the pattern gets even messier:
r'<a\s[^>]*src=[\'"]([^"\']*)[\'"][^>]*>[^<]*</a>|\b(\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()])'
Alternatively, you can just let the second half of the pattern pick up the src attribute, which also alleviates the need for the string join:
r'\b\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()]'
Once you have this much in place, you can replace any found links with something that doesn't look like a link, search for '://', and update the pattern to collect what it missed. You may also have to clean up false positives, particularly garbage at the end. (This pattern had to find links that included spaces, in plain text, so it's particularly prone to excess greediness.)
Warning: Do not rely on this for future user input, particularly when security is on the line. It is best used only for manually collecting links from existing data.
Usually you should never parse HTML with regular expressions since HTML isn't a regular language. Here it seems you only want to get all the http-links either they are in an A element or in text. How about getting them all and then remove the duplicates?
Try something like
set(re.findall("(http:\/\/.*?)[\"' <]", content))
and see if it serves your purpose.
Writing a regex pattern that matches all valid url is tricky business.
If all you're looking for is to detect simple http/https URLs within an arbitrary string, I could offer you this solution:
>>> import re
>>> content = 'http://www.google.com Some other text. And even more text! http://stackoverflow.com'
>>> re.findall(r"https?://[\w\-.~/?:#\[\]#!$&'()*+,;=]+", content)
['http://www.google.com', 'http://www.google.com', 'http://stackoverflow.com']
That looks for strings that start with http:// or https:// followed by one or more valid chars.
To avoid duplicate entries, use set():
>>> list(set(re.findall(r"https?://[\w\-.~/?:#\[\]#!$&'()*+,;=]+", content)))
['http://www.google.com', 'http://stackoverflow.com']
You should not use regular expressions to extract things from HTML. You should use an HTML parser.
If you also want to extract things from the text of the page then you should do that separately.
Here's how you would do it with lxml:
# -*- coding: utf8 -*-
import lxml.html as lh
import re
html = """
is.gd/testhttp://www.google.com Some other text.
And even more text! http://stackoverflow.com
here's a url bit.ly/test
"""
tree = lh.fromstring(html)
urls = set([])
for a in tree.xpath('//a'):
urls.add(a.text)
for text in tree.xpath('//text()'):
for url in re.findall(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', text):
urls.add(url[0])
print urls
Result:
set(['http://www.google.com', 'bit.ly/test', 'http://stackoverflow.com', 'is.gd/test'])
URL matchine regex from here: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
No, it will not be able to parse string like this. Regexp are capable of simple matching and you can't handle parsing a complicated grammar as html just with one or two regexps.
I'm trying to write a regular expression pattern (in python) for reformatting these template engine files.
Basically the scheme looks like this:
[$$price$$]
{
<h3 class="price">
$12.99
</h3>
}
I'm trying to make it remove any extra tabs\spaces\new lines so it should look like this:
[$$price$$]{<h3 class="price">$12.99</h3>}
I wrote this: (\t|\s)+? which works except it matches within the html tags, so h3 becomes h3class and I am unable to figure out how to make it ignore anything inside the tags.
Using regular expressions to deal with HTML is extremely error-prone; they're simply not the right tool.
Instead, use a HTML/XML-aware library (such as lxml) to build a DOM-style object tree; modify the text segments within the tree in-place, and generate your output again using said library.
Try this:
\r?\n[ \t]*
EDIT: The idea is to remove all newlines (either Unix: "\n", or Windows: "\r\n") plus any horizontal whitespace (TABs or spaces) that immediately follow them.
Alan,
I have to agree with Charles that the safest way is to parse the HTML, then work on the Text nodes only. Sounds overkill but that's the safest.
On the other hand, there is a way in regex to do that as long as you trust that the HTML code is correct (i.e. does not include invalid < and > in the tags as in: <a title="<this is a test>" href="look here">...)
Then, you know that any text has to be between > and < except at the very beginning and end (if you just get a snapshot of the page, otherwise there is the HTML tag minimum.)
So... You still need two regex's: find the text '>[^<]+<', then apply the other regex as you mentioned.
The other way, is to have an or with something like this (not tested!):
'(<[^>]*>)|([\r\n\f ]+)'
This will either find a tag or spaces. When you find a tag, do not replace, if you don't find a tag, replace with an empty string.