Consider this example, which I've ran on Python 2.7:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
tstr = r''' <div class="thebibliography">
<p class="bibitem" ><span class="biblabel">
[1]<span class="bibsp"> </span></span><a
id="Xtester"></a><span
class="cmcsc-10">A<span
class="small-caps">k</span><span
class="small-caps">e</span><span
class="small-caps">g</span><span
class="small-caps">c</span><span
class="small-caps">t</span><span
class="small-caps">o</span><span
class="small-caps">r</span>,</span>
<span
class="cmcsc-10">P. D.</span><span
class="cmcsc-10"> H. </span> testöng ... . <span
class="cmti-10">Draftin:</span>
<a
href="http://www.example.com/test.html" class="url" ><span
class="cmitt-10">http://www.example.com/test.html</span></a> (2001).
</p>
</div>
'''
# remove <a id>
tout2 = re.sub(r'''<a[\s]*?id=['"].*?['"][\s]*?></a>''', " ", tstr, re.DOTALL)
# remove class= in <a
regstr = r'''(<a.*?)(class=['"].*?['"])([\s]*>)'''
print( re.findall(regstr, tout2, re.DOTALL)) # finds
print("------") #
print( re.sub(regstr, "AAAAAAA", tout2, re.DOTALL )) # does nothing?
When I run this - the first regex is replaced/sub'd as expected ( is gone); then in the output I get:
[('<a\nhref="http://www.example.com/test.html" ', 'class="url"', ' >')]
... which means that the second regex is written correctly (all three parts are found) - but then, when I try to replace all of that snippet with "AAAAAAA" - nothing happens in that part of output:
------
<div class="thebibliography">
<p class="bibitem" ><span class="biblabel">
[1]<span class="bibsp"> </span></span> <span
class="cmcsc-10">A<span
class="small-caps">k</span><span
class="small-caps">e</span><span
class="small-caps">g</span><span
class="small-caps">c</span><span
class="small-caps">t</span><span
class="small-caps">o</span><span
class="small-caps">r</span>,</span>
<span
class="cmcsc-10">P. D.</span><span
class="cmcsc-10"> H. </span> testöng ... . <span
class="cmti-10">Draftin:</span>
<a
href="http://www.example.com/test.html" class="url" ><span
class="cmitt-10">http://www.example.com/test.html</span></a> (2001).
</p>
</div>
Clearly, there is no "AAAAAAA" here, as I'd expect.
What is the problem, and what should I do, to get sub to replace the matches that apparently have been found?
Why don't use an HTML parser for parsing and modifying HTML.
Example, using BeautifulSoup and replace_with():
from bs4 import BeautifulSoup
data = """Your html here"""
soup = BeautifulSoup(data)
for link in soup('a', id=True):
link.replace_with('AAAAAA')
print(soup.prettify())
This replaces all of the links that have id attribute with AAAAAA text:
<div class="thebibliography">
<p class="bibitem">
<span class="biblabel">
[1]
<span class="bibsp">
</span>
</span>
AAAAAA
<span class="cmcsc-10">
...
Also see:
RegEx match open tags except XHTML self-contained tags
Your replacement doesn't work due to a misuse of the re.sub method, If you look at the documentation:
re.sub(pattern, repl, string, count=0, flags=0)
But in your code, you put the "flag" in the "count" place. This is the reason why, the re.DOTALL flag is ignored, cause it is at the wrong place.
Since you don't need to use the count param, you can remove the re.DOTALL flag and use an inline modifier instead:
regstr = r'''(?s)(<a.*?)(class=['"].*?['"])([\s]*>)'''
However, using something like bs4 is probably more convenient. (as you can see in #alecxe answer).
It's quite simple : Python Standard Library Reference says syntax or re.sub is : re.sub(pattern, repl, string, count=0, flags=0). So your last sub is in fact (as re.DOTALL == 16):
re.sub(regstr, "AAAAAAA", tout2, count = 16, flags = 0 )
when you need :
re.sub(regstr, "AAAAAAA", tout2, flags = re.DOTALL )
and that last sub works perfectly ...
Problem is - your arguments were wrong.
Python 2.7 Source:
def re.sub(pattern, repl, string, count=0, flags=0):
//code
Here, your argument re.DOTALL is being treated as count argument.
FIX: Use re.sub(regstr, "AAAAAAA", tout2, flags=re.DOTALL ) instead
Note: If you try using compile with your regex, sub works just fine.
Well, in this case apparently, I should have used a compiled regex object (instead of going directly through the re. module call), and all seems to work (can even use backreferences) - but I still don't understand why the problem occurred at all? Would be good to learn why eventually... Anyways, this is the corrected code snippet:
# remove <a id>
tout2 = re.sub(r'''<a[\s]*?id=['"].*?['"][\s]*?></a>''', " ", tstr, re.DOTALL)
# remove class= in <a
regstr = r'''(<a.*?)(class=['"].*?['"])([\s]*>)'''
pat = re.compile(regstr, re.DOTALL)
#~ print( re.findall(regstr, tout2, re.DOTALL)) # finds
print( pat.findall(tout2)) # finds
print("------") #
# re.purge() # no need
print( pat.sub(r'\1AAAAAAA\3', tout2, re.DOTALL )) # does nothing?
... and this is the output:
[('<a\nhref="http://www.example.com/test.html" ', 'class="url"', ' >')]
------
<div class="thebibliography">
<p class="bibitem" ><span class="biblabel">
[1]<span class="bibsp"> </span></span> <span
class="cmcsc-10">A<span
class="small-caps">k</span><span
class="small-caps">e</span><span
class="small-caps">g</span><span
class="small-caps">c</span><span
class="small-caps">t</span><span
class="small-caps">o</span><span
class="small-caps">r</span>,</span>
<span
class="cmcsc-10">P. D.</span><span
class="cmcsc-10"> H. </span> testöng ... . <span
class="cmti-10">Draftin:</span>
<a
href="http://www.example.com/test.html" AAAAAAA ><span
class="cmitt-10">http://www.example.com/test.html</span></a> (2001).
</p>
</div>
Related
I'm trying to scrape data from a listing website with the following html structure
<div class="ListingCell-AllInfo ListingUnit" data-bathrooms="1" data-bedrooms="1" data-block="21st Floor" data-building_size="31" data-category="condominium" data-condominiumname="Twin Lakes Countrywoods" data-price="6000000" data-subcategories='["condominium","single-bedroom"]'>
<div class="ListingCell-TitleWrapper">
<h3 class="ListingCell-KeyInfo-title" title="Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay">
<a class="js-listing-link" data-position="8" data-sku="CD5E17CED0347ECPH" href="https://www.lamudi.com.ph/twin-lakes-countrywoods-1br-unit-for-sale-tagaytay-2.html" target="_blank" title="Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay">
Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay
</a>
</h3>
<div class="ListingCell-KeyInfo-address ellipsis">
<a class="js-listing-link ellipsis" data-position="8" data-sku="CD5E17CED0347ECPH" href="https://www.lamudi.com.ph/twin-lakes-countrywoods-1br-unit-for-sale-tagaytay-2.html" target="_blank" title="Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay">
<span class="icon-pin">
</span>
<span>
Tagaytay Hi-Way
Dayap Itaas, Laurel
</span>
</a>
</div>
What I want to get is the info beside <div class="ListingCell-AllInfo ListingUnit"... which are data-bathrooms, data-bedrooms, data-block, etc.
I tried to scrape it using Python BeautifulSoup
details = container.find('div',class_="ListingCell-AllInfo ListingUnit").text if container.find('div',class_="ListingCell-AllInfo ListingUnit") else "-"
It's been returning "-" for all listings. Complete newbie here!
You can use Beautiful soup that would be better it has alway worked for me .
req = Request("put your url here",headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage)
title = soup.find_all('tag you want to scrape', class_='class of that tag')
visit the link for more info : https://pypi.org/project/beautifulsoup4/
there ! You could use regular expression to solve your issue
I have introduced a few comment in my solution but for more information,
take a look at the official documentation
or read this
import re # regular expression module
txt = """insert your html here"""
# we create a regex patern called p1 and this that will match a string starting with
# <div class="ListingCell-AllInfo ListingUnit"
# following by anything (any character) found 0 or more times
# and the string must end by '>'
p1 = re.compile(r'<div class="ListingCell-AllInfo ListingUnit".*>')
# findall return a list of strings that matches the patern p1 in txt
ls = p1.findall(txt)
# now, what you want is the data, so we can create another patern where the word
# "data" will be found
# match string starting with data following by '-' then by 0 or more alphanumeric char
# then with '=' then with any character found in after the '=' that is not not
# a space, a tab
p2 = re.compile(r'(data-\w*=\S*)')
data = p2.findall(ls[0])
print(data)
Note : Don't be scared by the funky symbols they look way worse than what they truly are
I'm trying to extract tuples from an url and I've managed to extract string text and tuples using the re.search(pattern_str, text_str). However, I got stuck when I tried to extract a list of tuples using re.findall(pattern_str, text_str).
The text looks like:
<li>
<a href="11111">
some text 111
<span class="some-class">
#11111
</span>
</a>
</li><li>
<a href="22222">
some text 222
<span class="some-class">
#22222
</span>
</a>
</li><li>
<a href="33333">
some text 333
<span class="some-class">
#33333
</span>
</a>
... # repeating
...
...
and I'm using the following pattern & code to extract the tuples:
text_above = "..." # this is the text above
pat_str = '<a href="(\d+)">\n(.+)\n<span class'
pat = re.compile(pat_str)
# following line is supposed to return the numbers from the 2nd line
# and the string from the 3rd line for each repeating sequence
list_of_tuples = re.findall(pat, text_above)
for t in list_of tuples:
# supposed to print "11111 -> blah blah 111"
print(t[0], '->', t[1])
Maybe I'm trying something weird & impossible, maybe its better to extract the data using primitive string manipulations... But in case there exists a solution?
Your regex does not take into account the whitespace (indentation) between \n and <span. (And neither the whitespace at the start of the line you want to capture, but that's not as much of a problem.) To fix it, you could add some \s*:
pat_str = '<a href="(\d+)">\n\s*(.+)\n\s*<span class'
As suggested in the comments, use a html parser like BeautifulSoup:
from bs4 import BeautifulSoup
h = """<li>
<a href="11111">
some text 111
<span class="some-class">
#11111
</span>
</a>
</li><li>
<a href="22222">
some text 222
<span class="some-class">
#22222
</span>
</a>
</li><li>
<a href="33333">
some text 333
<span class="some-class">
#33333
</span>
</a>"""
soup = BeautifulSoup(h)
You can get the href and the previous_sibling to the span:
print([(a["href"].strip(), a.span.previous_sibling.strip()) for a in soup.find_all("a")])
[('11111', u'some text 111'), ('22222', u'some text 222'), ('33333', u'some text 333')]
Or the href and the first content from the anchor:
print([(a["href"].strip(), a.contents[0].strip()) for a in soup.find_all("a")])
Or with .find(text=True) to only get the tag text and not from the children.
[(a["href"].strip(), a.find(text=True).strip()) for a in soup.find_all("a")]
Also if you just want the anchors inside the list tags, you can specifically parse those:
[(a["href"].strip(), a.contents[0].strip()) for a in soup.select("li a")]
I tried using "<.+>\s*(.*?)\s*<\/?.+>" on a HTML file. The following is the Python code I used
import re
def recursiveExtractor(content):
re1='(<.+>\s*(.+?)\s*<\/?.+>)'
m = re.findall(re1,content)
if m:
for (id,item) in enumerate(m):
text=m[id][1]
if text:print text,"\n"
f = """
<div class='a'>
<div class='b'>
<div class='c'>
<button>text1</button>
<div class='d'>text2</div>
</div>
</div>
</div>
"""
recursiveExtractor(f)
But it skips some text since HTML is nested and regex restarts search from the end of the matched part.
For the above input,
the output is
<div class='b'>
<div class='d'>text2</div>
</div>
But the expected Output is:
text1
text2
Edit:
I read that HTML is not a regular language and hence cant be parsed.From what I understand, it is not possible to parse .* (ie with same closing tags).
But what I need would be text between any tags, for instance text1 text2 text3 So I am fine with a list of "text1","text2","text3"
Why not just doing this:
import re
f = """
<div class='a'>
<div class='b'>
<div class='c'>
<button>text1</button>
<div class='d'>text2</div>
</div>
</div>
</div>
"""
x = re.sub('<[^>]*>', '', f) # you can also use re.sub('<[A-Za-z\/][^>]*>', '', f)
print '\n'.join(x.split())
This will have the following output:
text1
text2
This question already has answers here:
How to parse XML and get instances of a particular node attribute?
(19 answers)
Closed 7 years ago.
I have the following xml file
<p style="1">
A
</p>
<div xml:lang="unknown">
<p style="3">
B
C
</div>
<div xml:lang="English">
<p style="5">
D
</p>
<p style="1">
Picture number 3?
</p>
and I just want to get the text between <div xml:lang="unknown"> and </div>.
So I've tried this code :
import os, re
html = open("2.xml", "r")
text = html.read()
lon = re.compile(r'<div xml:lang="unknown">\n(.+)\n</div>', re.MULTILINE)
lon = lon.search(text).group(1)
print lon
but It doesn't seem to work.
1) Don't parse XML with regex. It just doesn't work. Use an XML parser.
2) If you do use regex for this, you don't want re.MULTILINE, which controls how ^ and $ work in a multiple-line string. You want re.DOTALL, which controls whether . matches \n or not.
3) You probably also want your pattern to return the shortest possible match, using the non-greedy +? operator.
lon = re.compile(r'<div xml:lang="unknown">\n(.+?)\n</div>', re.DOTALL)
you can parse a piece of block code like this , when you in a block and set a flag True, and when you out and set the flag False and break out.
def get_infobox(self):
"""returns Infobox wikitext from text blob
learning form https://github.com/siznax/wptools/blob/master/wp_infobox.py
"""
if self._rawtext:
text = self._rawtext
else:
text = self.get_rawtext()
output = []
region = False
braces = 0
lines = text.split("\n")
if len(lines) < 3:
raise RuntimeError("too few lines!")
for line in lines:
match = re.search(r'(?im){{[^{]*box$', line)
braces += len(re.findall(r'{{', line))
braces -= len(re.findall(r'}}', line))
if match:
region = True
if region:
output.append(line.lstrip())
if braces <= 0:
region = False
break
self._infobox = "\n".join(output)
assert self._infobox
return self._infobox
You can try splitting on the div and just matching on the list item. This works well for regex's on large data as well.
import re
html = """<p style="1">
A
</p>
<div xml:lang="unknown">
<p style="3">
B
C
</div>
<div xml:lang="English">
<p style="5">
D
</p>
<p style="1">
Picture number 3?
</p>
"""
for div in html.split('<div'):
m = re.search(r'xml:lang="unknown">.+(<p[^<]+)', div, re.DOTALL)
if m:
print m.group(1)
Given a string like
"<p> >this line starts with an arrow <br /> this line does not </p>"
or
"<p> >this line starts with an arrow </p> <p> this line does not </p>"
How can I find the lines that start with an arrow and surround them with a div
So that it becomes:
"<p> <div> >this line starts with an arrow </div> <br /> this line does not </p>
Since it is an HTML you are parsing, use the tool for the job - an HTML parser, like BeautifulSoup.
Use find_all() to find all text nodes that start with > and wrap() them with a new div tag:
from bs4 import BeautifulSoup
data = "<p> >this line starts with an arrow <br /> this line does not </p>"
soup = BeautifulSoup(data)
for item in soup.find_all(text=lambda x: x.strip().startswith('>')):
item.wrap(soup.new_tag('div'))
print soup.prettify()
Prints:
<p>
<div>
>this line starts with an arrow
</div>
<br/>
this line does not
</p>
You can try with >\s+(>.*?)< regex pattern.
import re
regex = re.compile("\\>\\s{1,}(\\>.{0,}?)\\<")
testString = "" # fill this in
matchArray = regex.findall(testString)
# the matchArray variable contains the list of matches
and replace matched group with <div> matched_group </div>. Here pattern look for anything that is enclosed inside > > and <.
Here is demo on debuggex
You could try this regex,
>(\w[^<]*)
DEMO
Python code would be,
>>> import re
>>> str = '"<p> >this line starts with an arrow <br /> this line does not </p>"'
>>> m = re.sub(r'>(\w[^<]*)', r"<div> >\1</div> ", str)
>>> m
'"<p> <div> >this line starts with an arrow </div> <br /> this line does not </p>"'