I'm trying to extract tuples from an url and I've managed to extract string text and tuples using the re.search(pattern_str, text_str). However, I got stuck when I tried to extract a list of tuples using re.findall(pattern_str, text_str).
The text looks like:
<li>
<a href="11111">
some text 111
<span class="some-class">
#11111
</span>
</a>
</li><li>
<a href="22222">
some text 222
<span class="some-class">
#22222
</span>
</a>
</li><li>
<a href="33333">
some text 333
<span class="some-class">
#33333
</span>
</a>
... # repeating
...
...
and I'm using the following pattern & code to extract the tuples:
text_above = "..." # this is the text above
pat_str = '<a href="(\d+)">\n(.+)\n<span class'
pat = re.compile(pat_str)
# following line is supposed to return the numbers from the 2nd line
# and the string from the 3rd line for each repeating sequence
list_of_tuples = re.findall(pat, text_above)
for t in list_of tuples:
# supposed to print "11111 -> blah blah 111"
print(t[0], '->', t[1])
Maybe I'm trying something weird & impossible, maybe its better to extract the data using primitive string manipulations... But in case there exists a solution?
Your regex does not take into account the whitespace (indentation) between \n and <span. (And neither the whitespace at the start of the line you want to capture, but that's not as much of a problem.) To fix it, you could add some \s*:
pat_str = '<a href="(\d+)">\n\s*(.+)\n\s*<span class'
As suggested in the comments, use a html parser like BeautifulSoup:
from bs4 import BeautifulSoup
h = """<li>
<a href="11111">
some text 111
<span class="some-class">
#11111
</span>
</a>
</li><li>
<a href="22222">
some text 222
<span class="some-class">
#22222
</span>
</a>
</li><li>
<a href="33333">
some text 333
<span class="some-class">
#33333
</span>
</a>"""
soup = BeautifulSoup(h)
You can get the href and the previous_sibling to the span:
print([(a["href"].strip(), a.span.previous_sibling.strip()) for a in soup.find_all("a")])
[('11111', u'some text 111'), ('22222', u'some text 222'), ('33333', u'some text 333')]
Or the href and the first content from the anchor:
print([(a["href"].strip(), a.contents[0].strip()) for a in soup.find_all("a")])
Or with .find(text=True) to only get the tag text and not from the children.
[(a["href"].strip(), a.find(text=True).strip()) for a in soup.find_all("a")]
Also if you just want the anchors inside the list tags, you can specifically parse those:
[(a["href"].strip(), a.contents[0].strip()) for a in soup.select("li a")]
Related
I'm trying to scrape data from a listing website with the following html structure
<div class="ListingCell-AllInfo ListingUnit" data-bathrooms="1" data-bedrooms="1" data-block="21st Floor" data-building_size="31" data-category="condominium" data-condominiumname="Twin Lakes Countrywoods" data-price="6000000" data-subcategories='["condominium","single-bedroom"]'>
<div class="ListingCell-TitleWrapper">
<h3 class="ListingCell-KeyInfo-title" title="Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay">
<a class="js-listing-link" data-position="8" data-sku="CD5E17CED0347ECPH" href="https://www.lamudi.com.ph/twin-lakes-countrywoods-1br-unit-for-sale-tagaytay-2.html" target="_blank" title="Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay">
Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay
</a>
</h3>
<div class="ListingCell-KeyInfo-address ellipsis">
<a class="js-listing-link ellipsis" data-position="8" data-sku="CD5E17CED0347ECPH" href="https://www.lamudi.com.ph/twin-lakes-countrywoods-1br-unit-for-sale-tagaytay-2.html" target="_blank" title="Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay">
<span class="icon-pin">
</span>
<span>
Tagaytay Hi-Way
Dayap Itaas, Laurel
</span>
</a>
</div>
What I want to get is the info beside <div class="ListingCell-AllInfo ListingUnit"... which are data-bathrooms, data-bedrooms, data-block, etc.
I tried to scrape it using Python BeautifulSoup
details = container.find('div',class_="ListingCell-AllInfo ListingUnit").text if container.find('div',class_="ListingCell-AllInfo ListingUnit") else "-"
It's been returning "-" for all listings. Complete newbie here!
You can use Beautiful soup that would be better it has alway worked for me .
req = Request("put your url here",headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage)
title = soup.find_all('tag you want to scrape', class_='class of that tag')
visit the link for more info : https://pypi.org/project/beautifulsoup4/
there ! You could use regular expression to solve your issue
I have introduced a few comment in my solution but for more information,
take a look at the official documentation
or read this
import re # regular expression module
txt = """insert your html here"""
# we create a regex patern called p1 and this that will match a string starting with
# <div class="ListingCell-AllInfo ListingUnit"
# following by anything (any character) found 0 or more times
# and the string must end by '>'
p1 = re.compile(r'<div class="ListingCell-AllInfo ListingUnit".*>')
# findall return a list of strings that matches the patern p1 in txt
ls = p1.findall(txt)
# now, what you want is the data, so we can create another patern where the word
# "data" will be found
# match string starting with data following by '-' then by 0 or more alphanumeric char
# then with '=' then with any character found in after the '=' that is not not
# a space, a tab
p2 = re.compile(r'(data-\w*=\S*)')
data = p2.findall(ls[0])
print(data)
Note : Don't be scared by the funky symbols they look way worse than what they truly are
I have a webpage with following code:
<li>
Thalassery (<a class="mw-redirect" href="/wiki/Malayalam_language" title="Malayalam language">Malayalam</a>: <span lang="ml">തലശ്ശേരി</span>), from
<i>Tellicherry</i></li>
<li>Thanjavur (Tamil: <span lang="ta">தஞ்சாவூர்</span>), from British name <i>Tanjore</i></li>
<li>Thane (Marathi: <span lang="mr">ठाणे</span>), from British name <i>Tannah</i></li>
<li>Thoothukudi (Tamil: <span lang="ta">தூத்துக்குடி</span>), from <i>Tuticorin</i> and its short form <i>Tuty</i></li>
I need to parse the output such that the result will be extracting words like: Thalassery, Tellicherry, Thanjavur, Tanjore, Thane, Tannah, Thoothukudi, Tuticorin
Can anyone please help with this
You can use .findAll() to get all the li elements and use find() 'a' and 'i' tag
for item in soup.findAll('li'):
print(item.find('a').text,item.find('i').text)
>>>
Thalassery Tellicherry
Thanjavur Tanjore
Thane Tannah
Thoothukudi Tuticorin
Try simplified_scrapy's solution, its fault tolerance
from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''
<li>
Thalassery (<a class="mw-redirect" href="/wiki/Malayalam_language" title="Malayalam language">Malayalam</a>: <span lang="ml">തലശ്ശേരി</span>), from
<i>Tellicherry</i></li>
<li>Thanjavur (Tamil: <span lang="ta">தஞ்சாவூர்</span>), from British name <i>Tanjore</i></li>
<li>Thane (Marathi: <span lang="mr">ठाणे</span>), from British name <i>Tannah</i></li>
<li>Thoothukudi (Tamil: <span lang="ta">தூத்துக்குடி</span>), from <i>Tuticorin</i> and its short form <i>Tuty</i></li>
'''
doc = SimplifiedDoc(html)
lis = doc.lis
print ([(li.a.text,li.i.text if li.i else '') for li in lis])
Result:
[('Thalassery', 'Tellicherry'), ('Thanjavur', 'Tanjore'), ('Thane', 'Tannah'), ('Thoothukudi', 'Tuticorin')]
Hey Im currently trying to parse through a website and I'm almost done, but there's a little problem. I wannt to exclude inner tags from a html code
<span class="moto-color5_5">
<strong>Text 1 </strong>
<span style="font-size:8px;">Text 2</span>
</span>
I tried using
...find("span", "moto-color5_5") but this returns
Text 1 Text 2
instead of only returning Text 1
Any suggestions?
sincierly :)
Excluding inner tags would also exclude Text 1 because it's in an inner tag <strong>.
You can however just find strong inside of your current soup:
html = """<span class="moto-color5_5">
<strong>Text 1 </strong>
<span style="font-size:8px;">Text 2</span>
</span>
"""
soup = BeautifulSoup(html)
result = soup.find("span", "moto-color5_5").find('strong')
print(result.text) # Text 1
I have few tags like
<span attrib="5_5"> <font size="3">Text:Hello World </font> </span>
<span attrib="5_5"> <font size="1">Text_Hello New World </font> </span>
At the same time , some folks didn't want font ... so they didnt have it
<span attrib="5_5"> Text:Hello World </span>
<span attrib="5_5"> Text_Hello New World </span>
I need to convert all these to
<font size="3">Test_Hello_World_5_5</font>
<font size="1">Text_Hello_New_World_5_5</font
How do I do this in BeautifulSoup ? I can do a regex and replace text but I lose the fonts. I need to retain the children and in the same loop relace the inner text with regex . Can anyone tell me how to do this ? Basically I want a each.replaceWithChildren and then change each.text ... in the SAME LOOP since I can't lose the context . 5_5 is a number that comes from the attribute of the parent span .
In pseudo code I want something like :
span is a beautiful soup collection of all the span tags.
for each in span:
span.replaceWithChildren()
each.text = something
something like this:
for x in doc.findAll('span'):
s = x["attrib"]
t = x.find('font')
t.string = t.text.strip() + '_' + s
x.replaceWithChildren()
update:
t = x.find('font')
if not t:
x.string += s
else:
t.string += s
x.replaceWithChildren()
I am using Beautiful Soup to parse a html to find all text that is
1.Not contained inside any anchor elements
I came up with this code which finds all links within href but not the other way around.
How can I modify this code to get only plain text using Beautiful Soup, so that I can do some find and replace and modify the soup?
for a in soup.findAll('a',href=True):
print a['href']
EDIT:
Example:
<html><body>
<div> test1 </div>
<div><br></div>
<div>test2</div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
Output:
This should be identified
Identify me 1
Identify me 2
This paragraph should be identified.
I am doing this operation to find text not within <a></a> : then find "Identify" and do replace operation with "Replaced"
So the final output will be like this:
<html><body>
<div> test1 </div>
<div><br></div>
<div>test2</div>
<div><br></div><div><br></div>
<div>
This should be identified
Repalced me 1
Replaced me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
Thanks for your time !
If I understand you correct, you want to get the text that is inside an a element that contains an href attribute. If you want to get the text of the element, you can use the .text attribute.
>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed('this is some text')
>>> soup.findAll('a', href=True)[0]['href']
u'http://something.com'
>>> soup.findAll('a', href=True)[0].text
u'this is some text'
Edit
This finds all the text elements, with identified in them:
>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed(yourhtml)
>>> [txt for txt in soup.findAll(text=True) if 'identified' in txt.lower()]
[u'\n This should be identified \n\n Identify me 1 \n\n Identify me 2 \n ', u' identified ']
The returned objects are of type BeautifulSoup.NavigableString. If you want to check if the parent is an a element you can do txt.parent.name == 'a'.
Another edit:
Here's another example with a regex and a replacement.
import BeautifulSoup
import re
soup = BeautifulSoup.BeautifulSoup()
html = '''
<html><body>
<div> test1 </div>
<div><br></div>
<div>test2</div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
'''
soup.feed(html)
for txt in soup.findAll(text=True):
if re.search('identi',txt,re.I) and txt.parent.name != 'a':
newtext = re.sub(r'identi(\w+)', r'replace\1', txt.lower())
txt.replaceWith(newtext)
print(soup)
<html><body>
<div> test1 </div>
<div><br /></div>
<div>test2</div>
<div><br /></div><div><br /></div>
<div>
this should be replacefied
replacefy me 1
replacefy me 2
<p id="firstpara" align="center"> This paragraph should be<b> replacefied </b>.</p>
</div>
</body></html>