I have a python script using beautifulsoup to scrape a property sales website.
I am trying to get the number of beds from the HTML.
The data-reactid changes for each listing in the search results. The number $11606747 is unique.
I am trying to wild card search for "*$beds.0.0" to return the number of beds =3 in the example.
There is no error message, and the code runs but doesn't return the number.
What am I doing wrong?
The HTML:
<div class="property-features is-regular listing-result__features" data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2"><span class="property-feature__feature" data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds"><span class="property-feature__feature-text-container" data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds.0"><span data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds.0.0">3</span><span data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds.0.1"> </span><span class="property-features__feature-text" data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds.0.2">Beds</span></span>
The Python code
beds = listing.findAll('span',{"data-reactid":re.compile('*$beds.0.0')})
You can try like this to get the bed status:
content='''
<html>
<body>
<div class="property-features is-regular listing-result__features" data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2">
<span class="property-feature__feature" data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds">
<span class="property-feature__feature-text-container" data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds.0">
<span data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds.0.0">
3
</span>
<span data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds.0.1">
</span>
<span class="property-features__feature-text" data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds.0.2">
Beds
</span>
</span>
</span>
</div>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
item = soup.select("div span[data-reactid*='$11606747']")[0].text
print(' '.join(item.split()))
Result:
3 Beds
You need to escape symbols $, . and *, because they are special in regex:
re.compile(r'\*\$beds\.0\.0')
Related
I'm trying to scrape text from here to input directly into an excel sheet, rather than copy and pasting. The website uses Html to include information about the original typeface. This is an example for how one line of text is coded on the page:
<div class="line">
<span class="milestone_wrap"> </span>
<a id="tln-2212" href="index.html#tln-2212" class="milestone tln invisible" title="TLN: 2212">2212</a>
<span class="milestone_wrap">When </span>
<span class="typeform" data-setting="ſ">s</span>
<span class="milestone_wrap">uch ill dealing mu</span>
<span class="ligature" data-precomposed="ſt">
<span class="typeform" data-setting="ſ">s</span>
<span class="milestone_wrap">t</span>
</span>
<span class="milestone_wrap"> be </span>
<span class="typeform" data-setting="ſ">s</span>
<span class="milestone_wrap">eene in thought. </span>
<span class="sd exit">
<span class="space" style="padding-right:1em;" xml:space="preserve"></span>
<i>Exit</i>
<span class="milestone_wrap">.</span>
</span>
</div>
I have tried using the find_all method
import requests
from bs4 import BeautifulSoup as bs
url = 'https://internetshakespeare.uvic.ca/doc/R3_F1/scene/3.6/index.html'
page = requests.get(url)
text = bs(page.text, 'html.parser')
divs = text.find_all('div', class_="line")
for div in divs:
for item in div.contents: print(item)
This is what I get back:
When
<span class="typeform" data-setting="ſ">s</span>
uch ill dealing mu
<span class="ligature" data-precomposed="ſt"><span class="typeform" data-setting="ſ">s</span>t</span>
be
<span class="typeform" data-setting="ſ">s</span>
eene in thought.
<span class="sd exit"><span class="space" style="padding-right:1em;" xml:space="preserve"> </span><i>Exit</i>.</span>
Everything with the tag <span class="milestone_wrap"> appears without the tag: therefore, when I use .find_all for 'span', these strings then don't come up and so I'm left with random letters. Is there a reason why that class isn't appearing?
Work at the level of line class but decompose the a tags, so as to remove the line numbers; unless you really want them, in which case, I would add space between them and the following text:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://internetshakespeare.uvic.ca/doc/R3_F1/scene/3.6/index.html')
soup = bs(r.content, 'lxml')
for line in soup.select('.line'):
line.select_one('a').decompose()
print(line.text)
When executing your code with a little adjustment (the requests module must be imported) you should get the content of the site.
from bs4 import BeautifulSoup as bs
import requests
url = 'https://internetshakespeare.uvic.ca/doc/R3_F1/scene/3.6/index.html'
page = requests.get(url)
text = bs(page.text, 'html.parser')
divs = text.find_all('div', class_="line")
for div in divs:
for item in div.contents: print(item)
The text can be found within the <span class="milestone_wrap"> tags. You can check this with the inspector of your browser. The text is delivered in small portions tag by tag, e.g. "Which in a". You should be able to extract the text.
I'm trying to parse the follow HTML code in python using beautiful soup. I would like to be able to search for text inside a tag, for example "Color" and return the text next tag "Slate, mykonos" and do so for the next tags so that for a give text category I can return it's corresponding information.
However, I'm finding it very difficult to find the right code to do this.
<h2>Details</h2>
<div class="section-inner">
<div class="_UCu">
<h3 class="_mEu">General</h3>
<div class="_JDu">
<span class="_IDu">Color</span>
<span class="_KDu">Slate, mykonos</span>
</div>
</div>
<div class="_UCu">
<h3 class="_mEu">Carrying Case</h3>
<div class="_JDu">
<span class="_IDu">Type</span>
<span class="_KDu">Protective cover</span>
</div>
<div class="_JDu">
<span class="_IDu">Recommended Use</span>
<span class="_KDu">For cell phone</span>
</div>
<div class="_JDu">
<span class="_IDu">Protection</span>
<span class="_KDu">Impact protection</span>
</div>
<div class="_JDu">
<span class="_IDu">Cover Type</span>
<span class="_KDu">Back cover</span>
</div>
<div class="_JDu">
<span class="_IDu">Features</span>
<span class="_KDu">Camera lens cutout, hard shell, rubberized, port cut-outs, raised edges</span>
</div>
</div>
I use the following code to retrieve my div tag
soup.find_all("div", "_JDu")
Once I have retrieved the tag I can navigate inside it but I can't find the right code that will enable me to find the text inside one tag and return the text in the tag after it.
Any help would be really really appreciated as I'm new to python and I have hit a dead end.
You can define a function to return the value for the key you enter:
def get_txt(soup, key):
key_tag = soup.find('span', text=key).parent
return key_tag.find_all('span')[1].text
color = get_txt(soup, 'Color')
print('Color: ' + color)
features = get_txt(soup, 'Features')
print('Features: ' + features)
Output:
Color: Slate, mykonos
Features: Camera lens cutout, hard shell, rubberized, port cut-outs, raised edges
I hope this is what you are looking for.
Explanation:
soup.find('span', text=key) returns the <span> tag whose text=key.
.parent returns the parent tag of the current <span> tag.
Example:
When key='Color', soup.find('span', text=key).parent will return
<div class="_JDu">
<span class="_IDu">Color</span>
<span class="_KDu">Slate, mykonos</span>
</div>
Now we've stored this in key_tag. Only thing left is getting the text of second <span>, which is what the line key_tag.find_all('span')[1].text does.
Give it a go. It can also give you the corresponding values. Make sure to wrap the html elements within content=""" """ variable between Triple Quotes to see how it works.
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
for elem in soup.select("._JDu"):
item = elem.select_one("span")
if "Features" in item.text: #try to see if it misses the corresponding values
val = item.find_next("span").text
print(val)
I'm quite new to Programming and OO programming especially. Nonetheless, I'm trying to write a very simple Spider for web crawling. Here's my first approach:
I need to fetch the data out of this page: http://europa.eu/youth/volunteering/evs-organisation_en
Firstly, I do a view on the page source to find HTML elements?
view-source:https://europa.eu/youth/volunteering/evs-organisation_en
Note: I need to fetch the data that comes right below this line:
EVS accredited organisations search results: 6066
I chose beautiful soup for this job - since it is very powerful:
I Use find_all:
soup.find_all('p')[0].get_text() # Searching for tags by class and id
Note: Classes and IDs are used by CSS to determine which HTML elements to apply certain styles to. We can also use them when scraping to specify specific elements we want to scrape.
See the class:
<div class="col-md-4">
<div class="vp ey_block block-is-flex">
<div class="ey_inner_block">
<h4 class="text-center">"People need people" Zaporizhya oblast civic organisation of disabled families</h4>
<p class="ey_info">
<i class="fa fa-location-arrow fa-lg"></i>
Zaporizhzhya, <strong>Ukraine</strong>
</p> <p class="ey_info"><i class="fa fa-hand-o-right fa-lg"></i> Sending</p>
<p><strong>PIC no:</strong> 935175449</p>
<div class="empty-block">
Read more </div>
</div>
so this leads to:
# import libraries
import urllib2
from bs4 import BeautifulSoup
page = requests.get("https://europa.eu/youth/volunteering/evs-organisation_en")
soup = BeautifulSoup(page.content, 'html.parser')
soup
Now, we can use the find_all method to search for items by class or by id. In the below example, we'll search for any p tag that has the class outer-text
<div class="col-md-4">
so we choose:
soup.find_all(class_="col-md-4")
Now I have to combine all.
update: my approach: so far:
I have extracted data wrapped within multiple HTML tags from a webpage using BeautifulSoup4. I want to store all of the extracted data in a list. And - to be more concrete: I want each of the extracted data as separate list elements separated by a comma (i.e.CSV-formated).
To begin with the beginning:
here we have the HTML content structure:
<div class="view-content">
<div class="row is-flex"></span>
<div class="col-md-4"></span>
<div class </span>
<div class= >
<h4 Data 1 </span>
<div class= Data 2</span>
<p class=
<i class=
<strong>Data 3 </span>
</p> <p class= Data 4 </span>
<p class= Data 5 </span>
<p><strong>Data 6</span>
<div class=</span>
<a href="Data 7</span>
</div>
</div>
Code to extract:
for data in elem.find_all('span', class_=""):
This should give an output:
data = [ele.text for ele in soup.find_all('span', {'class':'NormalTextrun'})]
print(data)
Output:
[' Data 1 ', ' Data 2 ', ' Data 3 ' and so forth]
question: / i need help with the extraction part...
try this
data = [ele.text for ele in soup.find_all(text = True) if ele.text.strip() != '']
print(data)
Sorry.
I have asked a question like this.
After that i still have problem about data not in tag.
A few different the question i asked
(How can i crawl web data that not in tags)
<div class="bbs" id="main-content">
<div class="metaline">
<span class="article-meta-tag">
author
</span>
<span class="article-meta-value">
Jorden
</span>
</div>
<div class="metaline">
<span class="article-meta-tag">
board
</span>
<span class="article-meta-value">
NBA
</span>
</div>
I am here
</div>
I only need
I am here
The string is a child of the main div of type NavigableString, so you can loop through div.children and filter based on the type of the node:
from bs4 import BeautifulSoup, NavigableString
[x.strip() for x in soup.find("div", {'id': 'main-content'}).children if isinstance(x, NavigableString) and x.strip()]
# [u'I am here']
Data:
soup = BeautifulSoup("""<div class="bbs" id="main-content">
<div class="metaline">
<span class="article-meta-tag">
author
</span>
<span class="article-meta-value">
Jorden
</span>
</div>
<div class="metaline">
<span class="article-meta-tag">
board
</span>
<span class="article-meta-value">
NBA
</span>
</div>
I am here
</div>""", "html.parser")
soup = BeautifulSoup(that_html)
div_tag = soup.div
required_string = div_tag.string
go thought this documentation
I want to extract excerpts of data like company name and address from a website using BeautifulSoup. I am getting, however, the following failure:
Calgary's Notary Public
Traceback (most recent call last):
File "test.py", line 16, in <module>
print item.find_all(class_='jsMapBubbleAddress').text
AttributeError: 'ResultSet' object has no attribute 'text'
The HTML code snippet is here. I want to extract all the text information and convert into a CSV file. Please any one help me.
<div class="listing__right article hasIcon">
<h3 class="listing__name jsMapBubbleName" itemprop="name"><a data-analytics='{"lk_listing_id":"100971374","lk_non-ad-rollup":"0","lk_page_num":"1","lk_pos":"in_listing","lk_proximity":"14.5","lk_directory_heading":[{"085100":[{"00910600":"1"},{"00911000":"1"}]}],"lk_geo_tier":"in","lk_area":"left_1","lk_relevancy":"1","lk_name":"busname","lk_pos_num":"1","lk_se_id":"e292d1d2-f130-463d-8f0c-7dd66800dead_Tm90YXJ5_Q2FsZ2FyeSwgQUI_56","lk_ev":"link","lk_product":"l2"}' href="/bus/Alberta/Calgary/Calgary-s-Notary-Public/100971374.html?what=Notary&where=Calgary%2C+AB&useContext=true" title="See detailed information for Calgary's Notary Public">Calgary's Notary Public</a> </h3>
<div class="listing__address address mainLocal">
<em class="itemCounter">1</em>
<span class="listing__address--full" itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span class="jsMapBubbleAddress" itemprop="streetAddress">340-600 Crowfoot Cres NW</span>, <span class="jsMapBubbleAddress" itemprop="addressLocality">Calgary</span>, <span class="jsMapBubbleAddress" itemprop="addressRegion">AB</span> <span class="jsMapBubbleAddress" itemprop="postalCode">T3G 0B4</span></span>
<a class="listing__direction" data-analytics='{"lk_listing_id":"100971374","lk_non-ad-rollup":"0","lk_page_num":"1","lk_pos":"in_listing","lk_proximity":"14.5","lk_directory_heading":[{"085100":[{"00910600":"1"},{"00911000":"1"}]}],"lk_geo_tier":"in","lk_area":"left_1a","lk_relevancy":"1","lk_name":"directions-step1","lk_pos_num":"1","lk_se_id":"e292d1d2-f130-463d-8f0c-7dd66800dead_Tm90YXJ5_Q2FsZ2FyeSwgQUI_56","lk_ev":"link","lk_product":"l2"}' href="/merchant/directions/100971374?what=Notary&where=Calgary%2C+AB&useContext=true" rel="nofollow" title="Get direction to Calgary's Notary Public">Get directions »</a>
</div>
<div class="listing__details">
<p class="listing__details__teaser" itemprop="description">We offer you a convenient, quick and affordable solution for your Notary Public or Commissioner for Oaths in Calgary needs.</p>
</div>
<div class="listing__ratings--root">
<div class="listing__ratings ratingWarp" itemprop="aggregateRating" itemscope="" itemtype="http://schema.org/AggregateRating">
<meta content="5" itemprop="ratingValue"/>
<meta content="1" itemprop="ratingCount"/>
<span class="ypStars" data-analytics-group="stars" data-clicksent="false" data-rating="rating5" title="Ratings: 5 out of 5 stars">
<span class="star1" data-analytics-name="stars" data-label="Optional : Why did you hate it?" title="I hated it"></span>
<span class="star2" data-analytics-name="stars" data-label="Optional : Why didn't you like it?" title="I didn't like it"></span>
<span class="star3" data-analytics-name="stars" data-label="Optional : Why did you like it?" title="I liked it"></span>
<span class="star4" data-analytics-name="stars" data-label="Optional : Why did you really like it?" title="I really liked it"></span>
<span class="star5" data-analytics-name="stars" data-label="Optional : Why did you love it?" title="I loved it"></span>
</span><a class="listing__ratings__count" data-analytics='{"lk_listing_id":"100971374","lk_non-ad-rollup":"0","lk_page_num":"1","lk_pos":"in_listing","lk_proximity":"14.5","lk_directory_heading":[{"085100":[{"00910600":"1"},{"00911000":"1"}]}],"lk_geo_tier":"in","lk_area":"left_1","lk_relevancy":"1","lk_name":"read_yp_reviews","lk_pos_num":"1","lk_se_id":"e292d1d2-f130-463d-8f0c-7dd66800dead_Tm90YXJ5_Q2FsZ2FyeSwgQUI_56","lk_ev":"link","lk_product":"l2"}' href="/bus/Alberta/Calgary/Calgary-s-Notary-Public/100971374.html?what=Notary&where=Calgary%2C+AB&useContext=true#ypgReviewsHeader" rel="nofollow" title="1 of Review for Calgary's Notary Public">1<span class="hidden-phone"> YP review</span></a>
</div>
</div>
<div class="listing__details detailsWrap">
<ul>
<li>Notaries
,
</li>
<li>Notaries Public</li>
</ul>
</div>
</div>
There are many divs with listing__right article hasIcon. I am using for loop to extract the information.
The python code I have written so far is.
import requests
from bs4 import BeautifulSoup
url = 'http://www.yellowpages.ca/search/si-rat/1/Notary/Calgary%2C+AB'
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content)
g_data=soup.find_all('div', attrs={'class': 'listing__right article hasIcon'})
for item in g_data:
print item.find('h3').text
#print item.contents[2].find_all('em', attrs={'class': 'itemCounter'})[1].text
print item.find_all(class_='jsMapBubbleAddress').text
find_all returns a list which has no 'text' attribute so you are getting an error, not sure what output you are looking for, but this code seems to work ok:
import requests
from bs4 import BeautifulSoup
url = 'http://www.yellowpages.ca/search/si-rat/1/Notary/Calgary%2C+AB'
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content,"lxml")
g_data=soup.find_all('div', attrs={'class': 'listing__right article hasIcon'})
for item in g_data:
print item.find('h3').text
#print item.contents[2].find_all('em', attrs={'class': 'itemCounter'})[1].text
items = item.find_all(class_='jsMapBubbleAddress')
for item in items:
print item.text