How to find text with a particular value BeautifulSoup python2.7 - python

I have the following html: I'm trying to get the following numbers saved as variables Available Now,7,148.49,HatchBack,Good. The problem I'm running into is that I'm not able to pull them out independently since they don't have a class attached to it. I'm wondering how to solve this. The following is the html then my futile code to solve this.
</div>
<div class="car-profile-info">
<div class="col-md-12 no-padding">
<div class="col-md-6 no-padding">
<strong>Status:</strong> <span class="statusAvail"> Available Now </span><br/>
<strong>Min. Booking </strong>7 Days ($148.89)<br/>
<strong>Style: </strong>Hatchback<br/>
<strong>Transmission: </strong>Automatic<br/>
<strong>Condition: </strong>Good<br/>
</div>
Python 2.7 Code: - this gives me the entire html!
soup=BeautifulSoup(html)
print soup.find("span",{"class":"statusAvail"}).getText()
for i in soup.select("strong"):
if i.getText()=="Min. Booking ":
print i.parent.getText().replace("Min. Booking ","")

Find all the strong elements under the div element with class="car-profile-info" and, for each element found, get the .next_siblings until you meet the br element:
from bs4 import BeautifulSoup, Tag
for strong in soup.select(".car-profile-info strong"):
label = strong.get_text()
value = ""
for elm in strong.next_siblings:
if getattr(elm, "name") == "br":
break
if isinstance(elm, Tag):
value += elm.get_text(strip=True)
else:
value += elm.strip()
print(label, value)

You can use ".next_sibling" to navigate to the text you want like this:
for i in soup.select("strong"):
if i.get_text(strip=True) == "Min. Booking":
print(i.next_sibling) #this will print: 7 Days ($148.89)
See also http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-sideways

Related

How to get the value of an element within a tag?

<div class="player__AAtt">
<div>
<play-js data-account="1234567890" data-id="32667_32797">
</play-js>
</div>
</div>
I want to get the values of data-account and data-id which are elements within play-js tag.
elemname = driver.find_elements_by_xpath('xpath.../div/play-js')
I tried like below, but I couldn't get the value.
With javascript, I was able to import it with the code below.
var elems = document.querySelectorAll('.player__AAtt play-js');
console.log(elems[0].dataset.account)
console.log(elems[0].dataset.dataid)
How can I get the value of an element within a tag rather than the tag itself?
You can use the .get_attribute() method:
elemname = driver.find_elements_by_xpath('xpath.../div/play-js')
elemname.get_attribute("data-account")
In python we use BeautifulSoup mostly for parsing the html page. Here the code that will help you to get the value of all the play-js element in the provided html file.
from bs4 import BeautifulSoup
res_page = """
<div class="player__AAtt">
<div>
<play-js data-account="1234567890" data-id="32667_32797">
</play-js>
</div>
</div>
"""
soup = BeautifulSoup(res_page, 'html.parser')
output_list = soup.find_all('play-js')
data_account_list = [data_account['data-account'] for data_account in output_list]
data_id_list = [data_id['data-id'] for data_id in output_list]
print(data_account_list)
print(data_id_list)
The output is:
['1234567890']
['32667_32797']

Python beautifulsoup print by line #

Okay so I'm currently using python beautifulsoup to output a specific line from a html file, since the html contains multiple of the same div class, it'll output every div containing the same class, example of this
CONTENT:
<div class=border>aaaa</a>
<div class=border>example</a>
<div class=border>runrunrun</a>
OUTPUT:
<div class=border>aaaa</a>
<div class=border>example</a>
<div class=border>runrunrun</a>
Now I only want #2 of div class border,
<div class=border>example</a>
now if i view source within chrome, it'll show content in number lines, so line 1 will contain
<div class=border>aaaa</a>
& line 2 will contain
<div class=border>example</a>
is it possible to output via numbered line using beautiful soup?
find_all returns a list, so you can index it with [1] to get the second element.
from bs4 import BeautifulSoup
html_doc = """<div class=border>aaaa</a>
<div class=border>example</a>
<div class=border>runrunrun</a>"""
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all(class_="border")[1]
returns
<div class="border">example</div>
If you have the list with say 200 elements generated by soup.find_all... If the list is called div_list, you could just do an index loop (you want index 1,4,7 etc...)
count = 1
while True:
try:
print(div_list[count])
count+=3
except:
# happens because of index error
break
Or even shorter:
count = 1
while count<= len(div_list):
print(div_list[count])
count+=3

Python Selenium - iterate through search results

In the search results, I need to verify that all of them must contain the search key. This is the HTML source code:
<div id="content">
<h1>Search Results</h1>
<a id="main-content" tabindex="-1"></a>
<ul class="results">
<li><img alt="Icon for Metropolitan trains" title="Metropolitan trains" src="themes/transport-site/images/jp/iconTrain.png" class="resultIcon"/> <strong>Stop</strong> Sunshine Railway Station (Sunshine)</li>
<li><img alt="Icon for Metropolitan trains" title="Metropolitan trains" src="themes/transport-site/images/jp/iconTrain.png" class="resultIcon"/> <strong>Stop</strong> Albion Railway Station (Sunshine North)</li>
</ul>
</div>
I have written this code to enter search key and get the results but it fails to loop through the search result:
from selenium import webdriver
driver = webdriver.Chrome('C:/Users/lovea/OneDrive/Documents/Semester 2 2016/ISYS1087/w3-4/chromedriver')
driver.get('http://www.ptv.vic.gov.au')
next5Element = driver.find_element_by_link_text('Next 5 departures')
next5Element.click()
searchBox = driver.find_element_by_id('Form_ModeSearchForm_Search')
searchBox.click()
searchBox.clear()
searchBox.send_keys('Sunshine')
submitBtn = driver.find_element_by_id('Form_ModeSearchForm_action_doModeSearch')
submitBtn.click()
assert "Sorry, there were no results for your search." not in driver.page_source
results = driver.find_elements_by_xpath("//ul[#class='results']/li/a")
for result in results:
assert "Sunshine" in result //Error: argument of type 'WebElement' is not iterable
Anyone please tell me what is the proper way to to that? Thank you!
You should check if innerHTML value of particular element contains key string, but not element itself, so try
for result in results:
assert "Sunshine" in result.text
I got a mistake in assert statement.
Because result is a WebElement so there is no text to look up in it.
I just change it like: assert "Sunshine" in result.text

Select text from either a DIV or an underlying container, if it exists

I'm having a div table where each row has two cells/columns.
The second cell/column sometimes has a clear text (<div class="something">Text</div>) while sometimes it's hidden within an "a" tag inside: <div class="something">Text</div>.
Now, I have no problem in getting everything but the linked text. I can also get the linked text separately, but I don't know how to get everything at once, so I get three columns of data:
1. first column text,
2. second column text no matter if it is linked or not,
3. link, if it exist
The code that extracts everything not linked and works is:
times = scrapy.Selector(response).xpath('//div[contains(concat(" ", normalize-space(#class), " "), " time ")]/text()').extract()
titles = scrapy.Selector(response).xpath('//div[contains(concat(" ", normalize-space(#class), " "), " name ")]/text()').extract()
for time, title in zip(times, titles):
print time.strip(), title.strip()
I can get the linked items only with
ltitles = scrapy.Selector(response).xpath('//div[contains(concat(" ", normalize-space(#class), " "), " name ")]/a/text()').extract()
for ltitle in ltitles:
print ltitle.strip()
But don't know how to combine the "query" to get everything together.
Here's a sample HTML:
<div class="programRow rowOdd">
<div class="time ColorVesti">
22:55
</div>
<div class="name">
Dnevnik
</div>
</div>
<div class="programRow rowEven">
<div class="time ColorOstalo">
23:15
</div>
<div class="name">
<a class="recnik" href="/page/tv/sr/story/20/rts-1/2434373/kulturni-dnevnik.html" rel="/ajax/storyToolTip.jsp?id=2434373">Kulturni dnevnik</a>
</div>
</div>
Sample output (one I cannot get):
22:55, Dnevnik, []
23:15, Kulturni dnevnik, /page/tv/sr/story/20/rts-1/2434373/kulturni-dnevnik.html
I either get the first two columns (without the linked text) or just the linked text with the code samples above.
If I understand you correctly then you should probably just iterate through program nodes and create item for every cycle. Also there's xpath shortcut //text() which captures all text under the node and it's childrem
Try something like:
programs = response.xpath("//div[contains(#class,'programRow')]")
for program in programs:
item = dict()
item['name'] = program.xpath(".//div[contains(#class,'name')]//text()").extract_first()
item['link'] = program.xpath(".//div[contains(#class,'name')]/a/#href").extract_first()
item['title'] = program.xpath(".//div[contains(#class,'title')]//text()").extract_first()
return item

Parse html in python using beautifulsoup

How do i get output as below from give html page ?
>html_sting='''<td class="status_icon" rowspan="2"><img alt="QUEUED" src="images/arts/status_QUEUED.png" style="border:none" title="QUEUED"/></td>
><td class="test"> v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200
> <div class="start">(04.02) 23:29</div>
> <div class="end">~
> <span style="color:green"> () </span>
> </div>
></td>
><td>mcordeix</td>
><td>1614809</td>
><td>0/0/0 of 0</td>
><td>high</td>
><td style="white-space:nowrap"><img class="pbar" src="images/arts/bar_green.gif" style="border-right:2px;border-right-style:solid;border-right-color:#ffffff" width="1%"/><img class="pbar" src="images/arts/bar_gray.gif" width="99%"/></td>
><td></td>
><td></td>
><td></td>
><td></td>
><td colspan="4">
><!-- Florent Vial: this can be alway shown if admin=1 -->
>XML
>Raw XML
>CINFO
></td>
><td></td>
><td><!-- <script type="text/javascript">DIVShowHideDetails('func:DoPrintArtsDetails')</script> --> </td>
><td></td>
><td></td>
><td></td>
><td></td>
'''
EXpected Output:
-------
Status="QUEUED"
test=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200
start=(04.02) 23:29
end=~
user=mcordeix
Welcome to StackOverflow!
Please read the How to ask a question section of our FAQ.
Explain how you encountered the problem you're trying to solve, and any difficulties that have prevented you from solving it yourself.
What have you tried to solve this problem so far?
Let's give you a start.
All you're gonna need are the find and find_all functions.
soup = BeautifulSoup(html_string)
status = soup.find('img').get('alt') # get 'alt' content of the first <img> tag.
# find the first <td> tag with a class="test", get its content, split it using spaces,
version = soup.find('td', class_='test').text.split()[0] # and get the first substring
time_start = soup.find('div', class_='start').text
time_end = soup.find('div', class_='end').text
user = soup.find_all('td')[2].text # get a third <td>'s content.
print status # QUEUED
print version # v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200
print time_start # (04.02) 23:29
print time_end # ~ > () >
print user # mcordeix
That's just reading the bs4's documentation for like 10 minutes and trying it yourself.
Just pop out the Python interpreter, assign the html_string variable, import the beautifulsoup library, and try.
I'm sure you could work out the problem left with the time_end content yourself. It's not that hard.

Categories