I'm trying to get the text Something here I want to get inside the div element from a html file using Python and BeautifulSoup.
This is how part of the code looks like in html:
<div xmlns="" id="idp46819314579224" style="box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #d43f3a; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;" class="" onclick="toggleSection('idp46819314579224-container');" onmouseover="this.style.cursor='pointer'">Something here I want to get<div id="idp46819314579224-toggletext" style="float: right; text-align: center; width: 8px;">
-
</div>
</div>
And this is how I tried to do:
vu = soup.find_all("div", {"style" : "background: #d43f3a"})
for div in vu:
print(div.text)
I use loop because there are several div with different id but all of them has the same background colour. It has no errors, but I got no output.
How can I get the text using the background colour as the condition?
The style attribute has other content inside it
style="box-sizing: ....; ....;"
Your current code is asking if style == "background: #d43f3a" which it is not.
What you can do is ask if "background: #d43f3a" in style -- a sub-string check.
One approach is passing a regular expression.
>>> import re
>>> vu = soup.find_all("div", style=re.compile("background: #d43f3a"))
...
... for div in vu:
... print(div.text.strip())
Something here I want to get
You can also say the same thing using CSS Selectors
soup.select('div[style*="background: #d43f3a"]')
Or by passing a function/lambda
>>> vu = soup.find_all("div", style=lambda style: "background: #d43f3a" in style)
...
... for div in vu:
... print(div.text.strip())
Something here I want to get
I have a svg file like the following (en example)
<svg
<g class="displacy-arrow">
<path class="displacy-arc" id="arrow-ec55d4518d3c43e391ffce0b97c713ab-0-2" stroke-width="2px" d="M420,89.5 C420,2.0 575.0,2.0 575.0,89.5" fill="none" stroke="currentColor"/>
<text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
<textPath xlink:href="#arrow-ec55d4518d3c43e391ffce0b97c713ab-0-2" class="displacy-label" startOffset="50%" side="left" fill="currentColor" text-anchor="middle">pd</textPath>
</text>
<path class="displacy-arrowhead" d="M575.0,91.5 L583.0,79.5 567.0,79.5" fill="currentColor"/>
</g>
</svg>
I have tried to access the what is inside the 'textpath' node using the code below:
import xml.dom.minidom
doc = xml.dom.minidom.parse('my_file.svg')
name = doc.getElementsByTagName('textPath')
for t in name:
print([x.nodeValue for x in t.childNodes])
I would however like to get the other information included in the 'textpath', like the values for 'side' or 'fill', but I do not know how to access those.
just for future reference, I wrote a function based on the links that #Aswath has sent in the comments
from bs4 import BeautifulSoup
def extract_data_from_report3(filename):
soup = BeautifulSoup(open(filename), "html.parser")
for element in soup.find_all('textpath'):
print(element.get('side'))
extract_data_from_report3('my_file.svg')
All I am trying to do is select the drop down & then select "Export Excel Spread Sheet".
Example of Drop Down
Code:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
url2 =["https://example.com/reports"]
driver = webdriver.Chrome()
driver.implicitly_wait(15)
driver.get("https://example.com")
for u in url2:
driver.implicitly_wait(15)
driver.get(u)
I have tried so many different Xpaths & ID's
#driver.find_element_by_xpath("//a[contains(#class,'dropdown__trigger header-export-menu--toggle-btn')]").click()
#driver.find_element_by_xpath("//li[contains(text(),'Export Excel Spread Sheet')]").click()
#act.click().perform()
#act.click(driver.find_element_by_xpath("//a[contains(#class,'dropdown__trigger header-export-menu--toggle-btn')]")).perform()
#act.move_to_element(driver.find_element_by_xpath("//a[contains(#class,'dropdown__trigger header-export-menu--toggle-btn')]")).perform()
#WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.ID, "Header-Dropdown-Menu")).click())
#driver.find_element_by_class_name("//div[contains(#class,'dropdown__content header-export-menu--content')]").click()
#driver.find_element_by_xpath('//div[#class="dropdown header-export-menu" and #class="dropdown dropdown--active header-export-menu"]')
#driver.quit()
HTML Code
Click Me For HTML Example1
<!-- Under React Empty: 32 -->
<div class = "dropdown header-export-menu">
Take a look at the name in Example 1 vs Example 2
Click Me for HTML Example2
You'll notice the HTML Code has change to
<!-- Under React Empty: 32 -->
<div class = "dropdown dropdown--active header-export-menu">
Which I think is part of the problem I am having. Pretty Stuck.
I have also tried to use ChroPath & XPath Helper to try and resolve the issue but no luck.
Thank you in Advance !
Update:
The Comments have asked for further detail of the HTML code & I have gathered the following block.
<div class="header-container">
<!-- react-empty: 429 -->
<div class="header-event-info" id="header-event-info">
<div class="">
<div>
<div class="single-event-info">
<div class="event-data">
<p class="data-dd">25</p>
<p class="data-mmyy">Jun 2017</p>
</div>
<div class="event-detail">
<p class="event-name">"A name of a musical"</p>
<p class="event-more-details">
<!-- react-text: 437 -->
"Tuesday, 7:00 pm, Some Theatre"
<!-- /react-text -->
<a class="popup" data-content="" data-icon="" data-position="bottom" data-width="350" data-height="auto" data-trigger="click" data-scrollable="false">
<span class="popup-icon">
<svg width="19px" height="19px" viewBox="0 0 19 19" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<desc>Created with sketchtool.</desc>
<defs></defs>
<g id="Totals-For-Today" stroke="none" stroke-width="1" fill="none" fill-rule="evenodd">
<g class="svg-icon-path" id="01-Event-Audit-Narrow---No-TFT" transform="translate(-983.000000, -121.000000)" stroke="#919598">
<g id="Group" transform="translate(27.500000, 15.560000)">
<g id="iButton" transform="translate(956.000000, 106.000000)">
<path d="M9,17.4399996 C13.6944204,17.4399996 17.5,13.63442 17.5,8.93999958 C17.5,4.24557921 13.6944204,0.43999958 9,0.43999958 C4.30557963,0.43999958 0.5,4.24557921 0.5,8.93999958 C0.5,13.63442 4.30557963,17.4399996 9,17.4399996 Z" id="outline">
</path>
<path class="svg-icon-text" d="M10.4765625,13.3169527 L7.68164062,13.3169527 L7.68164062,12.930234 C7.77148482,12.9224214 7.86425733,12.914609 7.95996094,12.9067965 C8.05566454,12.8989839 8.13867152,12.8833591 8.20898438,12.8599215 C8.31835992,12.824765 8.3994138,12.7632422 8.45214844,12.6753511 C8.50488308,12.5874601 8.53125,12.4732034 8.53125,12.3325777 L8.53125,8.76421833 C8.53125,8.63921771 8.50292997,8.52496104 8.44628906,8.42144489 C8.38964815,8.31792875 8.31054738,8.23101556 8.20898438,8.16070271 C8.13476525,8.11382747 8.02734445,8.07378881 7.88671875,8.04058552 C7.74609305,8.00738223 7.61718809,7.98687462 7.5,7.97906208 L7.5,7.59820271 L9.5390625,7.46929646 L9.62109375,7.55132771 L9.62109375,12.2622652 C9.62109375,12.3989846 9.64746067,12.5122648 9.70019531,12.602109 C9.75292995,12.6919532 9.83593693,12.7583587 9.94921875,12.8013277 C10.0351567,12.8364841 10.1191402,12.8648042 10.2011719,12.8862886 C10.2832035,12.9077731 10.3749995,12.9224214 10.4765625,12.930234 L10.4765625,13.3169527 Z M9.73828125,5.18999958 C9.73828125,5.41265694 9.66503979,5.60699094 9.51855469,5.77300739 C9.37206958,5.93902385 9.19140732,6.02203083 8.9765625,6.02203083 C8.77734275,6.02203083 8.60449292,5.94293006 8.45800781,5.78472614 C8.31152271,5.62652223 8.23828125,5.44585997 8.23828125,5.24273396 C8.23828125,5.02788913 8.31152271,4.84039101 8.45800781,4.68023396 C8.60449292,4.5200769 8.77734275,4.43999958 8.9765625,4.43999958 C9.19921986,4.43999958 9.38183522,4.51519414 9.52441406,4.66558552 C9.6669929,4.81597689 9.73828125,4.99077983 9.73828125,5.18999958 L9.73828125,5.18999958 Z" id="i-2-copy-2" stroke-width="0.25" fill="#919598"></path>
</g>
</g>
</g>
</g>
</svg>
</span>
</a>
</p>
</div>
</div>
<div class="header-export">
<!-- react-empty: 32 -->
<div class="dropdown dropdown--active header-export-menu">
<a class="dropdown__trigger header-export-menu--toggle-btn">
<svg width="9px" height="5px" viewBox="0 0 9 5" version="1.1">
<desc>Created with Sketch.</desc>
<defs></defs>
<g id="Page-1" stroke="none" stroke-width="1" fill="none" fill-rule="evenodd">
<g id="Artboard" transform="translate(-109.000000, -97.000000)" fill="#FFFFFF">
<g id="Header-Dropdown-Menu" transform="translate(96.000000, 82.000000)">
<path d="M18.1734867,19.6470682 C17.8014721,20.0543213 17.1922167,20.0476427 16.8263028,19.6470682 L13.2549246,15.7373969 C12.8829101,15.3301438 13.0295754,15 13.5787039,15 L21.4210856,15 C21.9719185,15 22.1107787,15.3368224 21.7448649,15.7373969 L18.1734867,19.6470682 Z" id="options-dropdown-menu-arrow"></path>
</g>
</g>
</g>
</svg>
</a>
<div class="dropdown__content header-export-menu--content">
<ul class="export-menu">
<li class="export-menu-item ">
<svg width="17px" height="14px" viewBox="0 0 17 15" version="1.1">
<desc>Created with Sketch.</desc>
<defs></defs>
<g id="Basic-Report-Template-SPECS" stroke="none" stroke-width="1" fill="none" fill-rule="evenodd">
<g id="Basic-Report-Template---EXPORT-SPECS-OnClick" transform="translate(-522.000000, -180.000000)" stroke="#484B4D" stroke-width="2">
<g id="SPECS" transform="translate(487.000000, 14.000000)">
<g id="Download-Icon-Copy-2" transform="translate(43.500000, 172.000000) rotate(-180.000000) translate(-43.500000, -172.000000) translate(35.000000, 164.000000)">
<path d="M5.36902902,13.624518 L5.36902902,6.56257607 L12.430971,6.56257607" id="Rectangle-242-Copy-5" transform="translate(8.900000, 10.093547) rotate(-315.000000) translate(-8.900000, -10.093547) "></path>
<path d="M8.9,6.99999999 L8.9,12.9999999" id="Line-Copy-10" stroke-linecap="square"></path>
<path d="M16.9,0.0208873076 L16.9,4.02297419 C16.9,5.12639113 16.0054862,6.02088731 14.9059397,6.02088731 L2.89406028,6.02088731 C1.7927712,6.02088731 0.9,5.12262668 0.9,4.02297419 L0.9,0.0208873076" id="Rectangle-243-Copy-4" transform="translate(8.900000, 3.020887) rotate(-180.000000) translate(-8.900000, -3.020887) "></path>
</g>
</g>
</g>
</g>
</svg>
<!-- react-text: 55 -->
"Export PDF"
<!-- /react-text -->
</li>
<li class="export-menu-item">
<svg width="17px" height="14px" viewBox="0 0 17 15" version="1.1">
<desc>Created with Sketch.</desc>
<defs></defs><g id="Basic-Report-Template-SPECS" stroke="none" stroke-width="1" fill="none" fill-rule="evenodd">
<g id="Basic-Report-Template---EXPORT-SPECS-OnClick" transform="translate(-522.000000, -180.000000)" stroke="#484B4D" stroke-width="2">
<g id="SPECS" transform="translate(487.000000, 14.000000)">
<g id="Download-Icon-Copy-2" transform="translate(43.500000, 172.000000) rotate(-180.000000) translate(-43.500000, -172.000000) translate(35.000000, 164.000000)">
<path d="M5.36902902,13.624518 L5.36902902,6.56257607 L12.430971,6.56257607" id="Rectangle-242-Copy-5" transform="translate(8.900000, 10.093547) rotate(-315.000000) translate(-8.900000, -10.093547) "></path>
<path d="M8.9,6.99999999 L8.9,12.9999999" id="Line-Copy-10" stroke-linecap="square">
</path>
<path d="M16.9,0.0208873076 L16.9,4.02297419 C16.9,5.12639113 16.0054862,6.02088731 14.9059397,6.02088731 L2.89406028,6.02088731 C1.7927712,6.02088731 0.9,5.12262668 0.9,4.02297419 L0.9,0.0208873076" id="Rectangle-243-Copy-4" transform="translate(8.900000, 3.020887) rotate(-180.000000) translate(-8.900000, -3.020887)
"></path>
</g>
</g>
</g>
</g>
</svg>
<!-- react-text: 79 -->
"Export Excel Spread Sheet"
<!-- /react-text -->
</li>
<li class="export-menu-item ">
<svg width="16px" height="12px" viewBox="0 0 16 13" version="1.1">
<desc>Created with Sketch.</desc>
<defs></defs>
<g id="Basic-Report-Template-SPECS" stroke="none" stroke-width="1" fill="none" fill-rule="evenodd">
<g id="Basic-Report-Template---EXPORT-SPECS-OnClick" transform="translate(-554.000000, -182.000000)" stroke="#484B4D" stroke-width="2">
<g id="SPECS" transform="translate(487.000000, 14.000000)">
<g id="Email-Icon-Copy" transform="translate(67.000000, 168.000000)">
<rect id="Rectangle-40" x="0" y="0.669998169" width="16" height="11" rx="2"></rect>
<path d="M1.55761719,3.08300781 L8.07275391,7.10009766 L14.8974609,3.14355469" id="Path-41"></path>
</g>
</g>
</g>
</g>
</svg>
<!-- react-text: 90 -->
"Email/Schedule Report"
<!-- /react-text -->
</li>
</ul>
</div>
</div>
Update 2 (Solution)
Here's where I went wrong.
I didn't provide enough of the HTML code.
Just a few lines above there was an "iframe" which was not allowing me to enter the block of code.
After switching into the iframe, I was able to click into the button and complete the following task of exporting the excel report.
example of code (Generalized to your future endeavors)
#Finding the Frame
iframes = driver.find_element_by_id("IDofFrame")
#Switching to that frame
driver.switch_to.frame(iframes)
#Finding the dropdown button element
driver.find_element_by_xpath("XPathOfButton").click()
#delay on the export click
time.sleep(3)
#Export click
driver.find_element_by_xpath("XPathOfButtonToExport").click()
#If you need to switch out of the frame to go back to the original HTML block
driver.switch_to.default_content()
CHECK YOUR HTML CODE FOR FRAMES !!!!
good video to reference.
https://www.youtube.com/watch?v=NhRx99uFUNk
Actually your every attempt is incorrect.
driver.find_element_by_xpath("//li[contains(text(),'Export Excel
Spread Sheet')]").click()
Here you are using contains(text()) Which is incorrect actually if you pass the node set selected by text() to contains(), as you did, then it is converted to a string, by taking the string value of the first node in the node set while in your HTML I can see <svg> is the first inner node of the <li> element, I would suggest trying with dot . which will take all string value inside a node with explicit wait :
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
element = wait.until(EC.presence_of_element_located((By.XPATH, "//li[contains(.,'Export Excel Spread Sheet')]")))
element.click()
Hope it helps
I am extracting text from an html file which contains a lot of div tags. However, at some places there are say 4 nested div tags and when I print text, it prints it 4 times.
<div>
<div id="PGBRK" style="TEXT-INDENT: 0pt; WIDTH: 100%; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt">
<div id="PN" style="PAGE-BREAK-AFTER: always; WIDTH: 100%">
<div style="TEXT-ALIGN: center; WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt">27</font></div>
</div>
</div>
</div>
For example, here if I do:
for item in page_soup.find_all('div'):
if "27" in item.text:
print(item)
It prints the number 27 four times and therefore messes up whole text.
How can I get my code to only print the nested text once?
EDIT 1:
This works well for this part of the code. But like I said, this is only true at some places. For example, when I do:
for item in page_soup.find_all('div', recursive = False):
print(item)
It does not print anything. For reference, this is the document I am trying to scrape.
EDIT 2:
From the given html, I am trying to extract the section "ITEM 1A. RISK FACTORS".
should_print = False
for item in page_soup.find_all('div'):
if "ITEM 1A." in item.text:
should_print = True
elif "ITEM 1B." in item.text:
break
if should_print:
print(item)
So I am printing everything starting from ITEM 1A. until it finds ITEM 1B.
Here at some places there are nested div tags, which gets printed multiple times with this piece of code.
If I do, recursive = False, it does not print anything.
Here is one option
import bs4, re
html = '''<div>
<div id="PGBRK" style="TEXT-INDENT: 0pt; WIDTH: 100%; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt">
<div id="PN" style="PAGE-BREAK-AFTER: always; WIDTH: 100%">
<div style="TEXT-ALIGN: center; WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt">27</font></div>
</div>
</div>
</div>
</div>'''
soup = bs4.BeautifulSoup(html,'html.parser')
elements = soup.find_all(text=re.compile('27'))
print(elements)
output
[u'27']
printing everything starting from ITEM 1A. until it finds ITEM 1B
Trough .string attribute (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string)
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/4904/000000490412000013/ye11aep10k.htm'
html_doc = requests.get(url).content
page_soup = BeautifulSoup(html_doc, 'html.parser')
do_print = False
for el in page_soup.find_all('div'):
if el.string:
if "ITEM 1A" in el.string:
do_print = True
elif "ITEM 1B" in el.string:
break
if do_print:
print(el)
The output (I'll show the representative start and end blocks without middle part, to make a short dump):
<div align="justify" style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 12pt; FONT-WEIGHT: bold"><font style="DISPLAY: inline; TEXT-DECORATION: underline">ITEM 1A. RISK FACTORS</font></font></div>
<div style="TEXT-INDENT: 0pt; DISPLAY: block"><br/>
</div>
<div align="justify" style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 12pt; FONT-WEIGHT: bold">GENERAL RISKS OF OUR REGULATED OPERATIONS</font></div>
<div style="TEXT-INDENT: 0pt; DISPLAY: block">
<div align="justify" style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt"><font style="FONT-STYLE: italic; DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 12pt; FONT-WEIGHT: bold"> </font></div>
<div align="justify" style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt"><font style="FONT-STYLE: italic; DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 12pt; FONT-WEIGHT: bold">The regulatory environment in Ohio has recently become unpredictable and increasingly uncertain. – Affecting AEP and OPCo</font></div>
<div style="TEXT-INDENT: 0pt; DISPLAY: block"><br/>
.....
<div style="TEXT-ALIGN: center; WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt">37</font></div>
<div style="TEXT-ALIGN: center; WIDTH: 100%">
<hr noshade="" size="2" style="COLOR: black"/>
</div>
<div id="HDR">
<div align="right" id="GLHDR" style="WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 8pt"> </font></div>
</div>
<div align="right" id="GLHDR" style="WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 8pt"> </font></div>
<div style="TEXT-INDENT: 0pt; DISPLAY: block"> </div>
You can provide the option text = "27" to search the divs by text and identify only that exact div. The below code should work fine. If you want to get all the divs then just remove the text = "27" or replace it with what text that you want to find. You can also use recursive = False to get only the top level divs.
Edit 1:
from bs4 import BeautifulSoup
t = '''
<div>
27
</div>
<div>
<div id="PGBRK" style="TEXT-INDENT: 0pt; WIDTH: 100%; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt">
<div id="PN" style="PAGE-BREAK-AFTER: always; WIDTH: 100%">
<div style="TEXT-ALIGN: center; WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt">27</font></div>
</div>
</div>
</div>
</div>
'''
page_soup = BeautifulSoup(t, 'html.parser')
for item in page_soup.find_all('div', text="27"):
print(item.text)
Edit 2:
I have added a specific code that works for your problem specifically. Try the below code. The div range that you are expecting is from 567 - 715 with page numbers removed.
import requests
from bs4 import BeautifulSoup
resp = requests.get(
r'https://www.sec.gov/Archives/edgar/data/4904/000000490412000013/ye11aep10k.htm')
t = resp.text
page_soup = BeautifulSoup(t, 'html.parser')
s = 'body > div:not(#PGBRK)'
for i in page_soup.select(s)[567:715]:
print(i.get_text(strip=True))
Well I think that is a cool question, and I don't see a simple answer if you want to generalize it to find out what text there is at each level without resorting to searching for a specific number like 27. Beautiful Soup doesn't seem to have a function for showing only the text in the top , and recursive=False simply prevents the search from delving below the first level but will still include everything below the first level as contents, so if at the top level of tags then it will capture it and everything below it
So I think you'd actually have to recurse down the tree of divs and compare the text at each level. I figure this out. It prints in reverse order as it bubbles up from the recursion but that could be stored in a list and output in forward order.
from bs4 import BeautifulSoup
soup = BeautifulSoup('<div>1A<div>2A</div>1B<div>2B<div>3A</div><div>3A</div>2C</div>1C</div>', 'html.parser')
def mangle(node):
divs = node.find_all('div')
if len(divs):
result = [divs[0]] + [n for n in divs[0].next_siblings if n.__class__.__name__ == 'Tag']
txt = []
for r in result:
txt.append(r.__repr__())
for c in mangle(r):
txt[-1] = txt[-1].replace(c.__repr__(), '')
print(''.join(BeautifulSoup(t, 'html.parser').text for t in txt))
return result
else:
return []
if __name__ == '__main__':
mangle(soup)
Basically it walks down the branches of divs and builds lists at each fork of the tree, including the tags, then the caller removes anything found below it leaving just the text that is defined at that level. I keep the tags in place so that text patterns appearing at multiple levels don't get removed by mistake.
Output from the html 1A2A1B2B3A3A2C1C was
3A3A
2A2B2C
1A1B1C
which is the 3rd, 2nd and 1st nesting levels respectively. Hope this helps.
I will answer my own question since I finally got it to work.
The solution was easy, I was just thinking it too hard.
I just added the condition that the parent of the item should not be "div". Now the program does not print the text multiple times.
should_print = False
for item in page_soup.find_all('div'):
if item.name == "div" and item.parent.name != "div"
if "ITEM 1A." in item.text:
should_print = True
elif "ITEM 1B." in item.text:
break
if should_print:
print(item)
Thank you everyone for your contributions. Appreciated...
My code works well, I have no problem extracting what I need. my problem is in some difference coming from using the response of the web service to a different result of doing the same but with the value of the web service saved in a variable. I have this blocker for days, and I hope you please help.
NOTE: the suggested duplicate questions answers don't work for me, this isn't a duplicate question.
I'm consuming a web service. the answer I get is stored in the variable answerService, this is a very long string and after this I extract what is inside the tag span that has this structure:
<span style = "font-weight: bold"> xxx </ span>
"xxx" is what I want to extract
#with that I get the "xxx"
arraySpan = re.findall(r'<span style="font-weight:bold">(.*?)<', answerService)
I get an array of "n" length according to the span existing with this structure.
If I do this directly from the web service it does not work and I only get this answer:
['áGILMENTE']
Now, if I put the response of the web service sameStringOfAnswer in my code, the result is different:
print(arraySpan)
['ADV', 'áGILMENTE']
By logic the answer is the same and never changes, for some strange reason in real time when I get the response from the web service, I only get ['áGILMENTE'] when the answer I expect is ['ADV', 'áGILMENTE']
This is the key piece that shows that 2 span is always coming with the structure I need:
Here is my code:
import requests
import re
session = requests.Session()
getId=session.get('http://cartago.lllf.uam.es/grampal/grampal.cgi')
cookie=session.cookies.get_dict()
getId=session.cookies.get_dict()
getId=getId["CGISESSID"]
#getting an ID for request a webservice
getService=requests.get("http://cartago.lllf.uam.es/grampal/grampal.cgi?m=analiza&csrf="+getId+"&e="+"ágilmente", cookies=cookie)
answerService=getService.text
#get the value of the <span>
arraySpan = re.findall(r'<span style="font-weight:bold">(.*?)<', answerService)
print(answerService)
print("array",arraySpan)
#same code but using the result of service web
sameStringOfAnswer='<html xmlns="http://www.w3.org/TR/REC-html40"><head><title>Grampal </title><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><meta name="Content-Language" content="EN"><meta name="author" content="jmguirao#ugr.es"><link rel="icon" type="image/ico" href="/favicon.ico"/><style type="text/css">html,body,form,ul,li,h1,h3,p{margin:0; padding:0}body{font-family: Arial, Helvetica, sans-serif; background-color:#fff}a{text-decoration: none;}a:hover{text-decoration: underline}ul{list-style-type: none}td{padding: 0.5pc 2pc 0pc 0pc}.nav{float: right; padding: 0.5pc 0.5pc 0.5pc 0.5pc; margin-left:5px}.nav li{display:inline; border-left: 1px solid #444; padding:0 0.4em;}.nav li.first{border-left:0}.hide{display:none}input{text-indent: 2px}input[type="submit"]{text-indent: 0}DIV.delPage{padding: 0.5ex 5em 0.5em 5em; background-color:#ffd6ba;}.delMain{padding: 2ex 0.5em 0.5pc 0.5em;}.post{margin-bottom: 0.25pc; font-size: 100%; padding-top: 0.5ex;}.posts, #posts{padding: 0.5ex 0.5em 0.5pc 50px;}.banner{padding: 0.5ex 0 0.5pc 0.5em;background-color: #ffc6aa;clear: both}.banner h1{font-weight: bolder; font-size: 150%;margin:0; padding:0 0 0 26px; display: inline;}h2{font-weight: bolder; font-size: 140%; color: red; margin:0; padding:0 0 0 26px; display: inline;}.resaltado{font-weight: bolder;font-size: 100%}</style></head><body><div class="banner"><ul class="hide"><li>skip to content</li></ul><ul class="nav">Análsis de:<li class="first"><a title="Analizador morfosintáctico" href="/grampal/grampal.cgi?m=analiza&e=ágilmente">palabras</a></li><li><a title="Desambiguador contextual" href="/grampal/grampal.cgi?m=etiqueta&e=ágilmente">oraciones</a></li><li><a title="Etiquetado de textos" href="/grampal/grampal.cgi?m=xml">textos</a></li><li><a title="Formas de una palabra" href="/grampal/grampal.cgi?m=genera&e=ágilmente">Generación de formas</a></li><!--<li><a title="Transcripción fonética" href="/grampal/grampal.cgi?m=transcribe&e=ágilmente">Transcripción</a></li>--><li>Etiquetario</li><li>Autores</li></ul><h1>Grampal</h1></div><div class="delPage" style="font-size: 80%;"><form method="GET" action="/grampal/grampal.cgi"><input type="hidden" name="m" value="analiza"><input type="hidden" name="csrf" value="94508700a0ae409a90718299ae00b0e0"><span class="resaltado">Palabra : </span><input name="e" size="60" value="ágilmente"><input type="submit" value="Analiza"> </form></div><br><h2>ágilmente</h2><div class="delMain"><div id="posts"><table><tr><td style="font-style:italic;font-size:90%">categoría <span style="font-weight:bold"> ADV </span></td><td style="font-style:italic;font-size:90%">lema <span style="font-weight:bold"> áGILMENTE </span></td></tr></table></div></div></body></html>'
arraySpan = re.findall(r'<span style="font-weight:bold">(.*?)<', sameStringOfAnswer)
print(arraySpan)
What am I doing wrong?
The HTML from the webservice contains:
<span style="font-weight:bold"> ADV\n </span>
But your minified code contains the tag without the newline \n:
<span style="font-weight:bold"> ADV </span>
You can test the difference yourself:
>>> pattern = r'<span style="font-weight:bold">(.*?)<'
>>> re.findall(pattern, '<span style="font-weight:bold">AAA\n<')
[]
>>> re.findall(pattern, '<span style="font-weight:bold">AAA<')
['AAA']
That is why the are different. You should have mentioned that you use a minifier, as they alter the HTML and you can not use regex after that and still expect the same output.
This whole problem would have been avoided if you used an XML parser instead of regex, just like the linked question suggests: RegEx match open tags except XHTML self-contained tags