Extracting parent and child information

Extracting parent and child information - python

Using Python and beautifulsoup, I need help extracting information from a parent div and a child div at the same time.
Here is the first example code:
<div id="slide-609becd056bb40a7ad42607a4d1c67f5"
class="slide has-link slick-slide"
data-label="April 2 2018 Acura TLX Offer 2000x700.jpg"
data-link="/new-inventory/index.htm?model=TLX&year=2018" data-target="_self"
style="background-image: url("https://pictures.dealer.com/a/adw/0877/5eabcb338dc604c09b28a4df5a49ad78x.jpg?impolicy=resize&h=514");
width: 1897px; position: relative; left: 0px; top: 0px; z-index: 998; opacity: 0; height: 514px; transition: opacity 750ms ease;" data-slick-index="0" aria-hidden="true" tabindex="-1" role="option" aria-describedby="slick-slide00">
Here is example code 2:
<div id="slide-7ae8b29ddc9e45d1a219beffe5793b2b"
class="html-slide slide slick-slide"
data-label="March-Madness.jpg" data-link="" data-target=""
data-promo-id="" data-slick-index="2" aria-hidden="true" tabindex="-1" role="option"
aria-describedby="slick-slide02"
style="width: 1897px; position: relative; left: -3794px; top: 0px; z-index: 998; opacity: 0; height: 514px; transition: opacity 750ms ease;">
<div class="slide-background"
style="background-image: linear-gradient(rgba(0, 0, 0, 0), rgba(0, 0, 0, 0)), url("https://pictures.dealer.com/g/goodsonacuraofdallasadw/1747/13ed067a023df8ad412feea2c6eddec9x.jpg?impolicy=resize&h=514"); height: 514px;">
<img src="https://pictures.dealer.com/g/goodsonacuraofdallasadw/1747/13ed067a023df8ad412feea2c6eddec9x.jpg?impolicy=resize&h=514" class="placeholder-image pull-left"> </div>
I need to get the style element from both examples of code so I can get the background image url. The issue is that the first code has the style in the parent div and the second set of code has the style in the child div. How do I get those two style elements at the same time using Python and beautifulsoup?
Here is the code I have tried:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.goodsonacura.com/'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
banner_info = page_soup.findAll('div',{'class':['slide has-link', 'html-slide slide has-link']})
picture = [banner.get('style') for banner in banner_info]
This code gives me the correct style element for the first example code, but it gives me the wrong style element for the second example code.

Add "slide-background" class in the find_all query. See the example below:-
banner_info = page_soup.find_all('div',{'class':['slide has-link', 'html-slide slide has-link', 'slide-background']})
It works for me. May this helps you.

Related

Parse div element from html with style attributes

I'm trying to get the text Something here I want to get inside the div element from a html file using Python and BeautifulSoup.
This is how part of the code looks like in html:
<div xmlns="" id="idp46819314579224" style="box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #d43f3a; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;" class="" onclick="toggleSection('idp46819314579224-container');" onmouseover="this.style.cursor='pointer'">Something here I want to get<div id="idp46819314579224-toggletext" style="float: right; text-align: center; width: 8px;">
-
</div>
</div>
And this is how I tried to do:
vu = soup.find_all("div", {"style" : "background: #d43f3a"})
for div in vu:
print(div.text)
I use loop because there are several div with different id but all of them has the same background colour. It has no errors, but I got no output.
How can I get the text using the background colour as the condition?

The style attribute has other content inside it
style="box-sizing: ....; ....;"
Your current code is asking if style == "background: #d43f3a" which it is not.
What you can do is ask if "background: #d43f3a" in style -- a sub-string check.
One approach is passing a regular expression.
>>> import re
>>> vu = soup.find_all("div", style=re.compile("background: #d43f3a"))
...
... for div in vu:
... print(div.text.strip())
Something here I want to get
You can also say the same thing using CSS Selectors
soup.select('div[style*="background: #d43f3a"]')
Or by passing a function/lambda
>>> vu = soup.find_all("div", style=lambda style: "background: #d43f3a" in style)
...
... for div in vu:
... print(div.text.strip())
Something here I want to get

beautifulsoup find_all title

html is
<div class="trn-defstat__value">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-ash.16913d82e3.png" title="ASH" style="height: 35px; padding-right: 8px;">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-jager.600b2773be.png" title="JÄGER" style="height: 35px; padding-right: 8px;">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-bandit.385144d970.png" title="BANDIT" style="height: 35px; padding-right: 8px;">
</div>
I want to get each title value.
But before that, I write like this
from bs4 import BeautifulSoup as bs
import requests
bsURL = "https://r6.tracker.network/profile/pc/Spoit.GODSENT"
respinse = requests.get(bsURL)
html = bs(respinse.text, 'html.parser')
title = html.find_all(class_='trn-defstat__value')[4]
print(title)
Result ->
<div class="trn-defstat__value">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-ash.16913d82e3.png" style="height: 35px; padding-right: 8px;" title="ASH"/>
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-jager.600b2773be.png" style="height: 35px; padding-right: 8px;" title="JÄGER"/>
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-bandit.385144d970.png" style="height: 35px; padding-right: 8px;" title="BANDIT"/>
</div>
What should I do?

This script will print all <img> titles from Top Operators section:
from bs4 import BeautifulSoup as bs
import requests
bsURL = "https://r6.tracker.network/profile/pc/Spoit.GODSENT"
respinse = requests.get(bsURL)
html = bs(respinse.text, 'html.parser')
# find Top Operators tag
operators = html.find(class_='trn-defstat__name', text='Top Operators')
for img in operators.find_next('div').find_all('img'):
print(img['title'])
Prints:
ASH
JÄGER
BANDIT
Or using CSS:
for img in html.select('.trn-defstat__name:contains("Top Operators") + * img'):
print(img['title'])

Just use the .get() function to get the attribute and pass in the attribute name.
pip install html5lib
I suggest you use that, I believe it's a better parser.
from bs4 import BeautifulSoup as bs
import requests
bsURL = "https://r6.tracker.network/profile/pc/Spoit.GODSENT"
respinse = requests.get(bsURL)
html = bs(respinse.content, 'html5lib')
container = html.find("div", class_= "trn-defstat mb0 top-operators")
imgs = container.find_all("img")
for img in imgs:
print(img.get("title"))
I did not seem to understand what part of the site you were trying to scrape but take note of it to sometimes get first the block of html code where there are the details you want to scraped :)

This should help u:
from bs4 import BeautifulSoup
html = """
<div class="trn-defstat__value">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-ash.16913d82e3.png" title="ASH" style="height: 35px; padding-right: 8px;">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-jager.600b2773be.png" title="JÄGER" style="height: 35px; padding-right: 8px;">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-bandit.385144d970.png" title="BANDIT" style="height: 35px; padding-right: 8px;">
</div>
"""
soup = BeautifulSoup(html,'html.parser')
imgs = soup.find_all('img')
for img in imgs:
print(img['title'])
Output:
ASH
JÄGER
BANDIT
Here is the complete code:
from bs4 import BeautifulSoup as bs
import requests
bsURL = "https://r6.tracker.network/profile/pc/Spoit.GODSENT"
respinse = requests.get(bsURL)
html = bs(respinse.text, 'html.parser')
divs = html.find_all('div',class_ = "trn-defstat__value")
imgs = []
for div in divs:
try:
imgs.append(div.find_all('img'))
except:
pass
imgs = [ele for ele in imgs if ele != []]
imgs = [j for sub in imgs for j in sub]
for img in imgs:
print(img['title'])
Output:
ASH
JÄGER
BANDIT

I have 4 nested div tags and when I print text using find_all, it prints the text 4 times

I am extracting text from an html file which contains a lot of div tags. However, at some places there are say 4 nested div tags and when I print text, it prints it 4 times.
<div>
<div id="PGBRK" style="TEXT-INDENT: 0pt; WIDTH: 100%; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt">
<div id="PN" style="PAGE-BREAK-AFTER: always; WIDTH: 100%">
<div style="TEXT-ALIGN: center; WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt">27</font></div>
</div>
</div>
</div>
For example, here if I do:
for item in page_soup.find_all('div'):
if "27" in item.text:
print(item)
It prints the number 27 four times and therefore messes up whole text.
How can I get my code to only print the nested text once?
EDIT 1:
This works well for this part of the code. But like I said, this is only true at some places. For example, when I do:
for item in page_soup.find_all('div', recursive = False):
print(item)
It does not print anything. For reference, this is the document I am trying to scrape.
EDIT 2:
From the given html, I am trying to extract the section "ITEM 1A. RISK FACTORS".
should_print = False
for item in page_soup.find_all('div'):
if "ITEM 1A." in item.text:
should_print = True
elif "ITEM 1B." in item.text:
break
if should_print:
print(item)
So I am printing everything starting from ITEM 1A. until it finds ITEM 1B.
Here at some places there are nested div tags, which gets printed multiple times with this piece of code.
If I do, recursive = False, it does not print anything.

Here is one option
import bs4, re
html = '''<div>
<div id="PGBRK" style="TEXT-INDENT: 0pt; WIDTH: 100%; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt">
<div id="PN" style="PAGE-BREAK-AFTER: always; WIDTH: 100%">
<div style="TEXT-ALIGN: center; WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt">27</font></div>
</div>
</div>
</div>
</div>'''
soup = bs4.BeautifulSoup(html,'html.parser')
elements = soup.find_all(text=re.compile('27'))
print(elements)
output
[u'27']

printing everything starting from ITEM 1A. until it finds ITEM 1B
Trough .string attribute (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string)
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/4904/000000490412000013/ye11aep10k.htm'
html_doc = requests.get(url).content
page_soup = BeautifulSoup(html_doc, 'html.parser')
do_print = False
for el in page_soup.find_all('div'):
if el.string:
if "ITEM 1A" in el.string:
do_print = True
elif "ITEM 1B" in el.string:
break
if do_print:
print(el)
The output (I'll show the representative start and end blocks without middle part, to make a short dump):
<div align="justify" style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 12pt; FONT-WEIGHT: bold"><font style="DISPLAY: inline; TEXT-DECORATION: underline">ITEM 1A.   RISK FACTORS</font></font></div>
<div style="TEXT-INDENT: 0pt; DISPLAY: block"><br/>
</div>
<div align="justify" style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 12pt; FONT-WEIGHT: bold">GENERAL RISKS OF OUR REGULATED OPERATIONS</font></div>
<div style="TEXT-INDENT: 0pt; DISPLAY: block">
<div align="justify" style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt"><font style="FONT-STYLE: italic; DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 12pt; FONT-WEIGHT: bold"> </font></div>
<div align="justify" style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt"><font style="FONT-STYLE: italic; DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 12pt; FONT-WEIGHT: bold">The regulatory environment in Ohio has recently become unpredictable and increasingly uncertain. – Affecting AEP and OPCo</font></div>
<div style="TEXT-INDENT: 0pt; DISPLAY: block"><br/>
.....
<div style="TEXT-ALIGN: center; WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt">37</font></div>
<div style="TEXT-ALIGN: center; WIDTH: 100%">
<hr noshade="" size="2" style="COLOR: black"/>
</div>
<div id="HDR">
<div align="right" id="GLHDR" style="WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 8pt">  </font></div>
</div>
<div align="right" id="GLHDR" style="WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 8pt">  </font></div>
<div style="TEXT-INDENT: 0pt; DISPLAY: block"> </div>

You can provide the option text = "27" to search the divs by text and identify only that exact div. The below code should work fine. If you want to get all the divs then just remove the text = "27" or replace it with what text that you want to find. You can also use recursive = False to get only the top level divs.
Edit 1:
from bs4 import BeautifulSoup
t = '''
<div>
27
</div>
<div>
<div id="PGBRK" style="TEXT-INDENT: 0pt; WIDTH: 100%; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt">
<div id="PN" style="PAGE-BREAK-AFTER: always; WIDTH: 100%">
<div style="TEXT-ALIGN: center; WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt">27</font></div>
</div>
</div>
</div>
</div>
'''
page_soup = BeautifulSoup(t, 'html.parser')
for item in page_soup.find_all('div', text="27"):
print(item.text)
Edit 2:
I have added a specific code that works for your problem specifically. Try the below code. The div range that you are expecting is from 567 - 715 with page numbers removed.
import requests
from bs4 import BeautifulSoup
resp = requests.get(
r'https://www.sec.gov/Archives/edgar/data/4904/000000490412000013/ye11aep10k.htm')
t = resp.text
page_soup = BeautifulSoup(t, 'html.parser')
s = 'body > div:not(#PGBRK)'
for i in page_soup.select(s)[567:715]:
print(i.get_text(strip=True))

Well I think that is a cool question, and I don't see a simple answer if you want to generalize it to find out what text there is at each level without resorting to searching for a specific number like 27. Beautiful Soup doesn't seem to have a function for showing only the text in the top , and recursive=False simply prevents the search from delving below the first level but will still include everything below the first level as contents, so if at the top level of tags then it will capture it and everything below it
So I think you'd actually have to recurse down the tree of divs and compare the text at each level. I figure this out. It prints in reverse order as it bubbles up from the recursion but that could be stored in a list and output in forward order.
from bs4 import BeautifulSoup
soup = BeautifulSoup('<div>1A<div>2A</div>1B<div>2B<div>3A</div><div>3A</div>2C</div>1C</div>', 'html.parser')
def mangle(node):
divs = node.find_all('div')
if len(divs):
result = [divs[0]] + [n for n in divs[0].next_siblings if n.__class__.__name__ == 'Tag']
txt = []
for r in result:
txt.append(r.__repr__())
for c in mangle(r):
txt[-1] = txt[-1].replace(c.__repr__(), '')
print(''.join(BeautifulSoup(t, 'html.parser').text for t in txt))
return result
else:
return []
if __name__ == '__main__':
mangle(soup)
Basically it walks down the branches of divs and builds lists at each fork of the tree, including the tags, then the caller removes anything found below it leaving just the text that is defined at that level. I keep the tags in place so that text patterns appearing at multiple levels don't get removed by mistake.
Output from the html 1A2A1B2B3A3A2C1C was
3A3A
2A2B2C
1A1B1C
which is the 3rd, 2nd and 1st nesting levels respectively. Hope this helps.

I will answer my own question since I finally got it to work.
The solution was easy, I was just thinking it too hard.
I just added the condition that the parent of the item should not be "div". Now the program does not print the text multiple times.
should_print = False
for item in page_soup.find_all('div'):
if item.name == "div" and item.parent.name != "div"
if "ITEM 1A." in item.text:
should_print = True
elif "ITEM 1B." in item.text:
break
if should_print:
print(item)
Thank you everyone for your contributions. Appreciated...

How can I get the span with BeautifulSoup find_all?

I'm trying to get the following span from this website:
https://www.indeed.com/jobs?q=data&l=New+York%2C+NY&explvl=entry_level
<span class="indeed-apply-widget indeed-apply-button-container js-IndeedApplyWidget indeed-apply-status-not-applied" aria-labelledby="indeed-apply-button-label" data-indeed-apply-jobtitle="Growth Associate" data-indeed-apply-apitoken="aa102235a5ccb18bd3668c0e14aa3ea7e2503cfac2a7a9bf3d6549899e125af4" data-indeed-apply-coverletter="optional" data-indeed-apply-resume="required" data-indeed-apply-jk="40da42b64688bda8" data-indeed-apply-jobid="19c5d6a1fff8d6ba9724" data-indeed-apply-joblocation="New York, NY" data-indeed-apply-jobcompanyname="Via" data-indeed-apply-joburl="https://www.indeed.com/viewjob?jk=40da42b64688bda8" data-indeed-apply-posturl="https://dradisindeedapply.sandbox.indeed.net/process-indeedapply" data-indeed-apply-jobmeta="{"vtk":"1csimi0m80g7f002", "tk":""}" data-indeed-apply-advnum="7404493598529036" data-indeed-apply-onapplied="indeedApplyHandleApply" data-indeed-apply-onclose="indeedApplyHandleModalClose" data-indeed-apply-onclick="indeedApplyHandleButtonClick" data-indeed-apply-oncontinueclick="indeedApplyHandleModalClose" data-indeed-apply-pingbackurl="https://gdc.indeed.com/conv/orgIndApp?trk.origin=unknown&jk=40da42b64688bda8&vjtk=1csimi0m80g7f002&advn=7404493598529036&co=US&acct_key=899c31afcc98f5e9&sj=0" data-indeed-apply-skipcontinue="false" data-acc-payload="1,2,22,1,144,1,552,1,3648,1,4392,1" style="padding: 0px !important; margin: 0px !important; text-indent: 0px !important; vertical-align: top !important; position: relative; zoom: 1 !important; display: inline-block;"><a class="indeed-apply-button" href="javascript:void(0);" id="indeed-ia-1542520898760-0"><span class="indeed-apply-button-inner" id="indeed-ia-1542520898760-0inner"><span class="indeed-apply-button-label" id="indeed-ia-1542520898760-0label">Apply Now</span><span class="indeed-apply-button-cm"><img src="https://d3fw5vlhllyvee.cloudfront.net/indeedapply/s/14096d1/check.png" style="border: 0px;"></span></span></a></span>
And I tried this code:
url = "https://www.indeed.com/jobs?q=data&l=New+York%2C+NY&explvl=entry_level"
html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features = 'lxml')
soup.find_all("span", {"class":"indeed-apply-widget indeed-apply-button-container js-IndeedApplyWidget indeed-apply-status-not-applied",
"aria-labelledby":"indeed-apply-button-label"})
But the result is [].

There is no such element on URL you mentioned above but it exist in /viewjob?jk=.. page.
The class in your code is generated by javascript, if you view page source the real class is indeed-apply-widget and it only has 1 element
# https://www.indeed.com/viewjob?jk=0ee200c5fc30ce02&from=recjobs&vjtk=1csj1b3nmbi4v800
soup.find("span", {"class":"indeed-apply-widget"})

Selenium2Library: move mouse position for click

I am new to the whole Robotframework and the Selenium2Library and I have a problem.
I have two divs: rasterContainer and anlageContainer.
They have the same x- and y-offset. The anlageContainer has a z-offset of 3 and the rasterContainer has 0. The anlageContainer lies on top of the rasterContainer.
Those two build a time bar.The anlageContainer got just one id and the rasterContainer contains many other divs, each of them with an id.
If you mouse over those divs, the rasterContainer shows you the time. If you click there, you just click on the anlageContainer and some other methods calculate the offset to get the time and opens a window with this time in a textbox.
What I want to do:
I want to move my mouse to an element of the rasterContainer and click at the same position on the anlageContainer.
What I have tried:
I began to write my own library in python. I have just one method which gets an instance of the Selenium2Library, the vertical value of the mouse position (the mouse is on top of the anlageContainer) and the vertical value of the rasterContainer's element.
def click_on_element(self, vertEl, vertMo, se2lib):
v = vertEl - vertMo
#Get Webdriver
driver = se2lib._current_browser()
#ActionChains instance
ac = webdriver.ActionChains(driver)
ac.move_by_offset(0, v)
ac.click().perform()
return "On my way"
With move_by_offset: The window opens but with the wrong time (07:00). I wanted to have 09:30.
I also tried:
#Get Element
elmfinder = ElementFinder()
elm = elmfinder.find(driver, "5_09_30")[0]
ac.move_to_element(elm)
ac.move_to_element_with_offset(elm, 461, 422)
The window opened neither with move_to_element nor with move_to_element_with_offset.
I really don't know what I am missing here.
Any hints would help.
EDIT:
HTML code:
<div id="resource_id_5_2013-07-30" class="resource" daylenght="720" loaded="false" date="2013-07-30" time="07:00" style="top: 0px; height: 1540px; width: 309.75px; left: 619.5px;">
<div class="terminContainer"></div>
<div class="overlapContainer" style="width: 10%; position: absolute; left: 90%; height: 1560.0px; top: 0px;"></div>
<div id="5" class="anlageContainer" style="width: 10%; height: 1440px; top: 0px;" title="08:53"></div>
<div class="rasterContainer" style="width: 10%; height: 1440px; top: 0px;">
<div id="5_07_00" class="rasterLabel" style="position: absolute; top: 0px;">7:00</div>
<div id="5_07_15" class="rasterLabel" style="position: absolute; top: 30px;">7:15</div>
<div id="5_07_30" class="rasterLabel" style="position: absolute; top: 60px;">7:30</div>
etc...
</div>
</div>
CSS style:
.rasterContainer{
position: absolute;
background-color: #EEEEEE;
}
.anlageContainer:hover + .rasterContainer{
background-color: #e3e3e3;
}
.rasterLabel{
z-index: 2;
font-size: 0.7em;
color: #000;
border-top: solid 1px #888;
}
.anlageContainer{
z-index: 3;
cursor: pointer;
position: absolute;
}
There you can see that the anlageContainer is above the rasterContainer. And between them are the rasterLabels --> z-index.
The anlageContainer has
dojo.connect(anlageContainer, 'onclick', function(clickevt){
addTermin(resourceId, getOffsetY(clickevt)/g_terminMultiplikator, datum);
});
Two links to images:
Time bar
3D time bar

element = find_element_by_xpath (driver, ".//div[#id='resource_id_5_2013-07-30']//div[#class='anlageContainer']")
element.click()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting parent and child information - python

Add "slide-background" class in the find_all query. See the example below:- banner_info = page_soup.find_all('div',{'class':['slide has-link', 'html-slide slide has-link', 'slide-background']}) It works for me. May this helps you.

Related

Parse div element from html with style attributes

beautifulsoup find_all title

I have 4 nested div tags and when I print text using find_all, it prints the text 4 times

How can I get the span with BeautifulSoup find_all?

Selenium2Library: move mouse position for click

Categories

Resources