How can I get the span with BeautifulSoup find_all? - python

I'm trying to get the following span from this website:
https://www.indeed.com/jobs?q=data&l=New+York%2C+NY&explvl=entry_level
<span class="indeed-apply-widget indeed-apply-button-container js-IndeedApplyWidget indeed-apply-status-not-applied" aria-labelledby="indeed-apply-button-label" data-indeed-apply-jobtitle="Growth Associate" data-indeed-apply-apitoken="aa102235a5ccb18bd3668c0e14aa3ea7e2503cfac2a7a9bf3d6549899e125af4" data-indeed-apply-coverletter="optional" data-indeed-apply-resume="required" data-indeed-apply-jk="40da42b64688bda8" data-indeed-apply-jobid="19c5d6a1fff8d6ba9724" data-indeed-apply-joblocation="New York, NY" data-indeed-apply-jobcompanyname="Via" data-indeed-apply-joburl="https://www.indeed.com/viewjob?jk=40da42b64688bda8" data-indeed-apply-posturl="https://dradisindeedapply.sandbox.indeed.net/process-indeedapply" data-indeed-apply-jobmeta="{"vtk":"1csimi0m80g7f002", "tk":""}" data-indeed-apply-advnum="7404493598529036" data-indeed-apply-onapplied="indeedApplyHandleApply" data-indeed-apply-onclose="indeedApplyHandleModalClose" data-indeed-apply-onclick="indeedApplyHandleButtonClick" data-indeed-apply-oncontinueclick="indeedApplyHandleModalClose" data-indeed-apply-pingbackurl="https://gdc.indeed.com/conv/orgIndApp?trk.origin=unknown&jk=40da42b64688bda8&vjtk=1csimi0m80g7f002&advn=7404493598529036&co=US&acct_key=899c31afcc98f5e9&sj=0" data-indeed-apply-skipcontinue="false" data-acc-payload="1,2,22,1,144,1,552,1,3648,1,4392,1" style="padding: 0px !important; margin: 0px !important; text-indent: 0px !important; vertical-align: top !important; position: relative; zoom: 1 !important; display: inline-block;"><a class="indeed-apply-button" href="javascript:void(0);" id="indeed-ia-1542520898760-0"><span class="indeed-apply-button-inner" id="indeed-ia-1542520898760-0inner"><span class="indeed-apply-button-label" id="indeed-ia-1542520898760-0label">Apply Now</span><span class="indeed-apply-button-cm"><img src="https://d3fw5vlhllyvee.cloudfront.net/indeedapply/s/14096d1/check.png" style="border: 0px;"></span></span></a></span>
And I tried this code:
url = "https://www.indeed.com/jobs?q=data&l=New+York%2C+NY&explvl=entry_level"
html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features = 'lxml')
soup.find_all("span", {"class":"indeed-apply-widget indeed-apply-button-container js-IndeedApplyWidget indeed-apply-status-not-applied",
"aria-labelledby":"indeed-apply-button-label"})
But the result is [].

There is no such element on URL you mentioned above but it exist in /viewjob?jk=.. page.
The class in your code is generated by javascript, if you view page source the real class is indeed-apply-widget and it only has 1 element
# https://www.indeed.com/viewjob?jk=0ee200c5fc30ce02&from=recjobs&vjtk=1csj1b3nmbi4v800
soup.find("span", {"class":"indeed-apply-widget"})

Related

Parse div element from html with style attributes

I'm trying to get the text Something here I want to get inside the div element from a html file using Python and BeautifulSoup.
This is how part of the code looks like in html:
<div xmlns="" id="idp46819314579224" style="box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #d43f3a; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;" class="" onclick="toggleSection('idp46819314579224-container');" onmouseover="this.style.cursor='pointer'">Something here I want to get<div id="idp46819314579224-toggletext" style="float: right; text-align: center; width: 8px;">
-
</div>
</div>
And this is how I tried to do:
vu = soup.find_all("div", {"style" : "background: #d43f3a"})
for div in vu:
print(div.text)
I use loop because there are several div with different id but all of them has the same background colour. It has no errors, but I got no output.
How can I get the text using the background colour as the condition?
The style attribute has other content inside it
style="box-sizing: ....; ....;"
Your current code is asking if style == "background: #d43f3a" which it is not.
What you can do is ask if "background: #d43f3a" in style -- a sub-string check.
One approach is passing a regular expression.
>>> import re
>>> vu = soup.find_all("div", style=re.compile("background: #d43f3a"))
...
... for div in vu:
... print(div.text.strip())
Something here I want to get
You can also say the same thing using CSS Selectors
soup.select('div[style*="background: #d43f3a"]')
Or by passing a function/lambda
>>> vu = soup.find_all("div", style=lambda style: "background: #d43f3a" in style)
...
... for div in vu:
... print(div.text.strip())
Something here I want to get

beautifulsoup find_all title

html is
<div class="trn-defstat__value">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-ash.16913d82e3.png" title="ASH" style="height: 35px; padding-right: 8px;">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-jager.600b2773be.png" title="JÄGER" style="height: 35px; padding-right: 8px;">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-bandit.385144d970.png" title="BANDIT" style="height: 35px; padding-right: 8px;">
</div>
I want to get each title value.
But before that, I write like this
from bs4 import BeautifulSoup as bs
import requests
bsURL = "https://r6.tracker.network/profile/pc/Spoit.GODSENT"
respinse = requests.get(bsURL)
html = bs(respinse.text, 'html.parser')
title = html.find_all(class_='trn-defstat__value')[4]
print(title)
Result ->
<div class="trn-defstat__value">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-ash.16913d82e3.png" style="height: 35px; padding-right: 8px;" title="ASH"/>
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-jager.600b2773be.png" style="height: 35px; padding-right: 8px;" title="JÄGER"/>
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-bandit.385144d970.png" style="height: 35px; padding-right: 8px;" title="BANDIT"/>
</div>
What should I do?
This script will print all <img> titles from Top Operators section:
from bs4 import BeautifulSoup as bs
import requests
bsURL = "https://r6.tracker.network/profile/pc/Spoit.GODSENT"
respinse = requests.get(bsURL)
html = bs(respinse.text, 'html.parser')
# find Top Operators tag
operators = html.find(class_='trn-defstat__name', text='Top Operators')
for img in operators.find_next('div').find_all('img'):
print(img['title'])
Prints:
ASH
JÄGER
BANDIT
Or using CSS:
for img in html.select('.trn-defstat__name:contains("Top Operators") + * img'):
print(img['title'])
Just use the .get() function to get the attribute and pass in the attribute name.
pip install html5lib
I suggest you use that, I believe it's a better parser.
from bs4 import BeautifulSoup as bs
import requests
bsURL = "https://r6.tracker.network/profile/pc/Spoit.GODSENT"
respinse = requests.get(bsURL)
html = bs(respinse.content, 'html5lib')
container = html.find("div", class_= "trn-defstat mb0 top-operators")
imgs = container.find_all("img")
for img in imgs:
print(img.get("title"))
I did not seem to understand what part of the site you were trying to scrape but take note of it to sometimes get first the block of html code where there are the details you want to scraped :)
This should help u:
from bs4 import BeautifulSoup
html = """
<div class="trn-defstat__value">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-ash.16913d82e3.png" title="ASH" style="height: 35px; padding-right: 8px;">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-jager.600b2773be.png" title="JÄGER" style="height: 35px; padding-right: 8px;">
<img src="https://trackercdn.com/rainbow6-ubi/assets/images/badge-bandit.385144d970.png" title="BANDIT" style="height: 35px; padding-right: 8px;">
</div>
"""
soup = BeautifulSoup(html,'html.parser')
imgs = soup.find_all('img')
for img in imgs:
print(img['title'])
Output:
ASH
JÄGER
BANDIT
Here is the complete code:
from bs4 import BeautifulSoup as bs
import requests
bsURL = "https://r6.tracker.network/profile/pc/Spoit.GODSENT"
respinse = requests.get(bsURL)
html = bs(respinse.text, 'html.parser')
divs = html.find_all('div',class_ = "trn-defstat__value")
imgs = []
for div in divs:
try:
imgs.append(div.find_all('img'))
except:
pass
imgs = [ele for ele in imgs if ele != []]
imgs = [j for sub in imgs for j in sub]
for img in imgs:
print(img['title'])
Output:
ASH
JÄGER
BANDIT

I have 4 nested div tags and when I print text using find_all, it prints the text 4 times

I am extracting text from an html file which contains a lot of div tags. However, at some places there are say 4 nested div tags and when I print text, it prints it 4 times.
<div>
<div id="PGBRK" style="TEXT-INDENT: 0pt; WIDTH: 100%; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt">
<div id="PN" style="PAGE-BREAK-AFTER: always; WIDTH: 100%">
<div style="TEXT-ALIGN: center; WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt">27</font></div>
</div>
</div>
</div>
For example, here if I do:
for item in page_soup.find_all('div'):
if "27" in item.text:
print(item)
It prints the number 27 four times and therefore messes up whole text.
How can I get my code to only print the nested text once?
EDIT 1:
This works well for this part of the code. But like I said, this is only true at some places. For example, when I do:
for item in page_soup.find_all('div', recursive = False):
print(item)
It does not print anything. For reference, this is the document I am trying to scrape.
EDIT 2:
From the given html, I am trying to extract the section "ITEM 1A. RISK FACTORS".
should_print = False
for item in page_soup.find_all('div'):
if "ITEM 1A." in item.text:
should_print = True
elif "ITEM 1B." in item.text:
break
if should_print:
print(item)
So I am printing everything starting from ITEM 1A. until it finds ITEM 1B.
Here at some places there are nested div tags, which gets printed multiple times with this piece of code.
If I do, recursive = False, it does not print anything.
Here is one option
import bs4, re
html = '''<div>
<div id="PGBRK" style="TEXT-INDENT: 0pt; WIDTH: 100%; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt">
<div id="PN" style="PAGE-BREAK-AFTER: always; WIDTH: 100%">
<div style="TEXT-ALIGN: center; WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt">27</font></div>
</div>
</div>
</div>
</div>'''
soup = bs4.BeautifulSoup(html,'html.parser')
elements = soup.find_all(text=re.compile('27'))
print(elements)
output
[u'27']
printing everything starting from ITEM 1A. until it finds ITEM 1B
Trough .string attribute (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string)
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/4904/000000490412000013/ye11aep10k.htm'
html_doc = requests.get(url).content
page_soup = BeautifulSoup(html_doc, 'html.parser')
do_print = False
for el in page_soup.find_all('div'):
if el.string:
if "ITEM 1A" in el.string:
do_print = True
elif "ITEM 1B" in el.string:
break
if do_print:
print(el)
The output (I'll show the representative start and end blocks without middle part, to make a short dump):
<div align="justify" style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 12pt; FONT-WEIGHT: bold"><font style="DISPLAY: inline; TEXT-DECORATION: underline">ITEM 1A.   RISK FACTORS</font></font></div>
<div style="TEXT-INDENT: 0pt; DISPLAY: block"><br/>
</div>
<div align="justify" style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 12pt; FONT-WEIGHT: bold">GENERAL RISKS OF OUR REGULATED OPERATIONS</font></div>
<div style="TEXT-INDENT: 0pt; DISPLAY: block">
<div align="justify" style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt"><font style="FONT-STYLE: italic; DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 12pt; FONT-WEIGHT: bold"> </font></div>
<div align="justify" style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt"><font style="FONT-STYLE: italic; DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 12pt; FONT-WEIGHT: bold">The regulatory environment in Ohio has recently become unpredictable and increasingly uncertain. – Affecting AEP and OPCo</font></div>
<div style="TEXT-INDENT: 0pt; DISPLAY: block"><br/>
.....
<div style="TEXT-ALIGN: center; WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt">37</font></div>
<div style="TEXT-ALIGN: center; WIDTH: 100%">
<hr noshade="" size="2" style="COLOR: black"/>
</div>
<div id="HDR">
<div align="right" id="GLHDR" style="WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 8pt">  </font></div>
</div>
<div align="right" id="GLHDR" style="WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 8pt">  </font></div>
<div style="TEXT-INDENT: 0pt; DISPLAY: block"> </div>
You can provide the option text = "27" to search the divs by text and identify only that exact div. The below code should work fine. If you want to get all the divs then just remove the text = "27" or replace it with what text that you want to find. You can also use recursive = False to get only the top level divs.
Edit 1:
from bs4 import BeautifulSoup
t = '''
<div>
27
</div>
<div>
<div id="PGBRK" style="TEXT-INDENT: 0pt; WIDTH: 100%; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt">
<div id="PN" style="PAGE-BREAK-AFTER: always; WIDTH: 100%">
<div style="TEXT-ALIGN: center; WIDTH: 100%"><font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt">27</font></div>
</div>
</div>
</div>
</div>
'''
page_soup = BeautifulSoup(t, 'html.parser')
for item in page_soup.find_all('div', text="27"):
print(item.text)
Edit 2:
I have added a specific code that works for your problem specifically. Try the below code. The div range that you are expecting is from 567 - 715 with page numbers removed.
import requests
from bs4 import BeautifulSoup
resp = requests.get(
r'https://www.sec.gov/Archives/edgar/data/4904/000000490412000013/ye11aep10k.htm')
t = resp.text
page_soup = BeautifulSoup(t, 'html.parser')
s = 'body > div:not(#PGBRK)'
for i in page_soup.select(s)[567:715]:
print(i.get_text(strip=True))
Well I think that is a cool question, and I don't see a simple answer if you want to generalize it to find out what text there is at each level without resorting to searching for a specific number like 27. Beautiful Soup doesn't seem to have a function for showing only the text in the top , and recursive=False simply prevents the search from delving below the first level but will still include everything below the first level as contents, so if at the top level of tags then it will capture it and everything below it
So I think you'd actually have to recurse down the tree of divs and compare the text at each level. I figure this out. It prints in reverse order as it bubbles up from the recursion but that could be stored in a list and output in forward order.
from bs4 import BeautifulSoup
soup = BeautifulSoup('<div>1A<div>2A</div>1B<div>2B<div>3A</div><div>3A</div>2C</div>1C</div>', 'html.parser')
def mangle(node):
divs = node.find_all('div')
if len(divs):
result = [divs[0]] + [n for n in divs[0].next_siblings if n.__class__.__name__ == 'Tag']
txt = []
for r in result:
txt.append(r.__repr__())
for c in mangle(r):
txt[-1] = txt[-1].replace(c.__repr__(), '')
print(''.join(BeautifulSoup(t, 'html.parser').text for t in txt))
return result
else:
return []
if __name__ == '__main__':
mangle(soup)
Basically it walks down the branches of divs and builds lists at each fork of the tree, including the tags, then the caller removes anything found below it leaving just the text that is defined at that level. I keep the tags in place so that text patterns appearing at multiple levels don't get removed by mistake.
Output from the html 1A2A1B2B3A3A2C1C was
3A3A
2A2B2C
1A1B1C
which is the 3rd, 2nd and 1st nesting levels respectively. Hope this helps.
I will answer my own question since I finally got it to work.
The solution was easy, I was just thinking it too hard.
I just added the condition that the parent of the item should not be "div". Now the program does not print the text multiple times.
should_print = False
for item in page_soup.find_all('div'):
if item.name == "div" and item.parent.name != "div"
if "ITEM 1A." in item.text:
should_print = True
elif "ITEM 1B." in item.text:
break
if should_print:
print(item)
Thank you everyone for your contributions. Appreciated...

Python: SQL Generated Variables written to HTML file

I have a script in Python which connects to SQL using pyodbc and returns a set of values from a calendar for the 30 days following today. I prototyped it by using the print('') function to generate the HTML for the file I was creating then copying and pasting it in to an HTML file with Notepad++ and I know the HTML is sound and will be good for its purpose. However when it comes to generating the file I'm running aground with including the SQL results in the variable that is passed to the file writer.
I have tried both {variable} and %v methods which just seem to be either erroring out with;
unsupported format character ';' (0x3b) at index 1744
in the case of %, or in the case of {inset} is just including the word rather than the var. below is the code I have in JN;
from os import getenv
import pyodbc
cnxn = pyodbc.connect('DRIVER={ODBC Driver 13 for SQL Server};SERVER=MYSERVER\SQLEXPRESS;DATABASE=MyTable;UID=test;PWD=t')
f = open('tes.html','w')
cursor = cnxn.cursor()
cursor.execute('DECLARE #today as date SET #today = GetDate() SELECT style112, day, month, year, dayofweek, showroom_name, isbusy from ShowroomCal where Date Between #today and dateadd(month,1,#today) ')
row = cursor.fetchone()
while row is not None:
inset = ('<div class="',row.isbusy,'">',row.day,'</div>')
row = cursor.fetchone()
html_str = """
<html lang="en" ><head><meta charset="UTF-8"><title>Calendar</title>
<link rel=\'stylesheet prefetch\' href=\'https://netdna.bootstrapcdn.com/font-awesome/3.2.1/css/font-awesome.css\'>
<style>
body{background-color: #ffffff;}
a{color:#462955; text-decoration: none; display: block;}a:hover{color:#ffffff; text-decoration: none; display: block;}#yes a {color:#ffffff !important; text-decoration: none; display: block;}#yes a:hover {color:#ffffff !important; text-decoration: none; display: block;}
#calendar{margin-left: auto;margin-right: auto;width: 800px;font-family: \'Lato\', sans-serif;}
#calendar_weekdays div{display:inline-block;vertical-align:top;}
#calendar_content, #calendar_weekdays, #calendar_header{position: relative;width: 800px;overflow: hidden;float: left;z-index: 10;}
#calendar_weekdays div, #calendar_content div{width: 25px;height: 25px;overflow: hidden;text-align: center;background-color: #FFFFFF;color: #787878;}
.Yes{background-color: #990000 !important;color: #CDCDCD !important;}
.None{background-color: #ffffff !Important;color: #462955 !important;}
.None:hover{background-color: #462955 !Important;color: #ffffff !important;}
.wend{background-color: #676767 !important;color: #999999 !important;}
#calendar_content{background-colour: #ff0000;-webkit-border-radius: 0px 0px 12px 12px;-moz-border-radius: 0px 0px 12px 12px; border-radius: 0px 0px 12px 12px;}
#calendar_content div{float: left;}
#yes {background-color: #ff0000 !important;}
#calendar_content div:hover{background-color: #F8F8F8;}
#calendar_content div.blank{background-color: #E8E8E8;}
#calendar_header, #calendar_content div.today{zoom: 1;filter: alpha(opacity=70);opacity: 0.7;}
#calendar_content div.today{color: #FFFFFF;}
#calendar_header{width: 100%;height: 25px;text-align: center;background-color: #FF6860;padding: 8px 0;-webkit-border-radius: 12px 12px 0px 0px;-moz-border-radius: 12px 12px 0px 0px; border-radius: 12px 12px 0px 0px;}
#calendar_header h1{font-size: 1.5em;color: #FFFFFF;float:left;width:70%;
i[class^=icon-chevron]{color: #FFFFFF;float: left;width:15%;border-radius: 50%;}
</style>
<link href=\'https://fonts.googleapis.com/css?family=Lato\' rel=\'stylesheet\' type=\'text/css\'>
</head><base target="_parent">
<div id="calendar"><div id="calendar_header"><h1>07 2018</h1></div><div id="calendar_weekdays"></div><div id="calendar_content">
{inset}
</div></div><script src=\'jquery.min.js\'></script>
<script>
$(function(){function c(){p();var e=h();var r=0;var u=false;l.empty();while(!u){if(s[r]==e[0].weekday){u=true}else{l.append(\'<div class="blank"></div>\');r++}}for(var c=0;c<42-r;c++){if(c>=e.length){l.append(\'<div class="blank"></div>\')}else{var v=e[c].day;var m=g(new Date(t,n-1,v))?\'<div class="today">\':"<div>";l.append(m+""+v+"</div>")}}var y=o[n-1];a.css("background-color",y).find("h1").text(i[n-1]+" "+t);f.find("div").css("color",y);l.find(".today").css("background-color",y);d()}function h(){var e=[];for(var r=1;r<v(t,n)+1;r++){e.push({day:r,weekday:s[m(t,n,r)]})}return e}function p(){f.empty();for(var e=0;e<7;e++){f.append("<div>"+s[e].substring(0,3)+"</div>")}}function d(){var t;var n=$("#calendar").css("width",e+"px");n.find(t="#calendar_weekdays, #calendar_content").css("width",e+"px").find("div").css({width:e/7+"px",height:e/14+"px","line-height":e/14+"px"});n.find("#calendar_header").css({height:e*(1/14)+"px"}).find(\'i[class^="icon-chevron"]\').css("line-height",e*(1/14)+"px")}function v(e,t){return(new Date(e,t,0)).getDate()}function m(e,t,n){return(new Date(e,t-1,n)).getDay()}function g(e){return y(new Date)==y(e)}function y(e){return e.getFullYear()+"/"+(e.getMonth()+1)+"/"+e.getDate()}function b(){var e=new Date;t=e.getFullYear();n=e.getMonth()+1}var e=700;var t=2018;var n=9;var r=[];var i=["JANUARY","FEBRUARY","MARCH","APRIL","MAY","JUNE","JULY","AUGUST","SEPTEMBER","OCTOBER","NOVEMBER","DECEMBER"];var s=["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"];var o=["#462955","#462955","#462955","#462955","#462955","#462955","#462955","#462955","#462955","#462955","#462955","#462955"];var u=$("#calendar");var a=u.find("#calendar_header");var f=u.find("#calendarweekdays");var l=u.find("#calendarcontent");b();c();a.find(\'i[class^="icon-chevron"]\').on("click",function(){var e=$(this);var r=function(e){n=e=="next"?n+1:n-1;if(n<1){n=12;t--}else if(n>12){n=1;t++}c()};if(e.attr("class").indexOf("left")!=-1){r("previous")}else{r("next")}})})
function updateValue(val, event) {document.getElementById("field17").value = val;event.preventDefault();}
</script>
</body></html><wehavechangedit>
"""
cnxn.close()
f.write(html_str)
f.close()
Can anyone point me in the direction of a better way to include the variables? Do I need to have the inset as an array for this model?
It's Py3.6, on Windows 10.
Have you tried to just save your html_str inside a template .html file, write your inset lines into a long string, then read your file into a string, do the replace, then re-write the file?
with open('C:\\template.html') as file:
wholefile = file.readlines()
use this to make a string of your results.
inset = inset + '<div class="'+ str(row.isbusy) + '">' + str(row.day) + '</div>' + '\n'
and then do the replace, so you will have the complete file in a string, then write it back out.
wholefile.replace('{inset}',inset)

Extracting parent and child information

Using Python and beautifulsoup, I need help extracting information from a parent div and a child div at the same time.
Here is the first example code:
<div id="slide-609becd056bb40a7ad42607a4d1c67f5"
class="slide has-link slick-slide"
data-label="April 2 2018 Acura TLX Offer 2000x700.jpg"
data-link="/new-inventory/index.htm?model=TLX&year=2018" data-target="_self"
style="background-image: url("https://pictures.dealer.com/a/adw/0877/5eabcb338dc604c09b28a4df5a49ad78x.jpg?impolicy=resize&h=514");
width: 1897px; position: relative; left: 0px; top: 0px; z-index: 998; opacity: 0; height: 514px; transition: opacity 750ms ease;" data-slick-index="0" aria-hidden="true" tabindex="-1" role="option" aria-describedby="slick-slide00">
Here is example code 2:
<div id="slide-7ae8b29ddc9e45d1a219beffe5793b2b"
class="html-slide slide slick-slide"
data-label="March-Madness.jpg" data-link="" data-target=""
data-promo-id="" data-slick-index="2" aria-hidden="true" tabindex="-1" role="option"
aria-describedby="slick-slide02"
style="width: 1897px; position: relative; left: -3794px; top: 0px; z-index: 998; opacity: 0; height: 514px; transition: opacity 750ms ease;">
<div class="slide-background"
style="background-image: linear-gradient(rgba(0, 0, 0, 0), rgba(0, 0, 0, 0)), url("https://pictures.dealer.com/g/goodsonacuraofdallasadw/1747/13ed067a023df8ad412feea2c6eddec9x.jpg?impolicy=resize&h=514"); height: 514px;">
<img src="https://pictures.dealer.com/g/goodsonacuraofdallasadw/1747/13ed067a023df8ad412feea2c6eddec9x.jpg?impolicy=resize&h=514" class="placeholder-image pull-left"> </div>
I need to get the style element from both examples of code so I can get the background image url. The issue is that the first code has the style in the parent div and the second set of code has the style in the child div. How do I get those two style elements at the same time using Python and beautifulsoup?
Here is the code I have tried:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.goodsonacura.com/'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
banner_info = page_soup.findAll('div',{'class':['slide has-link', 'html-slide slide has-link']})
picture = [banner.get('style') for banner in banner_info]
This code gives me the correct style element for the first example code, but it gives me the wrong style element for the second example code.
Add "slide-background" class in the find_all query. See the example below:-
banner_info = page_soup.find_all('div',{'class':['slide has-link', 'html-slide slide has-link', 'slide-background']})
It works for me. May this helps you.

Categories