How to scrape nested divs with BeautifulSoup - python

I'm trying to scrape data from a website. First I authenticate and start the session. There is no problem in this part. But I would like to scrape my test questions. So there are 100 Questions in a test with a unique url, but only members can have access to.
with requests.session() as s:
s.post(loginURL, data=payLoad)
res = s.get(targetURL)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
elems = soup.find_all("div", class_="Question-Container")
print(elems)
After I try to run this code, I didn't receive the data which I wanted.
The output looks likes this
[<div class="Questionboard-body Question-Container">
<div class="clearfix">
<div class="text-right">
<span><b>Question Id: </b></span><span class="DisplayQNr"></span>
</div>
</div>
<div class="QuestionText">
<div class="qText"></div>
</div>
<div class="QuestionOptions" hideanswer="false"></div>
<div class="QuestionSolution" hideanswer="false">
<button class="showSolutionBtn btn btn-primary-alt">Show Solution</button>
<div class="QuestionCorrectOptions text-center"></div>
<div class="DetailedSolution text-center"></div>
</div>
</div>]
Output which I want is the data inside those elements.
The div trees looks like this. There are alot of divs, where class="DisplayQNr" is for questionID, there is one more div QuestionText but the question Text is inside class="qText". There are four options for each question, class=QuestionOptions and so on. I want to scrape all of them. Image attach for better clarity.
Screenshot of nested divs

for i in elems:
gg = i.find_all('div')
print(gg)

Assuming as you mentioned in the comments, all data / content is in your soup you could go with:
...
soup = bs4.BeautifulSoup(res.text, "html.parser")
data = []
for e in soup.select('.Question-Container'):
d = {
'question': e.select_one('.qText').text if e.select_one('.qText') else None
}
d.update(dict(s.stripped_strings for s in e.select('.answerText')))
data.append(d)
df = pd.DataFrame(data)
Output would be something like that:
question
A
B
0
my question text
answer text a
answer text b
...

Related

extract tags from soup with BeautifulSoup

'''
<div class="kt-post-card__body>
<div class="kt-post-card__title">Example_1</div>
<div class="kt-post-card__description">Example_2</div>
<div class="kt-post-card__bottom">
<span class="kt-post-card__bottom-description kt-text-truncate" title="Example_3">Example_4</span>
</div>
</div>
'''
according to picture I attached, I want to extract all "kt-post-card__body" attrs and then from each one of them, extract:
("kt-post-card__title", "kt-post-card__description")
like a list.
I tried this:
ads = soup.find_all('div',{'class':'kt-post-card__body'})
but with ads[0].div I only access to "kt-post-card__title" while "kt-post-card__body" has three other sub tags like: "kt-post-card__description" and "kt-post-card__bottom" ... , why is that?
Cause your question is not that clear - To extract the classes:
for e in soup.select('.kt-post-card__body'):
print([c for t in e.find_all() for c in t.get('class')])
Output:
['kt-post-card__title', 'kt-post-card__description', 'kt-post-card__bottom', 'kt-post-card__bottom-description', 'kt-text-truncate']
To get the texts you also have to iterate your ResultSet and could access each elements text to fill your list or use stripped_strings.
Example
from bs4 import BeautifulSoup
html_doc='''
<div class="kt-post-card__body">
<div class="kt-post-card__title">Example_1</div>
<div class="kt-post-card__description">Example_2</div>
<div class="kt-post-card__bottom">
<span class="kt-post-card__bottom-description kt-text-truncate" title="Example_3">Example_4</span>
</div>
</div>
'''
soup = BeautifulSoup(html_doc)
for e in soup.select('.kt-post-card__body'):
data = [
e.select_one('.kt-post-card__title').text,
e.select_one('.kt-post-card__description').text
]
print(data)
Output:
['Example_1', 'Example_2']
or
print(list(e.stripped_strings))
Output:
['Example_1', 'Example_2', 'Example_4']
Try this:
ads = soup.find_all('div',{'class':'kt-post-card__body'})
ads[0]
I think you're getting only the first div because you called ads[0].div

get attributes of a html class

i got the code below
h = """<div class="SB-kickOffInfo">
<div class="SB-kickOff">
<div class="SB-kickOff" data-eventdatetime='05/17/2022 18:45:00'></div>
</div>
</div>"""
soup = BeautifulSoup(h)
#print(soup)
kick_off = soup.find(class_="SB-kickOffInfo").get('data-eventdatetime')
print(kick_off)
i want to extract the date but fro the code above am getting None, what should i change to extract the date?
Issue here is that the selected element do not have this attribute you are looking for directly, it is one of its children:
soup.find(class_="SB-kickOffInfo").find(attrs={"data-eventdatetime": True}).get('data-eventdatetime')
Here also a solution with css selectors:
soup.select_one('.SB-kickOffInfo [data-eventdatetime]').get('data-eventdatetime')
Output:
05/17/2022 18:45:00

Web scraping from the span element

I am on a scraping project and I am lookin to scrape from the following.
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
I want to extract only Christian, Islam as the output.(Without the 'Faith:').
This is my try:
faithdiv = soup.find('div', class_='spec-subcat attributes-religion')
faith = faithdiv.find('span').text.strip()
How can I make this done?
There are several ways you can fix this, I would suggest the following - Find all <span> in <div> that have not the class="h5":
soup.select('div.spec-subcat.attributes-religion span:not(.h5)')
Example
import requests
html_text = '''
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
'''
soup = BeautifulSoup(html_text, 'lxml')
', '.join([x.get_text() for x in soup.select('div.spec-subcat.attributes-religion span:not(.h5)')])
Output
Christian, Islam

How to get child value of div seperately using beautifulsoup

I have a div tag which contains three anchor tags and have url in them.
I am able to print those 3 hrefs but they get merged into one value.
Is there a way I can get three seperate values.
Div looks like this:
<div class="speaker_social_wrap">
<a href="https://twitter.com/Sigve_telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-twitter" data-x-icon-b=""></i>
</a>
<a href="https://no.linkedin.com/in/sigvebrekke" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-linkedin-in" data-x-icon-b=""></i>
</a>
<a href="https://www.facebook.com/sigve.telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-facebook-f" data-x-icon-b=""></i>
</a>
What I have tried so far:
social_media_url = soup.find_all('div', class_ = 'foo')
for url in social_media_url:
print(url)
Expected Result:
http://twitter-url
http://linkedin-url
http://facebook-url
My Output
<div><a twitter-url><a linkedin-url><a facebook-url></div>
You can do like this:
from bs4 import BeautifulSoup
import requests
url = 'https://dtw.tmforum.org/speakers/sigve-brekke-2/'
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
a = soup.find('div', class_='speaker_social_wrap').find_all('a')
for i in a:
print(i['href'])
https://twitter.com/Sigve_telenor
https://no.linkedin.com/in/sigvebrekke
https://www.facebook.com/sigve.telenor
Your selector gives you the div not the urls array. You need something more like:
social_media_div = soup.find_all('div', class_ = 'foo')
social_media_anchors = social_media_div.find_all('a')
for anchor in social_media_anchors:
print(anchor.get('href'))

Scraping <span>flow text</span> with BeautifulSoup and urllib

I am working on scraping the data from a website using BeautifulSoup. For whatever reason, I cannot seem to find a way to get the text between span elements to print. Here is what I am running.
data = """ <div class="grouping">
<div class="a1 left" style="width:20px;">Text</div>
<div class="a2 left" style="width:30px;"><span
id="target_0">Data1</span>
</div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2
</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3
</span</div>
</div>
"""
My ultimate goal would be to able to print a list ["Text", "Data1", "Data2"] for each entry. But right now I am having trouble getting python and urllib to produce any text between the . Here is what I am running:
import urllib
from bs4 import BeautifulSoup
url = 'http://target.com'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
Search_List = [0,4,5] # list of Target IDs to scrape
for i in Search_List:
h = str(i)
root = 'target_' + h
taggr = soup.find("span", { "id" : root })
print taggr, ", ", taggr.text
When I use urllib it produces this:
<span id="target_0"></span>,
<span id="target_4"></span>,
<span id="target_5"></span>,
However, I also downloaded the html file, and when I parse the downloaded file it produces this output (the one that I want):
<span id="target_0">Data1</span>, Data1
<span id="target_4">Data1</span>, Data1
<span id="target_5">Data1</span>, Data1
Can anyone explain to me why urllib doesn't produce the outcome?
use this code :
...
soup = BeautifulSoup(html, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'target_0'}):
your_data.append(line.text)
...
similarly add all class attributes which you need to extract data from and write your_data list in csv file. Hope this will help if this doesn't work out. let me know.
You can use the following approach to create your lists based on the source HTML you have shown:
from bs4 import BeautifulSoup
data = """
<div class="grouping">
<div class="a1 left" style="width:20px;">Text0</div>
<div class="a2 left" style="width:30px;"><span id="target_0">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
<div class="grouping">
<div class="a1 left" style="width:20px;">Text2</div>
<div class="a2 left" style="width:30px;"><span id="target_2">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
<div class="grouping">
<div class="a1 left" style="width:20px;">Text4</div>
<div class="a2 left" style="width:30px;"><span id="target_4">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
"""
soup = BeautifulSoup(data, "lxml")
search_ids = [0, 4, 5] # list of Target IDs to scrape
for i in search_ids:
span = soup.find("span", id='target_{}'.format(i))
if span:
grouping = span.parent.parent
print list(grouping.stripped_strings)[:-1] # -1 to remove "Data3"
The example has been slightly modified to show it finding IDs 0 and 4. This would display the following output:
[u'Text0', u'Data1', u'Data2']
[u'Text4', u'Data1', u'Data2']
Note, if the HTML you are getting back from your URL is different to that seen been viewing the source from your browser (i.e. the data you want is missing completely) then you will need to use a solution such as selenium to connect to your browser and extract the HTML. This is because in this case, the HTML is probably being generated locally via Javascript, and urllib does not have a Javascript processor.

Categories