Python beautifulsoup print by line # - python

Okay so I'm currently using python beautifulsoup to output a specific line from a html file, since the html contains multiple of the same div class, it'll output every div containing the same class, example of this
CONTENT:
<div class=border>aaaa</a>
<div class=border>example</a>
<div class=border>runrunrun</a>
OUTPUT:
<div class=border>aaaa</a>
<div class=border>example</a>
<div class=border>runrunrun</a>
Now I only want #2 of div class border,
<div class=border>example</a>
now if i view source within chrome, it'll show content in number lines, so line 1 will contain
<div class=border>aaaa</a>
& line 2 will contain
<div class=border>example</a>
is it possible to output via numbered line using beautiful soup?

find_all returns a list, so you can index it with [1] to get the second element.
from bs4 import BeautifulSoup
html_doc = """<div class=border>aaaa</a>
<div class=border>example</a>
<div class=border>runrunrun</a>"""
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all(class_="border")[1]
returns
<div class="border">example</div>

If you have the list with say 200 elements generated by soup.find_all... If the list is called div_list, you could just do an index loop (you want index 1,4,7 etc...)
count = 1
while True:
try:
print(div_list[count])
count+=3
except:
# happens because of index error
break
Or even shorter:
count = 1
while count<= len(div_list):
print(div_list[count])
count+=3

Related

extract tags from soup with BeautifulSoup

'''
<div class="kt-post-card__body>
<div class="kt-post-card__title">Example_1</div>
<div class="kt-post-card__description">Example_2</div>
<div class="kt-post-card__bottom">
<span class="kt-post-card__bottom-description kt-text-truncate" title="Example_3">Example_4</span>
</div>
</div>
'''
according to picture I attached, I want to extract all "kt-post-card__body" attrs and then from each one of them, extract:
("kt-post-card__title", "kt-post-card__description")
like a list.
I tried this:
ads = soup.find_all('div',{'class':'kt-post-card__body'})
but with ads[0].div I only access to "kt-post-card__title" while "kt-post-card__body" has three other sub tags like: "kt-post-card__description" and "kt-post-card__bottom" ... , why is that?
Cause your question is not that clear - To extract the classes:
for e in soup.select('.kt-post-card__body'):
print([c for t in e.find_all() for c in t.get('class')])
Output:
['kt-post-card__title', 'kt-post-card__description', 'kt-post-card__bottom', 'kt-post-card__bottom-description', 'kt-text-truncate']
To get the texts you also have to iterate your ResultSet and could access each elements text to fill your list or use stripped_strings.
Example
from bs4 import BeautifulSoup
html_doc='''
<div class="kt-post-card__body">
<div class="kt-post-card__title">Example_1</div>
<div class="kt-post-card__description">Example_2</div>
<div class="kt-post-card__bottom">
<span class="kt-post-card__bottom-description kt-text-truncate" title="Example_3">Example_4</span>
</div>
</div>
'''
soup = BeautifulSoup(html_doc)
for e in soup.select('.kt-post-card__body'):
data = [
e.select_one('.kt-post-card__title').text,
e.select_one('.kt-post-card__description').text
]
print(data)
Output:
['Example_1', 'Example_2']
or
print(list(e.stripped_strings))
Output:
['Example_1', 'Example_2', 'Example_4']
Try this:
ads = soup.find_all('div',{'class':'kt-post-card__body'})
ads[0]
I think you're getting only the first div because you called ads[0].div

How to get the value of an element within a tag?

<div class="player__AAtt">
<div>
<play-js data-account="1234567890" data-id="32667_32797">
</play-js>
</div>
</div>
I want to get the values of data-account and data-id which are elements within play-js tag.
elemname = driver.find_elements_by_xpath('xpath.../div/play-js')
I tried like below, but I couldn't get the value.
With javascript, I was able to import it with the code below.
var elems = document.querySelectorAll('.player__AAtt play-js');
console.log(elems[0].dataset.account)
console.log(elems[0].dataset.dataid)
How can I get the value of an element within a tag rather than the tag itself?
You can use the .get_attribute() method:
elemname = driver.find_elements_by_xpath('xpath.../div/play-js')
elemname.get_attribute("data-account")
In python we use BeautifulSoup mostly for parsing the html page. Here the code that will help you to get the value of all the play-js element in the provided html file.
from bs4 import BeautifulSoup
res_page = """
<div class="player__AAtt">
<div>
<play-js data-account="1234567890" data-id="32667_32797">
</play-js>
</div>
</div>
"""
soup = BeautifulSoup(res_page, 'html.parser')
output_list = soup.find_all('play-js')
data_account_list = [data_account['data-account'] for data_account in output_list]
data_id_list = [data_id['data-id'] for data_id in output_list]
print(data_account_list)
print(data_id_list)
The output is:
['1234567890']
['32667_32797']

quickest way to find a child of html element bs4

I have some HTML that has the following structure:
<div class="article">
<h1 class="header">Birth Date between 1919-01-01 and 2019-01-01, Oscar-Winning, Oscar-Nominated, Males (Sorted by Popularity Ascending) </h1>
<br class="clear"/>
<div class="desc">
<span>1-100 of 716 names.</span> // I WANT THIS ELEMENT
<span class="ghost">|</span> <a class="lister-page-next next-page" href="/search/name?birth_date=1919-01-01,2019-01-01&groups=oscar_winner,oscar_nominee&gender=male&count=100&start=101&ref_=rlm">Next ยป</a>
</div>
<br class="clear"/>
</div>
Now I am trying to get a specific element out of this html with bs4. I tried to do:
webSoup = BeautifulSoup(html, 'html.parser')
nextUrl = webSoup.findChildren()[2][0]
but this gives me the following error:
return self.attrs[key]
KeyError: 0
So, to summarize my question:
How do I get a specific child at a certain index from an html document with bs4?
If you want the first match for span following class desc then you can use a css child combinator to pair the parent class with child element tag:
webSoup.select_one('.desc > span')
You can also choose to specify the parent must be a div
div.desc > span
If there is more than one match then use webSoup.select and then index into the list returned.
You can use:
nextUrl = webSoup.findChildren()[3].findChildren()[0]
print(nextUrl)

How to find an ID in a div class with multiple values BS4 Python

I am trying to find an ID in a div class which has multiple values using BS4 the HTML is
<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>
</div>
I want to find data-test5-uk, my current code is soup =
bs(size.text,"html.parser")
sizes = soup.find_all("div",{"class":"size"})
size = sizes[0]["data-test5-uk"]
size.text is from a get request to the site with the html, however it returns
size = sizes[0]["data-test5-uk"]
File "C:\Users\ninja_000\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'data-test5-uk'
Help is appreciated!
Explanation and then the solution.
.find_all('tag') is used to find all instances of that tag and we can later loop through them.
.find('tag') is used to find the ONLY first instance.
We can either extract the content of the argument with ['arg'] or ..get('arg') it is the SAME.
from bs4 import BeautifulSoup
html = '''<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>'''
soup = BeautifulSoup(html, 'lxml')
one_div = soup.find('div', class_='size ')
print( one_div.find('a')['data-test5-uk'])
# your code didn't work because you weren't in the a tag
# we have found the tag that contains the tag .find('a')['data-test5-uk']
# for multiple divs
for each in soup.find_all('div', class_='size '):
# we loop through each instance and do the same
datauk = each.find('a')['data-test5-uk']
print('data-test5-uk:', datauk)
Output:
data-test5-uk: 7
Additional
Why did your ['arg']? - You've tried to extract the ["data-test5-uk"] of the div. <div class="size "> the div has no arguments like that except one class="size "

Scraping <span>flow text</span> with BeautifulSoup and urllib

I am working on scraping the data from a website using BeautifulSoup. For whatever reason, I cannot seem to find a way to get the text between span elements to print. Here is what I am running.
data = """ <div class="grouping">
<div class="a1 left" style="width:20px;">Text</div>
<div class="a2 left" style="width:30px;"><span
id="target_0">Data1</span>
</div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2
</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3
</span</div>
</div>
"""
My ultimate goal would be to able to print a list ["Text", "Data1", "Data2"] for each entry. But right now I am having trouble getting python and urllib to produce any text between the . Here is what I am running:
import urllib
from bs4 import BeautifulSoup
url = 'http://target.com'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
Search_List = [0,4,5] # list of Target IDs to scrape
for i in Search_List:
h = str(i)
root = 'target_' + h
taggr = soup.find("span", { "id" : root })
print taggr, ", ", taggr.text
When I use urllib it produces this:
<span id="target_0"></span>,
<span id="target_4"></span>,
<span id="target_5"></span>,
However, I also downloaded the html file, and when I parse the downloaded file it produces this output (the one that I want):
<span id="target_0">Data1</span>, Data1
<span id="target_4">Data1</span>, Data1
<span id="target_5">Data1</span>, Data1
Can anyone explain to me why urllib doesn't produce the outcome?
use this code :
...
soup = BeautifulSoup(html, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'target_0'}):
your_data.append(line.text)
...
similarly add all class attributes which you need to extract data from and write your_data list in csv file. Hope this will help if this doesn't work out. let me know.
You can use the following approach to create your lists based on the source HTML you have shown:
from bs4 import BeautifulSoup
data = """
<div class="grouping">
<div class="a1 left" style="width:20px;">Text0</div>
<div class="a2 left" style="width:30px;"><span id="target_0">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
<div class="grouping">
<div class="a1 left" style="width:20px;">Text2</div>
<div class="a2 left" style="width:30px;"><span id="target_2">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
<div class="grouping">
<div class="a1 left" style="width:20px;">Text4</div>
<div class="a2 left" style="width:30px;"><span id="target_4">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
"""
soup = BeautifulSoup(data, "lxml")
search_ids = [0, 4, 5] # list of Target IDs to scrape
for i in search_ids:
span = soup.find("span", id='target_{}'.format(i))
if span:
grouping = span.parent.parent
print list(grouping.stripped_strings)[:-1] # -1 to remove "Data3"
The example has been slightly modified to show it finding IDs 0 and 4. This would display the following output:
[u'Text0', u'Data1', u'Data2']
[u'Text4', u'Data1', u'Data2']
Note, if the HTML you are getting back from your URL is different to that seen been viewing the source from your browser (i.e. the data you want is missing completely) then you will need to use a solution such as selenium to connect to your browser and extract the HTML. This is because in this case, the HTML is probably being generated locally via Javascript, and urllib does not have a Javascript processor.

Categories