Extracting data from an inconsistent HTML page using BeautifulSoup4 and Python

Extracting data from an inconsistent HTML page using BeautifulSoup4 and Python - python

I’m trying to extract data from this webpage and I'm having some trouble due to inconsistancies within the page's HTML formatting. I have a list of OGAP IDs and I want to extract the Gene Name and any literature information (PMID #) for each OGAP ID I iterate through. Thanks to other questions on here, and the BeautifulSoup documentation, I've been able to consistantly get the gene name for each ID, but I'm having trouble with the literature part. Here's a couple search terms that highlight the inconsitancies.
HTML sample that works
Search term: OG00131
<tr>
<td colspan="4" bgcolor="#FBFFCC" class="STYLE28">Literature describing O-GlcNAcylation:
<br> PMID:
20068230
[CAD, ETD MS/MS]; <br>
<br>
</td>
</tr>
HTML sample that doesn't work
Search term: OG00020
<td align="top" bgcolor="#FBFFCC">
<div class="STYLE28">Literature describing O-GlcNAcylation: </div>
<div class="STYLE28">
<div class="STYLE28">PMID:
16408927
[Azide-tag, nano-HPLC/tandem MS]
</div>
<br>
Site has not yet been determined. Use
OGlcNAcScan
to predict the O-GlcNAc site. </div>
</td>
Here's the code I have so far
import urllib2
from bs4 import BeautifulSoup
#define list of genes
#initialize variables
gene_list = []
literature = []
# Test list
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
for i in range(len(gene_listID)):
print gene_listID[i]
# Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided
dbOGAP = "https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield=%s&select=Any" % gene_listID[i]
# Opens the URL as a page
page = urllib2.urlopen(dbOGAP)
# Reads the page and parses it through "lxml" format
soup = BeautifulSoup(page, "lxml")
gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text
print gene_name[1:]
gene_list.append(gene_name[1:])
# PubMed IDs are located near the <td> tag with the term "Data and Source"
pmid = soup.find("span", text="Data and Source")
# Based on inspection of the website, need to move up to the parent <td> tag
pmid_p = pmid.parent
# Then we move to the next <td> tag, denoted as sibling (since they share parent <tr> (Table row) tag)
pmid_s = pmid_p.next_sibling
#for child in pmid_s.descendants:
# print child
# Now we search down the tree to find the next table data (<td>) tag
pmid_c = pmid_s.find("td")
temp_lit = []
# Next we print the text of the data
#print pmid_c.text
if "No literature is available" in pmid_c.text:
temp_lit.append("No literature is available")
print "Not available"
else:
# and then print out a list of urls for each pubmed ID we have
print "The following is available"
for link in pmid_c.find_all('a'):
# the <a> tag includes more than just the link address.
# for each <a> tag found, print the address (href attribute) and extra bits
# link.string provides the string that appears to be hyperlinked.
# In this case, it is the pubmedID
print link.string
temp_lit.append("PMID: " + link.string + " URL: " + link.get('href'))
literature.append(temp_lit)
print "\n"
So it seems the element is what is throwing the code for a loop. Is there a way to search for any element with the text "PMID" and return the text that comes after it (and url if there is a PMID number)? If not, would I just want to check each child, looking for the text I'm interested in?
I'm using Python 2.7.10

import requests
from bs4 import BeautifulSoup
import re
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
urls = ('https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield={}&select=Any'.format(i) for i in gene_listID)
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
regex = re.compile(r'http://www.ncbi.nlm.nih.gov/pubmed/\d+')
a_tag = soup.find('a', href=regex)
has_pmid = 'PMID' in a_tag.previous_element
if has_pmid :
print(a_tag.text, a_tag.next_sibling, a_tag.get("href"))
else:
print("Not available")
out:
18984734 [GalNAz-Biotin tagging, CAD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/18984734
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS]; http://www.ncbi.nlm.nih.gov/pubmed/16408927
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS] http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation
find the first a tag that match the target url, which end with numbers, than check if 'PMID' in it's previous element.
this web is so inconsistancies , and i try many times, hope this would help

Related

BeautifulSoup: Select P tag that comes after another P tag which should contain a link

The webpage I'm scraping has paragraphs and headings structured this way:
<p>
<strong>
<a href="https://dummy.com" class="">This is a link heading
</strong>
</p>
<p>
Content To Be Pulled
</p>
I wrote the following code to pull the link heading's content:
for anchor in soup.find_all('#pcl-full-content > p > strong > a'):
signs.append(anchor.text)
The next part is confusing me because the text I want to collect next is the <p> tag after the <p> tag which contains the link. I cannot use .next_sibling() here because it is outside of the parent <p> tag.
How do I choose the following paragraph given that the <p> before it contained a link?

One way seems to be to extract from script tag though you will need to split the text by horoscope:
import requests, re, json
r = requests.get('https://indianexpress.com/article/horoscope/weekly-horoscope-june-6-june-12-gemini-cancer-taurus-and-other-signs-check-astrological-prediction-7346080/',
headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'(\{"#context.*articleBody.*\})', r.text).group(1))
print(data['articleBody'])
You could get the horoscopes separately as follows. This dynamically determines which horoscopes are present, and in what order:
import requests, re, json
r = requests.get('https://indianexpress.com/article/horoscope/horoscope-today-april-6-2021-sagittarius-leo-aries-and-other-signs-check-astrological-prediction-7260276/',
headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'(\{"#context.*articleBody.*\})', r.text).group(1))
# print(data['articleBody'])
signs = ['ARIES', 'TAURUS', 'GEMINI', 'CANCER', 'LEO', 'VIRGO', 'LIBRA', 'SCORPIO', 'SAGITTARIUS', 'CAPRICORN', 'AQUARIUS', 'PISCES']
p = re.compile('|'.join(signs))
signs = p.findall(data['articleBody'])
for number, sign in enumerate(signs):
if number < len(signs) - 1:
print(re.search(f'({sign}.*?){signs[number + 1]}', data['articleBody']).group(1))
else:
print(re.search(f'({sign}.*)', data['articleBody']).group(1))

Beautifulsoup4 - not selcting all instances of span class

I am attempting to scrape data from a website that uses non-specific span classes to format/display content. The pages present information about chemical products and each product is described within a single div class.
I first parsed by that div class and am working to pull the data I need from there. I have been able to get many things but the parts I cant seem to pull are within the span class "ppisreportspan"
If you look at the code, you will note that it appears multiple times within each chemical description.
<tr>
<td><h4 id='stateprod'>MAINE STATE PRODUCT REPORT</h4><hr class='report'><span style="color:Maroon;" Class="subtitle">Company Number: </span><span style='color:black;' Class="subtitle">38</span><br /><span Class="subtitle">MONSANTO COMPANY <br/>800 N. LINDBERGH BOULEVARD <br/>MAIL STOP FF4B <br/>ST LOUIS MO 63167-0001<br/></span><br/><span style="color:Maroon;" Class="subtitle">Number of Currently Registered Products: </span><span style='color:black; font-size:14px' class="subtitle">80</span><br /><br/><p class='noprint'><img alt='' src='images/epalogo.png' /> View the label in the US EPA Pesticide Product Label System (PPLS).<br /><img alt='' src='images/alstar.png' /> View the label in the Accepted Labels State Tracking and Repository (ALSTAR).<br /></p>
<hr class='report'>
<div class='nopgbrk'>
<span class='ppisreportspanprodname'>PRECEPT INSECTICIDE </span>
<br/>EPA Registration Number: <a href = "http://iaspub.epa.gov/apex/pesticides/f?p=PPLS:102:::NO::P102_REG_NUM:100-1075" target='blank'>100-1075-524 <img alt='EPA PPLS Link' src='images/pplslink.png'/></a>
<span class='line-break'></span>
<span class=ppisProd>ME Product Number: </span>
<**span class="ppisreportspan"**>2014000996</span>
<br />Registration Year: <**span class="ppisreportspan"**>2019</span>
Type: <span class="ppisreportspan">RESTRICTED</span><br/><br/>
<table width='100%'>
<tr>
<td width='13%'>Percent</td>
<td style='width:87%;align:left'>Active Ingredient</td>
</tr>
<tr>
<td><span class="ppisreportspan">3.0000</span></td>
<td><span class="ppisreportspan">Tefluthrin (128912)</span></td>
</tr>
</table><hr />
</div>
<div class='nopgbrk'>
<span class='ppisreportspanprodname' >ACCELERON IC-609 INSECTICIDE SEED TREATMENT FOR CORN</span>
<br/>EPA Registration Number: <a href = "http://iaspub.epa.gov/apex/pesticides/f?p=PPLS:102:::NO::P102_REG_NUM:264-789" target='blank'>264-789-524 <img alt='EPA PPLS Link' src='images/pplslink.png'/>
</a><span class='line-break'></span>
<span class=ppisProd>ME Product Number: <a href = "alstar_label.aspx?LabelId=116671" target = 'blank'>2009005053</span>
<img alt='ALSTAR Link' src='images/alstar.png'/></a>
<br />Registration Year: <span class="ppisreportspan">2019</span>
<br/>
<table width='100%'>
<tr>
<td width='13%'>Percent</td>
<td style='width:87%;align:left'>Active Ingredient</td>
</tr>
<tr>
<td><span class="ppisreportspan">48.0000</span></td>
<td><span class="ppisreportspan">Clothianidin (44309)</span></td>
</tr>
</table><hr />
</div>
This sample includes two chemicals. One has an "alstar" ID and link and one does not. Both have registration years. Those are the data points that are hanging me up.
You may also note that there is a 10 digit code stored in "ppisreportspan" in the first example. I was able to extract that as part of the "ppisProd" span for nay record that doesn't have the Alstar link. I don't understand why, but it reinforces the point that it seems my parsing process ignores that span class.
I have tried various methods for the last 2 days based on all kinds of different answers on SO, so I can't possibly list them all. I seem to be able to either get anything from the first "span" to the end on the last span, or I get "nonetype" errors or empty lists.
This one gets the closest:
It returns the correct spans for many div chunks but it still skips (returns blank tuple []) for any of the ones that have alstar links like the second one in the example.
picture showing data and then a series of three sets of empty brackets where the data should be
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re
url = input('Enter URL:')
hand = open(url)
soup = BeautifulSoup(hand, 'html.parser')
#create a list of chunks by product (div)
products = soup.find_all('div' , class_ ='nopgbrk')
print(type(products))
print(len(products))
tempalstars =[]
rptspanclasses = []
regyears = []
alstarIDs = []
asltrlinks = []
# read the span tags
for product in products:
tempalstar = product.find_all('span', class_= "ppisreportspan")
tempalstars.append(tempalstar)
print(tempalstar)
Ultimately, I want to be able to select the text for the year as well as the Alstar link out of these span statements for each div chunk, but I will be cross that bridge when I can get the code finding all the instances of that class.
Alternately - Is there some easier way I can get the Registration year and the Alstar link (eg. <a href = "alstar_label.aspx?LabelId=116671" target = 'blank'>2009005053</span> <img alt='ALSTAR Link' src='images/alstar.png'/></a>) rather than what I am trying to do?
I am using Python 3.7.2 and Thank you!

I managed to get some data from this site. All you need to know is the company number, in case of monsanto, the number is 38 (this number is shown in after selecting Maine and typing monsanto in the search box:
import re
import requests
from bs4 import BeautifulSoup
url_1 = 'http://npirspublic.ceris.purdue.edu/state/state_menu.aspx?state=ME'
url_2 = 'http://npirspublic.ceris.purdue.edu/state/company.aspx'
company_name = 'monsanto'
company_number = '38'
with requests.session() as s:
r = s.get(url_1)
soup = BeautifulSoup(r.text, 'lxml')
data = {i['name']: '' for i in soup.select('input[name]')}
for i in soup.select('input[value]'):
data[i['name']] = i['value']
data['ctl00$ContentPlaceHolder1$search'] = 'company'
data['ctl00$ContentPlaceHolder1$TextBoxInput1'] = company_name
r = s.post(url_1, data=data)
soup = BeautifulSoup(r.text, 'lxml')
data = {i['name']: '' for i in soup.select('input[name]')}
for i in soup.select('input[value]'):
data[i['name']] = i['value']
data = {k: v for k, v in data.items() if not k.startswith('ctl00$ContentPlaceHolder1$')}
data['ctl00$ContentPlaceHolder1${}'.format(company_number)] = 'Display+Products'
r = s.post(url_2, data=data)
soup = BeautifulSoup(r.text, 'lxml')
for div in soup.select('.nopgbrk'):
#extract name
print(div.select_one('.ppisreportspanprodname').text)
#extract ME product number:
s = ''.join(re.findall(r'\d{10}', div.text))
print(s)
#extract alstar link
s = div.select_one('a[href*="alstar_label.aspx"]')
if s:
print(s['href'])
else:
print('No ALSTAR link')
#extract Registration year:
s = div.find(text=lambda t: 'Registration Year:' in t)
if s:
print(s.next.text)
else:
print('No registration year.')
print('-' * 80)
Prints:
PRECEPT INSECTICIDE
2014000996
No ALSTAR link
2019
--------------------------------------------------------------------------------
ACCELERON IC-609 INSECTICIDE SEED TREATMENT FOR CORN
2009005053
alstar_label.aspx?LabelId=117531
2019
--------------------------------------------------------------------------------
ACCELERON D-342 FUNGICIDE SEED TREATMENT
2015000498
alstar_label.aspx?LabelId=117538
2019
--------------------------------------------------------------------------------
ACCELERON DX-309
2009005026
alstar_label.aspx?LabelId=117559
2019
--------------------------------------------------------------------------------
... and so on.

How to automate scraping wikipedia info box specifically and print the data using python for any wiki page?

My task is to automate printing the wikipedia infobox data.As an example, I am scraping the Star Trek wikipedia page (https://en.wikipedia.org/wiki/Star_Trek) and extract infobox section from the right hand side and print them row by row on screen using python. I specifically want the info box. So far I have done this:
from bs4 import BeautifulSoup
import urllib.request
# specify the url
urlpage = 'https://en.wikipedia.org/wiki/Star_Trek'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(urlpage)
# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
# find results within table
table = soup.find('table', attrs={'class': 'infobox vevent'})
results = table.find_all('tr')
print(type(results))
print('Number of results', len(results))
print(results)
This gives me everything from the info box. A snippet is shown below:
[<tr><th class="summary" colspan="2" style="text-align:center;font-
size:125%;font-weight:bold;font-style: italic; background: lavender;">
<i>Star Trek</i></th></tr>, <tr><td colspan="2" style="text-align:center">
<a class="image" href="/wiki/File:Star_Trek_TOS_logo.svg"><img alt="Star
Trek TOS logo.svg" data-file-height="132" data-file-width="560" height="59"
I want to extract the data only and print it on screen. So What i want is:
Created by Gene Roddenberry
Original work Star Trek: The Original Series
Print publications
Book(s)
List of reference books
List of technical manuals
Novel(s) List of novels
Comics List of comics
Magazine(s)
Star Trek: The Magazine
Star Trek Magazine
And so on till the end of the infobox. So basically a way of printing every row of the infobox data so I can automate it for any wiki page? (The class of infobox table of all wiki pages is 'infobox vevent' as shown in the code)

This page should help you to parse your html as a simple string without the html tags Using BeautifulSoup Extract Text without Tags
This is a code from that page, it belongs to #0605002
>>> html = """
<p>
<strong class="offender">YOB:</strong> 1987<br />
<strong class="offender">RACE:</strong> WHITE<br />
<strong class="offender">GENDER:</strong> FEMALE<br />
<strong class="offender">HEIGHT:</strong> 5'05''<br />
<strong class="offender">WEIGHT:</strong> 118<br />
<strong class="offender">EYE COLOR:</strong> GREEN<br />
<strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.text
YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN

By using beautifulsoup,you need to reformat the data as you want. use fresult = [e.text for e in result] to get each result
If you want to read a table on html you can try some code like this,though this is using pandas.
import pandas
urlpage = 'https://en.wikipedia.org/wiki/Star_Trek'
data = pandas.read_html(urlpage)[0]
null = data.isnull()
for x in range(len(data)):
first = data.iloc[x][0]
second = data.iloc[x][1] if not null.iloc[x][1] else ""
print(first,second,"\n")

beautifulsoup how to recombine words

Some of the words outputted are split when running this code. Like the word "tolerances" is split into "tole rances". I looked at the html source and it seems that's how the page was created.
There are also many other places where the word is split. How do I recombine them before writing to text?
import requests, codecs
from bs4 import BeautifulSoup
from bs4.element import Comment
path='C:\\Users\\jason\\Google Drive\\python\\'
def tag_visible(element):
if element.parent.name in ['sup']:
return False
if isinstance(element, Comment):
return False
return True
ticker = 'TSLA'
quarter = '18Q2'
mark1= 'ITEM 1A'
mark2= 'UNREGISTERED SALES'
url_new='https://www.sec.gov/Archives/edgar/data/1318605/000156459018019254/tsla-10q_20180630.htm'
def get_text(url,mark1,mark2):
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
for hr in soup.select('hr'):
hr.find_previous('p').extract()
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
text=u" ".join(t.strip() for t in visible_texts)
return text[text.find(mark1): text.find(mark2)]
text = get_text(url_new,mark1,mark2)
file=codecs.open(path + "test.txt", 'w', encoding='utf8')
file.write (text)
file.close()

You are dealing with HTML formatted with Microsoft Word. Don't extract text and try to process it without that context.
The section you want to process is clearly delineated with <a name="..."> tags, lets start with selecting all elements with the <a name="ITEM_1A_RISK_FACTORS"> marker, all the way up to but not including the <a name="ITEM2_UNREGISTERED_SALES"> marker:
def sec_section(soup, item_name):
"""iterate over SEC document paragraphs for the section named item_name
Item name must be a link target, starting with ITEM
"""
# ask BS4 to find the section
elem = soup.select_one('a[name={}]'.format(item_name))
# scan up to the parent text element
# html.parser does not support <text> but lxml does
while elem.parent is not soup and elem.parent.name != 'text':
elem = elem.parent
yield elem
# now we can yield all next siblings until we find one that
# is also contains a[name^=ITEM] element:
for elem in elem.next_siblings:
if not isinstance(elem, str) and elem.select_one('a[name^=ITEM]'):
return
yield elem
This function gives us all child nodes from the <text> node in the HTML document that start at a paragraph containing a specific link target, all the way through to the next link target that names an ITEM.
Next, the usual Word cleanup task is to remove <font> tags, style attributes:
def clean_word(elem):
if isinstance(elem, str):
return elem
# remove last-rendered break markers, non-rendering but messy
for lastbrk in elem.select('a[name^=_AEIOULastRenderedPageBreakAEIOU]'):
lastbrk.decompose()
# remove font tags and styling from the document, leaving only the contents
if 'style' in elem.attrs:
del elem.attrs['style']
for e in elem: # recursively do the same for all child nodes
clean_word(e)
if elem.name == 'font':
elem = elem.unwrap()
return elem
The Tag.unwrap() method is what'll most help your case, as the text is divided up almost arbitrarily by <font> tags.
Now it's suddenly trivial to extract the text cleanly:
for elem in sec_section(soup, 'ITEM_1A_RISK_FACTORS'):
clean_word(elem)
if not isinstance(elem, str):
elem = elem.get_text(strip=True)
print(elem)
This outputs, among the rest of the text:
•that the equipment and processes which we have selected for Model 3 production will be able to accurately manufacture high volumes of Model 3 vehicles within specified design tolerances and with high quality;
The text is now properly joined up, no re-combining required any more.
The whole section is still in a table but clean_word() cleaned this up now to the much more reasonable:
<div align="left">
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td valign="top">
<p> </p></td>
<td valign="top">
<p>•</p></td>
<td valign="top">
<p>that the equipment and processes which we have selected for Model 3 production will be able to accurately manufacture high volumes of Model 3 vehicles within specified design tolerances and with high quality;</p></td></tr></table></div>
so you can use smarter text extraction techniques to further ensure a clean text conversion here; you could convert such bullet tables to a * prefix, for example:
def convert_word_bullets(soup, text_bullet="*"):
for table in soup.select('div[align=left] > table'):
div = table.parent
bullet = div.find(string='\u2022')
if bullet is None:
# not a bullet table, skip
continue
text_cell = bullet.find_next('td')
div.clear()
div.append(text_bullet + ' ')
for i, elem in enumerate(text_cell.contents[:]):
if i == 0 and elem == '\n':
continue # no need to include the first linebreak
div.append(elem.extract())
In addition, you probably want to skip the page breaks too (a combination of <p>[page number]</p> and <hr/> elements), if you run
for pagebrk in soup.select('p ~ hr[style^=page-break-after]'):
pagebrk.find_previous_sibling('p').decompose()
pagebrk.decompose()
This is more explicit than your own version, where you remove all <hr/> elements and preceding <p> element regardless of whether they are actually siblings.
Execute both before cleaning up your Word HTML. Combined with your function that together becomes:
def get_text(url, item_name):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
for pagebrk in soup.select('p ~ hr[style^=page-break-after]'):
pagebrk.find_previous_sibling('p').decompose()
pagebrk.decompose()
convert_word_bullets(soup)
cleaned_section = map(clean_word, sec_section(soup, item_name))
return ''.join([
elem.get_text(strip=True) if elem.name else elem
for elem in cleaned_section])
text = get_text(url, 'ITEM_1A_RISK_FACTORS')
with open(os.path.join(path, 'test.txt'), 'w', encoding='utf8') as f:
f.write(text)

This page markup is really bad. You will need to remove excess tags to fix your issue. Luckily for you, beautifulsoup can do the heavy-lifting. The code below will remove all font tags.
soup = BeautifulSoup(html.text, 'html.parser')
for font in soup.find_all('font'):
font.unwrap()

How to find an ID in a div class with multiple values BS4 Python

I am trying to find an ID in a div class which has multiple values using BS4 the HTML is
<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>
</div>
I want to find data-test5-uk, my current code is soup =
bs(size.text,"html.parser")
sizes = soup.find_all("div",{"class":"size"})
size = sizes[0]["data-test5-uk"]
size.text is from a get request to the site with the html, however it returns
size = sizes[0]["data-test5-uk"]
File "C:\Users\ninja_000\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'data-test5-uk'
Help is appreciated!

Explanation and then the solution.
.find_all('tag') is used to find all instances of that tag and we can later loop through them.
.find('tag') is used to find the ONLY first instance.
We can either extract the content of the argument with ['arg'] or ..get('arg') it is the SAME.
from bs4 import BeautifulSoup
html = '''<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>'''
soup = BeautifulSoup(html, 'lxml')
one_div = soup.find('div', class_='size ')
print( one_div.find('a')['data-test5-uk'])
# your code didn't work because you weren't in the a tag
# we have found the tag that contains the tag .find('a')['data-test5-uk']
# for multiple divs
for each in soup.find_all('div', class_='size '):
# we loop through each instance and do the same
datauk = each.find('a')['data-test5-uk']
print('data-test5-uk:', datauk)
Output:
data-test5-uk: 7
Additional
Why did your ['arg']? - You've tried to extract the ["data-test5-uk"] of the div. <div class="size "> the div has no arguments like that except one class="size "

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting data from an inconsistent HTML page using BeautifulSoup4 and Python - python

Related

BeautifulSoup: Select P tag that comes after another P tag which should contain a link

Beautifulsoup4 - not selcting all instances of span class

How to automate scraping wikipedia info box specifically and print the data using python for any wiki page?

beautifulsoup how to recombine words

How to find an ID in a div class with multiple values BS4 Python

Categories

Resources