<div class="content">
<div class="container">
<div class="row pt-2">
<div class="col pe-1">
<div class="grid-cell p-2">
<a href="united-states_florida/company/met-west-commercial-lender/tom-mchugh-975">
Tom Mchugh
</a>
</div>
</div>
<div class="col ps-1">
<div class="grid-cell p-2">
Company:
<span>
<a href="united-states_florida/company/met-west-commercial-lender">
Met West Commercial Lender
</a>
</span>
</div>
</div>
</div>
My result showing like this
I want to look like following table:
Column A
Column B
Tom Mchugh
Met West Commercial Lender
There might be different approaches. Here is an elegant one.
y = df.Name.values
df = pd.DataFrame({'A' : y[::2], 'B' : y[1::2]})
Related
I am trying to pull the name and position of random people from Sales Navigator. Each person shows up as a card that contains all the information. I obtain a list of the cards but then I want to get for each one the Name and Title. I have tried using the code below to get the information from a card, the HTML of one result is below.
So far, my attempts always return an error indicating that the element could not be found. How could I solve this?
def testeo(driver):
lista = driver.find_elements_by_xpath("//*[contains(#class,'pv5 ph2 search-results__result-item')]")
nombres = []
for i in range(0, len(lista)):
nombres.append((lista[i].find_element_by_xpath(".//*[contains(#class,'result-lockup__name')]").text,
lista[i].find_element_by_xpath(".//*[contains(#class,'t-14 t-bold')]").text))
<li class="pv5 ph2 search-results__result-item" data-scroll-into-view="urn:li:fs_salesProfile:(ACwAAAJ-Ab0Bu4JpScPs9SE2b8R_LP9L9vU9nM8,NAME_SEARCH,fH_T)">
<div class="pt5 absolute search-results__select-container">
<input id="search-result-ember6830" class="small-input ember-checkbox ember-view" type="checkbox">
<label class="m0" for="search-result-ember6830">
<span class="a11y-text">
Select Jean Jongejan
</span>
</label>
</div>
<div style="" id="ember6866" class="flex full-width deferred-area ember-view"> <div class="search-results__result-container full-width pl2">
<div id="ember6981" class="ember-view"> <div id="ember6982" class="ember-view">
<article>
<h3 class="a11y-text">
Profile result – Jean Jongejan
</h3>
<section class="result-lockup">
<h4 class="a11y-text">
Profile result lockup – Jean Jongejan
</h4>
<div class="result-lockup__profile-info flex flex-column">
<div class="horizontal-person-entity-lockup-4 result-lockup__entity ml6">
<figure>
<a href="/sales/people/ACwAAAJ-Ab0Bu4JpScPs9SE2b8R_LP9L9vU9nM8,NAME_SEARCH,fH_T?_ntb=ErSmZYlWS8KlI9CD0cB6Yg%3D%3D" id="ember6985" class="result-lockup__icon-link ember-view">
<div class="presence-entity--size-4 relative mr2">
<img src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" loading="lazy" alt="Go to Jean Jongejan’s profile" id="ember6986" class="max-width max-height circle-entity-4 lazy-image ghost-person loaded ember-view">
<div class="presence-indicator presence-indicator--size-4 hidden presence-entity__indicator presence-entity__indicator--size-4" title="Reachable">
<span class="a11y-text">
Jean Jongejan is reachable
</span>
</div>
</div>
</a> </figure>
<dl>
<dt class="result-lockup__name">
<a href="/sales/people/ACwAAAJ-Ab0Bu4JpScPs9SE2b8R_LP9L9vU9nM8,NAME_SEARCH,fH_T?_ntb=ErSmZYlWS8KlI9CD0cB6Yg%3D%3D" id="ember6989" class="ember-view"> Jean Jongejan
</a> </dt>
<dd class="inline-flex vertical-align-middle">
<ul class="ml1 flex align-items-center list-style-none">
<li class="mr1">
<span class="a11y-text">
3rd degree contact
</span>
<span class="label-16dp block" aria-hidden="true">
3rd
</span>
</li>
<!----><!----><!----> </ul>
</dd>
<dd class="result-lockup__highlight-keyword">
<span class="t-14 t-bold">EXT Key Account Management & Consultancy</span>
<span>
at
<span data-entity-hovercard-id="urn:li:fs_salesCompany:36314" class="result-lockup__position-company">
<a href="/sales/company/36314?_ntb=Z6Rvdg6sRMiPD6xsYlUuFQ%3D%3D" id="ember6991" class="Sans-14px-black-75%-bold ember-view"> <span aria-hidden="true">
Marimekko
</span>
<span class="a11y-text">
Go to Marimekko account page
</span>
</a> <button aria-expanded="false" aria-label="See more about Marimekko" class="entity-hovercard__a11y-trigger p0 b0" data-entity-hovercard-id="urn:li:fs_salesCompany:36314" data-entity-hovercard-trigger="click"></button>
</span>
</span>
</dd>
<dd>
<span class="t-12 t-black--light">
3 years 11 months in role and company
</span>
</dd>
<dd>
<ul class="mv1 t-12 t-black--light result-lockup__misc-list">
<li class="result-lockup__misc-item">Breda, North Brabant, Netherlands</li>
</ul>
</dd>
</dl>
</div>
<!----> </div>
<div class="result-lockup__actions flex">
<ul class="result-lockup__common-actions">
<li class="result-lockup__action-item mb3">
<div class="display-flex">
<div id="ember6993" class="ember-view"> <div id="ember6995" class="save-to-list-dropdown artdeco-dropdown artdeco-dropdown--placement-bottom artdeco-dropdown--justification-right ember-view"><button aria-expanded="false" id="ember6996" class="save-to-list-dropdown__trigger ph4 artdeco-button artdeco-button--secondary artdeco-button--pro artdeco-button--1 m-type--message artdeco-dropdown__trigger artdeco-dropdown__trigger--placement-bottom ember-view" type="button" tabindex="0"> Save
<!----></button><div tabindex="-1" aria-hidden="true" id="ember6997" class="save-to-list-dropdown__content-container artdeco-dropdown__content artdeco-dropdown--is-dropdown-element artdeco-dropdown__content--has-arrow artdeco-dropdown__content--arrow-right artdeco-dropdown__content--justification-right artdeco-dropdown__content--placement-bottom ember-view"><div class="artdeco-dropdown__content-inner">
<!---->
</div>
</div></div>
<div id="ember6998" class="ember-view">
<!---->
<!----></div>
</div> <div class="relative">
<div id="ember6999" class="ember-view">
<div id="ember7000" class="artdeco-dropdown artdeco-dropdown--placement-bottom artdeco-dropdown--justification-right ember-view"><button aria-expanded="false" id="ember7001" class="artdeco-dropdown__trigger result-lockup__action-button m-type--more artdeco-dropdown__trigger--non-button artdeco-dropdown__trigger--placement-bottom ember-view" type="button" tabindex="0"> <span class="a11y-text">See more actions for this result</span>
<li-icon aria-hidden="true" type="ellipsis-horizontal-icon" class="artdeco-button artdeco-button--tertiary artdeco-button--1 artdeco-button--muted p0" size="medium"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" data-supported-dps="24x24" fill="currentColor" width="24" height="24" focusable="false">
<path d="M2 10h4v4H2v-4zm8 4h4v-4h-4v4zm8-4v4h4v-4h-4z"></path>
</svg></li-icon>
<!----></button><div tabindex="-1" aria-hidden="true" id="ember7002" class="artdeco-dropdown__content result-lockup__dropdown-more artdeco-dropdown--is-dropdown-element artdeco-dropdown__content--has-arrow artdeco-dropdown__content--arrow-right artdeco-dropdown__content--justification-right artdeco-dropdown__content--placement-bottom ember-view"><!----></div></div>
<div id="ember7003" class="ember-view"><!----></div>
<!----></div> </div>
</div>
</li>
<!----> </ul>
</div>
</section>
<section class="result-context relative pt1">
<h4 class="a11y-text">Profile result context – Jean Jongejan</h4>
<!---->
<!---->
<!----> </section>
</article>
</div>
</div> </div>
</div>
</li>
Can you try this?
//*[name()='dt'][#class='result-lockup__name']
I am trying to scrape a title from an h1 class, but I keep getting "None"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('h1', {'class': 'prod-name'})
print(title)
I've also tried using this way:
name_div = soup.find_all('div', {'class': 'col-md-12 col-sm-12 col-xs-12'})[0]
name = name_div.find('h1').text
print(name)
in which case I get: "IndexError: list index out of range"
Can anybody help me out?
This is the source code:
<div class="row attachDetails __web-inspector-hidebefore-shortcut__">
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12">
<div class="brand-desc">POLO RALPH LAUREN</div>
<h1 class="prod-name">ARAN CREWNECK SWEATER</h1>
<div class="panel-group" id="accordion">
<div class="borders-overview">
<div class="panel-heading">
<h4 class="panel-title">
<label class="overview-label collapsed" data-angle="overview-label" data-toggle="collapse" data-parent="#accordion" href="#collapse1">
<a class="fa fa-angle-up pull-right"></a>
<a class="over-view">OVERVIEW</a>
<span class="color-disp over-view">COLOR: FAWN GREY HEATHER</span>
<span class="style-num over-view">MATERIAL# : 710766783002
</span></label>
</h4>
</div>
<div id="collapse1" class="panel-collapse collapse">
<div class="short-desc-section"></div>
</div>
</div>
<div class="border-details">
<div class="panel-heading">
<h4 class="panel-title">
<label class="prod-details collapsed" data-angle="prod-details" data-toggle="collapse" data-parent="#accordion" href="#collapse2">
<a class="detail-link">Details</a>
<a class="fa fa-angle-up pull-right"></a>
</label>
</h4>
</div>
<div id="collapse2" class="long-desc panel-collapse collapse">
<div><ol><li>STANDARD FIT</li><li>COTTON</li></ol></div>
<ol>
<div><li><b>Board:</b> S196SC23</li></div>
<!--***********************************************************************************************************-->
</ol>
</div>
</div>
</div>
</div>
</div>
</div>
The following html appears as a string in my code. That is okay, what I need though is how to get:
"class="company-image company-34""
for each company-## there is also a price found in this tag further below in the HTML:
class="small-12 medium-4 cell text-right" data-after="kr./år">1.813
I tried following code:
for x in html:
if "company-image company" in x:
print("Oh yes")
else:
print("Nahh")
but it doesn't really work. My thinking is I look for everytime "company-image company" is mentioned and get the whole string and the following numbers after, it is always two numbers ##. And whenever it is found I look for "data-after="kr./år"" and get the numbers following. Eventually this would end in a for loop, as there are multiple companies and prices.
<app-offer-match _ngcontent-vdv-c20="" _nghost-vdv-c22="" class="ng-star-inserted">
<div _ngcontent-vdv-c22="" class="box">
<!---->
<div _ngcontent-vdv-c22="" class="line1">
<div _ngcontent-vdv-c22="" class="company-image company-34"><img _ngcontent-vdv-c22="" src="/assets/images/companies/34.svg"></div>
<div _ngcontent-vdv-c22="" class="button compare">Sammenlign </div>
</div>
<div _ngcontent-vdv-c22="" class="line2">
<div _ngcontent-vdv-c22="" class="container-button">
<div _ngcontent-vdv-c22="" class="button mini-accordion"></div>
</div>
<div _ngcontent-vdv-c22="" class="container-insurance-list">
<!---->
<div _ngcontent-vdv-c22="" class="indbo ng-star-inserted">
<div _ngcontent-vdv-c22="" class="grid-x container-product-overview">
<div _ngcontent-vdv-c22="" class="small-5 cell detail"><span _ngcontent-vdv-c22="">Indbo</span>
<!----><span _ngcontent-vdv-c22="" class="ng-star-inserted">Kongshaven 3</span>
</div>
<div _ngcontent-vdv-c22="" class="small-6 cell">
<div _ngcontent-vdv-c22="" class="grid-x price">
<div _ngcontent-vdv-c22="" class="small-12 medium-8 cell text-right" data-after="kr.">Selvrisiko 2.199</div>
<div _ngcontent-vdv-c22="" class="small-12 medium-4 cell text-right" data-after="kr./år">1.813 </div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</app-offer-match>
EDIT: Added desired output.
Desired output would be a pandas dataframe of:
Company Price
company-image company-34 1.813
EDIT 2:
It looks like an xml, that's because I formatted it like that for you guys. WHen I output it, it is of type STR, thank you.
Try this:
company = """[your string above]"""
import lxml.html as lh
import pandas as pd
doc = lh.fromstring(company)
columns = ["Company", "Price"]
rows = []
targets = doc.xpath('//div[contains(#class,"company-image company")]')
for target in targets:
row = []
row.append(target.attrib['class'])
price = target.xpath('../following-sibling::div//div[#data-after="kr./år"]')[0]
row.append(price.text)
rows.append(row)
rows
pd.DataFrame(rows,columns=columns)
Output:
Company Price
0 company-image company-34 1.813
I just found out about how to process webpages in python using BeautifulSoup.
There's a list of div from which I want to get those in a specific range. The range is defined by two div that have a h2 child.
How would I do that? Thank you for your support!
EDIT: I added an actual representation of my html code below instead of a previous "simplified" version that was missing tags.
The new code shows a root div with class foo-bar-details.
Nested are 9 div tags. Two of which have a nested h2 tag. All of those 9 div tags contain img elements deeply nested within. What I need is each img element of those divs that are between the ones containing the h2 element.
An expected outcome if applied to the html code below would be:
<img src="../../images/123456_thumb.jpg" alt="Image 123456" title="Image 123456">
<img src="../../images/67890_thumb.JPG" alt="Image 67890 " title="Image 67890">
This is the html code:
<div class="foo-bar-details">
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/39826_thumb.JPG" alt="Image 39826" title="Image 39826 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>JHFDFD </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/223234_thumb.JPG" alt="Image 223234" title="Image 223234 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>sdfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/223823_thumb.JPG" alt="Image 223823" title="Image 223823 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-4">
<h2 class="h3 margin-bottom-5">
Foo
</h2>
<ul class="list-inline margin-0">
<li> Foo feature </li>
...
</ul>
</div>
<div id="info-panel-header" class="padding-y-10 padding-x-40">
<div class="row">
<div class="col-se-6 element-info">
<div class="col-se-12">
<div class="row">
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/123456_thumb.jpg" alt="Image 123456" title="Image 123456">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-wild-sand-bg" id="sec-feat-4-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Foo strin: </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Barbar</strong><span class="icon-help"></span>
</p>
</div>
</div>
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Mine: </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
TEST<span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/67890_thumb.JPG" alt="Image 67890 " title="Image 67890">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-5">
<h2 class="h3 margin-bottom-5">
Bar
</h2>
<ul class="list-inline margin-0">
<li> Bar feature </li>
...
</ul>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/39826_thumb.JPG" alt="Image 39826" title="Image 39826 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/209876_thumb.JPG" alt="Image 209876" title="Image 209876 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
</div>
Here is a solution involving lxml.html:
We extract all divs between the first and last divs which contain an h2 tag:
import lxml.html
# HTML file saved as "file.html"
file_name = "file.html"
with open(file_name, 'r') as f:
tree = lxml.html.fromstring(f.read())
# all_div = tree.findall('div')
all_div = tree.find_class('foo-bar-details')[0].findall('div')
start, stop = None, None
for k, div in enumerate(all_div):
if div.findall('h2') and start is None:
print("Range starts at %d" % k)
start = k
continue
if div.findall('h2') and start is not None:
print("Range stops at %d" % k)
stop = k + 1 # add one as range stops at k - 1
continue
# div_list = all_div[start:stop]
img_list = [_.xpath('.//img') for _ in all_div[start:stop]]
print(img_list)
# [[], [<Element img at 0x20b58d73f40>], [<Element img at 0x20b58d73f90>], []]
# Or
img_list = [_.xpath('.//img/#src') for _ in all_div[start:stop]]
print(img_list)
# [[], ['../../images/123456_thumb.jpg'], ['../../images/67890_thumb.JPG'], []]
Another solution involving SimplifiedDoc:
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<div class="foo-bar-details">
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-4">
<h2 class="h3 margin-bottom-5">
Foo
</h2>
<ul class="list-inline margin-0">
<li> Foo feature </li>
...
</ul>
</div>
<div id="info-panel-header" class="padding-y-10 padding-x-40">Test 1</div>
<div class="padding-y-10 padding-x-40 gray-wild-sand-bg" id="foo-feat-4-1">Test 2</div>
<div class="padding-y-10 padding-x-40 " id="foo-feat-4-2">Test 3</div>
<div class="padding-y-10 padding-x-40 gray-wild-sand-bg" id="foo-feat-4-3">Test 4</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-5">
<h2 class="h3 margin-bottom-5">
Bar
</h2>
<ul class="list-inline margin-0">
<li> Bar feature </li>
...
</ul>
</div>
</div>
'''
doc = SimplifiedDoc(html)
divs = doc.select('div.foo-bar-details').divs.contains('<h2')
print ([div.id for div in divs])
divs = doc.select('div.foo-bar-details').divs.notContains('<h2')
print ([div.id for div in divs])
Result:
['elem-4', 'elem-5']
['info-panel-header', 'foo-feat-4-1', 'foo-feat-4-2', 'foo-feat-4-3']
Simplifieddoc library does not rely on the third-party library, which is lighter and faster, perfect for beginners.
Here are more examples here
If I understand you correctly, you want to find <img> tags and corresponding <h2> to which the images belong to.
This example (txt variable contains the HTML snippet from your question):
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
out = {}
for img in soup.select('div:has(h2) ~ div img'):
out.setdefault(img.find_previous('h2').get_text(strip=True), []).append(img['src'])
from pprint import pprint
pprint(out)
Prints:
{'Bar': ['../../images/39826_thumb.JPG', '../../images/209876_thumb.JPG'],
'Foo': ['../../images/123456_thumb.jpg', '../../images/67890_thumb.JPG']}
I just came to know lxmlx in python and I'm in the need for some help as I have no experience with XPath.
I want to get text data from a webpage into a dictionary.
I'm referring to the html snippet I posted below. Within the original html page there's a div element of the class general-info that I retrieve using the following line:
general_info = document_tree.xpath("//div[contains(concat(' ', normalize-space(#class), ' '), 'general-info')]")
From here on I want to iterate over the nested divs and get the 2 <p> tags as key and value. The text inside the <strong> being the key.
There can also be empty div tags and there can be a special case where the key and the value for the dictionary can be within the same div (see the last element).
EDIT:
The number of elements can change, so it would be best to use the <strong> tags as starting point and then search for the next <p> tag.
This is code that I was able to write using BeautifulSoup:
generalinfo = documentSoup.findAll("div", {"class": "general-info"})
if generalinfo:
strongs = generalinfo[0].find_all('strong')
for descr in strongs:
p = descr.find_next_sibling("p")
if p:
key = descr.text.strip().rstrip(':')
details_dict[key] = p.text.strip()
else:
nextdiv = descr.parent.parent.find_next_sibling("div")
if nextdiv:
child = nextdiv.findChild()
if child:
key = descr.text.strip()[:-1]
details_dict[key] = child.text.strip()
I am going for the following output:
['Title:' : 'This is a title',
'Owner:' : 'This is an owner',
'Category:' : 'This is a categroy',
'Type:' : 'This is a type',
'Special case:' : 'This is a special case']
If anyone can help me out here I will appreciate this!
html code:
<body>
<main>
<div>
...
<div class="general-info margin-bottom-20 margin-top-20">
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Title:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a title</p>
</div>
</div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Owner:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is an owner</p>
</div>
</div>
<h2 class="h3 margin-top-10 margin-bottom-10 padding-x-20">Validity</h2>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Category:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a category</p>
</div>
</div>
<div class="row padding-x-40"></div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Type:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a type</p>
</div>
</div>
<div class="row padding-x-40">
<div>
<strong>Special case:</strong>
<p>This is a special case</p>
</div>
</div>
</div>
...
I believe this is about as generalized as I can get given the html provided:
general_info = doc.xpath("//div[contains(concat(' ', normalize-space(#class), ' '), 'general-info')]//p[#class='margin-0']")
for i in general_info :
if len(i.xpath('./strong/text()'))>0:
topic = i.xpath('./strong/text()')[0]
if len(i.text.strip())>0:
entry += i.text.replace('\n','').strip()
print(topic+' '+i.text.replace('\n','').strip())
special = general_info[0].xpath('./ancestor::div[#class="general-info margin-bottom-20 margin-top-20"]//div/div/strong')[0]
print(special.text+" ",special.xpath('./following-sibling::p/text()')[0])
Output:
('Title: This is a title',
'Owner: This is an owner',
'Category: This is a category',
'Type: This is a type',
'Special case: This is a special case')
I recommend another solution, which is very suitable for extracting data from XML.
from simplified_scrapy.spider import SimplifiedDoc
html='''
<body>
<main>
<div class="general-info margin-bottom-20 margin-top-20">
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Title:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a title</p>
</div>
</div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Owner:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is an owner</p>
</div>
</div>
<h2 class="h3 margin-top-10 margin-bottom-10 padding-x-20">Validity</h2>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Category:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a category</p>
</div>
</div>
<div class="row padding-x-40"></div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Type:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a type</p>
</div>
</div>
<div class="row padding-x-40">
<div>
<strong>Special case:</strong>
<p>This is a special case</p>
</div>
</div>
</div>
'''
data={}
doc = SimplifiedDoc(html) # create doc
divs = doc.selects('div.general-info')
# First way
for div in divs:
strongs = div.strongs
for strong in strongs:
p = strong.next
if not p:
p=strong.parent.next
data[strong.text]=p.text
print(data)
data={}
# Second way
for div in divs:
ds = div.selects('strong|p>text()')
for i in range(0,len(ds),2):
data[ds[i]]=ds[i+1]
print(data)
Result:
{'Title:': 'This is a title', 'Owner:': 'This is an owner', 'Category:': 'This is a category', 'Type:': 'This is a type', 'Special case:': 'This is a special case'}
{'Title:': 'This is a title', 'Owner:': 'This is an owner', 'Category:': 'This is a category', 'Type:': 'This is a type', 'Special case:': 'This is a special case'}
Here are more examples:https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples/