Beautiful Soup nested div (Adding extra function)

Beautiful Soup nested div (Adding extra function) - python

I am trying to extract Company Name, address, and zipcode from [www.quicktransportsolutions.com][1]. I have written the following code to scrawl the site and return the information I need.
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.quicktransportsolutions.com/carrier/missouri/adrian.php'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('div', {'class': 'well well-sm'}):
title = link.string
print(link)
trade_spider(1)
After running the code, I see the information that I want, but I am confused to how to get it to print without all of the non-pertinent information.
Above the
print(link)
I thought that I could have link.string pull the Company Names, but that failed. Any suggestions?
Output:
div class="well well-sm">
<b>2 OLD BOYS TRUCKING LLC</b><br><u><span itemprop="name"><b>2 OLD BOYS TRUCKING</b></span></u><br> <span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress"><span itemprop="streetAddress">227 E 2ND</span>
<br>
<span itemprop="addressLocality">Adrian</span>, <span itemprop="addressRegion">MO</span> <span itemprop="postalCode">64720</span></br></span><br>
Trucks: 2 Drivers: 2<br>
<abbr class="initialism" title="Unique Number to identify Companies operating commercial vehicles to transport passengers or haul cargo in interstate commerce">USDOT</abbr> 2474795 <br><span class="glyphicon glyphicon-phone"></span><b itemprop="telephone"> 417-955-0651</b>
<br><a href="/inspectionreports/2-old-boys-trucking-usdot-2474795.php" itemprop="url" target="_blank" title="Trucking Company 2 OLD BOYS TRUCKING Inspection Reports">
Everyone,
Thanks for the help so far... I'm trying to add an extra function to my little crawler. I have written the following code:
def Crawl_State_Page(max_pages):
url = 'http://www.quicktransportsolutions.com/carrier/alabama/trucking-companies.php'
while i <= len(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
table = soup.find("table", {"class" : "table table-condensed table-striped table-hover table-bordered"})
for link in table.find_all(href=True):
print link['href']
Output:
abbeville.php
adamsville.php
addison.php
adger.php
akron.php
alabaster.php
alberta.php
albertville.php
alexander-city.php
alexandria.php
aliceville.php
alpine.php
... # goes all the way to Z I cut the output short for spacing..
What I'm trying to accomplish here is to pull all of the href with the city.php and write it to a file. .. But right now, i am stuck in an infinite loop where it keep cycling through the URL. Any tips on how to increment it? My end goal is to create another function that feeds back into my trade_spider with the www.site.com/state/city.php and then loops through all 50 dates... Something to the effect of
while i < len(states,cities):
url = "http://www.quicktransportsolutions.com/carrier" + states + cities[i] +"
And then this would loop into my trade_spider function, pulling all of the information that I needed.
But, before I get to that part, I need a bit of help getting out of my infinite loop. Any suggestions? Or foreseeable issues that I am going to run into?
I tried to create a crawler that would cycle through every link on the page, and then if it found content on the page that trade_spider could crawl, it would write it to a file... However, that was a bit out of my skill set, for now. So, i'm trying this method.

I would rely on the itemprop attributes of the different tags for each company. They are conveniently set for name, url, address etc:
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.quicktransportsolutions.com/carrier/missouri/adrian.php'
response = requests.get(url)
soup = BeautifulSoup(response.content)
for company in soup.find_all('div', {'class': 'well well-sm'}):
link = company.find('a', itemprop='url').get('href').strip()
name = company.find('span', itemprop='name').text.strip()
address = company.find('span', itemprop='address').text.strip()
print name, link, address
print "----"
trade_spider(1)
Prints:
2 OLD BOYS TRUCKING /truckingcompany/missouri/2-old-boys-trucking-usdot-2474795.php 227 E 2ND
Adrian, MO 64720
----
HILLTOP SERVICE & EQUIPMENT /truckingcompany/missouri/hilltop-service-equipment-usdot-1047604.php ROUTE 2 BOX 453
Adrian, MO 64720
----

Related

Finding a specific span element with BeautifulSoup

I am trying to create a script to scrape price data from Udemy courses.
I'm struggling with navigating the HTML tree because the element I'm looking for is located inside multiple nested divs.
here's the structure of the HTML element I'm trying to access:
what I tried:
response = requests.get(url)
print(response)
doc = BeautifulSoup(response.text, "html.parser")
parent_div = doc.find(class_="sidebar-container--purchase-section--17KRp")
print(parent_div.find_all("span"))
and even:
response = requests.get(url)
print(response)
doc = BeautifulSoup(response.text, "html.parser")
main = doc.find(class_="generic-purchase-section--main-cta-container--3xxeM")
title = main.select_one("div span span span")
Here’s the URL: https://www.udemy.com/course/the-complete-web-development-bootcamp/
tried searching all the spans in the HTML and the specific span I'm searching for doesn't appear maybe because it's nested inside a div?
would appreciate a little guidance!

The price is being loaded by JavaScript. So it is not possible to scrape using beautifulsoup.
The data is loaded from an API Endpoint which takes in the course-id of the course.
Course-id of this course: 1565838
You can directly get the info from that endpoint like this.
import requests
course_id = '1565838'
url= f'https://www.udemy.com/api-2.0/course-landing-components/{course_id}/me/?components=price_text'
response = requests.get(url)
x = response.json()
print(x['price_text']['data']['pricing_result']['price'])
{'amount': 455.0, 'currency': 'INR', 'price_string': '₹455', 'currency_symbol': '₹'}

I tried your first approach several times and it works more-or-less for me, although it has returned a different number of span elements on different attempts (10 is the usual number but I have seen as few as 1 on one occasion):
import requests
from bs4 import BeautifulSoup
url = 'https://www.udemy.com/course/the-complete-web-development-bootcamp/'
response = requests.get(url)
doc = BeautifulSoup(response.text, "html.parser")
parent_div = doc.find(class_="sidebar-container--purchase-section--17KRp")
spans = parent_div.find_all("span")
print(len(spans))
for span in spans:
print(span)
Prints:
10
<span data-checked="checked" data-name="u872-accordion--3" data-type="radio" id="u872-accordion-panel--4" style="display:none"></span>
<span class="purchase-options--option-radio--1zjJ_ udlite-fake-toggle-input udlite-fake-toggle-radio udlite-fake-toggle-radio-small"></span>
<span class="udlite-accordion-panel-title">Subscribe</span>
<span>Try it free for 7 days</span>
<span class="udlite-text-xs purchase-section-container--cta-subscript--349MH">$29.99 per month after trial</span>
<span class="purchase-section-container--line--159eG"></span>
<span data-checked="" data-name="u872-accordion--3" data-type="radio" id="u872-accordion-panel--6" style="display:none"></span>
<span class="purchase-options--option-radio--1zjJ_ udlite-fake-toggle-input udlite-fake-toggle-radio udlite-fake-toggle-radio-small"></span>
<span class="udlite-accordion-panel-title">Buy Course</span>
<span class="money-back">30-Day Money-Back Guarantee</span>
As afar as your second approach goes, your main div does not have that many nested span elements, so it is bound to fail. Try just one span element:
import requests
from bs4 import BeautifulSoup
url = 'https://www.udemy.com/course/the-complete-web-development-bootcamp/'
response = requests.get(url)
doc = BeautifulSoup(response.text, "html.parser")
main = doc.find(class_="generic-purchase-section--main-cta-container--3xxeM")
title = main.select_one("div span")
print(title)
Prints:
<span class="money-back">30-Day Money-Back Guarantee</span>

Is there a way to parse data from multiple pages from a parent webpage?

So I have been going to a website to get NDC codes https://ndclist.com/?s=Solifenacin and I need to get 10 digit NDC codes, but on the current webpage there is only 8 digit NDC codes shown like this picture below
So I click on the underlined NDC code. And get this webpage.
So I copy and paste these 2 NDC codes to an excel sheet, and repeat the process for the rest of the codes on the first webpage I've shown. But this process takes a good bit of time, and was wondering if there was a library in Python that could copy and paste the 10 digit NDC codes for me or store them in a list and then I could print the list once I'm finished with all the 8 digit NDC codes on the first page. Would BeautifulSoup work or is there a better library to achieve this process?
EDIT <<<<
I actually need to go another level deep and I've been trying to figure it out, but I've been failing, apparently the last level of webpage is this dumb html table, and I only need one element of the table. Here is the last webpage after you click on the 2nd level codes.
Here is the code that I have, but it's returning a tr and None object once I run it.
url ='https://ndclist.com/?s=Trospium'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for b in soup2.select('#product-packages a'):
link_url2 = b['href']
print('Processing link {}... '.format(link_url2))
soup3 = BeautifulSoup(requests.get(link_url2).content, 'html.parser')
for link in soup3.findAll('tr', limit=7)[1]:
print(link.name)
all_data.append(link.name)
print('Trospium')
print(all_data)

Yes, BeautifulSoup is ideal in this case. This script will print all 10 digits codes from the page:
import requests
from bs4 import BeautifulSoup
url = 'https://ndclist.com/?s=Solifenacin'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for link in soup2.select('#product-packages a'):
print(link.text)
all_data.append(link.text)
# In all_data you have all codes, uncoment to print them:
# print(all_data)
Prints:
Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56
0093-5263-98
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56
0093-5264-98
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19
Processin link https://ndclist.com/ndc/27241-037...
27241-037-03
27241-037-09
... and so on.
EDIT: (Version where I get the description too):
import requests
from bs4 import BeautifulSoup
url = 'https://ndclist.com/?s=Solifenacin'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for code, desc in zip(soup2.select('a > h4'), soup2.select('a + p.gi-1x')):
code = code.get_text(strip=True).split(maxsplit=1)[-1]
desc = desc.get_text(strip=True).split(maxsplit=2)[-1]
print(code, desc)
all_data.append((code, desc))
# in all_data you have all codes:
# print(all_data)
Prints:
Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5263-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5264-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19 90 TABLET, FILM COATED in 1 BOTTLE
...and so on.

Python Data Scraping with Beautiful Soup - geting Data from within a href

I'm pretty new to Python and getting to know Beautiful Soup.
So I have this problem: I need to get data from an event company, specifically contact data. They have this main tables with all participant names and their location. But to get the contact data (phone, email) you need to press on each of the company name from the table and it opens the new window with all the additional information. I'm looking for a way to get that info from a href and combine it with the data in a main table.
So I can get the table and all the href:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
test_url = "https://standconstruction.messe-duesseldorf.de/vis/v1/en/hallindex/1.09?oid=2656&lang=2"
test_data = urlopen(test_url)
test_html = test_data.read()
test_data.close()
page_soup = soup(test_html, "html.parser")
test_table = page_soup.findAll("div", {"class": "exh-table-col"})
print(test_table)
As a result I get all the table and have this kind of info (example of one row), including Name, Adress and href:
<a class="flush" href="/vis/v1/en/exhibitors/aluminium2020.2661781?oid=2656&lang=2">
<h2 class="exh-table-item__name" itemprop="name">Aerospace Engineering Equipment (Suzhou) Co LTD</h2>
</a>
</div>, <div class="exh-table-col exh-table-col--address">
<span class=""><i class="fa fa-map-marker"></i> <span class="link-fix--text">Hall 9 / G57</span></span>
That's where my problem starts, I have no Idea how to get the additional data from the href and combine it with the main data.
I would be very thankful for any possible solutions or at least a tip, where can I find one.
Updating the question:
I needed a Table which contains the information of the following columns:
1.Name; 2.Hall; 3.PDF; 4.Phone; 5.Email.
If you collect the Data by hand - to get the Phone and Email you need to click the appropriate link for it to show.
I wanted to know if there was a way to export the Phone and Email from those Link and add them to the first 3 columns using Python.

import requests
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
params = {
"oid": "2656",
"lang": "2"
}
def main(url):
with requests.Session() as req:
r = req.get(url, params=params)
soup = BeautifulSoup(r.content, 'html.parser')
target = soup.select("div.exh-table-item")
names = [name.h2.text for name in target]
hall = [hall.span.text.strip() for hall in target]
pdf = [pdf.select_one("a.color--darkest")['href'] for pdf in target]
links = [f"{url[:46]}{link.a['href']}" for link in target]
phones = []
emails = []
for num, link in enumerate(links):
print(f"Extracting {num +1} of {len(links)}")
r = req.get(link)
soup = BeautifulSoup(r.content, 'html.parser')
goal = soup.select_one("div[class^=push--bottom]")
try:
phone = goal.select_one("span[itemprop=telephone]").text
except:
phone = "N/A"
try:
email = goal.select_one("a[itemprop=email]").text
except:
email = "N/A"
emails.append(email)
phones.append(phone)
sleep(1)
df = pd.DataFrame(list(zip(names, hall, pdf, phones, emails)), columns=[
"Name", "Hall", "PDF", "Phone", "Email"])
print(df)
df.to_csv("data.csv", index=False)
main("https://standconstruction.messe-duesseldorf.de/vis/v1/en/hallindex/1.09")
Output: View Online

Python Bs4 cannot have the plain text inside a double span damn text

Hey guys so I'm trying to scrape the Indeed website and so far everything has been working well except for the salary (lol of course).
So I am far from an expert in requests, requests_html, and bs4
and so after a good hour of looking everywhere, I cannot find an answer to my particular problem so here I am ...
I have shortened the code to make things easier (I will scrape more whatevs)
yet I have let as example a span with .text which works fine but only one span is there, not a span in a span like in the salary:
<div class="salarySnippet salarySnippetDemphasizeholisticSalary">
<span class="salary no-wrap">
<span class="salaryText">
1 900 € - 2 100 € par mois</span>
</span>
</div>
I'm French so Indeed is in French version if you wanna try yourself go for Paris or else in the input:
from bs4 import BeautifulSoup
import requests as req
print("_____Indeed Job Scaper_____")
city = str(input("Enter your city name here: "))
url = ("https://www.indeed.fr/emplois?l=" + city)
u_req = req.get(url)
soup = BeautifulSoup(u_req.content, 'html.parser')
job_elems = soup.find_all('div' , class_='jobsearch-SerpJobCard')
for job_elem in job_elems:
compagny_name = job_elem.find('span' , class_="company")
salary = job_elem.find("span", {"class": "salaryText"})
print(compagny_name.text)
print(salary.text)
If I do not ask print .text on salary I will get:
<span class="salaryText">
1 800 € - 4 000 € par mois</span>

It appears that this salaryText span does not exist for every card. So, In your loop, you'll need to check if the span exists before accessing it's inner text.
Here is an updated version of your code:
from bs4 import BeautifulSoup
import requests as req
print("_____Indeed Job Scaper_____")
city = str(input("Enter your city name here: "))
url = ("https://www.indeed.fr/emplois?l=" + city)
u_req = req.get(url)
soup = BeautifulSoup(u_req.content, 'html.parser')
job_elems = soup.find_all('div' , class_='jobsearch-SerpJobCard')
for job_elem in job_elems:
compagny_name = job_elem.find('span' , class_="company")
salary = job_elem.find("span", {"class": "salaryText"})
print(compagny_name.text)
# check if salary is not none, before accessing it's text property.
if salary:
print(salary.text)
Could you let me know if it works?

How to scrape second of webpage using python and Beautifulsoup

I've been trying to work with BeautifulSoup because I want to try and scrape a webpage (https://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1). So far I scraped some elements with success but now I wanted to scrape a movie description but I've been struggling. The description is simply situated like this in html :
<div class="lister-item mode-advanced">
<div class="lister-item-content>
<p class="muted-text"> paragraph I don't need</p>
<p class="muted-text"> paragraph I need</p>
</div>
</div>
I want to scrape the second paragraph which seemed easy to do but everything I tried gave me 'None' as output. I've been digging around to find an answer. In an other stackoverflow post I found that
find('p:nth-of-type(1)')
or
find_elements_by_css_selector('.lister-item-mode >p:nth-child(1)')
could do the trick but it still gives me
none #as output
Below you can find a piece of my code it's a bit low grade code because I'm just trying out stuff to learn
import urllib2
from bs4 import BeautifulSoup
from requests import get
url = 'http://www.imdb.com/search/title?
release_date=2017&sort=num_votes,desc&page=1'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_='lister-item mode-
advanced')
first_movie = movie_containers[0]
first_title = first_movie.h3.a.text
print first_title
first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year = first_year.text
print first_year
first_imdb = float(first_movie.strong.text)
print first_imdb
# !!!! problem zone ---------------------------------------------
first_description = first_movie.find('p', class_='muted-text')
#first_description = first_description.text
print first_description
the above code gives me this output:
$ python scrape.py
Logan
(2017)
8.1
None
I would like to learn the correct method of selecting html tags because it will be useful to know for future projects.

find_all() method looks through a tag’s descendants and retrieves
all descendants that match your filters.
You can then use the list's index to get the element you need. Index starts at 0, so 1 will give the second item.
Change the first_description to this.
first_description = first_movie.find_all('p', {"class":"text-muted"})[1].text.strip()
Full code
import urllib2
from bs4 import BeautifulSoup
from requests import get
url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_='lister-item mode-advanced')
first_movie = movie_containers[0]
first_title = first_movie.h3.a.text
print first_title
first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year = first_year.text
print first_year
first_imdb = float(first_movie.strong.text)
print first_imdb
# !!!! problem zone ---------------------------------------------
first_description = first_movie.find_all('p', {"class":"text-muted"})[1].text.strip()
#first_description = first_description.text
print first_description
Output
Logan
(2017)
8.1
In the near future, a weary Logan cares for an ailing Professor X. However, Logan's attempts to hide from the world and his legacy are upended when a young mutant arrives, pursued by dark forces.
Read the Documentation to learn the correct method of selecting html tags.
Also consider moving to python 3.

Just playing around with .next_sibling was able to get it. There's probably a more elegant way though. At least might give you a start/some direction
from bs4 import BeautifulSoup
html = '''<div class="lister-item mode-advanced">
<div class="lister-item-content>
<p class="muted-text"> paragraph I don't need</p>
<p class="muted-text"> paragraph I need</p>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
first_p = soup.find('div',{'class':'lister-item mode-advanced'}).text.strip()
second_p = soup.find('div',{'class':'lister-item mode-advanced'}).next_sibling.next_sibling.text.strip()
print (second_p)
Output:
print (second_p)
paragraph I need

BeautifulSoup 4.71 support :nth-child() or any CSS4 selectors
first_description = soup.select_one('.lister-item-content p:nth-child(4)')
# or
#first_description = soup.select_one('.lister-item-content p:nth-of-type(2)')
print(desc)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautiful Soup nested div (Adding extra function) - python

Related

Finding a specific span element with BeautifulSoup

Is there a way to parse data from multiple pages from a parent webpage?

Python Data Scraping with Beautiful Soup - geting Data from within a href

Python Bs4 cannot have the plain text inside a double span <span class="something"><span class="else"> damn text </span></span>

How to scrape second <p> of webpage using python and Beautifulsoup

Categories

Resources