Extract webpage elements up to a point - python

I want to extract elements from a webpage up to a point, and that is when it reaches this line on the webpage: <div class="clear"></div>. This appears twice on the webpage I am trying to extract, so I wanted to extract all the elements before the first one and then break.
For example:
hhref = ['https://www.ukfirestations.co.uk/stations/bedfordshire','https://www.ukfirestations.co.uk/stations/buckinghamshire']
dats = []
for i in range(0, 2, 1):
r = requests.get(hhref[i], headers=headers)
soup = BeautifulSoup(r.content)
station = soup.find('div',{'id':'stations-grid'}).find_all('a')
for j in station:
dats.append(j['href'])
This extracts all the information, including those after <div class="clear"></div>. The webpage splits the stations I am after by 'current' and 'old'. I wanted to grab only those in the 'current' section, though I am unsure of how I can tell BeautifulSoup to extract elements up to a point.

You can use CSS selector with :not to filter-out unwanted stations:
import requests
from bs4 import BeautifulSoup
hhref = [
"https://www.ukfirestations.co.uk/stations/bedfordshire",
"https://www.ukfirestations.co.uk/stations/buckinghamshire",
]
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}
for link in hhref:
print(f"{link=}")
print()
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
stations = soup.select(
'h2:-soup-contains("Current Stations") ~ .stations-row .station:not(h2:-soup-contains("Old Stations") ~ .stations-row .station)'
)
for s in stations:
print(
"{:<30} {}".format(s.select_one(".station-name").text, s.a["href"])
)
print()
Prints:
link='https://www.ukfirestations.co.uk/stations/bedfordshire'
Ampthill https://www.ukfirestations.co.uk/station/ampthill
Bedford https://www.ukfirestations.co.uk/station/bedford
Biggleswade https://www.ukfirestations.co.uk/station/biggleswade
Dunstable https://www.ukfirestations.co.uk/station/dunstable
Harrold https://www.ukfirestations.co.uk/station/harrold
Headquarters https://www.ukfirestations.co.uk/station/headquarters-10
Kempston https://www.ukfirestations.co.uk/station/kempston
Leighton Buzzard https://www.ukfirestations.co.uk/station/leighton-buzzard
Luton https://www.ukfirestations.co.uk/station/luton
Potton https://www.ukfirestations.co.uk/station/potton
Sandy https://www.ukfirestations.co.uk/station/sandy
Shefford https://www.ukfirestations.co.uk/station/shefford
Stopsley https://www.ukfirestations.co.uk/station/stopsley
Toddington https://www.ukfirestations.co.uk/station/toddington
Woburn https://www.ukfirestations.co.uk/station/woburn
link='https://www.ukfirestations.co.uk/stations/buckinghamshire'
Amersham https://www.ukfirestations.co.uk/station/amersham
Aylesbury & HQ https://www.ukfirestations.co.uk/station/aylesbury-hq
Beaconsfield https://www.ukfirestations.co.uk/station/beaconsfield
Brill https://www.ukfirestations.co.uk/station/brill
Broughton https://www.ukfirestations.co.uk/station/broughton
Buckingham https://www.ukfirestations.co.uk/station/buckingham
Chesham https://www.ukfirestations.co.uk/station/chesham
Gerrards Cross https://www.ukfirestations.co.uk/station/gerrards-cross
Great Missenden https://www.ukfirestations.co.uk/station/great-missenden
Haddenham https://www.ukfirestations.co.uk/station/haddenham
High Wycombe https://www.ukfirestations.co.uk/station/high-wycombe
Marlow https://www.ukfirestations.co.uk/station/marlow
Newport Pagnell https://www.ukfirestations.co.uk/station/newport-pagnell
Olney https://www.ukfirestations.co.uk/station/olney
Princes Risborough https://www.ukfirestations.co.uk/station/princes-risborough
Stokenchurch https://www.ukfirestations.co.uk/station/stokenchurch
Waddesdon https://www.ukfirestations.co.uk/station/waddesdon
West Ashland https://www.ukfirestations.co.uk/station/milton-keynes
Winslow https://www.ukfirestations.co.uk/station/winslow
Or: use .find_previous() to check if you're in correct section:
for link in hhref:
print(f"{link=}")
print()
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
for s in soup.select(".station"):
h2 = s.find_previous("h2")
if h2.get_text(strip=True) != "Current Stations":
continue
print(
"{:<30} {}".format(s.select_one(".station-name").text, s.a["href"])
)
print()

Related

How do I get rid of part of an object / Dictionary? [duplicate]

This question already has answers here:
What is the purpose of the return statement? How is it different from printing?
(15 answers)
Closed 8 months ago.
I'm web scraping a site for data using beautifulsoup4, and I'm not sure how to be specific to the data I want, without calling an unwanted object. I've failed to get rid of it.
import requests
from bs4 import BeautifulSoup
headers = {'User-agent': 'Mozilla/5.0 (Windows 10; Win64; x64; rv:101.0.1) Gecko/20100101 Firefox/101.0.1'}
url = "https://elitejobstoday.com/job-category/education-jobs-in-uganda/"
r = requests.get(url, headers = headers)
c = r.content
soup = BeautifulSoup(c, "html.parser")
table = soup.find("div", attrs={"article": "loadmore-item"})
def jobScan(link):
the_job = {}
job = requests.get(url, headers = headers)
jobC = job.content
jobSoup = BeautifulSoup(jobC, "html.parser")
name = jobSoup.find("h3", attrs={"class": "loop-item-title"})
title = name.a.text
the_job['title'] = title
print('The job is: {}'.format(title))
print(the_job)
return the_job
jobScan(table)
this is the result it fetches
PS C:\Users\MUHUMUZA IVAN\Desktop\JobPortal> py absa.py
The job is: 25 Credit Officers (Group lending) at ENCOT Microfinance Ltd
{'urlLink': 'https://elitejobstoday.com/job-category/education-jobs-in-uganda/', 'title': '25 Credit Officers (Group lending) at ENCOT Microfinance Ltd'}
I want to be able to retain "The job is: 25 Credit Officers (Group lending) at ENCOT Microfinance Ltd" and drop "{'urlLink': 'https://elitejobstoday.com/job-category/education-jobs-in-uganda/', 'title': '25 Credit Officers (Group lending) at ENCOT Microfinance Ltd'}"
if you just want the desired output to be printed, you don't need the dicitonary or any return. just print the title and remove the second print.
import requests
from bs4 import BeautifulSoup
headers = {'User-agent': 'Mozilla/5.0 (Windows 10; Win64; x64; rv:101.0.1) Gecko/20100101 Firefox/101.0.1'}
url = "https://elitejobstoday.com/job-category/education-jobs-in-uganda/"
r = requests.get(url, headers=headers)
c = r.content
soup = BeautifulSoup(c, "html.parser")
table = soup.find("div", attrs={"article": "loadmore-item"})
def jobScan(link):
job = requests.get(url, headers=headers)
jobC = job.content
jobSoup = BeautifulSoup(jobC, "html.parser")
name = jobSoup.find("h3", attrs={"class": "loop-item-title"})
title = name.a.text
print('The job is: {}'.format(title))
jobScan(table)

Extracting company name and other information inside all urls present in a webpage using beautifulsoup

<li>
<strong>Company Name</strong>
":"
<span itemprop="name">PT ERA MURNI BUSANA</span>
</li>
In the above HTML code, I am trying to extract the company name which is PT ERA MURNI BUSANA.
if I use a single test link, I can get the name using the single line code I wrote:
soup.find_all("span",attrs={"itemprop":"name"})[3].get_text()
But I want to extract the information from all such pages present in a single web page.
So I write the for loop but it is fetch the details. I am pasting the part of the code that I have been trying which needs some modification.
Code:-
for link in supplierlinks: #links have been extracted and merged with the base url
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.content,'lxml')
companyname=soup.find_all("span",attrs={"itemprop":"name"})[2].get_text()
Output looks like:
{'Company Name': 'AIRINDO SAKTI GARMENT PT'}
{'Company Name': 'Garments'}
{'Company Name': 'Garments'}
Instead of the garments popping up in the output, I need the company name. How do I modify the code within for loop?
Link:https://idn.bizdirlib.com/node/5290
Try this code:
import requests
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0'}
r = requests.get('https://idn.bizdirlib.com/node/5290',headers=headers).text
soup = BeautifulSoup(r,'html5lib')
print(soup.find_all("span",attrs={"itemprop":"name"})[-1].get_text())
div = soup.find('div',class_ = "content clearfix")
li_tags = div.div.find_all('fieldset')[1].find_all('div')[-1].ul.find_all('li')
supplierlinks = []
for li in li_tags:
try:
supplierlinks.append("https://idn.bizdirlib.com/"+li.a['href'])
except:
pass
for link in supplierlinks:
r = requests.get(link,headers=headers).text
soup = BeautifulSoup(r,'html5lib')
print(soup.find_all("span", attrs={"itemprop": "name"})[-1].get_text())
Output:
PT ERA MURNI BUSANA
PT ELKA SURYA ABADI
PT EMPANG BESAR MAKMUR
PT EMS
PT ENERON
PT ENPE JAYA
PT ERIDANI TOUR AND TRAVEL
PT EURO ASIA TRADE & INDUSTRY
PT EUROKARS CHRISDECO UTAMA
PT EVERAGE VALVES METAL
PT EVICO
This code prints the company names of all the links on the page
You can select sibling element to element <strong> that contains the text "Company Name" (also, don't forget to set User-Agent http header):
import requests
from bs4 import BeautifulSoup
url = 'https://idn.bizdirlib.com/node/5290'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print( soup.select_one('strong:contains("Company Name") + *').text )
Prints:
PT ERA MURNI BUSANA
EDIT: To get contact person:
import requests
from bs4 import BeautifulSoup
url = 'https://idn.bizdirlib.com/node/5290'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print( soup.select_one('strong:contains("Company Name") + *').text )
print( soup.select_one('strong:contains("Contact") + *').text )
Prints:
PT ERA MURNI BUSANA
Mr. Yohan Kustanto

Unable to expand more... python

I can scrape all the reviews from the web page.But I am not getting full content.Only half review content i can scrape.I need to scrape the full content.
from bs4 import BeautifulSoup import requests import re
s = requests.Session()
def get_soup(url):
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0'}
r = s.get(url, headers=headers)
#with open('temp.html', 'wb') as f:
# f.write(r.content)
# webbrowser.open('temp.html')
if r.status_code != 200:
print('status code:', r.status_code)
else:
return BeautifulSoup(r.text, 'html.parser')
def parse(url, response):
if not response:
print('no response:', url)
return
# get number of reviews
# num_reviews = response.find('span', class_='reviews_header_count').text
# num_reviews = num_reviews[1:-1] # remove `( )`
# num_reviews = num_reviews.replace(',', '') # remove `,`
# num_reviews = int(num_reviews)
# print('num_reviews:', num_reviews, type(num_reviews))
num_reviews = (20)
# num_reviews = num_reviews[1:-1] # remove `( )`
# num_reviews = num_reviews.replace(',', '') # remove `,`
# num_reviews = int(num_reviews)
print('num_reviews:', num_reviews, type(num_reviews))
# create template for urls to pages with reviews
url = url.replace('Hilton_New_York_Grand_Central-New_York_City_New_York.html', 'or{}-Hilton_New_York_Grand_Central-New_York_City_New_York.html')
print('template:', url)
# add requests to list
for offset in range(0, num_reviews, 5):
print('url:', url.format(offset))
url_ = url.format(offset)
parse_reviews(url_, get_soup(url_))
#return # for test only - to stop after first page
def parse_reviews(url, response):
print('review:', url)
if not response:
print('no response:', url)
return
for idx, review in enumerate(response.find_all('div', class_='review-container')):
item = {
'hotel_name': response.find('h1', class_='heading_title').text,
'review_title': review.find('span', class_='noQuotes').text,
'review_body': review.find('p', class_='partial_entry').text,
'review_date': review.find('span', class_='relativeDate')['title'],#.text,#[idx],
# 'num_reviews_reviewer': review.find('span', class_='badgetext').text,
'reviewer_name': review.find('span', class_='scrname').text,
'bubble_rating': review.select_one('div.reviewItemInline span.ui_bubble_rating')['class'][1][7:],
}
#~ yield item
results.append(item)
for key,val in item.items():
print(key, ':', val)
print('----')
#return # for test only - to stop after first review
start_urls = [
'https://www.tripadvisor.in/Hotel_Review-g60763-d93339-Reviews-Hilton_New_York_Grand_Central-New_York_City_New_York.html',
#'https://www.tripadvisor.com/Hotel_Review-g60795-d102542-Reviews-Courtyard_Philadelphia_Airport-Philadelphia_Pennsylvania.html',
#'https://www.tripadvisor.com/Hotel_Review-g60795-d122332-Reviews-The_Ritz_Carlton_Philadelphia-Philadelphia_Pennsylvania.html', ]
results = []
for url in start_urls:
parse(url, get_soup(url))
import pandas as pd
df = pd.DataFrame(results) # <--- convert list to DataFrame df.to_csv('output.csv')
I am getting an output sample in csv file from review like:
I went on a family trip and it was amazing, I hope to come back soon. The room was small but what can you expect from New York. It was close to many things and the staff was perfect.I will come back again soon.More...
I just want to expand that more. I need a help..I really have no clue to do it.Please help.
I have written one more code but unable to pull the id from next page.Code is given below
import re
import urllib
#import webbrowser``
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0'}
for i in range(0,10,5):
url = ("https://www.tripadvisor.in/Hotel_Review-g60763-d93339-Reviews-or{}-Hilton_New_York_Grand_Central-New_York_City_New_York.html").format(i)
print(url)
r = s.get(url,headers=headers)
html = BeautifulSoup(r.text, 'html.parser')
pattern = re.compile(r"UID_(\w+)\-SRC_(\w+)")
id = soup.find("div", id=pattern)["id"]
uid = pattern.match(id).group(2)
print(uid)
url1 ="https://www.tripadvisor.in/ShowUserReviews-g60763-d93339-r"+str(uid)+"-Hilton_New_York_Grand_Central-New_York_City_New_York.html#CHECK_RATES_CONT"
print(url1)
url2 = ('"' + url1 + '"')`enter code here`
print(url2)
The site uses ajax to expand the review content. The full content is not downloaded until the More link is clicked.
One way to access the content would be to figure out the ajax request format and then issue a HTTP request for the same. That might be difficult, perhaps not.
Another, easier, way is by noticing that the review title is a clickable link which loads the full review in a new page. You can therefore scrape the URL for each review and send a similar GET request. Then scrape the data from the response.

Scraping through every product on retailer website

We are trying to scrape every product for every category on Forever 21's website. Given a product page, we know how to extract the information we need, and given a category, we can extract every product. However, we do not know how to crawl through every product category. Here is our code for a given category and getting every product:
import requests
from bs4 import BeautifulSoup
import json
#import re
params = {"action": "getcategory",
"br": "f21",
#"category": re.compile('\S+'),
"category": "dress",
"pageno": 1,
"pagesize": "",
"sort": "",
"fsize": "",
"fcolor": "",
"fprice": "",
"fattr": ""}
url = "http://www.forever21.com/Ajax/Ajax_Category.aspx"
js = requests.get(url, params=params).json()
soup = BeautifulSoup(js[u'CategoryHTML'], "html.parser")
i = 0
j = 0
while len(soup.select("div.item_pic a")) != 0:
for a in soup.select("div.item_pic a"):
#print a["href"]
i = i + 1
params["pageno"] = params["pageno"] + 1
j = j + 1
js = requests.get(url, params=params).json()
soup = BeautifulSoup(js[u'CategoryHTML'], "html.parser")
print i
print j
As you can see in the comments, we tried to use regular expressions for the category but had no success. i and j are just product and page counters. Any suggestions on how to modify/add to this code to get every product category?
You can scrape the category page and get all subcategories from the navigation menu:
import requests
from bs4 import BeautifulSoup
url = "http://www.forever21.com/Product/Category.aspx?br=f21&category=app-main"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"})
soup = BeautifulSoup(response.content, "html.parser")
menues = [li["class"][0] for li in soup.select("#has_sub .white nav ul > li")]
print(menues)
Prints:
[u'women-new-arrivals', u'want_list', u'dress', u'top_blouses', u'outerwear_coats-and-jackets', u'bottoms', u'intimates_loungewear', u'activewear', u'swimwear_all', u'acc', u'shoes', u'branded-shop-women-clothing', u'sale_women|women', u'women-new-arrivals-clothing-dresses', u'women-new-arrivals-clothing-tops', u'women-new-arrivals-clothing-outerwear', u'women-new-arrivals-clothing-bottoms', u'women-new-arrivals-clothing-intimates-loungewear', u'women-new-arrivals-clothing-swimwear', u'women-new-arrivals-clothing-activewear', u'women-new-arrivals-accessories|women-new-arrivals', u'women-new-arrivals-shoes|women-new-arrivals', u'promo-web-exclusives', u'promo-best-sellers-app', u'backinstock-women', u'promo-shop-by-outfit-women', u'occasion-shop-wedding', u'contemporary-main', u'promo-basics', u'21_items', u'promo-summer-forever', u'promo-coming-soon', u'dress_casual', u'dress_romper', u'dress_maxi', u'dress_midi', u'dress_mini', u'occasion-shop-dress', u'top_blouses-off-shoulder', u'top_blouses-lace-up', u'top_bodysuits-bustiers', u'top_graphic-tops', u'top_blouses-crop-top', u'top_t-shirts', u'sweater', u'top_blouses-sweatshirts-hoodies', u'top_blouses-shirts', u'top_plaids', u'outerwear_bomber-jackets', u'outerwear_blazers', u'outerwear_leather-suede', u'outerwear_jean-jackets', u'outerwear_lightweight', u'outerwear_utility-jackets', u'outerwear_trench-coats', u'outerwear_faux-fur', u'promo-jeans-refresh|bottoms', u'bottoms_pants', u'bottoms_skirt', u'bottoms_shorts', u'bottoms_shorts-active', u'bottoms_leggings', u'bottoms_sweatpants', u'bottom_jeans|', u'intimates_loungewear-bras', u'intimates_loungewear-panties', u'intimates_loungewear-bodysuits-slips', u'intimates_loungewear-seamless', u'intimates_loungewear-accessories', u'intimates_loungewear-sets', u'activewear_top', u'activewear_sports-bra', u'activewear_bottoms', u'activewear_accessories', u'swimwear_tops', u'swimwear_bottoms', u'swimwear_one-piece', u'swimwear_cover-ups', u'acc_features', u'acc_jewelry', u'acc_handbags', u'acc_glasses', u'acc_hat', u'acc_hair', u'acc_legwear', u'acc_scarf-gloves', u'acc_home-and-gift-items', u'shoes_features', u'shoes_boots', u'shoes_high-heels', u'shoes_sandalsflipflops', u'shoes_wedges', u'shoes_flats', u'shoes_oxfords-loafers', u'shoes_sneakers', u'Shoes_slippers', u'branded-shop-new-arrivals-women', u'branded-shop-women-clothing-dresses', u'branded-shop-women-clothing-tops', u'branded-shop-women-clothing-outerwear', u'branded-shop-women-clothing-bottoms', u'branded-shop-women-clothing-intimates', u'branded-shop-women-accessories|branded-shop-women-clothing', u'branded-shop-women-accessories-jewelry|', u'branded-shop-shoes-women|branded-shop-women-clothing', u'branded-shop-sale-women', u'/brandedshop/brandlist.aspx', u'promo-branded-boho-me', u'promo-branded-rare-london', u'promo-branded-selfie-leslie', u'sale-newly-added', u'sale_dresses', u'sale_tops', u'sale_outerwear', u'sale_sweaters', u'sale_bottoms', u'sale_intimates', u'sale_swimwear', u'sale_activewear', u'sale_acc', u'sale_shoes', u'the-outlet', u'sale-under-5', u'sale-under-10', u'sale-under-15']
Note the values of br and category GET parameters. f21 is the "Women" category, app-main is the main page for a category.

Beautifulsoup parsing error

I am trying to extract some information about an App on Google Play and BeautifulSoup doesn't seem to work.
The link is this(say):
https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts
My code:
url = "https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts"
r = requests.get(url)
html = r.content
soup = BeautifulSoup(html)
l = soup.find_all("div", { "class" : "document-subtitles"})
print len(l)
0 #How is this 0?! There is clearly a div with that class
I decided to go all in, didn't work either:
i = soup.select('html body.no-focus-outline.sidebar-visible.user-has-no-subscription div#wrapper.wrapper.wrapper-with-footer div#body-content.body-content div.outer-container div.inner-container div.main-content div div.details-wrapper.apps.square-cover.id-track-partial-impression.id-deep-link-item div.details-info div.info-container div.info-box-top')
print i
What am I doing wrong?
You need to pretend to be a real browser by supplying the User-Agent header:
import requests
from bs4 import BeautifulSoup
url = "https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts"
r = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
})
html = r.content
soup = BeautifulSoup(html, "html.parser")
title = soup.find(class_="id-app-title").get_text()
rating = soup.select_one(".document-subtitle .star-rating-non-editable-container")["aria-label"].strip()
print(title)
print(rating)
Prints the title and the current rating:
Weird Facts
Rated 4.3 stars out of five stars
To get the additional information field values, you can use the following generic function:
def get_info(soup, text):
return soup.find("div", class_="title", text=lambda t: t and t.strip() == text).\
find_next_sibling("div", class_="content").get_text(strip=True)
Then, if you do:
print(get_info(soup, "Size"))
print(get_info(soup, "Developer"))
You will see printed:
1.4M
Email email#here.com

Categories