Beautifulsoup : Unable to extract href with several conditions

Beautifulsoup : Unable to extract href with several conditions - python

I'm trying to extract every links with BeautifulSoup from the SEC website such as this one by using the code from this Github. The thing is I do not want to extract every 8-K but only the ones matching the items "2.02" within the column "Description". So i edited the "Download.py" file and identified the following :
while continuation_tag:
r = requests_get(browse_url, params=requests_params)
if continuation_tag == 'first pass':
logger.debug("EDGAR search URL: " + r.url)
logger.info('-' * 100)
data = r.text
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a', {'id': 'documentsbutton'}):
URL = sec_website + link['href']
linkList.append(URL)
continuation_tag = soup.find('input', {'value': 'Next ' + str(count)}) # a button labelled 'Next 100' for example
if continuation_tag:
continuation_string = continuation_tag['onclick']
browse_url = sec_website + re.findall('cgi-bin.*count=\d*', continuation_string)[0]
requests_params = None
return linkList
I've tried to add another loop to match my regex but it doesn't work
for link in soup.find_all('a', {'id': 'documentsbutton'}):
for link in soup.find_all(string=re.compile("items 2.02")):
URL = sec_website + link['href']
linkList.append(URL)
Any helps would be really appreciated, thanks !

First find the tr that encapsulates both the a tag and the td tag that contains the items 2.02 text. Then find the url in the tr if the td actually contains the text items 2.02:
for link in soup.find_all("tr"):
td = link.find('td', {'class': 'small'})
if td:
if 'items 2.02' in td.text:
URL = sec_website + link.find('a', {'id': 'documentsbutton'})['href']
linkList.append(URL)

You can write something more concise by using css pseudo classes. The following looks for td child elements, of parent with class tableFile2, that have an adjacent sibling td (i.e. next column) which is both the third column (nth-of-type) of the table and contains 2.02; from those tds it filters to the child a tags that have id documentsbutton.
import requests
from bs4 import BeautifulSoup as bs # version 4.7.1 +
base = 'https://www.sec.gov'
r = requests.get('https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000320193&type=8-K&dateb=&owner=exclude&start=0&count=40')
soup = bs(r.content, 'lxml') # or html.parser
links = [base + i['href'] for i in soup.select('.tableFile2 td:has(+ td:nth-of-type(3):contains("2.02")) #documentsbutton')]

Related

Grabbing first link in this code with Python

Hello this is the code I want to grab first link from using BeautifulSoup.
view-source:https://www.binance.com/en/blog
I want to grab the first article here so it would be "Trust Wallet Now Supports Stellar Lumens, 4 More Tokens"
I am trying to use Python for this.
I use this code but it grabs all the links, I only want first one to grab
with open('binanceblog1.html', 'w') as article:
before13 = requests.get("https://www.binance.com/en/blog", headers=headers2)
data1b = before13.text
xsoup2 = BeautifulSoup(data1b, "lxml")
for div in xsoup2.findAll('div', attrs={'class':'title sc-0 iaymVT'}):
before_set13 = div.find('a')['href']
How can I do this?

Most simple solution I can think at this moment that works with your code is to use break, this is because of findAll
for div in xsoup2.findAll('div', attrs={'class':'title sc-62mpio-0 iIymVT'}):
before_set13 = div.find('a')['href']
break
For just the first element you can use find
before_set13 = soup.find('div', attrs={'class':'title sc-62mpio-0 iIymVT'}).find('a')['href']

Try (Extracting the href from 'Read more' button)
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.binance.com/en/blog')
soup = BeautifulSoup(r.text, "html.parser")
div = soup.find('div', attrs={'class': 'read-btn sc-62mpio-0 iIymVT'})
print(div.find('a')['href'])

You can assess the situation inside the loop and break when you find a satisfactory result.
for div in xsoup2.findAll('div', attrs={'class':'title sc-62mpio-0 iIymVT'}):
before_set13 = div.find('a')['href']
if before_set13 != '/en/blog':
break
print('skipping ' + before_set13)
print('grab ' + before_set13)
Output of the code with these changes:
skipping /en/blog
grab /en/blog/317619349105270784/Trust-Wallet-Now-Supports-Stellar-Lumens-4-More-Tokens

Use the class name with class css selector (.) for the content section then descendant combinator with a type css selector to specify child a tag element. select_one returns first match
soup.select_one('.content a')['href']
Code:
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.binance.com/en/blog')
soup = bs(r.content, 'lxml')
link = soup.select_one('.content a')['href']
print('https://www.binance.com' + link)

Adding objects for each item added from scraping data from a website

I am trying to retrieve data from a website with and add for each row of data and object, I am new to python and I clearly miss something because I can get only 1 object, what Im trying to get is all the objects I get sorted by key value pairs:
import urllib.request
import bs4 as bs
url = 'http://freemusicarchive.org/search/?quicksearch=drake/'
search = ''
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(html, 'html.parser')
tracks_info = [{}]
spans = soup.find_all('span', {'class': 'ptxt-artist'})
for span in spans:
arts = span.find_all('a')
for art in arts:
print(art.text)
spans = soup.find_all('span', {'class': 'ptxt-track'})
for span in spans:
tracks = span.find_all('a')
for track in tracks:
print(track.text)
for download_links in soup.find_all('a', {'title': 'Download'}):
print(download_links.get('href'))
for info in tracks_info:
info.update({'artist': art.text})
info.update({'track': track.text})
info.update({'link': download_links.get('href')})
print(info)
I failed to add an object for each element I get from the website, Im clearly doing something wrong\or not doing and any help would be much appreciated!

You could use a slightly different struture and syntax such as below.
I use a contains CSS class selector to retrieve the rows of info as the id is different for each track
The CSS selector combination of div[class*="play-item gcol gid-electronic tid-"]
looks for div elements with class attribute having value containing play-item gcol gid-electronic tid-.
Within that the various columns of interest are then selected by their class name and a descendant css selector is used for the a tag element for the final download link.
import urllib.request
import bs4 as bs
import pandas as pd
url = 'http://freemusicarchive.org/search/?quicksearch=drake/'
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(html, 'html.parser')
tracks_Info = []
headRow = ['Artist','TrackName','DownloadLink']
for item in soup.select('div[class*="play-item gcol gid-electronic tid-"]'):
tracks_Info.append([item.select_one(".ptxt-artist").text.strip(), item.select_one(".ptxt-track").text, item.select_one(".playicn a").get('href')])
df = pd.DataFrame(tracks_Info,columns=headRow)
print(df)

requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied

I am working on a web scraping project and have run into the following error.
requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?
Below is my code. I retrieve all of the links from the html table and they print out as expected. But when I try to loop through them (links) with request.get I get the error above.
from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame
page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
for ref in table.find_all('a', href=True):
links = (ref['href'])
print (links)
for link in links:
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
table = []
# Find all the divs we need in one go.
divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
for div in divs:
# find all the enclosing a tags.
anchors = div.find_all('a')
for anchor in anchors:
# Now we have groups of 3 list items (li) tags
lis = anchor.find_all('li')
# we clean up the text from the group of 3 li tags and add them as a list to our table list.
table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
# We have all the data so we add it to a DataFrame.
headers = ['Number', 'Tenant', 'Square Footage']
df = DataFrame(table, columns=headers)
print (df)

Your mistake is second for loop in code
for ref in table.find_all('a', href=True):
links = (ref['href'])
print (links)
for link in links:
ref['href'] gives you single url but you use it as list in next for loop.
So you have
for link in ref['href']:
and it gives you first char from url http://properties.kimcore... which is h
Full working code
from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame
page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
for ref in table.find_all('a', href=True):
link = ref['href']
print(link)
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
table = []
# Find all the divs we need in one go.
divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
for div in divs:
# find all the enclosing a tags.
anchors = div.find_all('a')
for anchor in anchors:
# Now we have groups of 3 list items (li) tags
lis = anchor.find_all('li')
# we clean up the text from the group of 3 li tags and add them as a list to our table list.
table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
# We have all the data so we add it to a DataFrame.
headers = ['Number', 'Tenant', 'Square Footage']
df = DataFrame(table, columns=headers)
print (df)
BTW: if you use comma in (ref['href'], ) then you get tuple and then second for works correclty.
EDIT: it create list table_data at start and add all data into this list. And it convert into DataFrame at the end.
But now I see it read the same page few times - because in every row the same url is in every column. You would have to get url only from one column.
EDIT: now it doesn't read the same url many times
EDIT: now it get text and hre from first link and add to every element in list when you use append().
from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame
page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')
table_data = []
# all rows in table except first ([1:]) - headers
rows = soup.select('table tr')[1:]
for row in rows:
# link in first column (td[0]
#link = row.select('td')[0].find('a')
link = row.find('a')
link_href = link['href']
link_text = link.text
print('text:', link_text)
print('href:', link_href)
page = requests.get(link_href)
soup = BeautifulSoup(page.content, 'html.parser')
divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
for div in divs:
anchors = div.find_all('a')
for anchor in anchors:
lis = anchor.find_all('li')
item1 = unicodedata.normalize("NFKD", lis[0].text).strip()
item2 = lis[1].text
item3 = lis[2].text.strip()
table_data.append([item1, item2, item3, link_text, link_href])
print('table_data size:', len(table_data))
headers = ['Number', 'Tenant', 'Square Footage', 'Link Text', 'Link Href']
df = DataFrame(table_data, columns=headers)
print(df)

Findall to div tag using beautiful soup yields blank return

<div class="columns small-5 medium-4 cell header">Ref No.</div>
<div class="columns small-7 medium-8 cell">110B60329</div>
Website is https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results
I would like to run a loop and return '110B60329'. I have ran beautiful soup and done a find_all(div), I then define the 2 different tags as head and data based on their class. I then ran iteration through the 'head' tags hoping it would return the info in the div tag i have defined as data .
Python returns a blank (cmd prompt reprinted the filepth).
Would anyone kindly know how i might fix this. My full code is.....thanks
import requests
from bs4 import BeautifulSoup as soup
import csv
url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results'
baseurl = 'https://www.saa.gov.uk'
session = requests.session()
response = session.get(url)
# content of search page in soup
html= soup(response.content,"lxml")
properties_col = html.find_all('div')
for col in properties_col:
ref = 'n/a'
des = 'n/a'
head = col.find_all("div",{"class": "columns small-5 medium-4 cell header"})
data = col.find_all("div",{"class":"columns small-7 medium-8 cell"})
for i,elem in enumerate(head):
#for i in range(elems):
if head [i].text == "Ref No.":
ref = data[i].text
print ref

You can do this by two ways.
1) If you are sure that the website that your are scraping won't change its content you can find all divs by that class and get the content by providing an index.
2) Find all left side divs (The titles) and if one of them matches what you want get the next sibling to get the text.
Example:
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results'
baseurl = 'https://www.saa.gov.uk'
session = requests.session()
response = session.get(url)
# content of search page in soup
html = soup(response.content,"lxml")
#Method 1
LeftBlockData = html.find_all("div", class_="columns small-7 medium-8 cell")
Reference = LeftBlockData[0].get_text().strip()
Description = LeftBlockData[2].get_text().strip()
print(Reference)
print(Description)
#Method 2
for column in html.find_all("div", class_="columns small-5 medium-4 cell header"):
RightColumn = column.next_sibling.next_sibling.get_text().strip()
if "Ref No." in column.get_text().strip():
print (RightColumn)
if "Description" in column.get_text().strip():
print (RightColumn)
The prints will output (in order):
110B60329
STORE
110B60329
STORE
Your problem is that you are trying to match a node text that have a lot of tabs with a non-spaced string.
For example your head [i].textvariable contains
Ref No., so if you compare it with Ref No. it'll give a false result. Striping it will solve.

import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results")
soup = BeautifulSoup(r.text, 'lxml')
for row in soup.find_all(class_='table-row'):
print(row.get_text(strip=True, separator='|').split('|'))
out:
['Ref No.', '110B60329']
['Office', 'LOTHIAN VJB']
['Description', 'STORE']
['Property Address', '29 BOSWALL PARKWAY', 'EDINBURGH', 'EH5 2BR']
['Proprietor', 'SCOTTISH MIDLAND CO-OP SOCIETY LTD.']
['Tenant', 'PROPRIETOR']
['Occupier']
['Net Annual Value', '£1,750']
['Marker']
['Rateable Value', '£1,750']
['Effective Date', '01-APR-10']
['Other Appeal', 'NO']
['Reval Appeal', 'NO']
get_text() is very powerful tool, you can strip the white space and put separator in the text.
You can use this method to get clean data and filter it.

Webcrawler BeautifulSoup - how do I get titles from links without class tags

The site I am trying to gather data from is http://www.boxofficemojo.com/yearly/chart/?yr=2015&p=.htm. Right now I want to get all the titles of the movies on this page and later move onto the rest of the data (studio, etc.) and additional data inside each of the links. This is what I have so far:
import requests
from bs4 import BeautifulSoup
from urllib2 import urlopen
def trade_spider(max_pages):
page = 0
while page <= max_pages:
url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2015&p=.htm'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'div':'body'}):
href = 'http://www.boxofficemojo.com' + link.get('href')
title = link.string
print title
get_single_item_data(href)
page += 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item_name in soup.findAll('section', {'id':'postingbody'}):
print item_name.text
trade_spider(1)
I section I am having trouble with is
for link in soup.findAll('a', {'div':'body'}):
href = 'http://www.boxofficemojo.com' + link.get('href')
The issue is that on the site, there's no identifying class in which all the links are part of. The links just have an "< ahref > " tag.
How can I get all the titles of the links on this page?

One possible way is using .select() method which accept CSS selector parameter :
for link in soup.select('td > b > font > a[href^=/movies/?]'):
......
......
brief explanation about CSS selector being used :
td > b : find all td element, then from each td find direct child b element
> font : from filtered b elements, find direct child font element
> a[href^=/movies/?] : from filtered font elements, return direct child a element having href attribute value starts with "/movies/?"

Sorry for not giving a full answer, but heres a clue.
I have a made up name for these problems in scraping.
When I use the find(), find_all() methods I call this Abstract Identification since you could get random data when tag class/id names are not data oriented.
Then theres Nested Identification. That's when you have to find data not using the find(), find_all() methods, and instead literally crawl through a nest of tags. This requires more proficiency in BeautifulSoup.
Nested Identification is a longer proccess that's generally messy but is sometimes the only solution.
So how to do it? When you have hold of a <class 'bs4.element.Tag'> object you can locate tags that are stored as attributes of the tag object.
from bs4 import element, BeautifulSoup as BS
html = '' +\
'<body>' +\
'<h3>' +\
'<p>Some text to scrape</p>' +\
'<p>Some text NOT to scrape</p>' +\
'</h3>' +\
'\n\n' +\
'<strong>' +\
'<p>Some more text to scrape</p>' +\
'\n\n' +\
'Some Important Link' +\
'</strong>' +\
'</body>'
soup = BS(html)
# Starting point to extract a link
h3_tag = soup.find('h3') # finds the first h3 tag in the soup object
child_of_h3__p = h3_tag.p # locates the first p tag in the h3 tag
# climbing in the nest
child_of_h3__forbidden_p = h3_tag.p.next_sibling
# or
#child_of_h3__forbidden_p = child_of_h3__p.next_sibling
# sometimes `.next_sibling` will yield '' or '\n', think of this element as a
# tag separator in which case you need to continue using `.next_sibling`
# to get past the separator and onto the tag.
# Grab the tag below the h3 tag, which is the strong tag
# we need to go up 1 tag, and down 2 from our current object.
# (down 2 so we skip the tag_seperator)
tag_below_h3 = child_of_h3__p.parent.next_sibling.next_sibling
# Heres 3 different ways to get to the link tag using Nested Identification
# 1.) getting a list of childern from our object
childern_tags = tag_below_h3.contents
p_tag = childern_tags[0]
tag_separator = childern_tags[1]
a_tag = childern_tags[2] # or childrent_tags[-1] to get the last tag
print (a_tag)
print '1.) We Found the link: %s' % a_tag['href']
# 2.) Theres only 1 <a> tag, so we can just grab it directly
a_href = tag_below_h3.a['href']
print '\n2.) We Found the link: %s' % a_href
# 3.) using next_sibling to crawl
tag_separator = tag_below_h3.p.next_sibling
a_tag = tag_below_h3.p.next_sibling.next_sibling # or tag_separator.next_sibling
print '\n3.) We Found the link: %s' % a_tag['href']
print '\nWe also found a tag seperator: %s' % repr(tag_separator)
# our tag seperator is a NavigableString.
if type(tag_separator) == element.NavigableString:
print '\nNavigableString\'s are usually plain text that reside inside a tag.'
print 'In this case however it is a tag seperator.\n'
Now If I remember right, accessing a certain tag or a tag seperator, will change the object from a Tag to a NavigableString in which case you need to pass it through BeautifulSoup to be able to use methods such as find(). To check for this you can do as so.
from bs4 import element, BeautifulSoup
# ... Do some beautiful soup data mining
# reach a NavigableString object
if type(formerly_a_tag_obj) == element.NavigableString:
formerly_a_tag_obj = BeautifulSoup(formerly_a_tag_obj) # is now a soup

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautifulsoup : Unable to extract href with several conditions - python

Related

Grabbing first link in this code with Python

Adding objects for each item added from scraping data from a website

requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied

Findall to div tag using beautiful soup yields blank return

Webcrawler BeautifulSoup - how do I get titles from links without class tags

Categories

Resources