Extract href link from html in python - python

I get this output list of html
HyperSense Software
QSS Technosoft - A CMMI Level 3 Certified Company
and more in the same format I need to extract href link from them?
My code
mainurl="https://www.appfutura.com/app-developers"
html = urlopen(mainurl).read()
main_soup = BeautifulSoup(html,"lxml")
allurl=main_soup.find_all('h3')
for i in allurl:
for a in i :
print(a)
How can I extract href in this loop?

You're close. One small change in your for loop:
for i in allurl:
print(i.a["href"])
This gets the child with tag "a" and then the "href" attribute for that tag.
If you aren't sure how many "a" tags there are in each "h3" block, or there are more than one, you can use another for loop (or depending on what you're doing, list comprehensions):
for i in allurl:
aa = i.find_all('a')
for j in aa:
print(j["href"])

I found a way using css selector
urllist=[]
mainurl="https://www.appfutura.com/app-developers"
html = urlopen(mainurl).read()
main_soup = BeautifulSoup(html,"lxml")
elms = main_soup.select("h3 a")
for i in elms:
urllist.append(i.attrs["href"])
print(urllist)
Thanks !!

Related

Trying to isolate URL suffix's from list of href tags

I'm currently working on a simple web crawling program that will crawl the SCP wiki to find links to other articles in each article. So far I have been able to get a list of href tags that go to other articles, but can't navigate to them since the URL I need is embedded in the tag:
[ SCP-1512,
SCP-2756,
SCP-002,
SCP-004 ]
Is there any way I would be able to isolate the "/scp-xxxx" from each item in the list so I can append it to the parent URL?
The code used to get the list looks like this:
import requests
import lxml
from bs4 import BeautifulSoup
import re
def searchSCP(x):
url = str(SCoutP(x))
c = requests.get(url)
crawl = BeautifulSoup(c.content, 'lxml')
#Searches HTML for text containing "SCP-" and href tags containing "scp-"
ref = crawl.find_all(text=re.compile("SCP-"), href=re.compile("scp-",))
param = "SCP-" + str(SkateP(x)) #SkateP takes int and inserts an appropriate number of 0's.
for i in ref: #Below function is for sorting out references to the article being searched
if str(param) in i:
ref.remove(i)
if ref != []:
print(ref)
The main idea I've tried to use is finding every item that contains items in quotations, but obviously that just returned the same list. What I want to be able to do is select a specific item in the list and take out ONLY the "scp-xxxx" part or, alternatively, change the initial code to only extract the href content in quotations to the list.
Is there any way I would be able to isolate the "/scp-xxxx" from each item in the list so I can append it to the parent URL?
If I understand correctly, you want to extract the href attribute - for that, you can use i.get('href') (or probably even just i['href']).
With .select and list comprehension, you won't even need regex to filter the results:
[a.get('href') for a in crawl.select('*[href*="scp-"]') if 'SCP-' in a.get_text()]
would return
['/scp-1512', '/scp-2756', '/scp-002', '/scp-004']
If you want the parent url attached:
root_url = 'https://PARENT-URL.com' ## replace with the actual parent url
scpLinks = [root_url + l for l, t in list(set([
(a.get('href'), a.get_text()) for a in crawl.select('*[href*="scp-"]')
])) if 'SCP-' in t]
scpLinks should return
['https://PARENT-URL.com/scp-004', 'https://PARENT-URL.com/scp-002', 'https://PARENT-URL.com/scp-1512', 'https://PARENT-URL.com/scp-2756']
If you want to filter out param, add str(param) not in t to the filter:
scpLinks = [root_url + l for l, t in list(set([
(a.get('href'), a.get_text()) for a in crawl.select('*[href*="scp-"]')
])) if 'SCP-' in t and str(param) not in t]
if str(param) was 'SCP-002', then scpLinks would be
['https://PARENT-URL.com/scp-004', 'https://PARENT-URL.com/scp-1512', 'https://PARENT-URL.com/scp-2756']

how to get text after a specific p tag in beautifulsoup?

how to get all text after third p tag from this code in BeautifulSoup web scraping.
questions = soup.find('div',{'class':'entry-content'})
exp = questions.p[3].text
(there is c a way something like this but i cant get it. )
anyone here can help. shall be very thanksfullenter image description here
Try below code, if that helps:
#This will fetch first div with class entry-content.
# In case if that is not the first div then instead use find_all and select the
# appropriate div with help of indexing.
questions = soup.find('div', class_= 'entry-content')
#This will get all the p tags present in questions.
p_tags = questions.find_all('p')
lst=[]
for tag in p_tags[3:]:
lst.append(tag.text)
#This will get you the text of the 4th <p> tag.
exp = p_tags[3].text
This questions = soup.find('div',{'class':'entry-content'})
Only finds one p tag,
you need:
questions = soup.find_all('div',{'class':'entry-content'})
To find all the p tags, then you can use [3]

get last page number - web scraping

I am trying to scrape a site with multiple pages. I would like to build a function that returns the number of pages within a set of pages.
Here is an example starting page.
There are 29 sub pages within that leading page, ideally the function would therefore return 29.
By subpage I mean, page 1 of 29, 2 of 29 etc etc.
This is the HTML snippet which contains the last page information, from the link posted above.
<div id="paging-wrapper-btm" class="paging-wrapper">
<ol class="page-nos"><li ><span class="selected">1</span></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=1&pgesize=36&sort=-1'>Weiter ยป</li></ol>
I have the following code which will find all ol tags, but can't figure out how to access the contents contained within in each 'a' .
a = soup.find_all('ol')
b = [x['a'] for x in a] <-- this part returns an error.
< further processing >
Any help/suggestions much appreciated.
Ah.. I found a simple solution.
for item in soup.select("ol a"):
x = item.text
print x
I can then sort and select the largest number.
Try this:
ols = soup.find_all('ol')
list_of_as = [ol.find_all('a') for ol in ols] # Finds all a's inside each ol in the ols list
all_as = []
for a in list_of_as: # This is to expand each sublist of a's and put all of them in one list
all_as.extend(a)
print all_as
The following would extract the last page number:
from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.asos.de/Herren-Jeans/podlh/?cid=4208&via=top&r=2#parentID=-1&pge=1&pgeSize=36&sort=-1")
soup = BeautifulSoup(html.text)
ol = soup.find('ol', class_='page-nos')
pages = [li.text for li in ol.find_all('li')]
last_page = pages[-2]
print last_page
Which for your website will display:
30

Beautiful Soup if Class "Contains" or Regex?

If my class names are constantly different say for example:
listing-col-line-3-11 dpt 41
listing-col-block-1-22 dpt 41
listing-col-line-4-13 CWK 12
Normally I could do:
for EachPart in soup.find_all("div", {"class" : "ClassNamesHere"}):
print EachPart.get_text()
There are way too many class names to work with here so a bunch of these are out.
I know Python doesn't have a ".contains" I would normally use but it does have an "in". Though I haven't been able to work out a way to incorporate that.
I'm hoping there's a way to do this with regex. Though again my Python syntax is really letting me down I've been trying variations on:
regex = re.compile('.*listing-col-.*')
for EachPart in soup.find_all(regex):
But that doesn't seem to be doing the trick.
BeautifulSoup supports CSS selectors which allow you to select elements based on the content of particular attributes. This includes the selector *= for contains.
The following will return all div elements with a class attribute containing the text 'listing-col-':
for EachPart in soup.select('div[class*="listing-col-"]'):
print EachPart.get_text()
You can try this for loop:
regex = re.compile('.*listing-col-.*')
for EachPart in soup.find_all("div", {"class" : regex}):
print EachPart.get_text()
You could avoid regex by using partial matching with gazpacho...
Input:
html = """\
<div class="listing-col-line-3-11 dpt 41">A</div>
<div class="listing-col-block-1-22 dpt 41">B</div>
<div class="listing-col-line-4-13 CWK 12">C</div>
"""
Partial matching code:
from gazpacho import Soup
soup = Soup(html)
divs = soup.find("div", {"class": "listing-col-"}, partial=True)
[div.text for div in divs]
Output:
['A', 'B', 'C']

How to get a nested element in beautiful soup

I am struggling with the syntax required to grab some hrefs in a td.
The table, tr and td elements dont have any class's or id's.
If I wanted to grab the anchor in this example, what would I need?
< tr >
< td > < a >...
Thanks
As per the docs, you first make a parse tree:
import BeautifulSoup
html = "<html><body><tr><td><a href='foo'/></td></tr></body></html>"
soup = BeautifulSoup.BeautifulSoup(html)
and then you search in it, for example for <a> tags whose immediate parent is a <td>:
for ana in soup.findAll('a'):
if ana.parent.name == 'td':
print ana["href"]
Something like this?
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [td.find('a') for td in soup.findAll('td')]
That should find the first "a" inside each "td" in the html you provide. You can tweak td.find to be more specific or else use findAll if you have several links inside each td.
UPDATE: re Daniele's comment, if you want to make sure you don't have any None's in the list, then you could modify the list comprehension thus:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [a for a in (td.find('a') for td in soup.findAll('td')) if a]
Which basically just adds a check to see if you have an actual element returned by td.find('a').

Categories