How to solve, finding two of each link (Beautifulsoup, python) - python

Im using beautifulsoup4 to parse a webpage and collect all the href values using this code
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
allProductInfo = soup.find_all("a", class_="name-link")
print allProductInfo
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)
linksList1 prints two of each link. I believe this is happening as its taking the link from the title as well as the item colour. I have tried a few things but cannot get BS to only parse the title link, and have a list of one of each link instead of two. I imagine its something real simple but im missing it. Thanks in advance

This code will give you the result without getting duplicate results
(also using set() may be a good idea as #Tarum Gupta)
But I changed the way you crawl
import requests
from bs4 import BeautifulSoup
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
# Gets all divs with class of inner-article then search for a with name-link class
that is inside an h1 tag
allProductInfo = soup.select("div.inner-article h1 a.name-link")
# print (allProductInfo)
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)

alldiv = soup.findAll("div", {"class":"inner-article"})
for div in alldiv:
linkList1.append(div.h1.a['href'])

set(linksList1) # use set() to remove duplicate link
list(set(linksList1)) # use list() convert set to list if you need

Related

How to get rid of duplicate links using python

I'm new at coding and I'm trying to scrape all unique web links from https://www.census.gov/programs-surveys/popest.html. I've tried to put the links into a set but the output comes back as {'/'}. I don't know any other way to get rid of duplicates. Below is my code. Thank you for you help.
from bs4 import BeautifulSoup
import urllib
import urllib.request
import requests
with urllib.request.urlopen('https://www.census.gov/programs-surveys/popest.html') as response:
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a', href=True):
links = (link['href'])
link = str(link.get('href'))
if link.startswith('https'):
print(link)
elif link.endswith('html'):
print(link)
unique_links = set(link)
print(unique_links)
Let's say all the links are stored in a list called links1. Here is how you can remove duplicates without the use of set():
links2 = []
for link in links1:
if link not in link2:
links2.append(link)
Your set only contains the final link, declare the set() earlier, then add to it.
unique_links = set()
for link in soup.find_all('a', href=True):
link = str(link.get('href'))
if link.startswith('https'):
print(link)
elif link.endswith('html'):
print(link)
unique_links.add(link)
print(unique_links)
Create the set outside the for loop, then add to set inside the loop.
link_set = set()
for link in soup.find_all('a', href=True):
link_set.add(link['href']

Search in Sub-Pages of a Main Webpage using BeautifulSoup

I am trying to search for a div with class = 'class', but I need to find all matches in the mainpage as well as in the sub (or children) pages. How can I do this using BeautifulSoup or anything else?
I have found the closest answer in this search
Search the frequency of words in the sub pages of a webpage using Python
but this method only retrieved partial result, the page of interest has many more subpages. Is there another way of doing this?
My code so far:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
subpages = []
for anchor in soup.find_all('a', href=True):
string = 'https://www.mainpage.nl/'+str(anchor['href'])
subpages.append(string)
for subpage in subpages:
try:
soup_sub = BeautifulSoup(requests.get(subpage).content, 'html.parser')
promotie = soup_sub.find_all('strong', class_='c-action-banner__subtitle')
if len(promotie) > 0:
print(promotie)
except Exception:
pass
Thanks!

How to extract the link of the first gallery on imgur using BS4 and urllib

I am trying to extract the gallery link of the first result on an imgur search.
theurl = "https://imgur.com/search?q=" +text
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
link = soup.findAll('a',{"class":"image-list-link"})[0].decode_contents()
Here is what is being displayed for link:
I am mainly trying to get the href value from only this section (the first result for the search)
Here is what the inspect element looks like:
Actually, it's pretty easy to accomplish what you're trying to do. As shown in the image, the href of first image (or any image for that matter) is located inside the <a> tag with the attribute class="image-list-link". So, you can use the find() function, which returns the first match found. And then, use ['href'] to get the link.
Code:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://imgur.com/search?q=python')
soup = BeautifulSoup(r.text, 'lxml')
first_image_link = soup.find('a', class_='image-list-link')['href']
print(first_image_link)
# /gallery/AxKwQ2c
If you want to get the links for all the images, you can use a list comprehension.
all_image_links = [a['href'] for a in soup.find_all('a', class_='image-list-link')]

Python BeautifulSoup4 - Scrape Section/Table Header and Values from Multiple Sections/Tables

I'm trying to scrape links with contextual information from the following page: https://www.reddit.com/r/anime/wiki/discussion_archive/2018. I'm able to get the links just fine using BS4 using Python, but having year, season, titles, and episodes associated to the links is ideal. The desired output would look like this:
I've started with the code below, but don't know how to loop through the code to capture things in sections for each season/title:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
link = 'https://www.reddit.com/r/anime/wiki/discussion_archive/2018'
request_2018 = session.get(link, headers={'User-agent': 'Chrome'})
soup = BeautifulSoup(request_2018.content, 'lxml')
data_table = soup.find('div', class_='md wiki')
Is this something that's doable with BS4? Thanks for your help!
EDIT
criteria = {'class':'md wiki'} # so it can reuse later
data_soup = soup.find('div', criteria)
titles = data_soup.find_all('strong')
tables = data_soup.find_all('table')
Try following:
titles = soup.find('div', {'class':'md wiki'}).find_all('strong')
data_tables = soup.find('div', {'class':'md wiki'}).find_all('table')
Better put the second argument of find into a dict and find_all will return all elements which match your search.

Python BS4 crawler indexerror

I am trying to create a simple crawler that pulls meta data from websites and saves the information into a csv. So far I am stuck here, I have followed some guides but am now stuck with the error:
IndexError: list of index out of range.
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
# Copy all of the content from the provided web page
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
# Grab everything that lies between the title tags using a REGEX
patFinderTitle = re.compile('<title>(.*)</title>')
# Grab the link to the original article using a REGEX
patFinderLink = re.compile('<link rel.*href="(.*)" />')
# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)
findPatLink = re.findall(patFinderLink,webpage)
# Create an iterator that will cycle through the first 16 articles and skip a few
listIterator = []
listIterator[:] = range(2,16)
# Print out the results to screen
for i in listIterator:
print findPatTitle[i] # The title
print findPatLink[i] # The link to the original article
articlePage = urlopen(findPatLink[i]).read() # Grab all of the content from original article
divBegin = articlePage.find('<div>') # Locate the div provided
article = articlePage[divBegin:(divBegin+1000)] # Copy the first 1000 characters after the div
# Pass the article to the Beautiful Soup Module
soup = BeautifulSoup(article)
# Tell Beautiful Soup to locate all of the p tags and store them in a list
paragList = soup.findAll('p')
# Print all of the paragraphs to screen
for i in paragList:
print i
print '\n'
# Here I retrieve and print to screen the titles and links with just Beautiful Soup
soup2 = BeautifulSoup(webpage)
print soup2.findAll('title')
print soup2.findAll('link')
titleSoup = soup2.findAll('title')
linkSoup = soup2.findAll('link')
for i in listIterator:
print titleSoup[i]
print linkSoup[i]
print '\n'
Any help would be greatly appreciated.
The error I get is
File "C:\Users......", line 24, in (module)
print findPatTitle[i] # the title
IndexError:list of index out of range
Thank you.
It seems that you are not using all the power that bs4 can give you.
You are getting this error because the lenght of patFinderTitle is just one, since all html has usually only one title element per document.
A simple way to grab the title of a HTML, is using bs4 itself:
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
soup = BeautifulSoup(webpage)
# get the content of title
title = soup.title.text
You will probably get the same error if you try to iterate over your findPatLink in the currently way, since it has length 6. For me, it is not clear enough if you want to get all the link elements or all the anchor elements, but stickying with the first idea, you can improve your code using bs4 again:
link_href_list = [link['href'] for link in soup.find_all("link")]
And finally, since you don't want some urls, you can slice link_href_list in the way that you want. An improved version of the last expression which excludes the first and the second result could be:
link_href_list = [link['href'] for link in soup.find_all("link")[2:]]

Categories