A am learning how to use BeautifulSoup and I have run into an issue with double printing in a loop I have written.
Any insight would be greatly appreciated!
from bs4 import BeautifulSoup
import requests
import re
page = 'https://news.google.com/news/headlines?gl=US&ned=us&hl=en' #main page
#url = raw_input("Enter a website to extract the URL's from: ")
r = requests.get(page) #requests html document
data = r.text #set data = to html text
soup = BeautifulSoup(data, "html.parser") #parse data with BS
for link in soup.find_all('a'):
#if contains /news/
if ('/news/' in link.get('href')):
print(link.get('href'))
Examples:
for link in soup.find_all('a'):
#if contains cointelegraph/news/
#if ('https://cointelegraph.com/news/' in link.get('href')):
url = link.get('href') #local var store url
if '/news/' in url:
print(url)
print(count)
count += 1
if count == 5:
break
output:
https://cointelegraph.com/news/woman-in-denmark-imprisoned-for-hiring-hitman-using-bitcoin
0
https://cointelegraph.com/news/ethereum-price-hits-all-time-high-of-750-following-speed-boost
1
https://cointelegraph.com/news/ethereum-price-hits-all-time-high-of-750-following-speed-boost
2
https://cointelegraph.com/news/senior-vp-says-ebay-seriously-considering-bitcoin-integration
3
https://cointelegraph.com/news/senior-vp-says-ebay-seriously-considering-bitcoin-integration
4
For some reason my code keeps printing out the same url twice...
Based on your code and the provided link there seems to be duplicates in the results of BeautifulSoup find_all search. The html structure needs to be checked to see why duplicates are returned (check the find_all search options to filter some in the documentation. But if you want a quick fix and want to remove the duplicates from the printed results you can use the modified loop with a set as below to keep track of seen entries (based on this).
In [78]: l = [link.get('href') for link in soup.find_all('a') if '/news/' in link.get('href')]
In [79]: any(l.count(x) > 1 for x in l)
Out[79]: True
The above output shows duplicate exists in the list. Now to remove them use something like
seen = set()
for link in soup.find_all('a'):
lhref = link.get('href')
if '/news/' in lhref and lhref not in seen:
print lhref
seen.add(lhref)
Related
I'm having a bit of trouble trying to save the links from a website into a list without repeating urls with same domain
Example:
www.python.org/download and www.python.org/about
should only save the first one (www.python.org/download) and not repeat it later
This is what i've got so far
from bs4 import BeautifulSoup
import requests
from urllib.parse import urlparse
url = "https://docs.python.org/3/library/urllib.request.html#module-urllib.request"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
atag = doc.find_all('a', href=True)
links = []
#below should be some kind of for loop
As a one-liner:
links = {nl for a in doc.find_all('a', href=True) if (nl := urlparse(a["href"]).netloc) != ""}
Explained:
links = set() # define empty set
for a in doc.find_all('a', href=True): # loop over every <a> element
nl = urlparse(a["href"]).netloc # get netloc from url
if nl:
links.add(nl) # add to set if exists
output:
{'www.w3.org', 'datatracker.ietf.org', 'www.python.org', 'requests.readthedocs.io', 'github.com', 'www.sphinx-doc.org'}
So I have just started learning about python using the Coursera online course "Python for Everybody", and I have this assignment where I have to follow links using beautiful soup. I saw this question pop up before but when I tried using it, it just didn't work. I managed to create something but the thing doesn't actually follow through the links but instead just stays on the same page. If possible can anyone provide materials that can give better insight on this assignment as well? Thanks.
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter URL - ')
cnt = input("Enter count -")
count = int(cnt)
pn = input("Enter position -")
position = int(pn)-1
while count > 0:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = soup('a')
lst = list()
for tag in tags:
lst.append(tag.get('href', None))
indxpos = lst[position]
count = count - 1
print("Retrieving:", indxpos)
You never set url to the new URL.
while count > 0:
html = urllib.request.urlopen(url, context=ctx).read() # Gets the page at url
...
for tag in tags:
lst.append(tag.get('href', None)) # Appends all the links to lst
indxpos = lst[position]
count = count - 1
print("Retrieving:", indxpos)
# What happens to lst?? you never use it
You should probably replace indxpos with url instead.
while count > 0:
html = urllib.request.urlopen(url, context=ctx).read() # Gets the page at url
...
for tag in tags:
lst.append(tag.get('href', None)) # Appends all the links to lst
url = lst[position]
count = count - 1
print("Retrieving:", url)
This way, the next time the loop runs, it will fetch the new URL.
Also: If the page does not have pn links (e.g. pn=12, page has 2 links), you will get an exception if you try and access lst[position], because lst has less than pn elements.
You don't have a function that interacts with the list of hyperlinks in your code, whatsoever.
It will only print contents of "lst" list, but won't do anything with them.
How would i save the output from this python script as a dataframe?
import requests
from bs4 import BeautifulSoup, SoupStrainer
response = requests.get('https://webscraper.io/test-sites/e-commerce/allinone')
for link in BeautifulSoup(response.content, "html.parser", parse_only=SoupStrainer('a', href=True)):
print(link['href'])
You might consider putting your code into a function that returns the desired output
def myfunction(response = response, parser = "html.parser", ...):
'''
this function saves output
'''
....
return(output)
or you could define a list and index over the loop
mylist = []
i = 1
for link in BeautifulSoup(response.content, "html.parser", parse_only=SoupStrainer('a', href=True)):
mylist[i] = link['href']
i += 1
Without more detail about what it is exactly you're trying to do, it is difficult to provide more guidance
I have the following piece of code which extracts all links from a page and puts them in a list (links=[]), which is then passed to the function filter_links() .
I wish to filter out any links that are not from the same domain as the starting link, aka the first link in the list. This is what I have:
import requests
from bs4 import BeautifulSoup
import re
start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
def filter_links(links):
filtered_links = []
for link in links:
if link.startswith(links[0]):
filtered_links.append(link)
return filtered_links
print(filter_links(links))
I have used the built-in startswith function, but its filtering out everything except the starting url.
Eventually I want to pass several different start urls through this program, so I need a generic way of filtering urls that are within the same domain as the starting url.I think I could use regex but this function should work too?
Try this :
import requests
from bs4 import BeautifulSoup
import re
import tldextract
start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
def filter_links(links):
ext = tldextract.extract(start_url)
domain = ext.domain
filtered_links = []
for link in links:
if domain in link:
filtered_links.append(link)
return filtered_links
print(filter_links(links))
Note :
You need to get that return statement out of the for loop. It is just returning the result after iterating over just one element and thus only the first item inside a list is only getting returned.
Use tldextract module to better extract the domain name from the urls. If you want to explicitly check whether the links starts with links[0], it's up to you.
Output :
['http://enzymebiosystems.org', 'http://enzymebiosystems.org/', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/recent-developments/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/contact-us/', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/investors-media/news/', 'http://enzymebiosystems.org/investors-media/investor-relations/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/investors-media/stock-information/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/contact-us']
Okay so you made an indentation error in filter_links(links). The function should be like this
def filter_links(links):
filtered_links = []
for link in links:
if link.startswith(links[0]):
filtered_links.append(link)
return filtered_links
Notice that in your code, you kept the return statement inside the for loop so, the for loop gets executed once and then returns the list.
Hope this helps :)
Possible Solution
What about if you kept all links which 'contain' the domain?
For example
import pandas as pd
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
all_links = pd.DataFrame(links, columns=["Links"])
enzyme_df = all_links[all_links.Links.str.contains("enzymebiosystems")]
# results in a dataframe with links containing "enzymebiosystems".
If you want to search multiple domains, see this answer
I'm new to python and web scraping.
I wrote some codes by using requests and beautifulsoup. One code is for scraping prices and names and links. Which works fine and is as follows:
from bs4 import BeautifulSoup
import requests
urls = "https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-1"
source = requests.get(urls).text
soup = BeautifulSoup(source, 'lxml')
for figcaption in soup.find_all('figcaption'):
price = figcaption.div.text
name = figcaption.find('a', class_='title').text
link = figcaption.find('a', class_='title')['href']
print(price)
print(name)
print(link)
and also one for making other urls that I need those information scraped from, which also gives the correct urls when I use print():
x = 0
counter = 1
for x in range(0, 70)
urls = "https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-" + str(counter)
counter += 1
x += 1
print(urls)
But when I try to combine these two in order to scrape a page and then change url to new one and then scrape it, it just gives the scraped information on the first page 70 times. please guide me through this. the whole code is as follows:
from bs4 import BeautifulSoup
import requests
x = 0
counter = 1
for x in range(0, 70):
urls = "https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-" + str(counter)
source = requests.get(urls).text
soup = BeautifulSoup(source, 'lxml')
counter += 1
x += 1
print(urls)
for figcaption in soup.find_all('figcaption'):
price = figcaption.div.text
name = figcaption.find('a', class_='title').text
link = figcaption.find('a', class_='title')['href']
print(price)
print()
print(name)
print()
print(link)
Your x=0 and then incriminating it by 1 is redundant and not needed, as you have it iterating through that range range(0, 70). I'm also not sure why you have a counter as you don't need that either. Here's how you would do it below:
HOWEVER, I believe that issue is not with the iteration or looping, but the url itself. If you manually go to the two pages as listed below, the content doesn’t change:
https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-1
and then
https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-2
Since the site is dynamic, you'll need to find a different way to iterate page to page, or figure out what the exact url is. So try:
from bs4 import BeautifulSoup
import requests
for x in range(0, 70):
try:
urls = 'https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html&pagesize[]=24&order[]=new&stock[]=1&page[]=' +str(x+1) + '&ajax=ok?_=1561559181560'
source = requests.get(urls).text
soup = BeautifulSoup(source, 'lxml')
print('Page: %s' %(x+1))
for figcaption in soup.find_all('figcaption'):
price = figcaption.find('span', {'class':'new_price'}).text.strip()
name = figcaption.find('a', class_='title').text
link = figcaption.find('a', class_='title')['href']
print('%s\n%s\n%s' %(price, name, link))
except:
break
You can find that link by going to the website and looking at the dev tools (Ctrl +Shift+I or right-click 'Inspect') -> network -> XHR
When I did that and then physically click to the next page, I can see how that data was rendered, and found the reference url.