How to get all links containing a phrase from a changing website - python

I want to retrieve all links from a website that contain a specific phrase.
An example on a public website would be to retrieve all videos from a large youtube channel (for example Linus Tech Tips):
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.youtube.com/user/LinusTechTips/videos'
html = requests.get(url)
soup = bs(html.content, "html.parser")
current_link = ''
for link in soup.find_all('a'):
current_link = link.get('href')
print(current_link)
Now I have 3 problems here:
How do I get only hyperlinks containing a phrase like "watch?v="
Most hyperlinks aren't shown. In the browser: They appear when you scroll down. BeautifulSoup does only find the links which can be found without scrolling. How can I retrieve all hyperlinks?
All hyperlinks appear two times. How can I only choose each hyperlink once?
Any suggestions?

How do I get only hyperlinks containing a phrase like "watch?v="
Add a single if statement above your print statement
if 'watch?v=' in current_link:
print(current_link)
All hyperlinks appear two times. How can I only choose each hyperlink once?
Store all hyperlinks in a dictionary as the key and set the value to any arbitrary number (dictionaries only allow a single key entry so you wont be able to add duplicates)
Something like this:
myLinks = {} //declare a dictionary variable to hold your data
if 'watch?v=' in current_link:
print(current_link)
myLinks[currentLink] = 1
You can iterate over the keys (links) in the dictionary like this:
for link,val in myLinks:
print(link)
This will print all the links in your dictionary
Most hyperlinks aren't shown. In the browser: They appear when you scroll down. BeautifulSoup does only find the links which can be found without scrolling. How can I retrieve all hyperlinks?
I'm unsure as to how you directly get around the scripting on the page you have directed us to but you could always crawl the links you get from the initial scrape and rip new links off the side panels/traverse them, this should give you most, if not all, of the links you want.
To do so you would want another dictionary to store the already traversed links/check if you already traversed them. You can check for a key in a dictionary like so:
if key in myDict:
print('myDict has this key already!')

I would use the request library,
for python3
import urllib.request
import requests
SearchString="SampleURL.com"
response = requests.get(SearchString, stream=True)
zeta= str(response.content)
with open ("File.txt" , "w") as l:
l.write(zeta)
l.close()
#And now open up the file with the information written to it
x = open("File.txt", "r")
jello = []
for line in x:
jello.append(line)
t = (jello[0].split(""""salePrice":""",1)[1].split(",",1)[0] )
#you'll notice above that I have the keyword "salePrice", this should be a unique identifier in the pages xpath. typically f12 in chrome and then navigating til the item is highlighted gives you the xpath if you right click and copy
#Now this will only return a single result, youll want to use a for loop to iterate over the File.txt until you find all the separate results
I hope this helps Ill keep an eye on this thread if you need more help.

Part One and Three:
Create a list and append links to the list:
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.youtube.com/user/LinusTechTips/videos'
html = requests.get(url)
soup = bs(html.content, "html.parser")
links = [] # see here
for link in soup.find_all('a'):
links.append(link.get('href')) # and here
Then create a set and convert it back to list to remove duplicates:
links = list(set(links))
Now return the items of interest:
clean_links = [i for i in links if 'watch?v=' in i]
Part Two:
In order to navigate through the site you may need more than just Beautiful Soup. Scrapy has a great API that allows you to pull down a page and explore how you want to parse parent and child elements with xpath. I highly encourage you to try Scrapy and use the interactive shell to tweak your extraction method.
HELPFUL LINK

Related

Python web-scraping trouble

I've been with this all day and I'm getting a little overwhelmed, I explain, I have a personal project, scrape all the links of the acestream: // protocol from a website and turn them into a playlist for acestream. For now I can either remove the links from the web (something like the site map) or remove the acestream links from a specific subpage. One of the problems I have is that since the same acestream link appears several times on the page,
Obviously I get the same link multiple times and I only want it once. Besides, I don't know how to do it either (I'm very new to this) so that instead of putting the link in it, it automatically takes it from a list of links in a .csv, because I need to get an acestream link from each link that I put on it. in the .csv. I'm sorry about the tirade, I hope it's not a nuisance.
Hope you understand, I translated it with Google Translate
from bs4 import BeautifulSoup
import requests
# creating empty list
urls = []
# function created
def scrape(site):
# getting the request from url
r = requests.get(site)
# converting the text
s = BeautifulSoup(r.text, "html.parser")
for i in s.find_all("a"):
href = i.attrs['href']
if href.startswith("acestream://"):
site = site + href
if site not in urls:
urls.append(site)
print(site)
# calling the scrape function itself
# generally called recursion
scrape(site)
# main function
if __name__ == "__main__":
site = "https://www.websitehere.com/index.htm"
scrape(site)
Based off your last comment and your code, you can read in a .csv using
import pandas as pd
file_path = 'C:\<path to your csv>'
df = pd.read_csv(file_path)
csv_links = df['<your_column_name_for_links>'].to_list()
With this, you can get the URLs from the .csv. Just change the values in the <>.

Retrieve all strings in a webpage in Python

I am trying to retrieve all strings from a webpage using BeautifulSoup and return a list of all the retrieved strings.
I have 2 approaches in mind:
Find all elements who have a text that is not null, append the text to result list and return it. I am having a hard time implementing this as I couldn't find any way to do it in BeautifulSoup.
Use BeautifulSoup's "find_all" method to find all attributes that I am looking for such as "p" for paragraphs, "a" for links etc. The problem I am facing with this approach is that for some reason, find_all is returning a duplicated output. For example, if a website has a link with a text "Get Hired", I am receiving "Get Hired" more than once in the output.
I am honestly not sure how to proceed from here and I have been stuck for several hours trying to figure out how to get all strings form a webpage.
Would really appreciate your help.
Use .stripped_strings to get all the strings with whitespaces stripped off.
.stripped_strings - Read the Docs.
Here is the code that returns a list of strings present inside the <body> tag.
import requests
from bs4 import BeautifulSoup
url = 'YOUR URL GOES HERE...'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
b = soup.find('body')
list_of_strings = [s for s in b.stripped_strings]
list_of_strings will have a list of all the strings present in the URL.
Post the code that you've used.
If I remember correctly, something like this should get the complete page in one variable "page" and all the text of the page would be available as page.text
import requests
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)
print(page.text)

Passing a list as an argument to a function in python

I'm new to Python and I'm struggling with passing a list as an argument to a function.
I've written a block of code to take a url, extract all links from the page and put them into a list (links=[]). I want to pass this list to a function that filters out any link that is not from the same domain as the starting link (aka the first in the list) and output a new list (filtered_list = []).
This is what I have:
import requests
from bs4 import BeautifulSoup
start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
def filter_links(links):
filtered_links = []
for link in links:
if link.startswith(links[0]):
filtered_links.append(link)
print(filter_links(links))
When I run this, I get an unfiltered list and below that, I get None.
Eventually I want to pass the filtered list to a function that grabs the html from each page in the domain linked on the homepage, but I am trying to tackle this problem 1 process at a time. Any tips would be much appreciated, thank you :)
EDIT
I now can pass the list of urls to the filter_links() function however, I'm filtering out too much now. eventually I want to pass several different start urls through this program, so I need a generic way of filtering urls that are within the same domain as the starting url. I have used the built-in startswith function, but its filtering out everything except the starting url. I think I could use regex but this should work too?

Writing CSV file while looping through web pages

This is a follow-up question to my earlier question on looping through multiple web pages. I am new to programming... so I appreciate your patience and very explicit explanations!
I have programmed a loop through many web pages. On each page, I want to scrape data, save it to a variable or a csv file (whichever is easier/more stable), then click on the "next" button, scrape data on the second page and append it to the variable or csv file, etc.
Specifically, my code looks like this:
url="http://www.url.com"
driver = webdriver.Firefox()
driver.get(url)
(driver.page_source).encode('utf-8')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
wait = WebDriverWait(driver, 10)
while True:
# some code to grab the data
job_tag={'class': re.compile("job_title")}
all_jobs=soup.findAll(attrs=job_tag)
jobs=[]
for text in (all_jobs):
t=str(''.join(text.findAll(text=True)).strip())
jobs.append(t)
writer=csv.writer(open('test.csv','a', newline=''))
writer.writerows(jobs)
# click next link
try:
element=wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='reviews']/a/span[starts-with(.,'Next')]")))
element.click()
except TimeoutException:
break
It runs without error, but
1) the file collects the data of the first page over and over again, but not the data of the subsequent pages, even though the loop performs correctly (ultimately, I do not really mind duplicate entries, but I do want data from all pages).
I am suspecting that I need to "redefine" the soup for each new page, I am looking into how to make bs4 access those urls.
2) the last page has no "next" button, so the code does not append last page's data (I get that error when I use 'w' instead of 'a' in the csv line, with the data of the second-to-last page writing into the csv file).
Also, although it is a minor issue, the data gets written one letter per cell in the csv, even though when I run that portion in Python with bs4, the data is correctly formatted. What am I missing?
Thanks!
I am suspecting that I need to "redefine" the soup for each new page
Indeed, you should. You see, your while loop runs with soup always referring to the same old object you made before entering that while loop. You should rebind soup to a new BeautifulSoup instance, which is most likely the URL you find behind the anchor (tag a) which you've located in those last lines:
element=wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='reviews']/a/span[starts-with(.,'Next')]")))
You could access it with just your soup (note that I haven't tested this for correctness: without the actual source of the page, I'm guessing):
next_link = soup.find(id='reviews').a.get('href')
And then, at the end of your while loop, you would rebind soup:
soup = BeautifulSoup(urllib.request.urlopen(next_link.read()))
You should still add a try - except clause to capture the error it'll generate on the last page when it cannot find the last "Next" link and then break out of the loop.
Note that selenium is most likely not necessary for your use-case, bs4 would be sufficient (but either would work).
Also, although it is a minor issue, the data gets written one letter per cell in the csv, even though when I run that portion in Python with bs4, the data is correctly formatted. What am I missing?
The writer instance you've created expects an iterable for its writerows method. You are passing it a single string (which might have kommas in them, but that's not what csv.writer will look at: it will add kommas (or whichever delimiter you specified in its construction) between every 2 items of the iterable). A Python string is iterable (per character), so writer.writerows("some_string") doesn't result in an error. But you most likely wanted this:
for text in (all_jobs):
t = [x.strip() for x in text.find_all(text=True)]
jobs.append(t)
As a follow-up on the comments:
You'll want to update the soup based on the new url, which you retrieve from the 1, 2, 3 Next >> (it's in a div container with a specific id, so easy to extract with just BeautifulSoup). The code below is a fairly basic example that shows how this is done. Extracting the things you find relevant is done by your own scraping code, which you'll have to add as indicated in the example.
#Python3.x
import urllib
from bs4 import BeautifulSoup
url = 'http://www.indeed.com/cmp/Wesley-Medical-Center/reviews'
base_url_parts = urllib.parse.urlparse(url)
while True:
raw_html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(raw_html)
# scrape the page for the desired info
# ...
last_link = soup.find('div', id='company_reviews_pagination').find_all('a')[-1]
if last_link.text.startswith('Next'):
next_url_parts = urllib.parse.urlparse(last_link['href'])
url = urllib.parse.urlunparse((base_url_parts.scheme, base_url_parts.netloc,
next_url_parts.path, next_url_parts.params, next_url_parts.query,
next_url_parts.fragment))
print(url)
else:
break

web scraping in python

I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.
However, it's just not working.
Here's my code so far:
import urllib2, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())
divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]
What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.
You could try something like this:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')
# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')
results = []
while True:
# Read the web page in XML mode
soup = BeautifulSoup(html.read(), "xml")
try:
for s in soup.find_all("signature"):
# Scrape the names from the XML
firstname = s.find('firstname').contents[0]
lastname = s.find('lastname').contents[0]
results.append(str(firstname) + " " + str(lastname))
except:
pass
# Find the next page to scrape
prev = soup.find("prev_signature")
# Check if another page of result exists - if not break from loop
if prev == None:
break
# Get the previous URL
url = prev.contents[0]
# Open the next page of results
html = urllib2.urlopen(url)
print("Extracting data from {}".format(url))
# Print the results
print("\n")
print("====================")
print("= Printing Results =")
print("====================\n")
print(results)
Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.
In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.
Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).
Or if you do absolutely need to scrape:
Space your requests using a timer
Scrape smartly
I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.
Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.
what do you mean by not working? empty list or error?
if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll

Categories