What would be the best way to scrape this website? (Not Selenium) - python

Before I begin TLDR is at the bottom
So I'm trying to scrape https://rarbgmirror.com/ for torrent magnet links and for their torrent title names based on user inputted searches. I've already figured out how to do this using BeautifulSoup and Requests through this code:
from bs4 import BeautifulSoup
import requests
import re
query = input("Input a search: ")
link = 'https://rarbgmirror.com/torrents.php?search=' + query
magnets = []
titles = []
try:
request = requests.get(link)
except:
print("ERROR")
source = request.text
soup = BeautifulSoup(source, 'lxml')
for page_link in soup.findAll('a', attrs={'href': re.compile("^/torrent/")}):
page_link = 'https://www.1377x.to/' + page_link.get('href')
try:
page_request = requests.get(page_link)
except:
print("ERROR")
page_source = page_request.content
page_soup = BeautifulSoup(page_source, 'lxml')
link = page_soup.find('a', attrs={'href': re.compile("^magnet")})
magnets.append(link.get('href'))
title = page_soup.find('h1')
titles.append(title)
print(titles)
print(magnets)
I am almost certain that this code has no error in it because the code was originally made for https://1377x.to for the same purpose, and if you look through the HTML structure of both websites, they use the same tags for magnet links and title names. But if the code is faulty please point that out to me!
After some research I found the issue to be that https://rarbgmirror.com/ uses JavaScript which dynamically loads web pages. So after some more research I find that selenium is recommended for this purpose. Well after some time using selenium I find some cons to using it such as:
The slow speed of scraping
The system which the app is running on must have the selenium browser installed (I'm planning on using pyinstaller to pack the app which would be an issue)
So I'm requesting for an alternative to selenium to scrape dynamically loaded web pages.
TLDR:
I want an alternative to selenium to scrape a website which is dynamically loaded using JavaScript.
PS: GitHub Repo:
https://github.com/eliasbenb/MagnetMagnet

If you are using only Chrome, you can check out Puppeteer by Google. It is fast and integrates quite well with Chrome DevTools.

WORKING SOLUTION
DISCLAIMER FOR PEOPLE LOOKING FOR AN ANSWER: this method WILL NOT work for any website other than RARBG
I posted this same question to reddit's r/learnpython someone on there found a great answer which met all my requirements. You can find the original comment here
What he found out was that rarbg gets its info from here
You can change what is searcher by changing "QUERY" in the link. On that page was all the information for each torrent, so using requests and bs4 I scraped all the information.
Here is the working code:
query = input("Input a search: ")
rarbg_link = 'https://torrentapi.org/pubapi_v2.php?mode=search&search_string=' + query + '&token=lnjzy73ucv&format=json_extended&app_id=lol'
try:
request = requests.get(rarbg_link, headers={'User-Agent': 'Mozilla/5.0'})
except:
print("ERROR")
source = request.text
soup = str(BeautifulSoup(source, 'lxml'))
soup = soup.replace('<html><body><p>{"torrent_results":[', '')
soup = soup.split(',')
titles = str([i for i in soup if i.startswith('{"title":')])
titles = titles.replace('{"title":"', '')
titles = titles.replace('"', '')
titles = titles.split("', '")
for title in titles:
title.append(titles)
links = str([i for i in soup if i.startswith('"download":')])
links = links.replace('"download":"', '')
links = links.replace('"', '')
links = links.split("', '")
for link in links:
magnets.append(link)

Related

search pdf links from all over the website

I want to search a website and look for all pdf links. I know there are several solutions with BeautifulSoup to look for pdf files using < a > tags but I want to search the whole domain using the base url, instead of just the page linked.
My idea was to a) first search a whole website for all sub links and then b) filter out the links that have a .pdf extension. For the first part, I tried this https://github.com/mujeebishaque/extract-urls:
import requests
from bs4 import BeautifulSoup
if __name__ == '__main__':
user_input_url = "https://www.aurednik.de/"
if not user_input_url or len(user_input_url) < 1:
raise Exception("INFO: Invalid Input")
_start = user_input_url.find('//')
_end = user_input_url.find('.com')
readable_website_name = user_input_url[_start+2:_end].strip()
try:
website_content = requests.get(user_input_url.strip()).text
except:
check_internet = requests.get('https://google.com').status_code
if check_internet != requests.codes.ok:
raise ConnectionError("ERROR: Check internet connection.")
_soup = BeautifulSoup(website_content, features='lxml')
internal_url_links = []
external_url_links = []
for link in _soup.find_all('a', href=True):
if readable_website_name in link.get('href'):
internal_url_links.append(link['href'])
if readable_website_name not in link.get('href') and len(link.get('href')) > 3:
external_url_links.append(link['href'])
print(internal_url_links, '\n')
print(external_url_links, '\n')
I was expecting that it would be able to crawl and return all links such as
https://www.aurednik.de/info-service/downloads/#unserekataloge
and https://www.aurednik.de/downloads/AUREDNIK_Haupt2021.pdf
but that is not the case. I dont see the 2nd pdf link at all and for the first link, I only see
/info-service/downloads/#unserekataloge
when I print out the external links. I want the full link and preferably also all pdf links on the website domain. How else could I achieve this? I am open to using any tools or libraries.
Maybe the website has dynamic content. Check if the HTML loaded by BeautifulSoup contains is the same as when you inspect the website in your browser. If not use for example selenium to scrape the website with dynamically loaded content.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
html = driver.page_source
soup = BeautifulSoup(html)
internal_url_links = []
external_url_links = []
for link in soup.find_all('a', href=True):
if readable_website_name in link.get('href'):
internal_url_links.append(link['href'])
if readable_website_name not in link.get('href') and len(link.get('href')) > 3:
external_url_links.append(link['href'])
print(internal_url_links, '\n')
print(external_url_links, '\n')
driver.close()

How can I reach seq tag data via web scraping with BeautifulSoup?

I am a newbie to web scraping. I am trying to get FASTA file from here, but somehow I cannot. First of all the problem starting for me span tag, I tried some couple of suggestions but not working for me I am suspecting that maybe there is a privacy problem
The FASTA file in this class, but when I run this code, I just can see FASTA title:
url = "https://www.ncbi.nlm.nih.gov/nuccore/193211599?report=fasta"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
fasta_data = soup.find_all("div")
for link in soup.find_all("div", {"class": "seqrprt seqviewer"}):
print link.text
url = "https://www.ncbi.nlm.nih.gov/nuccore/193211599?report=fasta"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
fasta_data = soup.find_all("div")
for link in soup.find_all("div", {"class": "seqrprt seqviewer"}):
print link.text
##When I try to reach directly via span, output is empty.
div = soup.find("div", {'id':'viewercontent1'})
spans = div.find_all('span')
for span in spans:
print span.string
Every scraping job involves two phases:
Understand the page that you want to scrape. (How it works? content loaded from Ajax? redirections? POST? GET? iframes? antiscraping stuff?...)
Emulate the webpage using your favourite framework
Do not write a single line of code before to work on point 1. Google network inspector is your friend, use it!
Regarding your webpage, it seems that the report is loaded into a viewer getting data from this url:
https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=193211599&db=nuccore&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
Use that url and you will get your report.

How to scrape dynamic webpages by Python

[What I'm trying to do]
Scrape the webpage below for used car data.
http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1
[Issue]
To scrape the entire pages. In the url above, only first 30 items are shown. Those could be scraped by the code below which I wrote. Links to other pages are displayed like 1 2 3... but the link addresses seems to be in Javascript. I googled for useful information but couldn't find any.
from bs4 import BeautifulSoup
import urllib.request
html = urllib.request.urlopen("http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1")
soup = BeautifulSoup(html, "lxml")
total_cars = soup.find(class_="change change_01").find('em').string
tmp = soup.find(class_="change change_01").find_all('span')
car_start, car_end = tmp[0].string, tmp[1].string
# get urls to car detail pages
car_urls = []
heading_inners = soup.find_all(class_="heading_inner")
for heading_inner in heading_inners:
href = heading_inner.find('h4').find('a').get('href')
car_urls.append('http://www.goo-net.com' + href)
for url in car_urls:
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "lxml")
#title
print(soup.find(class_='hdBlockTop').find('p', class_='tit').string)
#price of car itself
print(soup.find(class_='price1').string)
#price of car including tax
print(soup.find(class_='price2').string)
tds = soup.find(class_='subData').find_all('td')
# year
print(tds[0].string)
# distance
print(tds[1].string)
# displacement
print(tds[2].string)
# inspection
print(tds[3].string)
[What I'd like to know]
How to scrape the entire pages. I prefer to use BeautifulSoup4 (Python). But if that is not the appropriate tool, please show me other ones.
[My environment]
Windows 8.1
Python 3.5
PyDev (Eclipse)
BeautifulSoup4
Any guidance would be appreciated. Thank you.
you can use selenium like below sample:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://example.com')
element = driver.find_element_by_class_name("yourClassName") #or find by text or etc
element.click()
The python module splinter may be a good starting point. It calls an external browser (such as Firefox) and access the browser's DOM rather than dealing with HTML only.

Python - Reddit web crawler using BeautifulSoup4 returns nothing

I've attempted to create a web crawler for Reddit's /r/all that gathers the links of the top posts. I have been following part one of thenewboston's web crawler tutorial series on YouTube.
In my code, I've removed the while loop that sets a limit to the number of pages to crawl in thenewboston's case (I'm only going to crawl the top 25 posts of /r/all, only one page). Of course, I've made these changes to suit the purpose of my web crawler.
In my code, I've changed the URL variable to 'http://www.reddit.com/r/all/' (for obvious reasons) and the Soup.findAll iterable to Soup.findAll('a', {'class': 'title may-blank loggedin'}) (title may-blank loggedin is the class of a title of a post on Reddit).
Here is my code:
import requests
from bs4 import BeautifulSoup
def redditSpider():
URL = 'http://www.reddit.com/r/all/'
sourceCode = requests.get(URL)
plainText = sourceCode.text
Soup = BeautifulSoup(plainText)
for link in Soup.findAll('a', {'class': 'title may-blank loggedin'}):
href = 'http://www.reddit.com/r/all/' + link.get('href')
print(href)
redditSpider()
I've done some amateur bug-checking using print statements between each line and it seems that the for loop is not being executed.
To follow along or compare thenewboston's code with mine, skip to part two in his mini-series and find a spot in his video where his code is shown.
EDIT: thenewboston's code on request:
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'https://buckysroom.org/trade/search.php?page=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in Soup.findAll('a', {'class': 'item-name'}):
href = 'http://buckysroom.org' + link.get('href')
print(href)
page += 1
trade_spider()
This isn't exactly a direct answer to your question, but I thought I'd let you know there is an API for Reddit made for Python called PRAW (The Python Reddit Api Wrapper) you may want to check it out as it can do what you're looking to do much easier.
Link: https://praw.readthedocs.org/en/v2.1.20/
So first of all, newboston seems to be a screencast so getting that code in there would be helpful.
Secondly, I would recommend outputing the file locally so you can open it up in a browser and look around in Web Tools to look at what you want. I would also recommend using ipython to play around with BeautfulSoup on the file locally rather than scraping it every time.
If you throw this in there you can accomplish that:
plainText = sourceCode.text
f = open('something.html', 'w')
f.write(sourceCode.text.encode('utf8'))
When I ran your code, first of all I had to wait because several times it gave me back an error page that I was requesting too often. That could be your first problem.
When I did get the page, there were plenty of links but none with your classes. I'm not sure what 'title may-blank loggedin' is supposed to represent without watching that entire Youtube series.
Now I see the problem
It's the logged in class, you are not logged in with your scraper.
You shouldn't need to login just to see /r/all, just use this instead:
soup.findAll('a', {'class': 'title may-blank '})
You are not "logged in", hence that class styling is never applied. This works without the logged in:
import requests
from bs4 import BeautifulSoup
def redditSpider():
URL = 'http://www.reddit.com/r/all'
source = requests.get(URL)
Soup = BeautifulSoup(source.text)
for link in Soup.findAll('a',attrs={'class' : 'title may-blank '}):
href = 'http://www.reddit.com/r/all/' + link.get('href')
print(href)
redditSpider()

Scraping all links using Python BeautifulSoup/lxml

http://www.snapdeal.com/
I was trying to scrape all links from this site and when I do, I get an unexpected result. I figured out that this is happening because of javascript.
under "See All categories" Tab you will find all major product categories. If you hover the mouse over any category it will expand the categories. I want those links from each major categories.
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url)
page = BeautifulSoup(data)
#print data
for link in page.findAll('a'):
l = link.get('href')
print l
But, this gave me a different result than what I expected (I turned off javascript and looked at the page source and output was from this source)
I just want to finds all sub links from each major category. any suggestions will be appreciated.
This is happening just because you are letting BeautifulSoup chose its own best parser , and you might not have installed lxml .
The best option is to use html.parser to parse the url .
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
This worked for me .Make sure to install dependencies .
I thinks you should try another library such as selenium , it provide a web driver for you and this is the advantage of this library ,for my self I couldn't handle javascripts with bs4.
Categories Menu is the url you are looking for. Many websites generate the content dynamically using XHR(XMLHTTPRequest).
In order to examine the components of a website get familiar with Firebug add-on in Firefox or Developer Tools(inbuilt addon) in Chrome. You can check the XHR used in website under the network tab in aforementioned add-ons.
Use a web scraping tool such as scrapy or mechanize
In mechanize, to get all the links in the snapdeal homepage,
br=Browser()
br.open("http://www.snapdeal.com")
for link in browser.links():
print link.name
print link.url
I have been looking into a way to scrape links from webpages that are only rendered in an actual browser but wanted the results to be run using a headless browser.
I was able to achieve this using phantomJS, selenium and beautiful soup
#!/usr/bin/python
import bs4
import requests
from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')
url = 'http://www.snapdeal.com/'
browser = driver.get(url)
content = driver.page_source
soup = bs4.BeautifulSoup(content)
links = [a.attrs.get('href') for a in soup.find_all('a')]
for paths in links:
print paths
driver.close()
The following examples will work for both HTTP and HTTPS. I'm writing this answer to show how this can be used in both Python 2 and Python 3.
Python 2
This is inspired by this answer.
from bs4 import BeautifulSoup
import urllib2
url = 'https://stackoverflow.com'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
Python 3
from bs4 import BeautifulSoup
from urllib.request import urlopen
import ssl
# to open up HTTPS URLs
gcontext = ssl.SSLContext()
# You can give any URL here. I have given the Stack Overflow homepage
url = 'https://stackoverflow.com'
data = urlopen(url, context=gcontext).read()
page = BeautifulSoup(data, 'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print(l)
Other Languages
For other languages, please see this answer.

Categories