How to remove completed torrent using libtorrent rasterbar python binding? - python

I have a python script that downloads files using libtorrent python binding. I just want to know how to remove the torrent once the download is complete.
I'm posting here the example script I used to make mine (I'm not posting mine because it's too large, it has database parts).
import libtorrent as lt
import time
ses = lt.session()
params = { 'save_path': '/home/downloads/'}
link = "magnet:?xt=urn:btih:4MR6HU7SIHXAXQQFXFJTNLTYSREDR5EI&tr=http://tracker.vodo.net:6970/announce"
handle = lt.add_magnet_uri(ses, link, params)
print 'downloading metadata...'
while (not handle.has_metadata()): time.sleep(1)
print 'got metadata, starting torrent download...'
while (handle.status().state != lt.torrent_status.seeding):
print '%d %% done' % (handle.status().progress*100)
time.sleep(1)
Thanks.

you call remove_torrent() on the session object, passing in the torrent_handle to remove.
http://libtorrent.org/reference-Core.html#remove_torrent()
In your script:
ses.remove_torrent(handle)

Related

Problem using a proxy while running a Python script

I'm having a problem related to proxies in my Python script. Running the script below, to access NCBI Blast through biopython, the company's network where I work blocks the access because of security reasons. While talking with the IT guys they gave me a proxy for this kind of situations that has to be incorporated in my script. I've tried a lot of potential solutions but nothing seems to be working. Am I missing something here?
def main(seq):
import os
from Bio.Blast import NCBIWWW
import time
start_time = time.time()
try:
print('Connecting to NCBI...')
blast_handle = NCBIWWW.qblast('blastn','nt',sequence = seq, format_type = 'Text', megablast=True)
text = blast_handle.read()
print(text)
print("--- %s seconds ---" % (time.time() - start_time))
except Exception as e:
print(e)
if __name__ == '__main__':
import os
os.environ['http_proxy'] = 'http://123.456.78.90:80' # The proxy IT guys gave me
seq = 'CAACTTTTTTTTTTATTACAGACAATCAAGAAATTTTCTATTGAAATAAAATATTTTAAA\
ACTTTCAACAACGGATCTCTTGGTTCTCGCATCGATGAAGAACGTAGCGAATTGCGATAA\
GTAATGTGAATTGCAGATTCTCGTGAATCATTGAATTTTTGAACGCACATTGCGCCCTCT\
GGTATTCCAGAGGGCATGCCTGTTTGAGCGTCATTTCCTTCTCAAAAACCCAGTTTTTGG\
TTGTGAGTGATACTCTGCTTCAGGGTTAACTTGAAAATGCTATGCCCCTTTGGCTGCCCT\
TCTTTGAGGGGACTGCGCGTCTGTGCAGGATGTAACCAATGTATTTAGGTATTCATACCA\
ACTTTCATTGTGCGCGTCTTATGCAGTTGTAGTCCACCCAACCTCAGACACACAGGCTGG\
CTGGGCCAACAGTATTCATAAAGTTTGACCTCA'
main(seq)
Thank you very much.
It seems that NCBIWWW.qblast doesn't have support for proxies, so you will need to adapt the code yourself.
In your local BioPython installation, go find the biopython/Bio/Blast/NCBIWWW.py file and add your proxy settings at line 203:
request = Request(url_base, message, {"User-Agent": "BiopythonClient"})
request.set_proxy('http://123.456.78.90:80', 'http') # <-- Add this line
handle = urlopen(request)

Regular file download with Python scheduler and wget

I wrote a simple script which schedules the download of the file from web page once per every week with schedule module. Before downloading, it checks if the file was updated using BeautifulSoup. If yes, it downloads the file using wget. Further, other script uses the file to perform calculations.
The problem is that file won’t appear in the directory until I manually interrupt the script. So, each time I must interrupt script and rerun it again, so it’ll be scheduled for the next week.
Is there any chance to download and save the file "on the fly" without script interruption?
The code will be:
import wget
import ssl
import schedule
import time
from bs4 import BeautifulSoup
import datefinder
from datetime import datetime
# disable certificate checks
ssl._create_default_https_context = ssl._create_unverified_context
#checking if file was updated, if yes, download file, if not waiting until updated
def download_file():
if check_for_updates():
print("downloading")
url = 'https://fgisonline.ams.usda.gov/ExportGrainReport/CY2020.csv'
wget.download(url)
print("downloading complete")
else:
print("sleeping")
time.sleep(60)
download_file()
# Checking if website was updated
def check_for_updates():
url2 = 'https://fgisonline.ams.usda.gov/ExportGrainReport/default.aspx'
html = urlopen(url2).read()
soup = BeautifulSoup(html, "lxml")
text_to_search = soup.body.ul.li.string
matches = list(datefinder.find_dates(text_to_search[30:]))
found_date = matches[0].date()
today = datetime.today().date()
return found_date == today
schedule.every().tuesday.at('09:44').do(download_file)
while True:
schedule.run_pending()
time.sleep(1)
You need to specify the output directory. I think that unless doing this, PyCharm saves in temp directory somewhere, and when you stop the script PyCharm copy it.
Change to:
wget.download(url, out=output_directory)
Based on the following clue you should be able to solve your issue:
from bs4 import BeautifulSoup
import requests
import urllib3
urllib3.disable_warnings()
def main(url):
r = requests.head(url, verify=False)
print(r.headers['Last-Modified'])
main("https://fgisonline.ams.usda.gov/ExportGrainReport/CY2020.csv")
Output:
Mon, 28 Sep 2020 15:02:22 GMT
Now you can run your script via Cron job daily at the time which you prefer and looping over the file headers Last-Modified until it becomes equal to today's date and then download the file.
Be informed I used head request which will be 100x speedy to track it. and then you can use requests.get
I prefer to work under the same session as well

Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS

Fairly new to python, I learn by doing, so I thought I'd give this project a shot. Trying to create a script which finds the google analytics request for a certain website parses the request payload and does something with it.
Here are the requirements:
Ask user for 2 urls ( for comparing the payloads from 2 diff. HAR payloads)
Use selenium to open the two urls, use browsermobproxy/phantomJS to
get all HAR
Store the HAR as a list
From the list of all HAR files, find the google analytics request, including the payload
If Google Analytics tag found, then do things....like parse the payload, etc. compare the payload, etc.
Issue: Sometimes for a website that I know has google analytics, i.e. nytimes.com - the HAR that I get is incomplete, i.e. my prog. will say "GA Not found" but that's only because the complete HAR was not captured so when the regex ran to find the matching HAR it wasn't there. This issue in intermittent and does not happen all the time. Any ideas?
I'm thinking that due to some dependency or latency, the script moved on and that the complete HAR didn't get captured. I tried the "wait for traffic to stop" but maybe I didn't do something right.
Also, as a bonus, I would appreciate any help you can provide on how to make this script run fast, its fairly slow. As I mentioned, I'm new to python so go easy :)
This is what I've got thus far.
import browsermobproxy as mob
from selenium import webdriver
import re
import sys
import urlparse
import time
from datetime import datetime
def cleanup():
s.stop()
driver.quit()
proxy_path = '/Users/bob/Downloads/browsermob-proxy-2.1.4-bin/browsermob-proxy-2.1.4/bin/browsermob-proxy'
s = mob.Server(proxy_path)
s.start()
proxy = s.create_proxy()
proxy_address = "--proxy=127.0.0.1:%s" % proxy.port
service_args = [proxy_address, '--ignore-ssl-errors=yes', '--ssl-protocol=any'] # so that i can do https connections
driver = webdriver.PhantomJS(executable_path='/Users/bob/Downloads/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs', service_args=service_args)
driver.set_window_size(1400, 1050)
urlLists = []
collectTags = []
gaCollect = 0
varList = []
for x in range(0,2): # I want to ask the user for 2 inputs
url = raw_input("Enter a website to find GA on: ")
time.sleep(2.0)
urlLists.append(url)
if not url:
print "You need to type something in...here"
sys.exit()
#gets the two user url and stores in list
for urlList in urlLists:
print urlList, 'start 2nd loop' #printing for debug purpose, no need for this
if not urlList:
print 'Your Url list is empty'
sys.exit()
proxy.new_har()
driver.get(urlList)
#proxy.wait_for_traffic_to_stop(15, 30) #<-- tried this but did not do anything
for ent in proxy.har['log']['entries']:
gaCollect = (ent['request']['url'])
print gaCollect
if re.search(r'google-analytics.com/r\b', gaCollect):
print 'Found GA'
collectTags.append(gaCollect)
time.sleep(2.0)
break
else:
print 'No GA Found - Ending Prog.'
cleanup()
sys.exit()
cleanup()
This might be a stale question, but I found an answer that worked for me.
You need to change two things:
1 - Remove sys.exit() -- this causes your programme to stop after the first iteration through the ent list, so if what you want is not the first thing, it won't be found
2 - call new_har with the captureContent option enabled to get the payload of requests:
proxy.new_har(options={'captureHeaders':True, 'captureContent': True})
See if that helps.

Amazon Product API for python - A strange error that I can't figure out

Here is the situation, I'm trying to get this API to work for me, but I can't seem to figure out all of its quirks. Even when I run the example below (taken from https://bitbucket.org/basti/python-amazon-product-api/src/2b6b628300c4/examples/all-galileo-titles.py) I get random errors. Now I'm getting an error that says the __init__() takes 4 arguments, and I'm only feeding it two. I put all of the credentials in the correct place, though, and I can't find anywhere in the modules where __init__() is demanding extra arguments. Anyone have any thoughts?
Here is what I was running. More details about the program can be found at that link above.
from amazonproduct.api import API
import lxml
if __name__ == '__main__':
# Don't forget to create file ~/.amazon-product-api
# with your credentials (see docs for details)
api = API(locale = 'uk')
result = api.item_search('Books', Publisher= 'RosettaBooks',
ResponseGroup='Large')
# extract paging information
total_results = result.results
total_pages = len(result) # or result.pages
for book in result:
print 'page %d of %d' % (result.current, total_pages)
#~ from lxml import etree
#~ print etree.tostring(book, pretty_print=True)
print book.ASIN,
print unicode(book.ItemAttributes.Author), ':',
print unicode(book.ItemAttributes.Title),
if hasattr(book.ItemAttributes, 'ListPrice'):
print unicode(book.ItemAttributes.ListPrice.FormattedPrice)
elif hasattr(book.OfferSummary, 'LowestUsedPrice'):
print u'(used from %s)' % book.OfferSummary.LowestUsedPrice.FormattedPrice
I had the same problem, I found on the bitbucket site for python-amazon-product-api a question from a couple of years ago that enabled me to work around the problem. See 11/1/2011 new requirement AssociateTag -- update required?
What I did to work around is not use the .amazon-product-api file to specify the credentials but instead pass them as part of the API call:
AWS_KEY = '……….'
SECRET_KEY = '……………'
ASSOCIATE_TAG = '…………….'
api = API(access_key_id=AWS_KEY, secret_access_key=SECRET_KEY, locale="us", associate_tag=ASSOCIATE_TAG)

Loading Magnet LINK using Rasterbar libtorrent in Python

How would one load a Magnet link via rasterbar libtorrent python binding?
import libtorrent as lt
import time
ses = lt.session()
params = { 'save_path': '/home/downloads/'}
link = "magnet:?xt=urn:btih:4MR6HU7SIHXAXQQFXFJTNLTYSREDR5EI&tr=http://tracker.vodo.net:6970/announce"
handle = lt.add_magnet_uri(ses, link, params)
print 'downloading metadata...'
while (not handle.has_metadata()): time.sleep(1)
print 'got metadata, starting torrent download...'
while (handle.status().state != lt.torrent_status.seeding):
print '%d %% done' % (handle.status().progress*100)
time.sleep(1)

Categories