Python -> Get all valid media download url's from a webfolder

Python -> Get all valid media download url's from a webfolder - python

I have a website here who has a link structure like this
https://example.com/assets/contents/1627347928.mp4
https://example.com/assets/contents/1627342345.mp4
https://example.com/assets/contents/1627215324.mp4
And I want to use python to get all links to download, when I access the folder /assets/contents/ i get a 404 error, so I can't see all the media to download from this web folder, but I know all the MP4 files has 10 CHARACTERS and all of them start with "1627******.mp4"
Can I do a LOOP to check all the links from that website and get all VALID links? Thanks!!!!!!!!!!!! I am newbie on python right now!
I could check if have media mp4/media with that code i can see the headers of a file, but how to make a loop to check all the links and download automatically? Or just show me the valid links? Thanks!!
import requests
link = 'https://example.com/assets/contents/1627347923.mp4'
r = requests.get(link, stream=True)
print(r.headers)

Print if file exists or not
import requests
names = [ 1627347923, 1627347924, 1627347925]
base = 'https://example.com/assets/contents/{}.mp4'
for item in names:
link = base.format(item)
print(link)
r = requests.head(link, allow_redirects=True)
if r.status_code == 200:
print("found {}.mp4".format(item))
#open('{}.mp4'.format(item), 'wb').write(r.content)
else:
print("File no found or error getting headers")
Or try to download it
import requests
names = [ 1627347923, 1627347924, 1627347925]
base = 'https://example.com/assets/contents/{}.mp4'
for item in names:
link = base.format(item)
print(link)
# uncomment below to download
#r = requests.get(link, allow_redirects=True)
#open('{}.mp4'.format(item), 'wb').write(r.content)

yes you can run a loop, check the status code or if requests.get() throws an error you get back and as such get all the files, but there are some problems which might stop you from choosing that
Your files are in the format of "1627******.mp4", which means a for loop would check for 10^6 entries, if all the * are numbers, which is not efficient. If you are planning to include characters and special characters, it would be highly inefficient.
What if in the future you have more than 10^6 files? Your format will have to change and so your code will have to change.
A much more simple, straight forward and efficient solution would be to have a place to store your data, a file or better a database, where you can just query and get all your files. You can just run your query to get the necessary details.
Also, a 404 error means the page you are trying to reach is not found, in your case, it essentially means it doesn't exist.
A sample code a/c to check if link exists
files = []
links = ["https://www.youtube.com/","https://docs.python.org","https://qewrt.org"]
for i in links:
try:
requests.get(i) // If link doesnt exists, it throws an error, else the link is appended to the files list
files.append(i)
except:
print(i+" doesnt exist")
print(files)
Building on this, based on your condition, checking for all files if they exist in the given format:
import requests
file_prefix = 'https://example.com/assets/contents/1627'
file_lists = []
for i in range(10**6):
suffix = (6-len(str(i)))*"0"+str(i)+".mp4"
file_name = file_prefix+suffix
try:
requests.get(file_name)
file_lists.append(file_name)
except:
continue
for i in file_lists:
print(i)

Based on all your codes and LMC codes, I do a thing who test all the MP4 files and show me the "headers", how I can only pick links who has a mp4 valid file like the link
import requests
file_prefix = 'https://example.com/assets/contents/1627'
file_lists = []
for i in range(10**6):
suffix = (6-len(str(i)))*"0"+str(i)+".mp4"
file_name = file_prefix+suffix
try:
requests.get(file_name)
file_lists.append(file_name)
r = requests.get(file_name, stream=True)
print(file_name)
print(r.headers)
except:
continue
for i in file_lists:
print(i)

Related

How to scrape pdf to local folder with filename = url and delay within iteration?

I scraped a website (url = "http://bla.com/bla/bla/bla/bla.txt") for all the links containing .pdf that were important to me.
These are now stored in relative_paths:
['http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/3333/jjjjj-99-0065.pdf',
'http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/3333/jjjjj-99-1679.pdf',
'http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/4444/jjjjj-99-9526.pdf',]
Now i want to store the pdf "behind" the links in a local folder with their filename being their url.
None of the - although somewhat similar questions on the internet - seems to help me towards my goal. The closest i got was when it generated some weird file that didnt even have an extension. Here are some of the more promising code samples i already tried out.
for link in relative_paths:
content = requests.get(link, verify = False)
with open(link, 'wb') as pdf:
pdf.write(content.content)
for link in relative_paths:
response = requests.get(url, verify = False)
with open(join(r'C:/Users/', basename(url)), 'wb') as f:
f.write(response.content)
for link in relative_paths:
filename = link
with open(filename, 'wb') as f:
f.write(requests.get(link, verify = False).content)
for link in relative_paths:
pdf_response = requests.get(link, verify = False)
filename = link
with open(filename, 'wb') as f:
f.write(pdf_response.content)
Now i am confused and dont know how to move forward. Can you transform one of the for loop and provide a small explanation, please? If the urls are too long for filename, a split at the 3rd last / is also ok. thanks :)
Also, i was asked by the website host to not scrape all of the pdfs at once so that the server does not get overloaded since there are thousands of pdfs behind the many links stored in relative_paths. That is why i am searching for a way to incorporate some sort of delay within my requests.

give this a shot:
import time
count_downloads = 25 #<--- wait x seconds after every 25 downloads
time_delay = 60 #<--- wait 60 seconds after every y downloads
for idx, link in enumerate(relative_paths):
if idx % count_downloads == 0:
print ('Waiting %s seconds...' %time_delay)
time.sleep(time_delay)
filename = link.split('jjjjj-')[-1] #<--whatever that is is where you want to split then
try:
with open(filename, 'wb') as f:
f.write(requests.get(link).content)
print ('Saved: %s' %link)
except Exception as ex:
print('%s not saved. %s' %(link,ex))

How to filter filenames with extension on API call?

I was working on the python confluence API for downloading attachment from confluence page, I need to download only files with .mpp extension. Tried with glob and direct parameters but didnt work.
Here is my code:
file_name = glob.glob("*.mpp")
attachments_container = confluence.get_attachments_from_content(page_id=33110, start=0, limit=1,filename=file_name)
print(attachments_container)
attachments = attachments_container['results']
for attachment in attachments:
fname = attachment['title']
download_link = confluence.url + attachment['_links']['download']
r = requests.get(download_link, auth = HTTPBasicAuth(confluence.username,confluence.password))
if r.status_code == 200:
if not os.path.exists('phoenix'):
os.makedirs('phoenix')
fname = ".\\phoenix\\" +fname

glob.glob() operates on your local folder. So you can't use that as a filter for get_attachments_from_content(). Also, don't specify a limit of since that gets you just one/the first attachment. Specify a high limit or whatever default will include all of them. (You may have to paginate results.)
However, you can exclude the files you don't want by checking the title of each attachment before you download it, which you have as fname = attachment['title'].
attachments_container = confluence.get_attachments_from_content(page_id=33110, limit=1000)
attachments = attachments_container['results']
for attachment in attachments:
fname = attachment['title']
if not fname.lower().endswith('.mpp'):
# skip file if it's not got that extension
continue
download_link = ...
# rest of your code here
Also, your code looks like a copy-paste from this answer but you've changed the actual "downloading" part of it. So if your next StackOverflow question is going to be "how to download a file from confluence", use that answer's code.

Downloading XML files from a web services URL in python

Please correct me if I am wrong as I am a beginner in python.
I have a web services URL which contains an XML file:
http://abc.tch.xyz.edu:000/patientlabtests/id/1345
I have a list of values and I want to append each value in that list to the URL and download file for each value and the name of the downloaded file should be the same to the value appended from the list.
It is possible to download one file at a time but I have 1000's of values in the list and I was trying to write a function with a for loop and I am stuck.
x = [ 1345, 7890, 4729]
for i in x :
url = http://abc.tch.xyz.edu:000/patientlabresults/id/{}.format(i)
response = requests.get(url2)
****** Missing part of the code ********
with open('.xml', 'wb') as file:
file.write(response.content)
file.close()
The files downloaded from URL should be like
"1345patientlabresults.xml"
"7890patientlabresults.xml"
"4729patientlabresults.xml"
I know there is a part of the code which is missing and I am unable to fill in that missing part. I would really appreciate if anyone can help me with this.

Accessing your web service url seem not to be working. Check this.
import requests
x = [ 1345, 7890, 4729]
for i in x :
url2 = "http://abc.tch.xyz.edu:000/patientlabresults/id/"
response = requests.get(url2+str(i)) # i must be converted to a string
Note: When you use 'with' to open a file, you do not have close the file since it will closed automatically.
with open(filename, mode) as file:
file.write(data)
Since the Url you provide is not working, I am going to use a different url. And I hope you get the idea and how to write to a file using the custom name
import requests
categories = ['fruit', 'car', 'dog']
for category in categories :
url = "https://icanhazdadjoke.com/search?term="
response = requests.get(url + category)
file_name = category + "_JOKES_2018" #Files will be saved as fruit_JOKES_2018
r = requests.get(url + category)
data = r.status_code #Storing the status code in 'data' variable
with open(file_name+".txt", 'w+') as f:
f.write(str(data)) # Writing the status code of each url in the file
After running this code, the status codes will be written in each of the files. And the file will also be named as follows:
car_JOKES_2018.txt
dog_JOKES_2018.txt
fruit_JOKES_2018.txt
I hope this gives you an understanding of how to name the files and write into the files.

I think you just want to create a path using str.format as you (almost) are for the URL. maybe something like the following
import os.path
x = [ 1345, 7890, 4729]
for i in x:
path = '1345patientlabresults.xml'.format(i)
# ignore this file if we've already got it
if os.path.exists(path):
continue
# try and get the file, throwing an exception on failure
url = 'http://abc.tch.xyz.edu:000/patientlabresults/id/{}'.format(i)
res = requests.get(url)
res.raise_for_status()
# write the successful file out
with open(path, 'w') as fd:
fd.write(res.content)
I've added some error handling and better behaviour on retry

Downloading File from URL and Writing to Location

I'm trying to download images from a list of URL's. Each URL contains a txt file with jpeg information. The URL's are uniform except for an incremental change in the folder number. Below are example URL's
Min: https://marco.ccr.buffalo.edu/data/train/train-00001-of-00407
Max: https://marco.ccr.buffalo.edu/data/train/train-00407-of-00407
I want to read each of these URL's and store the their output to another folder. I was looking into the requests python library to do this but Im wondering how to iterate over the URL's and essentially write my loop to increment over that number in the URL. Apologize in advance if I misuse the terminology. Thanks!
# This may be terrible starting code
# imported the requests library
import requests
url = "https://marco.ccr.buffalo.edu/data/train/train-00001-of-00407"
# URL of the image to be downloaded is defined as image_url
r = requests.get(url) # create HTTP response object
# send a HTTP request to the server and save
# the HTTP response in a response object called r
with open("data.txt",'wb') as f:
# Saving received content as a png file in
# binary format
# write the contents of the response (r.content)
# to a new file in binary mode.
f.write(r.content)

You can generate urls like this and perform get for each
for i in range(1,408):
url = "https://marco.ccr.buffalo.edu/data/train/train-" + str(i).zfill(5) + "-of-00407"
print (url)
Also use a variable in the filename to keep a different copy of each. For eg, use this
with open("data" + str(i) + ".txt",'wb') as f:
Overall code may look something like this (not exactly this)
import requests
for i in range(1,408):
url = "https://marco.ccr.buffalo.edu/data/train/train-" + str(i).zfill(5) + "-of-00407"
r = requests.get(url)
# you might have to change the extension
with open("data" + str(i).zfill(5) + ".txt",'wb') as f:
f.write(r.content)

Multi-threading for downloading NCBI files in Python

So recently I have taken on the task of downloading large collection of files from the ncbi database. However I have run into times where I have to create multiple databases. This code here which works to downloads all the viruses from the ncbi website. My question is there any way to speed up the process of downloading these files.
Currently the runtime of this program is more than 5hours. I have looked into multi-threading and could never get it to work because some of these files take more than 10seconds to download and I do not know how to handle stalling. (new to programing) Also is there a way of handling urllib2.HTTPError: HTTP Error 502: Bad Gateway. I get this sometimes with with certain combinations of retstart and retmax. This crashes the program and I have to restart the download from a different location by changingthe 0 in the for statement.
import urllib2
from BeautifulSoup import BeautifulSoup
#This is the SearchQuery into NCBI. Spaces are replaced with +'s.
SearchQuery = 'viruses[orgn]+NOT+Retroviridae[orgn]'
#This is the Database that you are searching.
database = 'protein'
#This is the output file for the data
output = 'sample.fasta'
#This is the base url for NCBI eutils.
base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
#Create the search string from the information above
esearch = 'esearch.fcgi?db='+database+'&term='+SearchQuery+'&usehistory=y'
#Create your esearch url
url = base + esearch
#Fetch your esearch using urllib2
print url
content = urllib2.urlopen(url)
#Open url in BeautifulSoup
doc = BeautifulSoup(content)
#Grab the amount of hits in the search
Count = int(doc.find('count').string)
#Grab the WebEnv or the history of this search from usehistory.
WebEnv = doc.find('webenv').string
#Grab the QueryKey
QueryKey = doc.find('querykey').string
#Set the max amount of files to fetch at a time. Default is 500 files.
retmax = 10000
#Create the fetch string
efetch = 'efetch.fcgi?db='+database+'&WebEnv='+WebEnv+'&query_key='+QueryKey
#Select the output format and file format of the files.
#For table visit: http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1
format = 'fasta'
type = 'text'
#Create the options string for efetch
options = '&rettype='+format+'&retmode='+type
#For statement 0 to Count counting by retmax. Use xrange over range
for i in xrange(0,Count,retmax):
#Create the position string
poision = '&retstart='+str(i)+'&retmax='+str(retmax)
#Create the efetch URL
url = base + efetch + poision + options
print url
#Grab the results
response = urllib2.urlopen(url)
#Write output to file
with open(output, 'a') as file:
for line in response.readlines():
file.write(line)
#Gives a sense of where you are
print Count - i - retmax

To download files using multiple threads:
#!/usr/bin/env python
import shutil
from contextlib import closing
from multiprocessing.dummy import Pool # use threads
from urllib2 import urlopen
def generate_urls(some, params): #XXX pass whatever parameters you need
for restart in range(*params):
# ... generate url, filename
yield url, filename
def download((url, filename)):
try:
with closing(urlopen(url)) as response, open(filename, 'wb') as file:
shutil.copyfileobj(response, file)
except Exception as e:
return (url, filename), repr(e)
else: # success
return (url, filename), None
def main():
pool = Pool(20) # at most 20 concurrent downloads
urls = generate_urls(some, params)
for (url, filename), error in pool.imap_unordered(download, urls):
if error is not None:
print("Can't download {url} to {filename}, "
"reason: {error}".format(**locals())
if __name__ == "__main__":
main()

You should use multithreading, it's the right way for downloading tasks.
"these files take more than 10seconds to download and I do not know how to handle stalling",
I don't think this would be a problem because Python's multithreading will handle this, or I'd rather say multithreading is just for this kind of I/O-bound work. When a thread is waiting for download to complete, CPU will let other threads do their work.
Anyway, you'd better at least try and see what happen.

Two ways to effect your task. 1. Using process instead of thread, multiprocess is the module you should use. 2. Using Event-based, gevent is the right module.
502 error is not your script's fault. Simply, following pattern could be used to do retry
try_count = 3
while try_count > 0:
try:
download_task()
except urllib2.HTTPError:
clean_environment_for_retry()
try_count -= 1
In the line of except, you can refine the detail to do particular things according to concrete HTTP status code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python -> Get all valid media download url's from a webfolder - python

Related

How to scrape pdf to local folder with filename = url and delay within iteration?

How to filter filenames with extension on API call?

Downloading XML files from a web services URL in python

Downloading File from URL and Writing to Location

Multi-threading for downloading NCBI files in Python

Categories

Resources