Downloading XML files from a web services URL in python - python

Please correct me if I am wrong as I am a beginner in python.
I have a web services URL which contains an XML file:
http://abc.tch.xyz.edu:000/patientlabtests/id/1345
I have a list of values and I want to append each value in that list to the URL and download file for each value and the name of the downloaded file should be the same to the value appended from the list.
It is possible to download one file at a time but I have 1000's of values in the list and I was trying to write a function with a for loop and I am stuck.
x = [ 1345, 7890, 4729]
for i in x :
url = http://abc.tch.xyz.edu:000/patientlabresults/id/{}.format(i)
response = requests.get(url2)
****** Missing part of the code ********
with open('.xml', 'wb') as file:
file.write(response.content)
file.close()
The files downloaded from URL should be like
"1345patientlabresults.xml"
"7890patientlabresults.xml"
"4729patientlabresults.xml"
I know there is a part of the code which is missing and I am unable to fill in that missing part. I would really appreciate if anyone can help me with this.

Accessing your web service url seem not to be working. Check this.
import requests
x = [ 1345, 7890, 4729]
for i in x :
url2 = "http://abc.tch.xyz.edu:000/patientlabresults/id/"
response = requests.get(url2+str(i)) # i must be converted to a string
Note: When you use 'with' to open a file, you do not have close the file since it will closed automatically.
with open(filename, mode) as file:
file.write(data)
Since the Url you provide is not working, I am going to use a different url. And I hope you get the idea and how to write to a file using the custom name
import requests
categories = ['fruit', 'car', 'dog']
for category in categories :
url = "https://icanhazdadjoke.com/search?term="
response = requests.get(url + category)
file_name = category + "_JOKES_2018" #Files will be saved as fruit_JOKES_2018
r = requests.get(url + category)
data = r.status_code #Storing the status code in 'data' variable
with open(file_name+".txt", 'w+') as f:
f.write(str(data)) # Writing the status code of each url in the file
After running this code, the status codes will be written in each of the files. And the file will also be named as follows:
car_JOKES_2018.txt
dog_JOKES_2018.txt
fruit_JOKES_2018.txt
I hope this gives you an understanding of how to name the files and write into the files.

I think you just want to create a path using str.format as you (almost) are for the URL. maybe something like the following
import os.path
x = [ 1345, 7890, 4729]
for i in x:
path = '1345patientlabresults.xml'.format(i)
# ignore this file if we've already got it
if os.path.exists(path):
continue
# try and get the file, throwing an exception on failure
url = 'http://abc.tch.xyz.edu:000/patientlabresults/id/{}'.format(i)
res = requests.get(url)
res.raise_for_status()
# write the successful file out
with open(path, 'w') as fd:
fd.write(res.content)
I've added some error handling and better behaviour on retry

Related

Python -> Get all valid media download url's from a webfolder

I have a website here who has a link structure like this
https://example.com/assets/contents/1627347928.mp4
https://example.com/assets/contents/1627342345.mp4
https://example.com/assets/contents/1627215324.mp4
And I want to use python to get all links to download, when I access the folder /assets/contents/ i get a 404 error, so I can't see all the media to download from this web folder, but I know all the MP4 files has 10 CHARACTERS and all of them start with "1627******.mp4"
Can I do a LOOP to check all the links from that website and get all VALID links? Thanks!!!!!!!!!!!! I am newbie on python right now!
I could check if have media mp4/media with that code i can see the headers of a file, but how to make a loop to check all the links and download automatically? Or just show me the valid links? Thanks!!
import requests
link = 'https://example.com/assets/contents/1627347923.mp4'
r = requests.get(link, stream=True)
print(r.headers)
Print if file exists or not
import requests
names = [ 1627347923, 1627347924, 1627347925]
base = 'https://example.com/assets/contents/{}.mp4'
for item in names:
link = base.format(item)
print(link)
r = requests.head(link, allow_redirects=True)
if r.status_code == 200:
print("found {}.mp4".format(item))
#open('{}.mp4'.format(item), 'wb').write(r.content)
else:
print("File no found or error getting headers")
Or try to download it
import requests
names = [ 1627347923, 1627347924, 1627347925]
base = 'https://example.com/assets/contents/{}.mp4'
for item in names:
link = base.format(item)
print(link)
# uncomment below to download
#r = requests.get(link, allow_redirects=True)
#open('{}.mp4'.format(item), 'wb').write(r.content)
yes you can run a loop, check the status code or if requests.get() throws an error you get back and as such get all the files, but there are some problems which might stop you from choosing that
Your files are in the format of "1627******.mp4", which means a for loop would check for 10^6 entries, if all the * are numbers, which is not efficient. If you are planning to include characters and special characters, it would be highly inefficient.
What if in the future you have more than 10^6 files? Your format will have to change and so your code will have to change.
A much more simple, straight forward and efficient solution would be to have a place to store your data, a file or better a database, where you can just query and get all your files. You can just run your query to get the necessary details.
Also, a 404 error means the page you are trying to reach is not found, in your case, it essentially means it doesn't exist.
A sample code a/c to check if link exists
files = []
links = ["https://www.youtube.com/","https://docs.python.org","https://qewrt.org"]
for i in links:
try:
requests.get(i) // If link doesnt exists, it throws an error, else the link is appended to the files list
files.append(i)
except:
print(i+" doesnt exist")
print(files)
Building on this, based on your condition, checking for all files if they exist in the given format:
import requests
file_prefix = 'https://example.com/assets/contents/1627'
file_lists = []
for i in range(10**6):
suffix = (6-len(str(i)))*"0"+str(i)+".mp4"
file_name = file_prefix+suffix
try:
requests.get(file_name)
file_lists.append(file_name)
except:
continue
for i in file_lists:
print(i)
Based on all your codes and LMC codes, I do a thing who test all the MP4 files and show me the "headers", how I can only pick links who has a mp4 valid file like the link
import requests
file_prefix = 'https://example.com/assets/contents/1627'
file_lists = []
for i in range(10**6):
suffix = (6-len(str(i)))*"0"+str(i)+".mp4"
file_name = file_prefix+suffix
try:
requests.get(file_name)
file_lists.append(file_name)
r = requests.get(file_name, stream=True)
print(file_name)
print(r.headers)
except:
continue
for i in file_lists:
print(i)

Importing and naming a remote file

I have written some code to read the contents from a specific url:
import requests
import os
def read_doc(doc_ID):
filename = doc_ID + ".txt"
if not os.path.exists(filename):
my_url = encode_url(doc_ID) #this is a call to another function that would encode the url
my_response = requests.get(my_url)
if my_response.status_code == requests.codes.ok:
return my_response.text
return None
This checks if there's a file named doc_ID.txt (where doc_ID could be any name provided). And if there's no such file, it would read the contents from a specific url and would return them. What I would like to do is to store those returned contents in a file called doc_ID.txt. That is, I would like to finish my function by creating a new file in case it didn't exist at the beginning.
How can I do that? I tried this:
my_text = my_response.text
output = os.rename(my_text, filename)
return output
but then, the actual contents of the file would become the name of the file and I would get an error saying the filename is too long.
So the issue I think I'm seeing is that you want to put the contents of your request's response into the file, rather than naming the file with the contents. The code below should create a file with the filename you want, and insert the text from your response!
import requests
import os
def read_doc(doc_ID):
filename = doc_ID + ".txt"
if not os.path.exists(filename):
my_url = encode_url(doc_ID) #this is a call to another function that would encode the url
my_response = requests.get(my_url)
if my_response.status_code == requests.codes.ok:
with open(filename, "w") as file:
file.write(my_response.text)
return file
return None
To write the response text to the file, you can simply use python file object, https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files
with open(filename, "w") as file:
file.write(my_text)

Downloading File from URL and Writing to Location

I'm trying to download images from a list of URL's. Each URL contains a txt file with jpeg information. The URL's are uniform except for an incremental change in the folder number. Below are example URL's
Min: https://marco.ccr.buffalo.edu/data/train/train-00001-of-00407
Max: https://marco.ccr.buffalo.edu/data/train/train-00407-of-00407
I want to read each of these URL's and store the their output to another folder. I was looking into the requests python library to do this but Im wondering how to iterate over the URL's and essentially write my loop to increment over that number in the URL. Apologize in advance if I misuse the terminology. Thanks!
# This may be terrible starting code
# imported the requests library
import requests
url = "https://marco.ccr.buffalo.edu/data/train/train-00001-of-00407"
# URL of the image to be downloaded is defined as image_url
r = requests.get(url) # create HTTP response object
# send a HTTP request to the server and save
# the HTTP response in a response object called r
with open("data.txt",'wb') as f:
# Saving received content as a png file in
# binary format
# write the contents of the response (r.content)
# to a new file in binary mode.
f.write(r.content)
You can generate urls like this and perform get for each
for i in range(1,408):
url = "https://marco.ccr.buffalo.edu/data/train/train-" + str(i).zfill(5) + "-of-00407"
print (url)
Also use a variable in the filename to keep a different copy of each. For eg, use this
with open("data" + str(i) + ".txt",'wb') as f:
Overall code may look something like this (not exactly this)
import requests
for i in range(1,408):
url = "https://marco.ccr.buffalo.edu/data/train/train-" + str(i).zfill(5) + "-of-00407"
r = requests.get(url)
# you might have to change the extension
with open("data" + str(i).zfill(5) + ".txt",'wb') as f:
f.write(r.content)

Zamzar API download failed

Unable to download the converted file from zamzar api using python program, as specified on the https://developers.zamzar.com/docs but as i am using the code correctly along with api key. It is only showing error code : 20. Wasted 4hour behind this error, someone please.
import requests
from requests.auth import HTTPBasicAuth
file_id =291320
local_filename = 'afzal.txt'
api_key = 'my_key_of_zamzar_api'
endpoint = "https://sandbox.zamzar.com/v1/files/{}/content".format(file_id)
response = requests.get(endpoint, stream=True, auth=HTTPBasicAuth(api_key, ''))
try:
with open(local_filename, 'wb') as f:
for chunk in response.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
f.flush()
print("File downloaded")
except IOError:
print("Error")
THis is the code I am using for downloading the converted file.
This code easily convert files into different formats :
import requests
from requests.auth import HTTPBasicAuth
#--------------------------------------------------------------------------#
api_key = 'Put_Your_API_KEY' #your Api_key from developer.zamzar.com
source_file = "tmp/armash.pdf" #source_file_path
target_file = "results/armash.txt" #target_file_path_and_name
target_format = "txt" #targeted Format.
#-------------------------------------------------------------------------#
def check(job_id,api_key):
check_endpoint = "https://sandbox.zamzar.com/v1/jobs/{}".format(job_id)
response = requests.get(check_endpoint, auth=HTTPBasicAuth(api_key, ''))
#print(response.json())
#print(response.json())
checked_data=response.json()
value_list=checked_data['target_files']
#print(value_list[0]['id'])
return value_list[0]['id']
def download(file_id,api_key,local_filename):
downlaod_endpoint = "https://sandbox.zamzar.com/v1/files/{}/content".format(file_id)
download_response = requests.get(downlaod_endpoint, stream=True, auth=HTTPBasicAuth(api_key, ''))
try:
with open(local_filename, 'wb') as f:
for chunk in download_response.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
f.flush()
print("File downloaded")
except IOError:
print("Error")
endpoint = "https://sandbox.zamzar.com/v1/jobs"
file_content = {'source_file': open(source_file, 'rb')}
data_content = {'target_format': target_format}
res = requests.post(endpoint, data=data_content, files=file_content, auth=HTTPBasicAuth(api_key, ''))
print(res.json())
data=res.json()
#print(data)
print("=========== Job ID ============\n\n")
print(data['id'])
target_id=check(data['id'],api_key)
print("\n================= target_id ===========\n\n")
print(target_id)
download(target_id,api_key,target_file)
Hope this well somebody!.
I'm the lead developer for the Zamzar API.
So the Zamzar API docs contain a section on error codes (see https://developers.zamzar.com/docs#section-Error_codes). The relevant code for your error is:
{
"message" : "API key was missing or invalid",
"code" : 20
}
This can mean either that you did not specify an API key at all or that the API key used was invalid for the file you are attempting to download. It seems more likely to be the latter, since your code contains an api_key variable.
Looking at your code it's possible that you have used the job ID (291320) to try and download your file, when in fact you should be using a file ID.
Each conversion job can output 1 or more converted files and you need to specify the file ID for the one you wish to grab. You can see a list of all converted file ID's for your job by querying /jobs/ID and looking at the target_files array. This is outlined in the API docs at https://developers.zamzar.com/docs#section-Download_the_converted_file
So if you change your code to use the file ID from the target_files array of your Job your download should spring into life.
I'm sorry you wasted time on this. Clearly if it has reached S.O. our docs haven't done a good enough job of explaining this distinction so we'll look at what we can do to make them clearer.
Happy converting !

In Python, is there a way I can download all/some the image files (e.g. JPG/PNG) from a **Google Images** search result?

Is there a way I can download all/some the image files (e.g. JPG/PNG) from a Google Images search result?
I can use the following code to download one image that I already know its url:
import urllib.request
file = "Facts.jpg" # file to be written to
url = "http://www.compassion.com/Images/Hunger-Facts.jpg"
response = urllib.request.urlopen (url)
fh = open(file, "wb") #open the file for writing
fh.write(response.read()) # read from request while writing to file
To download multiple images, it has been suggested that I define a function and use that function to repeat the task for each image url that I would like to write to disk:
def image_request(url, file):
response = urllib.request.urlopen(url)
fh = open(file, "wb") #open the file for writing
fh.write(response.read())
And then loop over a list of urls with:
for i, url in enumerate(urllist):
image_request(url, str(i) + ".jpg")
However, what I really want to do is download all/some image files (e.g. JPG/PNG) from my own search result from Google Images without necessarily having a list of the image urls beforehand.
P.S.
Please I am a complete beginner and would favour an answer that breaks down the broad steps to achieve this over one that is bogs down on specific codes. Thanks.
You can use the Google API like this, where BLUE and DOG are your search parameters:
https://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=BLUE%20DOG
There is a developer guide about this here:
https://developers.google.com/image-search/v1/jsondevguide
You need to parse this JSON format before you can use the links directly.
Here's a start to your JSON parsing:
import json
j = json.loads('{"one" : "1", "two" : "2", "three" : "3"}')
print(j['two'])

Categories