YouTube video Downloader python - python

I made a youtube video download Manager. It download a video but i am facing one issue when i download same video, it doesn't download it again. how can i download it again with same title like pic.png and send pic1.png. How can i do that?
def Download(self):
video_url = self.lineEdit.text()
save_location = self.lineEdit_2.text()
if video_url == '' or save_location == '':
QMessageBox.warning(self, "Data Error", "Provide a Valid Video URL or save Location")
else:
video = pafy.new(video_url)
video_stream = video.streams
video_quality = self.comboBox.currentIndex()
download = video_stream[video_quality].download(filepath=save_location, callback=self.Handel_Progress, )

Ok, this one is interesting.
The real problem begins here.
download = video_stream[video_quality].download(filepath=save_location, callback=self.Handel_Progress, )
Here, you are calling download function of video_stream object which takes filepath as an argument for file location but does not take the filename, because, obviously, the file would be saved with the actual name.
Root Cause of your problem:
If you look into the definition of download function, you would find that if a file exists with the same name, it would not download the file at all.
Now comes the part, how do you make sure it downloads, no matter what:
There are two things you need to do:
Check if a file with same name exists or not, and if does, then add 1 in the end of the file name just before the extension. So if abc.mp4 exists, then save abc1.mp4.
[I will tell you how to handle the scenario when abc.mp4, abc1.mp4 and so on exists, but for now, let's get back to the problem.]
How to pass the file name (abc1.mp4) to the download method?
Following piece of code would handle both.
I have added comments for your understanding.
import os
import re
import pafy
from pafy.util import xenc
# this function is used by pafy to generate file name while saving,
# so im using the same function to get the file name which I will use to check
# if file exists or not
# DO NOT CHANGE IT
def generate_filename(title, extension):
max_length = 251
""" Generate filename. """
ok = re.compile(r'[^/]')
if os.name == "nt":
ok = re.compile(r'[^\\/:*?"<>|]')
filename = "".join(x if ok.match(x) else "_" for x in title)
if max_length:
max_length = max_length + 1 + len(extension)
if len(filename) > max_length:
filename = filename[:max_length - 3] + '...'
filename += "." + extension
return xenc(filename)
def get_file_name_for_saving(save_location, full_name):
file_path_with_name = os.path.join(save_location, full_name)
# file exists, add 1 in the end, otherwise return filename as it is
if os.path.exists(file_path_with_name):
split = file_path_with_name.split(".")
file_path_with_name = ".".join(split[:-1]) + "1." + split[-1]
return file_path_with_name
def Download(self):
video_url = self.lineEdit.text()
save_location = self.lineEdit_2.text()
if video_url == '' or save_location == '':
QMessageBox.warning(self, "Data Error", "Provide a Valid Video URL or save Location")
else:
# video file
video = pafy.new(video_url)
# available video streams
video_stream = video.streams
video_quality = self.comboBox.currentIndex()
# video title/name
video_name = video.title
# take out the extension of the file from video stream
extension = video_stream[video_quality].extension
# fullname with extension
full_name = generate_filename(video_name, extension)
final_path_with_file_name = get_file_name_for_saving(save_location, full_name)
download = video_stream[video_quality].download(filepath=final_path_with_file_name,
callback=self.Handel_Progress, )
Let me know if you face any issues.

Related

I am getting error while replacing url in orginal code python,selenium,re

i used code in this question https://codereview.stackexchange.com/questions/241842/webscraping-with-selenium-a-course-downloader-and-sorter/248712#248712
i replaced url in that code to my url
when i compile get an error shown below
line 66, in
current_file_name = re.search(r'https://player.hdflixcore.workers.dev//0://Courses//Account%20Cracking%20--MrSihag//TN%20Cracking%20Course%20--MrSihag/.+/(.+)', download_path, re.DOTALL).group(1)
AttributeError: 'NoneType' object has no attribute 'group'
i figured i that in code i used websiteaddress
in "current_file_name" has some extra letters like backward slash
i have no idea about it
i tried to do like same by adding some backward slash but no fix
but when i run orginal code it works fine
when i use it in my desired site it end up with error that mentioned above
below is my edited code
from selenium import webdriver
import time
import os
import shutil
import re
path = r'https://player.hdflixcore.workers.dev/0:/Courses/Account%20Cracking%20--MrSihag/TN%20Cracking%20Course%20--MrSihag/'
# For changing the download location for this browser temporarily
options = webdriver.ChromeOptions()
preferences = {"download.default_directory": r"C:\Users\shanid\Desktop\test", "safebrowsing.enabled": "false"}
options.add_experimental_option("prefs", preferences)
# Acquire the Course Link and Get all the directories
browser = webdriver.Chrome(chrome_options=options)
browser.get(r"https://player.hdflixcore.workers.dev/0:/Courses/Account%20Cracking%20--MrSihag/TN%20Cracking%20Course%20--MrSihag/")
time.sleep(2)
elements = browser.find_elements_by_css_selector(".mdui-text-truncate")
# loop for as many directories there are
for i in range(0, len(elements)):
print("deft")
# At each directory, it refreshes the page to update the webelements in the list, and returns the current directory that is being worked on
browser.get(path)
time.sleep(2)
elements = browser.find_elements_by_css_selector(".mdui-text-truncate")
element = elements[i]
# checks if the folder for the directory already exists
current_directory_name = element.text[11:].strip(" .")
current_folder_path = "C:\\Users\\shanid\\Desktop\\test\\" + current_directory_name
if os.path.exists(current_folder_path):
pass
else:
os.mkdir(current_folder_path)
# Formatting what has been downloaded and sorted, and
print(current_directory_name, "------------------------------", sep="\n")
# moves on to the directory to get the page with the files
element.click()
# pausing for a few secs for the page to load, and running the same mechanism to get each file using the same method used in directory
time.sleep(3)
files = browser.find_elements_by_css_selector(".mdui-text-truncate")
for j in range(len(files)):
files = browser.find_elements_by_css_selector(".mdui-text-truncate")
_file = files[j]
# constants for some if statements
download = True
move = True
current_file_name = _file.text[17:].strip()
# If file exists, then pass over it, and don't do anything, and moveon to next file
if os.path.exists(current_folder_path + "\\" + current_file_name):
pass
# If it doesnt exist, then depending on its extension, do specific actions with it
else:
# Downloads the mp4 files by clicking on it, and finding the input tag which contains the download link for vid in its value attribute
if ".mp4" in current_file_name:
_file.click()
time.sleep(2)
download_path = browser.find_element_by_css_selector("input").get_attribute("value")
current_file_name = re.search(r'https://player.hdflixcore.workers.dev//0://Courses//Account%20Cracking%20--MrSihag//TN%20Cracking%20Course%20--MrSihag/.+/(.+)', download_path, re.DOTALL).group(1)
# Checks if file exists again, incase the filename is different then the predicted filename orderly generated.
if os.path.exists(current_folder_path + "\\" + current_file_name):
move = False
download = False
# returns to the previous page with the files
browser.back()
# self explanatory
elif ".html" in current_file_name:
download_path = path + current_directory_name + "/" + current_file_name
if os.path.exists(current_folder_path + "\\" + current_file_name):
move = False
download = False
else:
# acquires the download location by going to the parent tag which is an a tag containing the link for html in its 'href' attribute
download_path = _file.find_element_by_xpath('..').get_attribute('href').replace(r"%5E", "^")
current_file_name = re.search(r'https://player.hdflixcore.workers.dev/0:/Courses/Account%20Cracking%20--MrSihag/TN%20Cracking%20Course%20--MrSihag/.+/(.+)', download_path, re.DOTALL).group(1).replace("%20", " ")
time.sleep(2)
current_file_path = "C:\\Users\\shanid\\Desktop\\test\\" + current_file_name
# responsible for downloading it using a path, get allows downloading, by source links
if download:
browser.get(download_path)
# while the file doesn't exist/ it hasn't been downloaded yet, do nothing
while True:
if os.path.exists(current_file_path):
break
time.sleep(1)
# moves the file from the download spot to its own folder
if move:
shutil.move(current_file_path, current_folder_path + "\\" + current_file_name)
print(current_file_name)
# formatter
print("------------------------------", "", sep="\n")
time.sleep(3)
orginal code below
from selenium import webdriver
import time
import os
import shutil
import re
path = r'https://coursevania.courses.workers.dev/[coursevania.com]%20Udemy%20-%20Master%20the%20Coding%20Interview%20Data%20Structures%20+%20Algorithms/'
# For changing the download location for this browser temporarily
options = webdriver.ChromeOptions()
preferences = {"download.default_directory": r"E:\Utilities_and_Apps\Python\MY PROJECTS\Test data\Downloads", "safebrowsing.enabled": "false"}
options.add_experimental_option("prefs", preferences)
# Acquire the Course Link and Get all the directories
browser = webdriver.Chrome(chrome_options=options)
browser.get(r"https://coursevania.courses.workers.dev/[coursevania.com]%20Udemy%20-%20Master%20the%20Coding%20Interview%20Data%20Structures%20+%20Algorithms/")
time.sleep(2)
elements = browser.find_elements_by_css_selector(".mdui-text-truncate")
# loop for as many directories there are
for i in range(0, len(elements)):
# At each directory, it refreshes the page to update the webelements in the list, and returns the current directory that is being worked on
browser.get(path)
time.sleep(2)
elements = browser.find_elements_by_css_selector(".mdui-text-truncate")
element = elements[i]
# checks if the folder for the directory already exists
current_directory_name = element.text[11:].strip(" .")
current_folder_path = "E:\\Utilities_and_Apps\\Python\\MY PROJECTS\\Test data\Downloads\\" + current_directory_name
if os.path.exists(current_folder_path):
pass
else:
os.mkdir(current_folder_path)
# Formatting what has been downloaded and sorted, and
print(current_directory_name, "------------------------------", sep="\n")
# moves on to the directory to get the page with the files
element.click()
# pausing for a few secs for the page to load, and running the same mechanism to get each file using the same method used in directory
time.sleep(3)
files = browser.find_elements_by_css_selector(".mdui-text-truncate")
for j in range(len(files)):
files = browser.find_elements_by_css_selector(".mdui-text-truncate")
_file = files[j]
# constants for some if statements
download = True
move = True
current_file_name = _file.text[17:].strip()
# If file exists, then pass over it, and don't do anything, and moveon to next file
if os.path.exists(current_folder_path + "\\" + current_file_name):
pass
# If it doesnt exist, then depending on its extension, do specific actions with it
else:
# Downloads the mp4 files by clicking on it, and finding the input tag which contains the download link for vid in its value attribute
if ".mp4" in current_file_name:
_file.click()
time.sleep(2)
download_path = browser.find_element_by_css_selector("input").get_attribute("value")
current_file_name = re.search(r'https://coursevania.courses.workers.dev/\[coursevania.com\]%20Udemy%20-%20Master%20the%20Coding%20Interview%20Data%20Structures%20\+%20Algorithms/.+/(.+)', download_path, re.DOTALL).group(1)
# Checks if file exists again, incase the filename is different then the predicted filename orderly generated.
if os.path.exists(current_folder_path + "\\" + current_file_name):
move = False
download = False
# returns to the previous page with the files
browser.back()
# self explanatory
elif ".html" in current_file_name:
download_path = path + current_directory_name + "/" + current_file_name
if os.path.exists(current_folder_path + "\\" + current_file_name):
move = False
download = False
else:
# acquires the download location by going to the parent tag which is an a tag containing the link for html in its 'href' attribute
download_path = _file.find_element_by_xpath('..').get_attribute('href').replace(r"%5E", "^")
current_file_name = re.search(r'https://coursevania.courses.workers.dev/\[coursevania.com\]%20Udemy%20-%20Master%20the%20Coding%20Interview%20Data%20Structures%20\+%20Algorithms/.+/(.+)', download_path, re.DOTALL).group(1).replace("%20", " ")
time.sleep(2)
current_file_path = "E:\\Utilities_and_Apps\\Python\\MY PROJECTS\\Test data\Downloads\\" + current_file_name
# responsible for downloading it using a path, get allows downloading, by source links
if download:
browser.get(download_path)
# while the file doesn't exist/ it hasn't been downloaded yet, do nothing
while True:
if os.path.exists(current_file_path):
break
time.sleep(1)
# moves the file from the download spot to its own folder
if move:
shutil.move(current_file_path, current_folder_path + "\\" + current_file_name)
print(current_file_name)
# formatter
print("------------------------------", "", sep="\n")
time.sleep(3)
this code works fine
but not working when i change the website to https://player.hdflixcore.workers.dev/0:/Courses/Account%20Cracking%20--MrSihag/TN%20Cracking%20Course%20--MrSihag/
the site i used is clone of orginal site
i have no idea why getting error
The issue is with the CSS selector for input box on below page.
https://player.hdflixcore.workers.dev/0:/Courses/Account%20Cracking%20--MrSihag/TN%20Cracking%20Course%20--MrSihag/01%20Course%20Introduction/1%20Course%20Introduction.mp4?a=view
There are 2 inputs boxes on the page, so you have to write CSS path as"#content > div > div:nth-child(6) > input".
Code with issue.
download_path = browser.find_element_by_css_selector("input").get_attribute("value")
To be replaced with.
download_path = browser.find_element_by_css_selector("#content > div > div:nth-child(6) > input").get_attribute("value")

How to play streaming audio using pyglet?

The goal of this question is trying to figure out how to play streaming audio using pyglet. The first is just making sure you're able to play mp3 files using pyglet, that's the purpose of this first snippet:
import sys
import inspect
import requests
import pyglet
from pyglet.media import *
pyglet.lib.load_library('avbin')
pyglet.have_avbin = True
def url_to_filename(url):
return url.split('/')[-1]
def download_file(url, filename=None):
filename = filename or url_to_filename(url)
with open(filename, "wb") as f:
print("Downloading %s" % filename)
response = requests.get(url, stream=True)
total_length = response.headers.get('content-length')
if total_length is None:
f.write(response.content)
else:
dl = 0
total_length = int(total_length)
for data in response.iter_content(chunk_size=4096):
dl += len(data)
f.write(data)
done = int(50 * dl / total_length)
sys.stdout.write("\r[%s%s]" % ('=' * done, ' ' * (50 - done)))
sys.stdout.flush()
url = "https://freemusicarchive.org/file/music/ccCommunity/DASK/Abiogenesis/DASK_-_08_-_Protocell.mp3"
filename = "mcve.mp3"
download_file(url, filename)
music = pyglet.media.load(filename)
music.play()
pyglet.app.run()
If you've installed the libraries pip install pyglet requests and also installed AVBin at this point you should be able to listen the mp3 once it's been downloaded.
Once we've reached this point, I'd like to figure out how to play & buffering the file in a similar way to mostly of the existing web video/audio players using pyglet+requests. This means playing the files without waiting till the file has been downloaded completely.
After reading the pyglet media docs you can see there are available these classes:
media
sources
base
AudioData
AudioFormat
Source
SourceGroup
SourceInfo
StaticSource
StreamingSource
VideoFormat
player
Player
PlayerGroup
I've seen there are another similar SO questions but they haven't been solved properly and their content doesn't provide a lot of relevant details:
Play streaming audio using pyglet
How can I play audio stream without saving it into the file with pyglet?
That's why I've created a new question. How do you play streaming audio using pyglet? Could you provide a little example using the above mcve as a base?
Assuming you don't want to import a new package to do this for you - this can be done with a bit of effort.
First, let's head over to the Pyglet source code and have a look at media.load in media/__init__.py.
"""Load a Source from a file.
All decoders that are registered for the filename extension are tried.
If none succeed, the exception from the first decoder is raised.
You can also specifically pass a decoder to use.
:Parameters:
`filename` : str
Used to guess the media format, and to load the file if `file` is
unspecified.
`file` : file-like object or None
Source of media data in any supported format.
`streaming` : bool
If `False`, a :class:`StaticSource` will be returned; otherwise
(default) a :class:`~pyglet.media.StreamingSource` is created.
`decoder` : MediaDecoder or None
A specific decoder you wish to use, rather than relying on
automatic detection. If specified, no other decoders are tried.
:rtype: StreamingSource or Source
"""
if decoder:
return decoder.decode(file, filename, streaming)
else:
first_exception = None
for decoder in get_decoders(filename):
try:
loaded_source = decoder.decode(file, filename, streaming)
return loaded_source
except MediaDecodeException as e:
if not first_exception or first_exception.exception_priority < e.exception_priority:
first_exception = e
# TODO: Review this:
# The FFmpeg codec attempts to decode anything, so this codepath won't be reached.
if not first_exception:
raise MediaDecodeException('No decoders are available for this media format.')
raise first_exception
add_default_media_codecs()
The critical line here is loaded_source = decoder.decode(...). Essentially, to load audio Pyglet takes a file and hauls it over to a media decoder (eg. FFMPEG), which then returns a list of 'frames' or packets that Pyglet can play with a built-in Player class. If the audio format is compressed (eg. mp3 or aac), Pyglet will use an external library (currently only AVBin is supported) to convert it to raw, decompressed audio. You probably already know some of this.
So if we want to see how we can stuff a stream of bytes into Pyglet's audio engine rather than a file, we'll need to take a look at one of the decoders. For this example, let's use FFMPEG as it's the easiest to access.
In media/codecs/ffmpeg.py:
class FFmpegDecoder(object):
def get_file_extensions(self):
return ['.mp3', '.ogg']
def decode(self, file, filename, streaming):
if streaming:
return FFmpegSource(filename, file)
else:
return StaticSource(FFmpegSource(filename, file))
The 'object' it inherits from is MediaDecoder, found in media/codecs/__init__.py. Back at the load function in media/__init__.py, you'll see pyglet will choose a MediaDecoder based on file extension, then return its decode function with the file as a parameter to get the audio in the form of a packet stream. That packet stream is a Source object; each decoder has its own flavor, in the form of StaticSource or StreamingSource. The former is used to store audio in memory, and the latter to play it immediately. FFmpeg's decoder only supports StreamingSource.
We can see that FFMPEG's is FFmpegSource, also located in media/codecs/ffmpeg.py. We find this Goliath of a class:
class FFmpegSource(StreamingSource):
# Max increase/decrease of original sample size
SAMPLE_CORRECTION_PERCENT_MAX = 10
def __init__(self, filename, file=None):
if file is not None:
raise NotImplementedError('Loading from file stream is not supported')
self._file = ffmpeg_open_filename(asbytes_filename(filename))
if not self._file:
raise FFmpegException('Could not open "{0}"'.format(filename))
self._video_stream = None
self._video_stream_index = None
self._audio_stream = None
self._audio_stream_index = None
self._audio_format = None
self.img_convert_ctx = POINTER(SwsContext)()
self.audio_convert_ctx = POINTER(SwrContext)()
file_info = ffmpeg_file_info(self._file)
self.info = SourceInfo()
self.info.title = file_info.title
self.info.author = file_info.author
self.info.copyright = file_info.copyright
self.info.comment = file_info.comment
self.info.album = file_info.album
self.info.year = file_info.year
self.info.track = file_info.track
self.info.genre = file_info.genre
# Pick the first video and audio streams found, ignore others.
for i in range(file_info.n_streams):
info = ffmpeg_stream_info(self._file, i)
if isinstance(info, StreamVideoInfo) and self._video_stream is None:
stream = ffmpeg_open_stream(self._file, i)
self.video_format = VideoFormat(
width=info.width,
height=info.height)
if info.sample_aspect_num != 0:
self.video_format.sample_aspect = (
float(info.sample_aspect_num) /
info.sample_aspect_den)
self.video_format.frame_rate = (
float(info.frame_rate_num) /
info.frame_rate_den)
self._video_stream = stream
self._video_stream_index = i
elif (isinstance(info, StreamAudioInfo) and
info.sample_bits in (8, 16) and
self._audio_stream is None):
stream = ffmpeg_open_stream(self._file, i)
self.audio_format = AudioFormat(
channels=min(2, info.channels),
sample_size=info.sample_bits,
sample_rate=info.sample_rate)
self._audio_stream = stream
self._audio_stream_index = i
channel_input = avutil.av_get_default_channel_layout(info.channels)
channels_out = min(2, info.channels)
channel_output = avutil.av_get_default_channel_layout(channels_out)
sample_rate = stream.codec_context.contents.sample_rate
sample_format = stream.codec_context.contents.sample_fmt
if sample_format in (AV_SAMPLE_FMT_U8, AV_SAMPLE_FMT_U8P):
self.tgt_format = AV_SAMPLE_FMT_U8
elif sample_format in (AV_SAMPLE_FMT_S16, AV_SAMPLE_FMT_S16P):
self.tgt_format = AV_SAMPLE_FMT_S16
elif sample_format in (AV_SAMPLE_FMT_S32, AV_SAMPLE_FMT_S32P):
self.tgt_format = AV_SAMPLE_FMT_S32
elif sample_format in (AV_SAMPLE_FMT_FLT, AV_SAMPLE_FMT_FLTP):
self.tgt_format = AV_SAMPLE_FMT_S16
else:
raise FFmpegException('Audio format not supported.')
self.audio_convert_ctx = swresample.swr_alloc_set_opts(None,
channel_output,
self.tgt_format, sample_rate,
channel_input, sample_format,
sample_rate,
0, None)
if (not self.audio_convert_ctx or
swresample.swr_init(self.audio_convert_ctx) < 0):
swresample.swr_free(self.audio_convert_ctx)
raise FFmpegException('Cannot create sample rate converter.')
self._packet = ffmpeg_init_packet()
self._events = [] # They don't seem to be used!
self.audioq = deque()
# Make queue big enough to accomodate 1.2 sec?
self._max_len_audioq = 50 # Need to figure out a correct amount
if self.audio_format:
# Buffer 1 sec worth of audio
self._audio_buffer = \
(c_uint8 * ffmpeg_get_audio_buffer_size(self.audio_format))()
self.videoq = deque()
self._max_len_videoq = 50 # Need to figure out a correct amount
self.start_time = self._get_start_time()
self._duration = timestamp_from_ffmpeg(file_info.duration)
self._duration -= self.start_time
# Flag to determine if the _fillq method was already scheduled
self._fillq_scheduled = False
self._fillq()
# Don't understand why, but some files show that seeking without
# reading the first few packets results in a seeking where we lose
# many packets at the beginning.
# We only seek back to 0 for media which have a start_time > 0
if self.start_time > 0:
self.seek(0.0)
---
[A few hundred lines more...]
---
def get_next_video_timestamp(self):
if not self.video_format:
return
if self.videoq:
while True:
# We skip video packets which are not video frames
# This happens in mkv files for the first few frames.
video_packet = self.videoq[0]
if video_packet.image == 0:
self._decode_video_packet(video_packet)
if video_packet.image is not None:
break
self._get_video_packet()
ts = video_packet.timestamp
else:
ts = None
if _debug:
print('Next video timestamp is', ts)
return ts
def get_next_video_frame(self, skip_empty_frame=True):
if not self.video_format:
return
while True:
# We skip video packets which are not video frames
# This happens in mkv files for the first few frames.
video_packet = self._get_video_packet()
if video_packet.image == 0:
self._decode_video_packet(video_packet)
if video_packet.image is not None or not skip_empty_frame:
break
if _debug:
print('Returning', video_packet)
return video_packet.image
def _get_start_time(self):
def streams():
format_context = self._file.context
for idx in (self._video_stream_index, self._audio_stream_index):
if idx is None:
continue
stream = format_context.contents.streams[idx].contents
yield stream
def start_times(streams):
yield 0
for stream in streams:
start = stream.start_time
if start == AV_NOPTS_VALUE:
yield 0
start_time = avutil.av_rescale_q(start,
stream.time_base,
AV_TIME_BASE_Q)
start_time = timestamp_from_ffmpeg(start_time)
yield start_time
return max(start_times(streams()))
#property
def audio_format(self):
return self._audio_format
#audio_format.setter
def audio_format(self, value):
self._audio_format = value
if value is None:
self.audioq.clear()
The line you'll be interested in here is self._file = ffmpeg_open_filename(asbytes_filename(filename)). This brings us here, once again in media/codecs/ffmpeg.py:
def ffmpeg_open_filename(filename):
"""Open the media file.
:rtype: FFmpegFile
:return: The structure containing all the information for the media.
"""
file = FFmpegFile() # TODO: delete this structure and use directly AVFormatContext
result = avformat.avformat_open_input(byref(file.context),
filename,
None,
None)
if result != 0:
raise FFmpegException('Error opening file ' + filename.decode("utf8"))
result = avformat.avformat_find_stream_info(file.context, None)
if result < 0:
raise FFmpegException('Could not find stream info')
return file
and this is where things get messy: it calls to a ctypes function (avformat_open_input) that when given a file, will grab its details and fill out all the information it needs for our FFmpegSource class. With some work, you should be able to get avformat_open_input to take a bytes object rather than a path to a file which it will open to get the same information. I'd love to do this and include a working example, but I don't have the time right now. You'd then need to make a new ffmpeg_open_filename function utilizing the new avformat_open_input function, and then a new FFmpegSource class utilizing the new ffmpeg_open_filename function. All you need now is a new FFmpegDecoder class utilizing the new FFmpegSource class.
You could then implement this by adding it to your pyglet package directly. After, you'd want to add support for a byte object argument in the load() function (located in media/__init__.py and override the decoder to your new one. And there, you would now be able to stream audio without saving it.
Or, you could simply use a package that already supports it. Python-vlc does. You could use the example here to play whatever audio you'd like from a link. If you aren't doing this just for a challenge, I would strongly recommend you use another package. Otherwise: good luck.

Python detecting file type before operation

I'm working on this piece of code and this weird bug showed up on the Try command near the end of the code. The whole script is aimed towards .flac files, and sometimes it'd read .jpg files in the folders and blow up. Simply enough I went ahead and added if (".flac" or ".FLAC" in Song): before the Try, this way easily enough it would only process the correct filetype. However this made absolutely no difference and I kept on getting the following error
Traceback (most recent call last):
File ".\musync.py", line 190, in <module>
match_metadata(CurrentAlbum + Song, CoAlbum + Song)
File ".\musync.py", line 152, in match_metadata
TagSource = FLAC(SrcFile)
File "C:\Python34\lib\site-packages\mutagen\_file.py", line 41, in __init__
self.load(filename, *args, **kwargs)
File "C:\Python34\lib\site-packages\mutagen\flac.py", line 721, in load
self.__check_header(fileobj)
File "C:\Python34\lib\site-packages\mutagen\flac.py", line 844, in __check_header
"%r is not a valid FLAC file" % fileobj.name)
mutagen.flac.FLACNoHeaderError: 'C:/Users/berna/Desktop/Lib/Andrew Bird/Armchair Apocrypha/cover.jpg' is not a valid FLAC file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".\musync.py", line 194, in <module>
check_song(CurrentAlbum + Song, CoAlbum)
File ".\musync.py", line 83, in check_song
TagSource = FLAC(SrcFile)
File "C:\Python34\lib\site-packages\mutagen\_file.py", line 41, in __init__
self.load(filename, *args, **kwargs)
File "C:\Python34\lib\site-packages\mutagen\flac.py", line 721, in load
self.__check_header(fileobj)
File "C:\Python34\lib\site-packages\mutagen\flac.py", line 844, in __check_header
"%r is not a valid FLAC file" % fileobj.name)
mutagen.flac.FLACNoHeaderError: 'C:/Users/berna/Desktop/Lib/Andrew Bird/Armchair Apocrypha/cover.jpg' is not a valid FLAC file
Why is the if condition not doing it's job and how can I fix this? Code Is currently as follows:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import shutil
import os
from mutagen.flac import FLAC # Used for metadata handling.
from os import listdir # Used for general operations.
from fuzzywuzzy import fuzz # Last resource name association.
# Insert here the root directory of your library and device respectively.
lib = 'C:/Users/berna/Desktop/Lib/'
dev = 'C:/Users/berna/Desktop/Dev/'
# Faster file copying function, arguments go as follows: Source file location,
# target directory, whether to keep the filename intact and whether to create
# the target directory in case it doesn't exist.
def copy_file(SrcFile, TgtDir, KeepName=True, MakeDir=True):
SourceFile = None
TargetFile = None
KeepGoing = False
# Checks is TgtDir is valid and creates if needed.
if MakeDir and not os.path.isdir(TgtDir):
os.makedirs(TgtDir)
# Processes TgtDir depending on filename choice.
if KeepName is True:
TgtDir += os.path.basename(SrcFile)
print(TgtDir)
try:
SourceFile = open(SrcFile, 'rb')
TargetFile = open(TgtDir, 'wb')
KeepGoing = True
Count = 0
while KeepGoing:
# Read blocks of size 2**20 = 1048576
Buffer = SourceFile.read(2 ** 20)
if not Buffer:
break
TargetFile.write(Buffer)
Count += len(Buffer)
finally:
if TargetFile:
TargetFile.close()
if SourceFile:
SourceFile.close()
return KeepGoing
# XXX TODO
# Copies a directory (SrcDir) to TgtDir, if Replace is True will delete same
# name directory and replace with new one.
def copy_tree(SrcDir, TgtDir, Replace=True):
if not os.path.isdir(TgtDir):
os.makedirs(TgtDir)
Target = format_dir(TgtDir, os.path.basename(SrcDir))
if os.path.isdir(Target) and Replace:
shutil.rmtree(Target)
if not os.path.isdir(Target):
os.makedirs(Target)
for File in listdir(SrcDir):
FileDir = format_dir(SrcDir, File)
# copy_file(FileDir, Tgt)
return()
# Checks for new and deleted folders and returns their name.
def check_folder(SrcDir, TgtDir):
# Lists Source and Target folder.
Source = listdir(SrcDir)
Target = listdir(TgtDir)
# Then creates a list of deprecated and new directories.
Deleted = [FileName for FileName in Target if FileName not in Source]
Added = [FileName for FileName in Source if FileName not in Target]
# Returns both lists.
return (Added, Deleted)
# Checks for song in case there's a name mismatch or missing file.
def check_song(SrcFile, TgtDir):
Matches = []
# Invariably the new name will be that of the source file, the issue here
# is finding which song is the correct one.
NewName = TgtDir + '/' + os.path.basename(SrcFile)
TagSource = FLAC(SrcFile)
# Grabs the number of samples in the original file.
SourceSamples = TagSource.info.total_samples
# Checks if any song has a matching sample number and if true appends the
# song's filename to Matches[]
for Song in listdir(TgtDir):
SongInfo = FLAC(TgtDir + '/' + Song)
if (SongInfo.info.total_samples == SourceSamples):
Matches.append(Song)
# If two songs have the same sample rate (44100Hz for CDs) and the same
# length it matches them to the source by filename similarity.
if (Matches.count > 1):
Diffs = []
for Song in Matches:
Diffs.append(fuzz.ratio(Song, os.path.basename(SrcFile)))
if (max(Diffs) > 0.8):
BestMatch = TgtDir + '/' + Matches[Diffs.index(max(Diffs))]
os.rename(BestMatch, NewName)
else:
shutil.copy(SrcFile, TgtDir)
# If there's no match at all simply copy over the missing file.
elif (Matches.count == 0):
shutil.copy(SrcFile, TgtDir)
# If a single match is found the filename will be the first item on the
# Matches[] list.
else:
os.rename(TgtDir + '/' + Matches[0], NewName)
# Syncs folders in a directory and return the change count.
def sync(SrcDir, TgtDir):
AddCount = 0
DeleteCount = 0
# Grabs the folders to be added and deleted.
NewDir, OldDir = check_folder(SrcDir, TgtDir)
# Checks if any and then does add/rm.
if OldDir:
for Folder in OldDir:
shutil.rmtree(TgtDir + Folder)
DeleteCount += 1
if NewDir:
for Folder in NewDir:
shutil.copytree(SrcDir + Folder, TgtDir + Folder)
AddCount += 1
return(AddCount, DeleteCount)
# Fixes missing metadata fields.
def fix_metadata(SrcFile, TgtFile):
TagSource = FLAC(TgtFile)
TagTarget = FLAC(SrcFile)
# Checks for deleted tags on source file and deletes them from target.
if (set(TagTarget) - set(TagSource)):
OldTags = list(set(TagTarget) - set(TagSource))
for Tag in OldTags:
# TODO Right now I haven't quite figured out how to delete
# specific tags, so workaround is to delete them all.
TagTarget.delete()
# Checks for new tags on source file and transfers them to target.
if (set(TagSource) != set(TagTarget)):
NewTags = list(set(TagSource) - set(TagTarget))
for Tag in NewTags:
TagTarget["%s" % Tag] = TagSource[Tag]
TagTarget.save(TgtFile)
# Does metadata transfer between two files.
def match_metadata(SrcFile, TgtFile):
Altered = 0
TagSource = FLAC(SrcFile)
TagTarget = FLAC(TgtFile)
# For every different Tag in source song copy it to target and save.
for Tag in TagSource:
if TagSource[Tag] != TagTarget[Tag]:
Altered += 1
TagTarget[Tag] = TagSource[Tag]
TagTarget.save(TgtFile)
return(Altered)
# Simply does directory formatting to make things easier.
def format_dir(Main, Second, Third=""):
# Replaces \ with /
Main = Main.replace('\\', '/')
# Adds a / to the end of Main and concatenates Main and Second.
if(Main[len(Main) - 1] != '/'):
Main += '/'
Main += Second + '/'
# Concatenates Main and Third if necessary.
if (Third):
Main += Third + '/'
return (Main)
# Sync main folders in lib with dev.
sync(lib, dev)
# For every Artist in lib sync it's Albums
for Artist in listdir(lib):
sync(format_dir(lib, Artist), format_dir(dev, Artist))
# For every Album in Artist match songs
for Album in listdir(format_dir(lib, Artist)):
# Declares lib Album and dev Album to make function calls shorter.
CurrentAlbum = format_dir(lib, Artist, Album)
CoAlbum = format_dir(dev, Artist, Album)
for Song in listdir(CurrentAlbum):
if (".flac" or ".FLAC" in Song):
try:
# Tries to match lib and dev song's metadata.
match_metadata(CurrentAlbum + Song, CoAlbum + Song)
except:
# If that fails will try to fix both Filename and Tag
# fields.
check_song(CurrentAlbum + Song, CoAlbum)
fix_metadata(CurrentAlbum + Song, CoAlbum + Song)
try:
# Try again after fix.
match_metadata(CurrentAlbum + Song, CoAlbum + Song)
except Exception as e:
# If it still doesn't work there's black magic in place
# go sleep, drink a beer and try again later.
print("""Ehm, something happened and your sync failed.\n
Error:{}""".format(e))
raise SystemExit(0)
Try it:
Songs = ["a.flac", "a.mp3", "b.FLAC"]
flac_files = [s for s in Songs if s.lower().endswith('.flac')]
As pointed by #EliKorvigo the error was caused by a simple miswriting in the if condition, fix looks as follows:
for Song in listdir(CurrentAlbum):
if (".flac" in Song or ".FLAC" in Song):
try:
# Tries to match lib and dev song's metadata.
match_metadata(CurrentAlbum + Song, CoAlbum + Song)
except:
# If that fails will try to fix both Filename and Tag
# fields.
check_song(CurrentAlbum + Song, CoAlbum)
fix_metadata(CurrentAlbum + Song, CoAlbum + Song)
try:
# Try again after fix.
match_metadata(CurrentAlbum + Song, CoAlbum + Song)
except Exception as e:
# If it still doesn't work there's black magic in place
# go sleep, drink a beer and try again later.
print("""Ehm, something happened and your sync failed.\n
Error:{}""".format(e))
raise SystemExit(0)

Python: How to loop through several ini files with ConfigParser?

I somewhat understand how to do looping in Python, seems easy enough to say "For each file in this directory...do something". I'm now having a hard time figuring out how to loop through a series of .ini files in a directory, read lines from them, and use the text in the ini files as variables in the same Python script. For example, in this script, a single .ini file provides the values for 12 variables in the script. Currently, to run the script multiple times, one has to replace the single ini file with another one, that contains a different 12 variables. The script performs routine maintenance of an on-line mapping service provider..thing is...I have dozen's of services I'd like to manage with the script. From the script, it appears that the name of the .ini file is fixed, not sure it's even possible to loop through multiple ini file? The good news is, that the script is using ConfigParser.....I hope this makes sense!
[FS_INFO]
SERVICENAME = MyMapService
FOLDERNAME = None
MXD = D:\nightly_updates\maps\MyMap.mxd
TAGS = points, dots, places
DESCRIPTION = This is the description text
MAXRECORDS = 1000
[FS_SHARE]
SHARE = True
EVERYONE = true
ORG = true
GROUPS = None
[AGOL]
USER = user_name
PASS = pass_word1
The script below is reading from the ini file above.
# Import system modules
import urllib, urllib2, json
import sys, os
import requests
import arcpy
import ConfigParser
from xml.etree import ElementTree as ET
class AGOLHandler(object):
def __init__(self, username, password, serviceName, folderName):
self.username = username
self.password = password
self.serviceName = serviceName
self.token, self.http = self.getToken(username, password)
self.itemID = self.findItem("Feature Service")
self.SDitemID = self.findItem("Service Definition")
self.folderName = folderName
self.folderID = self.findFolder()
def getToken(self, username, password, exp=60):
referer = "http://www.arcgis.com/"
query_dict = {'username': username,
'password': password,
'expiration': str(exp),
'client': 'referer',
'referer': referer,
'f': 'json'}
query_string = urllib.urlencode(query_dict)
url = "https://www.arcgis.com/sharing/rest/generateToken"
token = json.loads(urllib.urlopen(url + "?f=json", query_string).read())
if "token" not in token:
print token['error']
sys.exit()
else:
httpPrefix = "http://www.arcgis.com/sharing/rest"
if token['ssl'] == True:
httpPrefix = "https://www.arcgis.com/sharing/rest"
return token['token'], httpPrefix
def findItem(self, findType):
#
# Find the itemID of whats being updated
#
searchURL = self.http + "/search"
query_dict = {'f': 'json',
'token': self.token,
'q': "title:\""+ self.serviceName + "\"AND owner:\"" + self.username + "\" AND type:\"" + findType + "\""}
jsonResponse = sendAGOLReq(searchURL, query_dict)
if jsonResponse['total'] == 0:
print "\nCould not find a service to update. Check the service name in the settings.ini"
sys.exit()
else:
print("found {} : {}").format(findType, jsonResponse['results'][0]["id"])
return jsonResponse['results'][0]["id"]
def findFolder(self):
#
# Find the ID of the folder containing the service
#
if self.folderName == "None":
return ""
findURL = self.http + "/content/users/{}".format(self.username)
query_dict = {'f': 'json',
'num': 1,
'token': self.token}
jsonResponse = sendAGOLReq(findURL, query_dict)
for folder in jsonResponse['folders']:
if folder['title'] == self.folderName:
return folder['id']
print "\nCould not find the specified folder name provided in the settings.ini"
print "-- If your content is in the root folder, change the folder name to 'None'"
sys.exit()
def urlopen(url, data=None):
# monkey-patch URLOPEN
referer = "http://www.arcgis.com/"
req = urllib2.Request(url)
req.add_header('Referer', referer)
if data:
response = urllib2.urlopen(req, data)
else:
response = urllib2.urlopen(req)
return response
def makeSD(MXD, serviceName, tempDir, outputSD, maxRecords):
#
# create a draft SD and modify the properties to overwrite an existing FS
#
arcpy.env.overwriteOutput = True
# All paths are built by joining names to the tempPath
SDdraft = os.path.join(tempDir, "tempdraft.sddraft")
newSDdraft = os.path.join(tempDir, "updatedDraft.sddraft")
arcpy.mapping.CreateMapSDDraft(MXD, SDdraft, serviceName, "MY_HOSTED_SERVICES")
# Read the contents of the original SDDraft into an xml parser
doc = ET.parse(SDdraft)
root_elem = doc.getroot()
if root_elem.tag != "SVCManifest":
raise ValueError("Root tag is incorrect. Is {} a .sddraft file?".format(SDDraft))
# The following 6 code pieces modify the SDDraft from a new MapService
# with caching capabilities to a FeatureService with Query,Create,
# Update,Delete,Uploads,Editing capabilities as well as the ability to set the max
# records on the service.
# The first two lines (commented out) are no longer necessary as the FS
# is now being deleted and re-published, not truly overwritten as is the
# case when publishing from Desktop.
# The last three pieces change Map to Feature Service, disable caching
# and set appropriate capabilities. You can customize the capabilities by
# removing items.
# Note you cannot disable Query from a Feature Service.
#doc.find("./Type").text = "esriServiceDefinitionType_Replacement"
#doc.find("./State").text = "esriSDState_Published"
# Change service type from map service to feature service
for config in doc.findall("./Configurations/SVCConfiguration/TypeName"):
if config.text == "MapServer":
config.text = "FeatureServer"
#Turn off caching
for prop in doc.findall("./Configurations/SVCConfiguration/Definition/" +
"ConfigurationProperties/PropertyArray/" +
"PropertySetProperty"):
if prop.find("Key").text == 'isCached':
prop.find("Value").text = "false"
if prop.find("Key").text == 'maxRecordCount':
prop.find("Value").text = maxRecords
# Turn on feature access capabilities
for prop in doc.findall("./Configurations/SVCConfiguration/Definition/Info/PropertyArray/PropertySetProperty"):
if prop.find("Key").text == 'WebCapabilities':
prop.find("Value").text = "Query,Create,Update,Delete,Uploads,Editing"
# Add the namespaces which get stripped, back into the .SD
root_elem.attrib["xmlns:typens"] = 'http://www.esri.com/schemas/ArcGIS/10.1'
root_elem.attrib["xmlns:xs"] ='http://www.w3.org/2001/XMLSchema'
# Write the new draft to disk
with open(newSDdraft, 'w') as f:
doc.write(f, 'utf-8')
# Analyze the service
analysis = arcpy.mapping.AnalyzeForSD(newSDdraft)
if analysis['errors'] == {}:
# Stage the service
arcpy.StageService_server(newSDdraft, outputSD)
print "Created {}".format(outputSD)
else:
# If the sddraft analysis contained errors, display them and quit.
print analysis['errors']
sys.exit()
def upload(fileName, tags, description):
#
# Overwrite the SD on AGOL with the new SD.
# This method uses 3rd party module: requests
#
updateURL = agol.http+'/content/users/{}/{}/items/{}/update'.format(agol.username, agol.folderID, agol.SDitemID)
filesUp = {"file": open(fileName, 'rb')}
url = updateURL + "?f=json&token="+agol.token+ \
"&filename="+fileName+ \
"&type=Service Definition"\
"&title="+agol.serviceName+ \
"&tags="+tags+\
"&description="+description
response = requests.post(url, files=filesUp);
itemPartJSON = json.loads(response.text)
if "success" in itemPartJSON:
itemPartID = itemPartJSON['id']
print("updated SD: {}").format(itemPartID)
return True
else:
print "\n.sd file not uploaded. Check the errors and try again.\n"
print itemPartJSON
sys.exit()
def publish():
#
# Publish the existing SD on AGOL (it will be turned into a Feature Service)
#
publishURL = agol.http+'/content/users/{}/publish'.format(agol.username)
query_dict = {'itemID': agol.SDitemID,
'filetype': 'serviceDefinition',
'overwrite': 'true',
'f': 'json',
'token': agol.token}
jsonResponse = sendAGOLReq(publishURL, query_dict)
print("successfully updated...{}...").format(jsonResponse['services'])
return jsonResponse['services'][0]['serviceItemId']
def enableSharing(newItemID, everyone, orgs, groups):
#
# Share an item with everyone, the organization and/or groups
#
shareURL = agol.http+'/content/users/{}/{}/items/{}/share'.format(agol.username, agol.folderID, newItemID)
if groups == None:
groups = ''
query_dict = {'f': 'json',
'everyone' : everyone,
'org' : orgs,
'groups' : groups,
'token': agol.token}
jsonResponse = sendAGOLReq(shareURL, query_dict)
print("successfully shared...{}...").format(jsonResponse['itemId'])
def sendAGOLReq(URL, query_dict):
#
# Helper function which takes a URL and a dictionary and sends the request
#
query_string = urllib.urlencode(query_dict)
jsonResponse = urllib.urlopen(URL, urllib.urlencode(query_dict))
jsonOuput = json.loads(jsonResponse.read())
wordTest = ["success", "results", "services", "notSharedWith", "folders"]
if any(word in jsonOuput for word in wordTest):
return jsonOuput
else:
print "\nfailed:"
print jsonOuput
sys.exit()
if __name__ == "__main__":
#
# start
#
print "Starting Feature Service publish process"
# Find and gather settings from the ini file
localPath = sys.path[0]
settingsFile = os.path.join(localPath, "settings.ini")
if os.path.isfile(settingsFile):
config = ConfigParser.ConfigParser()
config.read(settingsFile)
else:
print "INI file not found. \nMake sure a valid 'settings.ini' file exists in the same directory as this script."
sys.exit()
# AGOL Credentials
inputUsername = config.get( 'AGOL', 'USER')
inputPswd = config.get('AGOL', 'PASS')
# FS values
MXD = config.get('FS_INFO', 'MXD')
serviceName = config.get('FS_INFO', 'SERVICENAME')
folderName = config.get('FS_INFO', 'FOLDERNAME')
tags = config.get('FS_INFO', 'TAGS')
description = config.get('FS_INFO', 'DESCRIPTION')
maxRecords = config.get('FS_INFO', 'MAXRECORDS')
# Share FS to: everyone, org, groups
shared = config.get('FS_SHARE', 'SHARE')
everyone = config.get('FS_SHARE', 'EVERYONE')
orgs = config.get('FS_SHARE', 'ORG')
groups = config.get('FS_SHARE', 'GROUPS') #Groups are by ID. Multiple groups comma separated
# create a temp directory under the script
tempDir = os.path.join(localPath, "tempDir")
if not os.path.isdir(tempDir):
os.mkdir(tempDir)
finalSD = os.path.join(tempDir, serviceName + ".sd")
#initialize AGOLHandler class
agol = AGOLHandler(inputUsername, inputPswd, serviceName, folderName)
# Turn map document into .SD file for uploading
makeSD(MXD, serviceName, tempDir, finalSD, maxRecords)
# overwrite the existing .SD on arcgis.com
if upload(finalSD, tags, description):
# publish the sd which was just uploaded
newItemID = publish()
# share the item
if shared:
enableSharing(newItemID, everyone, orgs, groups)
print "\nfinished."
If I understand your question correctly, you would just want to add another loop in your main and then place most of what you have in your main into a new function (in my example, the new function is called 'process_ini'.
So, try replacing everything from your name == main line through the end with:
def process_ini(fileName):
settingsFile = os.path.join(localPath, fileName)
if os.path.isfile(settingsFile):
config = ConfigParser.ConfigParser()
config.read(settingsFile)
else:
print "INI file not found. \nMake sure a valid 'settings.ini' file exists in the same directory as this script."
sys.exit()
# AGOL Credentials
inputUsername = config.get( 'AGOL', 'USER')
inputPswd = config.get('AGOL', 'PASS')
# FS values
MXD = config.get('FS_INFO', 'MXD')
serviceName = config.get('FS_INFO', 'SERVICENAME')
folderName = config.get('FS_INFO', 'FOLDERNAME')
tags = config.get('FS_INFO', 'TAGS')
description = config.get('FS_INFO', 'DESCRIPTION')
maxRecords = config.get('FS_INFO', 'MAXRECORDS')
# Share FS to: everyone, org, groups
shared = config.get('FS_SHARE', 'SHARE')
everyone = config.get('FS_SHARE', 'EVERYONE')
orgs = config.get('FS_SHARE', 'ORG')
groups = config.get('FS_SHARE', 'GROUPS') #Groups are by ID. Multiple groups comma separated
# create a temp directory under the script
tempDir = os.path.join(localPath, "tempDir")
if not os.path.isdir(tempDir):
os.mkdir(tempDir)
finalSD = os.path.join(tempDir, serviceName + ".sd")
#initialize AGOLHandler class
agol = AGOLHandler(inputUsername, inputPswd, serviceName, folderName)
# Turn map document into .SD file for uploading
makeSD(MXD, serviceName, tempDir, finalSD, maxRecords)
# overwrite the existing .SD on arcgis.com
if upload(finalSD, tags, description):
# publish the sd which was just uploaded
newItemID = publish()
# share the item
if shared:
enableSharing(newItemID, everyone, orgs, groups)
print "\nfinished."
if __name__ == "__main__":
print "Starting Feature Service publish process"
# Find and gather settings from the ini file
localPath = sys.path[0]
for fileName in ['settings.ini', 'flurb.ini', 'durf.ini']:
process_ini(fileName)
You'd have to write all the ini filenames in the list found in the penultimate line of my example.
Alternatively, you could identify all the .ini files in the directory via code:
if __name__ == "__main__":
print "Starting Feature Service publish process"
# Find and gather settings from the ini file
localPath = sys.path[0]
fileNames = [os.path.join(localPath, i) for i in os.listdir(localPath) if i.endswith('.ini')]
for fileName in fileNames:
process_ini(fileName)
It also might help to set the working directory (e.g., os.chdir(localPath)), but I'm going off of what you already had.

How to save an image with the correct file extension?

I have a script that parses HTML and saves the images to disk.
However, for some reason it outputs the filename wrongly.
It is not saving the file with the correct file extension in Windows. Eg, the image should be saved as <filename>.jpg or <filename>.gif. Instead the images are being saved with no filename extension.
Could you help me to see why this script is not saving the extension correctly in the filename?
I'm running Python 2.7.
""" Tumbrl downloader
This program will download all the images from a Tumblr blog """
from urllib import urlopen, urlretrieve
import os, sys, re
def download_images(images, path):
for im in images:
print(im)
filename = re.findall("([^/]*).(?:jpg|gif|png)",im)[0]
filename = os.path.join(path,filename)
try:
urlretrieve(im, filename.replace("500","1280"))
except:
try:
urlretrieve(im, filename)
except:
print("Failed to download "+im)
def main():
#Check input arguments
if len(sys.argv) < 2:
print("usage: ./tumblr_rip.py url [starting page]")
sys.exit(1)
url = sys.argv[1]
if len(sys.argv) == 3:
pagenum = int(sys.argv[2])
else:
pagenum = 1
if (check_url(url) == ""):
print("Error: Malformed url")
sys.exit(1)
if (url[-1] != "/"):
url.append("/")
blog_name = url.replace("http://", "")
blog_name = re.findall("(?:.[^\.]*)", blog_name)[0]
current_path = os.getcwd()
path = os.path.join(current_path, blog_name)
#Create blog directory
if not os.path.isdir(path):
os.mkdir(path)
html_code_old = ""
while(True):
#fetch html from url
print("\nFetching images from page "+str(pagenum)+"\n")
f = urlopen(url+"page/"+str(pagenum))
html_code = f.read()
html_code = str(html_code)
if(check_end(html_code, html_code_old, pagenum)):
break
images = get_images_page(html_code)
download_images(images, path)
html_code_old = html_code
pagenum += 1
print("Done downloading all images from " + url)
if __name__ == '__main__':
main()
The line
filename = re.findall("([^/]*).(?:jpg|gif|png)",im)[0]
Does not do what you think it does. First off, the dot is unescaped, meaning it will match any character, not just a period.
But the bigger problem is that you messed up the groups. You're acessing the value of the first group in the match, which is the first part inside parenthesis, giving you only the base filename without extension. The second group, containing the extension, is a seperate, noncapturing group. The (?:...) syntax makes a group noncapturing.
The way I fixed it was by putting a group around the entire match and making the existing groups noncapturing.
re.findall("((?:[^/]*)\.(?:jpg|gif|png))",im)[0]
P.S. Another problem is that the pattern is greedy so it can match multiple filenames at once. However, this isn't necessarily invalid, since spaces and periods are allowed in filenames. So if you want to match multiple filenames here, you'll have to figure out what to do yourself. Something like "((?:\w+)\.(?:jpg|gif|png))" would be more intuitive though.

Categories