I have written a simple function to resize an image from 1500x2000px to 900x1200px.
def resizeImage(file_list):
if file_list:
if not os.path.exists('resized'):
os.makedirs('resized')
i = 0
for files in file_list:
i += 1
im = Image.open(files)
im = im.resize((900,1200),Image.ANTIALIAS)
im.save('resized/' + files, quality=90)
print str(i) + " files resized successfully"
else:
print "No files to resize"
i used the timeit function to measure how long it takes to run with some example images. Here is an example of the results.
+---------------+-----------+---------------+---------------+---------------+
| Test Name | No. files | Min | Max | Average |
+---------------+-----------+---------------+---------------+---------------+
| Resize normal | 10 | 5.25000018229 | 5.31371171493 | 5.27186083393 |
+---------------+-----------+---------------+---------------+---------------+
But if i repeat the test the times gradually keep increasing i.e.
+---------------+-----------+---------------+---------------+---------------+
| Test Name | No. files | Min | Max | Average |
+---------------+-----------+---------------+---------------+---------------+
| Resize normal | 10 | 5.36660298734 | 5.57177596057 | 5.45903467485 |
+---------------+-----------+---------------+---------------+---------------+
+---------------+-----------+---------------+---------------+---------------+
| Test Name | No. files | Min | Max | Average |
+---------------+-----------+---------------+---------------+---------------+
| Resize normal | 10 | 5.58739076382 | 5.76515489024 | 5.70014196601 |
+---------------+-----------+---------------+---------------+---------------+
+---------------+-----------+---------------+---------------+-------------+
| Test Name | No. files | Min | Max | Average |
+---------------+-----------+---------------+---------------+-------------+
| Resize normal | 10 | 5.77366483042 | 6.00337707034 | 5.891541538 |
+---------------+-----------+---------------+---------------+-------------+
+---------------+-----------+---------------+--------------+---------------+
| Test Name | No. files | Min | Max | Average |
+---------------+-----------+---------------+--------------+---------------+
| Resize normal | 10 | 5.91993466793 | 6.1294756299 | 6.03516199948 |
+---------------+-----------+---------------+--------------+---------------+
This is how im running the test.
def resizeTest(repeats):
os.chdir('C:/Users/dominic/Desktop/resize-test')
files = glob.glob('*.jpg')
t = timeit.Timer(
"resizeImage(filess)",
setup="from imageToolkit import resizeImage; import glob; filess = glob.glob('*.jpg')"
)
time = t.repeat(repeats, 1)
results = {
'name': 'Resize normal',
'files': len(files),
'min': min(time),
'max': max(time),
'average': averageTime(time)
}
resultsTable(results)
I have moved the images that are processed from my mechanical hard drive to the SSD and the issue persists. I have also checked the Memory being used and it stays pretty steady through all the runs, topping out at around 26Mb, the process uses around 12% of one core of the CPU.
Going forward i like to experiment with the multiprocessing library to increase the speed, but i'd like to get to the bottom of this issue first.
Would this be an issue with my loop that causes the performance to degrade?
The im.save() call is slowing things down; repeated writing to the same directory is perhaps thrashing OS disk caches. When you removed the call, the OS was able to optimize the image read access times via disk caches.
If your machine has multiple CPU cores, you can indeed speed up the resize process, as the OS will schedule multiple sub-processes across those cores to run each resize operation. You'll not get a linear performance improvement, as all those processes still have to access the same disk for both reads and writes.
Related
I'm running a pytorch + ray hyperparameter optimization, and the output to screen while the algorithm is running is:
== Status ==
Current time: 2022-09-20 16:10:16 (running for 00:11:50.27)
Memory usage on this node: 29.0/128.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 1.0/1 GPUs, 0.0/83.03 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P2000)
Result logdir: /home/runs/train_fn_2022-09-20_15-58-22
Number of trials: 1/1 (1 RUNNING)
+-------------------+----------+-------------------+------------+-----------+------------------+--------------+
| Trial name | status | loc | c_hidden | dp_rate | dp_rate_linear | num_layers |
|-------------------+----------+-------------------+------------+-----------+------------------+--------------|
| train_fn_0fcfd440 | RUNNING | 172.18.0.2:356387 | 64 | 0.3 | 0.1 | 3 |
+-------------------+----------+-------------------+------------+-----------+------------------+--------------+
You can see it's saying I've requested 1 GPU - but then 0.0/1.0 accelerators is mentioned further on in the same line.
Can someone tell me how to tell if pytorch is actually implementing a GPU as it should be?
My current attempt:
This is my current code:
from moviepy.editor import *
clips = [VideoFileClip('a.mp4'), VideoFileClip('b.mp4'), VideoFileClip('c.mp4')]
transitioned_clips = [demo_clip.crossfadein(2) for demo_clip in clips]
for_delivery = concatenate_videoclips(transitioned_clips)
for_delivery.write_videofile(target_path, fps=clip.fps, bitrate='%dK' % (bitrate), threads=50, verbose=False, logger=None, preset='ultrafast')
I also tried using CompositeVideoClip, but:
It resulted in a completely black video.
Even for the completely black video it took 50 times longer to write the video file than for without transitions.
My current output:
My current output is a video with the 3 videos concatenated (which is good), but no transitions between the clips (which is not good).
My goal:
My goal is to add the crossfadein transition for 2 seconds between the clips and concatenate the clips into one video and output it.
In other words, I want it like (in order from left to right):
| | + | | + | |
| clip 1 | transition 1 | clip 2 | transition 2 | clip 3 |
| | + | | + | |
Is there anyway to have transitions? Any help appreciated.
You could try this approach of manually setting the start time to handle the transitions.
padding = 2
video_clips = [VideoFileClip('a.mp4'), VideoFileClip('b.mp4'), VideoFileClip('c.mp4')]
video_fx_list = [video_clips[0]]
idx = video_clips[0].duration - padding
for video in video_clips[1:]:
video_fx_list.append(video.set_start(idx).crossfadein(padding))
idx += video.duration - padding
final_video = CompositeVideoClip(video_fx_list)
final_video.write_videofile(target_path, fps=clip.fps) # add any remaining params
Edit:
Here's an attempt using concatenate:
custom_padding = 2
final_video = concatenate(
[
clip1,
clip2.crossfadein(custom_padding),
clip3.crossfadein(custom_padding)
],
padding=-custom_padding,
method="chain"
)
final_video.write_videofile(target_path, fps=clip.fps) # add any remaining params
TL;DR - The actual problem is that I am working on something that provides information about the entries in an archive file and specifies 'where' the size in the archive is coming from. The example below is sort of exactly not like my real problem(which has hundreds of thousands of entries) but highlights the actual problem I'm running into. My problem is that there's a non-trivial amount of size in my archive that is unaccounted for (actually used in the overhead for compression is my guess). The sum of the parts of my archive (the total compressed size of all of my entries + the expected gaps between them) is less than the actual size of the archive. How do I inspect the archive in a way that provides insight into this hidden overhead?
Where I'm at:
I have a directory that contains three files:
doc.pdf
cat.jpg
model.stl
Using a freeware program I dump these into a zip file: demo.zip
Using python I can inspect these pretty easily:
info_list= zipfile.ZipFile('demo.zip').infolist()
for i in info_list:
print i.orig_filename
print i.compress_size
print i.header_offset
Using this info we can get some info.
The total size of demo.zip is 84469
The compressed size of:
|---------------------|-----------------|---------------|
| File | Compressed Size | Header Offset |
|---------------------|-----------------|---------------|
| doc.pdf | 21439 | 0 |
|---------------------|-----------------|---------------|
| cat.jpg | 48694 | 21495 |
|---------------------|-----------------|---------------|
| model.stl | 13870 | 70232 |
|---------------------|-----------------|---------------|
I know that zipping will result in some space between entries. (Thus the difference between the sums of previous entry sizes and the header offset for every entry). You can calculate this small 'Gap':
gap = offset - previous_entry_size - previous_entry_offset
I can update my chart to look like:
|---------------------|-----------------|---------------|---------------|
| File | Compressed Size | Header Offset | 'Gap' |
|---------------------|-----------------|---------------|---------------|
| doc.pdf | 21439 | 0 | 0 |
|---------------------|-----------------|---------------|---------------|
| cat.jpg | 48694 | 21495 | 56 |
|---------------------|-----------------|---------------|---------------|
| model.stl | 13870 | 70232 | 43 |
|---------------------|-----------------|---------------|---------------|
Cool. So now one might expect that the size of demo.zip would be equal to the sum of the size of all entries and their gaps. (84102 in the example above).
But that's not the case. So, obviously, zipping requires headers and information about how zipping occurred (and how to unzip). But I'm running into a problem on how to define this or access any more information about it.
I could just take 84469 - 84102 and say ~magic zip overhead~ = 367 bytes. But that seems less than ideal because this number obviously is not magic. Is there a way to inspect the underlying zip data that is taking up this space?
An empty zip file is 22 bytes, containing only the End of Central Directory Record.
In [1]: import zipfile
In [2]: z = zipfile.ZipFile('foo.zip', 'w')
In [3]: z.close()
In [4]: import os
In [5]: os.stat('foo.zip').st_size
Out[5]: 22
If the zip-file is not empty, for every file you have a central directory file header (at least 46 bytes), and a local file header (at least 30 bytes).
The actual headers have a variable length because the given lengths do not include space for the file name which is part of the header.
I am trying to find out my rate limit remaining by using rate_limit_remaining call, but it doesn't seems work properly. According to pydoc:
class TwitterResponse(__builtin__.object)
| Response from a twitter request. Behaves like a list or a string
| (depending on requested format) but it has a few other interesting
| attributes.
|
| `headers` gives you access to the response headers as an
| httplib.HTTPHeaders instance. You can do
| `response.headers.get('h')` to retrieve a header.
|
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| rate_limit_limit
| The rate limit ceiling for that given request.
|
| rate_limit_remaining
| Remaining requests in the current rate-limit.
|
| rate_limit_reset
| Time in UTC epoch seconds when the rate limit will reset.
My code:
from twitter import *
t = Twitter(
auth=OAuth('all the tokens'))
t.TwitterResponse.rate_limit_remaining
In a Python dict of 50 items would there be any known noticeable speed difference in matching an integer key (2 digits) to find a string value VERSUS matching a string key (5 - 10+ letters) to find an integer value over a large number of loops (100,000+)?
As a minor bonus; is there any benefit to performing an activity like this in MYSQL versus Python if you're able to?
Micro-benchmarking language
features is a useful exercise, but you have to take it with
a grain of salt. It's hard to do
benchmarks in accurate and meaningful ways, and generally
what people care about are total performance, not individual feature
performance.
I find using a "test harness" makes it easier to run
differnet alternatives in a comparable way.
For dictionary lookup, here's an example using the benchmark module from PyPI.
100
randomized runs, setting up dicts of N=50 items
each--either int keys and str values or the
reverse, then trying both the try/excepts
and get access paradigms. Here's the code:
import benchmark
from random import choice, randint
import string
def str_key(length=8, alphabet=string.ascii_letters):
return ''.join(choice(alphabet) for _ in xrange(length))
def int_key(min=10, max=99):
return randint(min, max)
class Benchmark_DictLookup(benchmark.Benchmark):
each = 100 # allows for differing number of runs
def setUp(self):
# Only using setUp in order to subclass later
# Can also specify tearDown, eachSetUp, and eachTearDown
self.size = 1000000
self.n = 50
self.intdict = { int_key():str_key() for _ in xrange(self.n) }
self.strdict = { str_key():int_key() for _ in xrange(self.n) }
self.intkeys = [ int_key() for _ in xrange(self.size) ]
self.strkeys = [ str_key() for _ in xrange(self.size) ]
def test_int_lookup(self):
d = self.intdict
for key in self.intkeys:
try:
d[key]
except KeyError:
pass
def test_int_lookup_get(self):
d = self.intdict
for key in self.intkeys:
d.get(key, None)
def test_str_lookup(self):
d = self.strdict
for key in self.strkeys:
try:
d[key]
except KeyError:
pass
def test_str_lookup_get(self):
d = self.strdict
for key in self.strkeys:
d.get(key, None)
class Benchmark_Hashing(benchmark.Benchmark):
each = 100 # allows for differing number of runs
def setUp(self):
# Only using setUp in order to subclass later
# Can also specify tearDown, eachSetUp, and eachTearDown
self.size = 100000
self.intkeys = [ int_key() for _ in xrange(self.size) ]
self.strkeys = [ str_key() for _ in xrange(self.size) ]
def test_int_hash(self):
for key in self.intkeys:
id(key)
def test_str_hash(self):
for key in self.strkeys:
id(key)
if __name__ == '__main__':
benchmark.main(format="markdown", numberFormat="%.4g")
And the results:
$ python dictspeed.py
Benchmark Report
================
Benchmark DictLookup
--------------------
name | rank | runs | mean | sd | timesBaseline
---------------|------|------|--------|---------|--------------
int lookup get | 1 | 100 | 0.1756 | 0.01619 | 1.0
str lookup get | 2 | 100 | 0.1859 | 0.01477 | 1.05832996073
int lookup | 3 | 100 | 0.5236 | 0.03935 | 2.98143047487
str lookup | 4 | 100 | 0.8168 | 0.04961 | 4.65108861267
Benchmark Hashing
-----------------
name | rank | runs | mean | sd | timesBaseline
---------|------|------|----------|-----------|--------------
int hash | 1 | 100 | 0.008738 | 0.000489 | 1.0
str hash | 2 | 100 | 0.008925 | 0.0002952 | 1.02137781609
Each of the above 600 runs were run in random, non-consecutive order by
`benchmark` v0.1.5 (http://jspi.es/benchmark) with Python 2.7.5
Darwin-13.4.0-x86_64 on 2014-10-28 19:23:01.
Conclusion: String lookup in dictionaries is not that much more expensive than integer lookup. BUT the supposedly Pythonic "ask forgiveness not permission" paradigm takes much longer than simply using the get method call. Also, hashing a string (at least of size 8) is not much more expensive than hashing an integer.
But then things get even more interesting if you run on a different implementation, like PyPy:
$ pypy dictspeed.py
Benchmark Report
================
Benchmark DictLookup
--------------------
name | rank | runs | mean | sd | timesBaseline
---------------|------|------|---------|-----------|--------------
int lookup get | 1 | 100 | 0.01538 | 0.0004682 | 1.0
str lookup get | 2 | 100 | 0.01993 | 0.001117 | 1.295460397
str lookup | 3 | 100 | 0.0203 | 0.001566 | 1.31997704025
int lookup | 4 | 100 | 0.02316 | 0.001056 | 1.50543635375
Benchmark Hashing
-----------------
name | rank | runs | mean | sd | timesBaseline
---------|------|------|-----------|-----------|--------------
str hash | 1 | 100 | 0.0005657 | 0.0001609 | 1.0
int hash | 2 | 100 | 0.006066 | 0.0005283 | 10.724346492
Each of the above 600 runs were run in random, non-consecutive order by
`benchmark` v0.1.5 (http://jspi.es/benchmark) with Python 2.7.8
Darwin-13.4.0-x86_64 on 2014-10-28 19:23:57.
PyPy is about 11x faster, best case, but the ratios are much different. PyPy doesn't suffer the significant exception-handling cost that CPython does. And, hashing an integer is 10x slower than hashing a string. How about that for an unexpected result?
I would have tried Python 3, but benchmark didn't install well there. I also tried increasing the string length to 50. It didn't markedly change the results, the ratios, or the conclusions.
Overall, hashing and lookups are so fast that, unless you have to do them by the millions or billions, or have extraordinarily long keys, or some other unusual circumstance, developers generally needn't be concerned about their micro-performance.