AWS S3 list keys begins with a string

AWS S3 list keys begins with a string - python

I am using python in AWS Lambda function to list keys in a s3 bucket that begins with a specific id
for object in mybucket.objects.all():
file_name = os.path.basename(object.key)
match_id = file_name.split('_', 1)[0]
The problem is if a s3 bucket has several thousand files the iteration is very inefficient and sometimes lambda function times out
Here is an example file name
https://s3.console.aws.amazon.com/s3/object/bucket-name/012345_abc_happy.jpg
i want to only iterate objects that contains "012345" in the key name
Any good suggestion on how i can accomplish that

Here is how you need to solve it.
S3 stores everything as objects and there is no folder or filename. It is all for user convenience.
aws s3 ls s3://bucket/folder1/folder2/filenamepart --recursive
will get all s3 objects name that matches to that name.
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('bucketname')
for obj in my_bucket.objects.filter(Prefix='012345'):
print(obj)
To speed up the list you can run multiple scripts parallelly.
Hope it helps.

You can improve speed by 30-40% by dropping os and using string methods.
Depending on the assumptions you can make about the file path string, you can get additional speedups:
Using os.path.basename():
%%timeit
match = "012345"
fname = "https://s3.console.aws.amazon.com/s3/object/bucket-name/012345_abc_happy.jpg"
os.path.basename(fname).split("_")[0] == match
# 1.03 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Without os, splitting first on / and then on _:
%%timeit
match = "012345"
fname = "https://s3.console.aws.amazon.com/s3/object/bucket-name/012345_abc_happy.jpg"
fname.split("/")[-1].split("_")[0] == match
# 657 ns ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
If you know that the only underscores occur in the actual file name, you can use just one split():
%%timeit
match = "012345"
fname = "https://s3.console.aws.amazon.com/s3/object/bucket-name/012345_abc_happy.jpg"
fname.split("_")[0][-6:] == match
# 388 ns ± 5.65 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Related

Maximal efficiency to parse millions of json stored as a (very long) string

Objective:
I have thousands of data dumps whose format, after unzip, is a long string containing 150K json separated by '\n'.
big_string = '{"mineral": "gold", "qty": 2, "garbage":"abc"}\n ....... {"mineral": "silver", "qty": 4}'
Each JSON contains dozens of useless keys like garbage, but my objective is only to sum the qty for each mineral.
result = {'gold': 213012, 'silver': 123451, 'adamantium': 321434}
How to reproduce:
import random
minerals = ['gold', 'silver', 'adamantium']
big_string = str(
'\n'.join([
str({'mineral': random.choice(minerals),
'qty': random.randint(1,1000),
'garbage': random.randint(1,666),
'other_garbage': random.randint(-10,10)})
for _ in range(150000)
])
)
def solution(big_string):
# Show me your move
return dict() # or pd.DataFrame()
My current solution (which I find slower than expected):
Splitting the string using the '\n' separator, with a yield generator (see https://stackoverflow.com/a/9770397/4974431)
Loading the string in json format using ujson library (supposed to be faster than json standard lib)
Accessing the values needed only for 'mineral' and 'quantity'.
Doing the aggregation using pandas
Which gives:
import ujson
import re
import pandas as pd
# To split the big_string (from https://stackoverflow.com/a/9770397/4974431)
def lines(string, sep="\s+"):
# warning: does not yet work if sep is a lookahead like `(?=b)`
if sep=='':
return (c for c in string)
else:
return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))
def my_solution(big_string):
useful_fields = ['mineral', 'qty']
filtered_data = []
for line in lines(big_string, sep="\n"):
line = ujson.loads(line)
filtered_data.append([line[field] for field in useful_fields])
result = pd.DataFrame(filtered_data, columns = useful_fields)
return result.groupby('mineral')['qty'].sum().reset_index()
Any improvement, even by 25%, would be great because I have thousands to do !

I must confess: I'm going to use a library of mine - convtools (github)
rely on iterating io.StringIO, it splits lines by \n itself
process as a stream without additional allocations
import random
from io import StringIO
import ujson
from convtools import conversion as c
minerals = ["gold", "silver", "adamantium"]
big_string = str(
"\n".join(
[
ujson.dumps(
{
"mineral": random.choice(minerals),
"qty": random.randint(1, 1000),
"garbage": random.randint(1, 777),
"other_garbage": random.randint(-10, 10),
}
)
for _ in range(150000)
]
)
)
# define a conversion and generate ad hoc converter
converter = (
c.iter(c.this.pipe(ujson.loads))
.pipe(
c.group_by(c.item("mineral")).aggregate(
{
"mineral": c.item("mineral"),
"qty": c.ReduceFuncs.Sum(c.item("qty")),
}
)
)
.gen_converter()
)
# let's check
"""
In [48]: converter(StringIO(big_string))
Out[48]:
[{'mineral': 'silver', 'qty': 24954551},
{'mineral': 'adamantium', 'qty': 25048483},
{'mineral': 'gold', 'qty': 24975201}]
In [50]: OPs_solution(big_string)
Out[50]:
mineral qty
0 adamantium 25048483
1 gold 24975201
2 silver 24954551
"""
Let's profile:
In [53]: %timeit OPs_solution(big_string)
339 ms ± 9.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [54]: %timeit converter(StringIO(big_string))
93.2 ms ± 473 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

After looking for where time is spent, it appears 90% of total time is spent in the generator.
Changing it to : https://stackoverflow.com/a/59071238/4974431 was a 85% time improvement

iPython timeit - only time part of the operation

I was attempting to determine, via iPython's %%timeit mechanism, whether set.remove is faster than list.remove when a conundrum came up.
I could do
In [1]: %%timeit
a_list = list(range(100))
a_list.remove(50)
and then do the same thing but with a set. However, this would include the overhead from the list/set construction. Is there a way to re-build the list/set each iteration but only time the remove method?

Put your setup code on the same line to create any names or precursor operations you need!
https://ipython.org/ipython-doc/dev/interactive/magics.html#magic-timeit
In cell mode, the statement in the first line is used as setup code (executed but not timed) and the body of the cell is timed. The cell body has access to any variables created in the setup code.
%%timeit setup_code
...
Unfortunately only a single run can be done as it does not re-run the setup code
%%timeit -n1 x = list(range(100))
x.remove(50)
Surprisingly, this doesn't accept a string like the timeit module, so combined with the single run requirement, I'd still defer to timeit with a string setup= and repeat it if lots of setup or a statistically higher precision is needed
See #Kelly Bundy's much more precise answer for more!

Alternatively, using the timeit module with more repetitions and some statistics:
list: 814 ns ± 3.7 ns
set: 152 ns ± 1.6 ns
list: 815 ns ± 4.3 ns
set: 154 ns ± 1.6 ns
list: 817 ns ± 4.3 ns
set: 153 ns ± 1.6 ns
Code (Try it online!):
from timeit import repeat
from statistics import mean, stdev
for _ in range(3):
for kind in 'list', 'set':
ts = repeat('data.remove(50)', f'data = {kind}(range(100))', number=1, repeat=10**5)
ts = [t * 1e9 for t in sorted(ts)[:1000]]
print('%4s: %3d ns ± %.1f ns' % (kind, mean(ts), stdev(ts)))

Python timeit - TypeError: 'module' object is not callable

I usually use timeit in jupyter notebook like this:
def some_function():
for x in range(1000):
return x
timeit(some_func())
and get the result like this:
6.3 ms ± 42.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
but today I got an error like this:
TypeError Traceback (most recent call last)
<ipython-input-11-fef6a46355f1> in <module>
----> 1 timeit(some_func())
TypeError: 'module' object is not callable
How does it occur?

You are currently trying to execute the timeit module, rather than the function contained within.
You should change your import statement from import timeit to from timeit import timeit. Alternatively, you can call the function using timeit.timeit.

After searching and trying for a while I realize that when we want to use timeit(some_function()), we do not need import timeit but we should write it in another input of jupyter notebook like this:
IN [1]:
def some_function():
for x in range(1000):
return x
IN [2]:
timeit(some_func())
and we will get output like this:
280 ns ± 2.78 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
When we write it in one input like this:
IN [1]:
def some_function():
for x in range(1000):
return x
timeit(some_func())
we'll get an error timeit not define and when we 'import timeit' we'll got another error like I produce on the question TypeError: 'module' object is not callable.
because when we import timeit we need to specify the stmt and setup (if available) e.g:
import timeit
SETUP = """
import yourmodul_here
"""
TEST_CODE = """
def some_function():
for x in range(1000):
return x
"""
timeit.timeit(stmt=TEST_CODE, setup=SETUP, number=2000000)
And we'll get the output like this:
0.12415042300017376
stmt is code to run
setup is something that need to load before TEST_CODE run
The stmt will execute as per the number is given here. default = 1000000
so when we import timeit we need to write more I guess.

Efficient double for loop over large matrices

I have the following code which I need to runt it more than one time. Currently, it takes too long. Is there an efficient way to write these two for loops.
ErrorEst=[]
for i in range(len(embedingFea)):#17000
temp=[]
for j in range(len(emedingEnt)):#15000
if cooccurrenceCount[i][j]>0:
#print(coaccuranceCount[i][j]/ count_max)
weighting_factor = np.min(
[1.0,
math.pow(np.float32(cooccurrenceCount[i][j]/ count_max), scaling_factor)])
embedding_product = (np.multiply(emedingEnt[j], embedingFea[i]), 1)
#tf.log(tf.to_float(self.__cooccurrence_count))
log_cooccurrences =np.log (np.float32(cooccurrenceCount[i][j]))
distance_expr = np.square(([
embedding_product+
focal_bias[i],
context_bias[j],
-(log_cooccurrences)]))
single_losses =(weighting_factor* distance_expr)
temp.append(single_losses)
ErrorEst.append(np.sum(temp))

You can use Numba or Cython
At first make sure to avoid lists where ever possible and write a simple and readable code with explicit loops like you would do for example in C. All input and outputs are only numpy-arrays or scalars.
Your Code
import numpy as np
import numba as nb
import math
def your_func(embedingFea,emedingEnt,cooccurrenceCount,count_max,scaling_factor,focal_bias,context_bias):
ErrorEst=[]
for i in range(len(embedingFea)):#17000
temp=[]
for j in range(len(emedingEnt)):#15000
if cooccurrenceCount[i][j]>0:
weighting_factor = np.min([1.0,math.pow(np.float32(cooccurrenceCount[i][j]/ count_max), scaling_factor)])
embedding_product = (np.multiply(emedingEnt[j], embedingFea[i]), 1)
log_cooccurrences =np.log (np.float32(cooccurrenceCount[i][j]))
distance_expr = np.square(([embedding_product+focal_bias[i],context_bias[j],-(log_cooccurrences)]))
single_losses =(weighting_factor* distance_expr)
temp.append(single_losses)
ErrorEst.append(np.sum(temp))
return ErrorEst
Numba Code
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def your_func_2(embedingFea,emedingEnt,cooccurrenceCount,count_max,scaling_factor,focal_bias,context_bias):
ErrorEst=np.empty((embedingFea.shape[0],2))
for i in nb.prange(embedingFea.shape[0]):
temp_1=0.
temp_2=0.
for j in range(emedingEnt.shape[0]):
if cooccurrenceCount[i,j]>0:
weighting_factor=(cooccurrenceCount[i,j]/ count_max)**scaling_factor
if weighting_factor>1.:
weighting_factor=1.
embedding_product = emedingEnt[j]*embedingFea[i]
log_cooccurrences =np.log(cooccurrenceCount[i,j])
temp_1+=weighting_factor*(embedding_product+focal_bias[i])**2
temp_1+=weighting_factor*(context_bias[j])**2
temp_1+=weighting_factor*(log_cooccurrences)**2
temp_2+=weighting_factor*(1.+focal_bias[i])**2
temp_2+=weighting_factor*(context_bias[j])**2
temp_2+=weighting_factor*(log_cooccurrences)**2
ErrorEst[i,0]=temp_1
ErrorEst[i,1]=temp_2
return ErrorEst
Timings
embedingFea=np.random.rand(1700)+1
emedingEnt=np.random.rand(1500)+1
cooccurrenceCount=np.random.rand(1700,1500)+1
focal_bias=np.random.rand(1700)
context_bias=np.random.rand(1500)
count_max=100
scaling_factor=2.5
%timeit res_1=your_func(embedingFea,emedingEnt,cooccurrenceCount,count_max,scaling_factor,focal_bias,context_bias)
1min 1s ± 346 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=your_func_2(embedingFea,emedingEnt,cooccurrenceCount,count_max,scaling_factor,focal_bias,context_bias)
17.6 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If you need to increase the performance of your code you should write it in low level language like C and try to avoid the usage of floating point numbers.
Possible solution: Can we use C code in Python?

You could try using numba and wrapping your code with the #jit decorator. Usually the first execution needs to compile some stuff, and will thus not see much speedup, but subsequent iterations will be much faster.
You may need to put your loop in a function for this to work.
from numba import jit
#jit(nopython=True)
def my_double_loop(some, arguments):
for i in range(len(embedingFea)):#17000
temp=[]
for j in range(len(emedingEnt)):#15000
# ...

Get the RGB value of a specific pixel live

How would I be able to get the RGB value of a pixel on my screen live with python? I have tried using
from PIL import ImageGrab as ig
while(True):
screen = ig.grab()
g = (screen.getpixel((358, 402)))
print(g)
to get the value but there is noticeable lag.
Is there another way to do this without screen capturing? Because I think this is the cause of lag.
Is there a way to drastically speed up this process?
is it possible to constrain the ig.grab() to 358, 402 and get the values from there?

You will probably find it faster to use mss, which is specifically designed to provide high speed screenshot capabilities in Python, and can be used like so:
import mss
with mss.mss() as sct:
pic = sct.grab({'mon':1, 'top':358, 'left':402, 'width':1, 'height':1})
g = pic.pixel(0,0)
See the mss documentation for more information. The most important thing is that you want to avoid repeatedly doing with mss.mss() as sct but rather re-use a single object.

The following change will speed up it by 15%
pixel = (358, 402)
pixel_boundary = (pixel + (pixel[0]+1, pixel[1]+1))
g = ig.grab(pixel_boundary)
return g.getpixel((0,0))
Runtime:
proposed: 383 ms ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
original: 450 ms ± 5.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

AWS S3 list keys begins with a string - python

Related

Maximal efficiency to parse millions of json stored as a (very long) string

iPython timeit - only time part of the operation

Python timeit - TypeError: 'module' object is not callable

Efficient double for loop over large matrices

Get the RGB value of a specific pixel live

Categories

Resources