Copy multiple azure containers to newly created containers efficiently using Python - python

I am copying the contents from multiple containers in Azure storage explorer and writing this to a bunch of new containers and want to know the most efficient way to do this.
The existing containers are called cycling-input-1, cycling-input-2,.... and the contents are written to new containers called cycling-output-1, cycling-output-2 etc. The containers are all of the of the same type (jpegs).
The for loop below creates a new container (cycling-output) with the required suffix and then copies the blobs from the relevant cycling-input container into here. I have about 30 containers each with 1000s of images in there, so not sure if this is the best way to do it (it's slow). Is there a better way to do it?
from azure.storage.blob.baseblobservice import BaseBlobService
account_name = 'name'
account_key = 'key'
# connect to the storage account
blob_service = BaseBlobService(account_name = account_name, account_key = account_key)
# get a list of the containers that need to be processed
cycling_containers = blob_service.list_containers(prefix = 'cycling-input')
# check the list of containers
for c in cycling_containers:
print(c.name)
# copy across the blobs from existing containers to new containers with a prefix cycling-output
prefix_of_new_container = 'cycling-output-'
for c in cycling_containers:
contname = c.name
generator = blob_service.list_blobs(contname)
container_index = ''.join(filter(str.isdigit, contname))
for blob in generator:
flag_of_new_container = blob_service.create_container("%s%s" % (prefix_of_new_container, container_index))
blob_service.copy_blob("%s%s" % (prefix_of_new_container, container_index), blob.name, "https://%s.blob.core.windows.net/%s/%s" % (account_name, contname, blob.name))

The simple way is to use multiprocessing module to parallel copy these blobs of all containers to their new containers named by replacing input with output.
Here is my sample code as reference.
from azure.storage.blob.baseblobservice import BaseBlobService
import multiprocessing
account_name = '<your account name>'
account_key = '<your account key>'
blob_service = BaseBlobService(
account_name=account_name,
account_key=account_key
)
cycling_containers = blob_service.list_containers(prefix = 'cycling-input')
def putBlobCopyTriples(queue, num_of_workers):
for c in cycling_containers:
container_name = c.name
new_container_name = container_name.replace('input', 'output')
blob_service.create_container(new_container_name)
for blob in blob_service.list_blobs(container_name):
blob_url = "https://%s.blob.core.windows.net/%s/%s" % (account_name, container_name, blob.name)
queue.put( (new_container_name, blob.name, blob_url) )
for i in range(num_of_workers):
queue.put( (None, None, None) )
def copyWorker(lock, queue, sn):
while True:
with lock:
(new_container_name, blob_name, new_blob_url) = queue.get()
if new_container_name == None:
break
print(sn, new_container_name, blob_name, new_blob_url)
blob_service.copy_blob(new_container_name, blob_name, new_blob_url)
if __name__ == '__main__':
num_of_workers = 4 # the number of workers what you want, for example, 4 is my cpu core count
lock = multiprocessing.Lock()
queue = multiprocessing.Queue()
multiprocessing.Process(target = putBlobCopyTriples, args = (queue, num_of_workers)).start()
workers = [multiprocessing.Process(target = copyWorker, args = (lock, queue, i)) for i in range(num_of_workers)]
for p in workers:
p.start()
Note: Except cpu core count on your environment, the copy-speed limits is depended on your IO bandwidth. The worker number is not the more, the better. Recommanded that the number is equals or less than your cpu count or hyper-threading count.

Related

Unable to add Task To Azure Batch Model

I am trying to create Azure Batch Job with Task which uses output_files as a Task Parameter
tasks = list()
command_task = (r"cmd /c dir")
# Not providing actual property value for security purpose
containerName = r'ContainerName'
azureStorageAccountName = r'AccountName'
azureStorageAccountKey = r'AccountKey'
sas_Token = generate_account_sas(account_name=azureStorageAccountName, account_key=azureStorageAccountKey, resource_types=ResourceTypes(object=True), permission=AccountSasPermissions(read=True, write=True), expiry=datetime.datetime.utcnow() + timedelta(hours=1))
url = f"https://{azureStorageAccountName}.blob.core.windows.net/{containerName}?{sas_Token}"
output_file = batchmodels.OutputFile(
file_pattern=r"..\std*.txt",
destination=batchmodels.OutputFileDestination(
container=batchmodels.OutputFileBlobContainerDestination(container_url=url),
path="abc"),
upload_options='taskCompletion')
tasks.append(batchmodels.TaskAddParameter(id='Task1', display_name='Task1', command_line=command_task, user_identity=user, output_files=[output_file]))
batch_service_client.task.add_collection(job_id, tasks)
On Deugging this code I am getting exception as
But on removing the output_files parameter , everything works fine and Job is created with task.
I missed out upload_options object while creating OutputFile object:
output_file = batchmodels.OutputFile(
file_pattern=r"..\std*.txt",
destination=batchmodels.OutputFileDestination(
container=batchmodels.OutputFileBlobContainerDestination(container_url=url),
path="abc"),
upload_options=batchmodels.OutputFileUploadOptions('taskCompletion'))

how to provide multiprocessing.process unique variblables

I have a list containing ID Number's, I want to implement every unique ID Number in an API call for each Multiprocessor whilst running the same corresponding functions, implementing the same conditional statements to each processor etc. I have tried to make sense of it but there is not a lot online about this procedure.
I thought to use a for loop but I don't want every processor running this for loop picking up every item in a list. I just need each item to be associated to each processor.
I was thinking something like this:
from multiprocessing import process
import requests, json
ID_NUMBERS = ["ID 1", "ID 2", "ID 3".... ETC]
BASE_URL = "www.api.com"
KEY = {"KEY": "12345"}
a = 0
for x in ID_NUMBERS:
def[a]():
while Active_live_data == True:
# continuously loops over, requesting data from the website
unique_api_call = "{}/livedata[{}]".format(BASE_URL, x)
request_it = requests.get(unique_api_call, headers=KEY)
show_it = (json.loads(request_it.content))
#some extra conditional code...
a += 1
processes = []
b = 0
for _ in range(len(ID_NUMBERS))
p = multiprocessing.Process(target = b)
p.start()
processes.append(p)
b += 1
Any help would be greatly appreciated!
Kindest regards,
Andrew
You can use the map function:
import multiprocessing as mp
num_cores = mp.cpu_count()
pool = mp.Pool(processes=num_cores)
results = pool.map(your_function, list_of_IDs)
This will execute the function your_function, each time with a different item from the list list_of_IDs, and the values returned by your_function will be stored in a list of values (results).
Same approach as #AlessiaM but uses the high-level api in the concurrent.futures module.
import concurrent.futures as mp
import requests, json
BASE_URL = ''
KEY = {"KEY": "12345"}
ID_NUMBERS = ["ID 1", "ID 2", "ID 3"]
def job(id):
unique_api_call = "{}/livedata[{}]".format(BASE_URL, id)
request_it = requests.get(unique_api_call, headers=KEY)
show_it = (json.loads(request_it.content))
return show_it
# Default to as many workers as there are processors,
# But since your job is IO bound (vs CPU bound),
# you could increase this to an even bigger figure by giving the `max_workers` parameter
with mp.ProcessPoolExecutor() as pool:
results = pool.map(job,ID_NUMBERS)
# Process results here

Memory scanner for any program in Python

I am trying to create a memory scanner. similar to Cheat Engine. but only for extract information.
I know how to get the pid (in this case is "notepad.exe"). But I don't have any Idea about how to know wicht especific adress belong to the program that I am scanning.
Trying to looking for examples. I could see someone it was trying to scan every adress since one point to other. But it's to slow. Then I try to create a batch size (scan a part of memory and not one by one each adress). The problem is if the size is to short. still will take a long time. and if it is to long, is possible to lose many adress who are belong to the program. Because result from ReadMemoryScan is False in the first Adress, but It can be the next one is true. Here is my example.
import ctypes as c
from ctypes import wintypes as w
import psutil
from sys import stdout
write = stdout.write
import numpy as np
def get_client_pid(process_name):
pid = None
for proc in psutil.process_iter():
if proc.name() == process_name:
pid = int(proc.pid)
print(f"Found '{process_name}' PID = ", pid,f" hex_value = {hex(pid)}")
break
if pid == None:
print('Program Not found')
return pid
pid = get_client_pid("notepad.exe")
if pid == None:
sys.exit()
k32 = c.WinDLL('kernel32', use_last_error=True)
OpenProcess = k32.OpenProcess
OpenProcess.argtypes = [w.DWORD,w.BOOL,w.DWORD]
OpenProcess.restype = w.HANDLE
ReadProcessMemory = k32.ReadProcessMemory
ReadProcessMemory.argtypes = [w.HANDLE,w.LPCVOID,w.LPVOID,c.c_size_t,c.POINTER(c.c_size_t)]
ReadProcessMemory.restype = w.BOOL
GetLastError = k32.GetLastError
GetLastError.argtypes = None
GetLastError.restype = w.DWORD
CloseHandle = k32.CloseHandle
CloseHandle.argtypes = [w.HANDLE]
CloseHandle.restype = w.BOOL
processHandle = OpenProcess(0x10, False, int(pid))
# addr = 0x0FFFFFFFFFFF
data = c.c_ulonglong()
bytesRead = c.c_ulonglong()
start = 0x000000000000
end = 0x7fffffffffff
batch_size = 2**13
MemoryData = np.zeros(batch_size, 'l')
Size = MemoryData.itemsize*MemoryData.size
index = 0
Data_address = []
for c_adress in range(start,end,batch_size):
result = ReadProcessMemory(processHandle,c.c_void_p(c_adress), MemoryData.ctypes.data,
Size, c.byref(bytesRead))
if result: # Save adress
Data_address.extend(list(range(c_adress,c_adress+batch_size)))
e = GetLastError()
CloseHandle(processHandle)
I decided from 0x000000000000 to 0x7fffffffffff Because cheat engine scan this size. I am still a begginer with this kind of this about memory scan. maybe there are things that I can do to improve the efficiency.
I suggest you take advantage of existing python libraries that can analyse Windows 10 memory.
I'm no specialist but I've found Volatility. Seems to be pretty useful for your problem.
For running that tool you need Python 2 (Python 3 won't work).
For running python 2 and 3 in the same Windows 10 machine, follow this tutorial (The screenshots are in Spanish but it can easily be followed).
Then see this cheat sheet with main commands. You can dump the memory and then operate on the file.
Perhaps this leads you to the solution :) At least the most basic command pslist dumps all the running processes addresses.
psutil has proc.memory_maps()
pass the result as map to this function
TargetProcess eaxample 'Calculator.exe'
def get_memSize(self,TargetProcess,map):
for m in map:
if TargetProcess in m.path:
memSize= m.rss
break
return memSize
if you use this function, it returns the memory size of your Target Process
my_pid is the pid for 'Calculator.exe'
def getBaseAddressWmi(self,my_pid):
PROCESS_ALL_ACCESS = 0x1F0FFF
processHandle = win32api.OpenProcess(PROCESS_ALL_ACCESS, False, my_pid)
modules = win32process.EnumProcessModules(processHandle)
processHandle.close()
base_addr = modules[0] # for me it worked to select the first item in list...
return base_addr
to get the base address of your prog
so you search range is from base_addr to base_addr + memSize

Manager dictionary very slow for updating values from 100+ processes

I am having a hard time working with Python multiprocessing module.
In a nutshell, I have a dictionary object which updates, say, occurrences of a string from lots of s3 files. The key for the dictionary is the occurrence I need which increments by 1 each time it is found.
Sample code:
import boto3
from multiprocessing import Process, Manager
import simplejson
client = boto3.client('s3')
occurences_to_find = ["occ1", "occ2", "occ3"]
list_contents = []
def getS3Key(prefix_name, occurence_dict):
kwargs = {'Bucket': "bucket_name", 'Prefix': "prefix_name"}
while True:
value = client.list_objects_v2(**kwargs)
try:
contents = value['Contents']
for obj in contents:
key=obj['Key']
yield key
try:
kwargs['ContinuationToken'] = value['NextContinuationToken']
except KeyError:
break
except KeyError:
break
def getS3Object(s3_key, occurence_dict):
object = client.get_object(Bucket=bucket_name, Key=s3_key)
objjects = object['Body'].read()
for object in objects:
object_json = simplejson.loads(activity)
msg = activity_json["msg"]
for occrence in occurence_dict:
if occrence in msg:
occurence_dict[str(occrence)] += 1
break
'''each process will hit this function'''
def doWork(prefix_name_list, occurence_dict):
for prefix_name in prefix_name_list:
for s3_key in getS3Key(prefix_name, occurence_dict):
getS3Object(s3_key, occurence_dict)
def main():
manager = Manager()
'''shared dictionary between processes'''
occurence_dict = manager.dict()
procs = []
s3_prefixes = [["prefix1"], ["prefix2"], ["prefix3"], ["prefix4"]]
for occurrence in occurences_to_find:
occurence_dict[occurrence] = 0
for index,prefix_name_list in enumerate(s3_prefixes):
proc = Process(target=doWork, args=(prefix_name_list, occurence_dict))
procs.append(proc)
for proc in procs:
proc.start()
for proc in procs:
proc.join()
print(occurence_dict)
main()
I am having issues with speed of the code as it takes hours for the code to run with more than 10000 s3 prefixes and keys. I think the manager dictionary is shared and is locked by each process, and hence, it is not being updated concurrently; rather one process waits for it to be "released".
How can I update the dictionary parallely? Or, how can I maintain multiple dicts for each process and then combine the result in the end?

pickling issue while using pool to count check the file

I have my code that is sprawning multiple processes to check the count of files and maintaining the records in the database. The code which is working is mentioned below :
import multiprocessing as mp
from multiprocessing import Pool
import os
import time
import mysql.connector
"""Function to check the count of the file"""
def file_wc(fname):
with open('/home/vaibhav/Desktop/Input_python/'+ fname) as f:
count = sum(1 for line in f)
return (fname,count)
class file_audit:
def __init__(self):
"""Initialising the constructor for getting the names of files
and refrencing the outside class function"""
folder = '/home/vaibhav/Desktop/Input_python'
self.fnames = (name for name in os.listdir(folder))
self.file_wc=file_wc
def count_check(self):
"Creating 4 worker threads to check the count of the file parallelly"
pool = Pool(4)
self.m=list(pool.map(self.file_wc, list(self.fnames),4))
pool.close()
pool.join()
def database_updation(self):
"""To maintain an entry in the database with details
like filename and recrods present in the file"""
self.db = mysql.connector.connect(host="localhost",user="root",password="root",database="python_showtime" )
# prepare a cursor object using cursor() method
self.cursor = self.db.cursor()
query_string = ("INSERT INTO python_showtime.audit_capture"
"(name,records)"
"VALUES(%s,%s)")
#data_user = (name,records)
for each in self.m:
self.cursor.execute(query_string, each)
self.db.commit()
self.cursor.close()
start_time = time.time()
print("My program took", time.time() - start_time, "to run")
#if __name__ == '__main__':
x=file_audit()
x.count_check() #To check the count by sprawning multiple processes
x.database_updation() #To maintain the entry in the database
Point to be considered
Now if i put my function inside the class and comment self.file_wc=file_wc in the constructor section i get the Error can't pickle on generator objects. I got some fair understanding like we cannot pickle some objects,So want to know what exactly is happening at the background in very simple terms. I got the reference from here or here to make the code working

Categories