How can i optimize my Python loop for speed - python

I wrote some code that uses OCR to extract text from screenshots of follower lists and then transfer them into a data frame.
The reason I have to do the hustle with "name" / "display name" and removing blank lines is that the initial text extraction looks something like this:
Screenname 1
name 1
Screenname 2
name 2
(and so on)
So I know in which order each extraction will be.
My code works well for 1-30 images, but if I take more than that its gets a bit slow. My goal is to run around 5-10k screenshots through it at once. I'm pretty new to programming so any ideas/tips on how to optimize the speed would be very appreciated! Thank you all in advance :)
from PIL import Image
from pytesseract import pytesseract
import os
import pandas as pd
from itertools import chain
list_final = [""]
list_name = [""]
liste_anzeigename = [""]
list_raw = [""]
anzeigename = [""]
name = [""]
sort = [""]
f = r'/Users/PycharmProjects/pythonProject/images'
myconfig = r"--psm 4 --oem 3"
os.listdir(f)
for file in os.listdir(f):
f_img = f+"/"+file
img = Image.open(f_img)
img = img.crop((240, 400, 800, 2400))
img.save(f_img)
for file in os.listdir(f):
f_img = f + "/" + file
test = pytesseract.image_to_string(PIL.Image.open(f_img), config=myconfig)
lines = test.split("\n")
list_raw = [line for line in lines if line.strip() != ""]
sort.append(list_raw)
name = {list_raw[0], list_raw[2], list_raw[4],
list_raw[6], list_raw[8], list_raw[10],
list_raw[12], list_raw[14], list_raw[16]}
list_name.append(name)
anzeigename = {list_raw[1], list_raw[3], list_raw[5],
list_raw[7], list_raw[9], list_raw[11],
list_raw[13], list_raw[15], list_raw[17]}
liste_anzeigename.append(anzeigename)
reihenfolge_name = list(chain.from_iterable(list_name))
index_anzeigename = list(chain.from_iterable(liste_anzeigename))
sortieren = list(chain.from_iterable(sort))
print(list_raw)
sort_name = sorted(reihenfolge_name, key=sortieren.index)
sort_anzeigename = sorted(index_anzeigename, key=sortieren.index)
final = pd.DataFrame(zip(sort_name, sort_anzeigename), columns=['name', 'anzeigename'])
print(final)

Use a multiprocessing.Pool.
Combine the code under the for-loops, and put it into a function process_file.
This function should accept a single argument; the name of a file to process.
Next using listdir, create a list of files to process.
Then create a Pool and use its map method to process the list;
import multiprocessing as mp
def process_file(name):
# your code goes here.
return anzeigename # Or watever the result should be.
if __name__ is "__main__":
f = r'/Users/PycharmProjects/pythonProject/images'
p = mp.Pool()
liste_anzeigename = p.map(process_file, os.listdir(f))
This will run your code in parallel in as many cores as your CPU has.
For a N-core CPU this will take approximately 1/N times the time as doing it without multiprocessing.
Note that the return value of the worker function should be pickleable; it has to be returned from the worker process to the parent process.

Related

how to provide multiprocessing.process unique variblables

I have a list containing ID Number's, I want to implement every unique ID Number in an API call for each Multiprocessor whilst running the same corresponding functions, implementing the same conditional statements to each processor etc. I have tried to make sense of it but there is not a lot online about this procedure.
I thought to use a for loop but I don't want every processor running this for loop picking up every item in a list. I just need each item to be associated to each processor.
I was thinking something like this:
from multiprocessing import process
import requests, json
ID_NUMBERS = ["ID 1", "ID 2", "ID 3".... ETC]
BASE_URL = "www.api.com"
KEY = {"KEY": "12345"}
a = 0
for x in ID_NUMBERS:
def[a]():
while Active_live_data == True:
# continuously loops over, requesting data from the website
unique_api_call = "{}/livedata[{}]".format(BASE_URL, x)
request_it = requests.get(unique_api_call, headers=KEY)
show_it = (json.loads(request_it.content))
#some extra conditional code...
a += 1
processes = []
b = 0
for _ in range(len(ID_NUMBERS))
p = multiprocessing.Process(target = b)
p.start()
processes.append(p)
b += 1
Any help would be greatly appreciated!
Kindest regards,
Andrew
You can use the map function:
import multiprocessing as mp
num_cores = mp.cpu_count()
pool = mp.Pool(processes=num_cores)
results = pool.map(your_function, list_of_IDs)
This will execute the function your_function, each time with a different item from the list list_of_IDs, and the values returned by your_function will be stored in a list of values (results).
Same approach as #AlessiaM but uses the high-level api in the concurrent.futures module.
import concurrent.futures as mp
import requests, json
BASE_URL = ''
KEY = {"KEY": "12345"}
ID_NUMBERS = ["ID 1", "ID 2", "ID 3"]
def job(id):
unique_api_call = "{}/livedata[{}]".format(BASE_URL, id)
request_it = requests.get(unique_api_call, headers=KEY)
show_it = (json.loads(request_it.content))
return show_it
# Default to as many workers as there are processors,
# But since your job is IO bound (vs CPU bound),
# you could increase this to an even bigger figure by giving the `max_workers` parameter
with mp.ProcessPoolExecutor() as pool:
results = pool.map(job,ID_NUMBERS)
# Process results here

Using concurrent.futures within a for statement

I store QuertyText within a pandas dataframe. Once I've loaded all the queries into I want to conduct an analysis again each query. Currently, I have ~50k to evaluate. So, doing it one by one, will take a long time.
So, I wanted to implement concurrent.futures. How do I take the individual QueryText stored within fullAnalysis as pass it to concurrent.futures and return the output as a variable?
Here is my entire code:
import pandas as pd
import time
import gensim
import sys
import warnings
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed
fullAnalysis = pd.DataFrame()
def fetch_data(jFile = 'ProcessingDetails.json'):
print("Fetching data...please wait")
#read JSON file for latest dictionary file name
baselineDictionaryFileName = 'Dictionary/Dictionary_05-03-2020.json'
#copy data to pandas dataframe
labelled_data = pd.read_json(baselineDictionaryFileName)
#Add two more columns to get the most similar text and score
labelled_data['SimilarText'] = ''
labelled_data['SimilarityScore'] = float()
print("Data fetched from " + baselineDictionaryFileName + " and there are " + str(labelled_data.shape[0]) + " rows to be evalauted")
return labelled_data
def calculateScore(inputFunc):
warnings.filterwarnings("ignore", category=DeprecationWarning)
model = gensim.models.Word2Vec.load('w2v_model_bigdata')
inp = inputFunc
print(inp)
out = dict()
strEvaluation = inp.split("most_similar ",1)[1]
#while inp != 'quit':
split_inp = inp.split()
try:
if split_inp[0] == 'help':
pass
elif split_inp[0] == 'similarity' and len(split_inp) >= 3:
pass
elif split_inp[0] == 'most_similar' and len(split_inp) >= 2:
for pair in model.most_similar(positive=[split_inp[1]]):
out.update({pair[0]: pair[1]})
except KeyError as ke:
#print(str(ke) + "\n")
inp = input()
return out
def main():
with ThreadPoolExecutor(max_workers=5) as executor:
for i in range(len(fullAnalysis)):
text = fullAnalysis['QueryText'][i]
arg = 'most_similar'+ ' ' + text
#for item in executor.map(calculateScore, arg):
output = executor.map(calculateScore, arg)
return output
if __name__ == "__main__":
fullAnalysis = fetch_data()
results = main()
print(f'results: {results}')
The Python Global Interpreter Lock or GIL allows only one thread to hold control of the Python interpreter. Since your function calculateScore might be cpu-bound and requires the interpreter to execute its byte code, you may be gaining little by using threading. If, on the other hand, it were doing mostly I/O operations, it would be giving up the GIL for most of its running time allowing other threads to run. But that does not seem to be the case here. You probably should be using the ProcessPoolExecutor from concurrent.futures (try it both ways and see):
def main():
with ProcessPoolExecutor(max_workers=None) as executor:
the_futures = {}
for i in range(len(fullAnalysis)):
text = fullAnalysis['QueryText'][i]
arg = 'most_similar'+ ' ' + text
future = executor.submit(calculateScore, arg)
the_futures[future] = i # map future to request
for future in as_completed(the_futures): # results as they become available not necessarily the order of submission
i = the_futures[future] # the original index
result = future.result() # the result
If you omit the max_workers parameter (or specify a value of None) from the ProcessPoolExecutor constructor, the default will be the number of processors you have on your machine (not a bad default). There is no point in specifying a value larger than the number of processors you have.
If you do not need to tie the future back to the original request, then the_futures can just be a list to which But simplest yest in not even to bother to use the as_completed method:
def main():
with ProcessPoolExecutor(max_workers=5) as executor:
the_futures = []
for i in range(len(fullAnalysis)):
text = fullAnalysis['QueryText'][i]
arg = 'most_similar'+ ' ' + text
future = executor.submit(calculateScore, arg)
the_futures.append(future)
# wait for the completion of all the results and return them all:
results = [f.result() for f in the_futures()] # results in creation order
return results
It should be mentioned that code that launches the ProcessPoolExecutor functions should be in a block governed by a if __name__ = '__main__':. If it isn't you will get into a recursive loop with each subprocess launching the ProcessPoolExecutor. But that seems to be the case here. Perhaps you meant to use the ProcessPoolExecutor all along?
Also:
I don't know what the line ...
model = gensim.models.Word2Vec.load('w2v_model_bigdata')
... in function calculateStore does. It may be the one i/o-bound statement. But this appears to be something that does not vary from call to call. If that is the case and model is not being modified in the function, shouldn't this statement be moved out of the function and computed just once? Then this function would clearly run faster (and be clearly cpu-bound).
Also:
The exception block ...
except KeyError as ke:
#print(str(ke) + "\n")
inp = input()
... is puzzling. You are inputting a value that will never be used right before returning. If this is to pause execution, there is no error message being output.
With Booboo assistance, I was able to update code to include ProcessPoolExecutor. Here is my updated code. Overall, processing has been speed up by more than 60%.
I did run into a processing issue and found this topic BrokenPoolProcess that addresses the issue.
output = {}
thePool = {}
def main(labelled_data, dictionaryRevised):
args = sys.argv[1:]
with ProcessPoolExecutor(max_workers=None) as executor:
for i in range(len(labelled_data)):
text = labelled_data['QueryText'][i]
arg = 'most_similar'+ ' '+ text
output = winprocess.submit(
executor, calculateScore, arg
)
thePool[output] = i #original index for future to request
for output in as_completed(thePool): # results as they become available not necessarily the order of submission
i = thePool[output] # the original index
text = labelled_data['QueryText'][i]
result = output.result() # the result
maximumKey = max(result.items(), key=operator.itemgetter(1))[0]
maximumValue = result.get(maximumKey)
labelled_data['SimilarText'][i] = maximumKey
labelled_data['SimilarityScore'][i] = maximumValue
return labelled_data, dictionaryRevised
if __name__ == "__main__":
start = time.perf_counter()
print("Starting to evaluate Query Text for labelling...")
output_Labelled_Data, output_dictionary_revised = preProcessor()
output,dictionary = main(output_Labelled_Data, output_dictionary_revised)
finish = time.perf_counter()
print(f'Finished in {round(finish-start, 2)} second(s)')

Memory scanner for any program in Python

I am trying to create a memory scanner. similar to Cheat Engine. but only for extract information.
I know how to get the pid (in this case is "notepad.exe"). But I don't have any Idea about how to know wicht especific adress belong to the program that I am scanning.
Trying to looking for examples. I could see someone it was trying to scan every adress since one point to other. But it's to slow. Then I try to create a batch size (scan a part of memory and not one by one each adress). The problem is if the size is to short. still will take a long time. and if it is to long, is possible to lose many adress who are belong to the program. Because result from ReadMemoryScan is False in the first Adress, but It can be the next one is true. Here is my example.
import ctypes as c
from ctypes import wintypes as w
import psutil
from sys import stdout
write = stdout.write
import numpy as np
def get_client_pid(process_name):
pid = None
for proc in psutil.process_iter():
if proc.name() == process_name:
pid = int(proc.pid)
print(f"Found '{process_name}' PID = ", pid,f" hex_value = {hex(pid)}")
break
if pid == None:
print('Program Not found')
return pid
pid = get_client_pid("notepad.exe")
if pid == None:
sys.exit()
k32 = c.WinDLL('kernel32', use_last_error=True)
OpenProcess = k32.OpenProcess
OpenProcess.argtypes = [w.DWORD,w.BOOL,w.DWORD]
OpenProcess.restype = w.HANDLE
ReadProcessMemory = k32.ReadProcessMemory
ReadProcessMemory.argtypes = [w.HANDLE,w.LPCVOID,w.LPVOID,c.c_size_t,c.POINTER(c.c_size_t)]
ReadProcessMemory.restype = w.BOOL
GetLastError = k32.GetLastError
GetLastError.argtypes = None
GetLastError.restype = w.DWORD
CloseHandle = k32.CloseHandle
CloseHandle.argtypes = [w.HANDLE]
CloseHandle.restype = w.BOOL
processHandle = OpenProcess(0x10, False, int(pid))
# addr = 0x0FFFFFFFFFFF
data = c.c_ulonglong()
bytesRead = c.c_ulonglong()
start = 0x000000000000
end = 0x7fffffffffff
batch_size = 2**13
MemoryData = np.zeros(batch_size, 'l')
Size = MemoryData.itemsize*MemoryData.size
index = 0
Data_address = []
for c_adress in range(start,end,batch_size):
result = ReadProcessMemory(processHandle,c.c_void_p(c_adress), MemoryData.ctypes.data,
Size, c.byref(bytesRead))
if result: # Save adress
Data_address.extend(list(range(c_adress,c_adress+batch_size)))
e = GetLastError()
CloseHandle(processHandle)
I decided from 0x000000000000 to 0x7fffffffffff Because cheat engine scan this size. I am still a begginer with this kind of this about memory scan. maybe there are things that I can do to improve the efficiency.
I suggest you take advantage of existing python libraries that can analyse Windows 10 memory.
I'm no specialist but I've found Volatility. Seems to be pretty useful for your problem.
For running that tool you need Python 2 (Python 3 won't work).
For running python 2 and 3 in the same Windows 10 machine, follow this tutorial (The screenshots are in Spanish but it can easily be followed).
Then see this cheat sheet with main commands. You can dump the memory and then operate on the file.
Perhaps this leads you to the solution :) At least the most basic command pslist dumps all the running processes addresses.
psutil has proc.memory_maps()
pass the result as map to this function
TargetProcess eaxample 'Calculator.exe'
def get_memSize(self,TargetProcess,map):
for m in map:
if TargetProcess in m.path:
memSize= m.rss
break
return memSize
if you use this function, it returns the memory size of your Target Process
my_pid is the pid for 'Calculator.exe'
def getBaseAddressWmi(self,my_pid):
PROCESS_ALL_ACCESS = 0x1F0FFF
processHandle = win32api.OpenProcess(PROCESS_ALL_ACCESS, False, my_pid)
modules = win32process.EnumProcessModules(processHandle)
processHandle.close()
base_addr = modules[0] # for me it worked to select the first item in list...
return base_addr
to get the base address of your prog
so you search range is from base_addr to base_addr + memSize

Python script to alert on empty/missing logs

I am working on a project to check a file directory and automatically add log files as they are created. A file is being generated every five minutes, but some of the files are being created with a "0" filesize and I would like to alert when this happens.
So the sequence of steps I would like to have are essentially:
Get time (MM:DD:YY HH:MM:SS) *Not sure if I need to do this...
CD to Folder Directory /Netflow/YY/MM/DD
Search for filename "nfcapd.YYYYMMDDHHMM" where MM increments by 5.
If filesize is 0, then email Johnny, Sally and Jimmy
Wait 6 minutes and repeat
This is what I have pieced together thus far. How can I get the desired functionality?
import os
def is_non_zero_file(fpath): storage/Netflow/
return True if os.path.isfile(fpath) and os.path.getsize(fpath) > 0 else False
# I need to check storage/Netflow for files named by time e.g 13_56_05.txt
while True:
time.sleep(360)
In addition to enumerating the files in a given path, and subsequently filtering the files which are only zero-length, you probably want to maintain some type of state to ensure you're aren't notified multiple times of the same zero length file. That is, you probably don't want to get a notification that the same file is zero-length indefinitely (although you can modify the example below if you want said behavior).
You may optionally want to do things like verify that the file name strictly meets your naming convention. You may also want to validate the the string date-stamp included in the file name is a valid datetime.
The example below uses the glob module (itself leveraging os.listdir() and fnmatch.fnmatch()) to build up a set of possible files for inclusion. [1]
The example is intentionally simple, and leverages a single class to store log sample 'state'. KEEP_SAMPLES samples are maintained (instances of logState() in the log_states list, achieved by using list slicing.
A single alert(msg) function is supplied as a stub to something that might send mail, etc...
References:
[1] https://docs.python.org/3.2/library/glob.html
#!/usr/bin/python3
import os
import glob
import re
from datetime import datetime, timezone
import time
from pprint import pprint
class logState():
def __init__(self, log_path, glob_patt, re_patt, dt_fmt):
self.dt = datetime.now(timezone.utc)
self.log_path = log_path
self.glob_patt = glob_patt
self.re_patt = re_patt
self.dt_fmt = dt_fmt
self.empty_logs = []
self.nonempty_logs = []
# Retrieve only files from glob
self.files = [ f for f in
glob.glob(self.log_path + self.glob_patt)
if os.path.isfile(f) ]
for f in self.files:
unq_fname = f.split('/')[-1]
if unq_fname == None:
continue
# Tighter pattern matching
if re.match(re_patt, unq_fname) == None:
continue
# Get the datetime portion of the file name
f_dtstamp = unq_fname.split('.')[-1]
# Make sure the datetime stamp represents
# a valid date
if datetime.strptime(f_dtstamp, self.dt_fmt) == None:
continue
# Check file size, add to the appropriate
# list
if os.path.getsize(f) <= 0:
self.empty_logs.append(f)
else:
self.nonempty_logs.append(f)
def alert(msg):
print("ALERT!: {0}".format(msg))
if __name__ == "__main__":
# How long to sleep
SLEEP_SECS = 5
# How many samples to keep
KEEP_SAMPLES = 5
log_states = []
# Definition for what logs states we'll look for
log_path = './'
glob_patt = 'nfcapd.[0-9]*'
re_patt = 'nfcapd.([0-9]{12})'
dt_fmt = "%Y%m%d%H%M"
print("-- Setup --")
print("Sample files in '{0}'".format(log_path))
print("\t{0} samples kept:".format(KEEP_SAMPLES))
print("\tglob pattern: '{0}'".format(glob_patt))
print("\tregex pattern: '{0}'".format(re_patt))
print("\tdatetime string: '{0}'".format(dt_fmt))
print("")
# Collect the initial state
log_states.append(logState(log_path,
glob_patt,
re_patt, dt_fmt))
while True:
# Print state inventory and current state detail
print( "-- Log States Stored --")
for i, log_state in enumerate(log_states):
print("Log state {0} # {1}".format(i, log_state.dt))
print(" -- Logs size > 0 --")
pprint(log_states[-1].nonempty_logs)
print(" -- Logs size <= 0 --")
pprint(log_states[-1].empty_logs)
print("")
time.sleep(SLEEP_SECS)
log_states = log_states[-KEEP_SAMPLES+1:]
log_states.append(logState(log_path,
glob_patt,
re_patt,
dt_fmt))
# p = previous sample, c = current
p = set(log_states[-2].empty_logs)
c = set(log_states[-1].empty_logs)
# only report the items in the current sample
# not in the last
if len(c.difference(p)) > 0:
alert("\nNew zero length logs: " + str(c.difference(p)) + "\n")

Unable to display all the information except for first selection

I am using the following code to process a list of images that is found in my scene, before the gathered information, namely the tifPath and texPath is used in another function.
However, example in my scene, there are 3 textures, and hence I should be seeing 3 sets of tifPath and texPath but I am only seeing 1 of them., whereas if I am running to check surShaderOut or surShaderTex I am able to see all the 3 textures info.
For example, the 3 textures file path is as follows (in the surShaderTex): /user_data/testShader/textureTGA_01.tga, /user_data/testShader/textureTGA_02.tga, /user_data/testShader/textureTGA_03.tga
I guess what I am trying to say is that why in my for statement, it is able to print out all the 3 results and yet anything bypass that, it is only printing out a single result.
Any advices?
surShader = cmds.ls(type = 'surfaceShader')
for con in surShader:
surShaderOut = cmds.listConnections('%s.outColor' % con)
surShaderTex = cmds.getAttr("%s.fileTextureName" % surShaderOut[0])
path = os.path.dirname(surShaderTex)
f = surShaderTex.split("/")[-1]
tifName = os.path.splitext(f)[0] + ".tif"
texName = os.path.splitext(f)[0] + ".tex"
tifPath = os.path.join(path, tifName)
texPath = os.path.join(path, texName)
convertText(surShaderTex, tifPath, texPath)
Only two lines are part of your for loop. The rest only execute once.
So first this runs:
surShader = cmds.ls(type = 'surfaceShader')
for con in surShader:
surShaderOut = cmds.listConnections('%s.outColor' % con)
surShaderTex = cmds.getAttr("%s.fileTextureName" % surShaderOut[0])
Then after that loop, with only one surShader, one surShaderOut, and one surShaderTex, the following is executed once:
path = os.path.dirname(surShaderTex)
f = surShaderTex.split("/")[-1]
tifName = os.path.splitext(f)[0] + ".tif"
texName = os.path.splitext(f)[0] + ".tex"
tifPath = os.path.join(path, tifName)
texPath = os.path.join(path, texName)
Indent that the same as the lines above it, and it'll be run for each element of surShader instead of only once.

Categories