I have a project where i try disassemble with the help of angr a bunch of executables but i have a memory leak.
This is the main function where i have a while like this:
def main():
mypath = Path("/home/baroj/Thesis/smart_obfuscation_generator/MalwareDir")
binaries = [join(mypath, f) for f in listdir(mypath) if
isfile(join(mypath, f)) and '_patched' not in f.__str__()]
while binaries:
binary = binaries.pop(0)
print(
f"TIME: {time.asctime(time.localtime(time.time()))} - Starting with binary: {os.path.basename(binary)}\n")
try:
gmm = GeneticMalwareModifier(binary.__str__())
except Exception as e:
print(traceback.format_exc())
print(f"An error occurred while reading this binary... (see log {binary}))\n")
continue
etc...
The GeneticMalwareModifier init is like this:
class GeneticMalwareModifier:
def __init__(self, input_file_path, population_size=40, crossover_probability=0.8, mutation_probability=0.15,
ngen=7, min_actions=1, max_actions=15):
self.input_file_path = input_file_path
self.max_actions = max_actions
self.min_actions = min_actions
self.cfg = binary_analyzer.make_cfg(self.input_file_path)
self.code = binary_analyzer.make_code_dict(self.input_file_path, self.cfg)
self.functions = binary_analyzer.build_function_objects(self.cfg, self.code)
not_imported_functions_list = [f for f in self.functions.values() if f.address_to_instruction_dictionary]
self.not_imported_functions = {f.address: f for f in not_imported_functions_list}
self.levels = function.classify_functions(self.not_imported_functions)
function.analyze_functions(self.not_imported_functions, self.levels)
self.randomizable_functions = [f for f in self.not_imported_functions.values() if "_SEH_"
etc...
binary_analyzer.make_cfg:
def make_cfg(file_path):
angr_project = angr.Project(file_path)
cfg = angr_project.analyses.CFGEmulated()
return cfg
Usually i catch errors coming from self.cfg = binary_analyzer.make_cfg(self.input_file_path) but that's not a problem and the execution resume with another file. The problem is that it seems that angr keeps some references which causes memory leaks.
I want to add that currently this program doesn't go further than self.cfg = binary_analyzer.make_cfg(self.input_file_path).
I used memory_profiler, tracemalloc and heapy/guppy and the problem seems to be angr.
Whenever it starts to read an executable and maybe get an error it cause a huge memory leak.
This is heapy/guppy before/after gmm = gmm = GeneticMalwareModifier(binary.__str__()) :
Heap Status After Creating Few Objects :
Heap Size : 109499491 bytes
Partition of a set of 1039344 objects. Total size = 109499491 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 120515 12 27959480 26 27959480 26 dict of angr.sim_type.SimTypePointer
1 105025 10 19799760 18 47759240 44 dict (no owner)
2 124235 12 12920440 12 60679680 55 dict of angr.sim_type.SimTypeInt
3 49188 5 7083072 6 67762752 62 dict of angr.sim_type.SimStruct
4 124235 12 5963280 5 73726032 67 angr.sim_type.SimTypeInt
5 120515 12 5784720 5 79510752 73 angr.sim_type.SimTypePointer
6 49904 5 5190272 5 84701024 77 dict of angr.sim_type.SimTypeChar
7 37824 4 3238976 3 87940000 80 list
8 29947 3 3114488 3 91054488 83 dict of angr.sim_type.SimTypeBottom
9 28830 3 2557378 2 93611866 85 str
Once it reads one executable this stuff remains on the heap forever and it grows.
This is some iterations later:
Partition of a set of 1220070 objects. Total size = 131637306 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 120578 10 27982160 21 27982160 21 dict of angr.sim_type.SimTypePointer
1 105137 9 19961560 15 47943720 36 dict (no owner)
2 124417 10 12939368 10 60883088 46 dict of angr.sim_type.SimTypeInt
3 42652 3 9212832 7 70095920 53 frozenset
4 49250 4 7092000 5 77187920 59 dict of angr.sim_type.SimStruct
5 124417 10 5972016 5 83159936 63 angr.sim_type.SimTypeInt
6 120578 10 5787744 4 88947680 68 angr.sim_type.SimTypePointer
7 49929 4 5192872 4 94140552 72 dict of angr.sim_type.SimTypeChar
8 38065 3 3256696 2 97397248 74 list
9 14598 1 3185936 2 100583184 76 set
Related
Apparently list(a) doesn't overallocate, [x for x in a] overallocates at some points, and [*a] overallocates all the time?
Here are sizes n from 0 to 12 and the resulting sizes in bytes for the three methods:
0 56 56 56
1 64 88 88
2 72 88 96
3 80 88 104
4 88 88 112
5 96 120 120
6 104 120 128
7 112 120 136
8 120 120 152
9 128 184 184
10 136 184 192
11 144 184 200
12 152 184 208
Computed like this, reproducable at repl.it, using Python 3.8:
from sys import getsizeof
for n in range(13):
a = [None] * n
print(n, getsizeof(list(a)),
getsizeof([x for x in a]),
getsizeof([*a]))
So: How does this work? How does [*a] overallocate? Actually, what mechanism does it use to create the result list from the given input? Does it use an iterator over a and use something like list.append? Where is the source code?
(Colab with data and code that produced the images.)
Zooming in to smaller n:
Zooming out to larger n:
[*a] is internally doing the C equivalent of:
Make a new, empty list
Call newlist.extend(a)
Returns list.
So if you expand your test to:
from sys import getsizeof
for n in range(13):
a = [None] * n
l = []
l.extend(a)
print(n, getsizeof(list(a)),
getsizeof([x for x in a]),
getsizeof([*a]),
getsizeof(l))
Try it online!
you'll see the results for getsizeof([*a]) and l = []; l.extend(a); getsizeof(l) are the same.
This is usually the right thing to do; when extending you're usually expecting to add more later, and similarly for generalized unpacking, it's assumed that multiple things will be added one after the other. [*a] is not the normal case; Python assumes there are multiple items or iterables being added to the list ([*a, b, c, *d]), so overallocation saves work in the common case.
By contrast, a list constructed from a single, presized iterable (with list()) may not grow or shrink during use, and overallocating is premature until proven otherwise; Python recently fixed a bug that made the constructor overallocate even for inputs with known size.
As for list comprehensions, they're effectively equivalent to repeated appends, so you're seeing the final result of the normal overallocation growth pattern when adding an element at a time.
To be clear, none of this is a language guarantee. It's just how CPython implements it. The Python language spec is generally unconcerned with specific growth patterns in list (aside from guaranteeing amortized O(1) appends and pops from the end). As noted in the comments, the specific implementation changes again in 3.9; while it won't affect [*a], it could affect other cases where what used to be "build a temporary tuple of individual items and then extend with the tuple" now becomes multiple applications of LIST_APPEND, which can change when the overallocation occurs and what numbers go into the calculation.
Full picture of what happens, building on the other answers and comments (especially ShadowRanger's answer, which also explains why it's done like that).
Disassembling shows that BUILD_LIST_UNPACK gets used:
>>> import dis
>>> dis.dis('[*a]')
1 0 LOAD_NAME 0 (a)
2 BUILD_LIST_UNPACK 1
4 RETURN_VALUE
That's handled in ceval.c, which builds an empty list and extends it (with a):
case TARGET(BUILD_LIST_UNPACK): {
...
PyObject *sum = PyList_New(0);
...
none_val = _PyList_Extend((PyListObject *)sum, PEEK(i));
_PyList_Extend uses list_extend:
_PyList_Extend(PyListObject *self, PyObject *iterable)
{
return list_extend(self, iterable);
}
Which calls list_resize with the sum of the sizes:
list_extend(PyListObject *self, PyObject *iterable)
...
n = PySequence_Fast_GET_SIZE(iterable);
...
m = Py_SIZE(self);
...
if (list_resize(self, m + n) < 0) {
And that overallocates as follows:
list_resize(PyListObject *self, Py_ssize_t newsize)
{
...
new_allocated = (size_t)newsize + (newsize >> 3) + (newsize < 9 ? 3 : 6);
Let's check that. Compute the expected number of spots with the formula above, and compute the expected byte size by multiplying it with 8 (as I'm using 64-bit Python here) and adding an empty list's byte size (i.e., a list object's constant overhead):
from sys import getsizeof
for n in range(13):
a = [None] * n
expected_spots = n + (n >> 3) + (3 if n < 9 else 6)
expected_bytesize = getsizeof([]) + expected_spots * 8
real_bytesize = getsizeof([*a])
print(n,
expected_bytesize,
real_bytesize,
real_bytesize == expected_bytesize)
Output:
0 80 56 False
1 88 88 True
2 96 96 True
3 104 104 True
4 112 112 True
5 120 120 True
6 128 128 True
7 136 136 True
8 152 152 True
9 184 184 True
10 192 192 True
11 200 200 True
12 208 208 True
Matches except for n = 0, which list_extend actually shortcuts, so actually that matches, too:
if (n == 0) {
...
Py_RETURN_NONE;
}
...
if (list_resize(self, m + n) < 0) {
These are going to be implementation details of the CPython interpreter, and so may not be consistent across other interpreters.
That said, you can see where the comprehension and list(a) behaviors come in here:
https://github.com/python/cpython/blob/master/Objects/listobject.c#L36
Specifically for the comprehension:
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
...
new_allocated = (size_t)newsize + (newsize >> 3) + (newsize < 9 ? 3 : 6);
Just below those lines, there is list_preallocate_exact which is used when calling list(a).
I'm trying to solve the knapsack problem using Python, implementing a greedy algorithm. The result I'm getting back makes no sense to me.
Knapsack:
The first line gives the number of items, in this case 20. The last line gives the capacity of the knapsack, in this case 524. The remaining lines give the index, value and weight of each item.
20
1 91 29
2 60 65
3 61 71
4 9 60
5 79 45
6 46 71
7 19 22
8 57 97
9 8 6
10 84 91
11 20 57
12 72 60
13 32 49
14 31 89
15 28 2
16 81 30
17 55 90
18 43 25
19 100 82
20 27 19
524
Python code:
import os
def constructive():
knapsack = []
Weight = 0
while(Weight <= cap):
best = max(values)
i = values.index(best)
knapsack.append(i)
Weight = Weight + weights[i]
del values[i]
del weights[i]
return knapsack, Weight
def read_kfile(fname):
with open(fname, 'rU') as kfile:
lines = kfile.readlines() # reads the whole file
n = int(lines[0])
c = int(lines[n+1])
vs = []
ws = []
lines = lines[1:n+1] # Removes the first and last line
for l in lines:
numbers = l.split() # Converts the string into a list
vs.append(int(numbers[1])) # Appends value, need to convert to int
ws.append(int(numbers[2])) # Appends weigth, need to convert to int
return n, c, vs, ws
dir_path = os.path.dirname(os.path.realpath(__file__)) # Get the directory where the file is located
os.chdir(dir_path) # Change the working directory so we can read the file
knapfile = 'knap20.txt'
nitems, cap, values, weights = read_kfile(knapfile)
val1,val2 =constructive()
print ('knapsack',val1)
print('weight', val2)
print('cap', cap)
Result:
knapsack [18, 0, 8, 13, 3, 8, 1, 0, 3]
weight 570
cap 524
Welcome. the reason why your program is giving a weights over the cap limit is because on the final item you are putting in the knapsack, you aren't checking if it can fit in it. To do this just add an if statement, Also you should check if the list of values is empty. Do note that I have append (i+1) since your text file's index is starting at 1 but Python starts it's list index at 0:
def constructive():
knapsack = []
Weight = 0
while(Weight <= cap and values):
best = max(values)
i = values.index(best)
if weights[i] <= cap-Weight:
knapsack.append(i+1)
Weight = Weight + weights[i]
del values[i]
del weights[i]
return knapsack, Weight
The problem is -- in the last step -- the best item you find will exceed the maximum weight. But since you already entered the loop you add it anyway.
In the next iteration you recognize that you are over the cap and stop.
I am not sure how you want to proceed once the next best is too heavy. In case you simple want to stop and not add anything more you can simply modify your constructive to look as follows:
def constructive():
knapsack = []
Weight = 0
while(True):
best = max(values)
i = values.index(best)
if Weight + weights[i] > cap:
break
knapsack.append(i)
Weight = Weight + weights[i]
del values[i]
del weights[i]
return knapsack, Weight
I want to stop the genetic algorithm when the fitness doesn't increase.
I'm using the DEAP library in python.
Typically, I have the following log file:
gen nevals mean max
0 100 0.352431 0.578592
1 83 -0.533964 0.719633
2 82 -0.567494 0.719633
3 81 -0.396759 0.751318
4 74 -0.340427 0.87888
5 80 -0.29756 0.888443
6 86 -0.509486 0.907789
7 85 -0.335586 1.06199
8 69 -0.23967 1.12339
9 73 -0.10727 1.20622
10 88 -0.181696 1.20622
11 77 -0.188449 1.20622
12 72 0.135398 1.25254
13 67 0.0304611 1.26931
14 74 -0.0436463 1.3181
15 70 0.289306 1.37582
16 79 -0.0441134 1.37151
17 73 0.339611 1.37204
18 68 -0.137938 1.37204
19 76 0.000527522 1.40034
20 84 0.198005 1.40078
21 69 0.243705 1.4306
22 74 0.11812 1.4306
23 83 0.16235 1.4306
24 82 0.270455 1.43492
25 76 -0.200259 1.43492
26 77 0.157181 1.43492
27 74 0.210868 1.43492
I initially set ngen = 200, but as you can see, the fitness function achieve a local maximum at 22th generation. So I want to stop the genetic algorithm when this happens.
def main():
random.seed(64)
pop = toolbox.population(n=100)
CXPB, MUTPB = 0.5, 0.2
print "Start of evolution"
fitnesses = list(map(toolbox.evaluate, pop))
for ind, fit in zip(pop, fitnesses):
ind.fitness.values = fit
print " Evaluated %i individuals" % len(pop)
fits = [ind.fitness.values[0] for ind in pop]
g = 0
while max(fits) < 0.67 and g < 1000000:
g = g + 1
print "-- Generation %i --" % g
offspring = toolbox.select(pop, len(pop))
offspring = list(map(toolbox.clone, offspring))
for child1, child2 in zip(offspring[::2], offspring[1::2]):
if random.random() < CXPB:
toolbox.mate(child1, child2)
del child1.fitness.values
del child2.fitness.values
for mutant in offspring:
if random.random() < MUTPB:
toolbox.mutate(mutant)
del mutant.fitness.values
invalid_ind = [ind for ind in offspring if not ind.fitness.valid]
fitnesses = map(toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
pop[:] = offspring
fits = [ind.fitness.values[0] for ind in pop]
print "fitness-- ",max(fits)
print "-- End of (successful) evolution --"
best_ind = tools.selBest(pop, 1)[0]
triangle_to_image(best_ind).save('best.jpg')
this will stop the code when a fitness value desired is reached or a particular number of generations are over
you can set is in such a way that it stops when fitness doesn't change for some time i.e when it reaches local maxim and gets stuck there
line 12
this example stops when fitness crosses 0.67
and then saves the result
this is the way to do it when you are not using something like hall of fame
dont know how to do it then if you find it tell me too
Honestly I was looking into that issue recently too.
Following the research I've done recently here is what I found:
There's a DEAP example which implements CMA-ES algorithm. It has stopping criteria included (Python DEAP, how to stop the evolution when the fitness doesn't increase after X generations?)
There is a dissertation thesis worth reading on that: https://heal.heuristiclab.com/system/files/diss%20gkr2.pdf
The above mentioned solution implements what's mentioned in Issue: https://github.com/DEAP/deap/issues/271
I haven't yet tried none of the above but I'm more than sure that it will work.
This question already has an answer here:
How to add new column to output file in Python?
(1 answer)
Closed 7 years ago.
I have this code (see this Thread) for saving a two-column array in the file. The thing is that I need to call this function N times:
def save(self):
n=self.n
with open("test.csv","a") as f:
f.write("name\tnum\n")
for k, v in tripo.items():
if v:
f.write(n+"\t")
f.write("{}\n".format(k.split(".")[0]))
for s in v:
f.write(n+"\t")
f.write("\n".join([s.split(".")[0]])+"\n")
This is the sample content of tripo for n=1:
{
'1.txt': [],
'2.txt': [],
'5.txt': [],
'4.txt': ['3.txt','6.txt'],
'7.txt': ['8.txt']
}
This is the expected output for n=1...N:
name num
1 4
1 3
1 6
1 7
1 8
...
N 3
N 6
N ...
However, the above-given code puts some values in the same column.
UPDATE:
For instance, if I have this string '170.txt': ['46.txt','58.txt','86.txt'], then I receive this result:
1 1 1 1 170
46
58
86
instead of:
1 170
1 46
1 58
1 86
import os
tripo = [
('1.txt', []),
('2.txt', []),
('5.txt', []),
('4.txt', ['3.txt','6.txt']),
('7.txt', ['8.txt'])
]
def getname(f):
return os.path.splitext(f)[0]
def getresult(t):
result = []
for k, v in tripo:
values = [getname(n) for n in v]
if len(values)>0:
result.append(getname(k))
for x in values:
result.append(x)
return result
def writedown(n,r):
with open("test.csv","a") as f:
for x in r:
f.write("%s\t%s\n" % (n,x))
print("%s\t%s\n" % (n,x))
print(getresult(tripo))
writedown(1, getresult(tripo))
Use Pickle. Use pickle.dump to store to file and pickle.load to load it.
I don't understand quite well your question.
Does the object representation is correct but the writing in the file incorrect?
If this is the case as Dan said, using pickle could be useful.
import pickle;
s = pickle.dumps(object);
f.write(s);
f.close();
#for reading;
f = open('test.csv', 'rb');
serialized_object = pickle.load(f)
The serialized_objectvariable should have the structure you want to preserve.
I have bunch of files and very file has a header of 5 lines. In the rest of the file, pair of line form an entry. I need to randomly select entry from these files.
How can i select random files and random entry(pair of line, excluding header) ?
If the file is small enough, read the pairs of lines into memory and select randomly from that data structure. If the file is too large, Eugene Y provides the right answer: use reservoir sampling.
Here's an intuitive explanation for the algorithm.
Process the file line by line.
pick = line, with probability 1/N, where N = line number
In other words, on line 1, we will pick line 1 with 1/1 probability. On line 2, we will change the pick to line 2, with 1/2 probability. On line 3, we will change the pick to line 3, with 1/3 probability. Etc.
For an intuitive proof, imagine a file with 3 lines:
1 Pick line 1.
/ \
.5 .5
/ \
2 1 Switch to line 2?
/ \ / \
.67 .33 .33 .67
/ \ / \
2 3 1 Switch to line 3?
The probability for each outcome:
Line 1: .5 * .67 = 1/3
Line 2: .5 * .67 = 1/3
Line 3: .5 * .33 * 2 = 1/3
From there, the rest is induction. For example, suppose the file has 4 lines. We've already convinced ourselves that as of line 3, every line so far (1, 2, 3) will have an equal chance of being our current selection. When we advance to line 4, it will have a 1/4 chance of being picked -- exactly what it should be, thus reducing the probabilities on the previous 3 lines by exactly the right amount (1/3 * 3/4 = 1/4).
Here's the Perl FAQ answer, adapted to your problem.
use strict;
use warnings;
# Ignore 5 lines.
<> for 1 .. 5;
# Use reservoir sampling to select pairs from remaining lines.
my (#picks, $n);
until (eof){
my #lines;
$lines[$_] = <> for 0 .. 1;
$n ++;
#picks = #lines if rand($n) < 1;
}
print #picks;
You may find perlfaq5 useful.
sed "1,5d" < FILENAME | sort -R | head -2
Python solution - reads file only once and requires little memory
Invoke like so getRandomItems(file('myHuge.log'), 5, 2) - will return list of 2 lines
from random import randrange
def getRandomItems(f, skipFirst=0, numItems=1):
for _ in xrange(skipFirst):
f.next()
n = 0; r = []
while True:
try:
nxt = [f.next() for _ in range(numItems)]
except StopIteration: break
n += 1
if not randrange(n):
r = nxt
return r
Returns empty list if it could not get the first passable items from f. The code's only requirement is that argument f is an iterator (supports next() method). Hence we can pass something different than file, say we want to see the distribution:
>>> s={}
>>> for i in xrange(5000):
... r = getRandomItems(iter(xrange(50)))[0]
... s[r] = 1 + s.get(r,0)
...
>>> for i in s:
... print i, '*' * s[i]
...
0 ***********************************************************************************************
1 **************************************************************************************************************
2 ******************************************************************************************************
3 ***************************************************************************
4 *************************************************************************************************************************
5 ********************************************************************************
6 **********************************************************************************************
7 ***************************************************************************************
8 ********************************************************************************************
9 ********************************************************************************************
10 ***********************************************************************************************
11 ************************************************************************************************
12 *******************************************************************************************************************
13 *************************************************************************************************************
14 ***************************************************************************************************************
15 *****************************************************************************************************
16 ********************************************************************************************************
17 ****************************************************************************************************
18 ************************************************************************************************
19 **********************************************************************************
20 ******************************************************************************************
21 ********************************************************************************************************
22 ******************************************************************************************************
23 **********************************************************************************************************
24 *******************************************************************************************************
25 ******************************************************************************************
26 ***************************************************************************************************************
27 ***********************************************************************************************************
28 *****************************************************************************************************
29 ****************************************************************************************************************
30 ********************************************************************************************************
31 ********************************************************************************************
32 ****************************************************************************************************
33 **********************************************************************************************
34 ****************************************************************************************************
35 **************************************************************************************************
36 *********************************************************************************************
37 ***************************************************************************************
38 *******************************************************************************************************
39 **********************************************************************************************************
40 ******************************************************************************************************
41 ********************************************************************************************************
42 ************************************************************************************
43 ****************************************************************************************************************************
44 ****************************************************************************************************************************
45 ***********************************************************************************************
46 *****************************************************************************************************
47 ***************************************************************************************
48 ***********************************************************************************************************
49 ****************************************************************************************************************
Answer is in Python. Assuming you can read a whole file into memory.
#using python 2.6
import sys
import os
import itertools
import random
def main(directory, num_files=5, num_entries=5):
file_paths = os.listdir(directory)
# get a random sampling of the available paths
chosen_paths = random.sample(file_paths, num_files)
for path in chosen_paths:
chosen_entries = get_random_entries(path, num_entries)
for entry in chosen_entries:
# do something with your chosen entries
print entry
def get_random_entries(file_path, num_entries):
with open(file_path, 'r') as file:
# read the lines and slice off the headers
lines = file.readlines()[5:]
# group the lines into pairs (i.e. entries)
entries = list(itertools.izip_longest(*[iter(lines)]*2))
# return a random sampling of entries
return random.sample(entries, num_entries)
if __name__ == '__main__':
#use optparse here to do fancy things with the command line args
main(sys.argv[1:])
Links: itertools, random, optparse
Another Python option; reading the contents of all files into memory:
import random
import fileinput
def openhook(filename, mode):
f = open(filename, mode)
headers = [f.readline() for _ in range(5)]
return f
num_entries = 3
lines = list(fileinput.input(openhook=openhook))
print random.sample(lines, num_entries)
Two other means to do so:
1- by generators (may still require a lot of memory):
http://www.usrsb.in/Picking-Random-Items--Take-Two--Hacking-Python-s-Generators-.html
2- by a clever seeking (best method actually):
http://www.regexprn.com/2008/11/read-random-line-in-large-file-in.html
I here copy the code of the clever Jonathan Kupferman:
#!/usr/bin/python
import os,random
filename="averylargefile"
file = open(filename,'r')
#Get the total file size
file_size = os.stat(filename)[6]
while 1:
#Seek to a place in the file which is a random distance away
#Mod by file size so that it wraps around to the beginning
file.seek((file.tell()+random.randint(0,file_size-1))%file_size)
#dont use the first readline since it may fall in the middle of a line
file.readline()
#this will return the next (complete) line from the file
line = file.readline()
#here is your random line in the file
print line