I have a python code that suppose to read large files into a dictionary in memory and do some operations. What puzzles me is that in only one case it goes out of memory: when the values in the file are integer...
The structure of my file is like this:
string value_1 .... value_n
The files I have varies in size from 2G to 40G. I have 50G memory that I try to read the file in. When I have something like this:
string 0.001334 0.001473 -0.001277 -0.001093 0.000456 0.001007 0.000314 ... with the n=100 and number of rows equal to 10M, I'll be able to read it into memory relatively fast. The file size is about 10G. However, when I have string 4 -2 3 1 1 1 ... with the same dimension (n=100) and the same number of rows, I'm not able to read it to the memory.
for line in f:
tokens = line.strip().split()
if len(tokens) <= 5: #ignore w2v first line
continue
word = tokens[0]
number_of_columns = len(tokens)-1
features = {}
for dim, val in enumerate(tokens[1:]):
val = float(val)
features[dim] = val
matrix[word] = features
This will result Killed in the second case while will work in the first case.
I know this does not answer the question specifically, but probably offers a better solution to the problem looking to be resolved:
May i suggest you use Pandas for this kind of work?
It seems a lot more appropriate for what you're trying to do. http://pandas.pydata.org/index.html
import pandas as pd
pd.read_csv('file.txt', sep=' ', skiprows=1)
then do all your manipulations
Pandas is a package designed specifically to handle large datasets and process them. it has tons of useful features you probably will end up needing if you're dealing with big data.
I am trying to develop a way to address large parallel tasks for bruteforcing a keyspace. I'd like to be able to come up with a way to pass a worker a value in such a way that given a chunk size, that value would tell the work what to output.
Simply Speaking:
given a charset (a-z) and a max legth of 1 (basically a-z) and a chunk size of 5
If I send worker 1 the number 0 then it will take 0-4 of the iterator, a, b, d, e, f) if I send worker 2 the number 1 it will take 5-9 etc. I have that code basically working:
#!/usr/bin/python
import itertools
maxlen = 5
charset = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
chunksize = 1000
chunkpart = 5
for s in itertools.islice(itertools.chain.from_iterable((''.join(l) for l in itertools.product(charset, repeat=i)) for i in range(1, maxlen + 1)), chunksize*chunkpart, chunksize*(chunkpart + 1)):
print s
Ok that works great, if I send chunkpart 5 to worker 1 it will do what it needs to do on that chunkpart.
The issue comes into play when I need to get a small chunk (1000 records) but way far into a large set. Let's say the maxlen was 10 and the chunkpart was 50,000,000. It takes a LONG time for Python to get to that point.
So, I think I know WHY this happens, it needs to do quite a bit of math to figure out where to get to in the iterator; what I am wondering, is there a better way for me to do this that shortcuts something? My gut tells me itertools has the answer, my brain says you need to understand itertools better.
I'm working with some very large arrays. An issue that I'm dealing with of course is running out of RAM to work with, but even before that my code is running slowly so that, even if I had infinite RAM, it would still take way too long. I'll give a bit of my code to show what I'm trying to do:
#samplez is a 3 million element 1-D array
#zfit is a 10,000 x 500 2-D array
b = np.arange((len(zfit))
for x in samplez:
a = x-zfit
mask = np.ma.masked_array(a)
mask[a <= 0] = np.ma.masked
index = mask.argmin(axis=1)
# These past 4 lines give me an index array of the smallest positive number
# in x - zift
d = zfit[b,index]
e = zfit[b,index+1]
f = (x-d)/(e-d)
# f is the calculation I am after
if x == samplez[0]:
g = f
index_stack = index
else:
g = np.vstack((g,f))
index_stack = np.vstack((index_stack,index))
I need to use g and index_stack, each of which are 3million x 10,000 2-D arrays, in a further calculation. Each iteration of this loop takes almost 1 second, so 3 million seconds total, which is way too long.
Is there anything I can do so that this calculation will run much faster? I've tried to think how I can do without this for loop, but the only way I can imagine is making 3 million copies of zfit, which is unfeasible.
And is there someway I can work with these arrays by not keeping everything in RAM? I'm a beginner and everything I've searched about this is either irrelevant or something I can't understand. Thanks in advance.
It is good to know that the smallest positive number will never show up in the end of rows.
In samplez there are 1 million unique values but in zfit, each row can only have 500 unique values at most. The entire zfit can have as much as 50 million unique values. The algorithm can be greatly sped up, if the number of 'finding the smallest positive number > each_element_in_samplez' calculation can be greatly reduced. Doing all 5e13 comparisons are probably an overkill and careful planing will be able to get rid of a large proportion of it. That will heavy depend on your actual underlying mathematics.
Without knowing it, there are still some small things can be done. 1, there are not so many of possible (e-d) so that can be taken out of the loop. 2, The loop can be eliminated by map. These two small fix, on my machine, result in about 22% speed-up.
def function_map(samplez, zfit):
diff=zfit[:,:-1]-zfit[:,1:]
def _fuc1(x):
a = x-zfit
mask = np.ma.masked_array(a)
mask[a <= 0] = np.ma.masked
index = mask.argmin(axis=1)
d = zfit[:,index]
f = (x-d)/diff[:,index] #constrain: smallest value never at the very end.
return (index, f)
result=map(_fuc1, samplez)
return (np.array([item[1] for item in result]),
np.array([item[0] for item in result]))
Next: masked_array can be avoided completely (which should bring significant improvement). samplez needs to be sorted as well.
>>> x1=arange(50)
>>> x2=random.random(size=(20, 10))*120
>>> x2=sort(x2, axis=1) #just to make sure the last elements of each col > largest val in x1
>>> x3=x2*1
>>> f1=lambda: function_map2(x1,x3)
>>> f0=lambda: function_map(x1, x2)
>>> def function_map2(samplez, zfit):
_diff=diff(zfit, axis=1)
_zfit=zfit*1
def _fuc1(x):
_zfit[_zfit<x]=(+inf)
index = nanargmin(zfit, axis=1)
d = zfit[:,index]
f = (x-d)/_diff[:,index] #constrain: smallest value never at the very end.
return (index, f)
result=map(_fuc1, samplez)
return (np.array([item[1] for item in result]),
np.array([item[0] for item in result]))
>>> import timeit
>>> t1=timeit.Timer('f1()', 'from __main__ import f1')
>>> t0=timeit.Timer('f0()', 'from __main__ import f0')
>>> t0.timeit(5)
0.09083795547485352
>>> t1.timeit(5)
0.05301499366760254
>>> t0.timeit(50)
0.8838210105895996
>>> t1.timeit(50)
0.5063929557800293
>>> t0.timeit(500)
8.900799036026001
>>> t1.timeit(500)
4.614129018783569
So, that is another 50% speed-up.
masked_array is avoided and that saves some RAM. Can't think of anything else to reduce RAM usage. It may be necessary to process samplez in parts. And also, dependents on the data and the required precision, if you can use float16 or float32 instead of the default float64 that can save you a lot of RAM.
I'm trying to solve problem 13 from Euler project, and I'm trying to make the solution beautiful (at least, not ugly). Only "ugly thing" I do is that I'm pre-formating the input and keep it in the solution file (due to some technical reasons, and 'cause I want to concentrate on numeric part of problem)
The problem is "Work out the first ten digits of the sum of the following one-hundred 50-digit numbers."
I wrote some code, that should work, as far as I know, but it gives wrong result. I've checked input several times, it seems to be OK...
nums=[37107287533902102798797998220837590246510135740250,
46376937677490009712648124896970078050417018260538,
74324986199524741059474233309513058123726617309629,
91942213363574161572522430563301811072406154908250,
23067588207539346171171980310421047513778063246676,
89261670696623633820136378418383684178734361726757,
28112879812849979408065481931592621691275889832738,
44274228917432520321923589422876796487670272189318,
47451445736001306439091167216856844588711603153276,
70386486105843025439939619828917593665686757934951,
62176457141856560629502157223196586755079324193331,
64906352462741904929101432445813822663347944758178,
92575867718337217661963751590579239728245598838407,
58203565325359399008402633568948830189458628227828,
80181199384826282014278194139940567587151170094390,
35398664372827112653829987240784473053190104293586,
86515506006295864861532075273371959191420517255829,
71693888707715466499115593487603532921714970056938,
54370070576826684624621495650076471787294438377604,
53282654108756828443191190634694037855217779295145,
36123272525000296071075082563815656710885258350721,
45876576172410976447339110607218265236877223636045,
17423706905851860660448207621209813287860733969412,
81142660418086830619328460811191061556940512689692,
51934325451728388641918047049293215058642563049483,
62467221648435076201727918039944693004732956340691,
15732444386908125794514089057706229429197107928209,
55037687525678773091862540744969844508330393682126,
18336384825330154686196124348767681297534375946515,
80386287592878490201521685554828717201219257766954,
78182833757993103614740356856449095527097864797581,
16726320100436897842553539920931837441497806860984,
48403098129077791799088218795327364475675590848030,
87086987551392711854517078544161852424320693150332,
59959406895756536782107074926966537676326235447210,
69793950679652694742597709739166693763042633987085,
41052684708299085211399427365734116182760315001271,
65378607361501080857009149939512557028198746004375,
35829035317434717326932123578154982629742552737307,
94953759765105305946966067683156574377167401875275,
88902802571733229619176668713819931811048770190271,
25267680276078003013678680992525463401061632866526,
36270218540497705585629946580636237993140746255962,
24074486908231174977792365466257246923322810917141,
91430288197103288597806669760892938638285025333403,
34413065578016127815921815005561868836468420090470,
23053081172816430487623791969842487255036638784583,
11487696932154902810424020138335124462181441773470,
63783299490636259666498587618221225225512486764533,
67720186971698544312419572409913959008952310058822,
95548255300263520781532296796249481641953868218774,
76085327132285723110424803456124867697064507995236,
37774242535411291684276865538926205024910326572967,
23701913275725675285653248258265463092207058596522,
29798860272258331913126375147341994889534765745501,
18495701454879288984856827726077713721403798879715,
38298203783031473527721580348144513491373226651381,
34829543829199918180278916522431027392251122869539,
40957953066405232632538044100059654939159879593635,
29746152185502371307642255121183693803580388584903,
41698116222072977186158236678424689157993532961922,
62467957194401269043877107275048102390895523597457,
23189706772547915061505504953922979530901129967519,
86188088225875314529584099251203829009407770775672,
11306739708304724483816533873502340845647058077308,
82959174767140363198008187129011875491310547126581,
97623331044818386269515456334926366572897563400500,
42846280183517070527831839425882145521227251250327,
55121603546981200581762165212827652751691296897789,
32238195734329339946437501907836945765883352399886,
75506164965184775180738168837861091527357929701337,
62177842752192623401942399639168044983993173312731,
32924185707147349566916674687634660915035914677504,
99518671430235219628894890102423325116913619626622,
73267460800591547471830798392868535206946944540724,
76841822524674417161514036427982273348055556214818,
97142617910342598647204516893989422179826088076852,
87783646182799346313767754307809363333018982642090,
10848802521674670883215120185883543223812876952786,
71329612474782464538636993009049310363619763878039,
62184073572399794223406235393808339651327408011116,
66627891981488087797941876876144230030984490851411,
60661826293682836764744779239180335110989069790714,
85786944089552990653640447425576083659976645795096,
66024396409905389607120198219976047599490197230297,
64913982680032973156037120041377903785566085089252,
16730939319872750275468906903707539413042652315011,
94809377245048795150954100921645863754710598436791,
78639167021187492431995700641917969777599028300699,
15368713711936614952811305876380278410754449733078,
40789923115535562561142322423255033685442488917353,
44889911501440648020369068063960672322193204149535,
41503128880339536053299340368006977710650566631954,
81234880673210146739058568557934581403627822703280,
82616570773948327592232845941706525094512325230608,
22918802058777319719839450180888072429661980811197,
77158542502016545090413245809786882778948721859617,
72107838435069186155435662884062257473692284509516,
20849603980134001723930671666823555245252804609722,
53503534226472524250874054075591789781264330331690]
result_sum = []
tmp_sum = 0
for j in xrange(50):
for i in xrange(100):
tmp_sum += nums[i] % 10
nums[i] =nums[i] / 10
result_sum.insert(0,int(tmp_sum % 10))
tmp_sum = tmp_sum / 10
for i in xrange(10):
print result_sum[i]
Your code works by adding all the numbers in nums like a person would: adding column by column. Your code does not work because when you are summing the far left column, you treat it like every other column. Whenever people get to the far left, they write down the entire sum. So this line
result_sum.insert(0,int(tmp_sum % 10))
doesn't work for the far left column; you need to insert something else into result_sum in that case. I would post the code, but 1) I'm sure you don't need it, and 2) it's agains the Project-Euler tag rules. If you would like, I can email it to you, but I'm sure that won't be necessary.
You could save the numbers in a file (with a number on each line), and read from it:
nums = []
with open('numbers.txt', 'r') as f:
for num in f:
nums.append(int(num))
# nums is now populated with all of the numbers, so do your actual algorithm
Also, it looks like you want to store the sum as an array of digits. The cool thing about Python is that it automatically handles large integers. Here is a quote from the docs:
Plain integers (also just called integers) are implemented using long in C, which gives them at least 32 bits of precision (sys.maxint is always set to the maximum plain integer value for the current platform, the minimum value is -sys.maxint - 1). Long integers have unlimited precision.
So using an array of digits isn't really necessary if you are working with Python. In C, it is another story...
Also, regarding your code, you need to factor in the digits in tmp_sum, which contains your carry-over digits. You can add them into result_sum like this:
while tmp_sum:
result_sum.insert(0,int(tmp_sum % 10))
tmp_sum /= 10
This will fix your issue. Here, it works.
Since you already have all the numbers in a list, you should be able to take the sum of them pretty easily. Then you just need to take the first ten digits of the sum. I won't put any code here, though.
As Simple as this :
Values.txt will contain all digits.
nums = []
with open("values.txt",'r') as f:
for num in f:
nums.append(int(num))
print(str(sum(nums))[:10])
Just as easy is storing it in csv and using pandas:
def foo():
import pandas as pd
table = pd.read_csv("data.txt", header = None, usecols = [0])
and then iterate through panda dataframe:
sum = 0
for x in range(len(table)):
sum += int(table[0][x])
return str(sum)[:10]
just keep in mind that Python handles the large digits for you.
I've got a file containing several channels of data. The file is sampled at a base rate, and each channel is sampled at that base rate divided by some number -- it seems to always be a power of 2, though I don't think that's important.
So, if I have channels a, b, and c, sampled at divders of 1, 2, and 4, my stream will look like:
a0 b0 c0 a1 a2 b1 a3 a4 b2 c1 a5 ...
For added fun, the channels can independently be floats or ints (though I know for each one), and the data stream does not necessarily end on a power of 2: the example stream would be valid without further extension. The values are sometimes big and sometimes little-endian, though I know what I'm dealing with up-front.
I've got code that properly unpacks these and fills numpy arrays with the correct values, but it's slow: it looks something like (hope I'm not glossing over too much; just giving an idea of the algorithm):
for sample_num in range(total_samples):
channels_to_sample = [ch for ch in all_channels if ch.samples_for(sample_num)]
format_str = ... # build format string from channels_to_sample
data = struct.unpack( my_file.read( ... ) ) # read and unpack the data
# iterate over data tuple and put values in channels_to_sample
for val, ch in zip(data, channels_to_sample):
ch.data[sample_num / ch.divider] = val
And it's slow -- a few seconds to read a 20MB file on my laptop. Profiler tells me I'm spending a bunch of time in Channel#samples_for() -- which makes sense; there's a bit of conditional logic there.
My brain feels like there's a way to do this in one fell swoop instead of nesting loops -- maybe using indexing tricks to read the bytes I want into each array? The idea of building one massive, insane format string also seems like a questionable road to go down.
Update
Thanks to those who responded. For what it's worth, the numpy indexing trick reduced the time required to read my test data from about 10 second to about 0.2 seconds, for a speedup of 50x.
The best way to really improve the performance is to get rid of the Python loop over all samples and let NumPy do this loop in compiled C code. This is a bit tricky to achieve, but it is possible.
First, you need a bit of preparation. As pointed out by Justin Peel, the pattern in which the samples are arranged repeats after some number of steps. If d_1, ..., d_k are the divisors for your k data streams and b_1, ..., b_k are the sample sizes of the streams in bytes, and lcm is the least common multiple of these divisors, then
N = lcm*sum(b_1/d_1+...+b_k/d_k)
will be the number of bytes which the pattern of streams will repeat after. If you have figured out which stream each of the first N bytes belongs to, you can simply repeat this pattern.
You can now build the array of stream indices for the first N bytes by something similar to
stream_index = []
for sample_num in range(lcm):
stream_index += [i for i, ch in enumerate(all_channels)
if ch.samples_for(sample_num)]
repeat_count = [b[i] for i in stream_index]
stream_index = numpy.array(stream_index).repeat(repeat_count)
Here, d is the sequence d_1, ..., d_k and b is the sequence b_1, ..., b_k.
Now you can do
data = numpy.fromfile(my_file, dtype=numpy.uint8).reshape(-1, N)
streams = [data[:,stream_index == i].ravel() for i in range(k)]
You possibly need to pad the data a bit at the end to make the reshape() work.
Now you have all the bytes belonging to each stream in separate NumPy arrays. You can reinterpret the data by simply assigning to the dtype attribute of each stream. If you want the first stream to be intepreted as big endian integers, simply write
streams[0].dtype = ">i"
This won't change the data in the array in any way, just the way it is interpreted.
This may look a bit cryptic, but should be much better performance-wise.
Replace channel.samples_for(sample_num) with a iter_channels(channels_config) iterator that keeps some internal state and lets you read the file in one pass. Use it like this:
for (chan, sample_data) in izip(iter_channels(), data):
decoded_data = chan.decode(sample_data)
To implement the iterator, think of a base clock with a period of one. The periods of the various channels are integers. Iterate the channels in order, and emit a channel if the clock modulo its period is zero.
for i in itertools.count():
for chan in channels:
if i % chan.period == 0:
yield chan
The grouper() recipe along with itertools.izip() should be of some help here.