Python - file processing - memory error - speed up the performance

Python - file processing - memory error - speed up the performance - python

I'm dealing with huge numbers. I have to write them into a .txt file. Right now I have to write the all numbers between 1000000,10000000(1M-1B) into a .txt file. Since it throws me memory error if I do it in a single list, I sliced them ( I don't like this solution but couldn't find any other ).
The problem is, even with the first 50M numbers (1M-50M), I can't even open the .txt file. It's 458MB and took around + 15 mins, so I guess it'll be around a 9GB .txt file and +4 hours if I write all numbers.
When I try to open the .txt file contains numbers between 1M-50M
myfile.txt has stopped working
So right now the file contains the numbers between 1M-50M and I can't even open it, I guess if I write all numbers it's impossible to open.
I have to shuffle numbers between 1M-1B and store this numbers into a .txt file right now. Basically it's a freelance job and I'll have to deal with bigger numbers like 100B etc. Even first 50M has this problem, I don't know how to finish when the numbers are bigger.
Here are the codes for 1M-50M
import random
x = 1000000
y = 10000000
while x < 50000001:
nums = [a for a in range(x,x+y)]
random.shuffle(nums)
with open ("nums.txt","a+") as f:
for z in nums:
f.write(str(z)+"\n")
x += 10000000
How can I speed up this process?
How can I open this .txt file, should I create new file every time? If
I choose this option I have to slice the numbers more since even 50M numbers has
problem.
Is there any module can you suggest may be useful for this process?

Is there any module can you suggest may be useful for this process?
Using Numpy is really helpful for working with large arrays.
How can I speed up this process?
Using Numpy's functions arange and tofile dramatically speed up the process (see code below). Generation of the initial array is about 50 times faster and writing the array to a file is about 7 times faster.
The code just performs each operation once (change number=1 to a higher value to get better accuracy) and only generates number up to between 1M and 2M but you can see the general picture.
import random
import timeit
import numpy
x = 10**6
y = 2 * 10**6
def list_rand():
nums = [a for a in range(x, y)]
random.shuffle(nums)
return nums
def numpy_rand():
nums = numpy.arange(x, y)
numpy.random.shuffle(nums)
return nums
def std_write(nums):
with open ('nums_std.txt', 'w') as f:
for z in nums:
f.write(str(z) + '\n')
def numpy_write(nums):
with open('nums_numpy.txt', 'w') as f:
nums.tofile(f, '\n')
print('list generation, random [secs]')
print('{:10.4f}'.format(timeit.timeit(stmt='list_rand()', setup='from __main__ import list_rand', number=1)))
print('numpy array generation, random [secs]')
print('{:10.4f}'.format(timeit.timeit(stmt='numpy_rand()', setup='from __main__ import numpy_rand', number=1)))
print('standard write [secs]')
nums = list_rand()
print('{:10.4f}'.format(timeit.timeit(stmt='std_write(nums)', setup='from __main__ import std_write, nums', number=1)))
print('numpy write [secs]')
nums = numpy_rand()
print('{:10.4f}'.format(timeit.timeit(stmt='numpy_write(nums)', setup='from __main__ import numpy_write, nums', number=1)))
list generation, random [secs]
1.3995
numpy array generation, random [secs]
0.0319
standard write [secs]
2.5745
numpy write [secs]
0.3622
How can I open this .txt file, should I create new file every time? If
I choose this option I have to slice the numbers more since even 50M
numbers has problem.
It really depends what you are trying to do with the numbers. Find their relative position? Delete one from the list? Restore the array?

I would not help You with the Python, but if You need to shuffle a consecutive sequence, You can improve the shuffling algorithm. Make a bit array of 1E9 items, if would be about 125MB. Generate random number. If it is not present in the bit array, add it there and write it to the file. Repeat until You have 99% of numbers in the file.
Now convert the unused numbers in bit array into ordinary array - it would be 80MB. Shuffle them and write to the file.
You needed about 200MB of memory for 1E9 items (and 8 minutes, written in C#). You should be able to shuffle 100E9 items in 20GB of RAM and less than a day.

Related

avoid for loop in python normal(size={}))

My goal is to create an array where each elemet is normal(size={})) of each element of it.
I am trying to oprimize:
it = 2 ** arange(6, 25)
M = zeros(len(it))
for x in range(len(it)):
M[x] = (normal(size=it[x]))
I have these not working so far:
N = zeros(len(it))
it = 2 ** arange(6, 25)
N = (normal(size=it))
Further I tried:
N = (normal(size=it[:]))
Provided my data, I believe that such a manual work, or for loop is really inefficient, so I am trying to come up with vectorized operations.
i receive:
File "mtrand.pyx", line 1335, in numpy.random.mtrand.RandomState.normal
File "common.pyx", line 557, in numpy.random.common.cont
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

you've not been very precise in where these functions are coming from, but I'm guessing that by normal(size=it[:]) you mean:
import numpy as np
it = 2 ** np.arange(6, 25)
np.random.normal(size=it)
which would be telling numpy to create a 19 dimensional array (i.e. len(it)) that contains 6 × 1085 elements (i.e. np.prod(it.astype(float)) as floats because the number overflows an int64). numpy is saying that it can't do that, which seems like a reasonable thing to do.
Numpy doesn't like the "ragged arrays" you're trying to create, neither do most matrix/numeric libraries, hence support is limited!
I'm unsure why you consider that the "loop is really inefficient". You're creating ~33 million of floats from 19 iterations of a simple Python loop. The vast majority of time will be in highly optimised Numpy library code and some tiny (basically unmeasurable) amount of time will be spent evaluating your Python bytecode.
If you really want a one-liner then you can do:
X = [np.random.normal(size=2**i) for i in range(6, 25)]
which makes the split between Numpy and Python worlds more obvious.
Note that on my laptop, the Python code executes in ~5µs while the Numpy code runs for ~800ms. So you're trying to optimise the 0.0006% part!
Note that it's not always a win to use Numpy's vectorization, it only helps with larger arrays, for example the above loop is "faster" than:
X = [np.random.normal(i) for i in 2**np.arange(6, 25)]
4.8 vs 5.1 µs for the Python code, because of the time spent marshalling objects into/out of the Numpy world. Again, none of this matters, just use whichever solution makes your code easier to understand. A few microseconds is nothing compared to seconds.

Resampling (upsampling, interpolating) a series of numbers

I have a comma separated series of integer values that I'd like to resample so that I have twice as many, where a new value is added half way between each of the existing values. For example, if this is my source:
1,5,11,9,13,21
the result would be:
1,3,5,8,11,10,9,11,13,17,21
In case that's not clear, I'm trying to add a number between each of the values in my source series, like this:
1 5 11 9 13 21
1 3 5 8 11 10 9 11 13 17 21
I've searched quite a bit and it seems that something like scipy.signal.resample or panda should work, but I'm completely new at this and I haven't been able to get it working. For example, here's one of my attempts with scipy:
import numpy as np
from scipy import signal
InputFileName = "sample.raw"
DATA250 = np.loadtxt(InputFileName, delimiter=',', dtype=int);
print(DATA250)
DATA500 = signal.resample(DATA250, 11)
print(DATA500)
Which outputs:
[ 1 5 11 9 13 21]
[ 1. -0.28829461 6.12324489 10.43251996 10.9108191 9.84503237
8.40293529 10.7641676 18.44182898 21.68506897 12.68267746]
Obviously I'm using signal.resample incorrectly. Is there a way I can do this with signal.resample or panda? Should I be using some other method?
Also, in my example all of source numbers have an integer half way in between. In my actual data, that won't be the case. So if two of the number are 10,15, the new number would be 12.5. However I'd like to have all of the resulting numbers be integers. So the new number that gets inserted would need to either be 12 or 13 (it doesn't matter to me which it is).
Note that once I get this working, the source file will actually be a comma separated list of 2,000 numbers and the output should be 4,000 numbers (or technically 3,999 since there won't be one added to the end). Also, this is going to be used to process something similar to an ECG recording- currently the ECG is sampled at 250 Hz for 8 seconds, which is then passed to a separate process to analyze the recording. However that separate process needs the recording to be sampled at 500 Hz. So the workflow will be that I'll take a 250 Hz recording every 8 seconds and upsample it to 500 Hz, then pass the resulting output to the analysis process.
Thanks for any guidance you can provide.

Since the interpolation is simple, you can do it by hand:
import numpy as np
a = np.array([1,5,11,9,13,21])
b = np.zeros(2*len(a)-1, dtype=np.uint32)
b[0::2] = a
b[1::2] = (a[:-1] + a[1:]) // 2
You can also use scipy.signal.resample this way:
import numpy as np
from scipy import signal
a = np.array([1,5,11,9,13,21])
b = signal.resample(a, len(a) * 2)
b_int = b.astype(int)
The trick is to have exactly twice the number of elements, so that odd points match your initial points. Also I think that the Fourier interpolation done by scipy.signal.resample is better for your ECG signal than the linear interpolation you're asking for.

Although I probably would just use NumPy here, pretty similar to J. Martinot-Lagarde's answer, you don't actually have to.
First, you can read a single row of comma-separated numbers with just the csv module:
with open(path) as f:
numbers = map(int, next(csv.reader(f))
… or just string operations:
with open(path) as f:
numbers = map(int, next(f).split(','))
And then you can interpolate that easily:
def interpolate(numbers):
last = None
for number in numbers:
if last is not None:
yield (last+number)//2
yield number
last=number
If you want it to be fully general and reusable, just take a function argument and yield function(last, number), and replace None with sentinel = object().
And now, all you need to do is join the results and write them:
with open(outpath, 'w') as f:
f.write(','.join(map(str, interpolate(numbers))))
Are there any advantages to this solution? Well, other than the read/split and join/write, it's purely lazy. And we can write lazy split and join functions pretty easily (or just do it manually). So if you ever had to deal with a billion comma-separated numbers instead of a thousand, that's all you'd have to change.
Here's a lazy split:
def isplit(s, sep):
start = 0
while True:
nextpos = s.find(sep, start)
if nextpos == -1:
yield s[start:]
return
yield s[start:nextpos]
start=nextpos+1
And you can use an mmap as a lazily-read string (well, bytes, but our data are pure ASCII, so that's fine):
with open(path, 'rb') as f:
with mmap.mmap(inf.fileno(), 0, access=mmap.ACCESS_READ) as mm:
numbers = map(int, isplit(mm, b','))
And let's use a different solution for lazy writing, just for variety:
def icsvwrite(f, seq, sep=','):
first = next(seq, None)
if not first: return
f.write(first)
for value in seq:
f.write(sep)
f.write(value)
So, putting it all together:
with open(inpath, 'rb') as inf, open(outpath, 'w') as outf:
with mmap.mmap(inf.fileno(), 0, access=mmap.ACCESS_READ) as mm:
numbers = map(int, isplit(mm, b','))
icsvwrite(outf, map(str, interpolate(numbers)))
But, even though I was able to slap this together pretty quickly, and all of the pieces are nicely reusable, I'd still probably use NumPy for your specific problem. You're not going to read a row of a billion numbers. You already have NumPy installed on the only machine that's ever going to run this script. The cost of importing it every 8 seconds (which you can solve by just having the script sleep between runs). So, it's hard to beat an elegant 3-line solution.

Since you suggested a pandas solution, here is one possibility:
import pandas as pd
import numpy as np
l = [1,4,11,9,14,21]
n = len(l)
df = pd.DataFrame(l, columns = ["l"]).reindex(np.linspace(0, n-1, 2*n-1)).interpolate().astype(int)
print(df)
It feels unnecessary complicated, though. I tag in pandas, so people more familiar with pandas functionality see it.

Fastest way to compute e^x?

What is the fastest way to compute e^x, given x can be a floating point value.
Right now I have used the python's math library to compute this, below is the complete code where in result = -0.490631 + 0.774275 * math.exp(0.474907 * sum) is the main logic, rest is file handling code which the question demands.
import math
import sys
def sum_digits(n):
r = 0
while n:
r, n = r + n % 10, n // 10
return r
def _print(string):
fo = open("output.txt", "w+")
fo.write(string)
fo.close()
try:
f = open('input.txt')
except IOError:
_print("error")
sys.exit()
data = f.read()
num = data.split('\n', 1)[0]
try:
val = int(num)
except ValueError:
_print("error")
sys.exit()
sum = sum_digits(int(num))
f.close()
if (sum == 2):
_print("1")
else:
result = -0.490631 + 0.774275 * math.exp(0.474907 * sum)
_print(str(math.ceil(result)))
The rvalue of result is the equation of curve (which is the solution to a programming problem) which I derived from wolfarm-mathematica using my own data set.
But this doesn't seem to pass the par criteria of the assessment !
I have also tried the newton-raphson way but the convergence for larger x is causing the problem, other than that, calculating the natural log ln(x) is a challenge there again !
I don't have any language constraint so any solution is acceptable. Also if the python's math library is fastest as some of the comments says then can anyone give an insight on the time complexity and execution time of this program, in short the efficiency of the program ?

I don't know if the exponential curve math is accurate in this code, but it certainly isn't the slow point.
First, you read the input data in one read call. It does have to be read, but that loads the entire file. The next step takes the first line only, so it would seem more appropriate to use readline. That split itself is O(n) where n is the file size, at least, which might include data you were ignoring since you only process one line.
Second, you convert that line into an int. This probably requires Python's long integer support, but the operation could be O(n) or O(n^2). A single pass algorithm would multiply the accumulated number by 10 for each digit, allocating one or two new (longer) longs each time.
Third, sum_digits breaks that long int down into digits again. It does so using division, which is expensive, and two operations as well, rather than using divmod. That's O(n^2), because each division has to process every higher digit for each digit. And it's only needed because of the conversion you just did.
Summing the digits found in a string is likely easier done with something like sum(int(c) for c in l if c.isdigit()) where l is the input line. It's not particularly fast, as there's quite a bit of overhead in the digit conversions and the sum might grow large, but it does make a single pass with a fairly tight loop; it's somewhere between O(n) and O(n log n), depending on the length of the data, because the sum might grow large itself.
As for the unknown exponential curve, the existence of an exception for a low number is concerning. There's likely some other option that's both faster and more accurate if the answer's an integer anyway.
Lastly, you have at least four distinct output data formats: error, 2, 3.0, 3e+20. Do you know which of these is expected? Perhaps you should be using formatted output rather than str to convert your numbers.
One extra note: If the data is really large, processing it in chunks will definitely speed things up (instead of running out of memory, needing to swap, etc). As you're looking for a digit sum your size complexity can be reduced from O(n) to O(log n).

Generating 100 million integers using Python

I am writing an application to generate 10 thousand to 100 million integers and I am unsure whether a .txt file is the right representation to hold the integers. Below is my code:
import random
def printrandomInts(n,file):
for i in range(n):
x = random.random();
x = x * 10000000
x = int(x)
file.write(str(x))
file.write("\n")
file.close()
file = open("10k","w")
n = 10000
printrandomInts(n,file)
file = open("100k","w")
n*=10
printrandomInts(n,file)
file = open("1M","w")
n*=10
printrandomInts(n,file)
file = open("10M","w")
n*=10
printrandomInts(n,file)
file = open("100M","w")
printrandomInts(n*10,file)
When I run the above code, the size of the largest file Windows reports is 868,053 KB. Should I use binary representation to efficiently represent the integers. I also have to generate similar data for floats and strings. What should I do to make things more space efficient?

If all you want are the counts for later analysis you can use #TomKarzes 's idea of counting them as you generate them, along with using the pickle module for storing them:
import random, pickle
counts = [0]*10000000
for i in range(100000000):
num = random.randint(0,9999999)
counts[num] += 1
pickle.dump(bytes(counts),open('counts.p','wb'))
The file counts.p is just 9.53 MB on my Windows box -- an impressive average of a little less than 1 byte per number (the overwhelming majority of the counts will be between 5 and 15, so the stored numbers are on the smallish side).
To load them:
counts = pickle.load(open('counts.p','rb'))
counts = [int(num) for num in counts]
A final remark -- I used bytes(counts) rather than simply counts in the pickle dump because the chance of any count being greater than 255 is vanishingly small in this problem. If in some other scenarios the counts could be larger, skip this step.

Python: Number ranges that are extremely large?

val = long(raw_input("Please enter the maximum value of the range:")) + 1
start_time = time.time()
numbers = range(0, val)
shuffle(numbers)
I cannot find a simple way to make this work with extremely large inputs - can anyone help?
I saw a question like this - but I could not implement the range function they described in a way that works with shuffle. Thanks.

To get a random permutation of the range [0, n) in a memory efficient manner; you could use numpy.random.permutation():
import numpy as np
numbers = np.random.permutation(n)
If you need only small fraction of values from the range e.g., to get k random values from [0, n) range:
import random
from functools import partial
def sample(n, k):
# assume n is much larger than k
randbelow = partial(random.randrange, n)
# from random.py
result = [None] * k
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow()
while j in selected:
j = randbelow()
selected_add(j)
result[i] = j
return result
print(sample(10**100, 10))

If you don't need the full list of numbers (and if you are getting billions, its hard to imagine why you would need them all), you might be better off taking a random.sample of your number range, rather than shuffling them all. In Python 3, random.sample can work on a range object too, so your memory use can be quite modest.
For example, here's code that will sample ten thousand random numbers from a range up to whatever maximum value you specify. It should require only a relatively small amount of memory beyond the 10000 result values, even if your maximum is 100 billion (or whatever enormous number you want):
import random
def get10kRandomNumbers(maximum):
pop = range(1, maximum+1) # this is memory efficient in Python 3
sample = random.sample(pop, 10000)
return sample
Alas, this doesn't work as nicely in Python 2, since xrange objects don't allow maximum values greater than the system's integer type can hold.

An important point to note is that it will be impossible for a computer to have the list of numbers in memory if it is larger than a few billion elements: its memory footprint becomes larger than the typical RAM size (as it takes about 4 GB for 1 billion 32-bit numbers).
In the question, val is a long integer, which seems to indicate that you are indeed using more than a billion integer, so this cannot be done conveniently in memory (i.e., shuffling will be slow, as the operating system will swap).
That said, if the number of elements is small enough (let's say smaller than 0.5 billion), then a list of elements can fit in memory thanks to the compact representation offered by the array module, and be shuffled. This can be done with the standard module array:
import array, random
numbers = array.array('I', xrange(10**8)) # or 'L', if the number of bytes per item (numbers.itemsize) is too small with 'I'
random.shuffle(numbers)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.