Generating 100 million integers using Python - python

I am writing an application to generate 10 thousand to 100 million integers and I am unsure whether a .txt file is the right representation to hold the integers. Below is my code:
import random
def printrandomInts(n,file):
for i in range(n):
x = random.random();
x = x * 10000000
x = int(x)
file.write(str(x))
file.write("\n")
file.close()
file = open("10k","w")
n = 10000
printrandomInts(n,file)
file = open("100k","w")
n*=10
printrandomInts(n,file)
file = open("1M","w")
n*=10
printrandomInts(n,file)
file = open("10M","w")
n*=10
printrandomInts(n,file)
file = open("100M","w")
printrandomInts(n*10,file)
When I run the above code, the size of the largest file Windows reports is 868,053 KB. Should I use binary representation to efficiently represent the integers. I also have to generate similar data for floats and strings. What should I do to make things more space efficient?

If all you want are the counts for later analysis you can use #TomKarzes 's idea of counting them as you generate them, along with using the pickle module for storing them:
import random, pickle
counts = [0]*10000000
for i in range(100000000):
num = random.randint(0,9999999)
counts[num] += 1
pickle.dump(bytes(counts),open('counts.p','wb'))
The file counts.p is just 9.53 MB on my Windows box -- an impressive average of a little less than 1 byte per number (the overwhelming majority of the counts will be between 5 and 15, so the stored numbers are on the smallish side).
To load them:
counts = pickle.load(open('counts.p','rb'))
counts = [int(num) for num in counts]
A final remark -- I used bytes(counts) rather than simply counts in the pickle dump because the chance of any count being greater than 255 is vanishingly small in this problem. If in some other scenarios the counts could be larger, skip this step.

Related

Is there a way to estimate the size of file to be written based on the pandas dataframe that holds the data?

I am extracting data from a table in a database and writing it to a CSV file in a windows file directory using pandas and python.
I want to partition the data and split into multiple files if the file size exceeds a certain amount of memory.
So for an example that threshold is 32 MB, if my CSV data file is going to be less than 32 MB, I will write the data in a single CSV file.
But if the file size may exceed 32 MB, say 50 MB, I would split the data and write to two files one of 32 MB and other of (50-32)=18 MB.
The only thing I found is how to find the memory a dataframe accommodates using memory_usage method or python's getsizeof function. But I am not able to relate that memory with actual size of the data file. The in-process memory is generally 5-10 times greater than the file size.
Appreciate any suggestions.
Do some checks in your code. Write a portion of the DataFrame as csv to an io.StringIO() object and examine the length of that object; use the percentage it is over or under your goal to redefine the DataFrame slice; repeat; when sat write to disk then use that slice size to write the rest.
Something like...
import StringIO from io
g = StringIO()
n = 100
limit = 3000
tolerance = .90
while True:
data[:n].to_csv(g)
p = g.tell()/limit
print(n,g.tell(),p)
if tolerance < p <= 1:
break
else:
n = int(n/p)
g = StringIO()
if n >= nrows: break
_ = input('?')
# with open(somefilename, 'w') as f:
# g.seek(0)
# f.write(g.read())
# some type of loop where succesive slices of size n are written to a new file.
# [0n:1n], [1n:2n], [2n:3n] ...
Caveat, the docs for .tell() say:
Return the current stream position as an opaque number. The number does not usually represent a number of bytes in the underlying binary storage
My experience is that .tell() at the end of the stream is the number of bytes for an io.StringIO object - I must be missing something. Maybe if contains multibyte unicode stuff it is different.
Maybe it is safer to use the length of the csv string for testing in which case the io.StringIO object is not needed. This is probably better/simpler. If I had thouroughly read the docs first I would not have proposed the io.StringIO version - ##$%##.
n = 100
limit = 3000
tolerance = .90
while True:
q = data[:n].to_csv()
p = len(q)/limit
print(f'n:{n}, len(q):{len(q)}, p:{p}')
if tolerance < p <= 1:
break
else:
n = int(n/p)
if n >= nrows: break
_ = input('?')
Another caveat: if the number of characters in each row for the first n rows varies significantly from other other n sized slices it is possible to overshoot or undershoot your limit if you don't test and adjust each slice before you write it.
setup for example:
import numpy as np
import pandas as pd
nrows = 1000
data = pd.DataFrame(np.random.randint(0,100,size=(nrows, 4)), columns=list('ABCD'))

Constraining frequencies when building a list of lists of integers

I am trying to write a function which will return a list of lists of integers corresponding to pools that I will pool chemicals in. I want to keep the number of chemicals in each pool as uniform as possible. Each chemical is replicated some number of times across pools (in this example, 3 times across 9 pools). For the example below, I have 31 chemicals. Thus, each pool should have 10.333 drugs in it (or, more specifically, each pool should have floor(93/9) = 10 drugs with 93%9 = 3 pools having 11 drugs. My function for doing so is below. Currently, I'm trying to get the function to loop until there is one set of integers left (i.e. 3 pools with 9 chemicals) so that I can code the function to recognize which pools are allowed one more chemical and finalize the list of lists that tells me which pools to put each chemical in.
However, as written right now, the function will not always give my desired distribution of 11,11,11,10,10,10,9,9,9 for the frequencies of integers appearing in the list of lists. I've written the following to attempt to constrain the distribution: 1) Randomly select, without replacement, a list of bits (pool numbers). If any of the bits in the selected list have frequency >= 10 in the output list and I already have 3 pools with frequency 11, discard this list of bits. 2) If any of the bits in the selected list have frequency >= 9 in the output list, and there are 6 pools with frequency >= 10, discard this list of bits. 3) If any of the bits in the selected list have frequency >= 11 in the output list, discard this list of bits. It seems that this bit of code isn't working properly. I'm thinking it's either related to me improperly coding these three conditions. It appears that some lists of bits are being accidentally discarded while others are improperly added to the output list. Alternatively, there could be a scenario in which two pools go from 9 to 10 chemicals in the same step, resulting in 4 pools of 10 instead of 3 pools of 10. Am I thinking about this problem wrong? Is there an obvious place where my code isn't working?
The function for generating normalized pools:
(overlapping_kbits returns a list of lists of bits of length replicates with each bit being an integer in the range [1,pools], filtered such that no two lists may have greater than overlaps between them.)
import numpy as np
import pandas as pd
import itertools
import re
import math
from collections import Counter
def normalized_pool(pools, replicates, overlaps, ligands):
solvent_bits = [list(bits) for bits in itertools.combinations(range(pools),replicates)]
print(len(solvent_bits))
total_items = ligands*replicates
norm_freq = math.floor(total_items/pools)
num_extra = total_items%pools
num_norm = pools-3
normed_bits = []
count_extra = 0
count_norm = 0
while len(normed_bits) < ligands-1 and len(solvent_bits)>0:
rand = np.random.randint(0,len(solvent_bits))
bits = solvent_bits.pop(rand) #Sample without replacement
print(bits)
bin_freqs = Counter(itertools.chain.from_iterable(normed_bits))
print(bin_freqs)
previous = len(normed_bits)
#Constrain the frequency distribution
count_extra = len([bin_freqs[bit] for bit in bin_freqs.keys() if bin_freqs[bit] >= norm_freq+1])
count_norm = len([bin_freqs[bit] for bit in bin_freqs.keys() if bin_freqs[bit] >= norm_freq])
if any(bin_freqs[bit] >= norm_freq for bit in bits) and count_extra == num_extra:
print('rejected')
continue #i.e. only allow num_extra number of bits to have a frequency higher than norm_freq
elif any(bin_freqs[bit] >= norm_freq+1 for bit in bits):
print('rejected')
continue #i.e. never allow any bit to be greater than norm_freq+1
elif (any(bin_freqs[bit] >= norm_freq-1 for bit in bits) and count_norm >= num_norm):
if count_extra == num_extra:
print('rejected')
continue #only num_norm bins can have norm_freq
normed_bits.append(bits)
bin_freqs = Counter(itertools.chain.from_iterable(normed_bits))
return normed_bits
test_bits = normalized_pool(9,3,2,31)
test_freqs = Counter(itertools.chain.from_iterable(test_bits))
print(test_freqs)
print(len(test_bits))
I can get anything from 11,11,11,10,10,10,9,9,9 (my desired output) to 11,11,11,10,10,10,10,10,7. For a minimal example, try:
test_bits = normalized_pool(7,3,2,10)
test_freqs = Counter(itertools.chain.from_iterable(test_bits))
print(test_freqs)
Which should return 5,5,4,4,3,3,3 as the elements of the test_freqs Counter.
EDIT: Modified the function so it can run from being copied and pasted. Merged the function call into the larger block of code since it was being overlooked.

Fastest way to compare two huge csv files in python(numpy)

I am trying find the intesect sub set between two pretty big csv files of
phone numbers(one has 600k rows, and the other has 300mil). I am currently using pandas to open both files and then converting the needed columns into 1d numpy arrays and then using numpy intersect to get the intersect. Is there a better way of doing this, either with python or any other method. Thanks for any help
import pandas as pd
import numpy as np
df_dnc = pd.read_csv('dncTest.csv', names = ['phone'])
df_test = pd.read_csv('phoneTest.csv', names = ['phone'])
dnc_phone = df_dnc['phone']
test_phone = df_test['phone']
np.intersect1d(dnc_phone, test_phone)
I will give you general solution with some Python pseudo code. What you are trying to solve here is the classical problem from the book "Programming Pearls" by Jon Bentley.
This is solved very efficiently with just a simple bit array, hence my comment, how long is (how many digits does have) the phone number.
Let's say the phone number is at most 10 digits long, than the max phone number you can have is: 9 999 999 999 (spaces are used for better readability). Here we can use 1bit per number to identify if the number is in set or not (bit is set or not set respectively), thus we are going to use 9 999 999 999 bits to identify each number, i.e.:
bits[0] identifies the number 0 000 000 000
bits[193] identifies the number 0 000 000 193
having a number 659 234-4567 would be addressed by the bits[6592344567]
Doing so we'd need to pre-allocate 9 999 999 999 bits initially set to 0, which is: 9 999 999 999 / 8 / 1024 / 1024 = around 1.2 GB of memory.
I think that holding the intersection of numbers at the end will use more space than the bits representation => at most 600k ints will be stored => 64bit * 600k = around 4.6 GB (actually int is not stored that efficiently and might use much more), if these are string you'll probably end with even more memory requirements.
Parsing a phone number string from CSV file (line by line or buffered file reader), converting it to a number and than doing a constant time memory lookup will be IMO faster than dealing with strings and merging them. Unfortunately, I don't have these phone number files to test, but would be interested to hear your findings.
from bitstring import BitArray
max_number = 9999999999
found_phone_numbers = BitArray(length=max_number+1)
# replace this function with the file open function and retrieving
# the next found phone number
def number_from_file_iteator(dummy_data):
for number in dummy_data:
yield number
def calculate_intersect():
# should be open a file1 and getting the generator with numbers from it
# we use dummy data here
for number in number_from_file_iteator([1, 25, 77, 224322323, 8292, 1232422]):
found_phone_numbers[number] = True
# open second file and check if the number is there
for number in number_from_file_iteator([4, 24, 224322323, 1232422, max_number]):
if found_phone_numbers[number]:
yield number
number_intersection = set(calculate_intersect())
print number_intersection
I used BitArray from bitstring pip package and it needed around 2 secs to initialize the entire bitstring. Afterwards, scanning the file will use constant memory. At the end I used a set to store the items.
Note 1: This algorithm can be modified to just use the list. In that case a second loop as soon as bit number matches this bit must be reset, so that duplicates do not match again.
Note 2: Storing in the set/list occurs lazy, because we use the generator in the second for loop. Runtime complexity is linear, i.e. O(N).
Read the 600k phone numbers into a set.
Input the larger file row by row, checking each row against the set.
Write matches to an output file immediately.
That way you don't have to load all the data in memory at once.

Python - file processing - memory error - speed up the performance

I'm dealing with huge numbers. I have to write them into a .txt file. Right now I have to write the all numbers between 1000000,10000000(1M-1B) into a .txt file. Since it throws me memory error if I do it in a single list, I sliced them ( I don't like this solution but couldn't find any other ).
The problem is, even with the first 50M numbers (1M-50M), I can't even open the .txt file. It's 458MB and took around + 15 mins, so I guess it'll be around a 9GB .txt file and +4 hours if I write all numbers.
When I try to open the .txt file contains numbers between 1M-50M
myfile.txt has stopped working
So right now the file contains the numbers between 1M-50M and I can't even open it, I guess if I write all numbers it's impossible to open.
I have to shuffle numbers between 1M-1B and store this numbers into a .txt file right now. Basically it's a freelance job and I'll have to deal with bigger numbers like 100B etc. Even first 50M has this problem, I don't know how to finish when the numbers are bigger.
Here are the codes for 1M-50M
import random
x = 1000000
y = 10000000
while x < 50000001:
nums = [a for a in range(x,x+y)]
random.shuffle(nums)
with open ("nums.txt","a+") as f:
for z in nums:
f.write(str(z)+"\n")
x += 10000000
How can I speed up this process?
How can I open this .txt file, should I create new file every time? If
I choose this option I have to slice the numbers more since even 50M numbers has
problem.
Is there any module can you suggest may be useful for this process?
Is there any module can you suggest may be useful for this process?
Using Numpy is really helpful for working with large arrays.
How can I speed up this process?
Using Numpy's functions arange and tofile dramatically speed up the process (see code below). Generation of the initial array is about 50 times faster and writing the array to a file is about 7 times faster.
The code just performs each operation once (change number=1 to a higher value to get better accuracy) and only generates number up to between 1M and 2M but you can see the general picture.
import random
import timeit
import numpy
x = 10**6
y = 2 * 10**6
def list_rand():
nums = [a for a in range(x, y)]
random.shuffle(nums)
return nums
def numpy_rand():
nums = numpy.arange(x, y)
numpy.random.shuffle(nums)
return nums
def std_write(nums):
with open ('nums_std.txt', 'w') as f:
for z in nums:
f.write(str(z) + '\n')
def numpy_write(nums):
with open('nums_numpy.txt', 'w') as f:
nums.tofile(f, '\n')
print('list generation, random [secs]')
print('{:10.4f}'.format(timeit.timeit(stmt='list_rand()', setup='from __main__ import list_rand', number=1)))
print('numpy array generation, random [secs]')
print('{:10.4f}'.format(timeit.timeit(stmt='numpy_rand()', setup='from __main__ import numpy_rand', number=1)))
print('standard write [secs]')
nums = list_rand()
print('{:10.4f}'.format(timeit.timeit(stmt='std_write(nums)', setup='from __main__ import std_write, nums', number=1)))
print('numpy write [secs]')
nums = numpy_rand()
print('{:10.4f}'.format(timeit.timeit(stmt='numpy_write(nums)', setup='from __main__ import numpy_write, nums', number=1)))
list generation, random [secs]
1.3995
numpy array generation, random [secs]
0.0319
standard write [secs]
2.5745
numpy write [secs]
0.3622
How can I open this .txt file, should I create new file every time? If
I choose this option I have to slice the numbers more since even 50M
numbers has problem.
It really depends what you are trying to do with the numbers. Find their relative position? Delete one from the list? Restore the array?
I would not help You with the Python, but if You need to shuffle a consecutive sequence, You can improve the shuffling algorithm. Make a bit array of 1E9 items, if would be about 125MB. Generate random number. If it is not present in the bit array, add it there and write it to the file. Repeat until You have 99% of numbers in the file.
Now convert the unused numbers in bit array into ordinary array - it would be 80MB. Shuffle them and write to the file.
You needed about 200MB of memory for 1E9 items (and 8 minutes, written in C#). You should be able to shuffle 100E9 items in 20GB of RAM and less than a day.

Python: Number ranges that are extremely large?

val = long(raw_input("Please enter the maximum value of the range:")) + 1
start_time = time.time()
numbers = range(0, val)
shuffle(numbers)
I cannot find a simple way to make this work with extremely large inputs - can anyone help?
I saw a question like this - but I could not implement the range function they described in a way that works with shuffle. Thanks.
To get a random permutation of the range [0, n) in a memory efficient manner; you could use numpy.random.permutation():
import numpy as np
numbers = np.random.permutation(n)
If you need only small fraction of values from the range e.g., to get k random values from [0, n) range:
import random
from functools import partial
def sample(n, k):
# assume n is much larger than k
randbelow = partial(random.randrange, n)
# from random.py
result = [None] * k
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow()
while j in selected:
j = randbelow()
selected_add(j)
result[i] = j
return result
print(sample(10**100, 10))
If you don't need the full list of numbers (and if you are getting billions, its hard to imagine why you would need them all), you might be better off taking a random.sample of your number range, rather than shuffling them all. In Python 3, random.sample can work on a range object too, so your memory use can be quite modest.
For example, here's code that will sample ten thousand random numbers from a range up to whatever maximum value you specify. It should require only a relatively small amount of memory beyond the 10000 result values, even if your maximum is 100 billion (or whatever enormous number you want):
import random
def get10kRandomNumbers(maximum):
pop = range(1, maximum+1) # this is memory efficient in Python 3
sample = random.sample(pop, 10000)
return sample
Alas, this doesn't work as nicely in Python 2, since xrange objects don't allow maximum values greater than the system's integer type can hold.
An important point to note is that it will be impossible for a computer to have the list of numbers in memory if it is larger than a few billion elements: its memory footprint becomes larger than the typical RAM size (as it takes about 4 GB for 1 billion 32-bit numbers).
In the question, val is a long integer, which seems to indicate that you are indeed using more than a billion integer, so this cannot be done conveniently in memory (i.e., shuffling will be slow, as the operating system will swap).
That said, if the number of elements is small enough (let's say smaller than 0.5 billion), then a list of elements can fit in memory thanks to the compact representation offered by the array module, and be shuffled. This can be done with the standard module array:
import array, random
numbers = array.array('I', xrange(10**8)) # or 'L', if the number of bytes per item (numbers.itemsize) is too small with 'I'
random.shuffle(numbers)

Categories