Finding out who else is referring, big data - python

I have 50 million rows of data like:
referring_id,referred_id
1000,1001
1000,1002
1001,1000
1001,1002
1002,1003
The goal is to find all the cases that share incoming connections, numerical examples should help:
If we want to calculate the measure for 1001, we can see that it has incoming from 1000, so we look who else has incoming connection from 1000, and that is 1002.
So the result would be [1002].
For 1002 we can see 1000 and 1001 are referring to 1002, so that means we look who else do they refer to; result being [1001,1000] (1000 refers 1001, 1001 refers 1000).
If it would be smaller data, I would just store for every referring a set of outgoing connections, and then loop over referred and take a union over all those that have incoming connections.
The problem is that this doesn't fit in memory.
I'm using csv to loop over the file and process the lines one at a time not to load it into memory, even though I have 16gb ram.
Does anyone have an idea how to handle it?

You should give pandas a try. It uses NumPy arrays to store the data. This can help to save memory. For example, an integer has the size of 8 bytes instead of 24 in Python 2 or 28 in Python 3. If the numbers are small, you might be able to use np.int16 or np.int32 to reduce the size to 2 or 4 bytes per integer.
This solution seems to fit your description:
s = """referring_id,referred_id
1000,1001
1000,1002
1001,1000
1001,1002
1002,1003"""
import csv
import numpy as np
pd.read_csv(io.StringIO(s), dtype=np.int16)
# use: df = pd.read_csv('data.csv', dtype=np.int16)
by_refered = df.groupby('referred_id')['referring_id'].apply(frozenset)
by_refering = df.groupby('referring_id')['referred_id'].apply(frozenset)
with open('connections.csv', 'w') as fobj:
writer = csv.writer(fobj)
writer.writerow(['id', 'connections'])
for x in by_refered.index:
tmp = set()
for id_ in by_refered[x]:
tmp.update(by_refering[id_])
tmp.remove(x)
writer.writerow([x] + list(tmp))
Content of connections.csv:
id,connections
1000,1002
1001,1002
1002,1000,1001
1003
Depending on your data you might get away with this. If there are many repeated connections, the number of sets and their size may be small enough. Otherwise, you would need to use some chunked approach.

Related

Is there a way to estimate the size of file to be written based on the pandas dataframe that holds the data?

I am extracting data from a table in a database and writing it to a CSV file in a windows file directory using pandas and python.
I want to partition the data and split into multiple files if the file size exceeds a certain amount of memory.
So for an example that threshold is 32 MB, if my CSV data file is going to be less than 32 MB, I will write the data in a single CSV file.
But if the file size may exceed 32 MB, say 50 MB, I would split the data and write to two files one of 32 MB and other of (50-32)=18 MB.
The only thing I found is how to find the memory a dataframe accommodates using memory_usage method or python's getsizeof function. But I am not able to relate that memory with actual size of the data file. The in-process memory is generally 5-10 times greater than the file size.
Appreciate any suggestions.
Do some checks in your code. Write a portion of the DataFrame as csv to an io.StringIO() object and examine the length of that object; use the percentage it is over or under your goal to redefine the DataFrame slice; repeat; when sat write to disk then use that slice size to write the rest.
Something like...
import StringIO from io
g = StringIO()
n = 100
limit = 3000
tolerance = .90
while True:
data[:n].to_csv(g)
p = g.tell()/limit
print(n,g.tell(),p)
if tolerance < p <= 1:
break
else:
n = int(n/p)
g = StringIO()
if n >= nrows: break
_ = input('?')
# with open(somefilename, 'w') as f:
# g.seek(0)
# f.write(g.read())
# some type of loop where succesive slices of size n are written to a new file.
# [0n:1n], [1n:2n], [2n:3n] ...
Caveat, the docs for .tell() say:
Return the current stream position as an opaque number. The number does not usually represent a number of bytes in the underlying binary storage
My experience is that .tell() at the end of the stream is the number of bytes for an io.StringIO object - I must be missing something. Maybe if contains multibyte unicode stuff it is different.
Maybe it is safer to use the length of the csv string for testing in which case the io.StringIO object is not needed. This is probably better/simpler. If I had thouroughly read the docs first I would not have proposed the io.StringIO version - ##$%##.
n = 100
limit = 3000
tolerance = .90
while True:
q = data[:n].to_csv()
p = len(q)/limit
print(f'n:{n}, len(q):{len(q)}, p:{p}')
if tolerance < p <= 1:
break
else:
n = int(n/p)
if n >= nrows: break
_ = input('?')
Another caveat: if the number of characters in each row for the first n rows varies significantly from other other n sized slices it is possible to overshoot or undershoot your limit if you don't test and adjust each slice before you write it.
setup for example:
import numpy as np
import pandas as pd
nrows = 1000
data = pd.DataFrame(np.random.randint(0,100,size=(nrows, 4)), columns=list('ABCD'))

Write (large amount of) zeros into a binary file

This might be a stupid question but I'm unable to find a proper answer to it. I want to store (don't ask why) a binary representation of a (2000, 2000, 2000) array of zeros into disk, binary format. The traditional approach to achieve so would be:
with open('myfile', 'wb') as f:
f.write('\0' * 4 * 2000 * 2000 * 2000) # 4 bytes = float32
But that would imply creating a very large string which is not necessary at all. I know of two other options:
Iterate over the elements and store one byte at a time (extremely slow)
Create a numpy array and flush it to disk (as memory expensive as the string creation in the example above)
I was hopping to find something like write(char, ntimes) (as it exists in C and other languages) to copy on disk char ntimes at C speed, and not at Python-loops speed, without having to create such a big array on memory.
I don't know why you are making such a fuss about "Python's loop speed", but writing in the way of
for i in range(2000 * 2000):
f.write('\0' * 4 * 2000) # 4 bytes = float32
will tell the OS to write 8000 0-bytes. After write returns, in the next loop run it is called again.
It might be that the loop is executed slightly slower than it would be in C, but that definitely won't make a difference.
If it is ok to have a sparse file, you as well can seek to the desired file size's position and then truncate the file.
This would be a valid answer to fill a file from Python using numpy's memmap:
shape = (2000, 2000, 2000) # or just (2000 * 2000 * 2000,)
fp = np.memmap(filename, dtype='float32', mode='w+', shape=shape)
fp[...] = 0
To write zeroes, there's a nice hack: open the file in read-write, seek to the offset minus one, and write another zero.
This pads the start of the file with zeroes as well:
mega_size = 100000
with open("zeros.bin","wb+") as f:
f.seek(mega_size-1)
f.write(bytearray(1))
The original poster does not make clear why this really needs to be done in python. So here is a little shell script command which does the same thing, but probably slightly faster. Assuming the OP is on a unix like system (linux / Mac)
Definitions: bs (blocksize) = 2000 * 2000, count = 4 * 2000
The if (input file) is a special 'zero producing' device. The of (output file) has to be specified as wel.
dd bs=4000000 count=8000 if=/dev/zero of=/Users/me/thirtygig
On my computer (ssd) this takes about 110 seconds:
32000000000 bytes transferred in 108.435354 secs (295106705 bytes/sec)
You can always call this little shell command from python.
For comparisons, #glglgl little python script runs in 303 seconds on the same computer, so only 3 times slower, which is probably fast enough.

Fastest way to compare two huge csv files in python(numpy)

I am trying find the intesect sub set between two pretty big csv files of
phone numbers(one has 600k rows, and the other has 300mil). I am currently using pandas to open both files and then converting the needed columns into 1d numpy arrays and then using numpy intersect to get the intersect. Is there a better way of doing this, either with python or any other method. Thanks for any help
import pandas as pd
import numpy as np
df_dnc = pd.read_csv('dncTest.csv', names = ['phone'])
df_test = pd.read_csv('phoneTest.csv', names = ['phone'])
dnc_phone = df_dnc['phone']
test_phone = df_test['phone']
np.intersect1d(dnc_phone, test_phone)
I will give you general solution with some Python pseudo code. What you are trying to solve here is the classical problem from the book "Programming Pearls" by Jon Bentley.
This is solved very efficiently with just a simple bit array, hence my comment, how long is (how many digits does have) the phone number.
Let's say the phone number is at most 10 digits long, than the max phone number you can have is: 9 999 999 999 (spaces are used for better readability). Here we can use 1bit per number to identify if the number is in set or not (bit is set or not set respectively), thus we are going to use 9 999 999 999 bits to identify each number, i.e.:
bits[0] identifies the number 0 000 000 000
bits[193] identifies the number 0 000 000 193
having a number 659 234-4567 would be addressed by the bits[6592344567]
Doing so we'd need to pre-allocate 9 999 999 999 bits initially set to 0, which is: 9 999 999 999 / 8 / 1024 / 1024 = around 1.2 GB of memory.
I think that holding the intersection of numbers at the end will use more space than the bits representation => at most 600k ints will be stored => 64bit * 600k = around 4.6 GB (actually int is not stored that efficiently and might use much more), if these are string you'll probably end with even more memory requirements.
Parsing a phone number string from CSV file (line by line or buffered file reader), converting it to a number and than doing a constant time memory lookup will be IMO faster than dealing with strings and merging them. Unfortunately, I don't have these phone number files to test, but would be interested to hear your findings.
from bitstring import BitArray
max_number = 9999999999
found_phone_numbers = BitArray(length=max_number+1)
# replace this function with the file open function and retrieving
# the next found phone number
def number_from_file_iteator(dummy_data):
for number in dummy_data:
yield number
def calculate_intersect():
# should be open a file1 and getting the generator with numbers from it
# we use dummy data here
for number in number_from_file_iteator([1, 25, 77, 224322323, 8292, 1232422]):
found_phone_numbers[number] = True
# open second file and check if the number is there
for number in number_from_file_iteator([4, 24, 224322323, 1232422, max_number]):
if found_phone_numbers[number]:
yield number
number_intersection = set(calculate_intersect())
print number_intersection
I used BitArray from bitstring pip package and it needed around 2 secs to initialize the entire bitstring. Afterwards, scanning the file will use constant memory. At the end I used a set to store the items.
Note 1: This algorithm can be modified to just use the list. In that case a second loop as soon as bit number matches this bit must be reset, so that duplicates do not match again.
Note 2: Storing in the set/list occurs lazy, because we use the generator in the second for loop. Runtime complexity is linear, i.e. O(N).
Read the 600k phone numbers into a set.
Input the larger file row by row, checking each row against the set.
Write matches to an output file immediately.
That way you don't have to load all the data in memory at once.

Avoid reading all data into memory from hdf5 file when performing operations

I'm new to the HDF5 format, and I'm using h5py for these operations. I'm wondering about what exactly happens when I have a very large data set and perform some kind of operation on it. For instance:
>>> f = h5py.File("mytestfile.hdf5", "w")
>>> dset = f.create_dataset("mydataset", (100000000,), dtype=np.float64)
>>> dset[...] = np.linspace(0, 100, 100000000)
>>> myResult = f["mydataset"][:] * 15
# Graph myResult, or something
Is the entirety of myResult in memory now? Was dset also in memory? Is there a way to only read from the disk into memory part by part, like 100,000 points at a time or so, to avoid overflow (let's assume I might end up with data even far larger than the example, say, 50GB worth)? Is there a way to do so efficiently, and in a modular fashion (such that simply saying data * 100 automatically does the operation in chunks, and stores the results again in a new place)? Thanks in advance.

Fast way to read interleaved data?

I've got a file containing several channels of data. The file is sampled at a base rate, and each channel is sampled at that base rate divided by some number -- it seems to always be a power of 2, though I don't think that's important.
So, if I have channels a, b, and c, sampled at divders of 1, 2, and 4, my stream will look like:
a0 b0 c0 a1 a2 b1 a3 a4 b2 c1 a5 ...
For added fun, the channels can independently be floats or ints (though I know for each one), and the data stream does not necessarily end on a power of 2: the example stream would be valid without further extension. The values are sometimes big and sometimes little-endian, though I know what I'm dealing with up-front.
I've got code that properly unpacks these and fills numpy arrays with the correct values, but it's slow: it looks something like (hope I'm not glossing over too much; just giving an idea of the algorithm):
for sample_num in range(total_samples):
channels_to_sample = [ch for ch in all_channels if ch.samples_for(sample_num)]
format_str = ... # build format string from channels_to_sample
data = struct.unpack( my_file.read( ... ) ) # read and unpack the data
# iterate over data tuple and put values in channels_to_sample
for val, ch in zip(data, channels_to_sample):
ch.data[sample_num / ch.divider] = val
And it's slow -- a few seconds to read a 20MB file on my laptop. Profiler tells me I'm spending a bunch of time in Channel#samples_for() -- which makes sense; there's a bit of conditional logic there.
My brain feels like there's a way to do this in one fell swoop instead of nesting loops -- maybe using indexing tricks to read the bytes I want into each array? The idea of building one massive, insane format string also seems like a questionable road to go down.
Update
Thanks to those who responded. For what it's worth, the numpy indexing trick reduced the time required to read my test data from about 10 second to about 0.2 seconds, for a speedup of 50x.
The best way to really improve the performance is to get rid of the Python loop over all samples and let NumPy do this loop in compiled C code. This is a bit tricky to achieve, but it is possible.
First, you need a bit of preparation. As pointed out by Justin Peel, the pattern in which the samples are arranged repeats after some number of steps. If d_1, ..., d_k are the divisors for your k data streams and b_1, ..., b_k are the sample sizes of the streams in bytes, and lcm is the least common multiple of these divisors, then
N = lcm*sum(b_1/d_1+...+b_k/d_k)
will be the number of bytes which the pattern of streams will repeat after. If you have figured out which stream each of the first N bytes belongs to, you can simply repeat this pattern.
You can now build the array of stream indices for the first N bytes by something similar to
stream_index = []
for sample_num in range(lcm):
stream_index += [i for i, ch in enumerate(all_channels)
if ch.samples_for(sample_num)]
repeat_count = [b[i] for i in stream_index]
stream_index = numpy.array(stream_index).repeat(repeat_count)
Here, d is the sequence d_1, ..., d_k and b is the sequence b_1, ..., b_k.
Now you can do
data = numpy.fromfile(my_file, dtype=numpy.uint8).reshape(-1, N)
streams = [data[:,stream_index == i].ravel() for i in range(k)]
You possibly need to pad the data a bit at the end to make the reshape() work.
Now you have all the bytes belonging to each stream in separate NumPy arrays. You can reinterpret the data by simply assigning to the dtype attribute of each stream. If you want the first stream to be intepreted as big endian integers, simply write
streams[0].dtype = ">i"
This won't change the data in the array in any way, just the way it is interpreted.
This may look a bit cryptic, but should be much better performance-wise.
Replace channel.samples_for(sample_num) with a iter_channels(channels_config) iterator that keeps some internal state and lets you read the file in one pass. Use it like this:
for (chan, sample_data) in izip(iter_channels(), data):
decoded_data = chan.decode(sample_data)
To implement the iterator, think of a base clock with a period of one. The periods of the various channels are integers. Iterate the channels in order, and emit a channel if the clock modulo its period is zero.
for i in itertools.count():
for chan in channels:
if i % chan.period == 0:
yield chan
The grouper() recipe along with itertools.izip() should be of some help here.

Categories