Fastest way to compare two huge csv files in python(numpy) - python

I am trying find the intesect sub set between two pretty big csv files of
phone numbers(one has 600k rows, and the other has 300mil). I am currently using pandas to open both files and then converting the needed columns into 1d numpy arrays and then using numpy intersect to get the intersect. Is there a better way of doing this, either with python or any other method. Thanks for any help
import pandas as pd
import numpy as np
df_dnc = pd.read_csv('dncTest.csv', names = ['phone'])
df_test = pd.read_csv('phoneTest.csv', names = ['phone'])
dnc_phone = df_dnc['phone']
test_phone = df_test['phone']
np.intersect1d(dnc_phone, test_phone)

I will give you general solution with some Python pseudo code. What you are trying to solve here is the classical problem from the book "Programming Pearls" by Jon Bentley.
This is solved very efficiently with just a simple bit array, hence my comment, how long is (how many digits does have) the phone number.
Let's say the phone number is at most 10 digits long, than the max phone number you can have is: 9 999 999 999 (spaces are used for better readability). Here we can use 1bit per number to identify if the number is in set or not (bit is set or not set respectively), thus we are going to use 9 999 999 999 bits to identify each number, i.e.:
bits[0] identifies the number 0 000 000 000
bits[193] identifies the number 0 000 000 193
having a number 659 234-4567 would be addressed by the bits[6592344567]
Doing so we'd need to pre-allocate 9 999 999 999 bits initially set to 0, which is: 9 999 999 999 / 8 / 1024 / 1024 = around 1.2 GB of memory.
I think that holding the intersection of numbers at the end will use more space than the bits representation => at most 600k ints will be stored => 64bit * 600k = around 4.6 GB (actually int is not stored that efficiently and might use much more), if these are string you'll probably end with even more memory requirements.
Parsing a phone number string from CSV file (line by line or buffered file reader), converting it to a number and than doing a constant time memory lookup will be IMO faster than dealing with strings and merging them. Unfortunately, I don't have these phone number files to test, but would be interested to hear your findings.
from bitstring import BitArray
max_number = 9999999999
found_phone_numbers = BitArray(length=max_number+1)
# replace this function with the file open function and retrieving
# the next found phone number
def number_from_file_iteator(dummy_data):
for number in dummy_data:
yield number
def calculate_intersect():
# should be open a file1 and getting the generator with numbers from it
# we use dummy data here
for number in number_from_file_iteator([1, 25, 77, 224322323, 8292, 1232422]):
found_phone_numbers[number] = True
# open second file and check if the number is there
for number in number_from_file_iteator([4, 24, 224322323, 1232422, max_number]):
if found_phone_numbers[number]:
yield number
number_intersection = set(calculate_intersect())
print number_intersection
I used BitArray from bitstring pip package and it needed around 2 secs to initialize the entire bitstring. Afterwards, scanning the file will use constant memory. At the end I used a set to store the items.
Note 1: This algorithm can be modified to just use the list. In that case a second loop as soon as bit number matches this bit must be reset, so that duplicates do not match again.
Note 2: Storing in the set/list occurs lazy, because we use the generator in the second for loop. Runtime complexity is linear, i.e. O(N).

Read the 600k phone numbers into a set.
Input the larger file row by row, checking each row against the set.
Write matches to an output file immediately.
That way you don't have to load all the data in memory at once.

Related

Python: How to sum two signed int16 arrays into one without overflow

I have several int16 streams in strings and I want them sum together (without overflow) and return it as an int16 string. Background is mixing several wave files into one stream.
decodeddata1 = numpy.fromstring(data, numpy.int16)
decodeddata2 = numpy.fromstring(data2, numpy.int16)
newdata = decodeddata1 + decodeddata2
return newdata.tostring()
Is there a way doing this with numpy or is there another library?
Processing each single value in python is too slow and results in stutter.
The most important thing is performance, since this code is used in a callback method feeding the audio.
#edit:
test input data:
a = np.int16([20000,20000,-20000,-20000])
b = np.int16([10000,20000,-10000,-20000])
print a + b --> [ 30000 -25536 -30000 25536]
but I want to keep the maximum levels:
[ 30000 40000 -30000 -40000]
The obvious consequence of mixing two signals together with a dynamic range of -32768<x<32767 is a resulting signal of with range of -65537<x<65536 - which requires 17 bits to represent it.
To avoid clipping, you will need to gain-scale the inputs - the obvious way is to divide the sum (or both of the inputs) by 2.
numpy looks as thought it should be quite fast for this - at least faster than python's builtin variable-size integer type. If the additional arithmetic is a performance concern, you should consider your choice of language.

Historical database number formatting

Currently I am working with a historical database (in MS Access) containing passages of ships through the Sound (the strait between Denmark and Sweden).
I am having problems with the way amounts of products on board of ships were recorded. This generally takes the following forms:
12 1/15 (integer - space - fraction)
1/4 (fraction)
1 (integer)
I'd like to convert all these numbers to floats/decimal, in order to do some calculations. There are some additional challenges which are mainly caused by the lack of uniform input:
-not all rows have a value
-some rows have value: '-', i'd like to skip these
-some rows contain '*' when a number or a part of a number is missing, these can be skipped too
My first question is: Is there a way I could directly convert this in Access SQL? I have not been able to find anything but perhaps I overlooked something.
The second option I attempted is to export the table (called cargo), use python to convert the value and then output it and import the table again. I have a function to convert the standard three formats:
from fractions import Fraction
import pandas
import numpy
def fracToString(number):
conversionResult = float(sum(Fraction(s) for s in number.split()))
return conversionResult
df = pandas.read_csv('cargo.csv', usecols = [0,5], header = None, names = ['id_passage', 'amount'])
df['amountDecimal'] = df['amount'].dropna().apply(fracToString)
This works for empty rows, however the values containing '*' or '-' or other characters that the fractToString function can't handle raise a ValueError. Since these are just a couple of records out of over 4 million these can be omitted. Is there a way to tell pandas.apply() to just skip to the next row if the fracToString function throws a ValueError?
Thank you in advance,
Alex

Python MemoryError on large array

This is the python script that I'm trying to run:
n = 50000000000 ##50 billion
b = [0]*n
for x in range(0,n):
b[x] = random.randint(1,899999)
... But the output I'm getting is:
E:\python\> python sort.py
Traceback (most recent call last):
File "E:\python\sort.py", line 8, in <module>
b = [0]*n
MemoryError
So, what do I do now?
The size of the list you are generating (which is 50 billion not 5).
An int object instance takes 24 bytes (sys.getsizeof(int(899999)), the upper limit of your random numbers), so that list would take 50,000,000,000 * 24 bytes, which is about 1.09 TB.
In other words to create such a list you would need at least 1118 GB of RAM in your computer.
I don't know what your use case is, but you should consider a different approach to what you are trying to solve (maybe define a generator, or just don't store your numbers in memory and instead directly use the numbers in the for loop).
Since other people already answered your question here's a quick tip when dealing with big numbers: you can use "_" to separate the digits of your numbers as you wish:
n = 50_000_000_000
is the same as
n = 50000000000
but the former is much easier on the eyes
One other possibility is to increase you computers vitual memory. It helped me in my code. I had a max 3000MB virtual memory, when I increased it to 5000MB the memory error was gone.

Finding out who else is referring, big data

I have 50 million rows of data like:
referring_id,referred_id
1000,1001
1000,1002
1001,1000
1001,1002
1002,1003
The goal is to find all the cases that share incoming connections, numerical examples should help:
If we want to calculate the measure for 1001, we can see that it has incoming from 1000, so we look who else has incoming connection from 1000, and that is 1002.
So the result would be [1002].
For 1002 we can see 1000 and 1001 are referring to 1002, so that means we look who else do they refer to; result being [1001,1000] (1000 refers 1001, 1001 refers 1000).
If it would be smaller data, I would just store for every referring a set of outgoing connections, and then loop over referred and take a union over all those that have incoming connections.
The problem is that this doesn't fit in memory.
I'm using csv to loop over the file and process the lines one at a time not to load it into memory, even though I have 16gb ram.
Does anyone have an idea how to handle it?
You should give pandas a try. It uses NumPy arrays to store the data. This can help to save memory. For example, an integer has the size of 8 bytes instead of 24 in Python 2 or 28 in Python 3. If the numbers are small, you might be able to use np.int16 or np.int32 to reduce the size to 2 or 4 bytes per integer.
This solution seems to fit your description:
s = """referring_id,referred_id
1000,1001
1000,1002
1001,1000
1001,1002
1002,1003"""
import csv
import numpy as np
pd.read_csv(io.StringIO(s), dtype=np.int16)
# use: df = pd.read_csv('data.csv', dtype=np.int16)
by_refered = df.groupby('referred_id')['referring_id'].apply(frozenset)
by_refering = df.groupby('referring_id')['referred_id'].apply(frozenset)
with open('connections.csv', 'w') as fobj:
writer = csv.writer(fobj)
writer.writerow(['id', 'connections'])
for x in by_refered.index:
tmp = set()
for id_ in by_refered[x]:
tmp.update(by_refering[id_])
tmp.remove(x)
writer.writerow([x] + list(tmp))
Content of connections.csv:
id,connections
1000,1002
1001,1002
1002,1000,1001
1003
Depending on your data you might get away with this. If there are many repeated connections, the number of sets and their size may be small enough. Otherwise, you would need to use some chunked approach.

Get the the number of zeros and ones of a binary number in Python

I am trying to solve a binary puzzle, my strategy is to transform a grid in zeros and ones, and what I want to make sure is that every row has the same amount of 0 and 1.
Is there a any way to count how many 1s and 0s a number has without iterating through the number?
What I am currently doing is:
def binary(num, length=4):
return format(num, '#0{}b'.format(length + 2)).replace('0b', '')
n = binary(112, 8)
// '01110000'
and then
n.count('0')
n.count('1')
Is there any more efficient computational (or maths way) of doing that?
What you're looking for is the Hamming weight of a number. In a lower-level language, you'd probably use a nifty SIMD within a register trick or a library function to compute this. In Python, the shortest and most efficient way is to just turn it into a binary string and count the '1's:
def ones(num):
# Note that bin is a built-in
return bin(num).count('1')
You can get the number of zeros by subtracting ones(num) from the total number of digits.
def zeros(num, length):
return length - ones(num)
Demonstration:
>>> bin(17)
'0b10001'
>>> # leading 0b doesn't affect the number of 1s
>>> ones(17)
2
>>> zeros(17, length=6)
4
If the length is moderate (say less than 20), you can use a list as a lookup table.
It's only worth generating the list if you're doing a lot of lookups, but it seems you might in this case.
eg. For a 16 bit table of the 0 count, use this
zeros = [format(n, '016b').count('0') for n in range(1<<16)]
ones = [format(n, '016b').count('1') for n in range(1<<16)]
20 bits still takes under a second to generate on this computer
Edit: this seems slightly faster:
zeros = [20 - bin(n).count('1') for n in range(1<<20)]
ones = [bin(n).count('1') for n in range(1<<20)]

Categories