Computing hamming distances on large data set - python

I have an input file of about 10^5 rows.
Each row is a sequence of 24 bits, i.e.:
1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0
I need to compute the Hamming Distance for every pair of rows.
Here's my first implementation using SciPy hamming function:
from scipy.spatial.distance import hamming
with open('input.txt', 'r') as file:
reader = csv.reader(file, delimiter=' ')
nodes = {}
b = 24 # Number of bits
for nodeNum, node in enumerate(reader):
node[nodeNum] = [int(i) for i in node]
for u, uBits in nodes.items():
for v, vBits in nodes.items():
distance = hamming(uBits, vBits) * b
# Do stuff
Second implementation I came up with:
node[nodeNum] = sum([int(bit)*2**power for power, bit in enumerate(node)])
Here I only store the decimal value but I then have to manually count set bits resulting from each XOR operation:
def hamming(a, b):
N = a ^ b
distance = 0
ptr = 1
while N:
distance += ((N + 1) //
2 * ptr)
N -= (N + 1) // 2
ptr += 1
return distance
How can I improve my code (both in terms of memory usage and runtime, ideally)?

This may be the fastest you can do (watchout, for your data size it'd require to allocate 74.5 GiB of memory):
import numpy as np
nodes = []
with open('input.txt', 'r') as file:
reader = csv.reader(file, delimiter=' ')
for node in reader:
nodes.append([int(i) for i in node])
dists = 2 * np.inner(nodes-0.5, 0.5-nodes) + nodes.shape[1] / 2
Just for fun, here is a 40X faster version in Julia:
using LoopVectorization, Tullio
function hamming!(nodes,dists)
#tullio dists[i,j] = sum(nodes[i,k] ⊻ nodes[j,k])
end
n = 10^5
nodes = rand(Int8[0,1],n,24)
dists = Matrix{Int8}(undef,n,n)
#time hamming!(nodes,dists) # Run twice
# 1.886367 seconds (114 allocations: 6.594 KiB)
While we're at it, I invite you to enter the world of Julia. It offers similar speeds to C++ and a pleasant syntax similar to Python.

Why not just put the whole .csv into an array and then let scipy do all the work of computing pairwise distances?
import numpy as np
import pandas as pd
import scipy.spatial.distance
nodes = []
with open('input.txt', 'r') as file:
reader = csv.reader(file, delimiter=' ')
for node in reader:
nodes.append([int(i) for i in node])
nodes = np.array(nodes) # not strictly necessary
dists = scipy.spatial.distance.pdist(nodes, 'hamming')

Related

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed. Works for first two loops

Let me start by saying that I know this error message has posts about it, but I'm not sure what's wrong with my code. The block of code works just fine for the first two loops, but then fails. I've even tried removing the first two loops from the data to rule out issues in the 3rd loop, but no luck. I did have it set to print out the unsorted temporary list, and it just prints an empty array for the 3rd loop.
Sorry for the wall of comments in my code, but I'd rather have each line commented than cause confusion over what I'm trying to accomplish.
TL;DR: I'm trying to find and remove outliers from a list of data, but only for groups of entries that have the same number in column 0.
Pastebin with data
import numpy as np, csv, multiprocessing as mp, mysql.connector as msc, pandas as pd
import datetime
#Declare unsorted data array
d_us = []
#Declare temporary array for use in loop
tmp = []
#Declare sorted data array
d = []
#Declare Sum variable
tot = 0
#Declare Mean variable
m = 0
#declare sorted final array
sort = []
#Declare number of STDs
t = 1
#Declare Standard Deviation variable
std = 0
#Declare z-score variable
z_score
#Timestamp for output files
nts = datetime.datetime.now().timestamp()
#Create output file
with open(f"calib_temp-{nts}.csv", 'w') as ctw:
pass
#Read data from CSV
with open("test.csv", 'r', newline='') as drh:
fr_rh = csv.reader(drh, delimiter=',')
for row in fr_rh:
#append data to unsorted array
d_us.append([float(row[0]),float(row[1])])
#Sort array by first column
d = np.sort(d_us)
#Calculate the range of the data
l = round((d[-1][0] - d[0][0]) * 10)
#Declare the starting value
s = d[0][0]
#Declare the ending value
e = d[-1][0]
#Set the while loop counter
n = d[0][0]
#Iterate through data
while n <= e:
#Create array with difference column
for row in d:
if row[0] == n:
diff = round(row[0] - row[1], 1)
tmp.append([row[0],row[1],diff])
#Convert to numpy array
tmp = np.array(tmp)
#Sort numpy array
sort = tmp[np.argsort(tmp[:,2])]
#Calculate sum of differences
for row in tmp:
tot = tot + row[2]
#Calculate mean
m = np.mean(tot)
#Calculate Standard Deviation
std = np.std(tmp[:,2])
#Calculate outliers and write to output file
for y in tmp:
z_score = (y[2] - m)/std
if np.abs(z_score) > t:
with open(f"calib_temp-{nts}.csv", 'a', newline='') as ct:
c = csv.writer(ct, delimiter = ',')
c.writerow([y[0],y[1]])
#Reset Variables
tot = 0
m = 0
n = n + 0.1
tmp = []
std = 0
z_score = 0
Do this before the loop:
#Create output file
ct = open(f"calib_temp-{nts}.csv", 'w')
c = csv.writer(ct, delimiter = ',')
Then change the loop to this. Note that I have moved your initializations to the top of the loop, so you don't need to initialize them twice. Note the if tmp: line, which solves the numpy exception.
#Iterate through data
while n <= e:
tot = 0
m = 0
tmp = []
std = 0
z_score = 0
#Create array with difference column
for row in d:
if row[0] == n:
diff = round(row[0] - row[1], 1)
tmp.append([row[0],row[1],diff])
#Sort numpy array
if tmp:
#Convert to numpy array
tmp = np.array(tmp)
sort = tmp[np.argsort(tmp[:,2])]
#Calculate sum of differences
for row in tmp:
tot = tot + row[2]
#Calculate mean
m = np.mean(tot)
#Calculate Standard Deviation
std = np.std(tmp[:,2])
#Calculate outliers and write to output file
for y in tmp:
z_score = (y[2] - m)/std
if np.abs(z_score) > t:
c.writerow([y[0],y[1]])
#Reset Variables
n = n + 0.1

How to efficiently read the array columns in the tsv file into a single npz files for each column efficiently?

I've a data file that looks like this:
58f0965a62d62099f5c0771d35dbc218 0.868632614612579 [0.028979932889342308, 0.004080114420503378, 0.03757167607545853] [-0.006008833646774292, -0.010409083217382431, 0.01565541699528694]
36f7859ce47417470bc28384694f0ac4 0.835115909576416 [0.026130573824048042, -0.00358427781611681, 0.06635218113660812] [-0.06970945745706558, 0.03816794604063034, 0.03491008281707764]
59f7d617bb662155b0d49ce3f27093ed 0.907200276851654 [0.009903069585561752, -0.009721670299768448, 0.0151780480518937] [-0.03264783322811127, 0.0035394825972616673, -0.05089104175567627]
where the columns are respectively
an md5 hash of the data point
a target float output
an array of floats that I want to read into a np.array object
another array of floats that I want to read into a np.array object
I've been reading the file as such to create a numpy array files for the two matrices of array of floats:
import numpy as np
from tqdm import tqdm
import pandas as pd
lol = []
with open('data.tsv') as fin:
for line in tqdm(fin):
md5hash, score, vector1, vector2 = line.strip().split('\t')
row = {'md5_hash': md5hash, 'score':float(score),
'vector1': np.array(eval(vector1)),
'vector2': np.array(eval(vector2))
}
lol.append(row)
df = pd.DataFrame(lol)
training_vector1 = np.array(list(df['vector1']))
# Save the training vectors.
np.save('vector1.npz', training_vector1)
training_vector2 = np.array(list(df['vector2']))
# Save the training vectors.
np.save('vector1.npz', training_vector2)
While this works for small dataset, the actual dataset has a lot more floats in the arrays and it's close to 200 million rows. Here's a sample of 100 rows https://gist.github.com/1f6f0b2501dc334db1e0038d36452f5d
How to efficiently read the array columns in the tsv file into a single npz files for each column efficiently?
First, a note on the overall problem.
Any approach that loads 200M rows similar to the sample input you provided would require some 1.1 TB of memory.
While this is possible, it is certainly not ideal.
Therefore, I would not recommend going forward with this, but rather look into approaches specifically designed for handling large dataset, e.g. HDF5.
Having said that, the problem at hand is not particular complex, but passing through pandas and eval() is probably neither desirable nor beneficial.
The same could be said for cut pre-processing into CSV files that are only marginally simpler to read.
Assuming that np.save() will be equally fast, regardless of how the array is produced, we could say that the following function replicates well the processing in OP:
def process_tsv_OP(filepath="100-translation.embedded-3.tsv"):
lol = []
with open(filepath, "r") as fin:
for line in fin:
md5hash, score, vector1, vector2 = line.strip().split('\t')
row = {'md5_hash': md5hash, 'score':float(score),
'vector1': np.array(eval(vector1)),
'vector2': np.array(eval(vector2))
}
lol.append(row)
df = pd.DataFrame(lol)
training_vector1 = np.array(list(df['vector1']))
training_vector2 = np.array(list(df['vector2']))
return training_vector1, training_vector2
This can be simplified by avoiding pandas and "evil-eval()" (and a number of copying around in memory):
def text2row(text):
text = text[1:-1]
return [float(x) for x in text.split(',')]
def process_tsv(filepath="100-translation.embedded-3.tsv"):
with open(filepath, "r") as in_file:
v1 = []
v2 = []
for line in in_file:
_, _, text_r1, text_r2 = line.strip().split('\t')
r1 = text2row(text_r1)
r2 = text2row(text_r2)
v1.append(r1)
v2.append(r2)
v1 = np.array(v1)
v2 = np.array(v2)
return v1, v2
It is easy to show that the two produce the same output:
def same_res(x, y):
return all(np.allclose(i, j) for i, j in zip(x, y))
same_res(process_tsv(), process_tsv_OP())
# True
but with substantially different timings:
%timeit process_tsv_OP()
# 1 loop, best of 5: 300 ms per loop
%timeit process_tsv()
# 10 loops, best of 5: 86.1 ms per loop
(on the sample input file obtained with: wget https://gist.githubusercontent.com/alvations/1f6f0b2501dc334db1e0038d36452f5d/raw/ee31c052a4dbda131df182f0237dbe6e5197dff2/100-translation.embedded-3.tsv)
Preprocessing the input with cut does not seem to be that beneficial:
!time cut -f3 100-translation.embedded-3.tsv | rev | cut -c2- | rev | cut -c2- > vector1.csv
# real 0m0.184s
# user 0m0.102s
# sys 0m0.233s
!time cut -f4 100-translation.embedded-3.tsv | rev | cut -c2- | rev | cut -c2- > vector2.csv
# real 0m0.208s
# user 0m0.113s
# sys 0m0.279s
%timeit np.genfromtxt('vector1.csv', delimiter=','); np.genfromtxt('vector2.csv', delimiter=',')
# 1 loop, best of 5: 130 ms per loop
and, while some time may be saved by using pd.read_csv():
%timeit pd.read_csv('vector1.csv').to_numpy(); pd.read_csv('vector2.csv').to_numpy()
# 10 loops, best of 5: 85.7 ms per loop
this seems to be even slower than the original approach on the provided dataset (although cut itself may scale better for larger inputs).
If you really want to stick to the npy file format for this, you may at least wish to append to your output in blocks.
While this is not supported well with NumPy alone, you could use NpyAppendArray (see also here).
The modified process_tsv() would look like:
import os
from npy_append_array import NpyAppendArray
def process_tsv_append(
in_filepath="100-translation.embedded-3.tsv",
out1_filepath="out1.npy",
out2_filepath="out2.npy",
append_every=10,
):
# clear output files
for filepath in (out1_filepath, out2_filepath):
if os.path.isfile(filepath):
os.remove(filepath)
with \
open(in_filepath, "r") as in_file, \
NpyAppendArray(out1_filepath) as npaa1, \
NpyAppendArray(out2_filepath) as npaa2:
v1 = []
v2 = []
for i, line in enumerate(in_file, 1):
_, _, text_r1, text_r2 = line.strip().split("\t")
r1 = text2row(text_r1)
r2 = text2row(text_r2)
v1.append(r1)
v2.append(r2)
if i % append_every == 0:
npaa1.append(np.array(v1))
npaa2.append(np.array(v2))
v1 = []
v2 = []
if len(v1) > 0: # assumes len(v1) == len(v2)
npaa1.append(np.array(v1))
npaa2.append(np.array(v2))
process_tsv_append()
v1 = np.load("out1.npy")
v2 = np.load("out2.npy")
same_res(process_tsv(), (v1, v2))
# True
All this can be speed up relatively blindly with Cython, but the speed-up seems to be marginal:
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import numpy as np
cpdef text2row_cy(text):
return [float(x) for x in text[1:-1].split(',')]
cpdef process_tsv_cy(filepath="100-translation.embedded-3.tsv"):
with open(filepath, "r") as in_file:
v1 = []
v2 = []
for line in in_file:
_, _, text_r1, text_r2 = line.strip().split('\t')
r1 = text2row_cy(text_r1)
r2 = text2row_cy(text_r2)
v1.append(r1)
v2.append(r2)
v1 = np.array(v1)
v2 = np.array(v2)
return v1, v2
print(same_res(process_tsv_cy(), process_tsv_OP()))
# True
%timeit process_tsv_cy()
# 10 loops, best of 5: 72.4 ms per loop
Similarly, pre-allocating the arrays does not seem to be beneficial:
def text2row_out(text, out):
for i, x in enumerate(text[1:-1].split(',')):
out[i] = float(x)
def process_tsv_alloc(filepath="100-translation.embedded-3.tsv"):
num_lines = open(filepath, "r").read().count("\n")
with open(filepath, "r") as in_file:
# num lines
num_lines = in_file.read().count("\n")
# num cols
in_file.seek(0)
line = next(in_file)
_, _, text_r1, text_r2 = line.strip().split('\t')
num_cols1 = len(text_r1.split(","))
num_cols2 = len(text_r2.split(","))
# populate arrays
v1 = np.empty((num_lines, num_cols1))
v2 = np.empty((num_lines, num_cols2))
in_file.seek(0)
for i, line in enumerate(in_file):
_, _, text_r1, text_r2 = line.strip().split('\t')
text2row_out(text_r1, v1[i])
text2row_out(text_r2, v2[i])
return v1, v2
print(same_res(process_tsv_alloc(), process_tsv_OP()))
%timeit process_tsv_alloc()
# 10 loops, best of 5: 110 ms per loop
A significant reduction in the running time can be obtained with Numba (and possibly with Cython too) by rewriting everything to be closer to C. In order to make our code compatible with -- and beneficial to have it accelerated by -- Numba, we need to make significant modifications:
open the file as bytes (no longer supporting UTF-8, which is not a significant issue for the problem at hand)
read and process the file in blocks, which needs to be sufficiently large, say in the order of 1M
write all string handling functions by hand, notably the string-to-float conversion
import numpy as np
import numba as nb
#nb.njit
def bytes2int(text):
c_min = ord("0")
c_max = ord("9")
n = len(text)
valid = n > 0
# determine sign
start = n - 1
stop = -1
sign = 1
if valid:
first = text[0]
if first == ord("+"):
stop = 0
elif first == ord("-"):
sign = -1
stop = 0
# parse rest
number = 0
j = 0
for i in range(start, stop, -1):
c = text[i]
if c_min <= c <= c_max:
number += (c - c_min) * 10 ** j
j += 1
else:
valid = False
break
return sign * number if valid else None
#nb.njit
def bytes2float_helper(text):
sep = ord(".")
c_min = ord("0")
c_max = ord("9")
n = len(text)
valid = n > 0
# determine sign
start = n - 1
stop = -1
sign = 1
if valid:
first = text[0]
if first == ord("+"):
stop = 0
elif first == ord("-"):
sign = -1
stop = 0
# parse rest
sep_pos = 0
number = 0
j = 0
for i in range(start, stop, -1):
c = text[i]
if c_min <= c <= c_max:
number += (c - c_min) * 10 ** j
j += 1
elif c == sep and sep_pos == 0:
sep_pos = j
else:
valid = False
break
return sign * number, sep_pos, valid
#nb.njit
def bytes2float(text):
exp_chars = b"eE"
exp_pos = -1
for exp_char in exp_chars:
for i, c in enumerate(text[::-1]):
if c == exp_char:
exp_pos = i
break
if exp_pos > -1:
break
if exp_pos > 0:
exp_number = bytes2int(text[-exp_pos:])
if exp_number is None:
exp_number = 0
number, sep_pos, valid = bytes2float_helper(text[:-exp_pos-1])
result = number / 10.0 ** (sep_pos - exp_number) if valid else None
else:
number, sep_pos, valid = bytes2float_helper(text)
result = number / 10.0 ** sep_pos if valid else None
return result
#nb.njit
def btrim(text):
space = ord(" ")
tab = ord("\t")
nl = ord("\n")
cr = ord("\r")
start = 0
stop = 0
for c in text:
if c == space or c == tab or c == nl or c == cr:
start += 1
else:
break
for c in text[::-1]:
if c == space:
stop += 1
else:
break
if start == 0 and stop == 0:
return text
elif stop == 0:
return text[start:]
else:
return text[start:-stop]
#nb.njit
def text2row_nb(text, sep, num_cols, out, curr_row):
last_i = 0
j = 0
for i, c in enumerate(text):
if c == sep:
x = bytes2float(btrim(text[last_i:i]))
out[curr_row, j] = x
last_i = i + 2
j += 1
x = bytes2float(btrim(text[last_i:]))
out[curr_row, j] = x
#nb.njit
def process_line(line, psep, sep, num_psep, num_cols1, num_cols2, out1, out2, curr_row):
if len(line) > 0:
psep_pos = np.empty(num_psep, dtype=np.int_)
j = 0
for i, char in enumerate(line):
if char == psep:
psep_pos[j] = i
j += 1
text2row_nb(line[psep_pos[-2] + 2:psep_pos[-1] - 1], sep, num_cols1, out1, curr_row)
text2row_nb(line[psep_pos[-1] + 2:-1], sep, num_cols2, out2, curr_row)
#nb.njit
def decode_block(block, psep, sep, num_lines, num_cols1, num_cols2, out1, out2, curr_row):
nl = ord("\n")
last_i = 0
i = j = 0
for c in block:
if c == nl:
process_line(block[last_i:i], psep, sep, 3, num_cols1, num_cols2, out1, out2, curr_row)
j += 1
last_i = i
curr_row += 1
if j >= num_lines:
break
i += 1
return block[i + 1:], curr_row
#nb.njit
def count_nl(block, start=0):
nl = ord("\n")
for c in block:
if c == nl:
start += 1
return start
def process_tsv_block(filepath="100-translation.embedded-3.tsv", size=2 ** 18):
with open(filepath, "rb") as in_file:
# count newlines
num_lines = 0
while True:
block = in_file.read(size)
if block:
num_lines = count_nl(block, num_lines)
else:
break
# count num columns
in_file.seek(0)
line = next(in_file)
_, _, text_r1, text_r2 = line.strip().split(b'\t')
num_cols1 = len(text_r1.split(b","))
num_cols2 = len(text_r2.split(b","))
# fill output arrays
v1 = np.empty((num_lines, num_cols1))
v2 = np.empty((num_lines, num_cols2))
in_file.seek(0)
remainder = b""
curr_row = 0
while True:
block = in_file.read(size)
if block:
block = remainder + block
num_lines = count_nl(block)
if num_lines > 0:
remainder, curr_row = decode_block(block, ord("\t"), ord(","), num_lines, num_cols1, num_cols2, v1, v2, curr_row)
else:
remainder = block
else:
num_lines = count_nl(remainder)
if num_lines > 0:
remainder, curr_row = decode_block(remainder, ord("\t"), ord(","), num_lines, num_cols1, num_cols2, v1, v2, curr_row)
break
return v1, v2
The prize for all this work is a mere ~2x speed up over process_tsv():
print(same_res(process_tsv_block(), process_tsv_OP()))
# True
%timeit process_tsv_block()
# 10 loops, best of 5: 48.8 ms per loop
Cut the 3rd column, remove the first and last square brackets
cut -f3 data.tsv | rev | cut -c2- | rev | cut -c2- > vector1.csv
Repeat the same for Vector 2
cut -f4 data.tsv | rev | cut -c2- | rev | cut -c2- > vector2.csv
Read the csv into numpy in Python save to npy file.
import numpy as np
np.save('vector1.npy', np.genfromtxt('vector1.csv', delimiter=','))
np.save('vector1.npy', np.genfromtxt('vector2.csv', delimiter=','))
The other answers are good, the version below is a variation that uses dask. Since the original data is in text format, let's use dask.bag API.
First, import modules and define a utility function:
from dask.array import from_delayed, from_npy_stack, to_npy_stack, vstack
from dask.bag import read_text
from numpy import array, nan, stack
def process_line(line):
"""Utility function adapted from the snippet in the question."""
md5hash, score, vector1, vector2 = line.strip().split("\t")
row = {
"md5_hash": md5hash,
"score": float(score),
"vector1": array(eval(vector1)),
"vector2": array(eval(vector2)),
}
return row
Next, create a bag:
bag = read_text("100-translation.embedded-3.tsv", blocksize="1mb").map(process_line)
Since the sample snippet is small, to simulate 'big data', let's pretend that we can load '1mb' at once. This should create 3 partitions in the bag.
Next, isolate the vectors/arrays and convert them to dask.arrays:
# create delayed versions of the arrays
a1 = bag.pluck("vector1").map_partitions(stack).to_delayed()
a2 = bag.pluck("vector2").map_partitions(stack).to_delayed()
# convert the delayed objects to dask array
A1 = vstack(
[from_delayed(a, shape=(nan, 768), dtype="float") for a in a1],
allow_unknown_chunksizes=True,
)
A2 = vstack(
[from_delayed(a, shape=(nan, 768), dtype="float") for a in a2],
allow_unknown_chunksizes=True,
)
Now, we can save the arrays as npy stacks:
to_npy_stack("_A1", A1)
to_npy_stack("_A2", A2)
Note that this processing is not ideal, since the workers will pass over the data twice (once for each array), but with the current API I couldn't think of a better way.
Furthermore, note that the npy stacks preserve the 'unknown' chunks as metadata, even though all the relevant information was computed. This is something that could be improved in dask codebase, but for now the easiest fix is to load the data again, compute chunks, rechunk (to get nice, grid-like structure) and save again:
# rechunk into regular-sized format
A1 = from_npy_stack("_A1")
A1.compute_chunk_sizes()
A1.rechunk(chunks=(40, 768))
to_npy_stack("A1_final", A1)
# rechunk into regular-sized format
A2 = from_npy_stack("_A2")
A2.compute_chunk_sizes()
A2.rechunk(chunks=(40, 768))
to_npy_stack("A2_final", A2)
Of course on the real dataset, you'd want to use bigger chunks. And the final save operation does not have to be to numpy stacks, depending on your interest this could now be stored as HDF5 or zarr array.
If the output format is changed to a raw binary file then the input file can be processed line by line without storing the complete result in RAM.
import numpy as np
fh_in = open('data.tsv')
fh_vec1 = open('vector1.bin', 'wb')
fh_vec2 = open('vector2.bin', 'wb')
linecount = 0
for line in fh_in:
hash_, score, vec1, vec2 = line.strip().split('\t')
np.fromstring(vec1.strip('[]'), sep=',').tofile(fh_vec1)
np.fromstring(vec2.strip('[]'), sep=',').tofile(fh_vec2)
linecount += 1
A raw binary file doesn't store any info about dtype, shape, or byte order.
For loading it back into an array you can use np.fromfile or np.memmap and then call .reshape(linecount, -1) on it.

How can i solve "killed"?

I'm trying to plot clusters for my data which is stored in .data file using the density peak clustering algorithm using this code but got killed as the file size is 8 Giga and my Ram is 32. how can I solve this problem, please?
the core problem in loading the whole file by this method
def density_and_distance(self, distance_file, dc = None):
print("Begin")
distance, num, max_dis, min_dis = load_data(distance_file)
print("end")
if dc == None:
dc = auto_select_dc(distance, num, max_dis, min_dis)
rho = local_density(distance, num, dc)
delta, nearest_neighbor = min_distance(distance, num, max_dis, rho)
self.distance = distance
self.rho = rho
self.delta = delta
self.nearest_neighbor = nearest_neighbor
self.num = num
self.dc = dc
return rho, delta
I got Begin word printed then got killed after some minutes
the file contains like
1 2 19.86
1 3 36.66
1 4 87.94
1 5 11.07
1 6 36.94
1 7 52.04
1 8 173.68
1 9 28.10
1 10 74.00
1 11 85.36
1 12 40.04
1 13 95.24
1 14 67.29
....
the method of reading the file is
def load_data(distance_file):
distance = {}
min_dis, max_dis = sys.float_info.max, 0.0
num = 0
with open(distance_file, 'r', encoding = 'utf-8') as infile:
for line in infile:
content = line.strip().split(' ')
assert(len(content) == 3)
idx1, idx2, dis = int(content[0]), int(content[1]), float(content[2])
num = max(num, idx1, idx2)
min_dis = min(min_dis, dis)
max_dis = max(max_dis, dis)
distance[(idx1, idx2)] = dis
distance[(idx2, idx1)] = dis
for i in range(1, num + 1):
distance[(i, i)] = 0.0
infile.close()
return distance, num, max_dis, min_dis
to be
import dask.dataframe as dd
def load_data(distance_file):
distance = {}
min_dis, max_dis = sys.float_info.max, 0.0
num = 0
#with open(distance_file, 'r', encoding = 'utf-8') as infile:
df_dd = dd.read_csv("ex3.csv")
print("df_dd",df_dd.head())
#for line in df_dd:
#content = df_dd.strip().split(' ')
#print(content)
idx1, idx2, dis = df_dd.partitions[0], df_dd.partitions[1], df_dd.partitions[2]
print("df_dd.partitions[0]",df_dd.partitions[0])
num = max(num, idx1, idx2)
min_dis = min(min_dis, dis)
max_dis = max(max_dis, dis)
distance[(idx1, idx2)] = dis
distance[(idx2, idx1)] = dis
for i in range(1, num + 1):
distance[(i, i)] = 0.0
return distance, num, max_dis, min_dis
You are using Python native integers and floats: these alone take tens of bytes for each actual number in your data (28 bytes for an integer).
If you simply use Numpy or Pandas for that, your memory consumption might be slashed by a factor of 4 or more, without further adjustments.
Your lines average 10 bytes this early - at an 8GB file you should have less than 800 million registers - if you use 16bit integer numbers and 32 bit float that would mean that your data might fit in 10GB of memory. It is still a tight call, as the default pandas behavior is to copy everything on changes to a column. There are other options:
Since your code depends on indexing the rows as you've done there, you could just offload your data to an SQLite DB, and use in-sqlite indices instead of the dict you are using, as well as its min and max operators: this would offset memory usage, and sqlite would make its job with minimal fuss.
Another option would be to use "dask" instead of Pandas: it will take care of offloading data that would not fit in memory to disk.
TL;DR: the way your problem is arranged, maybe going to sqlite might be the way that would require less changes in what you have thought.

Calculating the standard deviation from numbers in python

I'm trying to calculate the standard deviation from a bunch of numbers in a document.
Here's what I got so far:
with open("\\Users\\xxx\\python_courses\\1DV501\\assign3\\file_10000integers_B.txt", "r") as f:
total2 = 0
number_of_ints2 = 0
deviation = 0.0
variance = 0.0
for line in f:
for num in line.split(':'):
total2 += int(num)
number_of_ints2 += 1
average = total2/number_of_ints2
for line in f:
for num in line.split(":"):
devation += [(int(num) - average) **2
But I'm completely stuck. I dont know how to do it. Math is not my strong suite so this this is turning out to be quite difficult.
Also the document is mixed with negative and positive numbers if that makes any difference.
You can use a few available libraries, for example if I had data I got from somewhere
>>> import random
>>> data = [random.randint(1,100) for _ in range(100)] # assume from your txt file
I could use statistics.stdev
>>> import statistics
>>> statistics.stdev(data)
28.453646514989956
or numpy.std
>>> import numpy as np
>>> np.std(data)
28.311020822287563
or scipy.stats.tstd
>>> import scipy.stats
>>> scipy.stats.tstd(data)
28.453646514989956
or if you want to roll your own
def stddev(data):
mean = sum(data) / len(data)
return math.sqrt((1/len(data)) * sum((i-mean)**2 for i in data))
>>> stddev(data)
28.311020822287563
Note that the slight difference in computed value will depend on if you want "sample" standard deviation or "population" standard deviation, see here
you may use the function, here is the official documentation :
Set your numbers in a list, then apply your function :
from statistics import stdev
mylist = [1,2,5,10,100]
std = stdev(mylist)
The problem is that you are iterating over the file twice, and you didn't reset the reader to the beginning of the file before the second loop. You can use f.seek(0) to do this.
total2 = 0
number_of_ints2 = 0
deviation = 0.0
variance = 0.0
with open("numbers.txt", "r") as f:
for line in f:
for num in line.split(':'):
total2 += int(num)
number_of_ints2 += 1
average = total2 / number_of_ints2
f.seek(0) # Move back to the beginning of the file.
for line in f:
for num in line.split(":"):
deviation += (int(num) - average) ** 2

Getting data arrays from CSV with loops

I have a CSV that looks like this:
0.500187550,CPU1,7.93
0.500187550,CPU2,1.62
0.500187550,CPU3,7.93
0.500187550,CPU4,1.62
1.000445359,CPU1,9.96
1.000445359,CPU2,1.61
1.000445359,CPU3,9.96
1.000445359,CPU4,1.61
1.500674877,CPU1,9.94
1.500674877,CPU2,1.61
1.500674877,CPU3,9.94
1.500674877,CPU4,1.61
The first column is time, the second the CPU used and the third is energy.
As a final result I would like to have these arrays:
Time:
[0.500187550, 1.000445359, 1.500674877]
Energy (per CPU): e.g. CPU1
[7.93, 9.96, 9.94]
For parsing the CSV I'm using:
query = csv.reader(csvfile, delimiter=',', skipinitialspace=True)
#Arrays global time and power:
for row in query:
x = row[0]
x = float(x)
x_array.append(x) #column 0 to array
y = row[2]
y = float(y)
y_array.append(y) #column 2 to array
print x_array
print y_array
These way I get all the data from time and energy into two arrays: x_array and y_array.
Then I order the arrays:
energy_core_ord_array = []
time_ord_array = []
#Dividing array into energy and time per core:
for i in range(number_cores[0]):
e = 0 + i
for j in range(len(x_array)/(int(number_cores[0]))):
time_ord = x_array[e]
time_ord_array.append(time_ord)
energy_core_ord = y_array[e]
energy_core_ord_array.append(energy_core_ord)
e = e + int(number_cores[0])
And lastly, I cut the time array into the lenghts it should have:
final_time_ord_array = []
for i in range(len(x_array)/(int(number_cores[0]))):
final_time_ord = time_ord_array[i]
final_time_ord_array.append(final_time_ord)
Till here, although the code is not elegant, it works.
The problem comes when I try to get the array for each core.
I get it for the first core, but when I try to iterate for the next one, I don´t know how to do it, and how can I store each array in a variable with a single name for example.
final_energy_core_ord_array = []
#Trunk energy core array:
for i in range(len(x_array)/(int(number_cores[0]))):
final_energy_core_ord = energy_core_ord_array[i]
final_energy_core_ord_array.append(final_energy_core_ord)
So using Pandas (library to handle dataframes in Python) you can do something like this, which is much quicker than trying to process the CSV manually like you're doing:
import pandas as pd
csvfile = "C:/Users/Simon/Desktop/test.csv"
data = pd.read_csv(csvfile, header=None, names=['time','cpu','energy'])
times = list(pd.unique(data.time.ravel()))
print times
cpuList = data.groupby(['cpu'])
cpuEnergy = {}
for i in range(len(cpuList)):
curCPU = 'CPU' + str(i+1)
cpuEnergy[curCPU] = list(cpuList.get_group('CPU' + str(i+1))['energy'])
for k, v in cpuEnergy.items():
print k, v
that will give the following as output:
[0.50018755000000004, 1.000445359, 1.5006748769999998]
CPU4 [1.6200000000000001, 1.6100000000000001, 1.6100000000000001]
CPU2 [1.6200000000000001, 1.6100000000000001, 1.6100000000000001]
CPU3 [7.9299999999999997, 9.9600000000000009, 9.9399999999999995]
CPU1 [7.9299999999999997, 9.9600000000000009, 9.9399999999999995]
Finally I got the answer, using globals.... not a great idea, but works, leave it here if someone find it useful.
final_energy_core_ord_array = []
#Trunk energy core array:
a = 0
for j in range(number_cores[0]):
for i in range(len(x_array)/(int(number_cores[0]))):
final_energy_core_ord = energy_core_ord_array[a + i]
final_energy_core_ord_array.append(final_energy_core_ord)
globals()['core%s' % j] = final_energy_core_ord_array
final_energy_core_ord_array = []
a = a + 12
print 'Final time and cores:'
print final_time_ord_array
for j in range(number_cores[0]):
print globals()['core%s' % j]

Categories