Reading data from CSV into dataframe with multiple delimiters efficiently

Reading data from CSV into dataframe with multiple delimiters efficiently - python

I have an awkward CSV file which has multiple delimiters: the delimiter for the non-numeric part is ',', for the numeric part ';'. I want to construct a dataframe only out of the numeric part as efficiently as possible.
I have made 5 attempts: among them, utilising the converters argument of pd.read_csv, using regex with engine='python', using str.replace. They are all more than 2x slower than reading the entire CSV file with no conversions. This is prohibitively slow for my use case.
I understand the comparison isn't like-for-like, but it does demonstrate the overall poor performance is not driven by I/O. Is there a more efficient way to read in the data into a numeric Pandas dataframe? Or the equivalent NumPy array?
The below string can be used for benchmarking purposes.
# Python 3.7.0, Pandas 0.23.4
from io import StringIO
import pandas as pd
import csv
# strings in first 3 columns are of arbitrary length
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6
def csv_reader_1(x):
df = pd.read_csv(x, usecols=[3], header=None, delimiter=',',
converters={3: lambda x: x.split(';')})
return df.join(pd.DataFrame(df.pop(3).values.tolist(), dtype=float))
def csv_reader_2(x):
df = pd.read_csv(x, header=None, delimiter=';',
converters={0: lambda x: x.rsplit(',')[-1]}, dtype=float)
return df.astype(float)
def csv_reader_3(x):
return pd.read_csv(x, usecols=[3, 4, 5], header=None, sep=',|;', engine='python')
def csv_reader_4(x):
with x as fin:
reader = csv.reader(fin, delimiter=',')
L = [i[-1].split(';') for i in reader]
return pd.DataFrame(L, dtype=float)
def csv_reader_5(x):
with x as fin:
return pd.read_csv(StringIO(fin.getvalue().replace(';',',')),
sep=',', header=None, usecols=[3, 4, 5])
Checks:
res1 = csv_reader_1(StringIO(x))
res2 = csv_reader_2(StringIO(x))
res3 = csv_reader_3(StringIO(x))
res4 = csv_reader_4(StringIO(x))
res5 = csv_reader_5(StringIO(x))
print(res1.head(3))
# 0 1 2
# 0 34.23 562.45 213.5432
# 1 56.23 63.45 625.2340
# 2 34.23 562.45 213.5432
assert all(np.array_equal(res1.values, i.values) for i in (res2, res3, res4, res5))
Benchmarking results:
%timeit csv_reader_1(StringIO(x)) # 5.31 s per loop
%timeit csv_reader_2(StringIO(x)) # 6.69 s per loop
%timeit csv_reader_3(StringIO(x)) # 18.6 s per loop
%timeit csv_reader_4(StringIO(x)) # 5.68 s per loop
%timeit csv_reader_5(StringIO(x)) # 7.01 s per loop
%timeit pd.read_csv(StringIO(x)) # 1.65 s per loop
Update
I'm open to using command-line tools as a last resort. To that extent, I have included such an answer. My hope is there is a pure-Python or Pandas solution with comparable efficiency.

Use a command-line tool
By far the most efficient solution I've found is to use a specialist command-line tool to replace ";" with "," and then read into Pandas. Pandas or pure Python solutions do not come close in terms of efficiency.
Essentially, using CPython or a tool written in C / C++ is likely to outperform Python-level manipulations.
For example, using Find And Replace Text:
import os
os.chdir(r'C:\temp') # change directory location
os.system('fart.exe -c file.csv ";" ","') # run FART with character to replace
df = pd.read_csv('file.csv', usecols=[3, 4, 5], header=None) # read file into Pandas

How about using a generator to do the replacement, and combining it with an appropriate decorator to get a file-like object suitable for pandas?
import io
import pandas as pd
# strings in first 3 columns are of arbitrary length
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6
def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module).
For efficiency, the stream is buffered.
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def replacementgenerator(haystack, needle, replace):
for s in haystack:
if s == needle:
yield str.encode(replace);
else:
yield str.encode(s);
csv = pd.read_csv(iterstream(replacementgenerator(x, ";", ",")), usecols=[3, 4, 5])
Note that we convert the string (or its constituent characters) to bytes through str.encode, as this is required for use by Pandas.
This approach is functionally identical to the answer by Daniele except for the fact that we replace values "on-the-fly", as they are requested instead of all in one go.

If this is an option, substituting the character ; with , in the string is faster.
I have written the string x to a file test.dat.
def csv_reader_4(x):
with open(x, 'r') as f:
a = f.read()
return pd.read_csv(StringIO(unicode(a.replace(';', ','))), usecols=[3, 4, 5])
The unicode() function was necessary to avoid a TypeError in Python 2.
Benchmarking:
%timeit csv_reader_2('test.dat') # 1.6 s per loop
%timeit csv_reader_4('test.dat') # 1.2 s per loop

A very very very fast one, 3.51 is the result, simply just make csv_reader_4 the below, it simply converts StringIO to str, then replaces ; with ,, and reads the dataframe with sep=',':
def csv_reader_4(x):
with x as fin:
reader = pd.read_csv(StringIO(fin.getvalue().replace(';',',')), sep=',',header=None)
return reader
The benchmark:
%timeit csv_reader_4(StringIO(x)) # 3.51 s per loop

Python has powerfull features to manipulate data, but don't expect performance using python.When performance is needed , C and C++ are your friend .
Any fast library in python is written in C/C++. It is quite easy to use C/C++ code in python, have a look at swig utility (http://www.swig.org/tutorial.html) . You can write a c++ class that may contain some fast utilities that you will use in your python code when needed.

In my environment (Ubuntu 16.04, 4GB RAM, Python 3.5.2) the fastest method was (the prototypical1) csv_reader_5 (taken from U9-Forward's answer) which ran only less than 25% slower than reading the entire CSV file with no conversions. I improved that approach by implementing a filter/wrapper that replaces the char in the read() method:
class SingleCharReplacingFilter:
def __init__(self, reader, oldchar, newchar):
def proxy(obj, attr):
a = getattr(obj, attr)
if attr in ('read'):
def f(*args):
return a(*args).replace(oldchar, newchar)
return f
else:
return a
for a in dir(reader):
if not a.startswith("_") or a == '__iter__':
setattr(self, a, proxy(reader, a))
def csv_reader_6(x):
with x as fin:
return pd.read_csv(SingleCharReplacingFilter(fin, ";", ","),
sep=',', header=None, usecols=[3, 4, 5])
The result is a little better performance compared to reading the entire CSV file with no conversions:
In [3]: %timeit pd.read_csv(StringIO(x))
605 ms ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit csv_reader_5(StringIO(x))
733 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: %timeit csv_reader_6(StringIO(x))
568 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1 I call it prototypical because it assumes that the input stream is of StringIO type (since it calls .getvalue() on it).

Related

Split a binary file into 4-byte sequences (python)

So, basically, I have a 256MB file which contains "int32" numbers written as 4-byte sequences, and I have to sort them into another file.
The thing I struggle with the most is how I read the file into an array of 4-byte sequences. First I thought this method is slow because it reads elements one by one:
for i in range(numsPerFile):
buffer = currFile.read(4)
arr.append(buffer)
Then I made this:
buffer = currFile.read()
arr4 = []
for i in range(numsPerFile):
arr4.append(bytes(buffer[i*4 : i*4+4]))
And it wasn't any better when I measured the time (both read 128000 numbers in ~0.8 sec on my pc). So, is there a faster method to do this?

Read and write tasks are often the slow part of an activity.
I have done some tests and removed the reading from file activity from the timings.
Test0 was repeating the test you did to calibrate it for my old slow
machine.
Test1 uses list comprehension to build the list of four byte chunks
the same as test0
Test2 builds a list of integers. As you mentioned the data was
int32 I've turned the four byte chunks in to integers to be used in Python.
Test3 uses the array library to unpack the bytes as int32 as suggested by the original poster. Used frombytes rather than fromfile to keep it consistent with the other tests.
Test3 was the fastest by some margin, followed by test2, test1, and test0 was always the slowest on my machine.
I only created 128MB file for test but it should give you an idea.
This is the code I used for my testing:
import array
import time
from pathlib import Path
import secrets
import struct
tmp_txt = Path('/tmp/test.txt')
def gen_test():
data = secrets.token_bytes(128_000_000)
tmp_txt.write_bytes(data)
print(f"Generated {len(data):,} bytes")
def test0(data):
arr4 = []
for i in range(int(len(data)/4)):
arr4.append(bytes(data[i*4:i*4+4]))
return arr4
def test1(data):
return [data[i:i+4] for i in range(0, len(data), 4)]
def test2(data):
return [_[0] for _ in struct.iter_unpack('>i', data)]
def test3(data):
arr4 = array.array('i')
arr4.frombytes(data)
arr4.byteswap()
return arr4
def run_test():
raw_data = tmp_txt.read_bytes()
all_test = [
test0,
test1,
test2,
test3,
]
for idx, test in enumerate(all_test):
print(f"test{idx}")
n0 = time.perf_counter()
x0 = test(raw_data)
n1 = time.perf_counter()
print(f"\tTest{idx} took {n1 - n0:.2f}")
if isinstance(x0, array.array):
print(f"\tdata[{x0.buffer_info()[1]:,}] = [{x0[0]}, ..., {x0[-1]}]")
else:
print(f"\tdata[{len(x0):,}] = [{x0[0]}, ..., {x0[-1]}]")
if __name__ == '__main__':
gen_test()
run_test()
This gave me the following transcript:
Generated 128,000,000 bytes
test0
Test0 took 10.50
data[32,000,000] = [b'\x9f\x1cOO', ..., b'\x94\x8e\xd2?']
test1
Test1 took 4.40
data[32,000,000] = [b'\x9f\x1cOO', ..., b'\x94\x8e\xd2?']
test2
Test2 took 3.31
data[32,000,000] = [-1625534641, ..., -1802579393]
test3
Test3 took 0.41
data[32,000,000] = [-1625534641, ..., -1802579393]

def bindata():
array = []
count = 0
with open('file.txt', 'r') as f:
while count < 64:
f.seek(count)
array.append(f.read(8))
count = count + 8
print(array)
bindata()
file.txt data: FFFFFFFFAAAAAAAABBBBBBBBCCCCCCCCDDDDDDDDAAAAAAAAEEEEEEEEFFFFFFFF

I think I must have misunderstood your question. I can create a test file hopefully similar to yours like this:
import numpy as np
# Make 256MB array
a = np.random.randint(0,1000,int(256*1024*1024/4, dtype=np.uint32)
# Write to disk
a.tofile('BigBoy.bin')
And now time how long it takes to read it from disk
%timeit b = np.fromfile('BigBoy.bin', dtype=np.uint32)
That gives 33 milliseconds.
33.2 ms ± 336 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Check all the same
print(np.all(a==b))
True
Or, if for some reason, you don't like Numpy, you can use a struct:
d = open('BigBoy.bin', 'rb').read()
fmt = f'{len(d)//4}I'
data = struct.unpack(fmt, d)

Maximal efficiency to parse millions of json stored as a (very long) string

Objective:
I have thousands of data dumps whose format, after unzip, is a long string containing 150K json separated by '\n'.
big_string = '{"mineral": "gold", "qty": 2, "garbage":"abc"}\n ....... {"mineral": "silver", "qty": 4}'
Each JSON contains dozens of useless keys like garbage, but my objective is only to sum the qty for each mineral.
result = {'gold': 213012, 'silver': 123451, 'adamantium': 321434}
How to reproduce:
import random
minerals = ['gold', 'silver', 'adamantium']
big_string = str(
'\n'.join([
str({'mineral': random.choice(minerals),
'qty': random.randint(1,1000),
'garbage': random.randint(1,666),
'other_garbage': random.randint(-10,10)})
for _ in range(150000)
])
)
def solution(big_string):
# Show me your move
return dict() # or pd.DataFrame()
My current solution (which I find slower than expected):
Splitting the string using the '\n' separator, with a yield generator (see https://stackoverflow.com/a/9770397/4974431)
Loading the string in json format using ujson library (supposed to be faster than json standard lib)
Accessing the values needed only for 'mineral' and 'quantity'.
Doing the aggregation using pandas
Which gives:
import ujson
import re
import pandas as pd
# To split the big_string (from https://stackoverflow.com/a/9770397/4974431)
def lines(string, sep="\s+"):
# warning: does not yet work if sep is a lookahead like `(?=b)`
if sep=='':
return (c for c in string)
else:
return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))
def my_solution(big_string):
useful_fields = ['mineral', 'qty']
filtered_data = []
for line in lines(big_string, sep="\n"):
line = ujson.loads(line)
filtered_data.append([line[field] for field in useful_fields])
result = pd.DataFrame(filtered_data, columns = useful_fields)
return result.groupby('mineral')['qty'].sum().reset_index()
Any improvement, even by 25%, would be great because I have thousands to do !

I must confess: I'm going to use a library of mine - convtools (github)
rely on iterating io.StringIO, it splits lines by \n itself
process as a stream without additional allocations
import random
from io import StringIO
import ujson
from convtools import conversion as c
minerals = ["gold", "silver", "adamantium"]
big_string = str(
"\n".join(
[
ujson.dumps(
{
"mineral": random.choice(minerals),
"qty": random.randint(1, 1000),
"garbage": random.randint(1, 777),
"other_garbage": random.randint(-10, 10),
}
)
for _ in range(150000)
]
)
)
# define a conversion and generate ad hoc converter
converter = (
c.iter(c.this.pipe(ujson.loads))
.pipe(
c.group_by(c.item("mineral")).aggregate(
{
"mineral": c.item("mineral"),
"qty": c.ReduceFuncs.Sum(c.item("qty")),
}
)
)
.gen_converter()
)
# let's check
"""
In [48]: converter(StringIO(big_string))
Out[48]:
[{'mineral': 'silver', 'qty': 24954551},
{'mineral': 'adamantium', 'qty': 25048483},
{'mineral': 'gold', 'qty': 24975201}]
In [50]: OPs_solution(big_string)
Out[50]:
mineral qty
0 adamantium 25048483
1 gold 24975201
2 silver 24954551
"""
Let's profile:
In [53]: %timeit OPs_solution(big_string)
339 ms ± 9.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [54]: %timeit converter(StringIO(big_string))
93.2 ms ± 473 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

After looking for where time is spent, it appears 90% of total time is spent in the generator.
Changing it to : https://stackoverflow.com/a/59071238/4974431 was a 85% time improvement

Did I/O become slower since Python 2.7?

I'm currently having a small side project in which I want to sort a 20GB file on my machine as fast as possible. The idea is to chunk the file, sort the chunks, merge the chunks. I just used pyenv to time the radixsort code with different Python versions and saw that 2.7.18 is way faster than 3.6.10, 3.7.7, 3.8.3 and 3.9.0a. Can anybody explain why Python 3.x is slower than 2.7.18 in this simple example? Were there new features added?
import os
def chunk_data(filepath, prefixes):
"""
Pre-sort and chunk the content of filepath according to the prefixes.
Parameters
----------
filepath : str
Path to a text file which should get sorted. Each line contains
a string which has at least 2 characters and the first two
characters are guaranteed to be in prefixes
prefixes : List[str]
"""
prefix2file = {}
for prefix in prefixes:
chunk = os.path.abspath("radixsort_tmp/{:}.txt".format(prefix))
prefix2file[prefix] = open(chunk, "w")
# This is where most of the execution time is spent:
with open(filepath) as fp:
for line in fp:
prefix2file[line[:2]].write(line)
Execution times (multiple runs):
2.7.18: 192.2s, 220.3s, 225.8s
3.6.10: 302.5s
3.7.7: 308.5s
3.8.3: 279.8s, 279.7s (binary mode), 295.3s (binary mode), 307.7s, 380.6s (wtf?)
3.9.0a: 292.6s
The complete code is on Github, along with a minimal complete version
Unicode
Yes, I know that Python 3 and Python 2 deal different with strings. I tried opening the files in binary mode (rb / wb), see the "binary mode" comments. They are a tiny bit faster on a couple of runs. Still, Python 2.7 is WAY faster on all runs.
Try 1: Dictionary access
When I phrased this question, I thought that dictionary access might be a reason for this difference. However, I think the total execution time is way less for dictionary access than for I/O. Also, timeit did not show anything important:
import timeit
import numpy as np
durations = timeit.repeat(
'a["b"]',
repeat=10 ** 6,
number=1,
setup="a = {'b': 3, 'c': 4, 'd': 5}"
)
mul = 10 ** -7
print(
"mean = {:0.1f} * 10^-7, std={:0.1f} * 10^-7".format(
np.mean(durations) / mul,
np.std(durations) / mul
)
)
print("min = {:0.1f} * 10^-7".format(np.min(durations) / mul))
print("max = {:0.1f} * 10^-7".format(np.max(durations) / mul))
Try 2: Copy time
As a simplified experiment, I tried to copy the 20GB file:
cp via shell: 230s
Python 2.7.18: 237s, 249s
Python 3.8.3: 233s, 267s, 272s
The Python stuff is generated by the following code.
My first thought was that the variance is quite high. So this could be the reason. But then, the variance of chunk_data execution time is also high, but the mean is noticeably lower for Python 2.7 than for Python 3.x. So it seems not to be an I/O scenario as simple as I tried here.
import time
import sys
import os
version = sys.version_info
version = "{}.{}.{}".format(version.major, version.minor, version.micro)
if os.path.isfile("numbers-tmp.txt"):
os.remove("numers-tmp.txt")
t0 = time.time()
with open("numbers-large.txt") as fin, open("numers-tmp.txt", "w") as fout:
for line in fin:
fout.write(line)
t1 = time.time()
print("Python {}: {:0.0f}s".format(version, t1 - t0))
My System
Ubuntu 20.04
Thinkpad T460p
Python through pyenv

This is a combination of multiple effects, mostly the fact that Python 3 needs to perform unicode decoding/encoding when working in text mode and if working in binary mode it will send the data through dedicated buffered IO implementations.
First of all, using time.time to measure execution time uses the wall time and hence includes all sorts of Python unrelated things such as OS-level caching and buffering, as well as buffering of the storage medium. It also reflects any interference with other processes that require the storage medium. That's why you are seeing these wild variations in timing results. Here are the results for my system, from seven consecutive runs for each version:
py3 = [660.9, 659.9, 644.5, 639.5, 752.4, 648.7, 626.6] # 661.79 +/- 38.58
py2 = [635.3, 623.4, 612.4, 589.6, 633.1, 613.7, 603.4] # 615.84 +/- 15.09
Despite the large variation it seems that these results indeed indicate different timings as can be confirmed for example by a statistical test:
>>> from scipy.stats import ttest_ind
>>> ttest_ind(p2, p3)[1]
0.018729004515179636
i.e. there's only a 2% chance that the timings emerged from the same distribution.
We can get a more precise picture by measuring the process time rather than the wall time. In Python 2 this can be done via time.clock while Python 3.3+ offers time.process_time. These two functions report the following timings:
py3_process_time = [224.4, 226.2, 224.0, 226.0, 226.2, 223.7, 223.8] # 224.90 +/- 1.09
py2_process_time = [171.0, 171.1, 171.2, 171.3, 170.9, 171.2, 171.4] # 171.16 +/- 0.16
Now there's much less spread in the data since the timings reflect the Python process only.
This data suggests that Python 3 takes about 53.7 seconds longer to execute. Given the large amount of lines in the input file (550_000_000) this amounts to about 97.7 nanoseconds per iteration.
The first effect causing increased execution time are unicode strings in Python 3. The binary data is read from the file, decoded and then encoded again when it is written back. In Python 2 all strings are stored as binary strings right away, so this doesn't introduce any encoding/decoding overhead. You don't see this effect clearly in your tests because it disappears in the large variation introduced by various external resources which are reflected in the wall time difference. For example we can measure the time it takes for a roundtrip from binary to unicode to binary:
In [1]: %timeit b'000000000000000000000000000000000000'.decode().encode()
162 ns ± 2 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
This does include two attribute lookups as well as two function calls, so the actual time needed is smaller than the value reported above. To see the effect on execution time, we can change the test script to use binary modes "rb" and "wb" instead of text modes "r" and "w". This reduces the timing results for Python 3 as follows:
py3_binary_mode = [200.6, 203.0, 207.2] # 203.60 +/- 2.73
That reduces the process time by about 21.3 seconds or 38.7 nanoseconds per iteration. This is in agreement with timing results for the roundtrip benchmark minus timing results for name lookups and function calls:
In [2]: class C:
...: def f(self): pass
...:
In [3]: x = C()
In [4]: %timeit x.f()
82.2 ns ± 0.882 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [5]: %timeit x
17.8 ns ± 0.0564 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
Here %timeit x measures the additional overhead of resolving the global name x and hence the attribute lookup and function call make 82.2 - 17.8 == 64.4 seconds. Subtracting this overhead twice from the above roundtrip data gives 162 - 2*64.4 == 33.2 seconds.
Now there's still a difference of 32.4 seconds between Python 3 using binary mode and Python 2. This comes from the fact that all the IO in Python 3 goes through the (quite complex) implementation of io.BufferedWriter .write while in Python 2 the file.write method proceeds fairly straightforward to fwrite.
We can check the types of the file objects in both implementations:
$ python3.8
>>> type(open('/tmp/test', 'wb'))
<class '_io.BufferedWriter'>
$ python2.7
>>> type(open('/tmp/test', 'wb'))
<type 'file'>
Here we also need to note that the above timing results for Python 2 have been obtained by using text mode, not binary mode. Binary mode aims to support all objects implementing the buffer protocol which results in additional work being performed also for strings (see also this question). If we switch to binary mode also for Python 2 then we obtain:
py2_binary_mode = [212.9, 213.9, 214.3] # 213.70 +/- 0.59
which is actually a bit larger than the Python 3 results (18.4 ns / iteration).
The two implementations also differ in other details such as the dict implementation. To measure this effect we can create a corresponding setup:
from __future__ import print_function
import timeit
N = 10**6
R = 7
results = timeit.repeat(
"d[b'10'].write",
setup="d = dict.fromkeys((str(i).encode() for i in range(10, 100)), open('test', 'rb'))", # requires file 'test' to exist
repeat=R, number=N
)
results = [x/N for x in results]
print(['{:.3e}'.format(x) for x in results])
print(sum(results) / R)
This gives the following results for Python 2 and Python 3:
Python 2: ~ 56.9 nanoseconds
Python 3: ~ 78.1 nanoseconds
This additional difference of about 21.2 nanoseconds amounts to about 12 seconds for the full 550M iterations.
The above timing code checks the dict lookup for only one key, so we also need to verify that there are no hash collisions:
$ python3.8 -c "print(len({str(i).encode() for i in range(10, 100)}))"
90
$ python2.7 -c "print len({str(i).encode() for i in range(10, 100)})"
90

Efficient double for loop over large matrices

I have the following code which I need to runt it more than one time. Currently, it takes too long. Is there an efficient way to write these two for loops.
ErrorEst=[]
for i in range(len(embedingFea)):#17000
temp=[]
for j in range(len(emedingEnt)):#15000
if cooccurrenceCount[i][j]>0:
#print(coaccuranceCount[i][j]/ count_max)
weighting_factor = np.min(
[1.0,
math.pow(np.float32(cooccurrenceCount[i][j]/ count_max), scaling_factor)])
embedding_product = (np.multiply(emedingEnt[j], embedingFea[i]), 1)
#tf.log(tf.to_float(self.__cooccurrence_count))
log_cooccurrences =np.log (np.float32(cooccurrenceCount[i][j]))
distance_expr = np.square(([
embedding_product+
focal_bias[i],
context_bias[j],
-(log_cooccurrences)]))
single_losses =(weighting_factor* distance_expr)
temp.append(single_losses)
ErrorEst.append(np.sum(temp))

You can use Numba or Cython
At first make sure to avoid lists where ever possible and write a simple and readable code with explicit loops like you would do for example in C. All input and outputs are only numpy-arrays or scalars.
Your Code
import numpy as np
import numba as nb
import math
def your_func(embedingFea,emedingEnt,cooccurrenceCount,count_max,scaling_factor,focal_bias,context_bias):
ErrorEst=[]
for i in range(len(embedingFea)):#17000
temp=[]
for j in range(len(emedingEnt)):#15000
if cooccurrenceCount[i][j]>0:
weighting_factor = np.min([1.0,math.pow(np.float32(cooccurrenceCount[i][j]/ count_max), scaling_factor)])
embedding_product = (np.multiply(emedingEnt[j], embedingFea[i]), 1)
log_cooccurrences =np.log (np.float32(cooccurrenceCount[i][j]))
distance_expr = np.square(([embedding_product+focal_bias[i],context_bias[j],-(log_cooccurrences)]))
single_losses =(weighting_factor* distance_expr)
temp.append(single_losses)
ErrorEst.append(np.sum(temp))
return ErrorEst
Numba Code
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def your_func_2(embedingFea,emedingEnt,cooccurrenceCount,count_max,scaling_factor,focal_bias,context_bias):
ErrorEst=np.empty((embedingFea.shape[0],2))
for i in nb.prange(embedingFea.shape[0]):
temp_1=0.
temp_2=0.
for j in range(emedingEnt.shape[0]):
if cooccurrenceCount[i,j]>0:
weighting_factor=(cooccurrenceCount[i,j]/ count_max)**scaling_factor
if weighting_factor>1.:
weighting_factor=1.
embedding_product = emedingEnt[j]*embedingFea[i]
log_cooccurrences =np.log(cooccurrenceCount[i,j])
temp_1+=weighting_factor*(embedding_product+focal_bias[i])**2
temp_1+=weighting_factor*(context_bias[j])**2
temp_1+=weighting_factor*(log_cooccurrences)**2
temp_2+=weighting_factor*(1.+focal_bias[i])**2
temp_2+=weighting_factor*(context_bias[j])**2
temp_2+=weighting_factor*(log_cooccurrences)**2
ErrorEst[i,0]=temp_1
ErrorEst[i,1]=temp_2
return ErrorEst
Timings
embedingFea=np.random.rand(1700)+1
emedingEnt=np.random.rand(1500)+1
cooccurrenceCount=np.random.rand(1700,1500)+1
focal_bias=np.random.rand(1700)
context_bias=np.random.rand(1500)
count_max=100
scaling_factor=2.5
%timeit res_1=your_func(embedingFea,emedingEnt,cooccurrenceCount,count_max,scaling_factor,focal_bias,context_bias)
1min 1s ± 346 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=your_func_2(embedingFea,emedingEnt,cooccurrenceCount,count_max,scaling_factor,focal_bias,context_bias)
17.6 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If you need to increase the performance of your code you should write it in low level language like C and try to avoid the usage of floating point numbers.
Possible solution: Can we use C code in Python?

You could try using numba and wrapping your code with the #jit decorator. Usually the first execution needs to compile some stuff, and will thus not see much speedup, but subsequent iterations will be much faster.
You may need to put your loop in a function for this to work.
from numba import jit
#jit(nopython=True)
def my_double_loop(some, arguments):
for i in range(len(embedingFea)):#17000
temp=[]
for j in range(len(emedingEnt)):#15000
# ...

Python, Dict to CSV: is there a faster way to do it?

I've written a simple method to write a dictionary to a CSV.
It works well but I've wondering if it can be improved in terms of speeds (writing a CSV of 1000 rows takes 6 secs in my tests).
My question is: how to improve the speed of this code? (if possible)
Thank you in advance for your assistance.
def fast_writer(self, f_name, text_dict):
try:
start = timer()
# Windows
if os.name == "nt":
with open(f_name, 'w', newline='') as self._csv_file:
self._writer = csv.writer(self._csv_file)
for self._key, self._value in text_dict.items():
self._writer.writerow([self._key, self._value])
# Unix/Linux
else:
with open(f_name, 'w') as self._csv_file:
self._writer = csv.writer(self._csv_file)
for self._key, self._value in text_dict.items():
self._writer.writerow([self._key, self._value])
end = timer()
print("[FastWriter_time] ", end - start)
except BaseException:
print("[ERROR] Unable to write file on disk. Exit...")
sys.exit()

If you're really just looking for a faster way to do this, pandas has such methods built-in, and pretty well optimized! Take the following code for example:
import numpy as np
import pandas as pd
# This is just to generate a dictionary with 1000 values:
data_dict = {'value':[i for i in np.random.randn(1000)]}
# This is to translate dict to dataframe, and then same it
df = pd.DataFrame(data_dict)
df.to_csv('test.csv')
Takes about 0.008 seconds to write the dictionary to a dataframe and write the dataframe to a csv on my machine

If you don't want to use pandas, get rid of all those variables being stored in self and make them local variables:
def fast_writer(self, f_name, text_dict):
try:
start = timer()
newline = '' if os.name == "nt" else None
with open(f_name, 'w', newline=newline) as csv_file:
writer = csv.writer(csv_file)
writer.writerows(text_dict.items())
end = timer()
print("[FastWriter_time] ", end - start)
except BaseException as e:
print("[ERROR] Unable to write file on disk. Exit...")
print(e)
sys.exit()
Also, use writer.writerows to write multiple rows at once.
On my machine this is faster than the pandas method, using the test data as defined by #sacul in their answer:
In [6]: %timeit fast_writer("test.csv", data_dict)
1.59 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [10]: %timeit fast_writer_pd("test.csv", data_dict)
3.97 ms ± 61.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The Writer object already has a method for writing a list of rows to the file; you don't need to iterate explicitly.
def fast_writer(self, f_name, text_dict):
try:
start = timer()
with open(f_name, 'w', newline=None) as csv_file:
writer = csv.writer(csv_file)
writer.writerows(text_dict.items())
end = timer()
print("[FastWriter_time] ", end - start)
except Exception:
print("[ERROR] Unable to write file on disk. Exit...")
sys.exit()
A few comments:
You don't need to sniff the operating system; newline=None uses the underlying system default.
If you are going to reallocate self._writer and self._csv_file on every call, they probably don't have to be instance attributes; they can just be local variables: writer = csv.writer(csv_file).
BaseException is far too broad; it's no better than a bare except statement. Use Exception, but consider only catching IOError and OSError instead. Other exceptions may indicate bugs in your code, not legitimate IO errors.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading data from CSV into dataframe with multiple delimiters efficiently - python

Related

Split a binary file into 4-byte sequences (python)

Maximal efficiency to parse millions of json stored as a (very long) string

Did I/O become slower since Python 2.7?

Efficient double for loop over large matrices

Python, Dict to CSV: is there a faster way to do it?

Categories

Resources