Dataframes the most efficient way to fill the column of dataframe

Dataframes the most efficient way to fill the column of dataframe - python

I'm trying to fill the dataframe column. I want to do it with ".iat" using loop. But traditional for loop is really slow and filling the column with 100 000 values using for loop is not efficient. List comprehension do it faster but create useless list which I wouldn't use. I also thought map method but it also creates useless map object. So I want to it similar to map but without creating any array,mapobject,etc. What is the fastest method for doing such a thing?

In terms of time efficiency, using numpy arrays seem to win when measured with %timeit (using vscode + ipython interactive terminal).
#%%
import pandas as pd
import random
import numpy as np
size = 1000000
def makelarge_random():
return pd.DataFrame(pd.Series((random.randint(1, 100) for i in range(size))))
def makelarge_constant():
return pd.DataFrame(pd.Series((55 for i in range(size))))
def makelarge_empty_numpy():
return pd.DataFrame(np.empty(size, dtype=pd.Int64Dtype))
def makeints_numpy():
return pd.DataFrame(np.arange(stop = size))
print(f"Time to instantiate dataframe with length {size} from various methods")
print(" --- ")
print(makelarge_random.__name__)
%timeit makelarge_random()
print(" --- ")
print(makelarge_constant.__name__)
%timeit makelarge_constant()
print(" --- ")
print(makelarge_empty_numpy.__name__)
%timeit makelarge_empty()
print(" --- ")
print(makeints_numpy.__name__)
%timeit makeints_numpy()
#%%
Output:
Time to instantiate dataframe with length 1000000 from various inputs
---
makelarge_random
550 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
---
makelarge_constant
139 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
---
makelarge_empty_numpy
10 ms ± 94.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
---
makeints_numpy
641 µs ± 31.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Related

Fastest way to check if any item in a list is a numeric type?

I have a Python loop that's expected to run billions of times, and as such it need to be as tightly optimized as possible.
One of the operations is checking if a list of ~50 items contains a float or an integer.
I know about the any() builtin method, but is it the fastest way to do this kind of checking?

This kind of question about how fast or slow something is can be tested for yourself using the timeit module, though it can be hard to know some different ways to test agains. Below I have tested several options and included the timings. Overall, for a 50 element list, checking types is very unlikely to be the bottleneck in a complex program
#initialize a list of integers to create a random list from
ch=[1,2,3,4,5,6,7]
#Fill a list with random integers, 5000 items in length just for a bigger test
arr=[random.choice(ch) for _ in range(5000)]
#Add a single string to the end for a worst case iterating scenario
arr.extend('a')
#check the end of the list for funsies
arr[-5:]
[3, 6, 7, 4, 'a']
#Check for stringiness with the OP-mentioned any() function
%timeit any(type(i)==str for i in arr)
2.52 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Since isinstance() is the more pythonic way of assertive type-checking, let's see if it makes a difference
%timeit any(isinstance(i, str) for i in arr)
2.05 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Define a function to make time checking easier
def check_list(a):
for i in a:
#stop iteration if a string is found
if isinstance(i, str):
return True
else:
return False
#Try out our function
%timeit check_list(arr)
711 ns ± 85.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#let's pretend booleans are numbers to math up a solution
%timeit sum(map(lambda x:isinstance(x, str), arr))>0
2.86 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#conversion to a set takes some time but reduces the number of items we need to check, so let's try it
%timeit any(type(i)==str for i in set(arr))
99.4 µs ± 3.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#Let's try our custom function with a set
%timeit check_list(set(arr))
115 µs ± 29.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
def check_set(a):
#let's convert the list to a set inside the function to see what happens
for i in set(a):
if isinstance(i, str):
return True
else:
return False
%timeit check_set(arr)
94.7 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
We have a winner on this synthetic problem, but more importantly we can see how we tested several different options

What's under the hood of numpy's 'mean' function such that it works faster than built in python methods?

I've been exploring the performance differences between numpy functions and the normal built-in functions of Python, and I want to know how numpy functions are so optimized such that there's almost a 100x speed up.
Below is some code that I wrote to highlight the execution time differences between numpy mean() and manual calculation of mean using sum() and len()
import numpy as np
import time
n = 10**7
a = np.random.randn(n)
start = time.perf_counter()
mean = sum(a)/len(a)
seconds1 = time.perf_counter()-start
start = time.perf_counter()
mean = np.mean(a)
seconds2 = time.perf_counter()-start
print("First method takes time {:.3f}s".format(seconds1))
print("Second method takes time {:.3f}s".format(seconds2))
Output:-
First method takes 1.687s
Second method takes 0.013s

Make a numpy array:
In [130]: a=np.arange(10000)
Apply the numpy sum function:
In [131]: timeit np.sum(a)
16.2 µs ± 22.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
mean is a bit slower, since it has to divide by the shape (and may do a few other tests):
In [132]: timeit np.mean(a)
34.9 µs ± 198 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.sum actually delegates the action to the sum method of the array, so using that directly is a bit faster:
In [133]: timeit a.sum()
13.3 µs ± 25.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Python sum isn't a bad function, but it iterates over its argument. Iterating (in Python code) on an array is slow:
In [134]: timeit sum(a)
1.16 ms ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Converting the array to a list first saves time:
In [135]: timeit sum(a.tolist())
369 µs ± 7.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Better yet if we just time the list operation:
In [136]: %%timeit alist=a.tolist()
...: sum(alist)
57.2 µs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
When working with numpy arrays, it is best to use its own methods (or numpy functions). Generally when using Python functions, it is better to use lists.
Using a numpy function on a list is slow, because it has to first convert the list to an array:
In [137]: %%timeit alist=a.tolist()
...: np.sum(alist)
795 µs ± 28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

What is a faster option to compare vales in pandas?

I am trying to structure a df for productivity at some point i need to verify if a id exist in list and give a indicator in function of that, but its too slow (something like 30 seg for df).
can you enlighten me on a better way to do it?
thats my current code:
data['first_time_it_happen'] = data['id'].apply(lambda x: 0 if x in old_data['id'].values else 1)
(i already try to use the colume like a serie but it do not work correctly)

To settle some debate in the comment section, I ran some timings.
Methods to time:
def isin(df, old_data):
return df["id"].isin(old_data["id"])
def apply(df, old_data):
return df['id'].apply(lambda x: 0 if x in old_data['id'].values else 1)
def set_(df, old_data):
old = set(old_data['id'].values)
return [x in old for x in df['id']]
import pandas as pd
import string
old_data = pd.DataFrame({"id": list(string.ascii_lowercase[:15])})
df = pd.DataFrame({"id": list(string.ascii_lowercase)})
Small DataFrame tests:
# Tests ran in jupyter notebook
%timeit isin(df, old_data)
184 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit apply(df, old_data)
926 µs ± 64.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit set_(df, old_data)
28.8 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Large dataframe tests:
df = pd.concat([df] * 100000, ignore_index=True)
%timeit isin(df, old_data)
122 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit apply(df, old_data)
56.9 s ± 6.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit set_(df, old_data)
974 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Seems like the set method is a smidge faster than the isin method for a small dataframe. However that comparison radically flips for a much larger dataframe. Seems like in most cases the isin method is will be the best way to go. Then the apply method is always the slowest of the bunch regardless of dataframe size.

Improve Harmonic Mean efficiency in Pandas pivot_table

I'm applying harmonic mean from scipy.stats for aggfunc parameter in Pandas pivot_table but it is much slower than a simple mean by orders of magnitude.
I would like to know if this is excepted behavior or there is a way to turn this calculation more efficient as I need to do this calculation thousands of times.
I need to use harmonic mean but this is taking a huge amount of processing time.
I've tried using harmonic_mean from statistics form Python 3.6 but still the overhead is the same.
Thanks
import numpy as np
import pandas as pd
import statistics
data = pd.DataFrame({'value1':np.random.randint(1000,size=200000),
'value2':np.random.randint(24,size=200000),
'value3':np.random.rand(200000)+1,
'value4':np.random.randint(100000,size=200000)})
%timeit result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=hmean)
1.74 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=lambda x: statistics.harmonic_mean(list(x)))
1.9 s ± 26.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=np.mean)
37.4 ms ± 938 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Single run for both functions
%timeit hmean(data.value3[:100])
155 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.mean(data.value3[:100])
138 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

I would recommend using multiprocessing.Pool, the code below has been tested for 20 million records, it is 3 times faster than the original, give it try please, for sure code still needs more improvements to answer your specific question about the slow performance of statistics.harmonic_mean.
note: you can get even better results for records > 100 M.
import time
import numpy as np
import pandas as pd
import statistics
import multiprocessing
data = pd.DataFrame({'value1':np.random.randint(1000,size=20000000),
'value2':np.random.randint(24,size=20000000),
'value3':np.random.rand(20000000)+1,
'value4':np.random.randint(100000,size=20000000)})
def chunk_pivot(data):
result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=lambda x: statistics.harmonic_mean(list(x)))
return result
DataFrameDict=[]
for i in range(4):
print(i*250,i*250+250)
DataFrameDict.append(data[:][data.value1.between(i*250,i*250+249)])
def parallel_pivot(prcsr):
# 6 is a number of processes I've tested
p = multiprocessing.Pool(prcsr)
out_df=[]
for result in p.imap(chunk_pivot, DataFrameDict):
#print (result)
out_df.append(result)
return out_df
start =time.time()
dict_pivot=parallel_pivot(6)
multiprocessing_result=pd.concat(dict_pivot,axis=0)
#singleprocessing_result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=lambda x: statistics.harmonic_mean(list(x)))
end = time.time()
print(end-start)

efficient python array to numpy array conversion

I get a big array (image with 12 Mpix) in the array format from the python standard lib.
Since I want to perform operations on those array, I wish to convert it to a numpy array.
I tried the following:
import numpy
import array
from datetime import datetime
test = array.array('d', [0]*12000000)
t = datetime.now()
numpy.array(test)
print datetime.now() - t
I get a result between one or two seconds: equivalent to a loop in python.
Is there a more efficient way of doing this conversion?

np.array(test) # 1.19s
np.fromiter(test, dtype=int) # 1.08s
np.frombuffer(test) # 459ns !!!

asarray(x) is almost always the best choice for any array-like object.
array and fromiter are slow because they perform a copy. Using asarray allows this copy to be elided:
>>> import array
>>> import numpy as np
>>> test = array.array('d', [0]*12000000)
# very slow - this makes multiple copies that grow each time
>>> %timeit np.fromiter(test, dtype=test.typecode)
626 ms ± 3.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# fast memory copy
>>> %timeit np.array(test)
63.5 ms ± 639 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# which is equivalent to doing the fast construction followed by a copy
>>> %timeit np.asarray(test).copy()
63.4 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# so doing just the construction is way faster
>>> %timeit np.asarray(test)
1.73 µs ± 70.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# marginally faster, but at the expense of verbosity and type safety if you
# get the wrong type
>>> %timeit np.frombuffer(test, dtype=test.typecode)
1.07 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dataframes the most efficient way to fill the column of dataframe - python

Related

Fastest way to check if any item in a list is a numeric type?

What's under the hood of numpy's 'mean' function such that it works faster than built in python methods?

What is a faster option to compare vales in pandas?

Improve Harmonic Mean efficiency in Pandas pivot_table

efficient python array to numpy array conversion

Categories

Resources