Python: faster way of counting occurences in numpy arrays (large dataset) - python

I am new to Python. I have a numpy.array which size is 66049x1 (66049 rows and 1 column). The values are sorted smallest to largest and are of float type, with some of them being repeated.
I need to determine the frequency of occurrences of each value (the number of times a given value is equalled but not surpassed, e.g. X<=x in statistical terms), in order to later plot the Sample Cumulative Distribution Function.
The code I am currently using is as follows, but it is extremely slow, as it has to loop 66049x66049=4362470401 times. Is there any way to augment the speed of such piece of code? Will perhaps the use of dictionaries help in any way? Unfortunately I cannot change the size of the arrays I am working with.
+++Function header+++
...
...
directoryPath=raw_input('Directory path for native csv file: ')
csvfile = numpy.genfromtxt(directoryPath, delimiter=",")
x=csvfile[:,2]
x1=numpy.delete(x, 0, 0)
x2=numpy.zeros((x1.shape[0]))
x2=sorted(x1)
x3=numpy.around(x2, decimals=3)
count=numpy.zeros(len(x3))
#Iterates over the x3 array to find the number of occurrences of each value
for i in range(len(x3)):
temp=x3[i]
for j in range(len(x3)):
if (temp<=x3[j]):
count[j]=count[j]+1
#Creates a 2D array with (value, occurrences)
x4=numpy.zeros((len(x3), 2))
for i in range(len(x3)):
x4[i,0]=x3[i]
x4[i,1]=numpy.around((count[i]/x1.shape[0]),decimals=3)
...
...
+++Function continues+++

import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data'])
df_p.T.plot(kind='hist')
plt.show()
That whole script took a very short period to execute (~2s) for (100,000x1) array. I didn't time, but if you provide the time it took to do yours we can compare.
I used [Counter][2] from collections to count the number of occurrences, my experiences with it have always been great (timewise). I converted it into DataFrame to plot and used T to transpose.
Your data does replicate a bit, but you can try and refine it some more. As it is, it's pretty fast.
Edit
Create CDF using cumsum()
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data']).T
df_p['cumu'] = df_p['data'].cumsum()
df_p['cumu'].plot(kind='line')
plt.show()
Edit 2
For scatter() plot you must specify the (x,y) explicitly. Also, calling df_p['cumu'] will result in a Series, not a DataFrame.
To properly display a scatter plot you'll need the following:
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data']).T
df_p['cumu'] = df_p['data'].cumsum()
df_p.plot(kind='scatter', x='data', y='cumu')
plt.show()

You should use np.where and then count the length of the obtained vector of indices:
indices = np.where(x3 <= value)
count = len(indices[0])

If efficiency counts, you can use the numpy function bincount, which need integers :
import numpy as np
a=np.random.rand(66049).reshape((66049,1)).round(3)
z=np.bincount(np.int32(1000*a[:,0]))
it takes about 1ms.
Regards.

# for counting a single value
mask = (my_np_array == value_to_count).astype('uint8')
# or a condition
mask = (my_np_array <= max_value).astype('uint8')
count = np.sum(mask)

Related

Successive calculation of coefficient of variation

I'm having a long list of values which is triple indexed (i,j,t). For all i in I and j in J I have to extract all t values and calculate the coefficient of variation (cv) successively. The length of the cv list ist len(I)*len(J). Then I plot the cv list and check whether the cv converged to sum number.
Right now I am looping, which is rather inefficient (see example). Is there another possibility which avoids the loops?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
iN = 10
jN = 10
tN = 20
I = range(iN)
J = range(jN)
T = range(tN)
idx = pd.MultiIndex.from_product([I,J,T])
data = np.random.normal(loc=1, size=iN*jN*tN)
df = pd.DataFrame(data, index=idx, columns=['value'])
values_lst = []
cv_lst = []
for i in I:
for j in J:
values_lst.extend(df.loc[(i,j,slice(None)), 'value'])
sd = np.std(values_lst, ddof=1)
mean = np.mean(values_lst)
cv_lst.append(sd/mean)
plt.plot(cv_lst)
plt.show()
Im posting this in the assumtion that you can easily extract your data into a (i,j,t) sized numpy array. In that case you can let numpy do its magic:
cv_list = np.std(data, axis=2, ddof=1)/np.mean(data, axis=2)
axis=2 means you do your mean or std over your axis t. The result is still a 2D array you can reshape or flatten as you like.

How to Write a plt.scatter(x, y) function in one line where y=function of x

I was plotting a scatter plot to show null values in dataframe. As you can see the plt.scatter() function is not expressive enough. Relation between list(range(0,1200)) and 'a' is not clear unless you see the previous lines. Can the plt.scatter(x,y) be written in a more explicit way where it could be easily understood how x and y is related. Like if somebody only see the plt.scatter(x,y) , they would understand what it is about.
a = []
for i in range(0,1200):
feature_with_na = [feature for feature in df.columns if df[feature].isnull().sum()>i]
a.append(len(feature_with_na))
plt.scatter(list(range(0,1200)), a)
On your x axis you have the number, then on the y-axis you want to plot the number of columns in your DataFrame that have more than that number of null values.
Instead of your loop you can count the number of null values within each column and use numpy.broadcasting, ([:, None]), to compare with an array of your numbers. This allows you to specify an xarr of the numbers, then you use that same array in the comparison.
Sample Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plot
df = pd.DataFrame(np.random.choice([1,2,3,4,5,np.NaN], (100,10)))
Code
# Range of 'x' values to consider
xarr = np.arange(0, 100)
plt.scatter(xarr, (df.isnull().sum().to_numpy()>xarr[:, None]).sum(axis=1))
ALollz answer is good, but here's a less numpy-heavy alternative if that's your thing:
feature_null_counts = df.isnull().sum()
n_nulls = list(range(100))
features_with_n_nulls = [sum(feature_null_counts > n) for n in n_nulls]
plt.scatter(n_nulls, features_with_n_nulls)

Python 3.7+Numpy+pandas Arrays Selecting data between a range

Ok I'm going to try to explain my problem, I have a csv file with data, the data is wavelength and amplitude, the image is include here.
CSV data
So, I want to select only data between 500nm and 800nm (wave),
import pandas as pd
import numpy as np
excelfile=pd.read_csv('Files/660nm.csv');
excelfile.head();
wave = excelfile['Longitud'];
wave = np.array(wave);
X = excelfile['Amplitud'];
X = np.array(X);
wave = wave[(wave > 500) & (wave < 800)]
This does what I want in first instance, but I want to extend this selection to the column of amplitude (X), to have two arrays of the same dimensions. In my actual code I have to make an index to select the data in the amplitude array(X):
indices = np.arange(382,775,1)
X = np.take(X, indices)
But this is not the best practice, if I cant extend the first column selection to the the amplitude column I don't have to make another array to index the X array, and check the extension of the array, any idea about it ?
Thanks.
Like #ALollz pointed out, you shouldn't split the DataFrame up. Instead just filter the whole dataframe on wavelength. See the docs for DataFrame.loc
import pandas as pd
import numpy as np
# some dummy data
excelfile = pd.DataFrame({'Longitud': np.random.random(100) * 1000,
'Amplitud': np.arange(100)})
wave = excelfile['Longitud']
excelfile_filtered = excelfile.loc[(wave > 500) & (wave < 800)]
X = excelfile_filtered ['Amplitud'].values # yields an array

customizing np.fft.fft function in python

I wish to perform a fourier transform of the function 'stress' from 0 to infinity and extract the real and imaginary parts. I have the following code that does it using a numerical integration technique:
import numpy as np
from scipy.integrate import trapz
import fileinput
import sys,string
window = 200000 # length of the array I wish to transform (number of data points)
time = np.linspace(1,window,window)
freq = np.logspace(-5,2,window)
output = [0]*len(freq)
for index,f in enumerate(freq):
visco = trapz(stress*np.exp(-1j*f*t),t)
soln = visco*(1j*f)
output[index] = soln
print 'f storage loss'
for i in range(len(freq)):
print freq[i],output[i].real,output[i].imag
This gives me a nice transformation of my input data.
Now I have an array of size 2x10^6, and using the above technique is not feasible(computation time scales as O(N^2)), so I have turned to the inbuilt fft function in numpy.
There aren't too many arguments that you can specify to change this function, and so I'm finding it difficult to customize it to my needs.
So far I have
import numpy as np
import fileinput
import sys, string
np.set_printoptions(threshold='nan')
N = len(stress)
fvi = np.fft.fft(stress,n=N)
gprime = fvi.real
gdoubleprime = fvi.imag
for i in range(len(stress)):
print gprime[i], gdoubleprime[i]
And it's not giving me accurate results.
The DFT in python is of the form A_k = summation(a_m * exp(-2*piimk/n)) where the summation is from m = 0 to m = n-1 (http://docs.scipy.org/doc/numpy-1.10.1/reference/routines.fft.html). How can I change it to the form that I have mentioned in my first code, i.e. exp(-1jfreq*t) (freq is the frequency and t is the time which have already been predefined)? Or is there a post processing of the data that I have to do?
Thanks in advance for all your help.

binning data in python with scipy/numpy

is there a more efficient way to take an average of an array in prespecified bins? for example, i have an array of numbers and an array corresponding to bin start and end positions in that array, and I want to just take the mean in those bins? I have code that does it below but i am wondering how it can be cut down and improved. thanks.
from scipy import *
from numpy import *
def get_bin_mean(a, b_start, b_end):
ind_upper = nonzero(a >= b_start)[0]
a_upper = a[ind_upper]
a_range = a_upper[nonzero(a_upper < b_end)[0]]
mean_val = mean(a_range)
return mean_val
data = rand(100)
bins = linspace(0, 1, 10)
binned_data = []
n = 0
for n in range(0, len(bins)-1):
b_start = bins[n]
b_end = bins[n+1]
binned_data.append(get_bin_mean(data, b_start, b_end))
print binned_data
It's probably faster and easier to use numpy.digitize():
import numpy
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]
An alternative to this is to use numpy.histogram():
bin_means = (numpy.histogram(data, bins, weights=data)[0] /
numpy.histogram(data, bins)[0])
Try for yourself which one is faster... :)
The Scipy (>=0.11) function scipy.stats.binned_statistic specifically addresses the above question.
For the same example as in the previous answers, the Scipy solution would be
import numpy as np
from scipy.stats import binned_statistic
data = np.random.rand(100)
bin_means = binned_statistic(data, data, bins=10, range=(0, 1))[0]
Not sure why this thread got necroed; but here is a 2014 approved answer, which should be far faster:
import numpy as np
data = np.random.rand(100)
bins = 10
slices = np.linspace(0, 100, bins+1, True).astype(np.int)
counts = np.diff(slices)
mean = np.add.reduceat(data, slices[:-1]) / counts
print mean
The numpy_indexed package (disclaimer: I am its author) contains functionality to efficiently perform operations of this type:
import numpy_indexed as npi
print(npi.group_by(np.digitize(data, bins)).mean(data))
This is essentially the same solution as the one I posted earlier; but now wrapped in a nice interface, with tests and all :)
I would add, and also to answer the question find mean bin values using histogram2d python that the scipy also have a function specially designed to compute a bidimensional binned statistic for one or more sets of data
import numpy as np
from scipy.stats import binned_statistic_2d
x = np.random.rand(100)
y = np.random.rand(100)
values = np.random.rand(100)
bin_means = binned_statistic_2d(x, y, values, bins=10).statistic
the function scipy.stats.binned_statistic_dd is a generalization of this funcion for higher dimensions datasets
Another alternative is to use the ufunc.at. This method applies in-place a desired operation at specified indices.
We can get the bin position for each datapoint using the searchsorted method.
Then we can use at to increment by 1 the position of histogram at the index given by bin_indexes, every time we encounter an index at bin_indexes.
np.random.seed(1)
data = np.random.random(100) * 100
bins = np.linspace(0, 100, 10)
histogram = np.zeros_like(bins)
bin_indexes = np.searchsorted(bins, data)
np.add.at(histogram, bin_indexes, 1)

Categories