How to directly create a categorical series in dask? - python

I would like to create a categorical dask Series based on a filter on another series. With pandas, I would do the following:
import numpy as np
import pandas as pd
x = pd.Series(np.random.random(10))
test = (x < 0.5).astype(int)
label = pd.Series(pd.Categorical.from_codes(test, categories=['a', 'b']))
If x is a dask Series, is there a way to create an equivalent label dask series without having to explicitly create the pandas series first (e.g., avoiding .compute() and from_pandas)?

Yes, all you need is available, as follows
import dask.array as da
import dask.dataframe as dd
r = da.random.random(1000000, chunks=(10000,)) # dask array
s = dd.from_array(r) # dask series
label = s.map_partitions(
lambda d: pd.Series(pd.Categorical.from_codes(
d < 0.5, categories=['a', 'b'])), meta='category')
(of course, replace your s with real data if you didn't really want random numbers)

Related

xarray: best way to "insert" a time slice into a dataset or dataarray

I have a 3-dimensional xarray dataset with the dimensions x, y, and time. Assuming I know that there's a missing observation at timestep n, what would be the best way to insert a timeslice with no-data values?
Here's a working example:
import xarray as xr
import pandas as pd
x = xr.tutorial.load_dataset("air_temperature")
# assuming this is the missing point in time (currently not in the dataset)
missing = "2014-12-31T07:00:00"
# create an "empty" time slice with fillvalues
empty = xr.full_like(x.isel(time=0), -3000)
# fix the time coordinate of the timeslice
empty['time'] = pd.date_range(missing, periods=1)[0]
# before insertion
print(x.time[-5:].values)
# '2014-12-30T18:00:00.000000000' '2014-12-31T00:00:00.000000000'
# '2014-12-31T06:00:00.000000000' '2014-12-31T12:00:00.000000000'
# '2014-12-31T18:00:00.000000000']
# concat and sort time
x2 = xr.concat([x, empty], "time").sortby("time")
# after insertion
print(x2.time[-5:].values)
# ['2014-12-31T00:00:00.000000000' '2014-12-31T06:00:00.000000000'
# '2014-12-31T07:00:00.000000000' '2014-12-31T12:00:00.000000000'
# '2014-12-31T18:00:00.000000000']
The example works fine, but I'm not sure if that's the best (or even the correct) approach.
My concerns are to use this with bigger datasets, and specifically with dask-array backed datasets.
Is there a better way to fill a missing 2d array?
Would it be better to use a dask-backed "fill array" when inserting into a dask-backed dataset?
You might consider using xarray's reindex method with a constant fill_value for this purpose:
import numpy as np
import xarray as xr
x = xr.tutorial.load_dataset("air_temperature")
missing_time = np.datetime64("2014-12-31T07:00:00")
missing_time_da = xr.DataArray([missing_time], dims=["time"], coords=[[missing_time]])
full_time = xr.concat([x.time, missing_time_da], dim="time")
full = x.reindex(time=full_time, fill_value=-3000.0).sortby("time")
I think both your method and the reindex method will automatically use dask-backed arrays if x is dask-backed.

Python 3.7+Numpy+pandas Arrays Selecting data between a range

Ok I'm going to try to explain my problem, I have a csv file with data, the data is wavelength and amplitude, the image is include here.
CSV data
So, I want to select only data between 500nm and 800nm (wave),
import pandas as pd
import numpy as np
excelfile=pd.read_csv('Files/660nm.csv');
excelfile.head();
wave = excelfile['Longitud'];
wave = np.array(wave);
X = excelfile['Amplitud'];
X = np.array(X);
wave = wave[(wave > 500) & (wave < 800)]
This does what I want in first instance, but I want to extend this selection to the column of amplitude (X), to have two arrays of the same dimensions. In my actual code I have to make an index to select the data in the amplitude array(X):
indices = np.arange(382,775,1)
X = np.take(X, indices)
But this is not the best practice, if I cant extend the first column selection to the the amplitude column I don't have to make another array to index the X array, and check the extension of the array, any idea about it ?
Thanks.
Like #ALollz pointed out, you shouldn't split the DataFrame up. Instead just filter the whole dataframe on wavelength. See the docs for DataFrame.loc
import pandas as pd
import numpy as np
# some dummy data
excelfile = pd.DataFrame({'Longitud': np.random.random(100) * 1000,
'Amplitud': np.arange(100)})
wave = excelfile['Longitud']
excelfile_filtered = excelfile.loc[(wave > 500) & (wave < 800)]
X = excelfile_filtered ['Amplitud'].values # yields an array

dask DataFrame.assign blows up dask graph

So I have an issue with dask DataFrame.append. I generate a lot of derivative features from main data and append them to the main dataframe. After that the dask graph for any set of columns is blown up. Here is small example:
%pylab inline
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask.dot import dot_graph
df=pd.DataFrame({'x%s'%i:np.random.rand(20) for i in range(5)})
ddf = dd.from_pandas(df, npartitions=2)
dot_graph(ddf['x0'].dask)
here is the dask graph as expected
g=ddf.assign(y=ddf['x0']+ddf['x1'])
dot_graph(g['x0'].dask)
here the graph for same column is exploded with irrelevant computation
Imagine i have lots of lots of spawned columns. So computation graph for any particular column includes irrelevant computations for all the other columns. I.e. in my case I have len(ddf['someColumn'].dask)>100000. So that becomes unusable quickly.
So my question is can this issue be resolved? Are there any existing means to do this? If not - what direction should i look to implement this?
Thanks!
Rather than continuously assigning new columns to the dask dataframe, you might want to build several dask series and then concat them all together at the end
So instead of doing this:
df['x'] = df.w + 1
df['y'] = df.x * 10
df['z'] = df.y ** 2
Do this
x = df.w + 1
y = x + 10
z = y * 2
df = df.assign(x=x, y=y, z=z)
Or this:
dd.concat([df, x, y, z], axis=1)
This may still result in the same number of tasks in your graph however, but will probably result in fewer memory copies.
Alternatively, if all of your transformations are row-wise then you can construct a pandas function and map that across all partitions
def f(part):
part = part.copy()
part['x'] = part.w + 1
part['y'] = part.x * 10
part['z'] = part.y ** 2
return part
df = df.map_partitions(f)
Also, while a million-node task graph is less than ideal, it should also be OK. I've seen larger graphs run comfortably.

Pandas: Apply function to set of groups

I have the following problem:
Given a 2D dataframe, first column with values and second giving categories of the points, I would like to compute a k-means dictionary of the means of each category and assign the centroid that the group mean of a particular value is closest to as a new column in the original data frame.
I would like to do this using groupby.
More generally, my problem is, that apply (to my knowledge) only can use functions that are defined on the individual groups (like mean()). k-means needs information on all the groups. Is there a nicer way than transforming everything to numpy arrays and working with these?
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans2
k=4
raw_data = np.random.randint(0,100,size=(100, 4))
f = pd.DataFrame(raw_data, columns=list('ABCD'))
df = pd.DataFrame(f, columns=['A','B'])
groups = df.groupby('A')
means = groups.mean().unstack()
centroids, dictionary = kmeans2(means,k)
fig, ax = plt.subplots()
print dictionary
What I would like to get now, is a new column in df, that gives the value in dictionary for each entry.
You can achieve it by the following:
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans2
k = 4
raw_data = np.random.randint(0,100,size=(100, 4))
f = pd.DataFrame(raw_data, columns=list('ABCD'))
df = pd.DataFrame(f, columns=['A','B'])
groups = df.groupby('A')
means_data_frame = pd.DataFrame(groups.mean())
centroid, means_data_frame['cluster'] = kmeans2(means_data_frame['B'], k)
df.join(means_data_frame, rsuffix='_mean', on='A')
This will append 2 more columns to df B_mean and cluster denoting the group's mean and the cluster that group's mean is closest to, respectively.
If you really want to use apply, you can write a function to read the cluster value from means_data_frame and assign it to a new column in df

Python: faster way of counting occurences in numpy arrays (large dataset)

I am new to Python. I have a numpy.array which size is 66049x1 (66049 rows and 1 column). The values are sorted smallest to largest and are of float type, with some of them being repeated.
I need to determine the frequency of occurrences of each value (the number of times a given value is equalled but not surpassed, e.g. X<=x in statistical terms), in order to later plot the Sample Cumulative Distribution Function.
The code I am currently using is as follows, but it is extremely slow, as it has to loop 66049x66049=4362470401 times. Is there any way to augment the speed of such piece of code? Will perhaps the use of dictionaries help in any way? Unfortunately I cannot change the size of the arrays I am working with.
+++Function header+++
...
...
directoryPath=raw_input('Directory path for native csv file: ')
csvfile = numpy.genfromtxt(directoryPath, delimiter=",")
x=csvfile[:,2]
x1=numpy.delete(x, 0, 0)
x2=numpy.zeros((x1.shape[0]))
x2=sorted(x1)
x3=numpy.around(x2, decimals=3)
count=numpy.zeros(len(x3))
#Iterates over the x3 array to find the number of occurrences of each value
for i in range(len(x3)):
temp=x3[i]
for j in range(len(x3)):
if (temp<=x3[j]):
count[j]=count[j]+1
#Creates a 2D array with (value, occurrences)
x4=numpy.zeros((len(x3), 2))
for i in range(len(x3)):
x4[i,0]=x3[i]
x4[i,1]=numpy.around((count[i]/x1.shape[0]),decimals=3)
...
...
+++Function continues+++
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data'])
df_p.T.plot(kind='hist')
plt.show()
That whole script took a very short period to execute (~2s) for (100,000x1) array. I didn't time, but if you provide the time it took to do yours we can compare.
I used [Counter][2] from collections to count the number of occurrences, my experiences with it have always been great (timewise). I converted it into DataFrame to plot and used T to transpose.
Your data does replicate a bit, but you can try and refine it some more. As it is, it's pretty fast.
Edit
Create CDF using cumsum()
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data']).T
df_p['cumu'] = df_p['data'].cumsum()
df_p['cumu'].plot(kind='line')
plt.show()
Edit 2
For scatter() plot you must specify the (x,y) explicitly. Also, calling df_p['cumu'] will result in a Series, not a DataFrame.
To properly display a scatter plot you'll need the following:
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data']).T
df_p['cumu'] = df_p['data'].cumsum()
df_p.plot(kind='scatter', x='data', y='cumu')
plt.show()
You should use np.where and then count the length of the obtained vector of indices:
indices = np.where(x3 <= value)
count = len(indices[0])
If efficiency counts, you can use the numpy function bincount, which need integers :
import numpy as np
a=np.random.rand(66049).reshape((66049,1)).round(3)
z=np.bincount(np.int32(1000*a[:,0]))
it takes about 1ms.
Regards.
# for counting a single value
mask = (my_np_array == value_to_count).astype('uint8')
# or a condition
mask = (my_np_array <= max_value).astype('uint8')
count = np.sum(mask)

Categories