Empirically it seems that whenever you set_index on a Dask dataframe, Dask will always put rows with equal indexes into a single partition, even if it results in wildly imbalanced partitions.
Here is a demonstration:
import pandas as pd
import dask.dataframe as dd
users = [1]*1000 + [2]*1000 + [3]*1000
df = pd.DataFrame({'user': users})
ddf = dd.from_pandas(df, npartitions=1000)
ddf = ddf.set_index('user')
counts = ddf.map_partitions(lambda x: len(x)).compute()
counts.loc[counts > 0]
# 500 1000
# 999 2000
# dtype: int64
However, I found no guarantee of this behaviour anywhere.
I have tried to sift through the code myself but gave up. I believe one of these inter-related functions probably holds the answer:
set_index
set_partitions
rearrange_by_column
rearrange_by_column_tasks
SimpleShuffleLayer
When you set_index, is it the case that a single index can never be in two different partitions? If not, then under what conditions does this property hold?
Bounty: I will award the bounty to an answer that draws from a reputable source. For example, referring to the implementation to show that this property has to hold.
is it the case that a single index can never be in two different partitions?
No, it's certainly allowed. Dask will even intend for this to happen. However, because of a bug in set_index, all the data will still end up in one partition.
An extreme example (every row is the same value except one):
In [1]: import dask.dataframe as dd
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({"A": [0] + [1] * 20})
In [4]: ddf = dd.from_pandas(df, npartitions=10)
In [5]: s = ddf.set_index("A")
In [6]: s.divisions
Out[6]: (0, 0, 0, 0, 0, 0, 0, 1)
As you can see, Dask intends for the 0s to be split up between multiple partitions. Yet when the shuffle actually happens, all the 0s still end up in one partition:
In [7]: import dask
In [8]: dask.compute(s.to_delayed()) # easy way to see the partitions separately
Out[8]:
([Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],)
This is because the code deciding which output partition a row belongs doesn't consider duplicates in divisions. Treating divisions as a Series, it uses searchsorted with side="right", hence why all the data always ends up in the last partition.
I'll update this answer when the issue is fixed.
Is it the case that a single index can never be in two different partitions?
IIUC, the answer for practical purposes is yes.
A dask dataframe will in general have multiple partitions and dask may or may not know about the index values associated with each partition (see Partitions). If dask does know which partition contains which index range, then this will be reflected in df.divisions output (if not, the result of this call will be None).
When running .set_index, dask will compute divisions and it seems that in determining the divisions it will require that divisions are sequential and unique (except for the last element). The relevant code is here.
So two potential follow-up questions: why not allow any non-sequential indexing, and as a specific case of the previous, why not allow duplicate indexes in partitions.
With regards to the first question: for smallish data it might be feasible to think about a design that allows non-sorted indexing, but you can imagine that a general non-sorted indexing won't scale well, since dask will need to store indexes for each partition somehow.
With regards to the second question: it seems that this should be possible, but it also seems that right now it's not implemented correctly. See the snippet below:
# use this to generate 10 indexed partitions
import pandas as pd
for user in range(10):
df = pd.DataFrame({'user_col': [user//3]*100})
df['user'] = df['user_col']
df = df.set_index('user')
df.index.name = 'user_index'
df.to_parquet(f'test_{user}.parquet', index=True)
# now load them into a dask dataframe
import dask.dataframe as dd
ddf = dd.read_parquet('test_*.parquet')
# dask will know about the divisions
print(ddf.known_divisions) # True
# further evidence
print(ddf.divisions) # (0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3)
# this should show three partitions, but will show only one
print(ddf.loc[0].npartitions) # 1
I have just noticed that Dask's documentation for shuffle says
After this operation, rows with the same value of on will be in the same partition.
This seems to confirm my empirical observation.
Related
In Pandas, it is simple to slice a series(/array) such as [1,1,1,1,2,2,1,1,1,1] to return groups of [1,1,1,1], [2,2,],[1,1,1,1]. To do this, I use the syntax:
datagroups= df[key].groupby(df[key][df[key][variable] == some condition].index.to_series().diff().ne(1).cumsum())
...where I would obtain individual groups by df[key][variable] == some condition. Groups that have the same value of some condition that aren't contiguous are their own groups. If the condition was x < 2, I would end up with [1,1,1,1],[1,1,1,1] from the above example.
I am attempting to do the same thing in xarray package, because I am working with multidimensional data, but the above syntax obviously doesn't work.
What I have been successful doing so far:
a) apply some condition to separate the values I want by NaNs:
datagroups_notsplit = df[key].where(df[key][variable] == some condition)
So now I have groups as in the example above [1,1,1,1,Nan,Nan,1,1,1,1] (if some condition was x <2). The question is, how do I cut these groups so that it becomes [1,1,1,1],[1,1,1,1]?
b) Alternatively, group by some condition...
datagroups_agglomerated = df[key].groupby_bins('variable', bins = [cleverly designed for some condition])
But then, following the example above, I end up with groups [1,1,1,1,1,1,1], [2,2]. Is there a way to then groupby the groups on noncontiguous index values?
Without knowing more about what your 'some condition' can be, or the domain of your data (small integers only?), I'd just workaround the missing pandas functionality, something like:
import pandas as pd
import xarray as xr
dat = xr.DataArray([1,1,1,1,2,2,1,1,1,1], dims='x')
# Use `diff()` to get groups of contiguous values
(dat.diff('x') != 0)]
# ...prepend a leading 0 (pedantic syntax for xarray)
xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x')
# ...take cumsum() to get group indices
xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x').cumsum()
# array([0, 0, 0, 0, 1, 1, 2, 2, 2, 2])
dat.groupby(xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x').cumsum() )
# DataArrayGroupBy, grouped over 'group'
# 3 groups with labels 0, 1, 2.
The xarray How do I page could use some recipes like this ("Group contiguous values"), suggest you contact them and have them added.
My use case was a bit more complicated than the minimal example I posted due to the use of timeseries indices and the desire to subselect certain conditions; however, I was able to adapt the answer of smci, above in the following way:
(1) create indexnumber variable:
df = Dataset( data_vars={
'some_data' : (('date'), some_data),
'more_data' : (('date'), more_data),
'indexnumber' : (('date'), arange(0,len(date_arr))
},
coords={
'date' : date_arr
}
)
(2) get the indices for the groupby groups:
ind_slice = df.where(df['more_data'] == some_condition)['indexnumber'].dropna(dim='date').diff(dim='date') !=1).cumsum().indexes
(3) get the cumsum field:
sumcum = df.where(df['more_data'] == some_condition)['indexnumber'].dropna(dim='date').diff(dim='date') !=1).cumsum()
(4) reconstitute a new df:
df2 = df.loc[ind_slice]
(5) add the cumsum field:
df2['sumcum'] = sumcum
(6) groupby:
groups = df2.groupby(df['sumcum'])
hope this helps anyone else out there looking to do this.
I'm trying to repartition my dask dataframe by city. I currently have over 1M rows but only 3 cities. So naturally, I expect to have 3 partitioned dataframes based of off the parameter I included.
Code I'm using directly from the Dask documentaion site:
ddf_1 = ddf.set_index("City")
ddf_2 = ddf_1.repartition(divisions=list(ddf_1.index.unique().compute()))
I created a dummy DF below to help explain what I would like as a result. Below I have an imbalanced dataset based on City. I want to partition the DF based on the number of unique cities.
Ideal result:
However, after running the above code. I'm getting only two partitions where each of the two partitions include 2 unique indexes (i.e. Cities). I can't figure out why after explicitly indicating how dask should partition the DF, it results in 2 instead of 3 partitions. One thought is maybe since the DF is imbalanced, it ignored the 'divisions' parameter.
As explained in the docstring of set_index, len(divisons) is equal to npartitions + 1. This is because divisions represents the upper and lower bounds of each partition. Therefore, if you want your Dask DataFrame to have 3 partitions, you need to pass a list of length 4 to divisions. Additionally, when you call set_index on a Dask DataFrame, it will repartition according to the arguments passed, so there is no need to call repartition immediately afterwards. I would recommend doing:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({
'City': ['Miami'] * 4 + ['Chicago'] * 2 + ['Detroit'],
'House_ID': [1, 2, 3, 4, 3, 4, 2],
'House_Price': [100000, 500000, 400000, 300000, 250000, 135000, 269000]
})
ddf = dd.from_pandas(df, npartitions=2).set_index(
'City', divisions=['Chicago', 'Detroit', 'Miami', 'Miami'])
Alternatively, you can let Dask pick the best partitioning based on memory use by changing the last line in the above snippet to ddf = dd.from_pandas(df, npartitions=2).set_index('City', npartitions='auto')
I have some numpy array, whose number of rows (axis=0) is the same as a pandas dataframe's number of rows.
I want to create a new column in the dataframe, for which each entry would be a numpy array of a lesser dimension.
Code:
some_df = pd.DataFrame(columns=['A'])
for i in range(10):
some_df.loc[i] = [np.random.rand(4, 6, 8)
data = np.stack(some_df['A'].values) #shape (10, 4, 6, 8)
processed = np.max(data, axis=1) # shape (10, 6, 8)
some_df['B'] = processed # This fails
I want the new column 'B' to contain numpy arrays of shape (6, 8)
How can this be done?
This is not recommended, it is pain, slow and later processing is not easy.
One possible solution is use list comprehension:
some_df['B'] = [x for x in processed]
Or convert to list and assign:
some_df['B'] = processed.tolist()
Coming back to this after 2 years, here is a much better practice:
from itertools import product, chain
import pandas as pd
import numpy as np
from typing import Dict
def calc_col_names(named_shape):
*prefix, shape = named_shape
names = [map(str, range(i)) for i in shape]
return map('_'.join, product(prefix, *names))
def create_flat_columns_df_from_dict_of_numpy(
named_np: Dict[str, np.array],
n_samples_per_np: int,
):
named_np_correct_lenth = {k: v for k, v in named_np.items() if len(v) == n_samples_per_np}
flat_nps = [a.reshape(n_samples_per_np, -1) for a in named_np_correct_lenth.values()]
stacked_nps = np.column_stack(flat_nps)
named_shapes = [(name, arr.shape[1:]) for name, arr in named_np_correct_lenth.items()]
col_names = [*chain.from_iterable(calc_col_names(named_shape) for named_shape in named_shapes)]
df = pd.DataFrame(stacked_nps, columns=col_names)
df = df.convert_dtypes()
return df
def parse_series_into_np(df, col_name, shp):
# can parse the shape from the col names
n_samples = len(df)
col_names = sorted(c for c in df.columns if col_name in c)
col_names = list(filter(lambda c: c.startswith(col_name + "_") or len(col_names) == 1, col_names))
col_as_np = df[col_names].astype(np.float).values.reshape((n_samples, *shp))
return col_as_np
usage to put a ndarray into a Dataframe:
full_rate_df = create_flat_columns_df_from_dict_of_numpy(
named_np={name: np.array(d[name]) for name in ["name1", "name2"]},
n_samples_per_np=d["name1"].shape[0]
)
where d is a dict of nd arrays of the same shape[0], hashed by ["name1", "name2"].
The reverse operation can be obtained by parse_series_into_np.
The accepted answer remains, as it answers the original question, but this one is a much better practice.
I know this question already has an answer to it, but I would like to add a much more scalable way of doing this. As mentioned in the comments above it is in general not recommended to store arrays as "field"-values in a pandas-Dataframe column (I actually do not know why?). Nevertheless, in my day to day work this is an extermely important functionality when working with time-series data and a bunch of related meta-data.
In general I organize my experimantal time-series in form of pandas dataframes with one column holding same-length numpy arrays and the other columns containing information on meta-data with respect to certain measurement conditions etc.
The proposed solution by jezrael works very well, and I used this for the last 4 years on a regular basis. But this method potentially encounters huge memory problems. In my case I came across these problems working with dataframes beyond 5 Million rows and time-series with approx. 100 data points.
The solution to these problems is extremely simple, since I did not find it anywhere I just wanted to share it here: Simply transform your 2D array to a pandas-Series object and assign this to a column of your dataframe:
df["new_list_column"] = pd.Series(list(numpy_array_2D))
I have the following (time-series) data:
t = [5.13, 5.27, 5.40, 5.46, 190.99, 191.13, 191.267, 368.70, 368.83, 368.90, 368.93]
y = [17.17, 17.18, 17.014, 17.104, 16.981, 16.96, 16.85, 17.27, 17.66, 17.76, 18.01]
so, groups of data in short (time) intervals then separated cleanly by a long time gap.
I'm looking for a simple method that will intelligently average these together; sort of a 'Bayesian blocks' but for non-histogram data.
One could do a simple moving average, or numpy convolution, but I'm looking for something a bit smarter that will generalize to larger, similar, but not identical datasets.
It's easy with Pandas. First, construct a DataFrame:
df = pd.DataFrame({'t':t,'y':y})
Then label the groups according to a time threshold:
groups = (df.t.diff() > 10).cumsum()
That gives you [0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2], because cumsum() on a boolean array increments wherever the input is true.
Finally, use groupby():
df.groupby(groups).mean()
It gives you:
t y
t
0 5.315 17.117000
1 191.129 16.930333
2 368.840 17.675000
If you need plain NumPy arrays at the end, just tack on .t.values and .y.values.
If you don't know a priori what time threshold to use, I'm sure you can come up with some heuristic, perhaps involving simple statistics on df.t and df.t.diff().
I have a CSV file containing data: (just the first ten rows of data are listed)
0,11,31,65,67
1,31,33,67
2,33,43,67
3,31,33,67
4,24,31,33,65,67,68,71,75,76,93,97
5,31,33,67
6,65,93
7,2,33,34,51,66,67,84
8,44,55,66
9,2,33,51,54,67,84
10,33,51,66,67,84
The first column indicates the row number (e.g the first column in the first row is 0). When i try to use
import pandas as pd
df0 = pd.read_csv('df0.txt', header=None, sep=',')
Error occurs as below:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 5, saw 12
I guess pandas computes the number of columns when it reads the first row (5 column). How can I declare the number of column by myself? It is known that there are total 120 class labels and hence, guess 121 columns should enough.
Further, how can I transform it into One Hot Encoding format because I want to use a neural network model to process the data.
For your first problem, you can pass a names=... parameter to read_csv:
df = pd.read_csv('df0.txt', header=None, names=range(121), sep=',')
As for your second problem, there's an existing solution here that uses sklearn.OneHotEncoder. If you are looking to convert each column to a one hot encoding, you may use it.
I gave this my best shot, but I don't think it's too good. I do think it gets at what you're asking, based on my own ML knowledge and your question I took you to be asking the following
1.) You have a csv of numbers
2.) This is for a problem with 120 classes
3.) You want a matrix with 1s and 0s for each class
4.) Example a csv such as:
1, 3
2, 3, 6
would be the feature matrix
Column:
1, 2, 3, 6
1, 0, 1, 0
0, 1, 1, 1
Thus this code achieves that, but it is surely not optimized:
df = pd.read_csv(file, header=None, names=range(121), sep=',')
one_hot = []
for k in df.columns:
one_hot.append(pd.get_dummies(df[k]))
for n, l in enumerate(one_hot):
if n == 0:
df = one_hot[n]
else:
df = func(df1=df, df2=one_hot[n])
def func(df1, df2):
# We can't join if columns overlap. Use set operations to identify
non_overlapping_columns = list(set(df2.columns)-set(df1.columns))
overlapping_columns = list(set(df2.columns)-set(non_overlapping_columns))
# Join where possible
df2_join = df2[non_overlapping_columns]
df3 = df1.join(df2_join)
# Manually add columns for overlaps
for k in overlapping_columns:
df3[k] = df3[k]+df2[k]
return df3
From here you could feed it into sklean onehot, as #cᴏʟᴅsᴘᴇᴇᴅ noted.
That would look like this:
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(df)
import sys
sys.getsizeof(onehot) #smaller than Pandas
sys.getsizeof(df)
I guess I'm unsure if the assumptions I noted above are what you want done in your data, it seems perhaps they aren't.
I thought that for a given line in your csv, that was indicating the classes that exist. I guess I'm a little unclear on it still.