In Pandas, it is simple to slice a series(/array) such as [1,1,1,1,2,2,1,1,1,1] to return groups of [1,1,1,1], [2,2,],[1,1,1,1]. To do this, I use the syntax:
datagroups= df[key].groupby(df[key][df[key][variable] == some condition].index.to_series().diff().ne(1).cumsum())
...where I would obtain individual groups by df[key][variable] == some condition. Groups that have the same value of some condition that aren't contiguous are their own groups. If the condition was x < 2, I would end up with [1,1,1,1],[1,1,1,1] from the above example.
I am attempting to do the same thing in xarray package, because I am working with multidimensional data, but the above syntax obviously doesn't work.
What I have been successful doing so far:
a) apply some condition to separate the values I want by NaNs:
datagroups_notsplit = df[key].where(df[key][variable] == some condition)
So now I have groups as in the example above [1,1,1,1,Nan,Nan,1,1,1,1] (if some condition was x <2). The question is, how do I cut these groups so that it becomes [1,1,1,1],[1,1,1,1]?
b) Alternatively, group by some condition...
datagroups_agglomerated = df[key].groupby_bins('variable', bins = [cleverly designed for some condition])
But then, following the example above, I end up with groups [1,1,1,1,1,1,1], [2,2]. Is there a way to then groupby the groups on noncontiguous index values?
Without knowing more about what your 'some condition' can be, or the domain of your data (small integers only?), I'd just workaround the missing pandas functionality, something like:
import pandas as pd
import xarray as xr
dat = xr.DataArray([1,1,1,1,2,2,1,1,1,1], dims='x')
# Use `diff()` to get groups of contiguous values
(dat.diff('x') != 0)]
# ...prepend a leading 0 (pedantic syntax for xarray)
xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x')
# ...take cumsum() to get group indices
xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x').cumsum()
# array([0, 0, 0, 0, 1, 1, 2, 2, 2, 2])
dat.groupby(xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x').cumsum() )
# DataArrayGroupBy, grouped over 'group'
# 3 groups with labels 0, 1, 2.
The xarray How do I page could use some recipes like this ("Group contiguous values"), suggest you contact them and have them added.
My use case was a bit more complicated than the minimal example I posted due to the use of timeseries indices and the desire to subselect certain conditions; however, I was able to adapt the answer of smci, above in the following way:
(1) create indexnumber variable:
df = Dataset( data_vars={
'some_data' : (('date'), some_data),
'more_data' : (('date'), more_data),
'indexnumber' : (('date'), arange(0,len(date_arr))
},
coords={
'date' : date_arr
}
)
(2) get the indices for the groupby groups:
ind_slice = df.where(df['more_data'] == some_condition)['indexnumber'].dropna(dim='date').diff(dim='date') !=1).cumsum().indexes
(3) get the cumsum field:
sumcum = df.where(df['more_data'] == some_condition)['indexnumber'].dropna(dim='date').diff(dim='date') !=1).cumsum()
(4) reconstitute a new df:
df2 = df.loc[ind_slice]
(5) add the cumsum field:
df2['sumcum'] = sumcum
(6) groupby:
groups = df2.groupby(df['sumcum'])
hope this helps anyone else out there looking to do this.
Related
Empirically it seems that whenever you set_index on a Dask dataframe, Dask will always put rows with equal indexes into a single partition, even if it results in wildly imbalanced partitions.
Here is a demonstration:
import pandas as pd
import dask.dataframe as dd
users = [1]*1000 + [2]*1000 + [3]*1000
df = pd.DataFrame({'user': users})
ddf = dd.from_pandas(df, npartitions=1000)
ddf = ddf.set_index('user')
counts = ddf.map_partitions(lambda x: len(x)).compute()
counts.loc[counts > 0]
# 500 1000
# 999 2000
# dtype: int64
However, I found no guarantee of this behaviour anywhere.
I have tried to sift through the code myself but gave up. I believe one of these inter-related functions probably holds the answer:
set_index
set_partitions
rearrange_by_column
rearrange_by_column_tasks
SimpleShuffleLayer
When you set_index, is it the case that a single index can never be in two different partitions? If not, then under what conditions does this property hold?
Bounty: I will award the bounty to an answer that draws from a reputable source. For example, referring to the implementation to show that this property has to hold.
is it the case that a single index can never be in two different partitions?
No, it's certainly allowed. Dask will even intend for this to happen. However, because of a bug in set_index, all the data will still end up in one partition.
An extreme example (every row is the same value except one):
In [1]: import dask.dataframe as dd
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({"A": [0] + [1] * 20})
In [4]: ddf = dd.from_pandas(df, npartitions=10)
In [5]: s = ddf.set_index("A")
In [6]: s.divisions
Out[6]: (0, 0, 0, 0, 0, 0, 0, 1)
As you can see, Dask intends for the 0s to be split up between multiple partitions. Yet when the shuffle actually happens, all the 0s still end up in one partition:
In [7]: import dask
In [8]: dask.compute(s.to_delayed()) # easy way to see the partitions separately
Out[8]:
([Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],)
This is because the code deciding which output partition a row belongs doesn't consider duplicates in divisions. Treating divisions as a Series, it uses searchsorted with side="right", hence why all the data always ends up in the last partition.
I'll update this answer when the issue is fixed.
Is it the case that a single index can never be in two different partitions?
IIUC, the answer for practical purposes is yes.
A dask dataframe will in general have multiple partitions and dask may or may not know about the index values associated with each partition (see Partitions). If dask does know which partition contains which index range, then this will be reflected in df.divisions output (if not, the result of this call will be None).
When running .set_index, dask will compute divisions and it seems that in determining the divisions it will require that divisions are sequential and unique (except for the last element). The relevant code is here.
So two potential follow-up questions: why not allow any non-sequential indexing, and as a specific case of the previous, why not allow duplicate indexes in partitions.
With regards to the first question: for smallish data it might be feasible to think about a design that allows non-sorted indexing, but you can imagine that a general non-sorted indexing won't scale well, since dask will need to store indexes for each partition somehow.
With regards to the second question: it seems that this should be possible, but it also seems that right now it's not implemented correctly. See the snippet below:
# use this to generate 10 indexed partitions
import pandas as pd
for user in range(10):
df = pd.DataFrame({'user_col': [user//3]*100})
df['user'] = df['user_col']
df = df.set_index('user')
df.index.name = 'user_index'
df.to_parquet(f'test_{user}.parquet', index=True)
# now load them into a dask dataframe
import dask.dataframe as dd
ddf = dd.read_parquet('test_*.parquet')
# dask will know about the divisions
print(ddf.known_divisions) # True
# further evidence
print(ddf.divisions) # (0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3)
# this should show three partitions, but will show only one
print(ddf.loc[0].npartitions) # 1
I have just noticed that Dask's documentation for shuffle says
After this operation, rows with the same value of on will be in the same partition.
This seems to confirm my empirical observation.
I am dealing with pandas dataframes that have some uniform labelling schema for the columns, but an arbitrary number of columns.
For example the following df with columns col_names, a special subset of the columns, and a filter criteria corresponding to those special columns. They're linked by index, or with a dictionary, take your pick.
col_names = ['col1','col2','col3','xyz.1','xyz.2','xyz.3']
relevant_col_names = ['xyz.1', 'xyz.2', 'xyz.3']
filter_criteria = [something1, something2, something3]
I want to get the subset dataframe where 'xyz.1' == something1 & 'xyz.2' == something2 & 'xyz.3' ==something3
I would normally do this by:
whatIwant = df.loc[df['xyz.1']==something1 & df['xyz.2']==something2 & df['xyz.3']==something3]
The issue is I need to be able to write that expression for an arbitrary number of 'xyz' columns without having to manually code the above expression.
For example, if the df happens to have 5 relevant columns like this:
['col1','col2','col3','xyz.1','xyz.2','xyz.3','xyz.4','xyz.5']
Then I would need to automatically write something like this:
whatIwant = df.loc[df['xyz.1']==something1 & df['xyz.2']==something2 & df['xyz.3']==something3 & df['xyz.4']==something4 & df['xyz.5']==something5]
Is there a way to write a boolean expression of arbitrary length based on a list or a dictionary (or something else), where I'm ANDing everything inside of it?
I don't know how to phrase this question for google. It seems related to list and dictionary comprehension, expression synthesis, or something else. Even knowing the right way to phrase this question, how to tag this question for stackoverflow, and/or what I should google would be very helpful. Thanks!
Assuming you have all the "somethings" in a list, you can do this:
import functools
import operator
import pandas as pd
df = pd.DataFrame.from_dict({
"abc": (1, 2, 3),
"xyz.1": (4, 2, 7),
"xyz.2": (8, 5, 5)
})
targets = [2, 5]
filtered_dfs = (df[f"xyz.{index + 1}"] == target for index, target in enumerate(targets))
filtered_df = df.loc[functools.reduce(operator.and_, filtered_dfs)]
print(filtered_df)
We construct each filter on the df in filtered_dfs by doing df[f"xyz.{index + 1}"] == target. For the first iteration through targets, this will be df["xyz.1"] == 2. For the second iteration, it will be df["xyz.2"] == 5.
We then combine all these filters with functools.reduce(operator.and_, filtered_dfs), which is like doing filtered_df_1 & filtered_df_2 & ....
We finally apply the filter to the dataframe through df.loc, which gives the rows here that have a 2 in xyz.1 and 5 in xyz.2. Output is:
abc xyz.1 xyz.2
1 2 2 5
I have a column with binary flag values and I'm trying to clean it up if there are mistakes. A mistake would be if a particular group has both 0's and 1's. My rule is that this column can only contain either 0's or 1's within the group. I'm trying to come up with an np.where() clause such that I'm testing for groups with a column that has a single repeated value and also that the first value of that column in the group isn't 1. If the first value of the group isn't 1, and there's a mix of values, flip them all to 0 in that group.
Here's what I'm trying:
df['Flag'] = np.where((df.groupby('CombBitSeq')['Flag'].transform('std') != 0) & (df.groupby('CombBitSeq')['Flag'].nth(0) != 1), 0, df['Flag'])
The error I'm getting is this, and I'm not sure how the lengths of the two combined conditions are off by 1:
ValueError: operands could not be broadcast together with shapes (336661,) () (336660,)
If you want to get the first item per group and translate that throughout the entire dataframe, use groupby + transform + head, instead of nth:
df.groupby('CombBitSeq')['Flag'].transform('head', 1)
Your condition now becomes:
g = df.groupby('CombBitSeq')['Flag'] # let's compute this only once
df['Flag'] = np.where(
g.transform('std').ne(0) & g.transform('head', 1).ne(1), 0, df['Flag']
)
I need to aggregate an array inside my DataFrame.
The DataFrame was created in this way
splitted.map(lambda x: Row(store= int(x[0]), date= parser.parse(x[1]), values= (x[2:(len(x))]) ) )
Values is an array
I want to do think like this
mean_by_week = sqlct.sql("SELECT store, SUM(values) from sells group by date, store")
But I have the following error
AnalysisException: u"cannot resolve 'sum(values)' due to data type mismatch: function sum requires numeric types, not ArrayType(StringType,true); line 0 pos 0"
The array have always the same dimension. But each run the dimension may change, is near 100 of length.
How can aggregate without going to a RDD?
Matching dimensions or not sum for array<> is not meaningful hence not implemented. You can try to restructure and aggregate:
from pyspark.sql.functions import col, array, size, sum as sum_
n = df.select(size("values")).first()[0]
df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6])]).toDF(["store", "values"])
df.groupBy("store").agg(array(*[
sum_(col("values").getItem(i)) for i in range(n)]).alias("values"))
I'm using pandas groupby on my DataFrame df which has columns type, subtype, and 11 others. I'm then calling an apply with my combine_function (needs a better name) on the groups like:
grouped = df('type')
reduced = grouped.apply(combine_function)
where my combine_function checks if any element in the group contains any element with the given subtype, say 1, and looks like:
def combine_function(group):
if 1 in group.subtype:
return aggregate_function(group)
else:
return group
The combine_function then can call an aggregate_function, that calculates summary statistics, stores them in the first row, and then sets that row to be the group. It looks like:
def aggregate_function(group):
first = group.first_valid_index()
group.value1[group.index == first] = group.value1.mean()
group.value2[group.index == first] = group.value2.max()
group.value3[group.index == first] = group.value3.std()
group = group[(group.index == first)]
return group
I'm fairly sure this isn't the best way to do this, but it has been giving my the desired results, 99.9% of the time on thousands of DataFrames. However it sometimes throws an error that is somehow related to a group that I don't want to aggregate has exactly 2 rows:
ValueError: Shape of passed values is (13,), indices imply (13, 5)
where my an example groups had size:
In [4]: grouped.size()
Out[4]:
type
1 9288
3 7667
5 7604
11 2
dtype: int64
It processed the 3 three fine, and then gave the error when it tried to combine everything. If I comment out the line group = group[(group.index == first)] so update but don't aggregate or call my aggregate_function on all groups its fine.
Does anyone know the proper way to be doing this kind of aggregation of some groups but not others?
Your aggregate_functions looks contorted to me. When you aggregate a group, it automatically reduces to one row; you don't need to do it manually. Maybe I am missing the point. (Are you doing something special with the index that I'm not understanding?) But a more normal usage would look like this:
agg_condition = lambda x: Series([1]).isin(x['subtype]').any()
agg_functions = {'value1': np.mean, 'value2': np.max, 'value3': np.std}
df1 = df.groupby('type').filter(agg_condition).groupby('type').agg(**agg_functions)
df2 = df.groupby('type').filter(~agg_condition)
result = pd.concat([df1, df2])
Note: agg_condition is messy because (1) built-in Python in refers to the index of a Series, not its values, and (2) the result has to be reduced to a scalar by any().