Aggregate Array into a Dataframe with a group by - python

I need to aggregate an array inside my DataFrame.
The DataFrame was created in this way
splitted.map(lambda x: Row(store= int(x[0]), date= parser.parse(x[1]), values= (x[2:(len(x))]) ) )
Values is an array
I want to do think like this
mean_by_week = sqlct.sql("SELECT store, SUM(values) from sells group by date, store")
But I have the following error
AnalysisException: u"cannot resolve 'sum(values)' due to data type mismatch: function sum requires numeric types, not ArrayType(StringType,true); line 0 pos 0"
The array have always the same dimension. But each run the dimension may change, is near 100 of length.
How can aggregate without going to a RDD?

Matching dimensions or not sum for array<> is not meaningful hence not implemented. You can try to restructure and aggregate:
from pyspark.sql.functions import col, array, size, sum as sum_
n = df.select(size("values")).first()[0]
df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6])]).toDF(["store", "values"])
df.groupBy("store").agg(array(*[
sum_(col("values").getItem(i)) for i in range(n)]).alias("values"))

Related

Xarray: grouping by contiguous identical values

In Pandas, it is simple to slice a series(/array) such as [1,1,1,1,2,2,1,1,1,1] to return groups of [1,1,1,1], [2,2,],[1,1,1,1]. To do this, I use the syntax:
datagroups= df[key].groupby(df[key][df[key][variable] == some condition].index.to_series().diff().ne(1).cumsum())
...where I would obtain individual groups by df[key][variable] == some condition. Groups that have the same value of some condition that aren't contiguous are their own groups. If the condition was x < 2, I would end up with [1,1,1,1],[1,1,1,1] from the above example.
I am attempting to do the same thing in xarray package, because I am working with multidimensional data, but the above syntax obviously doesn't work.
What I have been successful doing so far:
a) apply some condition to separate the values I want by NaNs:
datagroups_notsplit = df[key].where(df[key][variable] == some condition)
So now I have groups as in the example above [1,1,1,1,Nan,Nan,1,1,1,1] (if some condition was x <2). The question is, how do I cut these groups so that it becomes [1,1,1,1],[1,1,1,1]?
b) Alternatively, group by some condition...
datagroups_agglomerated = df[key].groupby_bins('variable', bins = [cleverly designed for some condition])
But then, following the example above, I end up with groups [1,1,1,1,1,1,1], [2,2]. Is there a way to then groupby the groups on noncontiguous index values?
Without knowing more about what your 'some condition' can be, or the domain of your data (small integers only?), I'd just workaround the missing pandas functionality, something like:
import pandas as pd
import xarray as xr
dat = xr.DataArray([1,1,1,1,2,2,1,1,1,1], dims='x')
# Use `diff()` to get groups of contiguous values
(dat.diff('x') != 0)]
# ...prepend a leading 0 (pedantic syntax for xarray)
xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x')
# ...take cumsum() to get group indices
xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x').cumsum()
# array([0, 0, 0, 0, 1, 1, 2, 2, 2, 2])
dat.groupby(xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x').cumsum() )
# DataArrayGroupBy, grouped over 'group'
# 3 groups with labels 0, 1, 2.
The xarray How do I page could use some recipes like this ("Group contiguous values"), suggest you contact them and have them added.
My use case was a bit more complicated than the minimal example I posted due to the use of timeseries indices and the desire to subselect certain conditions; however, I was able to adapt the answer of smci, above in the following way:
(1) create indexnumber variable:
df = Dataset( data_vars={
'some_data' : (('date'), some_data),
'more_data' : (('date'), more_data),
'indexnumber' : (('date'), arange(0,len(date_arr))
},
coords={
'date' : date_arr
}
)
(2) get the indices for the groupby groups:
ind_slice = df.where(df['more_data'] == some_condition)['indexnumber'].dropna(dim='date').diff(dim='date') !=1).cumsum().indexes
(3) get the cumsum field:
sumcum = df.where(df['more_data'] == some_condition)['indexnumber'].dropna(dim='date').diff(dim='date') !=1).cumsum()
(4) reconstitute a new df:
df2 = df.loc[ind_slice]
(5) add the cumsum field:
df2['sumcum'] = sumcum
(6) groupby:
groups = df2.groupby(df['sumcum'])
hope this helps anyone else out there looking to do this.

Filtering columns in pandas by length

I have a column in a dataframe that contains IATA_Codes (Abbreviations) for Airports (such as: LAX, SFO, ...) However, if I analyze the column values a little more (column.unique()), it says that there are also 4 digit numbers in it.
How can I filter the column so that my Datafram will only consist of rows containing a real Airport code?
My idea was to filter the length (Airports Code Length is always 3, while the Number length is always 4) but I don't know how to implement this idea.
array(['LFT', 'HYS', 'ELP', 'DVL', 'ISP', 'BUR', 'DAB', 'DAY', 'GRK',
'GJT', 'BMI', 'LBE', 'ASE', 'RKS', 'GUM', 'TVC', 'ALO', 'IMT',
...
10170, 11577, 14709, 14711, 12255, 10165, 10918, 15401, 13970,
15497, 12265, 14254, 10581, 12016, 11503, 13459, 14222, 14025,
'10333', '14222', '14025', '13502', '15497', '12265'], dtype=object)
You can use df.columns.str.len to get the length, and pass that to the second indexer position of df.loc:
df = df.loc[:, df.columns.astype(str).str.len() == 3]
one another possibility is to use lambda expression :
df[df['IATA_Codes'].apply(lambda x : len(str(x))==3)]['IATA_Codes'].unique()

Pandas transform: assign result to each element of group

I am currently using pandas groupby and transform to calculate smth for each group (once) and then assign the result to each row of the group.
If the result of calculations is scalar it can be obtained like:
df['some_col'] = df.groupby('id')['some_col'].transform(lambda x:process(x))
The problem is that the result of my calculations is vector, and pd tries to make element-wise assignment of result vector to the group (quote from pandas docs):
The transform function must:
Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk (e.g., a scalar, grouped.transform(lambda x: x.iloc[-1])).
I could hardcode external function, creating a group-sized list, that will contain copies of result (currently on python 3.6, so it's not possible to use assignment inside lambda):
def return_group(x):
result = process(x)
return [result for item in x]
But I think that it's possible to solve this somehow "smarter". Remember that it's necessary to make calculations only once for each group.
Is it possible to force pd.transform work with array-like result of lambda function like with scalars (just copy it n-times)?
Would be grateful for any advices.
P. S. I understand, that it's possible to use combination of apply and join to solve the original requirement, but the solution with transform has more priority in my case.
Sometime transform is a pain to work with If that's not a problem for you I'd suggest you to use groupby + a left pd.merge as in this example:
import pandas as pd
df = pd.DataFrame({"id":[1,1,2,2,2],
"col":[1,2,3,4,5]})
# this return a list for every group
grp = df.groupby("id")["col"]\
.apply(lambda x: list(x))\
.reset_index(name="out")
# Then you merge it to the original df
df = pd.merge(df, grp, how="left")
And print(df) returns
id col out
0 1 1 [1, 2]
1 1 2 [1, 2]
2 2 3 [3, 4, 5]
3 2 4 [3, 4, 5]
4 2 5 [3, 4, 5]

Concat gives Shape of passed values is X, indices imply Y

I have a dataframe (df) my goal is to add a new column ("grad") that corresponds to the gradient between the points if they have the same index.
First I didn't find an easy way to do it using only pandas for now I use numpy+pandas. I have written a function to get the gradient for each row by group, and it works but it is not pretty and a bit wonky.
Second I want to add the pandas series of numpy arrays to the df but I don't know how to do so. I tried to stack them so i get a series of size (0,9) (grouped_2 ) but when I use concat I have the following message: "ValueError: Shape of passed values is (48, 3), indices imply (16, 3)". According to a previous question I think having duplicate index values is the problem but I can't modify the index of my first df.
df = pd.DataFrame(index = [1,1,1,1,1,2,2,2,2],
data={'value': [1,5,8,10,12,1,2,8,2], 'diff_day':[-1,0,2,3,4,-2,-1,0,10]} )
def grad(gr):
val = gr['value']
dif = gr['diff_day']
return np.gradient(val, dif)
grouped_1 = df.groupby(level=0).apply(grad)
grouped_2 = pd.DataFrame(grouped_1.values.tolist(), index=grouped_1.index).stack().reset_index(drop=True)
result = pd.concat([df, grouped_2], axis=1)
My expectation was the following dataframe:
pd.DataFrame(index = [1,1,1,1,1,2,2,2,2],
data={'value': [1,5,8,10,12,1,2,8,2], 'diff_day':[-1,0,2,3,4,-2,-1,0,10], 'grad':[4,3.16,1.83,2,2,1,3.5,5.4,-0.6]} )
Here's a simple way:
df['grad'] = np.gradient(df['value'], df['diff_day'])
To make your solution work, you can do:
result = pd.concat([df.reset_index(drop=True), grouped_2.reset_index(drop=True)], axis=1)

Function does not reduce in Pandas

I have a problem with Pandas aggregate.
I have 4 columns with "int" type and one as a string.
I want those with int to sum, that one with string to get unique.
I have used next function:
df = df.groupby(['Time', 'Id', 'Object', 'Alias', 'Type'],as_index=False).agg(lambda x : x.sum() if x.dtype=='int' else x.unique())
But I'm getting next error:
ValueError: Function does not reduce
I have found some similar questions but each of them has only one operation in an aggregate function. Not sure how to use any advice in my case.
unique is not an aggregating function. Aggregating functions reduce the dimension of the data, e.g. sum reduces a list of numbers to a single number, whereas unique reduces a list to a potentially smaller list.
my_series = pd.Series([1, 2, 1, 3]) # 1-D
my_series.sum() # 7, a scalar
my_series.unique() # [1, 2, 3] still 1-D

Categories