I have a problem with Pandas aggregate.
I have 4 columns with "int" type and one as a string.
I want those with int to sum, that one with string to get unique.
I have used next function:
df = df.groupby(['Time', 'Id', 'Object', 'Alias', 'Type'],as_index=False).agg(lambda x : x.sum() if x.dtype=='int' else x.unique())
But I'm getting next error:
ValueError: Function does not reduce
I have found some similar questions but each of them has only one operation in an aggregate function. Not sure how to use any advice in my case.
unique is not an aggregating function. Aggregating functions reduce the dimension of the data, e.g. sum reduces a list of numbers to a single number, whereas unique reduces a list to a potentially smaller list.
my_series = pd.Series([1, 2, 1, 3]) # 1-D
my_series.sum() # 7, a scalar
my_series.unique() # [1, 2, 3] still 1-D
Related
I have a column in a dataframe that contains IATA_Codes (Abbreviations) for Airports (such as: LAX, SFO, ...) However, if I analyze the column values a little more (column.unique()), it says that there are also 4 digit numbers in it.
How can I filter the column so that my Datafram will only consist of rows containing a real Airport code?
My idea was to filter the length (Airports Code Length is always 3, while the Number length is always 4) but I don't know how to implement this idea.
array(['LFT', 'HYS', 'ELP', 'DVL', 'ISP', 'BUR', 'DAB', 'DAY', 'GRK',
'GJT', 'BMI', 'LBE', 'ASE', 'RKS', 'GUM', 'TVC', 'ALO', 'IMT',
...
10170, 11577, 14709, 14711, 12255, 10165, 10918, 15401, 13970,
15497, 12265, 14254, 10581, 12016, 11503, 13459, 14222, 14025,
'10333', '14222', '14025', '13502', '15497', '12265'], dtype=object)
You can use df.columns.str.len to get the length, and pass that to the second indexer position of df.loc:
df = df.loc[:, df.columns.astype(str).str.len() == 3]
one another possibility is to use lambda expression :
df[df['IATA_Codes'].apply(lambda x : len(str(x))==3)]['IATA_Codes'].unique()
I am currently using pandas groupby and transform to calculate smth for each group (once) and then assign the result to each row of the group.
If the result of calculations is scalar it can be obtained like:
df['some_col'] = df.groupby('id')['some_col'].transform(lambda x:process(x))
The problem is that the result of my calculations is vector, and pd tries to make element-wise assignment of result vector to the group (quote from pandas docs):
The transform function must:
Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk (e.g., a scalar, grouped.transform(lambda x: x.iloc[-1])).
I could hardcode external function, creating a group-sized list, that will contain copies of result (currently on python 3.6, so it's not possible to use assignment inside lambda):
def return_group(x):
result = process(x)
return [result for item in x]
But I think that it's possible to solve this somehow "smarter". Remember that it's necessary to make calculations only once for each group.
Is it possible to force pd.transform work with array-like result of lambda function like with scalars (just copy it n-times)?
Would be grateful for any advices.
P. S. I understand, that it's possible to use combination of apply and join to solve the original requirement, but the solution with transform has more priority in my case.
Sometime transform is a pain to work with If that's not a problem for you I'd suggest you to use groupby + a left pd.merge as in this example:
import pandas as pd
df = pd.DataFrame({"id":[1,1,2,2,2],
"col":[1,2,3,4,5]})
# this return a list for every group
grp = df.groupby("id")["col"]\
.apply(lambda x: list(x))\
.reset_index(name="out")
# Then you merge it to the original df
df = pd.merge(df, grp, how="left")
And print(df) returns
id col out
0 1 1 [1, 2]
1 1 2 [1, 2]
2 2 3 [3, 4, 5]
3 2 4 [3, 4, 5]
4 2 5 [3, 4, 5]
I got the following 56 columns, filled with random numbers:
What I want, is to add an extra column with the autocorrelation of column 1-56 with a certain lag. So if the lag is 1, then the outcome is 0.42, when lag is 2, 0.06 and so on.
This is the code I use:
def autocorr(x, t):
return np.corrcoef(np.array([x[0:len(x)-t], x[t:len(x)]]))
where, I presume, x is the dataframe and t is the lag.
However, when I try to add a column with the autocorrelation with lag = 1, I get:
df["output"] = autocorr(df,1)
error: ValueError: cannot copy sequence with size 0 to array axis with
dimension 56
What am I doing wrong, or is there an easier way to calculate the autocorrelation with a defined lag?
Appreciate the help
Steven
update: I'm constantly trying to adjust, but I can't find it. Anybody?????
I tried the following code:
def autocorr(x, t):
return np.corrcoef(np.array([x[:len(x)-t], x[t:len(x)]]))
But this gives me the error:
File "", line 1
autocorr(df(axis=1,1))
^
SyntaxError: positional argument follows keyword argument
Looks to me like you mismatched the parentheses in your function call. If anything autocorr(df(axis=1, 1)) should be autocorr(df(axis=1), 1), but pd.DataFrame objects are not callable.
Does the pd.Series.autocorr(lag=1) function not achieve what you want?
import pandas as pd, numpy as np
series = pd.Series(np.random.randint(100, high=200, size=56))
print(series.autocorr(lag=1))
results in values similar to what you expected.
Update: Concerning your original problem:
Since you have one row len(x) is 1 and x[0:len(x)-1] is an empty array!
Plus: in this case np.corrcoef returns a 2x2 matrix of the form [[1, C], [C, 1]]. Your autocorr function works when called this way
df1 = df.copy(deep=True)
df1["output"] = autocorr(df.T[0], 1)[0, 1]
I would not append the result to the df as this would change the outcome of subsequent calculations.
In Pandas, I can use .apply to apply functions to two columns. For example,
df = pd.DataFrame({'A':['a', 'a', 'a', 'b'], 'B':[3, 3, 2, 5], 'C':[2, 2, 2, 8]})
formula = lambda x: (x.B + x.C)**2
df.apply(formula, axis=1)
But, notice that results on the first two rows are the same since all the inputs are the same. In large dataset with complicated operations. These repeated calculations is likely to slow down my program. Is there a way that I can program it so that I can save time with these repeated calculations?
You can use a technique called memoization. For functions which accept hashable arguments, you can use the built-in functools.lru_cache.
from functools import lru_cache
#lru_cache(maxsize=None)
def cached_function(B, C):
return (B + C)**2
def formula(x):
return cached_function(x.B, x.C)
Notice that I had to pass the values through to the cached function for lru_cache to work correctly because Series objects aren't hashable.
You could use np.unique to create a copy of the dataframe consisting of only the unique rows, then do the calculation on those, and construct the full results.
For example:
import numpy as np
# convert to records for use with numpy
rec = df.to_records(index=False)
arr, ind = np.unique(rec, return_inverse=True)
# find dataframe of unique rows
df_small = pd.DataFrame(arr)
# Apply the formula & construct the full result
df_small.apply(formula, axis=1).iloc[ind].reset_index()
Even faster than using apply here would be to use broadcasting: for example, simply compute
(df.B + df.C) ** 2
If this is still too slow, you can use this method on the de-duplicated dataframe, as above.
I need to aggregate an array inside my DataFrame.
The DataFrame was created in this way
splitted.map(lambda x: Row(store= int(x[0]), date= parser.parse(x[1]), values= (x[2:(len(x))]) ) )
Values is an array
I want to do think like this
mean_by_week = sqlct.sql("SELECT store, SUM(values) from sells group by date, store")
But I have the following error
AnalysisException: u"cannot resolve 'sum(values)' due to data type mismatch: function sum requires numeric types, not ArrayType(StringType,true); line 0 pos 0"
The array have always the same dimension. But each run the dimension may change, is near 100 of length.
How can aggregate without going to a RDD?
Matching dimensions or not sum for array<> is not meaningful hence not implemented. You can try to restructure and aggregate:
from pyspark.sql.functions import col, array, size, sum as sum_
n = df.select(size("values")).first()[0]
df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6])]).toDF(["store", "values"])
df.groupBy("store").agg(array(*[
sum_(col("values").getItem(i)) for i in range(n)]).alias("values"))