GroupBy object in which entries can belong to several groups - python

By the methods it supports, it looks like nothing speaks against labels of the original data frame/series occuring multiple times in a derived GroupBy object. Is it actually possible to, for example, construct a GroupBy object g from an iterable column like a in
>>> x
a b
0 [0, 1] 1
1 [1, 2] 2
such that g will represent a GroupBy object with one entry for each of the entries in a's values? That is, I get results like
>>> x.iterable_groupby('a').size()
a
0 1
1 2
2 1
>>> x.iterable_groupby('a').mean()
b
0 1.0
1 1.5
2 2.0

You should reshape your DataFrame to a tidy dataset. Reshaping part is asked frequently (1, 2, 3).
In a tidy dataset, each row should represent a single record. For that, you can create a 'grouper' column like this:
x['a'].apply(pd.Series).stack().reset_index(level=1, drop=True).to_frame('grouper')
Out:
grouper
0 0
0 1
1 1
1 2
If you join this with the original DataFrame, it can be grouped as you like:
x['a'].apply(pd.Series).stack().reset_index(level=1, drop=True).to_frame('grouper').join(x).groupby('grouper').mean()
Out:
b
grouper
0 1.0
1 1.5
2 2.0
Reshaping part is not very efficient but as far as I know pandas does not offer a better method for that yet.

Related

Pandas - How to extract values from a large DF without any 'keys' using another DF's values?

I've got one large matrix as a pandas DF w/o any 'keys' but plain numbers on top. A smaller version of that just to demonstrate the problem in here would be like this input:
M=pd.DataFrame(np.random.rand(4,5))
What I want to accomplish is using another given DF as reference that has a structure like this
N=pd.DataFrame({'A':[2,2,2],'B':[2,3,4]})
...to extract the values from the large DF whereas the values of 'A' correspond to the ROW number and 'B' values to the COLUMN number of the large DF so that the expected output would look like this:
Large DF
0 1 2 3 4
0 0.766275 0.910825 0.378541 0.775416 0.639854
1 0.505877 0.992284 0.720390 0.181921 0.501062
2 0.439243 0.416820 0.285719 0.100537 0.429576
3 0.243298 0.560427 0.162422 0.631224 0.033927
Small DF
A B
0 2 2
1 2 3
2 2 4
Expected Output:
A B extracted values
0 2 2 0.285719
1 2 3 0.100537
2 2 4 0.429576
So far I've tried different version of something like this
N['extracted'] = M.iloc[N['A'].astype(int):,N['B'].astype(int)]
..but it keeps failing with an error saying
TypeError: cannot do positional indexing on RangeIndex with these indexers
[0 2
1 2
2 2
Which approach would be the best ?
Is this job better to accomplish by converting the DF's into a numpy arrays ?
Thanks for help!
I think you want to use the apply function. This goes row by row through your data set.
N['extracted'] = N.apply(lambda row: M.iloc[row['A'], row['B']], axis=1)

Create a column based on multiple column distinct count pandas [duplicate]

I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.
my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.
something like this isn't working:
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()
nor is
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)
this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)
in R this is easily done in data.table with
df[, n_unique_id := uniqueN(id), by = c('track', 'type')]
thanks!
df.groupby(['track', 'type'])['id'].transform(nunique)
Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.
As pointed out by #root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.
For example transform('sum') should be preferred over transform(sum).
Try this instead
df.groupby(['track', 'type'])['id'].transform('nunique')
demo
df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)
id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B
df.groupby(['track', 'type'])['id'].transform('nunique')
0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64

Python pandas groupby category and integer variables results in pandas last and tail difference

UPDATE:
Please download my full dataset here.
my datatype is:
>>> df.dtypes
increment int64
spread float64
SYM_ROOT category
dtype: object
I have realized that the problem might have been caused by the fact that my SYM_ROOT is a category variable.
To replicate the issue you might want to do the following first:
df=pd.read_csv("sf.csv")
df['SYM_ROOT']=df['SYM_ROOT'].astype('category')
But I am still puzzled as in why my SYM_ROOT will result in the gaps in increment being filled with NA? Unless groupby category and integer value will result in a balanced panel by default.
I noticed that the behaviour of pd.groupby().last is different from that of pd.groupby().tail(1).
For example, suppose I have the following data:
increment is an integer that spans from 0 to 4680. However, for some SYM_ROOT variable, there are gaps in between. For example, 4 could be missing from it.
What I want to do is to keep the last observation per group.
If I do df.groupby(['SYM_ROOT','increment']).last(), the dataframe becomes:
While if I do df.groupby(['SYM_ROOT','increment']).tail(1), the dataframe becomes:
It looks to me that the last() statement will create a balanced time-series data and fill in the gaps with NaN, while the tail(1) statement doesn't. Is it correct?
Update :
Your columns increment is category
df=pd.DataFrame({'A':[1,1,2,2],'B':[1,1,2,3],'C':[1,1,1,1]})
df.B=df.B.astype('category')
df.groupby(['A','B']).last()
Out[590]:
C
A B
1 1 1.0
2 NaN
3 NaN
2 1 NaN
2 1.0
3 1.0
When you using tail it will not make up the miss level since , tail is more like dataframe base , not single columns
df.groupby(['A','B']).tail(1)
Out[593]:
A B C
1 1 1 1
2 2 2 1
3 2 3 1
After hange it using astype
df.B=df.B.astype('int')
df.groupby(['A','B']).last()
Out[591]:
C
A B
1 1 1
2 2 1
3 1
It is actually an issue here at Github, where the problem is mainly caused by groupby categories guessing the values.

Map each value in pandas series/dataframe to n>1 dimensions

I have a pandas series, and a function that takes a value in the series and returns a dataframe. Is there a way to apply the function to the series and collate the results in a natural way?
What I am really trying to do is to use pandas series/multiindex to keep track of the results in each step of my data analysis pipeline, where the multiindex holds the parameters used to get the values. For example, the series (s below) is the result of step 0 in my data analysis pipeline. In step 1, I want to try x more dimensions (2 below, thus the dataframe) and collate the results into another series.
Can we do better than below? Where stack() calls seem a bit excessive. Will the xarray library be a good fit for my use case?
In [112]: s
Out[112]:
a 0
b 1
c 2
dtype: int64
In [113]: d = s.apply(lambda x: pd.DataFrame([[x,x*2],[x*3,x*4]]).stack()).stack().stack()
In [114]: d
Out[114]:
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 3
1 0 2
1 4
c 0 0 2
1 6
1 0 4
1 8
dtype: int64
This should give you a DataSet of 2D arrays, and align them for you. You may want to set the dimensions prior if you want them to be named a certain way / be a certain size.
xr.Dataset(k: func(v) for k, v in series.items())

Slice column in panda database and averaging results

If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.
IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0
Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.

Categories