Multi Column DDPLY/R function in Pandas/Python - python

I have the following statement in R
library(plyr)
filteredData <- ddply(data, .(ID1, ID2), businessrule)
I am trying to use Python and Pandas to duplicate the action.
I have tried...
data['judge'] = data.groupby(['ID1','ID2']).apply(lambda x: businessrule(x))
But this provides error...
incompatible index of inserted column with frame index

The error message can be reproduced with
import pandas as pd
df = pd.DataFrame(np.arange(12).reshape(4,3), columns=['ID1', 'ID2', 'val'])
df['new'] = df.groupby(['ID1', 'ID2']).apply(lambda x: x.values.sum())
# TypeError: incompatible index of inserted column with frame index
It is likely that your code raises an error for the same reason this toy example does.
The right-hand side is a Series with a 2-level MultiIndex:
ID1 ID2
0 1 3
3 4 12
6 7 21
9 10 30
dtype: int64
df['new'] = ... tells Pandas to assign this Series to a column in df.
But df has a single-level index:
ID1 ID2 val
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
Because the single-level index is incompatible with the 2-level MultiIndex, the
assignment fails. It is in general never correct to assign the result of
groupby/apply to a columns of df unless the columns or levels you group by
also happen to be valid index keys in the original DataFrame, df.
Instead, assign the Series to a new variable, just like what the R code does:
filteredData = data.groupby(['ID1','ID2']).apply(businessrule)
Note that lambda x: businessrule(x) can be replaced with businessrule.

Related

Stick the columns based on the one columns keeping ids

I have a DataFrame with 100 columns (however I provide only three columns here) and I want to build a new DataFrame with two columns. Here is the DataFrame:
import pandas as pd
df = pd.DataFrame()
df ['id'] = [1,2,3]
df ['c1'] = [1,5,1]
df ['c2'] = [-1,6,5]
df
I want to stick the values of all columns for each id and put them in one columns. For example, for id=1 I want to stick 2, 3 in one column. Here is the DataFrame that I want.
Note: df.melt does not solve my question. Since I want to have the ids also.
Note2: I already use the stack and reset_index, and it can not help.
df = df.stack().reset_index()
df.columns = ['id','c']
df
You could first set_index with "id"; then stack + reset_index:
out = (df.set_index('id').stack()
.droplevel(1).reset_index(name='c'))
Output:
id c
0 1 1
1 1 -1
2 2 5
3 2 6
4 3 1
5 3 5

Slice pandas dataframe using .loc with both index values and multiple column values, then set values

I have a dataframe, and I would like to select a subset of the dataframe using both index and column values. I can do both of these separately, but cannot figure out the syntax to do them simultaneously. Example:
import pandas as pd
# sample dataframe:
cid=[1,2,3,4,5,6,17,18,91,104]
c1=[1,2,3,1,2,3,3,4,1,3]
c2=[0,0,0,0,1,1,1,1,0,1]
df=pd.DataFrame(list(zip(c1,c2)),columns=['col1','col2'],index=cid)
df
Returns:
col1 col2
1 1 0
2 2 0
3 3 0
4 1 0
5 2 1
6 3 1
17 3 1
18 4 1
91 1 0
104 3 1
Using .loc, I can collect by index:
rel_index=[5,6,17]
relc1=[2,3]
relc2=[1]
df.loc[rel_index]
Returns:
col1 col2
5 1 5
6 2 6
17 3 7
Or I can select by column values:
df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]
Returning:
col1 col2
5 2 1
6 3 1
17 3 1
104 3 1
However, I cannot do both. When I try the following:
df.loc[rel_index,df['col1'].isin(relc1) & df['col2'].isin(relc2)]
Returns:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
I have tried a few other variations (such as "&" instead of the ","), but these return the same or other errors.
Once I collect this slice, I am hoping to reassign values on the main dataframe. I imagine this will be trivial once the above is done, but I note it here in case it is not. My goal is to assign something like df2 in the following:
c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)
to the slice referenced by index and multiple column conditions (overwriting what was in the original dataframe).
The reason for the IndexingError, is that you're calling df.loc with arrays of 2 different sizes.
df.loc[rel_index] has a length of 3 whereas df['col1'].isin(relc1) has a length of 10.
You need the index results to also have a length of 10. If you look at the output of df['col1'].isin(relc1), it is an array of booleans.
You can achieve a similar array with the proper length by replacing df.loc[rel_index] with df.index.isin([5,6,17])
so you end up with:
df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)]
which returns:
col1 col2
5 2 1
6 3 1
17 3 1
That said, I'm not sure why your index would ever look like this. Typically when slicing by index you would use df.iloc and your index would match the 0,1,2...etc. format.
Alternatively, you could first search by value - then assign the resulting dataframe to a new variable df2
df2 = df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]
then df2.loc[rel_index] would work without issue.
As for your overall goal, you can simply do the following:
c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)
df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)] = df2
#Rexovas explains it quite well, this is an alternative, where you can compute the filters on the index before assigning - it is a bit long, involves MultiIndex, but once you get your head around MultiIndex, should be intuitive:
(df
# move columns into the index
.set_index(['col1', 'col2'], append = True)
# filter based on the index
.loc(axis = 0)[rel_index, relc1, relc2]
# return cols 1 and 2
.reset_index(level = [-2, -1])
# assign values
.assign(col1 = c3, col2 = c4)
)
col1 col2
5 1 5
6 2 6
17 3 7

How to re-index as multi-index pandas dataframe from index value that repeats

I have an index in a pandas dataframe which repeats the index value. I want to re-index as multi-index where repeated indexes are grouped.
The indexing looks like such:
so I would like all the 112335586 index values would be grouped under the same in index.
I have looked at this question Create pandas dataframe by repeating one row with new multiindex but here the value can be index can be pre-defined but this is not possible as my dataframe is far too large to hard code this.
I also looked at at the multi-index documentation but this also pre-defines the value for the index.
I believe you need:
s = pd.Series([1,2,3,4], index=[10,10,20,20])
s.index.name = 'EVENT_ID'
print (s)
EVENT_ID
10 1
10 2
20 3
20 4
dtype: int64
s1 = s.index.to_series()
s2 = s1.groupby(s1).cumcount()
s.index = [s.index, s2]
print (s)
EVENT_ID
10 0 1
1 2
20 0 3
1 4
dtype: int64
Try this:
df.reset_index(inplace=True)
df['sub_idx'] = df.groupby('EVENT_ID').cumcount()
df.set_index(['EVENT_ID','sub_idx'], inplace=True)

Python adding column to dataframe causes NaN

I have a series and df
s = pd.Series([1,2,3,5])
df = pd.DataFrame()
When I add columns to df like this
df.loc[:, "0-2"] = s.iloc[0:3]
df.loc[:, "1-3"] = s.iloc[1:4]
I get df
0-2 1-3
0 1 NaN
1 2 2.0
2 3 3.0
Why am I getting NaN? I tried create new series with correct idxs, but adding it to df still causes NaN.
What I want is
0-2 1-3
0 1 2
1 2 3
2 3 5
Try either of the following lines.
df.loc[:, "1-3"] = s.iloc[1:4].values
# -OR-
df.loc[:, "1-3"] = s.iloc[1:4].reset_index(drop=True)
Your original code is trying unsuccessfully to match the index of the data frame df to the index of the subset series s.iloc[1:4]. When it can't find the 0 index in the series, it places a NaN value in df at that location. You can get around this by only keeping the values so it doesn't try to match on the index or resetting the index on the subset series.
>>> s.iloc[1:4]
1 2
2 3
3 5
dtype: int64
Notice the index values since the original, unsubset series is the following.
>>> s
0 1
1 2
2 3
3 5
dtype: int64
The index of the first row in df is 0. By dropping the indices with the values call, you bypass the index matching which is producing the NaN. By resetting the index in the second option, you make the indices the same.

Transforming a column index in Python pandas

I have the following DataFrame:
index PUBLICO CLASSIFICACAO_PUBLICO
0 19 143643 1
1 34 111879 2
2 31 50382 3
3 9 49204 4
4 32 37541 5
5 4 36095 6
I need convert the index name column to index column.
For example:
index PUBLICO CLASSIFICACAO_PUBLICO
19 143643 1
34 111879 2
31 50382 3
9 49204 4
32 37541 5
4 36095 6
I try use df.set_index('index'), but it didn't work.
The column with the name index previously was the index column the DataFrame, but I used reset_index(); now I need to do the reverse.
The method set_index doesn't work inplace. So that you have to reassign your dataframe, or to pass the option inplace = True:
df = df.set_index('index')
or
df.set_index('index',inplace = True)
see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html
You can try it this way:
df.set_index(df['index'], inplace=True)
This will set your index column as the index in your dataframe and your index column will still remain in your dataframe as well. Then, you can just drop that column.
df.drop('index', axis=1, inplace=True)

Categories