Finding common subindex values with error? - python

How would you find a common values of subindices, columns B in this example, between two dataframes where index A = 'a'?
import pandas as pd
df = pd.DataFrame({'Do': [0, 0, 0, 0, 0, 0], 'Ri': [0, 0, 0, 0, 0, 0],
'Mi': [0, 0, 0, 0, 0, 0],'A':['a', 'a', 'a', 'a', 'b', 'b'],
'B': [1, 2, 2, 3, 4, 5]})
df.set_index(['A', 'B'])
Do Ri Mi
A B
a 1 0 0 0
2 0 0 0
2 0 0 0
3 0 0 0
b 4 0 0 0
5 0 0 0
df2 = pd.DataFrame({'Do': [0, 0, 0, 0, 0, 0], 'Ri': [0, 0, 0, 0, 0, 0],
'Mi': [0, 0, 0, 0, 0, 0], 'A':['a', 'a', 'a', 'a', 'b', 'b'],
'B': [3, 3, 4, 6, 7, 8]})
df2.set_index(['A', 'B'])
Do Ri Mi
A B
a 3 0 0 0
3 0 0 0
4 0 0 0
6 0 0 0
b 7 0 0 0
8 0 0 0
Currently I have:
df_a = df.loc[['a']].sort_index(level='B')
df2_a = df2.loc[['a']].sort_index(level='B')
df_a_b = df_a.index.levels[1].tolist()
df2_a_n = df2_a.index.levels[1].tolist()
set(df_a_b) & set(df2_a_n)
But this seems to take from where index A = 'a' and A = 'b'
I noticed that have loc['a'] or loc[['a']] results in different dfs, I'm not sure if this relates, but what is the significance of [['a']] vs ['a']?

For a single overlap use set intersection after subsetting each DataFrme:
set(df.loc['a'].index) & set(df2.loc['a'].index)
#{3}
merge also works, but is overkill for a single intersection. On the other hand, if you want to do all the intersections at once, then use .merge + groupby
#Single
df.loc['a'].merge(df2.loc['a'], left_index=True, right_index=True).index.unique()
#Int64Index([3], dtype='int64', name='B')
#All
df.merge(df2, on=['A', 'B']).reset_index().groupby('A').B.unique()
#A
#a [3]
#Name: B, dtype: object
To explain your error, you were finding the intersection of the levels, but what you want is the intersection of the level values. Your current code should be changed to:
df_a = df.loc[['a']].sort_index(level='B')
df2_a = df2.loc[['a']].sort_index(level='B')
# Get The Level Values, not the Level IDs
df_a_b = df_a.index.get_level_values(1).tolist()
df2_a_n = df2_a.index.get_level_values(1).tolist()
set(df_a_b) & set(df2_a_n)
#{3}

Related

Getting binary labels on from a dataframe and a list of labels

Suppose I have the following list of labels,
labs = ['G1','G2','G3','G4','G5','G6','G7']
and also suppose that I have the following df:
group entity_label
0 0 G1
1 0 G2
3 1 G5
4 1 G1
5 2 G1
6 2 G2
7 2 G3
to produce the above df you can use:
df_test = pd.DataFrame({'group': [0,0,0,1,1,2,2,2,2],
'entity_label':['G1','G2','G2','G5','G1','G1','G2','G3','G3']})
df_test.drop_duplicates(subset=['group','entity_label'], keep='first')
for each group I want to use a mapping to look up on the labels and make a new dataframe with binary labels
group entity_label_binary
0 0 [1, 1, 0, 0, 0, 0, 0]
1 1 [1, 0, 0, 0, 1, 0, 0]
2 2 [1, 1, 1, 0, 0, 0, 0]
namely for group 0 we have G1 and G2 hence 1s in above table and so on. I wonder how one can do this?
One option, based on crosstab:
labs = ['G1','G2','G3','G4','G5','G6','G7']
(pd.crosstab(df_test['group'], df_test['entity_label'])
.clip(upper=1)
.reindex(columns=labs, fill_value=0)
.agg(list, axis=1)
.reset_index(name='entity_label_binary')
)
Variant, with get_dummies and groupby.max:
(pd.get_dummies(df_test['entity_label'])
.groupby(df_test['group']).max()
.reindex(columns=labs, fill_value=0)
.agg(list, axis=1)
.reset_index(name='entity_label_binary')
)
Output:
group entity_label_binary
0 0 [1, 1, 0, 0, 0, 0, 0]
1 1 [1, 0, 0, 0, 1, 0, 0]
2 2 [1, 1, 1, 0, 0, 0, 0]

Count unique records based on conditional result of aggregate functions on multiple columns

My data looks like this:
df = pd.DataFrame({'ID': [1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4,
4, 4, 5, 5, 5],
'group': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B',
'B', 'B', 'B', 'B', 'B', 'B'],
'attempts': [0, 1, 1, 1, 1, 1, 1, 0, 1,
1, 1, 1, 0, 0, 1, 0],
'successes': [1, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 1, 1, 0, 1],
'score': [None, 5, 5, 4, 5, 4, 5, None, 1, 5,
0, 1, None, None, 1, None]})
## df output
ID group attempts successes score
0 1 A 0 1 None
1 1 A 1 0 5
2 1 A 1 0 5
3 1 A 1 0 4
4 2 A 1 0 5
5 2 A 1 0 4
6 3 A 1 0 5
7 3 A 0 1 None
8 3 A 1 0 1
9 4 B 1 0 5
10 4 B 1 0 0
11 4 B 1 0 1
12 4 B 0 1 None
13 5 B 0 1 None
14 5 B 1 0 1
15 5 B 0 1 None
I'm trying to group by two columns (group, score) and count the number of unique ID after first identifying which groups of (group, ID) have at least 1 successes count across all score values. In other words, I only want to count the ID once (unique) in the aggregation if it has at least one associated success. I also only want to only count unique IDs per each (group, ID) pair regardless of the number of attempt_counts it contains (i.e if there's a sum of 5 success counts, I only want to include 1).
The successes and attempts columns are binary (only 1 or 0). For example, for ID = 1, group = A, there is at least 1 success. Therefore, when counting the number of unique IDs per (group, score), I will include that ID.
I'd like the final output to look something like this so that I can calculate the ratio of unique successes to unique attempts for each (group, score) combination.
group score successes_count attempts_counts ratio
A 5 2 3 0.67
4 1 2 0.50
1 1 1 1.0
0 0 0 inf
B 5 1 1 1.0
4 0 0 inf
1 2 2 1.0
0 1 1 1.0
So far I've been able to run a pivot table to sums per (group, ID) to identify those IDs that have at least 1 success. However, I'm not sure the best way to use this to reach my desired final state.
p = pd.pivot_table(data=df_new,
values=['ID'],
index=['group', 'ID'],
columns=['successes', 'attempts'],
aggfunc={'ID': 'count'})
# p output
ID
successes 0 1
attempts 1 0
group ID
A 1 3.0 1.0
2 2.0 NaN
3 2.0 1.0
B 4 3.0 1.0
5 1.0 2.0
Let's try something like:
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4,
4, 4, 5, 5, 5],
'group': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B',
'B', 'B', 'B', 'B', 'B', 'B'],
'attempts': [0, 1, 1, 1, 1, 1, 1, 0, 1,
1, 1, 1, 0, 0, 1, 0],
'successes': [1, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 1, 1, 0, 1],
'score': [None, 5, 5, 4, 5, 4, 5, None, 1, 5,
0, 1, None, None, 1, None]})
# Groups With At least 1 Success
m = df.groupby('group')['successes'].transform('max').astype(bool)
# Filter Out
df = df[m]
# Replace 0 successes with NaNs
df['successes'] = df['successes'].replace(0, np.nan)
# FFill BFill each group so that any success will fill the group
df['successes'] = df.groupby(['ID', 'group'])['successes'] \
.apply(lambda s: s.ffill().bfill())
# Pivot then stack to make sure each group has all score values
# Sort and reset index
# Rename Columns
# fix types
p = df.drop_duplicates() \
.pivot_table(index='group',
columns='score',
values=['attempts', 'successes'],
aggfunc='sum',
fill_value=0) \
.stack() \
.sort_values(['group', 'score'], ascending=[True, False]) \
.reset_index() \
.rename(columns={'attempts': 'attempts_counts',
'successes': 'successes_count'}) \
.convert_dtypes()
# Calculate Ratio
p['ratio'] = p['successes_count'] / p['attempts_counts']
print(p)
Output:
group score attempts_counts successes_count ratio
0 A 5 3 2 0.666667
1 A 4 2 1 0.5
2 A 1 1 1 1.0
3 A 0 0 0 NaN
4 B 5 1 1 1.0
5 B 4 0 0 NaN
6 B 1 2 2 1.0
7 B 0 1 1 1.0

Can we create a matrix with neighbouring columns?

Suppose that I have the matrix:
df=
a b c d e
z 1 1 1 0 1
y 0 0 0 1 0
x 1 0 1 0 0
w 0 1 0 0 1
v 0 0 0 1 0
Now we want to find all neigbouring columns and put that in a matrix. For my example this gives the result:
result =
a b c d e
a 1 1 1 0 1
b 1 1 1 0 1
c 1 1 1 0 1
d 0 0 0 0 0
e 1 1 1 0 0
Logic is as follows: since a and b columns both have a 1 in row z there should be 1 in the cell a,b and b,a. One mutual row is sufficient.
Now I am doing it with a for-loop but it is too slow.
df = pd.DataFrame( [[1, 1, 1, 0, 1],
[0, 0, 0, 1, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 0, 1],
[0, 0, 0, 1, 0]],
index=['a', 'b', 'c', 'd', 'e' ],
columns=['a', 'b', 'c', 'd', 'e' ],
)
result = pd.DataFrame(False, index=df.columns, columns=df.columns)
for row in result.index:
my_rows = df.index[df.loc[:, row]]
my_neighbours = df.columns[df.loc[my_rows, :].any(axis=0)]
result.loc[row, my_neighbours] = True
Does anyone have an alternative?
Perhaps you want something like this:
import pandas as pd
df = pd.DataFrame( [[1, 1, 1, 0, 1],
[0, 0, 0, 1, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 0, 1],
[0, 0, 0, 1, 0]],
index=['a', 'b', 'c', 'd', 'e' ],
columns=['a', 'b', 'c', 'd', 'e' ],
)
res = pd.DataFrame( index= ['a', 'b', 'c', 'd', 'e' ])
for col in df.columns:
res[col] = df.apply( lambda row: int(row[col]==1 and row[row.name]==1), axis=1 )
But as mentioned in comments, notice that you have an error in your example output.
Also I had to guess what you want in some default cases.
res
Out[176]:
a b c d e
a 1 1 1 0 1
b 0 0 0 0 0
c 1 0 1 0 0
d 0 0 0 0 0
e 0 0 0 0 0
It is also possible, to achieve the result with inliner below:
res2 = df.apply( lambda row: pd.Series([int(row[col]==1 and row[row.name]==1) for col in row.index], index=row.index), axis=1, result_type='expand' )

Cumulative count reset on condition

I have a dataframe similar to this:
df = pd.DataFrame({'col1': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c'],
'col2': [1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2],
'col3': [1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0],
'desired': [0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1]})
I want to apply a rolling sum on col3 which resets when either of col1 or col2 changes, or when the previous value of col3 was zero.
Note that the count is offset by 1 cell. This means the desired value for a new (col1, col2) combination will always be zero.
The code below demonstrates the required logic. However, it takes nearly 4 minutes on the dataset below.
des = []
count = 0
for i in range(1, len(df)):
des.append(count)
if (df.iloc[i-1].col1 == df.iloc[i].col1) & \
(df.iloc[i-1].col2 == df.iloc[i].col2) & \
(df.iloc[i-1].col3 == 1):
count += 1
else:
count = 0
des.append(0)
df['desired'] = des
A bigger dataset to test on: https://www.dropbox.com/s/hbafcq6hdkh4r9r/test.csv?dl=0
Use groupby with shift first and then count consecutive 1:
a = df.groupby(['col1','col2'])['col3'].shift().fillna(0).eq(1)
b = a.cumsum()
df['desired'] = b-b.where(~a).ffill().fillna(0).astype(int)
print (df.head(20))
col1 col2 col3 desired
0 100055 1 1 0
1 100055 1 0 1
2 100055 1 0 0
3 100055 1 0 0
4 100055 1 0 0
5 100055 1 0 0
6 100055 1 0 0
7 100055 1 0 0
8 100055 1 0 0
9 100055 1 0 0
10 100055 1 1 0
11 100055 1 1 1
12 100055 1 0 2
13 100055 1 1 0
14 100055 1 1 1
15 100055 1 0 2
16 100055 1 0 0
17 100055 1 1 0
18 100055 1 0 1
19 100055 1 1 0
Since the OP wants a rolling count, that seems to indicate that they want to handle groups that may repeat at different places in the df, without lumping them all together (like a groupby would do).
From the problem statement, it sounds more like a cumulative operation with resets. That is a class of problems that is relatively easy to vectorize, and it doesn't involve groupby at all.
Here is a way using numpy:
def cum_count(df, groupcols, zerocol):
a = df[groupcols].values
c = df[zerocol].values
# find indices to reset the cumulative count
r0 = np.concatenate(([True], np.any(a[1:] != a[:-1], axis=1)))
r1 = np.concatenate(([True], c[:-1] == 0))
reset = np.nonzero(r0 + r1)[0]
# offset: values to subtract at reset indices
off = np.concatenate(([0], np.diff(reset)))
# we add 1 everywhere except at reset indices
delt = np.ones(df.shape[0], dtype=int)
delt[reset] -= off
return np.cumsum(delt) - 1
And here are a corner case example:
df = pd.DataFrame([
['a', 1, 1, 0],
['a', 1, 1, 1],
['a', 1, 1, 2],
['b', 1, 1, 0],
['b', 1, 1, 1],
['b', 1, 1, 2],
['a', 1, 1, 0],
['a', 1, 1, 1],
['a', 1, 1, 2],
], columns='col1 col2 col3 desired'.split())
c = custom_cum_count(df, 'col1 col2'.split(), 'col3')
assert np.all(c == df['desired'])
print(c)
# out
[0 1 2 0 1 2 0 1 2]

Splitting survey data with OneHotEncoder

I have a dataframe with results from a survey where there were options A-E and it was possible to select more than one option - a selection could be 'A' or 'A;C;D', etc.
I will be using the data for some machine learning and am looking to run it through OneHotEncoder to end up with the 5 columns with 1's and 0's.
An example of my initial survey data is :
survey_data = pd.DataFrame({'Q1': ['A','B','C','A;D', 'D;E', 'F']})
I initially tried LabelEncoder but obviously ended up with a lot of features (rather than just the A-E).
You can also use MultilabelBinarizer for this:
inputX = [element.split(';') for element in survey_data['Q1']]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
transformedX = mlb.fit_transform(inputX)
#Out: transformedX
#array([[1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 1, 0],
[0, 0, 0, 0, 0, 1]])
Here's one approach, using get_dummies:
import pandas as pd
# example data provided by OP
survey_data = pd.DataFrame({'Q1': ['A','B','C','A;D', 'D;E', 'F']})
# split out rows with multiple chosen options into columns
tmp = survey_data.Q1.str.split(';').apply(pd.Series)
# one-hot encode columns with get_dummies, then overlay into one df
df = (pd.get_dummies(tmp[0])
.add(pd.get_dummies(tmp[1]), fill_value=0)
.astype(int))
print(df)
A B C D E F
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 0 1 0 0 0
3 1 0 0 1 0 0
4 0 0 0 1 1 0
5 0 0 0 0 0 1

Categories