Can we create a matrix with neighbouring columns? - python

Suppose that I have the matrix:
df=
a b c d e
z 1 1 1 0 1
y 0 0 0 1 0
x 1 0 1 0 0
w 0 1 0 0 1
v 0 0 0 1 0
Now we want to find all neigbouring columns and put that in a matrix. For my example this gives the result:
result =
a b c d e
a 1 1 1 0 1
b 1 1 1 0 1
c 1 1 1 0 1
d 0 0 0 0 0
e 1 1 1 0 0
Logic is as follows: since a and b columns both have a 1 in row z there should be 1 in the cell a,b and b,a. One mutual row is sufficient.
Now I am doing it with a for-loop but it is too slow.
df = pd.DataFrame( [[1, 1, 1, 0, 1],
[0, 0, 0, 1, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 0, 1],
[0, 0, 0, 1, 0]],
index=['a', 'b', 'c', 'd', 'e' ],
columns=['a', 'b', 'c', 'd', 'e' ],
)
result = pd.DataFrame(False, index=df.columns, columns=df.columns)
for row in result.index:
my_rows = df.index[df.loc[:, row]]
my_neighbours = df.columns[df.loc[my_rows, :].any(axis=0)]
result.loc[row, my_neighbours] = True
Does anyone have an alternative?

Perhaps you want something like this:
import pandas as pd
df = pd.DataFrame( [[1, 1, 1, 0, 1],
[0, 0, 0, 1, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 0, 1],
[0, 0, 0, 1, 0]],
index=['a', 'b', 'c', 'd', 'e' ],
columns=['a', 'b', 'c', 'd', 'e' ],
)
res = pd.DataFrame( index= ['a', 'b', 'c', 'd', 'e' ])
for col in df.columns:
res[col] = df.apply( lambda row: int(row[col]==1 and row[row.name]==1), axis=1 )
But as mentioned in comments, notice that you have an error in your example output.
Also I had to guess what you want in some default cases.
res
Out[176]:
a b c d e
a 1 1 1 0 1
b 0 0 0 0 0
c 1 0 1 0 0
d 0 0 0 0 0
e 0 0 0 0 0
It is also possible, to achieve the result with inliner below:
res2 = df.apply( lambda row: pd.Series([int(row[col]==1 and row[row.name]==1) for col in row.index], index=row.index), axis=1, result_type='expand' )

Related

Select mode of each group and assign back to the original dataframe on a condition

I have a dataframe like this:
df = pd.DataFrame({
'id_vector': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'E', 'E', 'E', 'E'],
'svm_prediction': [0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0]
})
And I want to create a new column in which each value represent the most frequent value in the svm_prediction column and just for the specific id_vector group. For example if for id_vector 'A' the most frequent value in svm_prediction is 1, the new column will have all values 1, same for 0, in case the frequency is equal for both values then put 2 or in each row.
The result will be something like this where column new_label is added to df.
result = pd.DataFrame({
'id_vector': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'E', 'E', 'E', 'E'],
'svm_prediction': [0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0],
'new_label': [0, 0, 0, 0, 0, 2, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2]
})
I tried:
r.groupby('id_vector')['svm_prediction'].mode()
but an error is raise because groupby has no attribute mode
also:
r['average'] = r[['svm_prediction']].mode(axis=0)
gives NaN values for every row except the first one.
Since the output you want has the same length as the original dataframe, groupby.transform seems more appropriate. Then you can pass a lambda into it that uses a ternary operator to assign the mode if it's a singleton and the length otherwise.
mode_f = lambda x: m[0] if len(m:=x.mode()) == 1 else len(m)
df['new_label'] = df.groupby('id_vector')['svm_prediction'].transform(mode_f)
We can use np.where to match the case you want, and np.mean which will give 0.5 if zero's == one's and more than 0.5 if one's more and less than 0.5 for zero's more.
df['new_label'] = df.groupby('id_vector').svm_prediction.transform(np.mean)
df['new_label'] = np.where(df.new_label > 0.5 , 1, np.where(df.new_label < 0.5, 0, 2))
output
id_vector true_label svm_prediction new_label
0 A 0 0 0
1 A 0 0 0
2 A 0 1 0
3 A 0 1 0
4 A 0 0 0
5 B 0 0 2
6 B 0 1 2
7 B 0 0 2
8 B 0 1 2
9 C 0 1 1
10 C 0 0 1
11 C 0 1 1
12 C 0 1 1
13 D 1 1 0
14 D 1 0 0
15 D 1 0 0
16 D 1 0 0
17 E 1 0 2
18 E 1 1 2
19 E 1 1 2
20 E 1 0 2
well i think you are looking for this.
result_final = df.groupby('id_vector')['svm_prediction'].agg(pd.Series.mode)
print(result_final)
'''
id_vector
A 0
B [0, 1] #probably both repeated same time
C 1
D 0
E [0, 1]
Name: svm_prediction, dtype: object
'''

Cumcount with reset in Python

I have a following problem. I need to compute a cumcount but I would like to reset the counter always when the series is interupted.
See example:
data = { 'col_1': ['a', 'a', 'b', 'b', 'a'], 'col_2': [3, 2, 1, 0, -3]}
df = pd.DataFrame.from_dict(data)
I tried this but it gives me a wrong output:
df["seq"] = df.groupby(["col_1"]).cumcount()
What I want is:
data = { 'col_1': ['a', 'a', 'b', 'b', 'a'], 'col_2': [3, 2, 1, 0, -3], 'seq': [0, 1, 0, 1, 0]}
How can I do it, please?
Try:
df["seq"] = df.groupby((df["col_1"] != df["col_1"].shift()).cumsum())["col_1"].cumcount()
print(df)
Output
col_1 col_2 seq
0 a 3 0
1 a 2 1
2 b 1 0
3 b 0 1
4 a -3 0
Note that as you are interested in runs (like in run-length encoding) itertools.groupby might be better suited for this task, consider following example
import pandas as pd
df = pd.DataFrame({'col1':['a','a','b','b','a']})
df['seq'] = [i for k, g in itertools.groupby(col1) for i in range(len(list(g)))]
print(df)
output
col1 seq
0 a 0
1 a 1
2 b 0
3 b 1
4 a 0

Count unique records based on conditional result of aggregate functions on multiple columns

My data looks like this:
df = pd.DataFrame({'ID': [1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4,
4, 4, 5, 5, 5],
'group': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B',
'B', 'B', 'B', 'B', 'B', 'B'],
'attempts': [0, 1, 1, 1, 1, 1, 1, 0, 1,
1, 1, 1, 0, 0, 1, 0],
'successes': [1, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 1, 1, 0, 1],
'score': [None, 5, 5, 4, 5, 4, 5, None, 1, 5,
0, 1, None, None, 1, None]})
## df output
ID group attempts successes score
0 1 A 0 1 None
1 1 A 1 0 5
2 1 A 1 0 5
3 1 A 1 0 4
4 2 A 1 0 5
5 2 A 1 0 4
6 3 A 1 0 5
7 3 A 0 1 None
8 3 A 1 0 1
9 4 B 1 0 5
10 4 B 1 0 0
11 4 B 1 0 1
12 4 B 0 1 None
13 5 B 0 1 None
14 5 B 1 0 1
15 5 B 0 1 None
I'm trying to group by two columns (group, score) and count the number of unique ID after first identifying which groups of (group, ID) have at least 1 successes count across all score values. In other words, I only want to count the ID once (unique) in the aggregation if it has at least one associated success. I also only want to only count unique IDs per each (group, ID) pair regardless of the number of attempt_counts it contains (i.e if there's a sum of 5 success counts, I only want to include 1).
The successes and attempts columns are binary (only 1 or 0). For example, for ID = 1, group = A, there is at least 1 success. Therefore, when counting the number of unique IDs per (group, score), I will include that ID.
I'd like the final output to look something like this so that I can calculate the ratio of unique successes to unique attempts for each (group, score) combination.
group score successes_count attempts_counts ratio
A 5 2 3 0.67
4 1 2 0.50
1 1 1 1.0
0 0 0 inf
B 5 1 1 1.0
4 0 0 inf
1 2 2 1.0
0 1 1 1.0
So far I've been able to run a pivot table to sums per (group, ID) to identify those IDs that have at least 1 success. However, I'm not sure the best way to use this to reach my desired final state.
p = pd.pivot_table(data=df_new,
values=['ID'],
index=['group', 'ID'],
columns=['successes', 'attempts'],
aggfunc={'ID': 'count'})
# p output
ID
successes 0 1
attempts 1 0
group ID
A 1 3.0 1.0
2 2.0 NaN
3 2.0 1.0
B 4 3.0 1.0
5 1.0 2.0
Let's try something like:
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4,
4, 4, 5, 5, 5],
'group': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B',
'B', 'B', 'B', 'B', 'B', 'B'],
'attempts': [0, 1, 1, 1, 1, 1, 1, 0, 1,
1, 1, 1, 0, 0, 1, 0],
'successes': [1, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 1, 1, 0, 1],
'score': [None, 5, 5, 4, 5, 4, 5, None, 1, 5,
0, 1, None, None, 1, None]})
# Groups With At least 1 Success
m = df.groupby('group')['successes'].transform('max').astype(bool)
# Filter Out
df = df[m]
# Replace 0 successes with NaNs
df['successes'] = df['successes'].replace(0, np.nan)
# FFill BFill each group so that any success will fill the group
df['successes'] = df.groupby(['ID', 'group'])['successes'] \
.apply(lambda s: s.ffill().bfill())
# Pivot then stack to make sure each group has all score values
# Sort and reset index
# Rename Columns
# fix types
p = df.drop_duplicates() \
.pivot_table(index='group',
columns='score',
values=['attempts', 'successes'],
aggfunc='sum',
fill_value=0) \
.stack() \
.sort_values(['group', 'score'], ascending=[True, False]) \
.reset_index() \
.rename(columns={'attempts': 'attempts_counts',
'successes': 'successes_count'}) \
.convert_dtypes()
# Calculate Ratio
p['ratio'] = p['successes_count'] / p['attempts_counts']
print(p)
Output:
group score attempts_counts successes_count ratio
0 A 5 3 2 0.666667
1 A 4 2 1 0.5
2 A 1 1 1 1.0
3 A 0 0 0 NaN
4 B 5 1 1 1.0
5 B 4 0 0 NaN
6 B 1 2 2 1.0
7 B 0 1 1 1.0

Finding common subindex values with error?

How would you find a common values of subindices, columns B in this example, between two dataframes where index A = 'a'?
import pandas as pd
df = pd.DataFrame({'Do': [0, 0, 0, 0, 0, 0], 'Ri': [0, 0, 0, 0, 0, 0],
'Mi': [0, 0, 0, 0, 0, 0],'A':['a', 'a', 'a', 'a', 'b', 'b'],
'B': [1, 2, 2, 3, 4, 5]})
df.set_index(['A', 'B'])
Do Ri Mi
A B
a 1 0 0 0
2 0 0 0
2 0 0 0
3 0 0 0
b 4 0 0 0
5 0 0 0
df2 = pd.DataFrame({'Do': [0, 0, 0, 0, 0, 0], 'Ri': [0, 0, 0, 0, 0, 0],
'Mi': [0, 0, 0, 0, 0, 0], 'A':['a', 'a', 'a', 'a', 'b', 'b'],
'B': [3, 3, 4, 6, 7, 8]})
df2.set_index(['A', 'B'])
Do Ri Mi
A B
a 3 0 0 0
3 0 0 0
4 0 0 0
6 0 0 0
b 7 0 0 0
8 0 0 0
Currently I have:
df_a = df.loc[['a']].sort_index(level='B')
df2_a = df2.loc[['a']].sort_index(level='B')
df_a_b = df_a.index.levels[1].tolist()
df2_a_n = df2_a.index.levels[1].tolist()
set(df_a_b) & set(df2_a_n)
But this seems to take from where index A = 'a' and A = 'b'
I noticed that have loc['a'] or loc[['a']] results in different dfs, I'm not sure if this relates, but what is the significance of [['a']] vs ['a']?
For a single overlap use set intersection after subsetting each DataFrme:
set(df.loc['a'].index) & set(df2.loc['a'].index)
#{3}
merge also works, but is overkill for a single intersection. On the other hand, if you want to do all the intersections at once, then use .merge + groupby
#Single
df.loc['a'].merge(df2.loc['a'], left_index=True, right_index=True).index.unique()
#Int64Index([3], dtype='int64', name='B')
#All
df.merge(df2, on=['A', 'B']).reset_index().groupby('A').B.unique()
#A
#a [3]
#Name: B, dtype: object
To explain your error, you were finding the intersection of the levels, but what you want is the intersection of the level values. Your current code should be changed to:
df_a = df.loc[['a']].sort_index(level='B')
df2_a = df2.loc[['a']].sort_index(level='B')
# Get The Level Values, not the Level IDs
df_a_b = df_a.index.get_level_values(1).tolist()
df2_a_n = df2_a.index.get_level_values(1).tolist()
set(df_a_b) & set(df2_a_n)
#{3}

Cumulative count reset on condition

I have a dataframe similar to this:
df = pd.DataFrame({'col1': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c'],
'col2': [1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2],
'col3': [1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0],
'desired': [0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1]})
I want to apply a rolling sum on col3 which resets when either of col1 or col2 changes, or when the previous value of col3 was zero.
Note that the count is offset by 1 cell. This means the desired value for a new (col1, col2) combination will always be zero.
The code below demonstrates the required logic. However, it takes nearly 4 minutes on the dataset below.
des = []
count = 0
for i in range(1, len(df)):
des.append(count)
if (df.iloc[i-1].col1 == df.iloc[i].col1) & \
(df.iloc[i-1].col2 == df.iloc[i].col2) & \
(df.iloc[i-1].col3 == 1):
count += 1
else:
count = 0
des.append(0)
df['desired'] = des
A bigger dataset to test on: https://www.dropbox.com/s/hbafcq6hdkh4r9r/test.csv?dl=0
Use groupby with shift first and then count consecutive 1:
a = df.groupby(['col1','col2'])['col3'].shift().fillna(0).eq(1)
b = a.cumsum()
df['desired'] = b-b.where(~a).ffill().fillna(0).astype(int)
print (df.head(20))
col1 col2 col3 desired
0 100055 1 1 0
1 100055 1 0 1
2 100055 1 0 0
3 100055 1 0 0
4 100055 1 0 0
5 100055 1 0 0
6 100055 1 0 0
7 100055 1 0 0
8 100055 1 0 0
9 100055 1 0 0
10 100055 1 1 0
11 100055 1 1 1
12 100055 1 0 2
13 100055 1 1 0
14 100055 1 1 1
15 100055 1 0 2
16 100055 1 0 0
17 100055 1 1 0
18 100055 1 0 1
19 100055 1 1 0
Since the OP wants a rolling count, that seems to indicate that they want to handle groups that may repeat at different places in the df, without lumping them all together (like a groupby would do).
From the problem statement, it sounds more like a cumulative operation with resets. That is a class of problems that is relatively easy to vectorize, and it doesn't involve groupby at all.
Here is a way using numpy:
def cum_count(df, groupcols, zerocol):
a = df[groupcols].values
c = df[zerocol].values
# find indices to reset the cumulative count
r0 = np.concatenate(([True], np.any(a[1:] != a[:-1], axis=1)))
r1 = np.concatenate(([True], c[:-1] == 0))
reset = np.nonzero(r0 + r1)[0]
# offset: values to subtract at reset indices
off = np.concatenate(([0], np.diff(reset)))
# we add 1 everywhere except at reset indices
delt = np.ones(df.shape[0], dtype=int)
delt[reset] -= off
return np.cumsum(delt) - 1
And here are a corner case example:
df = pd.DataFrame([
['a', 1, 1, 0],
['a', 1, 1, 1],
['a', 1, 1, 2],
['b', 1, 1, 0],
['b', 1, 1, 1],
['b', 1, 1, 2],
['a', 1, 1, 0],
['a', 1, 1, 1],
['a', 1, 1, 2],
], columns='col1 col2 col3 desired'.split())
c = custom_cum_count(df, 'col1 col2'.split(), 'col3')
assert np.all(c == df['desired'])
print(c)
# out
[0 1 2 0 1 2 0 1 2]

Categories