Output as zero in data frame with size() in python - python

I have a file that consists of three columns: A, B and C with some integer. Using python, Let say I would like to grouby() column 'A' and get the size() of each group with number greater than 4 , 6 and 8 in column 'B'. So I implemented the code below:
>>> import pandas as pd
>>>
>>> df = pd.read_csv("test.txt", sep="\t")
>>> df
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
>>>
>>> out1 = df[df['B'] > 4].groupby(['A']).size().reset_index()
>>> out1
A 0
0 1 1
1 2 2
>>> out2 = df[df['B'] > 6].groupby(['A']).size().reset_index()
>>> out2
A 0
0 2 1
>>> out3 = df[df['B'] > 8].groupby(['A']).size().reset_index()
>>> out3
Empty DataFrame
Columns: [A, 0]
Index: []
>>>
out1 is the output that I want. But for out2 and out3, how do I get the data frame similar to out1 with zero as below?
out2:
A 0
0 2 1
1 2 0
out3:
A 0
0 2 0
1 2 0
Thanks in advance.

Idea is create boolean mask, convert to integers and aggregate sum - here is necessary grouping by Series like df['A'] instead by column name A:
out3 = (df['B'] > 8).astype(int).groupby(df['A']).sum().reset_index()
#alternative
#out3 = (df['B'] > 8).view('i1').groupby(df['A']).sum().reset_index()
print (out3)
A B
0 1 0
1 2 0
Another idea is create helper column - e.g. assign B to new values and then aggregate sum:
out3 = df.assign(B = (df['B'] > 8).astype(int)).groupby('A')['B'].sum().reset_index()
print (out3)
A B
0 1 0
1 2 0

Related

Comparing the value of a column with the previous value of a new column using Apply in Python (Pandas)

I have a dataframe with these values in column A:
df = pd.DataFrame(A,columns =['A'])
A
0 0
1 5
2 1
3 7
4 0
5 2
6 1
7 3
8 0
I need to create a new column (called B) and populate it using next conditions:
Condition 1: If the value of A is equal to 0 then, the value of B must be 0.
Condition 2: If the value of A is not 0 then I compare its value to the previous value of B. If A is higher than the previous value of B then I take A, otherwise I take B.
The result should be this:
A B
0 0 0
1 5 5
2 1 5
3 7 7
4 0 0
5 2 2
6 1 2
7 3 3
The dataset is huge and using loops would be too slow. I would need to solve this without using loops and the pandas “Loc” function. Anyone could help me to solve this using the Apply function? I have tried different things without success.
Thanks a lot.
One way to do this I guess could be the following
def do_your_stuff(row):
global value
# fancy stuff here
value = row["b"]
[...]
value = df.iloc[0]['B']
df["C"] = df.apply(lambda row: do_your_stuff(row), axis=1)
Try this:
df['B'] = df['A'].shift()
df['B'] = df.apply(lambda x:0 if x.A == 0 else x.A if x.A > x.B else x.B, axis=1)
Use .shift() to shift your one cell down and check if the previous value is smaller and it is not 0. Then use .mask() to replace the values with the previous if the condition stands.
from io import StringIO
import pandas as pd
wt = StringIO("""A
0 0
1 2
2 3
3 1
4 2
5 7
6 0
""")
df = pd.read_csv(wt, sep='\s\s+')
df
A
0 0
1 2
2 3
3 1
4 2
5 7
6 0
def func(df, col):
df['B'] = df[col].mask(cond=((df[col].shift(1) > df[col]) & (df[col] != 0)), other=df[col].shift(1))
if col == 'B':
while ((df[col].shift(1) > df[col]) & (df[col] != 0)).any():
df['B'] = df[col].mask(cond=((df[col].shift(1) > df[col]) & (df[col] != 0)), other=df[col].shift(1))
return df
(df.pipe(func, 'A').pipe(func, 'B'))
Output:
A B
0 0 0
1 2 2
2 3 3
3 1 3
4 2 3
5 7 7
6 0 0
Using the solution of Achille I solved it this way:
import pandas as pd
A = [0,2,3,0,2,7,2,3,2,20,1,0,2,5,4,3,1]
df = pd.DataFrame(A,columns =['A'])
df['B'] = 0
def function(row):
global value
global prev
if row['A'] ==0:
value = 0
elif row['A'] > value:
value = row['A']
else:
value = prev
prev = value
return value
value = df.iloc[0]['B']
prev = value
df["B"] = df.apply(lambda row: function(row), axis=1)
df
output:
A B
0 0 0
1 2 2
2 3 3
3 0 0
4 2 2
5 7 7
6 2 7
7 3 7
8 2 7
9 20 20
10 1 20
11 0 0
12 2 2
13 5 5
14 4 5
15 3 5
16 1 5

Pandas: occurrence matrix from one hot encoding from pandas dataframe

I have a dataframe, it's in one hot format:
dummy_data = {'a': [0,0,1,0],'b': [1,1,1,0], 'c': [0,1,0,1],'d': [1,1,1,0]}
data = pd.DataFrame(dummy_data)
Output:
a b c d
0 0 1 0 1
1 0 1 1 1
2 1 1 0 1
3 0 0 1 0
I am trying to get the occurrence matrix from dataframe, but if I have columns name in list instead of one hot like this:
raw = [['b','d'],['b','c','d'],['a','b','d'],['c']]
unique_categories = ['a','b','c','d']
Then I am able to find the occurrence matrix like this:
df = pd.DataFrame(raw).stack().rename('val').reset_index().drop(columns='level_1')
df = df.loc[df.val.isin(unique_categories)]
df = df.merge(df, on='level_0').query('val_x != val_y')
final = pd.crosstab(df.val_x, df.val_y)
adj_matrix = (pd.crosstab(df.val_x, df.val_y)
.reindex(unique_categories, axis=0).reindex(unique_categories, axis=1)).fillna(0)
Output:
val_y a b c d
val_x
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0
How to get the occurrence matrix directly from one hot dataframe?
You can have some fun with matrix math!
u = np.diag(np.ones(df.shape[1], dtype=bool))
df.T.dot(df) * (~u)
a b c d
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0

Get substring in one column based on the value in another column

My dataframe looks like this:
test_df = pd.DataFrame({'name':['a12','b1','c'],'Length':[2,1,0]})
test_df
Length name
0 2 a12
1 1 b1
2 0 c
I would like to have a result like this:
Length name
0 2 a
1 1 b
2 0 c
With this code: Getting substring based on another column in a pandas dataframe
test_df.apply(lambda x: x['name'][:-x['Length']],axis = 1)
test_df
I got the same dataframe than before
Length name
0 2 a12
1 1 b1
2 0 c
Modify your apply a bit, to slice with respect to len(x['name']) -
def f(x):
return x['name'][:len(x['name']) - x['Length_to_drop']]
df.apply(f, 1)
0 a
1 b
2 c
dtype: object
Try this:
import pandas as pd
test_df = pd.DataFrame({'name':['a12','b1','c'],'Length':[2,1,0]})
test_df['name']=test_df.apply(lambda x: x['name'][:len(x['name'])-x['Length']],axis = 1)
test_df
This will output as you intended
Length name
0 2 a
1 1 b
2 0 c
One can use list functions for this:
outlist = list(map(lambda x,y: x[0:(len(x)-y)], test_df.name, test_df.Length_to_drop))
test_df.name = outlist
print(test_df)
Output:
Length_to_drop name
0 2 a
1 1 b
2 0 c

Add a column results in difference of rows

Let's say I have a data frame:
A B
0 a b
1 c d
2 e f
and what I am aiming for is to find the difference between the following rows from column A
Like this:
A B Ic
0 a b (a-a)
1 c d (a-c)
2 e f (a-e)
This is what I tried:
df['dA'] = df['A'] - df['A']
But it doesn't give me the result I needed. Any help would be greatly appreciated.
Select first value by loc by index and column name or iat by column name and position and subtract:
df['Ic'] = df.loc[0,'A'] - df['A']
print (df)
A B Ic
0 4 b 0
1 1 d 3
2 0 f 4
df['Ic'] = df['A'].iat[0] - df['A']
print (df)
A B Ic
0 4 b 0
1 1 d 3
2 0 f 4
Detail:
print (df.loc[0,'A'])
4
print (df['A'].iat[0])
4

Drop pandas dataframe rows AND columns in a batch fashion based on value

Background: I have a matrix which represents the distance between two points. In this matrix both rows and columns are the data points. For example:
A B C
A 0 999 3
B 999 0 999
C 3 999 0
In this toy example let's say I want to drop C for some reason, because it is far away from any other point. So I first aggregate the count:
df["far_count"] = df[df == 999].count()
and then batch remove them:
df = df[df["far_count"] == 2]
In this example this looks a bit redundant but please imagine that I have many data points like this (say in the order of 10Ks)
The problem with the above batch removal is that I would like to remove rows and columns in the same time (instead of just rows) and it is unclear to me how to do so elegantly. A naive way is to get a list of such data points and put it in a loop and then:
for item in list:
df.drop(item, axis=1).drop(item, axis=0)
But I was wondering if there is a better way. (Bonus if we could skip the intermdiate step far_count)
np.random.seed([3,14159])
idx = pd.Index(list('ABCDE'))
a = np.random.randint(3, size=(5, 5))
df = pd.DataFrame(
a.T.dot(a) * (1 - np.eye(5, dtype=int)),
idx, idx)
df
A B C D E
A 0 4 2 4 2
B 4 0 1 5 2
C 2 1 0 2 6
D 4 5 2 0 3
E 2 2 6 3 0
l = ['A', 'C']
m = df.index.isin(l)
df.loc[~m, ~m]
B D E
B 0 5 2
D 5 0 3
E 2 3 0
For your specific case, because the array is symmetric you only need to check one dimension.
m = (df.values == 999).sum(0) == len(df) - 1
In [66]: x = pd.DataFrame(np.triu(df), df.index, df.columns)
In [67]: x
Out[67]:
A B C
A 0 999 3
B 0 0 999
C 0 0 0
In [68]: mask = x.ne(999).all(1) | x.ne(999).all(0)
In [69]: df.loc[mask, mask]
Out[69]:
A C
A 0 3
C 3 0

Categories