Compare Dataframe column with numpy ndarray and update value in dataframe - python

Dateframe df:A B C D E
1 2 4 6 #Value to be updated for this column
12 34 5 54
4 8 12 4
3 5 6 2
5 7 11 27
numpy ndarray(shape(4*1)):
npar= ([12]
[6]
[2]
[27]
)
I have above dataframe df and array npar, I want to compare value of column D in array npar. if value of column D is found in array npar anywhere . I want to update column E with 1 else 0 for that row of dataframe df. Kindly suggest how I can do this with sample code.

You need isin, but first is necessery flatten array by numpy.ravel and last convert boolean mask to integers - Trues are 1s and Falses are 0s:
df['E'] = df.D.isin(npar.ravel()).astype(int)
print (df)
A B C D E
0 1 2 4 6 1
1 12 34 5 54 0
2 4 8 12 4 0
3 3 5 6 2 1
4 5 7 11 27 1
Detail:
npar = np.array([[12],[6],[2],[27]])
print (npar)
[[12]
[ 6]
[ 2]
[27]]
print (npar.ravel())
[12 6 2 27]
print (df.D.isin(npar.ravel()))
0 True
1 False
2 False
3 True
4 True
Name: D, dtype: bool

Related

Subtract two timestamps from two different columns that are in consecutive rows

So I have a pandas dataframe like this
a b
0 1 3
1 7 8
2 11 3
3 9 1
And I want to subtract the column b from the column a with the previous index, for example b[1] = 8 and a[0] = 1, then b[1] - a[0] = 7, that will be in c[1]
a b c
0 1 3 -
1 7 8 7
2 11 3 -4
3 9 1 -10
How can I do it? Thanks for your time and help :)
Use Series.sub with shifted values by Series.shift:
df['c'] = df['b'].sub(df['a'].shift())
print (df)
a b c
0 1 3 NaN
1 7 8 7.0
2 11 3 -4.0
3 9 1 -10.0
For integers use integer na by convert to Int64:
df['c'] = df['b'].sub(df['a'].shift(1)).astype('Int64')
print (df)
a b c
0 1 3 <NA>
1 7 8 7
2 11 3 -4
3 9 1 -10

Pandas how to output distinct values in column based on duplicate in another column

Here an example:
import pandas as pd
df = pd.DataFrame({
'product':['1','1','1','2','2','2','3','3','3','4','4','4','5','5','5'],
'value':['a','a','a','a','a','b','a','b','a','b','b','b','a','a','a']
})
product value
0 1 a
1 1 a
2 1 a
3 2 a
4 2 a
5 2 b
6 3 a
7 3 b
8 3 a
9 4 b
10 4 b
11 4 b
12 5 a
13 5 a
14 5 a
I need to output:
1 a
4 b
5 a
Because 'value' values for distinct 'product' values all are same
I'm sorry for bad English
I think you need this
m=df.groupby('product')['value'].transform('nunique')
df.loc[m==1].drop_duplicates(). reset_index(drop=True)
Output
product value
0 1 a
1 4 b
2 5 a
Details
df.groupby('product')['value'].transform('nunique') returns a series as below
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 2
9 1
10 1
11 1
12 1
13 1
14 1
where the numbers of the number of unique values in each group. Then we use df.loc to get only the rows in which this value is 1, so, the groups with unique values.
The we drop duplicates since you need only the group & its unique value.
If I undestand correctly your question, this simple code is for your:
distinct_prod_df = df.drop_duplicates(['product'])
and gives:
product value
0 1 a
3 2 a
6 3 a
9 4 b
12 5 a
You can try this:
mask = df.groupby('product').apply(lambda x: x.nunique() == 1)
df = df[mask].drop_duplicates()

Filter pandas dataframe rows by multiple column values

I have a pandas dataframe containing rows with numbered columns:
1 2 3 4 5
a 0 0 0 0 1
b 1 1 2 1 9
c 2 2 2 2 2
d 5 5 5 5 5
e 8 9 9 9 9
How can I filter out the rows where a subset of columns are all above or below a certain value?
So, for example: I want to remove all rows where columns 1 to 3 all values are not > 3. In the above, that would leave me with only rows d and e.
The columns I am filtering and the value I am checking against are both arguments.
I've tried a few things, this is the closest I've gotten:
df[df[range(1,3)]>3]
Any ideas?
I used loc and all
in this function:
def filt(df, cols, thresh):
return df.loc[(df[cols] > thresh).all(axis=1)]
filt(df, [1, 2, 3], 3)
1 2 3 4 5
d 5 5 5 5 5
e 8 9 9 9 9
You can achieve this without using apply:
In [73]:
df[(df.ix[:,0:3] > 3).all(axis=1)]
Out[73]:
1 2 3 4 5
d 5 5 5 5 5
e 8 9 9 9 9
So this slices the df to just the first 3 columns using ix and then we compare against the scalar 3 and then call all(axis=1) to create a boolean series to mask the index

Groupby returning full row for max occurs

How to get full row of data for groupby relsult?
df
a b c d e
0 a 25 12 1 20
1 a 15 1 1 1
2 b 12 1 1 1
3 n 25 2 3 3
In [4]: df = pd.read_clipboard()
In [5]: df.groupby('a')['b'].max()
Out[5]:
a
a 25
b 12
n 25
Name: b, dtype: int64
How the get the full row?
a b c d e
a 25 12 1 20
b 12 1 1 1
n 25 2 3 3
I tried filtering but df[df.e == df.groupby('a')['b'].max()] but size is different :(
Original data:
0 1 2 3 4 5 6 7 8 9
EVE00101 Trial DRY RUN PASS 1610071 1610071 Y 20140808 NaN 29
10 11 12 13 14
FF1 ./ff1.sh Event Validation Hive Tables 2015-11-30 9:40:34
Groupby([1,7])[14].max() gives me the result but in grouped series as 1 and 7 as index I wanted the corresponding columns. It is 15,000 row data and provided 1 row of sample
You can use argmax() :
In [287]: df.groupby('a', as_index=False).apply(lambda x: x.loc[x.b.argmax(),])
Out[287]:
a b c d e
0 a 25 12 1 20
1 b 12 1 1 1
2 n 25 2 3 3
This way it works even if b is not the biggest one.
I'd overwrite the 'b' column using transform and then drop the duplicate 'a' row using drop_duplicates:
In [331]:
df['b'] = df.groupby('a')['b'].transform('max')
df
Out[331]:
a b c d e
0 a 25 12 1 20
1 a 25 1 1 1
2 b 12 1 1 1
3 n 25 2 3 3
In [332]:
df.drop_duplicates('a')
Out[332]:
a b c d e
0 a 25 12 1 20
2 b 12 1 1 1
3 n 25 2 3 3

Summing over a DataFrame with two conditions and multiple values

I have a DataFrame x with three columns;
a b c
1 1 10 4
2 5 6 5
3 4 6 5
4 2 11 9
5 1 2 10
... and a Series y of two values;
t
1 3
2 7
Now I'd like to get a DataFrame z with two columns;
t sum_c
1 3 18
2 7 13
... with t from y and sum_c the sum of c from x for all rows where t was larger than a and smaller than b.
Would anybody be able to help me with this?
here is a possible solution based on the given condition (the expected results listed in ur question dont quite line up with the given condition):
In[99]: df1
Out[99]:
a b c
0 1 10 4
1 5 6 5
2 4 6 5
3 2 11 9
4 1 2 10
In[100]: df2
Out[100]:
t
0 3
1 5
then write a function which would be used by pandas.apply() later:
In[101]: def cond_sum(x):
return sum(df1['c'].ix[np.logical_and(df1['a']<x.ix[0],df1['b']>x.ix[0])])
finally:
In[102]: df3 = df2.apply(cond_sum,axis=1)
In[103]: df3
Out[103]:
0 13
1 18
dtype: int64

Categories