pandas transform nunique on groupby object dealing with nan values - python

I have the following df,
inv_id cluster_id
793 2
2
789 3
789 3
4
4
I like to groupby cluster_id and check how many unique values each group has,
df['same_inv_id'] = df.groupby('cluster_id')['inv_id'].transform('nunique') == 1
but I like to set same_inv_id = False when some cluster only contains empty/blank inv_id, and when some cluster contains one or more empty/blank inv_id, so the result will look like,
inv_id cluster_id same_inv_id
793 2 False
2 False
789 3 True
789 3 True
4 False
4 False

IIUC get the condition then transform+ all
s1=df.inv_id.ne('').groupby(df.cluster_id).transform('all')
s1
Out[432]:
0 False
1 False
2 True
3 True
4 False
5 False
Name: inv_id, dtype: bool
s2=df.groupby('cluster_id')['inv_id'].transform('nunique') == 1
#df['same_inv_id']=s1&s2

Related

How to get the no of same boolean occur in two list in python

i have dataframe having
A B C D
0 True 5 True True
1 True 6 False False
2 False 5 True True
3 False 8 True False
4 True 2 True True
It should print the count when Column D is True, how many times Column A and Column C are True.
Expected Output
A : 2
C : 3
You can filter by column D because boolean in boolean indexing with DataFrame.loc for also filter by columns names and last for count Trues values is used sum:
s = df.loc[df.D, ['A','C']].sum()
print (s)
A 2
C 3
dtype: int64
Details:
print (df.loc[df.D, ['A','C']])
A C
0 True True
2 False True
4 True True

pandas groupby return a boolean vector

I have a time series database where I would like to group the data to compare them both to another cell in the same row, and the previous value.
The code below will return a vector against the whole dataframe, but if I try to group it I get a dataframe with apply() and an error with agg or transform.
Sample data frame
df = pd.DataFrame({ 'group': [1, 1, 1, 2,2,2,1,2, 1], 'target': [100,100,100,100,10,10,10,10,50],'val' :[90,80,70,4,120,6,60,8, 50] })
df
group target val
0 1 100 90
1 1 100 80
2 1 100 70
3 2 100 4
4 2 10 120
5 2 10 6
6 1 10 60
7 2 10 8
8 1 50 50
Here is my attempt at a function
def spike(df):
high = df['val'] > df['target']+25
rising = df['val'] > df['val'].shift()
return high & rising
print(spike(df))
print( df.groupby('group').apply(spike))
Output
0 False
1 False
2 False
3 False
4 True
5 False
6 True
7 False
8 False
dtype: bool
0 1 2 6 8
group
1 False False False False False
2 False True False False True
Here is my output, I was trying to get the second output to look like the first except row 6 should be false.
You are over thinking it:
shift = df.groupby('group')['val'].shift()
df['val'].gt(df['target']+25) & df['val'].gt(shift)
Output:
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
dtype: bool

How to differentiate string and Alphanumeric?

df:
company_name product Id rating
0 matrix mobile Id456 2.5
1 ins-faq alpha1 Id956 3.5
2 metric5 sounds-B Id-356 2.5
3 ingsaf digital Id856 4star
4 matrix win11p Idklm 2.0
5 4567 mobile 596 3.5
df2:
Col_name Datatype
0 company_name String #(pure string)
1 Product String #(pure string)
2 Id Alpha-Numeric #(must contain atleast 1 number and 1 alphabet)
3 rating Float or int
df is the main dataframe and df2 is the expected datatype information of main dataframe.
how to check every column values extract wrong datatype values.
output:
row_num col_name current_value expected_dtype
0 2 company_name metric5 string
1 5 company_name 4567 string
2 1 Product alpha1 string
3 4 Product win11p string
4 4 Id Idklm Alpha-Numeric
5 5 Id 596 Alpha-Numeric
6 3 rating 4star Float or int
For columns that cannot contain numbers, you can find the exceptions with:
In [5]: df['product'].str.contains(r'[0-9]')
Out[5]:
0 False
1 True
2 False
3 False
4 True
5 False
Name: product, dtype: bool
For Alpha-Numeric columns identify compliance with:
In [7]: df['Id'].str.contains(r'(?:\d\D)|(?:\D\d)')
Out[7]:
0 True
1 True
2 True
3 True
4 False
5 False
Name: Id, dtype: bool
For int or float columns find exceptions with
In [8]: df['rating'].str.contains(r'[^0-9.+-]')
Out[8]:
0 False
1 False
2 False
3 True
4 False
5 False
That may be problematic, it won't catch things with multiple or misplaced plus,minus, or dot characters, like 9.4.1 or 6+3.-12. But you could use
In [11]: def check(thing):
...: try:
...: return bool(float(thing)) or thing==0
...: except ValueError:
...: return False
...:
In [12]: df['rating'].apply(check)
Out[12]:
0 True
1 True
2 True
3 False
4 True
5 True
Name: rating, dtype: bool

Pandas incremental values when boolean changes

There is a dataframe that contains column in which boolean values alternate. I want to create incremental value series based on those boolean changes. I want to increment only when boolean value differs from previous value. I want to do this without loop.
Example, Here's dataframe:
column
0 True
1 True
2 False
3 False
4 False
5 True
I want to get this:
column inc
0 True 1
1 True 1
2 False 2
3 False 2
4 False 2
5 True 3
Compare shifted column by not equal and add cumulative sum:
df['inc'] = df['column'].ne(df['column'].shift()).cumsum()
print (df)
column inc
0 True 1
1 True 1
2 False 2
3 False 2
4 False 2
5 True 3
Detail:
print (df['column'].ne(df['column'].shift()))
0 True
1 False
2 True
3 False
4 False
5 True
Name: column, dtype: bool

List columns in data frame that have missing values as '?'

List names of the column(s) of data frame along with the count of missing number of values if missing values are coded with '?' using pandas and numpy.
import numpy as np
import pandas as pd
bridgeall = pd.read_excel('bridge.xlsx',sheet_name='Sheet1')
#print(bridgeall)
bridge_sep = bridgeall.iloc[:,0].str.split(',',-1,expand=True)
bridge_sep.columns = ['IDENTIF','RIVER', 'LOCATION', 'ERECTED', 'PURPOSE', 'LENGTH', 'LANES','CLEAR-G', 'T-OR-D',
'MATERIAL', 'SPAN', 'REL-L', 'TYPE']
print(bridge_sep)
Data: I am posting a snippet. Its actually [107 rows x 13 columns].
IDENTIF RIVER LOCATION ERECTED ... MATERIAL SPAN REL-L TYPE
0 E2 A ? CRAFTS ... WOOD SHORT ? WOOD
1 E3 A 39 CRAFTS ... WOOD ? S WOOD
2 E5 A ? CRAFTS ... WOOD SHORT S WOOD
Output required:
LOCATION 2
SPAN 1
REL-L 1
Compare all values by eq (==) and for count accurencies use sum - Trues are processes like 1, then remove only False values (0) by boolean indexing:
s = df.eq('?').sum()
s = s[s != 0]
print (s)
LOCATION 2
SPAN 1
REL-L 1
dtype: int64
Last for DataFrame add reset_index:
df1 = s.reset_index()
df1.columns = ['names','count']
print (df1)
names count
0 LOCATION 2
1 SPAN 1
2 REL-L 1
EDIT:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)))
print (df)
0 1 2 3 4
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
#compare with same length Series
#same index values like index/columns of DataFrame
s = pd.Series(np.arange(5))
print (s)
0 0
1 1
2 2
3 3
4 4
dtype: int32
#compare columns
print (df.eq(s, axis=0))
0 1 2 3 4
0 False False False False False
1 False False False False False
2 True True False False False
3 False False False False False
4 True False False False True
#compare rows
print (df.eq(s, axis=1))
0 1 2 3 4
0 False False False False False
1 True False True False False
2 False False False False False
3 False False False False False
4 False True False True True
If your DataFrame is named df, try (df == '?').sum()

Categories