How to differentiate string and Alphanumeric? - python

df:
company_name product Id rating
0 matrix mobile Id456 2.5
1 ins-faq alpha1 Id956 3.5
2 metric5 sounds-B Id-356 2.5
3 ingsaf digital Id856 4star
4 matrix win11p Idklm 2.0
5 4567 mobile 596 3.5
df2:
Col_name Datatype
0 company_name String #(pure string)
1 Product String #(pure string)
2 Id Alpha-Numeric #(must contain atleast 1 number and 1 alphabet)
3 rating Float or int
df is the main dataframe and df2 is the expected datatype information of main dataframe.
how to check every column values extract wrong datatype values.
output:
row_num col_name current_value expected_dtype
0 2 company_name metric5 string
1 5 company_name 4567 string
2 1 Product alpha1 string
3 4 Product win11p string
4 4 Id Idklm Alpha-Numeric
5 5 Id 596 Alpha-Numeric
6 3 rating 4star Float or int

For columns that cannot contain numbers, you can find the exceptions with:
In [5]: df['product'].str.contains(r'[0-9]')
Out[5]:
0 False
1 True
2 False
3 False
4 True
5 False
Name: product, dtype: bool
For Alpha-Numeric columns identify compliance with:
In [7]: df['Id'].str.contains(r'(?:\d\D)|(?:\D\d)')
Out[7]:
0 True
1 True
2 True
3 True
4 False
5 False
Name: Id, dtype: bool
For int or float columns find exceptions with
In [8]: df['rating'].str.contains(r'[^0-9.+-]')
Out[8]:
0 False
1 False
2 False
3 True
4 False
5 False
That may be problematic, it won't catch things with multiple or misplaced plus,minus, or dot characters, like 9.4.1 or 6+3.-12. But you could use
In [11]: def check(thing):
...: try:
...: return bool(float(thing)) or thing==0
...: except ValueError:
...: return False
...:
In [12]: df['rating'].apply(check)
Out[12]:
0 True
1 True
2 True
3 False
4 True
5 True
Name: rating, dtype: bool

Related

How to find minimum value in a column based on condition in an another column of a dataframe?

I have a dataframe like below:
Number Req Response
0 3 6
1 5 0
2 33 4
3 15 3
4 12 2
I would like to identify minimum 'Response' value before the 'Req' is 15.
i tried the below code:
min_val=[]
for row in range(len(df)):
#if the next row of 'Req' contains 15, append the current row value of'Response'
if(df[row+1].loc[df[row+1]['Req'] == 15]):
min_val.append(df['Response'].min())
else:
min_val.append(0)
I get 'invalid type comparison' error.
I expect the below output:
Min value of df['Response'] is: 0
If possible value 15 is not in data, use general solution:
df = df.reset_index(drop=True)
out = df.loc[df.Req.eq(15)[::-1].cumsum().ne(0), 'Response'].sort_values()
print (out)
1 0
3 3
2 4
0 6
Name: Response, dtype: int64
print (next(iter(out), 'no match'))
0
Details:
print (df.Req.eq(15))
0 False
1 False
2 False
3 True
4 False
Name: Req, dtype: bool
print (df.Req.eq(15)[::-1])
4 False
3 True
2 False
1 False
0 False
Name: Req, dtype: bool
print (df.Req.eq(15)[::-1].cumsum())
4 0
3 1
2 1
1 1
0 1
Name: Req, dtype: int32
print (df.Req.eq(15)[::-1].cumsum().ne(0))
4 False
3 True
2 True
1 True
0 True
Name: Req, dtype: bool
Test with not matched value:
print (df)
Number Req Response
0 0 3 6
1 1 5 0
2 2 33 4
3 3 150 3
4 4 12 2
df = df.reset_index(drop=True)
out = df.loc[df.Req.eq(15)[::-1].cumsum().ne(0), 'Response'].sort_values()
print (out)
Series([], Name: Response, dtype: int64)
print (next(iter(out), 'no match'))
no match
One way could be using idxmax to find the first index where Req is equal to 15, use the result to index the dataframe and take the minimum Response:
df.loc[:df.Req.eq(15).idxmax(), 'Response'].min()
# 0
Where:
df.Req.eq(15)
0 False
1 False
2 False
3 True
4 False
Name: Req, dtype: bool
And the idxmax will return the index of the first True occurrence, in this case 3.

how find rows where a particular column has decimal numbers using pandas?

I am writing a data quality script using pandas, where the script would be checking certain conditions on each column
At the moment i need to find out the rows that don't have a decimal or an actual number in a a particular column. I am able to find the numbers if its a whole number, but the methods I have seen so far ie isdigit() , isnumeric(), isdecimal() etc fail to correctly identify when the number is a decimal number. eg: 2.5, 0.1245 etc.
Following is some sample code & data:
>>> df = pd.DataFrame([
[np.nan, 'foo', 0],
[1, '', 1],
[-1.387326, np.nan, 2],
[0.814772, ' baz', ' '],
["a", ' ', 4],
[" ", 'foo qux ', ' '],
], columns='A B C'.split(),dtype=str)
>>> df
A B C
0 NaN foo 0
1 1 1
2 -1.387326 NaN 2
3 0.814772 baz
4 a 4
5 foo qux
>>> df['A']
0 NaN
1 1
2 -1.387326
3 0.814772
4 a
5
Name: A, dtype: object
The following method all fails to identify the decimal numbers
df['A'].fillna('').str.isdigit()
df['A'].fillna('').str.isnumeric()
df['A'].fillna('').str.isdecimal()
0 False
1 True
2 False
3 False
4 False
5 False
Name: A, dtype: bool
So when i try the following I only get 1 row
>>> df[df['A'].fillna('').str.isdecimal()]
A B C
1 1 1
NB: I am using dtype=str to get the data wihtout pandas interpreting/changing the values of the dtypes. The actual data could have spaces in column A, I will trim that out using replace(), I have kept the code simple here so as not to confuse things.
Use to_numeric with errors='coerce' for non numeric to NaNs and then test by Series.notna:
print (pd.to_numeric(df['A'], errors='coerce').notna())
0 False
1 True
2 True
3 True
4 False
5 False
Name: A, dtype: bool
If need return Trues for missing values:
print (pd.to_numeric(df['A'], errors='coerce').notna() | df['A'].isna())
0 True
1 True
2 True
3 True
4 False
5 False
Name: A, dtype: bool
Another solution with custom function:
def test_numeric(x):
try:
float(x)
return True
except Exception:
return False
print (df['A'].apply(test_numeric))
0 True
1 True
2 True
3 True
4 False
5 False
Name: A, dtype: bool
print (df['A'].fillna('').apply(test_numeric))
0 False
1 True
2 True
3 True
4 False
5 False
Name: A, dtype: bool
Alternativ, if you want to keep the string structure you can use:
df['A'].str.contains('.')
0 False
1 True
2 False
3 False
4 False
5 False
The only risk in that case could be that you identify words with .as well..which is not your wish

Pandas incremental values when boolean changes

There is a dataframe that contains column in which boolean values alternate. I want to create incremental value series based on those boolean changes. I want to increment only when boolean value differs from previous value. I want to do this without loop.
Example, Here's dataframe:
column
0 True
1 True
2 False
3 False
4 False
5 True
I want to get this:
column inc
0 True 1
1 True 1
2 False 2
3 False 2
4 False 2
5 True 3
Compare shifted column by not equal and add cumulative sum:
df['inc'] = df['column'].ne(df['column'].shift()).cumsum()
print (df)
column inc
0 True 1
1 True 1
2 False 2
3 False 2
4 False 2
5 True 3
Detail:
print (df['column'].ne(df['column'].shift()))
0 True
1 False
2 True
3 False
4 False
5 True
Name: column, dtype: bool

List columns in data frame that have missing values as '?'

List names of the column(s) of data frame along with the count of missing number of values if missing values are coded with '?' using pandas and numpy.
import numpy as np
import pandas as pd
bridgeall = pd.read_excel('bridge.xlsx',sheet_name='Sheet1')
#print(bridgeall)
bridge_sep = bridgeall.iloc[:,0].str.split(',',-1,expand=True)
bridge_sep.columns = ['IDENTIF','RIVER', 'LOCATION', 'ERECTED', 'PURPOSE', 'LENGTH', 'LANES','CLEAR-G', 'T-OR-D',
'MATERIAL', 'SPAN', 'REL-L', 'TYPE']
print(bridge_sep)
Data: I am posting a snippet. Its actually [107 rows x 13 columns].
IDENTIF RIVER LOCATION ERECTED ... MATERIAL SPAN REL-L TYPE
0 E2 A ? CRAFTS ... WOOD SHORT ? WOOD
1 E3 A 39 CRAFTS ... WOOD ? S WOOD
2 E5 A ? CRAFTS ... WOOD SHORT S WOOD
Output required:
LOCATION 2
SPAN 1
REL-L 1
Compare all values by eq (==) and for count accurencies use sum - Trues are processes like 1, then remove only False values (0) by boolean indexing:
s = df.eq('?').sum()
s = s[s != 0]
print (s)
LOCATION 2
SPAN 1
REL-L 1
dtype: int64
Last for DataFrame add reset_index:
df1 = s.reset_index()
df1.columns = ['names','count']
print (df1)
names count
0 LOCATION 2
1 SPAN 1
2 REL-L 1
EDIT:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)))
print (df)
0 1 2 3 4
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
#compare with same length Series
#same index values like index/columns of DataFrame
s = pd.Series(np.arange(5))
print (s)
0 0
1 1
2 2
3 3
4 4
dtype: int32
#compare columns
print (df.eq(s, axis=0))
0 1 2 3 4
0 False False False False False
1 False False False False False
2 True True False False False
3 False False False False False
4 True False False False True
#compare rows
print (df.eq(s, axis=1))
0 1 2 3 4
0 False False False False False
1 True False True False False
2 False False False False False
3 False False False False False
4 False True False True True
If your DataFrame is named df, try (df == '?').sum()

pandas transform nunique on groupby object dealing with nan values

I have the following df,
inv_id cluster_id
793 2
2
789 3
789 3
4
4
I like to groupby cluster_id and check how many unique values each group has,
df['same_inv_id'] = df.groupby('cluster_id')['inv_id'].transform('nunique') == 1
but I like to set same_inv_id = False when some cluster only contains empty/blank inv_id, and when some cluster contains one or more empty/blank inv_id, so the result will look like,
inv_id cluster_id same_inv_id
793 2 False
2 False
789 3 True
789 3 True
4 False
4 False
IIUC get the condition then transform+ all
s1=df.inv_id.ne('').groupby(df.cluster_id).transform('all')
s1
Out[432]:
0 False
1 False
2 True
3 True
4 False
5 False
Name: inv_id, dtype: bool
s2=df.groupby('cluster_id')['inv_id'].transform('nunique') == 1
#df['same_inv_id']=s1&s2

Categories