Check for numeric value in text column - python - python

5 columns (col1 - col5) in a 10-column dataframe (df) should be either blank or have text values only. If any row in these 5 columns has an all numeric value, i need to trigger an error. Wrote the following code to identify rows where the value is all-numeric in 'col1'. (I will cycle through all 5 columns using the same code):
df2 = df[df['col1'].str.isnumeric()]
I get the following error: ValueError: cannot mask with array containing NA / NaN values
This is triggered because the blank values create NaNs instead of False. I see this when I created a list instead using the following:
lst = df['col1'].str.isnumeric()
Any suggestions on how to solve this? Thanks

Try this to work around the NaN
import pandas as pd
df = pd.DataFrame([{'col1':1}, {'col1': 'a'}, {'col1': None}])
lst = df['col1'].astype(str).str.isnumeric()
if lst.any():
raise ValueError()

Here's a way to do:
import string
df['flag'] = (df
.applymap(lambda x: any(i for i in x if i in string.digits))
.apply(lambda x: f'Fail: {",".join(df.columns[x].tolist())} is numeric', 1))
print(df)
col1 col2 flag
0 a 2.04 Fail: col2 is numeric
1 2.02 b Fail: col1 is numeric
2 c c Fail: is numeric
3 d e Fail: is numeric
Explanation:
We iterate through each value of the dataframe and check if it is a digit and return a boolean value.
We use that boolean value to subset the column names
Sample Data
df = pd.DataFrame({'col1': ['a','2.02','c','d'],
'col2' : ['2.04','b','c','e']})

Related

How to select only string (non-numeric) columns when there are mixed type columns?

Suppose I have a data frame with three columns with dtypes (object, int, and float):
df = pd.DataFrame({
'col1': [1, 2, np.nan, 5],
'col2': [3, 4, 5, 4],
'col3': ['This is a text column'] * 4
})
I need to replace the np.nan with None, which is an object (since None becomes to NULL when imported to PostgresSQL).
df.replace({np.nan: None}, inplace=True)
I think (correct me if I'm wrong) None cannot be used in any NumPy/Pandas array except for arrays with dtype object. And so 'col1' above becomes an object column after replace. Now, if I wanted to subset only the string columns (which in this case should only be 'col3'), I can no longer use df.select_dtypes(include=object), which returns all object dtype columns, including 'col1'. I've been working around this by using this hacky solution:
# Select only object columns, which includes 'col1'
(df.select_dtypes(include=object)
# Hack, after this, 'col1' becomes float again since None becomes np.nan
.apply(lambda col: col.apply(lambda val: val))
# Now select only the object columns
.select_dtypes(include=object))
I'm wondering if there are idiomatic (or less hacky) ways to accomplish this. The use case really arose since I need to get the string columns from a data frame where there are numeric (float or int) columns with missing values represented by None rather than np.nan.
Another solution
Based on Mayank Porwal's solution below:
# The list comprehension returns a boolean list
df.loc[:, [pd.to_numeric(df[col], errors='coerce').isna().all() for col in df.columns.tolist()]]
Based on your sample df, you can do something like this:
After replacing np.nan to None, col1 becomes an object:
In [1413]: df.dtypes
Out[1413]:
col1 object
col2 int64
col3 object
dtype: object
To pick the columns which contains only strings, you can use pd.to_numeric with errors='coerce' and check if the column contains all Nan using isna:
In [1416]: cols = df.select_dtypes('object').columns.tolist()
In [1422]: cols
Out[1422]: ['col1', 'col3']
In [1424]: for i in cols:
...: if pd.to_numeric(df[i], errors='coerce').isna().all():
...: print(f'{i}: String col')
...: else:
...: print(f'{i}: number col')
...:
col1: number col
col3: String col
Reverse your 2 operations:
Extract object columns and process them.
Convert NaN to None before export to pgsql.
>>> df.dtypes
col1 float64
col2 int64
col3 object
dtype: object
# Step 1: process string columns
>>> df.update(df.select_dtypes('object').agg(lambda x: x.str.upper()))
# Step 2: replace nan by None
>>> df.replace({np.nan: None}, inplace=True)
>>> df
col1 col2 col3
0 1.0 3 THIS IS A TEXT COLUMN
1 2.0 4 THIS IS A TEXT COLUMN
2 None 5 THIS IS A TEXT COLUMN
3 5.0 4 THIS IS A TEXT COLUMN

Truncate string and replace with "X" Python Pandas DataFrame

I have a df such as:
d = {'col1': [11111111, 2222222]]}
df = pd.DataFrame(data=d)
df
col1
0 11111111
1 2222222
I need to remove everything before the first four characters and replace with something like "X" such that the new df would be
d = {'col1': [XXXX1111, XXX2222]]}
df = pd.DataFrame(data=d)
df
col1
0 XXXX1111
1 XXX2222
New to python still and have been able to for example slice the last four characters. But have not been able to replace everything else with X's.
Also, strings can be different lengths. So the number of X's is dependent on the length of the string. That particularly is what has given me trouble. If they were all the same length this would be much easier.
You can use .str.replace() with regex:
df.col1 = df.col1.astype(str).str.replace(
r"^(.*)(.{4})$", lambda g: "X" * len(g.group(1)) + g.group(2)
)
print(df)
Prints:
col1
0 XXXX1111
1 XXX2222
df['col1'] = list(map(lambda l: 'X'*(l-4), df['col1'].astype(str).apply(len))) + df['col1'].astype(str).str[-4:]
map() is to repeat X n-4 times, where n is the length of each element in col1.
.str[-4:] is to get the last 4 character in col1 column
# print(df)
col1
0 XXXX1111
1 XXX2222

Pandas mapping all, and a portion, of column value in another column

I am trying to search for values and portions of values from one column to another and return a third value.
Essentially, I have two dataframes: df and df2. The first has a part number in 'col1'. The second has the part number, or portion of it, in 'col1' and the value I want to put in df['col2'] in 'col2'.
import pandas as pd
df = pd.DataFrame({'col1': ['1-1-1', '1-1-2', '1-1-3',
'2-1-1', '2-1-2', '2-1-3']})
df2 = pd.DataFrame({'col1': ['1-1-1', '1-1-2', '1-1-3', '2-1'],
'col2': ['A', 'B', 'C', 'D']})
Of course this:
df['col1'].isin(df2['col1'])
Only covers everything that matches, not the portions:
df['col1'].isin(df2['col1'])
Out[27]:
0 True
1 True
2 True
3 False
4 False
5 False
Name: col1, dtype: bool
I tried:
df[df['col1'].str.contains(df2['col1'])]
but get:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I also tried use a dictionary made from df2; using the same approaches as above and also mapping it--with no luck
The results for df I need would look like this:
col1 col2
'1-1-1' 'A'
'1-1-2' 'B'
'1-1-3' 'C'
'2-1-1' 'D'
'2-1-2' 'D'
'2-1-3' 'D'
I can't figure out how to get the 'D' value into 'col2' because df2['col1'] contains '2-1'--only a portion of the part number.
Any help would be greatly appreciated. Thank you in advance.
We can do str.findall
s=df.col1.str.findall('|'.join(df2.col1.tolist())).str[0].map(df2.set_index('col1').col2)
df['New']=s
df
col1 New
0 1-1-1 A
1 1-1-2 B
2 1-1-3 C
3 2-1-1 D
4 2-1-2 D
5 2-1-3 D
If your df and df2 the specific format as in the sample, another way is using a dict map with fillna by mapping from rsplit
d = dict(df2[['col1', 'col2']].values)
df['col2'] = df.col1.map(d).fillna(df.col1.str.rsplit('-',1).str[0].map(d))
Out[1223]:
col1 col2
0 1-1-1 A
1 1-1-2 B
2 1-1-3 C
3 2-1-1 D
4 2-1-2 D
5 2-1-3 D
Otherwise, besides using findall as in Wen's solution, you may also use extract using with dict d from above
df.col1.str.extract('('+'|'.join(df2.col1)+')')[0].map(d)

Excel Pandas Python question on IndexingError, can search and remove columns fine containing certain words, but not rows

I am using this code
searchfor = ["s", 'John']
df = df[~df.iloc[1].astype(str).str.contains('|'.join(searchfor),na=False)]
This returns the error
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
However this works fine if run as a column search
df = df[~df.iloc[;,1].astype(str).str.contains('|'.join(searchfor),na=False)]
I am trying to remove a row based on if the row contains a certain phrase
To drop rows
Create a mask which returns True or False depending whether that cell contains your strings
search_for = ["s", "John"]
mask = data.applymap(lambda x: any(s in str(x) for s in search_for))
Then use filter .any to check for at least one True per row with boolean indexing and take only the rows where no True was found.
df_filtered = df[~mask.any(axis=1)]
To drop columns
search_for = ["s", "John"]
mask = data.applymap(lambda x: any(s in str(x) for s in search_for))
axis=0 instead of 1 to check for each column:
columns_analysis = mask.any(axis=0)
get the indexes when True to drop
columns_to_drop = columns_analysis[columns_analysis == True].index.tolist()
df_filtered = data.drop(columns_to_drop, axis=1)
This is related to the way you are splitting your data.
In the first statement, you are asking python to split your dataframe and give you the second (index 1 is second if you want first change index to 0) row, while in the second case, you are asking for the second column and in your dataframe these have different lengths (my mistake, is shapes). See this example:
d = {'col1': [1, 2], 'col2': [3, 4], 'col3':[23,23]}
df = pd.DataFrame(data=d)
print(df)
col1 col2 col3
1 3 23
2 4 23
First row:
df.iloc[0]
col1 1
col2 3
col3 23
Name: 0, dtype: int64
First column:
df.iloc[:,]
1
2
Name: col2, dtype: int64
Try this and if you like the answer vote...
Good luck.

How to apply pandas.DataFrame.replace on selected columns with inplace = True?

import pandas as pd
df = pd.DataFrame({
'col1':[99,99,99],
'col2':[4,5,6],
'col3':[7,None,9]
})
col_list = ['col1','col2']
df[col_list].replace(99,0,inplace=True)
This generates a Warning and leaves the dataframe unchanged.
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
I want to be able to apply the replace method on a subset of the columns specified by the user. I also want to use inplace = True to avoid making a copy of the dataframe, since it is huge. Any ideas on how this can be accomplished would be appreciated.
When you select the columns for replacement with df[col_list], a slice (a copy) of your dataframe is created. The copy is updated, but never written back into the original dataframe.
You should either replace one column at a time or use nested dictionary mapping:
df.replace(to_replace={'col1' : {99 : 0}, 'col2' : {99 : 0}},
inplace=True)
The nested dictionary for to_replace can be generated automatically:
d = {col : {99:0} for col in col_list}
You can use replace with loc. Here is a slightly modified version of your sample df:
d = {'col1':[99,99,9],'col2':[99,5,6],'col3':[7,None,99]}
df = pd.DataFrame(data=d)
col_list = ['col1','col2']
df.loc[:, col_list] = df.loc[:, col_list].replace(99,0)
You get
col1 col2 col3
0 0 0 7.0
1 0 5 NaN
2 9 6 99.0
Here is a nice explanation for similar issue.

Categories