Convert pandas series strings to numbers - python

`Following series, contains result as string of lists with values either PASS or FAIL.
Input:-
result
"['PASS','FAIL']"
"['PASS','FAIL','PASS','FAIL']"
"['FAIL','FAIL']"
Output:
result
1
1
0
If any row has at-least one PASS as value then return 1 else return 0
Input:-
result
"['PASS','FAIL']"
"['PASS','FAIL','PASS','FAIL']"
"['FAIL','FAIL']"

If there are lists use in statement:
df['result'] = [int('PASS' in x) for x in df['result']]
#alternative solution
df['result'] = df['result'].apply(lambda x: 'PASS' in x).astype(int)
If strings use Series.str.contains:
df['result'] = df['result'].str.contains('PASS').astype(int)

A simple and fast approach, use a regex with str.contains:
# if your want a robust check
df['result'] = df['result'].str.contains(r'\bPASS\b').astype(int)
# or if you're sure there are only PASS/FAIL
df['result'] = df['result'].str.contains('PASS').astype(int)

Related

division with the help of an if statement

I am trying to divide a combined number by 2 if both of the inputs are more than 0
data = {'test':[1,1,0],'test2':[0, 1, 0,]}
df = pd.DataFrame(data)
df['combined'] = df['test'] +df['test2']
df
I am looking for a way (probably an if-statement to divide df['combined'] by 2 if both test and test2 have a value of 1.
I've tried this, however it gives an error
if ((df['test']> 1) and (df['test2']>1)):
df['combined'] / 2
else:
df['combined']
what is the best way to do this?
There are two same kind problems in you if statement. First you should know that result of (df['test']> 1) is a pandas Series object.
0 False
1 False
2 False
Name: test, dtype: bool
Both if and and operation couldn't handle pandas Series object, that's why you see the error.
At last, you can use np.where() to replace Series on condition:
mask = (df['test']> 1) & (df['test2']>1)
df['combined'] = np.where(mask, df['combined'], df['combined'] / 2)

How can I add pandas "match" based on column list values and value in additional column?

I have a dataframe that contains a column with a list of identifiers called Multiple_IDS and a column called ID. Now, I would like to create an additional column called "Match" that tells weather an ID was contained in the Multiple_IDs column or not. The ouput should be an additional column called Match that contains True or False values. Here some sample input data:
data = {'ID':[2128441, 2128447, 2128500], 'Multiple_IDs':["2128442, 2128443, 2128444, 2128441", "2128446, 2128447", "2128503, 2128508"]}
df = pd.DataFrame(data)
The list has the datatype "object".
The desire output would then be this according to the input data provided above.
I know I can achieve this using explode and then comparing the values but I am wondering if there something more elegant ?
Use in statement if is possible test without separate each ID:
df['Match'] = [str(x) in y for x, y in df[['ID','Multiple_IDs']].to_numpy()]
print (df)
ID Multiple_IDs Match
0 2128441 2128442, 2128443, 2128444, 2128441 True
1 2128447 2128446, 2128447 True
2 2128500 2128503, 2128508 False
Or:
df['Match'] = df.apply(lambda x: str(x['ID']) in x['Multiple_IDs'], axis=1)
print (df)
ID Multiple_IDs Match
0 2128441 2128442, 2128443, 2128444, 2128441 True
1 2128447 2128446, 2128447 True
2 2128500 2128503, 2128508 False
Another idea is match by splitted values:
df['Match'] = [str(x) in y.split(', ') for x, y in df[['ID','Multiple_IDs']].to_numpy()]
df['Match'] = df.apply(lambda x: str(x['ID']) in x['Multiple_IDs'].split(', '), axis=1)
What I will do
s=pd.DataFrame(df.Multiple_IDs.str.split(', ').tolist(),index=df.index).eq(df.ID.astype(str),axis=0).any(1)
Out[10]:
0 True
1 True
2 False
dtype: bool
df['Match']=s

How can I efficiently and idiomatically filter rows of PandasDF based on multiple StringMethods on a single column?

I have a Pandas DataFrame df with many columns, of which one is:
col
---
abc:kk__LL-z12-1234-5678-kk__z
def:kk_A_LL-z12-1234-5678-kk_ss_z
abc:kk_AAA_LL-z12-5678-5678-keek_st_z
abc:kk_AA_LL-xx-xxs-4rt-z12-2345-5678-ek__x
...
I am trying to fetch all records where col starts with abc: and has the first -num- between '1234' and '2345' (inclusive using a string search; the -num- parts are exactly 4 digits each).
In the case above, I'd return
col
---
abc:kk__LL-z12-1234-5678-kk__z
abc:kk_AA_LL-z12-2345-5678-ek__x
...
My current (working, I think) solution looks like:
df = df[df['col'].str.startswith('abc:')]
df = df[df['col'].str.extract('.*-(\d+)-(\d+)-.*')[0].ge('1234')]
df = df[df['col'].str.extract('.*-(\d+)-(\d+)-.*')[0].le('2345')]
What is a more idiomatic and efficient way to do this in Pandas?
Complex string operations are not as efficient as numeric calculations. So the following approach might be more efficient:
m1 = df['col'].str.startswith('abc')
m2 = pd.to_numeric(df['col'].str.split('-').str[2]).between(1234, 2345)
dfn = df[m1&m2]
col
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x
One way would be to use regexp and apply function. I find it easier to play with regexp in a separate function than to crowd the pandas expression.
import pandas as pd
import re
def filter_rows(string):
z = re.match(r"abc:.*-(\d+)-(\d+)-.*", string)
if z:
return 1234 <= (int(z.groups()[0])) <= 2345
else:
return False
Then use the defined function to select rows
df.loc[df['col'].apply(filter_rows)]
col
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x
Another play on regex :
#string starts with abc,greedy search,
#then look for either 1234, or 2345,
#search on for 4 digit number and whatever else after
pattern = r'(^abc.*(?<=1234-|2345-)\d{4}.*)'
df.col.str.extract(pattern).dropna()
0
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x

Find and replace substrings in a Pandas dataframe ignore case

df.replace('Number', 'NewWord', regex=True)
how to replace Number or number or NUMBER with NewWord
Same as you'd do with the standard regex, using the i flag.
df = df.replace('(?i)Number', 'NewWord', regex=True)
Granted, df.replace is limiting in the sense that flags must be passed as part of the regex string (rather than flags). If this was using str.replace, you could've used case=False or flags=re.IGNORECASE.
Simply use case=False in str.replace.
Example:
df = pd.DataFrame({'col':['this is a Number', 'and another NuMBer', 'number']})
>>> df
col
0 this is a Number
1 and another NuMBer
2 number
df['col'] = df['col'].str.replace('Number', 'NewWord', case=False)
>>> df
col
0 this is a NewWord
1 and another NewWord
2 NewWord
[Edit]: In the case of having multiple columns you are looking for your substring in, you can select all columns with object dtypes, and apply the above solution to them. Example:
>>> df
col col2 col3
0 this is a Number numbernumbernumber 1
1 and another NuMBer x 2
2 number y 3
str_columns = df.select_dtypes('object').columns
df[str_columns] = (df[str_columns]
.apply(lambda x: x.str.replace('Number', 'NewWord', case=False)))
>>> df
col col2 col3
0 this is a NewWord NewWordNewWordNewWord 1
1 and another NewWord x 2
2 NewWord y 3
Brutish. This only works if the whole string is either 'Number' or 'NUMBER'. It will not replace those within a larger string. And of course, it is limited to just those two words.
df.replace(['Number', 'NUMBER'], 'NewWord')
More Brute Force
If it wasn't obvious enough, this is far inferior to #coldspeed's answer
import re
df.applymap(lambda x: re.sub('number', 'NewWord', x, flags=re.IGNORECASE))
Or with a cue from #coldspeed's answer
df.applymap(lambda x: re.sub('(?i)number', 'NewWord', x))
This solution will work if the text you're looking to convert is in specific columns of your dataframe:
df['COL_n'] = df['COL_n'].str.lower()
df['COL_n'] = df['COL_n'].replace('number', 'NewWord', regex=True)

Is there a way in pandas to filter rows in a column that are contained in a string

There is a way to check if the string in the column contains another string:
df["column"].str.contains("mystring")
But I'm looking for the opposite, the column string to be contained in another string, without doing an apply function which I guess is slower than the vectorised .contains:
df["column"].apply(lambda x: x in "mystring", axis=1)
Update with data:
mystring = "abc"
df = pd.DataFrame({"A": ["ab", "az"]})
df
A
0 ab
1 az
I would like to show only "ab" because it is contained in mystring.
Only one option (jpp had it) - iterate with a list comprehension:
df[[r in mystring for r in df.A]]
A
0 ab
Or,
pd.DataFrame([r for r in df.A if r in mystring], columns=['A'])
A
0 ab
pd.Series([i in mystring for i in df.A])
Output:
0 True
1 False
dtype: bool

Categories