cannot sum rows that match a regular expression in pandas / python - python

I can find the number of rows in a column in a pandas dataframe that do NOT follow a pattern but not the number of rows that follow the very same pattern!
This works:
df.report_date.apply(lambda x: (not re.match(r'[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}', x))).sum()
This does not: removing 'not' does not tell me how many rows match but raises a TypeError. Any idea why that would be the case?
df.report_date.apply(lambda x: (re.match(r'[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}', x))).sum()

df = pd.DataFrame(dict(
report_date=[
'2001-02-04',
'2016-11-12',
'1-1-1999',
'02-28-2012',
'1995-09-30'
]
))
df
regex = r'[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}'
print('does match: {}\ndoesn\'t match: {}'.format(
df.report_date.str.match(regex).sum(),
df.report_date.str.match(regex).__neg__().sum()
))
does match: 3
doesn't match: 2
or
regex = r'[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}'
df.groupby(df.report_date.str.match(regex)).size()
report_date
False 2
True 3
dtype: int64

The problem is that the match function does not return True when it matches, it returns a match object. Pandas cannot add this match object because it is not an integer value. The reason you get a sum when you are using 'not' is because it returns a boolean value of True, which pandas can sum the True value and return a number.

Related

Convert pandas series strings to numbers

`Following series, contains result as string of lists with values either PASS or FAIL.
Input:-
result
"['PASS','FAIL']"
"['PASS','FAIL','PASS','FAIL']"
"['FAIL','FAIL']"
Output:
result
1
1
0
If any row has at-least one PASS as value then return 1 else return 0
Input:-
result
"['PASS','FAIL']"
"['PASS','FAIL','PASS','FAIL']"
"['FAIL','FAIL']"
If there are lists use in statement:
df['result'] = [int('PASS' in x) for x in df['result']]
#alternative solution
df['result'] = df['result'].apply(lambda x: 'PASS' in x).astype(int)
If strings use Series.str.contains:
df['result'] = df['result'].str.contains('PASS').astype(int)
A simple and fast approach, use a regex with str.contains:
# if your want a robust check
df['result'] = df['result'].str.contains(r'\bPASS\b').astype(int)
# or if you're sure there are only PASS/FAIL
df['result'] = df['result'].str.contains('PASS').astype(int)

Replace '-' by 'E-' in dataframe cell IF the '-' is in the middle of a string

I have a huge dataframe composed of 7 columns.
Extract:
45589 664865.0 100000.0 7.62275 -.494 1.60149 100010
...
57205 718888.0 100000.0 8.218463 -1.405-3 1.75137 100010
...
55143 711827.0 100000.0 8.156107 9.8336-3 1.758051 100010
As these values come from an input file, there are currently all of string type and I would like to change all the dataframe to float through :
df= df.astype('float')
However, as you might have noticed on the extract, there are ' - ' hiding. Some represent the negative value of the whole number, such as -.494 and others represent a negative power, such as 9.8-3.
I need to replace the latters with 'E-' so Python understands it's a power and can switch the cell to a float type. Usually, I would use:
df= df.replace('E\-', '-', regex=True)
However, this would also add an E to my negative values. To avoid that, I tried the solution offered here: Replace all a in the middle of string by * using regex
str = 'JAYANTA POKED AGASTYA WITH BAAAAMBOO '
str = re.sub(r'\BA+\B', r'*', str)
However, this is for one specific string. As my dataframe is quite large, I would like to avoid having to go through each cell.
Is there a combination of the functions replace and re.sub I could use in order to only replace the'-' surrounded by other strings by 'E-'?
Thank you for your help!
You can use regex negative lookahead and positive lookahead to assert that the hyphen is in the middle for replace, as follows:
df = df.replace(r'\s', '', regex=True) # remove any unwanted spaces
df = df.replace(r'(?<=.)-(?=.)', 'E-', regex=True)
Result:
print(df)
0 1 2 3 4 5 6
0 45589 664865.0 100000.0 7.62275 -.494 1.60149 100010
1 57205 718888.0 100000.0 8.218463 -1.405E-3 1.75137 100010
2 55143 711827.0 100000.0 8.156107 9.8336E-3 1.758051 100010
Regular expressions can be expensive, perhaps slice the string into the first digit and remaining digits, use replace on the remaining digits, then recombine with the first digit. Haven't benchmarked this though! Something like this (applied with df.str_col.apply(lambda x: f(x))
my_str = '-1.23-4'
def f(x):
first_part = my_str[0]
remaining_part = my_str[1:]
remaining_part = remaining_part.replace('-', 'E-')
return first_part + remaining_part
Or as a one liner (assuming the seven columns are the only columns in your df, otherwise specify the columns):
df.apply(lambda x: x[0] + x[1:].replace('-', 'E-'))
I tried this example and worked:
import pandas as pd
df = pd.DataFrame({'A': ['-.494', '-1.405-3', '9.8336-3']})
pat = r"(\d)-"
repl = lambda m: f"{m.group(1)}e-"
df['A'] = df['A'].str.replace(pat, repl, regex=True)
df['A'] = pd.to_numeric(df['A'], errors='coerce')
You could use groups as specified in this thread, to select the number before you exponent so that :
first : the match only ocurs when the minus is preceded by values
and second : replace the match by E preceded by the values matched by the group (for example 158-3 will be replaced "dynamically" by the value 158 matched in group 1, with the expression \1 (group 1 content) and "statically" by E-.
This gives :
df.replace({r'(\d+)-' : r'\1E-'}, inplace=True, regex=True)
(You can verify it on regexp tester)

Issues with extracting substrings of a string in Python Pandas Dataframe

I have an expression like ( one row of a column, say 'old_col' in pandas data frame) ( Shown the top two rows from a column of the dataframe )
abcd_6.9_uuu ghaha_12.8 _sksks
abcd_5.2_uuu ghaha_13.9 _sksks
I was trying to use the str.extract on the dataframe to get the two floating numbers. However I find two issues, only the first one is picked up( 6.9 from first row and 5.2 from second row )
1. So how can I do that?
2. Also how can I make the extract method general to pick numbers upto any digits ( 5.7or 12.9 irrespective)
I am using:
df['newcol'] = df['old_col'].str.extract('(_\d.\d)')
To get more than one digit,
df['col'].str.extract('(\_\d+\.\d+)')
col
0 _6.9
1 _15.9
To get all occurrences, use str.extractall
df['col'].str.extractall('(\_\d+\.\d+)')
col
match
0 0 _6.9
1 _12.8
1 0 _15.9
1 _13.9
To assign back to df:
s = df['col'].str.extractall('(\_\d+\.\d+)')['col']
df['new_col'] = s.groupby(s.index.get_level_values(0)).agg(list)
You can use Series.str.findall:
import pandas as pd
df=pd.DataFrame({'old_col':['abcd_6.9_uuu ghaha_12.8 _sksks','abcd_5.2_uuu ghaha_13.9 _sksks']})
df['newcol'] = df['old_col'].str.findall(r'\d+(?:\.\d+)?')
df['newcol_str'] = df['old_col'].str.findall(r'\d+(?:\.\d+)?').str.join(', ')
# >>> df
# old_col newcol newcol_str
# 0 abcd_6.9_uuu ghaha_12.8 _sksks [6.9, 12.8] 6.9, 12.8
# 1 abcd_5.2_uuu ghaha_13.9 _sksks [5.2, 13.9] 5.2, 13.9
Regex details:
\d+(?:\.\d+)? - one or more digits followed with an optional occurrence of a . and one or more digits
\d+\.\d+ would match only float values where the . is obligatory between at least two digits.
Since .str.findall(r'\d+(?:\.\d+)?') returns a list, the newcol column contains lists, with .str.join(', '), the newcol_str column contains strings with found matches merged.
If you must check if the numbers occur between underscores add them on both sides of the pattern and wrap the number matching pattern with parentheses:
.str.findall(r'_(\d+(?:\.\d+)?)_')

Validate strings using regex in pandas

I need a bit of help.
I'm pretty new to Python (I use version 3.0 bundled with Anaconda) and I want to use regex to validate/return a list of only valid numbers that match a criteria (say \d{11} for 11 digits). I'm getting the list using Pandas
df = pd.DataFrame(columns=['phoneNumber','count'], data=[
['08034303939',11],
['08034382919',11],
['0802329292',10],
['09039292921',11]])
When I return all the items using
for row in df.iterrows(): # dataframe.iterrows() returns tuple
print(row[1][0])
it returns all items without regex validation, but when I try to validate with this
for row in df.iterrows(): # dataframe.iterrows() returns tuple
print(re.compile(r"\d{11}").search(row[1][0]).group())
it returns an Attribute error (since the returned value for non-matching values is None.
How can I work around this, or is there an easier way?
If you want to validate, you can use str.match and convert to a boolean mask using df.astype(bool):
x = df['phoneNumber'].str.match(r'\d{11}').astype(bool)
x
0 True
1 True
2 False
3 True
Name: phoneNumber, dtype: bool
You can use boolean indexing to return only rows with valid phone numbers.
df[x]
phoneNumber count
0 08034303939 11
1 08034382919 11
3 09039292921 11

Find String Pattern Match in Pandas Dataframe and Return Matched Strin

I have a dataframe column with variable comma separated text and just trying to extract the values that are found based on another list. So my dataframe looks like this:
col1 | col2
-----------
x | a,b
listformatch = [c,d,f,b]
pattern = '|'.join(listformatch)
def test_for_pattern(x):
if re.search(pattern, x):
return pattern
else:
return x
#also can use col2.str.contains(pattern) for same results
The above filtering works great but instead of returning b when it finds the match it returns the whole pattern such as a|b instead of just b whereas I want to create another column with the pattern it finds such as b.
Here is my final function but still getting UserWarning: This pattern has match groups. To actually get the groups, use str.extract." groups, use str.extract.", UserWarning) I wish I can solve:
def matching_func(file1, file2):
file1 = pd.read_csv(fin)
file2 = pd.read_excel(fin1, 0, skiprows=1)
pattern = '|'.join(file1[col1].tolist())
file2['new_col'] = file2[col1].map(lambda x: re.search(pattern, x).group()\
if re.search(pattern, x) else None)
I think I understand how pandas extract works now but probably still rusty on regex. How do I create a pattern variable to use for the below example:
df[col1].str.extract('(word1|word2)')
Instead of having the words in the argument, I want to create variable as pattern = 'word1|word2' but that won't work because of the way the string is being created.
My final and preferred version with vectorized string method in pandas 0.13:
Using values from one column to extract from a second column:
df[col1].str.extract('({})'.format('|'.join(df[col2]))
You might like to use extract, or one of the other vectorised string methods:
In [11]: s = pd.Series(['a', 'a,b'])
In [12]: s.str.extract('([cdfb])')
Out[12]:
0 NaN
1 b
dtype: object

Categories