Validate strings using regex in pandas - python

I need a bit of help.
I'm pretty new to Python (I use version 3.0 bundled with Anaconda) and I want to use regex to validate/return a list of only valid numbers that match a criteria (say \d{11} for 11 digits). I'm getting the list using Pandas
df = pd.DataFrame(columns=['phoneNumber','count'], data=[
['08034303939',11],
['08034382919',11],
['0802329292',10],
['09039292921',11]])
When I return all the items using
for row in df.iterrows(): # dataframe.iterrows() returns tuple
print(row[1][0])
it returns all items without regex validation, but when I try to validate with this
for row in df.iterrows(): # dataframe.iterrows() returns tuple
print(re.compile(r"\d{11}").search(row[1][0]).group())
it returns an Attribute error (since the returned value for non-matching values is None.
How can I work around this, or is there an easier way?

If you want to validate, you can use str.match and convert to a boolean mask using df.astype(bool):
x = df['phoneNumber'].str.match(r'\d{11}').astype(bool)
x
0 True
1 True
2 False
3 True
Name: phoneNumber, dtype: bool
You can use boolean indexing to return only rows with valid phone numbers.
df[x]
phoneNumber count
0 08034303939 11
1 08034382919 11
3 09039292921 11

Related

Find cell with value LIKE string in Pandas [duplicate]

This works (using Pandas 12 dev)
table2=table[table['SUBDIVISION'] =='INVERNESS']
Then I realized I needed to select the field using "starts with" Since I was missing a bunch.
So per the Pandas doc as near as I could follow I tried
criteria = table['SUBDIVISION'].map(lambda x: x.startswith('INVERNESS'))
table2 = table[criteria]
And got AttributeError: 'float' object has no attribute 'startswith'
So I tried an alternate syntax with the same result
table[[x.startswith('INVERNESS') for x in table['SUBDIVISION']]]
Reference http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
Section 4: List comprehensions and map method of Series can also be used to produce more complex criteria:
What am I missing?
You can use the str.startswith DataFrame method to give more consistent results:
In [11]: s = pd.Series(['a', 'ab', 'c', 11, np.nan])
In [12]: s
Out[12]:
0 a
1 ab
2 c
3 11
4 NaN
dtype: object
In [13]: s.str.startswith('a', na=False)
Out[13]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
and the boolean indexing will work just fine (I prefer to use loc, but it works just the same without):
In [14]: s.loc[s.str.startswith('a', na=False)]
Out[14]:
0 a
1 ab
dtype: object
.
It looks least one of your elements in the Series/column is a float, which doesn't have a startswith method hence the AttributeError, the list comprehension should raise the same error...
To retrieve all the rows which startwith required string
dataFrameOut = dataFrame[dataFrame['column name'].str.match('string')]
To retrieve all the rows which contains required string
dataFrameOut = dataFrame[dataFrame['column name'].str.contains('string')]
Using startswith for a particular column value
df = df.loc[df["SUBDIVISION"].str.startswith('INVERNESS', na=False)]
You can use apply to easily apply any string matching function to your column elementwise.
table2=table[table['SUBDIVISION'].apply(lambda x: x.startswith('INVERNESS'))]
this assuming that your "SUBDIVISION" column is of the correct type (string)
Edit: fixed missing parenthesis
This can also be achieved using query:
table.query('SUBDIVISION.str.startswith("INVERNESS").values')

Pandas dataframe reports no matching string when the string is present

Fairly new to python. This seems to be a really simple question but I can't find any information about it.
I have a list of strings, and for each string I want to check whether it is present in a dataframe (actually in a particular column of the dataframe. Not whether a substring is present, but the whole exact string.
So my dataframe is something like the following:
A=pd.DataFrame(["ancestry","time","history"])
I should simply be able to use the "string in dataframe" method, as in
"time" in A
This returns False however.
If I run
"time" == A.iloc[1]
it returns "True", but annoyingly as part of a series, and this depends on knowing where in the dataframe the corresponding string is.
Is there some way I can just use the string in df method, to easily find out whether the strings in my list are in the dataframe?
Add .to_numpy() to the end:
'time' in A.to_numpy()
As you've noticed, the x in pandas.DataFrame syntax doesn't produce the result you want. But .to_numpy() transforms the dataframe into a Numpy array, and x in numpy.array works as you expect.
The way to deal with this is to compare the whole dataframe with "time". That will return a mask where each value of the DF is True if it was time, False otherwise. Then, you can use .any() to check if there are any True values:
>>> A = pd.DataFrame(["ancestry","time","history"])
>>> A
0
0 ancestry
1 time
2 history
>>> A == "time" # or A.eq("time")
0
0 False
1 True
2 False
>>> (A == "time").any()
0 True
dtype: bool
Notice in the above output, (A == "time").any() returns a Series where each entry is a column and whether or not that column contained time. If you want to check the entire dataframe (across all columns), call .any() twice:
>>> (A == "time").any().any()
True
I believe (myseries==mystr).any() will do what you ask. The special __contains__ method of DataFrames (which informs behavior of in) checks whether your string is a column of the DataFrame, e.g.
>>> A = pd.DataFrame({"c": [0,1,2], "d": [3,4,5]})
>>> 'c' in A
True
>>> 0 in A
False
I would slightly modify your dataframe and use .str.contains for checking where the string is present in your series.
df=pd.DataFrame()
df['A']=pd.Series(["ancestry","time","history"])
df['A'].str.contains("time")

Create a new variable based on 4 other variables

I have a dataframe in Python called df1 where I have 4 dichotomous variables called Ordering_1; Ordering_2, Ordering_3, Ordering_4 with True/False values.
I need to create a variable called Clean, which is based on the 4 other variables. Meaning, when Ordering_1 == True, then Clean == Ordering_1, when Ordering_2==True, then Clean == Ordering_2. Then Clean would be a combination of all the true values from Ordering_1; Ordering_2, Ordering_3, Ordering_4.
Here is an example of how I would like the variable Clean to be:
I have tried the below code but it does not work:
df1[Clean] = df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1]
Would anyone please be able to help me how to do this in python?
Universal solution if there are multiple Trues per rows - filter columns by DataFrame.filter and then use DataFrame.dot for matrix multiplication:
df1 = df.filter(like='Ordering_')
df['Clean'] = df1.dot(df1.columns + ',').str.strip(',')
If there is only one "True" value per row you can use the booleans of each column "Ordering_1", "Ordering_2", etc. and the df1.loc.
Note that this is what you get with df1.Ordering_1:
0 True
1 False
2 False
3 False
Name: Ordering_1, dtype: bool
With df1.loc you can use it to filter on the "True" rows, in this case only row 0:
So you can code this:
Create a new blank "clean" column:
df1["clean"]=""
Set the rows where the series df.Ordering_1 = True to "Ordering_1":
df1.loc[df1.Ordering_1,["clean"]] = "Ordering_1"
Proceed with the remaining columns in the same way.

Looking at the first character of a string for every element in a list

I have a pandas dataframe with a column called 'picture'; that column has values that either start with a number or letter. What I'm trying to do is create a new column that checks whether or not the value starts with a letter or number, and populate that new column accordingly. I'm using np.where, and my code is below (raw_master is the dataframe, 'database' is the new column):
def iaps_or_naps(x):
if x in ["1","2","3","4","5","6","7","8","9"]:
return True
else:
return False
raw_master['database'] = np.where(iaps_or_naps(raw_master.picture[?][0])==True, 'IAPS', 'NAPS')
My issue is that if I just do raw_master.picture[0], that checks the value of the entire string, which is not what I need. I need the first character; however, if I do raw_master.picture[0][0], that will just evaluate to the first character of the first row for the whole dataframe. BTW, the question mark just means I'm not sure what to put there.
How can I get it so it takes the first character of the string for every row?
Thanks so much!
You don't need to write your own function for this. Take this small df as an example:
s = pd.DataFrame(['3asd', 'asd', '3423', 'a123'])
looks like:
0
0 3asd
1 asd
2 3423
3 a123
using a pandas builtin:
# checking first column, s[0], first letter, str[0], to see if it is digit.
# if so, assigning IAPS, if not, assigning NAPS
s['database'] = np.where(s[0].str[0].str.isdigit(), 'IAPS', 'NAPS')
output:
0 database
0 3asd IAPS
1 asd NAPS
2 3423 IAPS
3 a123 NAPS
Applying this to your dataframe:
raw_master['database'] = np.where(raw_master['picture'].str[0].str.isdigit(), 'IAPS', 'NAPS')
IIUC you can just test if the first char is an int using pd.to_numeric
np.where(pd.to_numeric(df['your_col'].str[0],errors='coerce').isnull(),'IAPS'
,'NAPS') # ^ not a number
#^ number
You could use a mapping function such as apply which iterates over each element in the column, this way accessing the first character with indexing [0]
df['new_col'] = df['picture'].apply(lambda x: 'IAPS' if x[0].str.isdigit() else 'NAPS')

Use Pandas string method 'contains' on a Series containing lists of strings

Given a simple Pandas Series that contains some strings which can consist of more than one sentence:
In:
import pandas as pd
s = pd.Series(['This is a long text. It has multiple sentences.','Do you see? More than one sentence!','This one has only one sentence though.'])
Out:
0 This is a long text. It has multiple sentences.
1 Do you see? More than one sentence!
2 This one has only one sentence though.
dtype: object
I use pandas string method split and a regex-pattern to split each row into its single sentences (which produces unnecessary empty list elements - any suggestions on how to improve the regex?).
In:
s = s.str.split(r'([A-Z][^\.!?]*[\.!?])')
Out:
0 [, This is a long text., , It has multiple se...
1 [, Do you see?, , More than one sentence!, ]
2 [, This one has only one sentence though., ]
dtype: object
This converts each row into lists of strings, each element holding one sentence.
Now, my goal is to use the string method contains to check each element in each row seperately to match a specific regex pattern and create a new Series accordingly which stores the returned boolean values, each signalizing if the regex matched on at least one of the list elements.
I would expect something like:
In:
s.str.contains('you')
Out:
0 False
1 True
2 False
<-- Row 0 does not contain 'you' in any of its elements, but row 1 does, while row 2 does not.
However, when doing the above, the return is
0 NaN
1 NaN
2 NaN
dtype: float64
I also tried a list comprehension which does not work:
result = [[x.str.contains('you') for x in y] for y in s]
AttributeError: 'str' object has no attribute 'str'
Any suggestions on how this can be achieved?
you can use python find() method
>>> s.apply(lambda x : any((i for i in x if i.find('you') >= 0)))
0 False
1 True
2 False
dtype: bool
I guess s.str.contains('you') is not working because elements of your series is not strings, but lists. But you can also do something like this:
>>> s.apply(lambda x: any(pd.Series(x).str.contains('you')))
0 False
1 True
2 False

Categories