Searching for a string that doesn't include certain characters - python

In column 'a' I have values which are numbers separated by a comma (ranging from 1 to 35). e.g. '1,6,7,3,5,15,6,25,30' and '5,6,7,33' '1,6,29,15'
In a new column 'b', I want the value to say 'yes' whenever the value in column A is 5 or it's variations ,5 (comma 5) or 5, (5 comma). However I don't want values such as 15 or 25 included. Is there a way to include all combinations of 5 with a comma but not anything else?
df.loc[df['a'].str.contains(',5'), 'b'] = 'yes'
df.loc[df['a'].str.contains('5,'), 'b'] = 'yes'

I would suggest converting your comma-separated string into an array (see here for how: How to convert a string to a list in Python?).
Then you can check if the search value (e.g. '5') exists in the array using in, e.g.:
if searchValue in arrayOfNumbers {
return True
}
(Or you could try a ternary operator, however that's done in Python)

I would suggest something like:
# your dataframe
df = pandas.DataFrame({'A': ['1,2,34,5,6', '32,2,4,67,5', '4,3,2,1,']})
df['B'] = df['A'].apply(lambda x : True if '5' in x.split(',') else False)
this will add a column B to your dataframe containing True if 5 is there and False otherwise.
A B
0 1,2,34,5,6 True
1 32,2,4,67,5 True
2 4,3,2,1, False

Related

Pandas dataframe reports no matching string when the string is present

Fairly new to python. This seems to be a really simple question but I can't find any information about it.
I have a list of strings, and for each string I want to check whether it is present in a dataframe (actually in a particular column of the dataframe. Not whether a substring is present, but the whole exact string.
So my dataframe is something like the following:
A=pd.DataFrame(["ancestry","time","history"])
I should simply be able to use the "string in dataframe" method, as in
"time" in A
This returns False however.
If I run
"time" == A.iloc[1]
it returns "True", but annoyingly as part of a series, and this depends on knowing where in the dataframe the corresponding string is.
Is there some way I can just use the string in df method, to easily find out whether the strings in my list are in the dataframe?
Add .to_numpy() to the end:
'time' in A.to_numpy()
As you've noticed, the x in pandas.DataFrame syntax doesn't produce the result you want. But .to_numpy() transforms the dataframe into a Numpy array, and x in numpy.array works as you expect.
The way to deal with this is to compare the whole dataframe with "time". That will return a mask where each value of the DF is True if it was time, False otherwise. Then, you can use .any() to check if there are any True values:
>>> A = pd.DataFrame(["ancestry","time","history"])
>>> A
0
0 ancestry
1 time
2 history
>>> A == "time" # or A.eq("time")
0
0 False
1 True
2 False
>>> (A == "time").any()
0 True
dtype: bool
Notice in the above output, (A == "time").any() returns a Series where each entry is a column and whether or not that column contained time. If you want to check the entire dataframe (across all columns), call .any() twice:
>>> (A == "time").any().any()
True
I believe (myseries==mystr).any() will do what you ask. The special __contains__ method of DataFrames (which informs behavior of in) checks whether your string is a column of the DataFrame, e.g.
>>> A = pd.DataFrame({"c": [0,1,2], "d": [3,4,5]})
>>> 'c' in A
True
>>> 0 in A
False
I would slightly modify your dataframe and use .str.contains for checking where the string is present in your series.
df=pd.DataFrame()
df['A']=pd.Series(["ancestry","time","history"])
df['A'].str.contains("time")

Create a new variable based on 4 other variables

I have a dataframe in Python called df1 where I have 4 dichotomous variables called Ordering_1; Ordering_2, Ordering_3, Ordering_4 with True/False values.
I need to create a variable called Clean, which is based on the 4 other variables. Meaning, when Ordering_1 == True, then Clean == Ordering_1, when Ordering_2==True, then Clean == Ordering_2. Then Clean would be a combination of all the true values from Ordering_1; Ordering_2, Ordering_3, Ordering_4.
Here is an example of how I would like the variable Clean to be:
I have tried the below code but it does not work:
df1[Clean] = df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1]
Would anyone please be able to help me how to do this in python?
Universal solution if there are multiple Trues per rows - filter columns by DataFrame.filter and then use DataFrame.dot for matrix multiplication:
df1 = df.filter(like='Ordering_')
df['Clean'] = df1.dot(df1.columns + ',').str.strip(',')
If there is only one "True" value per row you can use the booleans of each column "Ordering_1", "Ordering_2", etc. and the df1.loc.
Note that this is what you get with df1.Ordering_1:
0 True
1 False
2 False
3 False
Name: Ordering_1, dtype: bool
With df1.loc you can use it to filter on the "True" rows, in this case only row 0:
So you can code this:
Create a new blank "clean" column:
df1["clean"]=""
Set the rows where the series df.Ordering_1 = True to "Ordering_1":
df1.loc[df1.Ordering_1,["clean"]] = "Ordering_1"
Proceed with the remaining columns in the same way.

In Python/Pandas, Check if a comma separated string contains any value in a list

I have a column in pandas dataframe that looks like this:
Code
----
ABC,DEF,XYZ
ABC,XYZ
...
...
CBA,FED,ABC
I'm trying to check if this series of comma separated string contains any string in my below list:
["UVW","XYZ"]
I know we can check single value like "XYZ" in df["Code"] but how can we do it for a list of values in Python or is there any special functions from pandas?
Use pd.Series.str.contains with regex=True:
Given Series, s and target list l:
s
0 ABC,DEF,XYZ
1 ABC,XYZ
2 CBA,FED,ABC
l = ["UVW","XYZ"]
s.str.contains('|'.join(l))
Output:
0 True
1 True
2 False
dtype: bool

pandas string to numeric

I have a set of data like
a b
0 type 1 True
1 type 2 False
How can I keep the numerical part of column a and transfer ture to 1, false to zero at the same time. Below is what I want.
a b
0 1 1
1 2 0
You can convert Booleans to integers as follows:
df['b'] = df.b.astype(int)
Depending on the nature of your text in Column A, you can do a few things:
a) Split on the space and take the second part (either string or int depending on your needs).
df['a'] = df.a.str.split().str[1] # Optional `.astype(int)`
b) Use regex to extract any digits (\d*) from the end of the string.
df['a'] = df.a.str.extract(r'(\d*)$') # Optional `.astype(int)`

Use Pandas string method 'contains' on a Series containing lists of strings

Given a simple Pandas Series that contains some strings which can consist of more than one sentence:
In:
import pandas as pd
s = pd.Series(['This is a long text. It has multiple sentences.','Do you see? More than one sentence!','This one has only one sentence though.'])
Out:
0 This is a long text. It has multiple sentences.
1 Do you see? More than one sentence!
2 This one has only one sentence though.
dtype: object
I use pandas string method split and a regex-pattern to split each row into its single sentences (which produces unnecessary empty list elements - any suggestions on how to improve the regex?).
In:
s = s.str.split(r'([A-Z][^\.!?]*[\.!?])')
Out:
0 [, This is a long text., , It has multiple se...
1 [, Do you see?, , More than one sentence!, ]
2 [, This one has only one sentence though., ]
dtype: object
This converts each row into lists of strings, each element holding one sentence.
Now, my goal is to use the string method contains to check each element in each row seperately to match a specific regex pattern and create a new Series accordingly which stores the returned boolean values, each signalizing if the regex matched on at least one of the list elements.
I would expect something like:
In:
s.str.contains('you')
Out:
0 False
1 True
2 False
<-- Row 0 does not contain 'you' in any of its elements, but row 1 does, while row 2 does not.
However, when doing the above, the return is
0 NaN
1 NaN
2 NaN
dtype: float64
I also tried a list comprehension which does not work:
result = [[x.str.contains('you') for x in y] for y in s]
AttributeError: 'str' object has no attribute 'str'
Any suggestions on how this can be achieved?
you can use python find() method
>>> s.apply(lambda x : any((i for i in x if i.find('you') >= 0)))
0 False
1 True
2 False
dtype: bool
I guess s.str.contains('you') is not working because elements of your series is not strings, but lists. But you can also do something like this:
>>> s.apply(lambda x: any(pd.Series(x).str.contains('you')))
0 False
1 True
2 False

Categories