Python:Can df['a']str.contains() have multiple condition? - python

I have 4 types of value in my df column A example shown below
123
123/123/123/123/123
123,,123,,123
1234-1234-1234
i want index of those value which do not have any type of sepertor in it
I tried like this but failed to get results
mask = df["A"].str.contains(',','/' na=False)
Any help would be appreciated

If possible invert logic - get all rows if only numbers without any separator use ^\d+$ - ^ means start of string, \d+ means one or more digits and $ means end of string - together only numbers values:
mask = df["A"].str.contains('^\d+$', na=False)
print (mask)
0 True
1 False
2 False
3 False
Name: A, dtype: bool

Related

Pandas: How to compare DateTime64 and Datetime [duplicate]

i have this excruciatingly annoying problem (i'm quite new to python)
df=pd.DataFrame[{'col1':['1','2','3','4']}]
col1=df['col1']
Why does col1[1] in col1 return False?
For check values use boolean indexing:
#get value where index is 1
print (col1[1])
2
#more common with loc
print (col1.loc[1])
2
print (col1 == '2')
0 False
1 True
2 False
3 False
Name: col1, dtype: bool
And if need get rows:
print (col1[col1 == '2'])
1 2
Name: col1, dtype: object
For check multiple values with or:
print (col1.isin(['2', '4']))
0 False
1 True
2 False
3 True
Name: col1, dtype: bool
print (col1[col1.isin(['2', '4'])])
1 2
3 4
Name: col1, dtype: object
And something about in for testing membership docs:
Using the Python in operator on a Series tests for membership in the index, not membership among the values.
If this behavior is surprising, keep in mind that using in on a Python dictionary tests keys, not values, and Series are dict-like. To test for membership in the values, use the method isin():
For DataFrames, likewise, in applies to the column axis, testing for membership in the list of column names.
#1 is in index
print (1 in col1)
True
#5 is not in index
print (5 in col1)
False
#string 2 is not in index
print ('2' in col1)
False
#number 2 is in index
print (2 in col1)
True
You try to find string 2 in index values:
print (col1[1])
2
print (type(col1[1]))
<class 'str'>
print (col1[1] in col1)
False
I might be missing something, and this is years later, but as I read the question, you are trying to get the in keyword to work on your panda series? So probably want to do:
col1[1] in col1.values
Because as mentioned above, pandas is looking through the index, and you need to specifically ask it to look at the values of the series, not the index.

How to Split Pandas data for 2 decimals in object

Consider a dataframe in Pandas, where one of the many columns have data that has TWO decimals in the column.
Like
13.343.00
12.345.00
98.765.00
How can one get a new column (float) where values are stored in only 1 decimal format stripping that last part of 14.234(.00).
Desired output should be a new column like
13.343
12.345
98.765
If the digits after the second period are not always 0s (and not always two), the following code is more robust:
df["col"] = df["col"].str.extract("(.+)\.[0-9]+").astype(float)
Use:
#remove last 3 values
df['col'] = df['col'].str[:-3].astype(float)
Or:
#get values before last .
df['col'] = df['col'].str.rsplit('.', 1).str[0].astype(float)
Or:
#one or zero integer \d* \. and integer \d+ pattern
df["col"] = df["col"].str.extract("(\d*\.\d+)").astype(float)
You can use:
print(df)
col
0 13.343.00
1 12.345.00
2 98.765.00
df.col=df.col.str.rstrip('.00')
print(df)
col
0 13.343
1 12.345
2 98.765
You can convert it back to float if you like by astype(float)
Note : You should not use this if you have all 0s example: 00.000.00 instead use the second solution.
If the second decimal is not always 0 use:
df.col.str.rsplit(".",1).str[0]

cannot sum rows that match a regular expression in pandas / python

I can find the number of rows in a column in a pandas dataframe that do NOT follow a pattern but not the number of rows that follow the very same pattern!
This works:
df.report_date.apply(lambda x: (not re.match(r'[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}', x))).sum()
This does not: removing 'not' does not tell me how many rows match but raises a TypeError. Any idea why that would be the case?
df.report_date.apply(lambda x: (re.match(r'[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}', x))).sum()
df = pd.DataFrame(dict(
report_date=[
'2001-02-04',
'2016-11-12',
'1-1-1999',
'02-28-2012',
'1995-09-30'
]
))
df
regex = r'[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}'
print('does match: {}\ndoesn\'t match: {}'.format(
df.report_date.str.match(regex).sum(),
df.report_date.str.match(regex).__neg__().sum()
))
does match: 3
doesn't match: 2
or
regex = r'[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}'
df.groupby(df.report_date.str.match(regex)).size()
report_date
False 2
True 3
dtype: int64
The problem is that the match function does not return True when it matches, it returns a match object. Pandas cannot add this match object because it is not an integer value. The reason you get a sum when you are using 'not' is because it returns a boolean value of True, which pandas can sum the True value and return a number.

pandas string to numeric

I have a set of data like
a b
0 type 1 True
1 type 2 False
How can I keep the numerical part of column a and transfer ture to 1, false to zero at the same time. Below is what I want.
a b
0 1 1
1 2 0
You can convert Booleans to integers as follows:
df['b'] = df.b.astype(int)
Depending on the nature of your text in Column A, you can do a few things:
a) Split on the space and take the second part (either string or int depending on your needs).
df['a'] = df.a.str.split().str[1] # Optional `.astype(int)`
b) Use regex to extract any digits (\d*) from the end of the string.
df['a'] = df.a.str.extract(r'(\d*)$') # Optional `.astype(int)`

Use Pandas string method 'contains' on a Series containing lists of strings

Given a simple Pandas Series that contains some strings which can consist of more than one sentence:
In:
import pandas as pd
s = pd.Series(['This is a long text. It has multiple sentences.','Do you see? More than one sentence!','This one has only one sentence though.'])
Out:
0 This is a long text. It has multiple sentences.
1 Do you see? More than one sentence!
2 This one has only one sentence though.
dtype: object
I use pandas string method split and a regex-pattern to split each row into its single sentences (which produces unnecessary empty list elements - any suggestions on how to improve the regex?).
In:
s = s.str.split(r'([A-Z][^\.!?]*[\.!?])')
Out:
0 [, This is a long text., , It has multiple se...
1 [, Do you see?, , More than one sentence!, ]
2 [, This one has only one sentence though., ]
dtype: object
This converts each row into lists of strings, each element holding one sentence.
Now, my goal is to use the string method contains to check each element in each row seperately to match a specific regex pattern and create a new Series accordingly which stores the returned boolean values, each signalizing if the regex matched on at least one of the list elements.
I would expect something like:
In:
s.str.contains('you')
Out:
0 False
1 True
2 False
<-- Row 0 does not contain 'you' in any of its elements, but row 1 does, while row 2 does not.
However, when doing the above, the return is
0 NaN
1 NaN
2 NaN
dtype: float64
I also tried a list comprehension which does not work:
result = [[x.str.contains('you') for x in y] for y in s]
AttributeError: 'str' object has no attribute 'str'
Any suggestions on how this can be achieved?
you can use python find() method
>>> s.apply(lambda x : any((i for i in x if i.find('you') >= 0)))
0 False
1 True
2 False
dtype: bool
I guess s.str.contains('you') is not working because elements of your series is not strings, but lists. But you can also do something like this:
>>> s.apply(lambda x: any(pd.Series(x).str.contains('you')))
0 False
1 True
2 False

Categories