pandas string to numeric - python

I have a set of data like
a b
0 type 1 True
1 type 2 False
How can I keep the numerical part of column a and transfer ture to 1, false to zero at the same time. Below is what I want.
a b
0 1 1
1 2 0

You can convert Booleans to integers as follows:
df['b'] = df.b.astype(int)
Depending on the nature of your text in Column A, you can do a few things:
a) Split on the space and take the second part (either string or int depending on your needs).
df['a'] = df.a.str.split().str[1] # Optional `.astype(int)`
b) Use regex to extract any digits (\d*) from the end of the string.
df['a'] = df.a.str.extract(r'(\d*)$') # Optional `.astype(int)`

Related

How to conserve dataframe rows containing a list a specific strings?

I have a dataframe with a column level
level
0 HH
1 FF
2 FF
3 C,NN-FRAC,W-PROC
4 C,D
...
8433 C,W-PROC
8434 C,D
8435 D
8436 C,Q
8437 C,HH
I would like to only conserve row which contains specific string:
searchfor = ['W','W-OFFSH','W-ONSH','W-GB','W-PROC','W-NGTC','W-TRANS','W-UNSTG','W-LNGSTG','W-LNGIE','W-LDC','X','Y','LL','MM','MM – REF','MM – IMP','MM – EXP','NN','NN-FRAC','NN-LDC','OO']
which should give me (from the above extract):
level
1 C,NN-FRAC,W-PROC
2 C,W-PROC
I tried to apply these 2 different string filter but non one give me the excepted result.
df = df[df['industrytype'].str.contains(searchfor)]
df = df[df['industrytype'].str.contains(','.join(searchfor))]
It might not be behaving the expected way because of the presence of comma in the columns. You can write a simple function which splits at comma and checks for each different splits. You can use apply method to use that function on the column.
def filter(x):
x = x.split(',')
for i in x:
if i in searchfor:
return True
return False
df = df[df.industrytype.apply(filter)]

Python:Can df['a']str.contains() have multiple condition?

I have 4 types of value in my df column A example shown below
123
123/123/123/123/123
123,,123,,123
1234-1234-1234
i want index of those value which do not have any type of sepertor in it
I tried like this but failed to get results
mask = df["A"].str.contains(',','/' na=False)
Any help would be appreciated
If possible invert logic - get all rows if only numbers without any separator use ^\d+$ - ^ means start of string, \d+ means one or more digits and $ means end of string - together only numbers values:
mask = df["A"].str.contains('^\d+$', na=False)
print (mask)
0 True
1 False
2 False
3 False
Name: A, dtype: bool

Searching for a string that doesn't include certain characters

In column 'a' I have values which are numbers separated by a comma (ranging from 1 to 35). e.g. '1,6,7,3,5,15,6,25,30' and '5,6,7,33' '1,6,29,15'
In a new column 'b', I want the value to say 'yes' whenever the value in column A is 5 or it's variations ,5 (comma 5) or 5, (5 comma). However I don't want values such as 15 or 25 included. Is there a way to include all combinations of 5 with a comma but not anything else?
df.loc[df['a'].str.contains(',5'), 'b'] = 'yes'
df.loc[df['a'].str.contains('5,'), 'b'] = 'yes'
I would suggest converting your comma-separated string into an array (see here for how: How to convert a string to a list in Python?).
Then you can check if the search value (e.g. '5') exists in the array using in, e.g.:
if searchValue in arrayOfNumbers {
return True
}
(Or you could try a ternary operator, however that's done in Python)
I would suggest something like:
# your dataframe
df = pandas.DataFrame({'A': ['1,2,34,5,6', '32,2,4,67,5', '4,3,2,1,']})
df['B'] = df['A'].apply(lambda x : True if '5' in x.split(',') else False)
this will add a column B to your dataframe containing True if 5 is there and False otherwise.
A B
0 1,2,34,5,6 True
1 32,2,4,67,5 True
2 4,3,2,1, False

How to Split Pandas data for 2 decimals in object

Consider a dataframe in Pandas, where one of the many columns have data that has TWO decimals in the column.
Like
13.343.00
12.345.00
98.765.00
How can one get a new column (float) where values are stored in only 1 decimal format stripping that last part of 14.234(.00).
Desired output should be a new column like
13.343
12.345
98.765
If the digits after the second period are not always 0s (and not always two), the following code is more robust:
df["col"] = df["col"].str.extract("(.+)\.[0-9]+").astype(float)
Use:
#remove last 3 values
df['col'] = df['col'].str[:-3].astype(float)
Or:
#get values before last .
df['col'] = df['col'].str.rsplit('.', 1).str[0].astype(float)
Or:
#one or zero integer \d* \. and integer \d+ pattern
df["col"] = df["col"].str.extract("(\d*\.\d+)").astype(float)
You can use:
print(df)
col
0 13.343.00
1 12.345.00
2 98.765.00
df.col=df.col.str.rstrip('.00')
print(df)
col
0 13.343
1 12.345
2 98.765
You can convert it back to float if you like by astype(float)
Note : You should not use this if you have all 0s example: 00.000.00 instead use the second solution.
If the second decimal is not always 0 use:
df.col.str.rsplit(".",1).str[0]

Use Pandas string method 'contains' on a Series containing lists of strings

Given a simple Pandas Series that contains some strings which can consist of more than one sentence:
In:
import pandas as pd
s = pd.Series(['This is a long text. It has multiple sentences.','Do you see? More than one sentence!','This one has only one sentence though.'])
Out:
0 This is a long text. It has multiple sentences.
1 Do you see? More than one sentence!
2 This one has only one sentence though.
dtype: object
I use pandas string method split and a regex-pattern to split each row into its single sentences (which produces unnecessary empty list elements - any suggestions on how to improve the regex?).
In:
s = s.str.split(r'([A-Z][^\.!?]*[\.!?])')
Out:
0 [, This is a long text., , It has multiple se...
1 [, Do you see?, , More than one sentence!, ]
2 [, This one has only one sentence though., ]
dtype: object
This converts each row into lists of strings, each element holding one sentence.
Now, my goal is to use the string method contains to check each element in each row seperately to match a specific regex pattern and create a new Series accordingly which stores the returned boolean values, each signalizing if the regex matched on at least one of the list elements.
I would expect something like:
In:
s.str.contains('you')
Out:
0 False
1 True
2 False
<-- Row 0 does not contain 'you' in any of its elements, but row 1 does, while row 2 does not.
However, when doing the above, the return is
0 NaN
1 NaN
2 NaN
dtype: float64
I also tried a list comprehension which does not work:
result = [[x.str.contains('you') for x in y] for y in s]
AttributeError: 'str' object has no attribute 'str'
Any suggestions on how this can be achieved?
you can use python find() method
>>> s.apply(lambda x : any((i for i in x if i.find('you') >= 0)))
0 False
1 True
2 False
dtype: bool
I guess s.str.contains('you') is not working because elements of your series is not strings, but lists. But you can also do something like this:
>>> s.apply(lambda x: any(pd.Series(x).str.contains('you')))
0 False
1 True
2 False

Categories