Add character to column based on text condition using pandas - python

I'm trying to do some data cleaning using pandas. Imagine I have a data frame which has a column call "Number" and contains data like: "1203.10", "4221","3452.11", etc. I want to add an "M" before the numbers, which have a point and a zero at the end. For this example, it would be turning the "1203.10" into "M1203.10".
I know how to obtain a data frame containing the numbers with a point and ending with zero.
Suppose the data frame is call "df".
pointzero = '[0-9]+[.][0-9]+[0]$'
pz = df[df.Number.str.match(pointzero)]
But I'm not sure on how to add the "M" at the beginning after having "pz". The only way I know is using a for loop, but I think there is a better way. Any suggestions would be great!

You can use boolean indexing:
pointzero = '[0-9]+[.][0-9]+[0]$'
m = df.Number.str.match(pointzero)
df.loc[m, 'Number'] = 'M' + df.loc[m, 'Number']
Alternatively, using str.replace and a slightly different regex:
pointzero = '([0-9]+[.][0-9]+[0]$)'
df['Number'] = df['Number'].str.replace(pointzero, r'M\1', regex=True))
Example:
Number
0 M1203.10
1 4221
2 3452.11

you should make dataframe or seires example for answer
example:
s1 = pd.Series(["1203.10", "4221","3452.11"])
s1
0 M1203.10
1 4221
2 3452.11
dtype: object
str.contains + boolean masking
cond1 = s1.str.contains('[0-9]+[.][0-9]+[0]$')
s1.mask(cond1, 'M'+s1)
output:
0 M1203.10
1 4221
2 3452.11
dtype: object

Related

Create a new variable based on 4 other variables

I have a dataframe in Python called df1 where I have 4 dichotomous variables called Ordering_1; Ordering_2, Ordering_3, Ordering_4 with True/False values.
I need to create a variable called Clean, which is based on the 4 other variables. Meaning, when Ordering_1 == True, then Clean == Ordering_1, when Ordering_2==True, then Clean == Ordering_2. Then Clean would be a combination of all the true values from Ordering_1; Ordering_2, Ordering_3, Ordering_4.
Here is an example of how I would like the variable Clean to be:
I have tried the below code but it does not work:
df1[Clean] = df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1]
Would anyone please be able to help me how to do this in python?
Universal solution if there are multiple Trues per rows - filter columns by DataFrame.filter and then use DataFrame.dot for matrix multiplication:
df1 = df.filter(like='Ordering_')
df['Clean'] = df1.dot(df1.columns + ',').str.strip(',')
If there is only one "True" value per row you can use the booleans of each column "Ordering_1", "Ordering_2", etc. and the df1.loc.
Note that this is what you get with df1.Ordering_1:
0 True
1 False
2 False
3 False
Name: Ordering_1, dtype: bool
With df1.loc you can use it to filter on the "True" rows, in this case only row 0:
So you can code this:
Create a new blank "clean" column:
df1["clean"]=""
Set the rows where the series df.Ordering_1 = True to "Ordering_1":
df1.loc[df1.Ordering_1,["clean"]] = "Ordering_1"
Proceed with the remaining columns in the same way.

String Modification On Pandas DataFrame Subset

I'm having a hard time updating a string value in a subset of Pandas data frame
In the field action, I am able to modify the action column using regular expressions with:
df['action'] = df.action.str.replace('([^a-z0-9\._]{2,})','')
However, if the string contains a specific word, I don't want to modify it, so I tried to only update a subset like this:
df[df['action'].str.contains('TIME')==False]['action'] = df[df['action'].str.contains('TIME')==False].action.str.replace('([^a-z0-9\._]{2,})','')
and also using .loc like:
df.loc('action',df.action.str.contains('TIME')==False) = df.loc('action',df.action.str.contains('TIME')==False).action.str.replace('([^a-z0-9\._]{2,})','')
but in both cases, nothing gets updated. Is there a better way to achieve this?
you can do it with loc but you did it the way around with column first while it should be index first, and using [] and not ()
mask_time = ~df['action'].str.contains('TIME') # same as df.action.str.contains('TIME')==False
df.loc[mask_time,'action'] = df.loc[mask_time,'action'].str.replace('([^a-z0-9\._]{2,})','')
example:
#dummy df
df = pd.DataFrame({'action': ['TIME 1', 'ABC 2']})
print (df)
action
0 TIME 1
1 ABC 2
see the result after using above method:
action
0 TIME 1
1 2
Try this it should work, I found it here
df.loc[df.action.str.contains('TIME')==False,'action'] = df.action.str.replace('([^a-z0-9\._]{2,})','')

How can I efficiently and idiomatically filter rows of PandasDF based on multiple StringMethods on a single column?

I have a Pandas DataFrame df with many columns, of which one is:
col
---
abc:kk__LL-z12-1234-5678-kk__z
def:kk_A_LL-z12-1234-5678-kk_ss_z
abc:kk_AAA_LL-z12-5678-5678-keek_st_z
abc:kk_AA_LL-xx-xxs-4rt-z12-2345-5678-ek__x
...
I am trying to fetch all records where col starts with abc: and has the first -num- between '1234' and '2345' (inclusive using a string search; the -num- parts are exactly 4 digits each).
In the case above, I'd return
col
---
abc:kk__LL-z12-1234-5678-kk__z
abc:kk_AA_LL-z12-2345-5678-ek__x
...
My current (working, I think) solution looks like:
df = df[df['col'].str.startswith('abc:')]
df = df[df['col'].str.extract('.*-(\d+)-(\d+)-.*')[0].ge('1234')]
df = df[df['col'].str.extract('.*-(\d+)-(\d+)-.*')[0].le('2345')]
What is a more idiomatic and efficient way to do this in Pandas?
Complex string operations are not as efficient as numeric calculations. So the following approach might be more efficient:
m1 = df['col'].str.startswith('abc')
m2 = pd.to_numeric(df['col'].str.split('-').str[2]).between(1234, 2345)
dfn = df[m1&m2]
col
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x
One way would be to use regexp and apply function. I find it easier to play with regexp in a separate function than to crowd the pandas expression.
import pandas as pd
import re
def filter_rows(string):
z = re.match(r"abc:.*-(\d+)-(\d+)-.*", string)
if z:
return 1234 <= (int(z.groups()[0])) <= 2345
else:
return False
Then use the defined function to select rows
df.loc[df['col'].apply(filter_rows)]
col
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x
Another play on regex :
#string starts with abc,greedy search,
#then look for either 1234, or 2345,
#search on for 4 digit number and whatever else after
pattern = r'(^abc.*(?<=1234-|2345-)\d{4}.*)'
df.col.str.extract(pattern).dropna()
0
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x

Creating series from a series of list using positional list memeber

I have a data frame with a string column. I need to create a new column with 3rd element after col1.split(' '). I tried
df['col1'].str.split(' ')[0]
but all I get is error.
Actually I need to turn col1 into multiple columns after spliting by " ".
What is the correct way to do this ?
Consider this df
df = pd.DataFrame({'col': ['Lets say 2000 is greater than 5']})
col
0 Lets say 2000 is greater than 5
You can split and use str accessor to get elements at different positions
df['third'] = df.col.str.split(' ').str[2]
df['fifth'] = df.col.str.split(' ').str[4]
df['last'] = df.col.str.split(' ').str[-1]
col third fifth last
0 Lets say 2000 is greater than 5 2000 greater 5
Another way is:
df["third"] = df['col1'].apply(lambda x: x.split()[2])

Getting substring based on another column in a pandas dataframe

Hi is there a way to get a substring of a column based on another column?
import pandas as pd
x = pd.DataFrame({'name':['bernard','brenden','bern'],'digit':[2,3,3]})
x
digit name
0 2 bernard
1 3 brenden
2 3 bern
What i would expect is something like:
for row in x.itertuples():
print row[2][:row[1]]
be
bre
ber
where the result is the substring of name based on digit.
I know if I really want to I can create a list based on the itertuples function but does not seem right and also, I always try to create a vectorized method.
Appreciate any feedback.
Use apply with axis=1 for row-wise with a lambda so you access each column for slicing:
In [68]:
x = pd.DataFrame({'name':['bernard','brenden','bern'],'digit':[2,3,3]})
x.apply(lambda x: x['name'][:x['digit']], axis=1)
Out[68]:
0 be
1 bre
2 ber
dtype: object

Categories