regular expression to delete the 0 after dash sign - python

I have a pandas dataframe column of numbers which all have a dash in between, for example :
"123-045"
I am wondering is there anyway to delete the zero after the dash sign, to make the above example to
"123-45"
? And is it possible to apply the process condition to the entire column??
I have used a for loop to check each digit after the dash sign, using the python string function. But the number of rows is large, and the for loop takes forever.

Try Series.str.replace method with regex (?<=-)0+ to remove 0 after -:
df = pd.DataFrame({'a': ["123-045"]})
df
# a
#0 123-045
df.a.str.replace('(?<=-)0+', '')
#0 123-45
#Name: a, dtype: object

If str is your string then it could be as simple as this:
str = re.sub("-.", "-", str)
Or with pandas dataframe:
df = pd.DataFrame({'key': ["assa-dssd-sd"]})
print (df.key.str.replace("-.", "-"))

Related

Add character to column based on text condition using pandas

I'm trying to do some data cleaning using pandas. Imagine I have a data frame which has a column call "Number" and contains data like: "1203.10", "4221","3452.11", etc. I want to add an "M" before the numbers, which have a point and a zero at the end. For this example, it would be turning the "1203.10" into "M1203.10".
I know how to obtain a data frame containing the numbers with a point and ending with zero.
Suppose the data frame is call "df".
pointzero = '[0-9]+[.][0-9]+[0]$'
pz = df[df.Number.str.match(pointzero)]
But I'm not sure on how to add the "M" at the beginning after having "pz". The only way I know is using a for loop, but I think there is a better way. Any suggestions would be great!
You can use boolean indexing:
pointzero = '[0-9]+[.][0-9]+[0]$'
m = df.Number.str.match(pointzero)
df.loc[m, 'Number'] = 'M' + df.loc[m, 'Number']
Alternatively, using str.replace and a slightly different regex:
pointzero = '([0-9]+[.][0-9]+[0]$)'
df['Number'] = df['Number'].str.replace(pointzero, r'M\1', regex=True))
Example:
Number
0 M1203.10
1 4221
2 3452.11
you should make dataframe or seires example for answer
example:
s1 = pd.Series(["1203.10", "4221","3452.11"])
s1
0 M1203.10
1 4221
2 3452.11
dtype: object
str.contains + boolean masking
cond1 = s1.str.contains('[0-9]+[.][0-9]+[0]$')
s1.mask(cond1, 'M'+s1)
output:
0 M1203.10
1 4221
2 3452.11
dtype: object

Looking at the first character of a string for every element in a list

I have a pandas dataframe with a column called 'picture'; that column has values that either start with a number or letter. What I'm trying to do is create a new column that checks whether or not the value starts with a letter or number, and populate that new column accordingly. I'm using np.where, and my code is below (raw_master is the dataframe, 'database' is the new column):
def iaps_or_naps(x):
if x in ["1","2","3","4","5","6","7","8","9"]:
return True
else:
return False
raw_master['database'] = np.where(iaps_or_naps(raw_master.picture[?][0])==True, 'IAPS', 'NAPS')
My issue is that if I just do raw_master.picture[0], that checks the value of the entire string, which is not what I need. I need the first character; however, if I do raw_master.picture[0][0], that will just evaluate to the first character of the first row for the whole dataframe. BTW, the question mark just means I'm not sure what to put there.
How can I get it so it takes the first character of the string for every row?
Thanks so much!
You don't need to write your own function for this. Take this small df as an example:
s = pd.DataFrame(['3asd', 'asd', '3423', 'a123'])
looks like:
0
0 3asd
1 asd
2 3423
3 a123
using a pandas builtin:
# checking first column, s[0], first letter, str[0], to see if it is digit.
# if so, assigning IAPS, if not, assigning NAPS
s['database'] = np.where(s[0].str[0].str.isdigit(), 'IAPS', 'NAPS')
output:
0 database
0 3asd IAPS
1 asd NAPS
2 3423 IAPS
3 a123 NAPS
Applying this to your dataframe:
raw_master['database'] = np.where(raw_master['picture'].str[0].str.isdigit(), 'IAPS', 'NAPS')
IIUC you can just test if the first char is an int using pd.to_numeric
np.where(pd.to_numeric(df['your_col'].str[0],errors='coerce').isnull(),'IAPS'
,'NAPS') # ^ not a number
#^ number
You could use a mapping function such as apply which iterates over each element in the column, this way accessing the first character with indexing [0]
df['new_col'] = df['picture'].apply(lambda x: 'IAPS' if x[0].str.isdigit() else 'NAPS')

Processing Data by Datatype in Pandas

I have a DataFrame containing int and str data which I have to process through.
I would like to separate the text and the numerical values in each cell into separate columns, so that I can compute on the numerical data.
My columns are similar to this:
I have read about doing something like this through the apply function and applymap function, but I can't design such a function as I am new to pandas. It should basically do -
def separator():
if cell has str:
Add str part to another column(Check column), leave int inplace.
else:
Add 'NA' to Check column
You can do this using extract with a followed to_numeric:
import pandas as pd
df = pd.DataFrame({'a_mrk4': ['042FP', '077', '079', '1234A-BC D..EF']})
df[['a_mrk4', 'Check']] = df['a_mrk4'].str.extract(r'(\d+)(.*)')
df['a_mrk4'] = pd.to_numeric(df['a_mrk4'])
print(df)
Output:
a_mrk4 Check
0 42 FP
1 77
2 79
3 1234 A-BC D..EF
you can use regular expressions, let's say that you have a column (target_col) and the data follow the pattern digits+text then you can use the following column
df.target_col.str.extractall(r'(/d)(/w)')
you can tweak the re to match your specific needs

Python - Apply rsplit in DataFrame column using lambda

I've a dataframe with the following structure (3 columns):
DATE,QUOTE,SOURCE
2019-11-21,1ºTEST/2ºTEST DONE, KAGGLE
What I am trying to do is make a substring on QUOTE column in order to generate anew column only with the words after the last occurrence (in this case the word 'TEST').
My expected result:
DATE,QUOTE, SATUS, SOURCE
2019-11-21,1ºTEST/2ºTEST DONE, DONE, KAGGLE
For that I'm trying with the following code:
import pandas as pd
df = pd.read_excel (filename)
split = lambda x: len(x['QUOTE'].rsplit('TEST',1)[0])
df["STATUS"] = df.apply(split, axis=1)
print(df["STATUS"].unique())
However I'm just printing numbers not 'DONE'.
What I am doing wrong?
Thanks!
In the definition of split you are using len, that returns the length of sequence (an integer),
len([1, 'Done']) # returns 2
You need to access the last index, for example:
df['STATUS'] = df.QUOTE.str.rsplit('TEST').str[-1]
print(df)
Output
DATE QUOTE SOURCE STATUS
0 2019-11-21 1ºTEST/2ºTEST DONE KAGGLE DONE
Or if you want to use apply, just change the definition of split:
split = lambda x: x['QUOTE'].rsplit('TEST', 1)[-1]
df["STATUS"] = df.apply(split, axis=1)
print(df)
Output
DATE QUOTE SOURCE STATUS
0 2019-11-21 1ºTEST/2ºTEST DONE KAGGLE DONE
Note than using lambda to create named functions is consider a not so good practice.

How to trim string from reverse in Pandas column

I have a pandas dataframe column value as
"assdffjhjhjh(12tytyttyt)bhhh(AS7878788)"
I need to trim it from the back,i.e my resultant value should be AS7878788.
I am doing the below:
newdf=pd.DataFrame(df.COLUMNNAME.str.split('(',1).tolist(),columns = ['col1','col2'])
df['newcol'] = newdf['col2'].str[:10]
This in the above Dataframe column is giving the the output "12tytyttyt", however my intended output is "AS7878788"
Can someone help please?
Let's try first with a regular string in pure Python:
x = "assdffjhjhjh(12tytyt)bhhh(AS7878788)"
res = x.rsplit('(', 1)[-1][:-1] # 'AS7878788'
Here we split from the right by open bracket (limiting the split count to one for efficiency), extract the last split, and extract every character except the last.
You can then apply this in Pandas via pd.Series.str methods:
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
Here's a demo:
df = pd.DataFrame({'col': ["assdffjhjhjh(12tytyt)bhhh(AS7878788)"]})
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
print(df)
col
0 AS7878788
Note the solution above is very specific to the string you have presented as an example. For a more flexible alternative, consider using regex.
You can use a regex to find all instances of "values between two brackets" and then pull out the final one. For example, if we have the following data:
df = pd.DataFrame({'col': ['assdffjhjhjh(12tytyt)bhhh(AS7878788)',
'asjhgdv(abjhsgf)(abjsdfvhg)afdsgf']})
and we do:
df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)').str[-1]
this gets us:
col
0 AS7878788
1 abjsdfvhg
To explain what the regex is doing, it is trying to find all instances where we have:
\( # an open bracket
([^\(^\)]+) # anything that isn't an open bracket or a close bracket for one or more characters
\) # a close bracket
We can see how this is working if we take the .str[-1] from the end of our previous statement, as df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)') gives us:
col
0 [12tytyt, AS7878788]
1 [abjhsgf, abjsdfvhg]

Categories