Issues with extracting substrings of a string in Python Pandas Dataframe - python

I have an expression like ( one row of a column, say 'old_col' in pandas data frame) ( Shown the top two rows from a column of the dataframe )
abcd_6.9_uuu ghaha_12.8 _sksks
abcd_5.2_uuu ghaha_13.9 _sksks
I was trying to use the str.extract on the dataframe to get the two floating numbers. However I find two issues, only the first one is picked up( 6.9 from first row and 5.2 from second row )
1. So how can I do that?
2. Also how can I make the extract method general to pick numbers upto any digits ( 5.7or 12.9 irrespective)
I am using:
df['newcol'] = df['old_col'].str.extract('(_\d.\d)')

To get more than one digit,
df['col'].str.extract('(\_\d+\.\d+)')
col
0 _6.9
1 _15.9
To get all occurrences, use str.extractall
df['col'].str.extractall('(\_\d+\.\d+)')
col
match
0 0 _6.9
1 _12.8
1 0 _15.9
1 _13.9
To assign back to df:
s = df['col'].str.extractall('(\_\d+\.\d+)')['col']
df['new_col'] = s.groupby(s.index.get_level_values(0)).agg(list)

You can use Series.str.findall:
import pandas as pd
df=pd.DataFrame({'old_col':['abcd_6.9_uuu ghaha_12.8 _sksks','abcd_5.2_uuu ghaha_13.9 _sksks']})
df['newcol'] = df['old_col'].str.findall(r'\d+(?:\.\d+)?')
df['newcol_str'] = df['old_col'].str.findall(r'\d+(?:\.\d+)?').str.join(', ')
# >>> df
# old_col newcol newcol_str
# 0 abcd_6.9_uuu ghaha_12.8 _sksks [6.9, 12.8] 6.9, 12.8
# 1 abcd_5.2_uuu ghaha_13.9 _sksks [5.2, 13.9] 5.2, 13.9
Regex details:
\d+(?:\.\d+)? - one or more digits followed with an optional occurrence of a . and one or more digits
\d+\.\d+ would match only float values where the . is obligatory between at least two digits.
Since .str.findall(r'\d+(?:\.\d+)?') returns a list, the newcol column contains lists, with .str.join(', '), the newcol_str column contains strings with found matches merged.
If you must check if the numbers occur between underscores add them on both sides of the pattern and wrap the number matching pattern with parentheses:
.str.findall(r'_(\d+(?:\.\d+)?)_')

Related

Substring a column in pandas

I have a dataframe like this
Index
Identifier
0
10769289.0
1
1082471174.0
The "Identifier column is a string column" and I need to remove the ".0"
I'm using the following code:
Dataframe["Identifier"] = Dataframe["Identifier"].replace(regex=['.0'],value='')
But I got this:
IndexIdentifier0769289182471174
As you can see it removed more than just the ".0". I also tried to use
Dataframe["Identifier"] = Dataframe["Identifier"].str.replace(".0", "")
but I got the same result.
The dot (.) in regex or in replace can indicate any character. Therefore you have to escape the decimal point. Otherwise it will replace any character followed by a zero. Which in your case would mean that it would replace the 10 at the beginning of 10769289.0 and 1082471174.0, as well as the .0 at the end of each number. By escaping the decimal point, it will only look for the following: .0 - which is what you intended.
import pandas as pd
# Create the dataframe as per the example
Dataframe = pd.DataFrame({"Index": [0,1], "Identifier": ['10769289.0', '1082471174.0']})
# Replace the decimal and the zero at the end of each Identifier.
Dataframe["Identifier"] = Dataframe["Identifier"].str.replace("\.0", "")
# Print the dataframe
print(Dataframe)
OUTPUT:
Index Identifier
0 0 10769289
1 1 1082471174

Detect special values in one column

I am trying to detect values with some specific characters e.g(?,/ etc). Below you can see a small sample with some data.
import pandas as pd
import numpy as np
data = {
'artificial_number':['000100000','000010000','00001000/1','00001000?','0?00/10000'],
}
df1 = pd.DataFrame(data, columns = [
'artificial_number',])
Now I want to detect values with specific characters that are not numbers ('00001000/1','00001000?','0?00/10000') I tried with this lines below
import re
clean = re.sub(r'[^a-zA-Z0-9\._-]', '', df1['artificial_number'])
But this code is not working as I expected. So can anybody help me how to solve this problem ?
#replace the non-digit with an empty value
df1['artificial_number'].str.replace(r'([^\d])','', regex=True)
0 000100000
1 000010000
2 000010001
3 00001000
4 00010000
Name: artificial_number, dtype: object
if you like to list the column with non-digit values
df1.loc[df1['artificial_number'].str.extract(r'([^\d])')[0].notna()]
artificial_number
2 00001000/1
3 00001000?
4 0?00/10000
Assuming a number in your case is an integer, to find the values that have non-numbers, just count the number of numbers, and compare with length of string:
rows = [len(re.findall('[0-9]', s)) != len(s) for s in df1.artificial_number]
df1.loc[rows]
# artificial_number
#2 00001000/1
#3 00001000?
#4 0?00/10000
To detect which of the values aren't interpretable as numeric, you can also use str.isnumeric:
df1.loc[~df1.artificial_number.str.isnumeric()]
artificial_number
2 00001000/1
3 00001000?
4 0?00/10000
If all characters need to be digits (e.g. 10.0 should also be excluded), use str.isdigit:
df1.loc[~df1.artificial_number.str.isdigit()]
df1.iloc[0,0] = '000100000.0'
artificial_number
0 000100000.0
2 00001000/1
3 00001000?
4 0?00/10000

How to Split Pandas data for 2 decimals in object

Consider a dataframe in Pandas, where one of the many columns have data that has TWO decimals in the column.
Like
13.343.00
12.345.00
98.765.00
How can one get a new column (float) where values are stored in only 1 decimal format stripping that last part of 14.234(.00).
Desired output should be a new column like
13.343
12.345
98.765
If the digits after the second period are not always 0s (and not always two), the following code is more robust:
df["col"] = df["col"].str.extract("(.+)\.[0-9]+").astype(float)
Use:
#remove last 3 values
df['col'] = df['col'].str[:-3].astype(float)
Or:
#get values before last .
df['col'] = df['col'].str.rsplit('.', 1).str[0].astype(float)
Or:
#one or zero integer \d* \. and integer \d+ pattern
df["col"] = df["col"].str.extract("(\d*\.\d+)").astype(float)
You can use:
print(df)
col
0 13.343.00
1 12.345.00
2 98.765.00
df.col=df.col.str.rstrip('.00')
print(df)
col
0 13.343
1 12.345
2 98.765
You can convert it back to float if you like by astype(float)
Note : You should not use this if you have all 0s example: 00.000.00 instead use the second solution.
If the second decimal is not always 0 use:
df.col.str.rsplit(".",1).str[0]

Dropping row in pandas by index - have to add to index?

I have a pandas dataframe of 182 rows that comes from read_csv. The first column, sys_code, contains various alphanumeric codes. I want to drop ones that start with 'FB' (there are 14 of these). I loop through the dataframe, adding what I assume would be the index to a list, then try to drop by index using the list. But this doesn't work unless I add 18 to each index number.
Without adding 18, I get a list containing numbers from 84 - 97. When I try to drop the rows using this list for indexes, I get KeyError: '[84] not found in axis'. But when I add 18 to each number, it works fine, at least for this particular dataset. But why is this? Shouldn't i be the same as the index number?
fb = []
i = 0
df.reset_index(drop=True)
for x in df['sys_code']:
if x[:2] == 'FB':
fb.append(i+18) #works
fb.append(i) # doesn't work
i += 1
df.drop(fb, axis=0, inplace=True)
You could use Series.str.startswith. Here's an example:
df = pd.DataFrame({'col1':['some string', 'FBsomething', 'FB', 'etc']})
print(df)
col1
0 some string
1 FBsomething
2 FB
3 etc
You could remove those strings that do not start with FB using:
df[~df.col1.str.startswith('FB')]
col1
0 some string
3 etc

How to trim string from reverse in Pandas column

I have a pandas dataframe column value as
"assdffjhjhjh(12tytyttyt)bhhh(AS7878788)"
I need to trim it from the back,i.e my resultant value should be AS7878788.
I am doing the below:
newdf=pd.DataFrame(df.COLUMNNAME.str.split('(',1).tolist(),columns = ['col1','col2'])
df['newcol'] = newdf['col2'].str[:10]
This in the above Dataframe column is giving the the output "12tytyttyt", however my intended output is "AS7878788"
Can someone help please?
Let's try first with a regular string in pure Python:
x = "assdffjhjhjh(12tytyt)bhhh(AS7878788)"
res = x.rsplit('(', 1)[-1][:-1] # 'AS7878788'
Here we split from the right by open bracket (limiting the split count to one for efficiency), extract the last split, and extract every character except the last.
You can then apply this in Pandas via pd.Series.str methods:
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
Here's a demo:
df = pd.DataFrame({'col': ["assdffjhjhjh(12tytyt)bhhh(AS7878788)"]})
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
print(df)
col
0 AS7878788
Note the solution above is very specific to the string you have presented as an example. For a more flexible alternative, consider using regex.
You can use a regex to find all instances of "values between two brackets" and then pull out the final one. For example, if we have the following data:
df = pd.DataFrame({'col': ['assdffjhjhjh(12tytyt)bhhh(AS7878788)',
'asjhgdv(abjhsgf)(abjsdfvhg)afdsgf']})
and we do:
df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)').str[-1]
this gets us:
col
0 AS7878788
1 abjsdfvhg
To explain what the regex is doing, it is trying to find all instances where we have:
\( # an open bracket
([^\(^\)]+) # anything that isn't an open bracket or a close bracket for one or more characters
\) # a close bracket
We can see how this is working if we take the .str[-1] from the end of our previous statement, as df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)') gives us:
col
0 [12tytyt, AS7878788]
1 [abjhsgf, abjsdfvhg]

Categories