I got the following dataframe
Initial
M.H..
T.H.
How can i remove the .. from M.H.. to M.H.
Use Series.str.replace with escape .:
df['Initial'] = df['Initial'].str.replace(r'\.+', ".")
You can also use string indexing,
string[:-1]
In your case,
"M.H.."[:-1]
Related
How do I go about removing an empty string or at least having regex ignore it?
I have some data that looks like this
EIV (5.11 gCO₂/t·nm)
I'm trying to extract the numbers only. I have done the following:
df['new column'] = df['column containing that value'].str.extract(r'((\d+.\d*)|(\d+)|(\.\d+)|(\d+[eE][+]?\d*)?)').astype('float')
since the numbers Can be floats, integers, and I think there's one exponent 4E+1
However when I run it I then get the error as in title which I presume is an empty string.
What am I missing here to allow the code to run?
Try this
import re
c = "EIV (5.11 gCO₂/t·nm)"
x = re.findall("[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?", c)
print(x)
Will give
['5.11']
The problem is not only the number of groups, but the fact that the last alternative in your regex is optional (see ? added right after it, and your regex demo). However, since Series.str.extract returns the first match, your regex matches and returns the empty string at the start of the string if the match is not at the string start position.
It is best to use the well-known single alternative patterns to match any numbers with a single capturing group, e.g.
df['col'].str.extract(r'((?:(?:\b[0-9]+)?\.)?\b[0-9]+(?:[eE][-+]?[0-9]+)?)\b').astype(float)
See Example Regexes to Match Common Programming Language Constructs.
Pandas test:
import pandas as pd
df = pd.DataFrame({'col':['EIV (5.11 gCO₂/t·nm)', 'EIV (5.11E+12 gCO₂/t·nm)']})
df['col'].str.extract(r'((?:(?:\b[0-9]+)?\.)?\b[0-9]+(?:[eE][-+]?[0-9]+)?)\b').astype(float)
# => 0
# 0 5.110000e+00
# 1 5.110000e+12
There also quite a lot of other such regex variations at Parsing scientific notation sensibly?, and you may also use r"([-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?)", r"(-?\d+(?:\.\d*)?(?:[eE][+-]?\d+)?)", r"([+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?)", etc.
If your column consist of data of same format(as you have posted - EIV (5.11 gCO₂/t·nm)) then it will surely work
import pandas as pd
df['new_exctracted_column'] = df['column containing that value'].str.extract('(\d+(?:\.\d+)?)')
df
5.11
I am trying to clean up this string "cha?ra\ncter num?\nber". I want it to remove "?" and "\n" without removing "n" when it is alone. I tried the following, but it doesn't work. Any advice appreciated!
data_doc='cha?ra\ncter num?\nber'
code={"?":"", "\n":""}
table=str.maketrans(code.keys())
data_doc.translate(table)
An even shorter way to do this could be to simply use replace
data_doc='cha?ra\ncter num?\nber'
data_doc = data_doc.replace('?','').replace('\n','')
Output:
character number
import re
data_doc='cha?ra\ncter num?\nber'
cleaned = re.sub("[\\n\?]", "", data_doc)
print(cleaned)
The output:
character number
I have some URLs and I need some of them to be stripped from the question mark (?)
Ex. https://www.yelp.com/biz/starbucks-san-leandro-4?large_photo=1
I need it to return https://www.yelp.com/biz/starbucks-san-leandro-4
How can I do that?
you can also use .split() method
The split() method splits a string into a list.
You can specify the separator, default separator is any whitespace.
Syntax
string.split(separator, maxsplit)
data = 'https://www.yelp.com/biz/starbucks-san-leandro-4?large_photo=1'
print (data.split('?')[0])
output:
https://www.yelp.com/biz/starbucks-san-leandro-4
You could use rfind and slice the string up to the returned index:
s = 'https://www.yelp.com/biz/starbucks-san-leandro-4?large_photo=1'
s[:s.rfind('?')]
# 'https://www.yelp.com/biz/starbucks-san-leandro-4'
Go for a regular expression
import re
new_string = re.sub(r'\?.+$', '', your_string)
See a demo on regex101.com.
I would parse the url and the rebuild it with the parts that you want to keep. For example you can use urllib.parse
I have a string that looks as follows:
word1||word2||word3||word4
What is the best way to remove the 'extra' | between words in the string without getting rid of both of them?
The end product needs to look like:
word1|word2|word3|word4
You can use replace
str='word1||word2||word3||word4'
print(str.replace('||', '|'))
#word1|word2|word3|word4
You can use a regex that matches one or more occurrences of the pattern:
import re
s='word1||word2||word3||word4'
re.sub('\|+','|',s)
# 'word1|word2|word3|word4'
Simply use str.replace :
('word1||word2||word3||word4').replace ('||', '|')
below code should address your requirement.
str1="word1||word2||word3||word4|word6"
str1 = str1.replace("||","|")
print (str1)
You can replace || with | using replace functions.
str='word1||word2||word3||word4'
print(str.replace('||', '|'))
Thanks. :)
I have a similar question to this one: Pandas DataFrame: remove unwanted parts from strings in a column.
So I used:
temp_dataframe['PPI'] = temp_dataframe['PPI'].map(lambda x: x.lstrip('PPI/'))
Most, of the items start with a 'PPI/' but not all. It seems that when an item without the 'PPI/' suffix encountered this error:
AttributeError: 'float' object has no attribute 'lstrip'
Am I missing something here?
use replace:
temp_dataframe['PPI'].replace('PPI/','',regex=True,inplace=True)
or string.replace:
temp_dataframe['PPI'].str.replace('PPI/','')
use vectorised str.lstrip:
temp_dataframe['PPI'] = temp_dataframe['PPI'].str.lstrip('PPI/')
it looks like you may have missing values so you should mask those out or replace them:
temp_dataframe['PPI'].fillna('', inplace=True)
or
temp_dataframe.loc[temp_dataframe['PPI'].notnull(), 'PPI'] = temp_dataframe['PPI'].str.lstrip('PPI/')
maybe a better method is to filter using str.startswith and use split and access the string after the prefix you want to remove:
temp_dataframe.loc[temp_dataframe['PPI'].str.startswith('PPI/'), 'PPI'] = temp_dataframe['PPI'].str.split('PPI/').str[1]
As #JonClements pointed out that lstrip is removing whitespace rather than removing the prefix which is what you're after.
update
Another method is to pass a regex pattern that looks for the optionally prefix and extract all characters after the prefix:
temp_dataframe['PPI'].str.extract('(?:PPI/)?(.*)', expand=False)