I am trying to use the following code to make replacements in a pandas dataframe however:
replacerscompanya = {',':'','.':'','-':'','ltd':'limited','&':'and'}
df1['CompanyA'] = df1['CompanyA'].replace(replacerscompanya)
replacersaddress1a = {',':'','.':'','-':'','ltd':'limited','&':'and', r'\brd\b':'road'}
df1['Address1A'] = df1['Address1A'].replace(replacersaddress1a)
replacersaddress2a = {',':'','.':'','-':'','ltd':'limited','&':'and', r'\brd\b':'road'}
df1['Address2A'] = df1['Address2A'].replace(replacersaddress2a)
It does not give me an error but when i check the dataframe, no replacements have been made.
I had previously just used a number of lines of the code below to acheive the same result but I was hoping to create something a bit simpler to adjust.
df1['CompanyA'] = df1['CompanyA'].str.replace('.','')
Any ideas as to what is going on here?
Thanks!
Escape . in dictionary because special regex character and add parameter regex=True for substring replacement and also for replace by regex:
replacersaddress1a = {',':'','\.':'','-':'','ltd':'limited','&':'and', r'\brd\b':'road'}
df1['CompanyA'] = df1['CompanyA'].replace(replacerscompanya, regex=True)
Related
I just started with python, now I see myself needing the following, I have the following string:
1184-7380501-2023-183229
what i need is to trim this string and get only the following characters after the first hyphen. it should be as follows:
1184-738
how can i do this?
s = "1184-7380501-2023-183229"
print(s[:8])
Or perhaps
import re
pattern = re.compile(r'^\d+-...')
m = pattern.search(s)
print(m[0])
which accommodates variable length numeric prefixes.
You could (you can do this a lot of different ways) use partition() and join()...
"".join([token[:3] if idx == 2 else token for idx, token in enumerate("1184-7380501-2023-183229".partition("-"))])
I am searching particular strings in first column using str.contain() in a big file. There are some cases are reported even if they partially match with the provided string. For example:
My file structure:
miRNA,Gene,Species_ID,PCT
miR-17-5p/331-5p,AAK1,9606,0.94
miR-17-5p/31-5p,Gnp,9606,0.92
miR-17-5p/130-5p,AAK1,9606,0.94
miR-17-5p/30-5p,Gnp,9606,0.94
when I run my code search
DE_miRNAs = ['31-5p', '150-3p'] #the actual list is much bigger
for miRNA in DE_miRNAs:
targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains(miRNA)]
I am expecting to only get only the second raw:
miR-17-5p/31-5p,Gnp,9606,0.92
but I de get both first and second raw - 331-5p come in the result too which should not:
miR-17-5p/331-5p,AAK1,9606,0.94
miR-17-5p/31-5p,Gnp,9606,0.92
Is there a way to make the str.contains() more specific? There is a suggestion here but how I can implement it to a for loop? str.contains(r"\bmiRNA\b") does not work.
Thank you.
Use str.contains with a regex alternation which is surrounded by word boundaries on both sides:
DE_miRNAs = ['31-5p', '150-3p']
regex = r'\b(' + '|'.join(DE_miRNAs) + r')\b'
targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains(regex)]
contains is a function that takes a regex pattern as an argument. You should be more explicit about the regex pattern you are using.
In your case, I suggest you use /31-5p instead of 31-5p:
DE_miRNAs = ['31-5p', '150-3p'] #the actual list is much bigger
for miRNA in DE_miRNAs:
targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains("/" + miRNA)]
How do I go about removing an empty string or at least having regex ignore it?
I have some data that looks like this
EIV (5.11 gCO₂/t·nm)
I'm trying to extract the numbers only. I have done the following:
df['new column'] = df['column containing that value'].str.extract(r'((\d+.\d*)|(\d+)|(\.\d+)|(\d+[eE][+]?\d*)?)').astype('float')
since the numbers Can be floats, integers, and I think there's one exponent 4E+1
However when I run it I then get the error as in title which I presume is an empty string.
What am I missing here to allow the code to run?
Try this
import re
c = "EIV (5.11 gCO₂/t·nm)"
x = re.findall("[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?", c)
print(x)
Will give
['5.11']
The problem is not only the number of groups, but the fact that the last alternative in your regex is optional (see ? added right after it, and your regex demo). However, since Series.str.extract returns the first match, your regex matches and returns the empty string at the start of the string if the match is not at the string start position.
It is best to use the well-known single alternative patterns to match any numbers with a single capturing group, e.g.
df['col'].str.extract(r'((?:(?:\b[0-9]+)?\.)?\b[0-9]+(?:[eE][-+]?[0-9]+)?)\b').astype(float)
See Example Regexes to Match Common Programming Language Constructs.
Pandas test:
import pandas as pd
df = pd.DataFrame({'col':['EIV (5.11 gCO₂/t·nm)', 'EIV (5.11E+12 gCO₂/t·nm)']})
df['col'].str.extract(r'((?:(?:\b[0-9]+)?\.)?\b[0-9]+(?:[eE][-+]?[0-9]+)?)\b').astype(float)
# => 0
# 0 5.110000e+00
# 1 5.110000e+12
There also quite a lot of other such regex variations at Parsing scientific notation sensibly?, and you may also use r"([-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?)", r"(-?\d+(?:\.\d*)?(?:[eE][+-]?\d+)?)", r"([+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?)", etc.
If your column consist of data of same format(as you have posted - EIV (5.11 gCO₂/t·nm)) then it will surely work
import pandas as pd
df['new_exctracted_column'] = df['column containing that value'].str.extract('(\d+(?:\.\d+)?)')
df
5.11
I have a similar question to this one: Pandas DataFrame: remove unwanted parts from strings in a column.
So I used:
temp_dataframe['PPI'] = temp_dataframe['PPI'].map(lambda x: x.lstrip('PPI/'))
Most, of the items start with a 'PPI/' but not all. It seems that when an item without the 'PPI/' suffix encountered this error:
AttributeError: 'float' object has no attribute 'lstrip'
Am I missing something here?
use replace:
temp_dataframe['PPI'].replace('PPI/','',regex=True,inplace=True)
or string.replace:
temp_dataframe['PPI'].str.replace('PPI/','')
use vectorised str.lstrip:
temp_dataframe['PPI'] = temp_dataframe['PPI'].str.lstrip('PPI/')
it looks like you may have missing values so you should mask those out or replace them:
temp_dataframe['PPI'].fillna('', inplace=True)
or
temp_dataframe.loc[temp_dataframe['PPI'].notnull(), 'PPI'] = temp_dataframe['PPI'].str.lstrip('PPI/')
maybe a better method is to filter using str.startswith and use split and access the string after the prefix you want to remove:
temp_dataframe.loc[temp_dataframe['PPI'].str.startswith('PPI/'), 'PPI'] = temp_dataframe['PPI'].str.split('PPI/').str[1]
As #JonClements pointed out that lstrip is removing whitespace rather than removing the prefix which is what you're after.
update
Another method is to pass a regex pattern that looks for the optionally prefix and extract all characters after the prefix:
temp_dataframe['PPI'].str.extract('(?:PPI/)?(.*)', expand=False)
In sikuli I've get a multiline string from clipboard like this...
Names = App.getClipboard();
So Name =
#corazona
#Pebleo00
#cofriasd
«paflio
and I have use this regex to delete the first character if it is not in x00-x7f hex range or is not a word, or is a digit
import re
Names = re.sub(r"(?m)^([^\x00-\x7F]+|\W|\d)", "", Names)
So now Names =
corazona
Pebleo00
cofriasd
paflio
But, I am having trouble with the second regex that converts "Names" into the items of a sequence. I would like to convert "Names" into...
'corazona', 'Pebleo00', 'cofriasd', 'paflio'
or
'corazona', 'Pebleo00', 'cofriasd', 'paflio',
So sikuli can then recognize it as a List (I've found that Sikuli is able to recognize it even with those last "comma" and "space" in the end) by using...
NamesAsList = eval(Names)
How could I do this in python? is it necessary to use regex, or there is other way to do this in python?
I have already done this but using .Net regex, I just don't know how to do it in python, I have googled it with no result.
This is how I did it using .Net regex
Text to find:
(.*[^$])(\r\n|\z)
Replace with:
'$1',%" "%
Thanks Advanced.
A couple of one liners. Your question isn't completely clear - but I am assuming - you want to split a given string delimited by 'newline' and then generate a list of strings by removing the first character if it's not alpha numeric. Here's how I'd go about it
import re
r = re.compile(r'^[a-zA-Z0-9]') # match # beginning anything that's not alpha numeric
s = '#abc\ndef\nghi'
l = [r.sub('', x) for x in s.split()]
# join this list with comma (if that's required else you got the list already)
','.join(l)
Hope that's what you want.
If Names is a string before you "convert" it, in which each name is separated by a new line ('\n'), then this will work:
NamesAsList = '\n'.split(Names)
See this question for other options.
You could use splitlines()
import re
clipBoard = App.getClipboard();
Names = re.sub(r"(?m)^([^\x00-\x7F]+|\W|\d)", "", clipBoard)
# Replace the end of a line with a comma.
singleNames = ', '.join(Names.splitlines())
print(singleNames)