I'm trying to remove the square brackets from a list of latlongs called latlng, which is held in a dataframe (df) in the format as per example below:
latlng
[-1.4253128, 52.9015902]
I've got this but it returns NaN:
df['latitude'] = df['latlng'].str.replace('[','')
and produces this warning:
<ipython-input-59-34b588ef7f4b>:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
df['latitude'] = df['latlng'].str.replace('[','')
It seems that if I write the file to a csv and then use that file, the above works. But ideally, I want it all in once script.
Any help is appreciated!
I think you meant to write df['latitude'] = str(df['latlng']).replace('[','').
Yet, this only removes the opening bracket and will not return long or lat alone.
You might want to use regex to split the string, something on the lines such as
import re
splitted = re.match("\[(-?\d+\.\d+),\s*(-?\d+\.\d+)\]", "[-1.4253128, 52.9015902]")
print(splitted[1])
print(splitted[2])
Use indices:
df['latitude'] = df['latlng'].str[0]
df['longitude'] = df['latlng'].str[1]
Even if they are objects, str[] accessor allows access to their values.
Related
I aim to write a function to apply to an entire dataframe. Each column is checked to see if it contains the currency symbol '$' and remove it.
Surprisingly, a case like:
import pandas as pd
dates = pd.date_range(start='2021-01-01', end='2021-01-10').strftime('%d-%m-%Y')
print(dates)
output:
Index(['01-01-2021', '02-01-2021', '03-01-2021', '04-01-2021', '05-01-2021', '06-01-2021', '07-01-2021', '08-01-2021', '09-01-2021', '10-01-2021'], dtype='object')
But when I do:
dates.str.contains('$').all()
It returns True. Why???
.contains uses regex (by default), not just a raw string. And $ means the end of the line in regex (intuitively or not, all strings have "the end"). To check the symbol "$" you need to escape it:
dates.str.contains('\$').all()
Or you can use regex=False argument of the .contains():
dates.str.contains('$', regex=False).all()
Both options return False.
I am searching particular strings in first column using str.contain() in a big file. There are some cases are reported even if they partially match with the provided string. For example:
My file structure:
miRNA,Gene,Species_ID,PCT
miR-17-5p/331-5p,AAK1,9606,0.94
miR-17-5p/31-5p,Gnp,9606,0.92
miR-17-5p/130-5p,AAK1,9606,0.94
miR-17-5p/30-5p,Gnp,9606,0.94
when I run my code search
DE_miRNAs = ['31-5p', '150-3p'] #the actual list is much bigger
for miRNA in DE_miRNAs:
targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains(miRNA)]
I am expecting to only get only the second raw:
miR-17-5p/31-5p,Gnp,9606,0.92
but I de get both first and second raw - 331-5p come in the result too which should not:
miR-17-5p/331-5p,AAK1,9606,0.94
miR-17-5p/31-5p,Gnp,9606,0.92
Is there a way to make the str.contains() more specific? There is a suggestion here but how I can implement it to a for loop? str.contains(r"\bmiRNA\b") does not work.
Thank you.
Use str.contains with a regex alternation which is surrounded by word boundaries on both sides:
DE_miRNAs = ['31-5p', '150-3p']
regex = r'\b(' + '|'.join(DE_miRNAs) + r')\b'
targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains(regex)]
contains is a function that takes a regex pattern as an argument. You should be more explicit about the regex pattern you are using.
In your case, I suggest you use /31-5p instead of 31-5p:
DE_miRNAs = ['31-5p', '150-3p'] #the actual list is much bigger
for miRNA in DE_miRNAs:
targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains("/" + miRNA)]
How do I go about removing an empty string or at least having regex ignore it?
I have some data that looks like this
EIV (5.11 gCO₂/t·nm)
I'm trying to extract the numbers only. I have done the following:
df['new column'] = df['column containing that value'].str.extract(r'((\d+.\d*)|(\d+)|(\.\d+)|(\d+[eE][+]?\d*)?)').astype('float')
since the numbers Can be floats, integers, and I think there's one exponent 4E+1
However when I run it I then get the error as in title which I presume is an empty string.
What am I missing here to allow the code to run?
Try this
import re
c = "EIV (5.11 gCO₂/t·nm)"
x = re.findall("[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?", c)
print(x)
Will give
['5.11']
The problem is not only the number of groups, but the fact that the last alternative in your regex is optional (see ? added right after it, and your regex demo). However, since Series.str.extract returns the first match, your regex matches and returns the empty string at the start of the string if the match is not at the string start position.
It is best to use the well-known single alternative patterns to match any numbers with a single capturing group, e.g.
df['col'].str.extract(r'((?:(?:\b[0-9]+)?\.)?\b[0-9]+(?:[eE][-+]?[0-9]+)?)\b').astype(float)
See Example Regexes to Match Common Programming Language Constructs.
Pandas test:
import pandas as pd
df = pd.DataFrame({'col':['EIV (5.11 gCO₂/t·nm)', 'EIV (5.11E+12 gCO₂/t·nm)']})
df['col'].str.extract(r'((?:(?:\b[0-9]+)?\.)?\b[0-9]+(?:[eE][-+]?[0-9]+)?)\b').astype(float)
# => 0
# 0 5.110000e+00
# 1 5.110000e+12
There also quite a lot of other such regex variations at Parsing scientific notation sensibly?, and you may also use r"([-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?)", r"(-?\d+(?:\.\d*)?(?:[eE][+-]?\d+)?)", r"([+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?)", etc.
If your column consist of data of same format(as you have posted - EIV (5.11 gCO₂/t·nm)) then it will surely work
import pandas as pd
df['new_exctracted_column'] = df['column containing that value'].str.extract('(\d+(?:\.\d+)?)')
df
5.11
I am trying to use the following code to make replacements in a pandas dataframe however:
replacerscompanya = {',':'','.':'','-':'','ltd':'limited','&':'and'}
df1['CompanyA'] = df1['CompanyA'].replace(replacerscompanya)
replacersaddress1a = {',':'','.':'','-':'','ltd':'limited','&':'and', r'\brd\b':'road'}
df1['Address1A'] = df1['Address1A'].replace(replacersaddress1a)
replacersaddress2a = {',':'','.':'','-':'','ltd':'limited','&':'and', r'\brd\b':'road'}
df1['Address2A'] = df1['Address2A'].replace(replacersaddress2a)
It does not give me an error but when i check the dataframe, no replacements have been made.
I had previously just used a number of lines of the code below to acheive the same result but I was hoping to create something a bit simpler to adjust.
df1['CompanyA'] = df1['CompanyA'].str.replace('.','')
Any ideas as to what is going on here?
Thanks!
Escape . in dictionary because special regex character and add parameter regex=True for substring replacement and also for replace by regex:
replacersaddress1a = {',':'','\.':'','-':'','ltd':'limited','&':'and', r'\brd\b':'road'}
df1['CompanyA'] = df1['CompanyA'].replace(replacerscompanya, regex=True)
I am new to python and I have a string that looks like this
Temp = "', '/1412311.2121\n"
my desired output is just getting the numbers and decimal itself.. so im looking for
1412311.2121
as the output.. trying to get rid of the ', '/\n in the string.. I have tried Temp.strip("\n") and Temp.rstrip("\n") for trying to remove \n but i still seems to remain in my string. :/... Does anyone have any ideas? Thanks for your help.
Strings are immutable. string.strip() doesn't change string, it's a function that returns a value. You need to do:
Temp = Temp.strip()
Note also that calling strip() without any parameters causes it to remove all whitespace characters, including \n
As stalk said, you can achieve your desired result by calling strip("',/\n") on Temp.
If the data are like you show, numbers that are wrapped from right and left with non-number data, you can use a very simple regular expression:
g = re.search('[0-9.]+', s) # capture the inner number only
print g.group(0)
I would use a regular expression to do this:
In [8]: s = "', '/1412311.2121\n"
In [9]: re.findall(r'([+-]?\d+(?:\.\d+)?(?:[eE][+-]\d+)?)', s)
Out[9]: ['1412311.2121']
This returns a list of all floating-point numbers found in the string.