I have a column in my dataframe like this:
range
"(2,30)"
"(50,290)"
"(400,1000)"
...
and I want to replace the , comma with - dash. I'm currently using this method but nothing is changed.
org_info_exc['range'].replace(',', '-', inplace=True)
Can anybody help?
Use the vectorised str method replace:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.
For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP's 'range' column):
Pandas has a built in replace method available on a dataframe object.
df.replace(',', '-', regex=True)
Source: Docs
If you only need to replace characters in one specific column, somehow regex=True and in place=True all failed, I think this way will work:
data["column_name"] = data["column_name"].apply(lambda x: x.replace("characters_need_to_replace", "new_characters"))
lambda is more like a function that works like a for loop in this scenario.
x here represents every one of the entries in the current column.
The only thing you need to do is to change the "column_name", "characters_need_to_replace" and "new_characters".
Replace all commas with underscore in the column names
data.columns= data.columns.str.replace(' ','_',regex=True)
In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:
import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df['string_col'].str.replace(regular_expression, '', regex=True)
Almost similar to the answer by Nancy K, this works for me:
data["column_name"] = data["column_name"].apply(lambda x: x.str.replace("characters_need_to_replace", "new_characters"))
If you want to remove two or more elements from a string, example the characters '$' and ',' :
Column_Name
===========
$100,000
$1,100,000
... then use:
data.Column_Name.str.replace("[$,]", "", regex=True)
=> [ 100000, 1100000 ]
Related
I am looking for a way to write this code consisely. It's for replacing certain characters in a Pandas DataFrame column.
df['age'] = ['[70-80)' '[50-60)' '[60-70)' '[40-50)' '[80-90)' '[90-100)']
df['age'] = df['age'].str.replace('[', '')
df['age'] = df['age'].str.replace(')', '')
df['age'] = df['age'].str.replace('50-60', '50-59')
df['age'] = df['age'].str.replace('60-70', '60-69')
df['age'] = df['age'].str.replace('70-80', '70-79')
df['age'] = df['age'].str.replace('80-90', '80-89')
df['age'] = df['age'].str.replace('90-100', '90-99')
I tried this, but it didn't work, strings in df['age'] were not replaced:
chars_to_replace = {
'[' : '',
')' : '',
'50-60' : '50-59',
'60-70' : '60-69',
'70-80' : '70-79',
'80-90' : '80-89',
'90-100': '90-99'
}
for key in chars_to_replace.keys():
df['age'] = df['age'].replace(key, chars_to_replace[key])
UPDATE
As pointed out in the comments, I did forget str before replace. Adding it solved my problem, thank you!
Also, thank you tdelaney for that answer, I gave it a try and it works just as well. I am not familiar with regex substitions yet, I wasn't comfortable with the other options.
Use two passes of regex substitution.
In the first pass, match each pair of numbers separated by -, and decrement the second number.
In the second pass, remove any occurrences of [ and ).
By the way, did you mean to have spaces between each pair of numbers? Because as it is now, implicit string concatenation puts them together without spaces.
import re
string = '[70-80)' '[50-60)' '[60-70)' '[40-50)' '[80-90)' '[90-100)'
def repl(m: re.Match):
age1 = m.group(1)
age2 = int(m.group(2)) - 1
return f"{age1}-{age2}"
string = re.sub(r'(\d+)-(\d+)', repl, string)
string = re.sub(r'\[|\)', '', string)
print(string) # 70-7950-5960-6940-4980-8990-99
The repl function above can be condensed into a lambda:
repl = lambda m: f"{m.group(1)}-{int(m.group(2))-1}"
Update: Actually, this can be done in one pass.
import re
string = '[70-80)' '[50-60)' '[60-70)' '[40-50)' '[80-90)' '[90-100)'
repl = lambda m: f"{m.group(1)}-{int(m.group(2))-1}"
string = re.sub(r'\[(\d+)-(\d+)\)', repl, string)
print(string) # 70-7950-5960-6940-4980-8990-99
Assuming these brackets are on all of the entries, you can slice them off and then replace the range strings. From the docs, pandas.Series.replace, pandas will replace the values from the dict without the need for you to loop.
import pandas as pd
df = pd.DataFrame({
"age":['[70-80)', '[50-60)', '[60-70)', '[40-50)', '[80-90)', '[90-100)']})
ranges_to_replace = {
'50-60' : '50-59',
'60-70' : '60-69',
'70-80' : '70-79',
'80-90' : '80-89',
'90-100': '90-99'}
df['age'] = df['age'].str.slice(1,-1).replace(ranges_to_replace)
print(df)
Output
age
0 70-79
1 50-59
2 60-69
3 40-50
4 80-89
5 90-99
In addition to previous response, if you want to apply the regex substitution to your dataframe, you can use the apply method from pandas. To do so, you need to put the regex substitution into a function, then use the apply method:
def replace_chars(chars):
string = re.sub(r'(\d+)-(\d+)', repl, chars)
string = re.sub(r'\[|\)', ' ', string)
return string
df['age'] = df['age'].apply(replace_chars)
print(df)
which gives the following output:
age
0 70-79 50-59 60-69 40-49 80-89 90-99
By the way, here I put spaces between the ages intervals. Hope this helps.
change the last part to this
for i in range(len(df['age'])):
for x in chars_to_replace:
df['age'].iloc[i]=df['age'].iloc[i].replace(x,chars_to_replace[x])
I have a dataset with lots of variation in format like this.
-0.002672945<120>
-0.077635566{600}
5.88365537e-005{500}
-0.116441565{1}
-4.549649974<29.448>
There are all kinds of variety in the end of the values, I need to remove all those weird brackets, problem is sometimes they are 3 characters, some times 6, etc. I also cannot just take first few characters as there are scientific notation numbers such as 8.645637e-007 like this.
Is there a smart way to clear this kind of mess from data?
The str.split function accepts regex too -
df = pd.DataFrame({'Fruit': ['Banana', 'Banana', 'Carrot<x2>', 'Carrot{78}', 'Carrot<91'], 'Person': list('ABCDE')})
df.loc[:, 'Fruit'] = df['Fruit'].str.split(r'<|{', n=1, expand=True)[0]
>>> df = pd.DataFrame({"x": [
... "-0.002672945<120>",
... "-0.077635566{600}",
... "5.88365537e-005{500}",
... "-0.116441565{1}",
... "-4.549649974<29.448>",
... ]})
>>> df["x"].replace(r"[<{].+$", "", regex=True)
0 -0.002672945
1 -0.077635566
2 5.88365537e-005
3 -0.116441565
4 -4.549649974
Name: x, dtype: object
>>>
You can assign that result back into the df then.
Use a regular expression to clean those:
df[column].str.replace(r'[<\[{].+?[>\]}]$', '', regex=True)
Output:
0 -0.002672945
1 -0.077635566
2 5.88365537e-005
3 -0.116441565
4 -4.549649974
Name: column, dtype: object
Breakdown of the regex:
[<\[{] -- Character class; Matches ONE of ANY of the characters between the `[` and `]` (the `\[` is just a literal `[`, escaped)
.+? -- "." means one of ANY character (except newline), "+" means one or more of the preceding token, ? means not to match the next thing...
[>\]}] -- Character class
$ -- Only match this stuff if it occurs at the VERY END of the string
I have a csv that has a column called 'ra'.
This is the first 'ra' value the csv has: 8570.0 - I will use it as an example.
I need to remove '.0'.
So I've tried:
dtypes = {
'ra': 'str',
}
df['ra_csv'] = pd.DataFrame({'ra_csv':df['ra']}).replace('.0', '', regex=true).astype(str)
This code returns me '85' instead of '8570'. It's replacing all the 0s, and somehow removed the number '7' aswell.
How can I make it return '8750'? Thanks.
Option 1: use to_numeric to first convert the data to numeric type and convert to int,
df['ra_csv'] = pd.to_numeric(df['ra_csv']).astype(int)
Option 2: using str.replace
df['ra_csv'] = df['ra_csv'].str.replace('\..*', '')
You get
ra_csv
0 8570
The regex pattern .0 has two matches in your string '8570.0'. . matches any character.
70
.0
Since you are using df.replace setting regex=False wouldn't because it checks for exact matches only.
From docs df.replace:
str: string exactly matching to_replace will be replaced with value
Possible fixes are either fix your regex or use pd.Series.str.replace
Fixing your regex
df.replace('\.0', '', regex=True)
Using str.replace
df['ra'].str.replace('.0', '', regex=False)
I need to manage string in Python in this way:
I have this kind of strings with '>=', '=', '<=', '<', '>' in front of them, for example:
'>=1_2_3'
'<2_3_2'
what I want to achieve is splitting the strings to obtain, respectively:
'>=', '1_2_3'
'<', '2_3_2'
basically I need to split them starting from the first numeric character.
There's a way to achieve this result with regular expressions without iterating over the string checking if a character is a number or a '_'?
thank you.
This will do:
re.split(r'(^[^\d]+)', string)[1:]
Example:
>>> re.split(r'(^[^\d]+)', '>=1_2_3')[1:]
['>=', '1_2_3']
>>> re.split(r'(^[^\d]+)', '<2_3_2')[1:]
['<', '2_3_2']
import re
strings = ['>=1_2_3','<2_3_2']
for s in strings:
mat = re.match(r'([^\d]*)(\d.*)', s)
print mat.groups()
Outputs:
('>=', '1_2_3')
('<', '2_3_2')
This just groups everything up until the first digit in one group, then that first digit and everything after into a second.
You can access the individual groups with mat.group(1), mat.group(2)
You can split using this regex:
(?<=[<>=])(?=\d)
RegEx Demo
There's probably a better way but you can split with a capture then join the second two elements:
values = re.split(r'(\d)', '>=1_2_3', maxsplit = 1)
values = [values[0], values[1] + values[2]]
I have some python code in which I am retrieving data from a database.
The column I am interested in is a URL which is in the format:
../xxxx/ggg.com
I need to find out if the first charactor is a ..
If it is a . I need to remove the two dots .. at the beginning of the string and then append another string to it.
And finally i have to generate an xml file.
This is my code:
xml.element("Count","%s" %(offercount))
for colm in offer:
xml.start("Offer")
xml.element("qqq","%s" %(colm[0]))
xml.element("aaaa","%s" %(colm[1]))
xml.element("tttt","%s" %(colm[2]))
xml.element("nnnnnn","%s" %(colm[3]))
xml.element("tttt","%s" %(colm[4]))----> This colm[4] is the string with ..
xml.end()
I am new to Python, Please help me.
Thanks in advance.
you can keep it simple like this
In [116]: colm = ['a', 'b', 'c', 'd', '..heythere']
In [117]: str = colm[4]
In [118]: if str.find('..') == 0:
.....: print "found .. at the start of string"
.....: x = str.replace('..', '!')
.....: print x
.....:
found .. at the start of string
!heythere
Use a regular expression, e.g. re.sub(r'^\.\.', '', old_string). Regular expressions are a powerful way of matching strings, so in the example above, the regular expression ^\.\. matches the start of a string (^) followed by two dots, which need to be escaped using \ since . on its own actually matches anything. A more complete example to do what I think you want:
import re
if re.match(r'^\.\.', old_string):
new_string = old_string[2:] + append_string
See http://docs.python.org/2/library/re.html for more info on regular expressions.
I would recommend you utilize the built in string handling functions startswith() and replace():
if col.startswith('..'):
col = col.replace('..', '')
Or perhaps, if you simply wish to remove the two periods at the beginning of the string you could do something like this:
if col.startswith('..'):
col = col[2:]
This of course is assuming that you have only two periods at the beginning and that you wish to simply remove those two periods from the string.