I am looking for a way to write this code consisely. It's for replacing certain characters in a Pandas DataFrame column.
df['age'] = ['[70-80)' '[50-60)' '[60-70)' '[40-50)' '[80-90)' '[90-100)']
df['age'] = df['age'].str.replace('[', '')
df['age'] = df['age'].str.replace(')', '')
df['age'] = df['age'].str.replace('50-60', '50-59')
df['age'] = df['age'].str.replace('60-70', '60-69')
df['age'] = df['age'].str.replace('70-80', '70-79')
df['age'] = df['age'].str.replace('80-90', '80-89')
df['age'] = df['age'].str.replace('90-100', '90-99')
I tried this, but it didn't work, strings in df['age'] were not replaced:
chars_to_replace = {
'[' : '',
')' : '',
'50-60' : '50-59',
'60-70' : '60-69',
'70-80' : '70-79',
'80-90' : '80-89',
'90-100': '90-99'
}
for key in chars_to_replace.keys():
df['age'] = df['age'].replace(key, chars_to_replace[key])
UPDATE
As pointed out in the comments, I did forget str before replace. Adding it solved my problem, thank you!
Also, thank you tdelaney for that answer, I gave it a try and it works just as well. I am not familiar with regex substitions yet, I wasn't comfortable with the other options.
Use two passes of regex substitution.
In the first pass, match each pair of numbers separated by -, and decrement the second number.
In the second pass, remove any occurrences of [ and ).
By the way, did you mean to have spaces between each pair of numbers? Because as it is now, implicit string concatenation puts them together without spaces.
import re
string = '[70-80)' '[50-60)' '[60-70)' '[40-50)' '[80-90)' '[90-100)'
def repl(m: re.Match):
age1 = m.group(1)
age2 = int(m.group(2)) - 1
return f"{age1}-{age2}"
string = re.sub(r'(\d+)-(\d+)', repl, string)
string = re.sub(r'\[|\)', '', string)
print(string) # 70-7950-5960-6940-4980-8990-99
The repl function above can be condensed into a lambda:
repl = lambda m: f"{m.group(1)}-{int(m.group(2))-1}"
Update: Actually, this can be done in one pass.
import re
string = '[70-80)' '[50-60)' '[60-70)' '[40-50)' '[80-90)' '[90-100)'
repl = lambda m: f"{m.group(1)}-{int(m.group(2))-1}"
string = re.sub(r'\[(\d+)-(\d+)\)', repl, string)
print(string) # 70-7950-5960-6940-4980-8990-99
Assuming these brackets are on all of the entries, you can slice them off and then replace the range strings. From the docs, pandas.Series.replace, pandas will replace the values from the dict without the need for you to loop.
import pandas as pd
df = pd.DataFrame({
"age":['[70-80)', '[50-60)', '[60-70)', '[40-50)', '[80-90)', '[90-100)']})
ranges_to_replace = {
'50-60' : '50-59',
'60-70' : '60-69',
'70-80' : '70-79',
'80-90' : '80-89',
'90-100': '90-99'}
df['age'] = df['age'].str.slice(1,-1).replace(ranges_to_replace)
print(df)
Output
age
0 70-79
1 50-59
2 60-69
3 40-50
4 80-89
5 90-99
In addition to previous response, if you want to apply the regex substitution to your dataframe, you can use the apply method from pandas. To do so, you need to put the regex substitution into a function, then use the apply method:
def replace_chars(chars):
string = re.sub(r'(\d+)-(\d+)', repl, chars)
string = re.sub(r'\[|\)', ' ', string)
return string
df['age'] = df['age'].apply(replace_chars)
print(df)
which gives the following output:
age
0 70-79 50-59 60-69 40-49 80-89 90-99
By the way, here I put spaces between the ages intervals. Hope this helps.
change the last part to this
for i in range(len(df['age'])):
for x in chars_to_replace:
df['age'].iloc[i]=df['age'].iloc[i].replace(x,chars_to_replace[x])
Related
Here is my code:
import pandas as pd
import requests
url = "https://www.coingecko.com/en/coins/recently_added?page=1"
df = pd.read_html(requests.get(url).text, flavor="bs4")
df = pd.concat(df).drop(["Unnamed: 0", "Unnamed: 1"], axis=1)
df.to_csv("your_table.csv", index=False)
names = df["Coin"].str.rsplit(n=2).str[0].str.lower()
coins=names.replace(" ", "-")
print(coins)
The print is still printing coins with spaces in their names. I want to replace the spaces with dashes (-). Can someone please help
You can add a parameter regex=True to make it work:
coins = names.replace(" ", "-", regex=True)
The reason is that for .replace() function, it looks for exact match when replacing strings without regex=True (default is regex=False):
Official doc::
str: string exactly matching to_replace will be replaced with value
regex: regexs matching to_replace will be replaced with value
Therefore, unless your strings to replace contains only exactly one space (no other characters), there will be no match for the string to replace. Hence, no result with the code without regex=True.
Alternatively, you can also use .str.replace() instead of .replace():
coins = names.str.replace(" ", "-")
.str.replace() does not require exact match and matches for partial string regardless of whether regex=True or regex=False.
Result:
print(coins)
0 safebreastinu
1 unicly-bored-ape-yacht-club-collection
2 bundle-dao
3 shibamax
4 swapp
...
95 apollo-space-token
96 futurov-governance-token
97 safemoon-inu
98 x
99 black-kishu-inu
Name: Coin, Length: 100, dtype: object
u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
All I need is the contents inside the parenthesis.
If your problem is really just this simple, you don't need regex:
s[s.find("(")+1:s.find(")")]
Use re.search(r'\((.*?)\)',s).group(1):
>>> import re
>>> s = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
>>> re.search(r'\((.*?)\)',s).group(1)
u"date='2/xc2/xb2',time='/case/test.png'"
If you want to find all occurences:
>>> re.findall('\(.*?\)',s)
[u"(date='2/xc2/xb2',time='/case/test.png')", u'(eee)']
>>> re.findall('\((.*?)\)',s)
[u"date='2/xc2/xb2',time='/case/test.png'", u'eee']
Building on tkerwin's answer, if you happen to have nested parentheses like in
st = "sum((a+b)/(c+d))"
his answer will not work if you need to take everything between the first opening parenthesis and the last closing parenthesis to get (a+b)/(c+d), because find searches from the left of the string, and would stop at the first closing parenthesis.
To fix that, you need to use rfind for the second part of the operation, so it would become
st[st.find("(")+1:st.rfind(")")]
import re
fancy = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
print re.compile( "\((.*)\)" ).search( fancy ).group( 1 )
contents_re = re.match(r'[^\(]*\((?P<contents>[^\(]+)\)', data)
if contents_re:
print(contents_re.groupdict()['contents'])
No need to use regex ....
Just use list slicing ...
string="(tidtkdgkxkxlgxlhxl) ¥£%#_¥#_¥#_¥#"
print(string[string.find("(")+1:string.find(")")])
TheSoulkiller's answer is great. just in my case, I needed to handle extra parentheses and only extract the word inside the parentheses. a very small change would solve the problem
>>> s=u'abcde((((a+b))))-((a*b))'
>>> re.findall('\((.*?)\)',s)
['(((a+b', '(a*b']
>>> re.findall('\(+(.*?)\)',s)
['a+b', 'a*b']
Here are several ways to extract strings between parentheses in Pandas with the \(([^()]+)\) regex (see its online demo) that matches
\( - a ( char
([^()]+) - then captures into Group 1 any one or more chars other than ( and )
\) - a ) char.
Extracting the first occurrence using Series.str.extract:
import pandas as pd
df = pd.DataFrame({'Description':['some text (value 1) and (value 2)']})
df['Values'] = df['Description'].str.extract(r'\(([^()]+)\)')
# => df['Values']
# 0 value 1
# Name: Values, dtype: object
Extracting (finding) all occurrences using Series.str.findall:
import pandas as pd
df = pd.DataFrame({'Description':['some text (value 1) and (value 2)']})
df['Values'] = df['Description'].str.findall(r'\(([^()]+)\)')
# => df['Values']
# 0 [value 1, value 2]
# Name: Values, dtype: object
df['Values'] = df['Description'].str.findall(r'\(([^()]+)\)').str.join(', ')
# => df['Values']
# 0 value 1, value 2
# Name: Values, dtype: object
Note that .str.join(', ') is used to create a comma-separated string out of the resulting list of strings. You may adjust this separator for your scenario.
testcase
s = "(rein<unint>(pBuf) +fsizeof(LOG_RECH))"
result
['pBuf', 'LOG_RECH', 'rein<unint>(pBuf) +fsizeof(LOG_RECH)']
implement
def getParenthesesList(s):
res = list()
left = list()
for i in range(len(s)):
if s[i] == '(':
left.append(i)
if s[i] == ')':
le = left.pop()
res.append(s[le + 1:i])
print(res)
return res
If im not missing something, a small fix to #tkerwin:
s[s.find("(")+1:s.rfind(")")]
The 2nd find should be rfind so you start search from end of string
u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
All I need is the contents inside the parenthesis.
If your problem is really just this simple, you don't need regex:
s[s.find("(")+1:s.find(")")]
Use re.search(r'\((.*?)\)',s).group(1):
>>> import re
>>> s = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
>>> re.search(r'\((.*?)\)',s).group(1)
u"date='2/xc2/xb2',time='/case/test.png'"
If you want to find all occurences:
>>> re.findall('\(.*?\)',s)
[u"(date='2/xc2/xb2',time='/case/test.png')", u'(eee)']
>>> re.findall('\((.*?)\)',s)
[u"date='2/xc2/xb2',time='/case/test.png'", u'eee']
Building on tkerwin's answer, if you happen to have nested parentheses like in
st = "sum((a+b)/(c+d))"
his answer will not work if you need to take everything between the first opening parenthesis and the last closing parenthesis to get (a+b)/(c+d), because find searches from the left of the string, and would stop at the first closing parenthesis.
To fix that, you need to use rfind for the second part of the operation, so it would become
st[st.find("(")+1:st.rfind(")")]
import re
fancy = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
print re.compile( "\((.*)\)" ).search( fancy ).group( 1 )
contents_re = re.match(r'[^\(]*\((?P<contents>[^\(]+)\)', data)
if contents_re:
print(contents_re.groupdict()['contents'])
No need to use regex ....
Just use list slicing ...
string="(tidtkdgkxkxlgxlhxl) ¥£%#_¥#_¥#_¥#"
print(string[string.find("(")+1:string.find(")")])
TheSoulkiller's answer is great. just in my case, I needed to handle extra parentheses and only extract the word inside the parentheses. a very small change would solve the problem
>>> s=u'abcde((((a+b))))-((a*b))'
>>> re.findall('\((.*?)\)',s)
['(((a+b', '(a*b']
>>> re.findall('\(+(.*?)\)',s)
['a+b', 'a*b']
Here are several ways to extract strings between parentheses in Pandas with the \(([^()]+)\) regex (see its online demo) that matches
\( - a ( char
([^()]+) - then captures into Group 1 any one or more chars other than ( and )
\) - a ) char.
Extracting the first occurrence using Series.str.extract:
import pandas as pd
df = pd.DataFrame({'Description':['some text (value 1) and (value 2)']})
df['Values'] = df['Description'].str.extract(r'\(([^()]+)\)')
# => df['Values']
# 0 value 1
# Name: Values, dtype: object
Extracting (finding) all occurrences using Series.str.findall:
import pandas as pd
df = pd.DataFrame({'Description':['some text (value 1) and (value 2)']})
df['Values'] = df['Description'].str.findall(r'\(([^()]+)\)')
# => df['Values']
# 0 [value 1, value 2]
# Name: Values, dtype: object
df['Values'] = df['Description'].str.findall(r'\(([^()]+)\)').str.join(', ')
# => df['Values']
# 0 value 1, value 2
# Name: Values, dtype: object
Note that .str.join(', ') is used to create a comma-separated string out of the resulting list of strings. You may adjust this separator for your scenario.
testcase
s = "(rein<unint>(pBuf) +fsizeof(LOG_RECH))"
result
['pBuf', 'LOG_RECH', 'rein<unint>(pBuf) +fsizeof(LOG_RECH)']
implement
def getParenthesesList(s):
res = list()
left = list()
for i in range(len(s)):
if s[i] == '(':
left.append(i)
if s[i] == ')':
le = left.pop()
res.append(s[le + 1:i])
print(res)
return res
If im not missing something, a small fix to #tkerwin:
s[s.find("(")+1:s.rfind(")")]
The 2nd find should be rfind so you start search from end of string
I need to writing my dataframe to csv, and some of the series start with "+-= ", so I need to remove them first.
I tried to test by using a string:
test="+++++-= I love Mercedes-Benz"
while True:
if test.startswith('+') or test.startswith('-') or test.startswith('=') or test.startswith(' '):
test=test[1:]
continue
else:
print(test)
break
Output looks perfect:
I love Mercedes-Benz.
Now when I want to do the same while using lambda in my dataframe:
import pandas as pd
col_names = ['A', 'B', 'C']
my_df = pd.DataFrame(columns = col_names)
my_df.loc[len(my_df)] = ["++++-= I love Mercedes-Benz", 4, "Love this"]
my_df.loc[len(my_df)] = ["=Looks so good!", 2, "5-year-old"]
my_df
my_df["A"]=my_df["A"].map(lambda x: x[1:] if x.startswith('=') else x)
print(my_df["A"])
I am not sure how to put 4 startswith "-","=","+"," " together and loop them until they meet the first alphabet or character(sometimes it might be in Japanese or Chinese.)
expected final my_df:
A B C
0 I love Mercedes-Benz 4 Love this
1 Looks so good! 2 5-year-old
You can use str.lstrip in order to remove these leading characters:
my_df.A.str.lstrip('+-=')
0 I love Mercedes-Benz
1 Looks so good!
Name: A, dtype: object
One way to achieve it could be
old = ""
while old != my_df["A"]:
old = my_df["A"]
my_df["A"]=my_df["A"].map(lambda x: x[1:] if any(x.startswith(char) for char in "-=+ ") else x)
But I'd like to warn you about the strip() method for strings:
>>> test="+++++-= I love Mercedes-Benz"
>>> test.strip("+-=")
' I love Mercedes-Benz'
So your data extraction can become simpler:
my_df["A"].str=my_df["A"].str.strip("+=- ")
Just be careful because strip will remove the characters from both sides of the string. lstrip instead can do the job only on the left side.
The function startswith accepts a tuple of prefixes:
while test.startswith(('+','-','=',' ')):
test=test[1:]
But you can't put that in a lambda. But then, you don't need a lambda: just write the function and pass its name to map.
As a lover of regex and possibly convoluted solutions, I will add this solution as well:
import re
my_df["A"]=my_df["A"].map(lambda x: re.sub('^[*-=\s]*', '', x))
the regex reads:
^ from the beginning
[] items in this group
\s any whitespace
* zero or more
so this will match (and replace with nothing) all the characters from the beginning of the string that are in the square brackets
I have a column in my dataframe like this:
range
"(2,30)"
"(50,290)"
"(400,1000)"
...
and I want to replace the , comma with - dash. I'm currently using this method but nothing is changed.
org_info_exc['range'].replace(',', '-', inplace=True)
Can anybody help?
Use the vectorised str method replace:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.
For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP's 'range' column):
Pandas has a built in replace method available on a dataframe object.
df.replace(',', '-', regex=True)
Source: Docs
If you only need to replace characters in one specific column, somehow regex=True and in place=True all failed, I think this way will work:
data["column_name"] = data["column_name"].apply(lambda x: x.replace("characters_need_to_replace", "new_characters"))
lambda is more like a function that works like a for loop in this scenario.
x here represents every one of the entries in the current column.
The only thing you need to do is to change the "column_name", "characters_need_to_replace" and "new_characters".
Replace all commas with underscore in the column names
data.columns= data.columns.str.replace(' ','_',regex=True)
In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:
import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df['string_col'].str.replace(regular_expression, '', regex=True)
Almost similar to the answer by Nancy K, this works for me:
data["column_name"] = data["column_name"].apply(lambda x: x.str.replace("characters_need_to_replace", "new_characters"))
If you want to remove two or more elements from a string, example the characters '$' and ',' :
Column_Name
===========
$100,000
$1,100,000
... then use:
data.Column_Name.str.replace("[$,]", "", regex=True)
=> [ 100000, 1100000 ]