unable to replace white spaces with dash in python using string.replace(" ", "_") - python

Here is my code:
import pandas as pd
import requests
url = "https://www.coingecko.com/en/coins/recently_added?page=1"
df = pd.read_html(requests.get(url).text, flavor="bs4")
df = pd.concat(df).drop(["Unnamed: 0", "Unnamed: 1"], axis=1)
df.to_csv("your_table.csv", index=False)
names = df["Coin"].str.rsplit(n=2).str[0].str.lower()
coins=names.replace(" ", "-")
print(coins)
The print is still printing coins with spaces in their names. I want to replace the spaces with dashes (-). Can someone please help

You can add a parameter regex=True to make it work:
coins = names.replace(" ", "-", regex=True)
The reason is that for .replace() function, it looks for exact match when replacing strings without regex=True (default is regex=False):
Official doc::
str: string exactly matching to_replace will be replaced with value
regex: regexs matching to_replace will be replaced with value
Therefore, unless your strings to replace contains only exactly one space (no other characters), there will be no match for the string to replace. Hence, no result with the code without regex=True.
Alternatively, you can also use .str.replace() instead of .replace():
coins = names.str.replace(" ", "-")
.str.replace() does not require exact match and matches for partial string regardless of whether regex=True or regex=False.
Result:
print(coins)
0 safebreastinu
1 unicly-bored-ape-yacht-club-collection
2 bundle-dao
3 shibamax
4 swapp
...
95 apollo-space-token
96 futurov-governance-token
97 safemoon-inu
98 x
99 black-kishu-inu
Name: Coin, Length: 100, dtype: object

Related

Can I use a dictionary in Python to replace multiple characters?

I am looking for a way to write this code consisely. It's for replacing certain characters in a Pandas DataFrame column.
df['age'] = ['[70-80)' '[50-60)' '[60-70)' '[40-50)' '[80-90)' '[90-100)']
df['age'] = df['age'].str.replace('[', '')
df['age'] = df['age'].str.replace(')', '')
df['age'] = df['age'].str.replace('50-60', '50-59')
df['age'] = df['age'].str.replace('60-70', '60-69')
df['age'] = df['age'].str.replace('70-80', '70-79')
df['age'] = df['age'].str.replace('80-90', '80-89')
df['age'] = df['age'].str.replace('90-100', '90-99')
I tried this, but it didn't work, strings in df['age'] were not replaced:
chars_to_replace = {
'[' : '',
')' : '',
'50-60' : '50-59',
'60-70' : '60-69',
'70-80' : '70-79',
'80-90' : '80-89',
'90-100': '90-99'
}
for key in chars_to_replace.keys():
df['age'] = df['age'].replace(key, chars_to_replace[key])
UPDATE
As pointed out in the comments, I did forget str before replace. Adding it solved my problem, thank you!
Also, thank you tdelaney for that answer, I gave it a try and it works just as well. I am not familiar with regex substitions yet, I wasn't comfortable with the other options.
Use two passes of regex substitution.
In the first pass, match each pair of numbers separated by -, and decrement the second number.
In the second pass, remove any occurrences of [ and ).
By the way, did you mean to have spaces between each pair of numbers? Because as it is now, implicit string concatenation puts them together without spaces.
import re
string = '[70-80)' '[50-60)' '[60-70)' '[40-50)' '[80-90)' '[90-100)'
def repl(m: re.Match):
age1 = m.group(1)
age2 = int(m.group(2)) - 1
return f"{age1}-{age2}"
string = re.sub(r'(\d+)-(\d+)', repl, string)
string = re.sub(r'\[|\)', '', string)
print(string) # 70-7950-5960-6940-4980-8990-99
The repl function above can be condensed into a lambda:
repl = lambda m: f"{m.group(1)}-{int(m.group(2))-1}"
Update: Actually, this can be done in one pass.
import re
string = '[70-80)' '[50-60)' '[60-70)' '[40-50)' '[80-90)' '[90-100)'
repl = lambda m: f"{m.group(1)}-{int(m.group(2))-1}"
string = re.sub(r'\[(\d+)-(\d+)\)', repl, string)
print(string) # 70-7950-5960-6940-4980-8990-99
Assuming these brackets are on all of the entries, you can slice them off and then replace the range strings. From the docs, pandas.Series.replace, pandas will replace the values from the dict without the need for you to loop.
import pandas as pd
df = pd.DataFrame({
"age":['[70-80)', '[50-60)', '[60-70)', '[40-50)', '[80-90)', '[90-100)']})
ranges_to_replace = {
'50-60' : '50-59',
'60-70' : '60-69',
'70-80' : '70-79',
'80-90' : '80-89',
'90-100': '90-99'}
df['age'] = df['age'].str.slice(1,-1).replace(ranges_to_replace)
print(df)
Output
age
0 70-79
1 50-59
2 60-69
3 40-50
4 80-89
5 90-99
In addition to previous response, if you want to apply the regex substitution to your dataframe, you can use the apply method from pandas. To do so, you need to put the regex substitution into a function, then use the apply method:
def replace_chars(chars):
string = re.sub(r'(\d+)-(\d+)', repl, chars)
string = re.sub(r'\[|\)', ' ', string)
return string
df['age'] = df['age'].apply(replace_chars)
print(df)
which gives the following output:
age
0 70-79 50-59 60-69 40-49 80-89 90-99
By the way, here I put spaces between the ages intervals. Hope this helps.
change the last part to this
for i in range(len(df['age'])):
for x in chars_to_replace:
df['age'].iloc[i]=df['age'].iloc[i].replace(x,chars_to_replace[x])

pandas regex look ahead and behind from a 1st occurrence of character

I have python strings like below
"1234_4534_41247612_2462184_2131_GHI.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx"
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx"
I would like to do the below
a) extract characters that appear before and after 1st dot
b) The keywords that I want are always found after the last _ symbol
For ex: If you look at 2nd input string, I would like to get only PQRST.GHI as output. It is after last _ and before 1st . and we also get keyword after 1st .
So, I tried the below
for s in strings:
after_part = (s.split('.')[1])
before_part = (s.split('.')[0])
before_part = qnd_part.split('_')[-1]
expected_keyword = before_part + "." + after_part
print(expected_keyword)
Though this works, this is definitely not nice and elegant way to write a regex.
Is there any other better way to write this?
I expect my output to be like as below. As you can see that we get keywords before and after 1st dot character
GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV
Try (regex101):
import re
strings = [
"1234_4534_41247612_2462184_2131_ABCDEF.GHI.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx",
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx",
]
pat = re.compile(r"[^.]+_([^.]+\.[^.]+)")
for s in strings:
print(pat.search(s).group(1))
Prints:
ABCDEF.GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV
You can do (try the pattern here )
df['text'].str.extract('_([^._]+\.[^.]+)',expand=False)
Output:
0 ABCDEF.GHI
1 PQRST.GHI
2 JKLMN.OPQ
3 WXY.TUV
Name: text, dtype: object
You can also do it with rsplit(). Specify maxsplit, so that you don't split more than you need to (for efficiency):
[s.rsplit('_', maxsplit=1)[1].rsplit('.', maxsplit=1)[0] for s in strings]
# ['GHI', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']
If there are strings with less than 2 dots and each returned string should have one dot in it, then add a ternary operator that splits (or not) depending on the number of dots in the string.
[x.rsplit('.', maxsplit=1)[0] if x.count('.') > 1 else x
for s in strings
for x in [s.rsplit('_', maxsplit=1)[1]]]
# ['GHI.xlsx', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']

Regex find keyword followed by N characters

I have a df column with URL having keyword with hash values,
example
/someurl/env40d929fadbe746ecagjbf6c515d30686/end
/some/other/url/envlabel40d929fadbe746ecagjbf6c517t30686/envvar40d929fadbe746ecagjbf6c515d306r6
Goal is to replace words env.following.32.char.hash into {env}, and similarly envlabel.following.32.char.hash into {envlabel} and similarly others.
I am trying to use regex in replace method,
to_replace_key = ['env', 'envlabel', 'envvar']
for word in to_replace_key:
df['URL'] = df['URL'].str.replace(f"{word}\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w", f'{{{word}Id}}', regex=True)
Challenges:
env replaces envlabels
keyword.following.32.char.hash is located as a substring between / char or at the end of line
Expected output:
/someurl/{env}/user
/some/other/url/{envlabel}/{envvar}
Thanks !!
You can use the regex '(env(label|var)?)\w{32}' which simply captures env and label if it is present. ie ? ensures that label is captured if present. Replace the matched string with the first captured group. ie \\1 within the curly braces.
df['URL'].str.replace('(env(label|var)?)\w{32}', '{\\1}', regex=True)
0 /someurl/{env}/end
1 /some/other/url/{envlabel}/{envvar}
Edit:
If you have alot of items in the list, do:
pat = f"({'|'.join(sorted(to_replace_key, key = len,reverse = True))})\w{{{32}}}"
df['URL'].str.replace(pat, '{\\1}', regex=True)
Out[567]:
0 /someurl/{env}/end
1 /some/other/url/{envlabel}/{envvar}
Name: URL, dtype: object
Don't loop, use:
df['URL'] = df['URL'].str.replace('(envlabel|envvar|env)\w{32}', r'{\1}', regex=True)
output:
URL
0 /someurl/{env}/end
1 /some/other/url/{envlabel}/{envvar}
regex demo

Using regular expressions to remove a string from a column

I am trying to remove a string from a column using regular expressions and replace.
Name
"George # ACkDk02gfe" sold
I want to remove " # ACkDk02gfe"
I have tried several different variations of the code below, but I cant seem to remove string I want.
df['Name'] = df['Name'].str.replace('(\#\D+\"$)','')
The output should be
George sold
This portion of the string "ACkDk02gfe is entirely random.
Let's try this using regex with | ("OR") and regex group:
df['Name'].str.replace('"|(\s#\s\w+)','', regex=True)
Output:
0 George sold
Name: Name, dtype: object
Updated
df['Name'].str.replace('"|(\s#\s\w*[-]?\w+)','')
Where df,
Name
0 "George # ACkDk02gfe" sold
1 "Mike # AisBcIy-rW" sold
Output:
0 George sold
1 Mike sold
Name: Name, dtype: object
Your pattern and syntax is wrong.
import pandas as pd
# set up the df
df = pd.DataFrame.from_dict(({'Name': '"George # ACkDk02gfe" sold'},))
# use a raw string for the pattern
df['Name'] = df['Name'].str.replace(r'^"(\w+)\s#.*?"', '\\1')
I'll let someone else post a regex answer, but this could also be done with split. I don't know how consistent the data you are looking at is, but this would work for the provided string:
df['Name'] = df['Name'].str.split(' ').str[0].str[1:] + ' ' + df['Name'].str.split(' ').str[-1]
output:
George sold
This should do for you
Split the string by a chain of whitespace,#,text immediately after #and whitespace after the text. This results in a list. remove the list corner brackets while separating elements by space using .str.join(' ')
df.Name=df.Name.str.split('\s\#\s\w+\s').str.join(' ')
0 George sold
To use a regex for replacement, you need to import re and use re.sub() instead of .replace().
import re
Name
"George # ACkDk02gfe" sold
df['Name'] = re.sub(r"#.*$", "", df['Name'])
should work.
import re
ss = '"George # ACkDk02gfe" sold'
ss = re.sub('"', "", ss)
ss = re.sub("\#\s*\w+", "", ss)
ss = re.sub("\s*", " ", ss)
George sold
Given that this is the general format of your code, here's what may help you understand the process I made. (1) substitute literal " (2) substitute given regex \#\s*\w+ (means with literal # that may be followed by whitespace/s then an alphanumeric word with multiple characters) will be replaced (3) substitute multiple whitespaces with a single whitespace.
You can wrap around a function to this process which you can simply call to a column. Hope it helps!

applying replace strings lambda to all rows in python [duplicate]

I have a column in my dataframe like this:
range
"(2,30)"
"(50,290)"
"(400,1000)"
...
and I want to replace the , comma with - dash. I'm currently using this method but nothing is changed.
org_info_exc['range'].replace(',', '-', inplace=True)
Can anybody help?
Use the vectorised str method replace:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.
For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP's 'range' column):
Pandas has a built in replace method available on a dataframe object.
df.replace(',', '-', regex=True)
Source: Docs
If you only need to replace characters in one specific column, somehow regex=True and in place=True all failed, I think this way will work:
data["column_name"] = data["column_name"].apply(lambda x: x.replace("characters_need_to_replace", "new_characters"))
lambda is more like a function that works like a for loop in this scenario.
x here represents every one of the entries in the current column.
The only thing you need to do is to change the "column_name", "characters_need_to_replace" and "new_characters".
Replace all commas with underscore in the column names
data.columns= data.columns.str.replace(' ','_',regex=True)
In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:
import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df['string_col'].str.replace(regular_expression, '', regex=True)
Almost similar to the answer by Nancy K, this works for me:
data["column_name"] = data["column_name"].apply(lambda x: x.str.replace("characters_need_to_replace", "new_characters"))
If you want to remove two or more elements from a string, example the characters '$' and ',' :
Column_Name
===========
$100,000
$1,100,000
... then use:
data.Column_Name.str.replace("[$,]", "", regex=True)
=> [ 100000, 1100000 ]

Categories