Python Pandas Replace Special Character - python

For some reason, I cannot get this simple statement to work on the ñ. It seems to work on anything else but doesn't like that character. Any ideas?
DF['NAME']=DF['NAME'].str.replace("ñ","n")
Thanks

I'm assuming you're using Python 2.x here and this is likely a Unicode problem. Don't worry, you're not alone--unicode is really tough in general and especially in Python 2, which is why it's been made standard in Python 3.
If all you're concerned about is the ñ, you should decode in UTF-8, and then just replace the one character.
That would look something like the following:
DF['name'] = DF['name'].str.decode('utf-8').replace(u'\xf1', 'n')
As an example:
>>> "sureño".decode("utf-8").replace(u"\xf1", "n")
u'sureno'
If your string is already Unicode, then you can (and actually have to) skip the decode step:
>>> u"sureño".replace(u"\xf1", "n")
u'sureno'
Note here that u'\xf1' uses the hex escape for the character in question.
Update
I was informed in the comments that <>.str.replace is a pandas series method, which I hadn't realized. The answer to this possibly might be something like the following:
DF['name'] = map(lambda x: x.decode('utf-8').replace(u'\xf1', 'n'), DF['name'].str)
or something along those lines, if that pandas object is iterable.
Another update
It actually just occurred to me that your issue may be as simple as the following:
DF['NAME']=DF['NAME'].str.replace(u"ñ","n")
Note how I've added the u in front of the string to make it unicode.

You can use replace function with special character to be replaced with a different value of your choice in the following way.
if your dataframe is df and you have to do it in all the columns that are string. in case of mine I am doing it for "\n"
df= df.applymap(lambda x: x.replace("\n"," "))

Related

Cannot replace special characters in a Python pandas dataframe

I'm working with Python 3.5 in Windows. I have a dataframe where a 'titles' str type column contains titles of headlines, some of which have special characters such as â,€,˜.
I am trying to replace these with a space '' using pandas.replace. I have tried various iterations and nothing works. I am able to replace regular characters, but these special characters just don't seem to work.
The code runs without error, but the replacement simply does not occur, and instead the original title is returned. Below is what I have tried already. Any advice would be much appreciated.
df['clean_title'] = df['titles'].replace('€','',regex=True)
df['clean_titles'] = df['titles'].replace('€','')
df['clean_titles'] = df['titles'].str.replace('€','')
def clean_text(row):
return re.sub('€','',str(row))
return str(row).replace('€','')
df['clean_title'] = df['titles'].apply(clean_text)
We can only assume that you refer to non-ASCI as 'special' characters.
To remove all non-ASCI characters in a pandas dataframe column, do the following:
df['clean_titles'] = df['titles'].str.replace(r'[^\x00-\x7f]', '')
Note that this is a scalable solution as it works for any non-ASCI char.
How to remove escape sequence character in dataframe
Data.
product,rating
pest,<br> test
mouse,/ mousetest
Solution: scala Code
val finaldf = df.withColumn("rating", regexp_replace(col("rating"), "\\\\", "/")).show()

Removing special characters in a pandas dataframe

I have found information on how this could be done, but nothing has worked for me. I am trying to replace the special character 'ð'. I imported my data from a csv file and I used encoding='latin1' or else I kept getting errors. However, a simple DF['Column'].str.replace('ð', '') will not do the trick. I also tried decoding and using the hex value for that character which was recommended on another post, but that still won't work for me. Help is very much appreciated, and I am willing to post code if necessary.
Call str.encode followed by str.decode:
df.YourCol.str.encode('utf-8').str.decode('ascii', 'ignore')
If you want to do this for multiple columns, you can slice and call df.applymap:
df[col_list].applymap(lambda x: x.encode('utf-8').decode('ascii', 'ignore'))
Remember that these operations are not in-place. So, you'll have to assign those columns back to their rightful place.

Print raw string from variable? (not getting the answers)

I'm trying to find a way to print a string in raw form from a variable. For instance, if I add an environment variable to Windows for a path, which might look like 'C:\\Windows\Users\alexb\', I know I can do:
print(r'C:\\Windows\Users\alexb\')
But I cant put an r in front of a variable.... for instance:
test = 'C:\\Windows\Users\alexb\'
print(rtest)
Clearly would just try to print rtest.
I also know there's
test = 'C:\\Windows\Users\alexb\'
print(repr(test))
But this returns 'C:\\Windows\\Users\x07lexb'
as does
test = 'C:\\Windows\Users\alexb\'
print(test.encode('string-escape'))
So I'm wondering if there's any elegant way to make a variable holding that path print RAW, still using test? It would be nice if it was just
print(raw(test))
But its not
I had a similar problem and stumbled upon this question, and know thanks to Nick Olson-Harris' answer that the solution lies with changing the string.
Two ways of solving it:
Get the path you want using native python functions, e.g.:
test = os.getcwd() # In case the path in question is your current directory
print(repr(test))
This makes it platform independent and it now works with .encode. If this is an option for you, it's the more elegant solution.
If your string is not a path, define it in a way compatible with python strings, in this case by escaping your backslashes:
test = 'C:\\Windows\\Users\\alexb\\'
print(repr(test))
In general, to make a raw string out of a string variable, I use this:
string = "C:\\Windows\Users\alexb"
raw_string = r"{}".format(string)
output:
'C:\\\\Windows\\Users\\alexb'
You can't turn an existing string "raw". The r prefix on literals is understood by the parser; it tells it to ignore escape sequences in the string. However, once a string literal has been parsed, there's no difference between a raw string and a "regular" one. If you have a string that contains a newline, for instance, there's no way to tell at runtime whether that newline came from the escape sequence \n, from a literal newline in a triple-quoted string (perhaps even a raw one!), from calling chr(10), by reading it from a file, or whatever else you might be able to come up with. The actual string object constructed from any of those methods looks the same.
I know i'm too late for the answer but for people reading this I found a much easier way for doing it
myVariable = 'This string is supposed to be raw \'
print(r'%s' %myVariable)
try this. Based on what type of output you want. sometime you may not need single quote around printed string.
test = "qweqwe\n1212as\t121\\2asas"
print(repr(test)) # output: 'qweqwe\n1212as\t121\\2asas'
print( repr(test).strip("'")) # output: qweqwe\n1212as\t121\\2asas
Get rid of the escape characters before storing or manipulating the raw string:
You could change any backslashes of the path '\' to forward slashes '/' before storing them in a variable. The forward slashes don't need to be escaped:
>>> mypath = os.getcwd().replace('\\','/')
>>> os.path.exists(mypath)
True
>>>
Just simply use r'string'. Hope this will help you as I see you haven't got your expected answer yet:
test = 'C:\\Windows\Users\alexb\'
rawtest = r'%s' %test
I have my variable assigned to big complex pattern string for using with re module and it is concatenated with few other strings and in the end I want to print it then copy and check on regex101.com.
But when I print it in the interactive mode I get double slash - '\\w'
as #Jimmynoarms said:
The Solution for python 3x:
print(r'%s' % your_variable_pattern_str)
Your particular string won't work as typed because of the escape characters at the end \", won't allow it to close on the quotation.
Maybe I'm just wrong on that one because I'm still very new to python so if so please correct me but, changing it slightly to adjust for that, the repr() function will do the job of reproducing any string stored in a variable as a raw string.
You can do it two ways:
>>>print("C:\\Windows\Users\alexb\\")
C:\Windows\Users\alexb\
>>>print(r"C:\\Windows\Users\alexb\\")
C:\\Windows\Users\alexb\\
Store it in a variable:
test = "C:\\Windows\Users\alexb\\"
Use repr():
>>>print(repr(test))
'C:\\Windows\Users\alexb\\'
or string replacement with %r
print("%r" %test)
'C:\\Windows\Users\alexb\\'
The string will be reproduced with single quotes though so you would need to strip those off afterwards.
To turn a variable to raw str, just use
rf"{var}"
r is raw and f is f-str; put them together and boom it works.
Replace back-slash with forward-slash using one of the below:
re.sub(r"\", "/", x)
re.sub(r"\", "/", x)
This does the trick
>>> repr(string)[1:-1]
Here is the proof
>>> repr("\n")[1:-1] == r"\n"
True
And it can be easily extrapolated into a function if need be
>>> raw = lambda string: repr(string)[1:-1]
>>> raw("\n")
'\\n'
i wrote a small function.. but works for me
def conv(strng):
k=strng
k=k.replace('\a','\\a')
k=k.replace('\b','\\b')
k=k.replace('\f','\\f')
k=k.replace('\n','\\n')
k=k.replace('\r','\\r')
k=k.replace('\t','\\t')
k=k.replace('\v','\\v')
return k
Here is a straightforward solution.
address = 'C:\Windows\Users\local'
directory ="r'"+ address +"'"
print(directory)
"r'C:\\Windows\\Users\\local'"

python converttion from u'\\u795d\\u798f' to u'\u795d\u798f'

How to convert from u'\\u795d\\u798f' to u'\u795d\u798f'?
I'm quite confused....
u'\u795d\u798f' is 祝福 in Chinese.
Thanks~
UPDATE:
I'm sorry, I didn't know how to express it at the beginning. Now, my problem is :I got u'\\u795d\\u798f' and I want it to be u'\u795d\u798f'.
From your title (but not the question text) it looks like the problem is that the backslashes in the strings are escaped (i.e., you have \\u795d and you want \u795d). There are several questions on this issue (like Process escape sequences in a string in Python).
In python 2, you can do:
>>> u'\\u795d\\u798f'.decode('unicode_escape')
u'\u795d\u798f'
Applying the print statement to this should print the Chinese characters.
The python 3 equivalent is:
>>> bytes('\\u795d\\u798f','utf-8').decode('unicode_escape')
'祝福'

Python efficient mass replacing unknown characterers

PHP4+mySQL4 based project post to Django 1.1 project and it mixes up some letters.
What is the best way (most efficient) to replace in this fashion?
The problem for me is that i cannot get values for those letters. Is there an online tool to do that?
I have textField with various letters and i want to replace those in this fashion:
àèæëáðøûþ => ąčęėįšųūž
ÀÈÆËÁÐØÛÞ => ĄČĘĖĮŠŲŪŽ
I had similar case where i had to clean up the code so i used this:
def clean(string):
return ''.join([c for c in string if ord(c) > 31 or ord(c) in [9, 10, 13]] )
Update: i succeeded to extract Unicode values looking at Django debug messages (replace_from:replace_to):
{'\xe0':'\u0105', '\xe8':'\u010d', '\xe6':'\u0119', '\xeb':'\u0117', '\xe1':'\u012f',
'\xf0':'\u0161', '\xf8':'\u0179', '\xfb':'\u016b', '\xfe':'\u017e',
'\xc0':'\u0104', '\xc8':'\u010c', '\xc6':'\u0118', '\xcb':'\u0116', '\xc1':'\u012e',
'\xd0':'\u0160', '\xd8':'\u0172', '\xdb':'\u016a', '\xde':'\u017d'
So the main problem remains - replacing
Try the str.replace() method - should work with unicode strings.
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
Make sure your old and new strings are of type Unicode
(that applies to your input data as well).
Find out what your input (non-unicode) string is supposed to be encoded in.
For example, it may be in latin1 encoding.
Use the builtin str.decode() method to create a Unicode version of your data,
and feed that to str.replace().
>>> unioldchars = oldchars.decode("latin1")
>>> newdata = data.replace(unioldchars, newchars)
I'd do it myself. The built-in replace function is of little use if you want multiple, efficient replacements.
Give this a look: http://code.activestate.com/recipes/81330-single-pass-multiple-replace/
EDIT: WAIT, you wanted to do the replacement client-side, like in the text-box?
string.translate(s, table[, deletechars])
Delete all characters from s that are in deletechars (if
present), and then translate the characters using table, which must be
a 256-character string giving the translation for each character value,
indexed by its ordinal. If table is None, then only the character deletion
step is performed.
See also http://docs.python.org/library/string.html#string.maketrans

Categories