Removing special characters in a pandas dataframe - python

I have found information on how this could be done, but nothing has worked for me. I am trying to replace the special character 'ð'. I imported my data from a csv file and I used encoding='latin1' or else I kept getting errors. However, a simple DF['Column'].str.replace('ð', '') will not do the trick. I also tried decoding and using the hex value for that character which was recommended on another post, but that still won't work for me. Help is very much appreciated, and I am willing to post code if necessary.

Call str.encode followed by str.decode:
df.YourCol.str.encode('utf-8').str.decode('ascii', 'ignore')
If you want to do this for multiple columns, you can slice and call df.applymap:
df[col_list].applymap(lambda x: x.encode('utf-8').decode('ascii', 'ignore'))
Remember that these operations are not in-place. So, you'll have to assign those columns back to their rightful place.

Related

Trimming string from oracle

I'm attempting, poorly, to compare two lists of filenames and provide a list of the differences. I have read and attempted the numerous examples available.
Before you close this as a Duplicate, or Already Answered, please read further.
In python, I'm making a call via ftp to obtain a list of files. This returns an array as expected. Next, I'm calling Oracle to retrieve a list of filenames.
This returns an array of filenames, but the cx_oracle module returns them as ('filename',). When I do a compare, it is comparing filename against ('filename',) and fails. I need to remove the single quote, parentheses and comma prior to the compare. But I'm having zero luck.
I'm attempting to trim these extra characters out of the string using slicing, but this is failing.
for file from resultset:
print file
print file[2:10]
This returns (), rather than filename as expected.
I have tried slicing the string, I have attempted to use awk, but the metacharacters are making this difficult.
If anyone can provide guidance, I would greatly appreciate it.
I have spent hours on something that should have taken minutes.
Thank you, Allan
So first of all if you want to remove characters from both sides of a string you can use lstrp and rstrip.
>>> "('filename')".lstrip("('")
'filename')'
>>> "('filename')".rstrip("')")
"('filename"
But I suspect that the return data is actually a tuple and not a string
>>> ('filename',)
('filename',)
>>> type(('filename',))
<class 'tuple'>
>>> ('filename',)[2:20]
()
Which will allow you to say
('filename',)[0]
'filename'

Python Pandas Replace Special Character

For some reason, I cannot get this simple statement to work on the ñ. It seems to work on anything else but doesn't like that character. Any ideas?
DF['NAME']=DF['NAME'].str.replace("ñ","n")
Thanks
I'm assuming you're using Python 2.x here and this is likely a Unicode problem. Don't worry, you're not alone--unicode is really tough in general and especially in Python 2, which is why it's been made standard in Python 3.
If all you're concerned about is the ñ, you should decode in UTF-8, and then just replace the one character.
That would look something like the following:
DF['name'] = DF['name'].str.decode('utf-8').replace(u'\xf1', 'n')
As an example:
>>> "sureño".decode("utf-8").replace(u"\xf1", "n")
u'sureno'
If your string is already Unicode, then you can (and actually have to) skip the decode step:
>>> u"sureño".replace(u"\xf1", "n")
u'sureno'
Note here that u'\xf1' uses the hex escape for the character in question.
Update
I was informed in the comments that <>.str.replace is a pandas series method, which I hadn't realized. The answer to this possibly might be something like the following:
DF['name'] = map(lambda x: x.decode('utf-8').replace(u'\xf1', 'n'), DF['name'].str)
or something along those lines, if that pandas object is iterable.
Another update
It actually just occurred to me that your issue may be as simple as the following:
DF['NAME']=DF['NAME'].str.replace(u"ñ","n")
Note how I've added the u in front of the string to make it unicode.
You can use replace function with special character to be replaced with a different value of your choice in the following way.
if your dataframe is df and you have to do it in all the columns that are string. in case of mine I am doing it for "\n"
df= df.applymap(lambda x: x.replace("\n"," "))

How to recognize special eol character when I see it, using Python?

I'm scraping a set of originally pdf files, using Python. Having gotten them to text, I had a lot of trouble getting the line endings out. I couldn't figure out what the line separator was. The trouble is, I still don't know.
It's not a '\n', or, I don't think, '\r\n'. However, I've managed to isolate one of these special characters. I literally have it in memory, and by doing a call to my_str.replace(eol, ''), I can remove all of these characters from one of my files.
So my question is open-ended. I'm a bit lost when it comes to unicode and such. How can I identify this character in my files without resorting to something ridiculous, like serializing it and then reading it in? Is there a way I can refer to it as a code, perhaps? I can't get Python to yield what it actually IS. All I ever see if I print it, or call unicode(special_eol) is the character in its functional usage as a newline.
Please help! Thanks, and sorry if I'm missing something obvious.
To determine what specific character that is, you can use str.encode('unicode_escape') or repr() to get (in Python 2) a ASCII-printable representation of the character:
>>> print u'☃'.encode('unicode_escape')
\u2603
>>> print repr(u'☃')
u'\u2603'

Unexpected read_csv result with \W+ separator

I have an input file I am trying to read into a pandas dataframe.
The file is space delimited, including white space before the first value.
I have tried both read_csv and read_table with a "\W+" regex as the separator.
data = pd.io.parsers.read_csv('file.txt',names=header,sep="\W+")
They read in the correct number of columns, but the values themselves are totally bogus. Has any one else experienced this, or am I using it incorrectly
I have also tried to read file line by line, create a series from row.split() and append the series to a dataframe, but it appears to crash due to memory.
Are there any other options for creating a data frame from a file?
I am using Pandas v0.11.0, Python 2.7
The regex '\W' means "not a word character" (a "word character" being letters, digits, and underscores), see the re docs, hence the strange results. I think you meant to use whitespace '\s+'.
Note: read_csv offers a delim_whitespace argument (which you can set to True), but personally I prefer to use '\s+'.
I don't know what your data looks like, so I can't reproduce your error. I created some sample data and it worked fine, but sometimes using regex in read_csv can be troublesome. If you want to specify the separator, use " " as a separator instead. But I'd advise first trying Andy Hayden's suggestion. It's "delim_whitespace=True". It works well.
You can see it in the documentation here: http://pandas.pydata.org/pandas-docs/dev/io.html

Python CSV module - quotes go missing

I have a CSV file that has data like this
15,"I",2,41301888,"BYRNESS RAW","","BYRNESS VILLAGE","NORTHUMBERLAND","ENG"
11,"I",3,41350101,2,2935,2,2008-01-09,1,8,0,2003-02-01,,2009-12-22,2003-02-11,377016.00,601912.00,377105.00,602354.00,10
I am reading this and then writing different rows to different CSV files.
However, in the original data there are quotes around the non-numeric fields, as some of them contain commas within the field.
I am not able to keep the quotes.
I have researched lots and discovered the quoting=csv.QUOTE_NONNUMERIC however this now results in a quote mark around every field and I dont know why??
If i try one of the other quoting options like MINIMAL I end up with an error message regarding the date value, 2008-01-09, not being a float.
I have tried to create a dialect, add the quoting on the csv reader and writer but nothing I have tried results in the getting an exact match to the original data.
Anyone had this same problem and found a solution.
When writing, quoting=csv.QUOTE_NONNUMERIC keeps values unquoted as long as they're numbers, ie. if their type is int or float (for example), which means it will write what you expect.
Your problem could be that, when reading, a csv.reader will turn every row it reads into a list of strings (if you read the documentation carefully enough, you'll see a reader does not perform automatic data type conversion!
If you don't perform any kind of conversion after reading, then when you write you'll end up with everything on quotes... because everything you write is a string.
Edit: of course, date fields will be quoted, because they are not numbers, meaning you cannot get the exact expected behaviour using the standard csv.writer.
Are you sure you have a problem? The behavior you're describing is correct: The csv module will enclose strings in quotes only if it's necessary for parsing them correctly. So you should expect to see quotes only around strings containing a comma, newlines, etc. Unless you're getting errors reading your output back in, there is no problem.
Trying to get an "exact match" of the original data is a difficult and potentially fruitless endeavor. quoting=csv.QUOTE_NONNUMERIC put quotes around everything because every field was a string when you read it in.
Your concern that some of the "quoted" input fields could have commas is usually not that big a deal. If you added a comma to one of your quoted fields and used the default writer, the field with the comma would be automatically quoted in the output.

Categories