How to remove special characters in json data python - python

I am reading a set of data from a json file. Content of the json file looks like:
"Address":"4820 ALCOA AVE� ",
"City":"VERNON� "
As you can see that it contains a special character � and white space at the end. While reading this json data, it is coming like:
'address': '4820 ALCOA AVE� '
'city': 'VERNON� '
I can remove the whitespace easily but I am not sure how can I remove the ¿½. I do not have direct access to json file so cannot edit it and even if I had access to json file, I would talk couple of hours to edit the file. Is there any way in python we can remove this special characters. Please help. Thanks

you can use regexp
import re
address = re.sub(r"[^\x20-\x7E]", "", "4820 ALCOA AVE� ")
print(address)

Looks like somewhere upstream wasn't handling character encoding properly and ended up with replacement characters... You may need to keep an eye out in case it mangled more important parts of the text (eg. accented characters, non-English letters, emoji).
For the immediate problem, you can load the JSON data with the utf-8 encoding, then strip the character '\ufffd'.
value = value.strip().strip('\ufffd')
If the replacement characters also appear in the middle (and you want to delete them), you might want to use replace() instead.
value = value.replace('\ufffd', '').strip()

Related

Add special characters in csv pandas python

While writing strings containing certain special characters, such as
Töölönlahdenkatu
using to_csv from pandas, the result in the csv looks like
T%C3%B6%C3%B6l%C3%B6nlahdenkatu
How do we get to write the text of string as it is? This is my to_csv command
df.to_csv(csv_path,index=False,encoding='utf8')
I have even tried
df.to_csv(csv_path,index=False,encoding='utf-8')
df.to_csv(csv_path,index=False,encoding='utf-8-sig')
and still no success.There are other characters replaced with random symbols
'-' to –
Is there a workaround?
What you're trying to do is remove German umlauts and Spanish tildes. There is an easy solution for that.
import unicodedata
data = u'Töölönlahdenkatu Adiós Pequeño'
english = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
print(english)
output : b'Toolonlahdenkatu Adios Pequeno'
Let me know if it works or if there are any edge cases.
Special characters like ö cannot be stored in a csv the same way english letters can. The "random symbols" tell a program like excel to interpret the letters as special characters when you open the file, but special characters cannot be seen when you view the csv in vscode (for instance).

How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?

I am trying to get a string to use in google geocoding api.I ve checked a lot of threads but I am still facing problem and I don't understand how to solve it.
I need addresse1 to be a string without any special characters. Addresse1 is for example: "32 rue d'Athènes Paris France".
addresse1= collect.replace(' ','+').replace('\n','')
addresse1=unicodedata.normalize('NFKD', addresse1).encode('utf-8','ignore')
here I got a string without any accent... Ho no... It is not a string but a bytes. So I ve done what was suggested and 'decode:
addresse1=addresse1.decode('utf-8')
But then addresse1 is exactly the same than at the begining... What do I have to do? What am I doing wrong? Or what i don't understand with unicode? Or is there a better solution?
Thanks,
Stéphane.
with 3rd party package: unidecode
3>> unidecode.unidecode("32 rue d'Athènes Paris France")
"32 rue d'Athenes Paris France"
addresse1=unicodedata.normalize('NFKD', addresse1).encode('utf-8','ignore')
You probably meant .encode('ascii', 'ignore'), to remove non-ASCII characters. UTF-8 contains all characters, so encoding to it doesn't get rid of any, and an encode-decode cycle with it is a no-op.
is there a better solution?
It depends what you are trying to do.
If you only want to remove diacritical marks and not lose all other non-ASCII characters, you could read unicodedata.category for each character after NFKD-normalising and remove those in category M.
If you want to transliterate to ASCII that becomes a language-specific question that requires custom replacements (for example in German ö becomes oe, but not in Swedish).
If you just want to fudge a string into ASCII because having non-ASCII characters in it causes some code to break, it is of course much better to fix that code to work properly with all Unicode characters than to mangle good data. The letter è is not encodable in ASCII, but neither are 99.9989% of all characters so that hardly makes it “special”. Code that only supports ASCII is lame.
The Google Geocoding API can work with Unicode perfectly well so there is no obvious reason you should need to do any of this.
ETA:
url2= 'maps.googleapis.com/maps/api/geocode/json?address=' + addresse1 ...
Ah, you need to URL-encode any data you inject into a URL. That's not just for Unicode — the above will break for many ASCII punctuation symbols too. Use urllib.quote to encode a single string, or urllib.encode to convert multiple parameters:
params = dict(
address=address1.encode('utf-8'),
key=googlekey
)
url2 = '...?' + urllib.urlencode(params)
(in Python 3 it's urllib.parse.quote and urllib.parse.encode and they automatically choose UTF-8 so you don't have to manually encode there.)
data2 = urllib.request.urlopen(url2).read().decode('utf-8')
data3=json.loads(data2)
json.loads reads byte strings so you should be safe to omit the UTF-8 decode. Anyway json.load will read directly from a file-like object so you shouldn't have to load the data into a string at all:
data3 = json.load(urllib.request.urlopen(url2))
Generally, there are two approaches: (1) regular expressions and (2) str.translate.
1) regular expressions
Decompose string and replace characters from the Unicode block \u0300-\u036f:
import unicodedata
import re
word = unicodedata.normalize("NFD", word)
word = re.sub("[\u0300-\u036f]", "", word)
It removes accents, circumflex, diaeresis, and so on:
pingüino > pinguino
εἴκοσι εἶσι > εικοσι εισι
For some languages, it could be another block, such as [\u0559-\u055f] for Armenian script.
2) str.translate
First, create replacement table (case-sensitive) and then apply it.
repl = str.maketrans(
"áéúíó",
"aeuio"
)
word.translate(repl)
Multi-char replacements are made as following:
repl = {
ord("æ"): "ae",
ord("œ"): "oe",
}
word.translate(repl)
I had a similar problem where I was generating tags that users might have to type with their phone.
Without using 3rd party packages you can simplify bobinces's answer above:
collect = "32 rue d'Athènes Paris France"
unicode_collect = unicodedata.normalize('NFD', collect)
address1 = unicode_collect.encode('ascii', 'ignore').decode('utf-8')
address1:
"32 rue d'Athenes Paris France"
You can use the translate() method from python.
Here's an example copied from tutorialspoint.com:
#!/usr/bin/python
from string import maketrans # Required to call maketrans function.
intab = "aeiou"
outtab = "12345"
trantab = maketrans(intab, outtab)
str = "this is string example....wow!!!";
print str.translate(trantab)
This outputs:
th3s 3s str3ng 2x1mpl2....w4w!!!
So you can define what characters you wish to replace more easily than with replace()

How to find non-ascii characters in file using Regular Expression Python

I have a string of characters that includes [a-z] as well as á,ü,ó,ñ,å,... and so on. Currently I am using regular expressions to get every line in a file that includes these characters.
Sample of spanishList.txt:
adan
celular
tomás
justo
tom
átomo
camara
rosa
avion
Python code (charactersToSearch comes from flask #application.route('/<charactersToSearch>')):
print (charactersToSearch)
#'átdsmjfnueó'
...
#encode
charactersToSearch = charactersToSearch.encode('utf-8')
query = re.compile('[' + charactersToSearch + ']{2,}$', re.UNICODE).match
words = set(word.rstrip('\n') for word in open('spanishList.txt') if query(word))
...
When I do this, I am expecting to get the words in the text file that include the characters in charactersToSearch. It works perfectly for words without special characters:
...
#after doing further searching for other conditions, return list of found words.
return '<br />'.join(sorted(set(word for (word, path) in solve())))
>>> adan
>>> justo
>>> tom
Only problem is that it ignores all words in the file that aren't ASCII. I should also be getting tomás and átomo.
I've tried encode, UTF-8, using ur'[...], but I haven't been able to get it to work for all characters. The file and the program (# -*- coding: utf-8 -*-) are in utf-8 as well.
A different tack
I'm not sure how to fix it in your current workflow, so I'll suggest a different route.
This regex will match characters that are neither white-space characters nor letters in the extended ASCII range, such as A and é. In other words, if one of your words contains a weird character that is not part of this set, the regex will match.
(?i)(?!(?![×Þß÷þø])[a-zÀ-ÿ])\S
Of course this will also match punctuation, but I'm assuming that we're only looking at words in an unpunctuated list. otherwise, excluding punctuation is not too hard.
As I see it, your challenge is to define your set.
In Python, you can so something like:
if re.search(r"(?i)(?!(?![×Þß÷þø])[a-zÀ-ÿ])\S", subject):
# Successful match
else:
# Match attempt failed
I feel your pain. Dealing with Unicode in python2.x is the headache.
The problem with that input is that python sees "á" as the raw byte string '\xc3\xa1' instead of the unicode character "u'\uc3a1'. So your going to need to sanitize the input before passing the string into your regex.
To change a raw byte string to to a unicode string
char = "á"
## print char yields the infamous, and in python unparsable "\xc3\xa1".
## which is probably what the regex is not registering.
bytes_in_string = [byte for byte in char]
string = ''.join([str(hex(ord(byte))).strip('0x') for byte in bytes_in_string])
new_unicode_string = unichr(int(string),16))
There's probably a better way, because this is a lot of operations to get something ready for regex, which I think is supposed to be faster in some way than iterating & 'if/else'ing.
Dunno though, not an expert.
I used something similar to this to isolate the special char words when I parsed wiktionary which was a wicked mess. As far as I can tell your going to have to comb through that to clean it up anyways, you may as well just:
for word in file:
try:
word.encode('UTF-8')
except UnicodeDecodeError:
your_list_of_special_char_words.append(word)
Hope this helped, and good luck!
On further research found this post:
Bytes in a unicode Python string
The was able to figure out the issue. After getting the string from the flask app route, encode it otherwise it give you an error, and then decode the charactersToSearch and each word in the file.
charactersToSearch = charactersToSearch.encode('utf-8')
Then decode it in UTF-8. If you leave the previous line out it give you an error
UNIOnlyAlphabet = charactersToSearch.decode('UTF-8')
query = re.compile('[' + UNIOnlyAlphabet + ']{2,}$', re.U).match
Lastly, when reading the UTF-8 file and using query, don't forget to decode each word in the file.
words = set(word.decode('UTF-8').rstrip('\n') for word in open('spanishList.txt') if query(word.decode('UTF-8')))
That should do it. Now the results show regular and special characters.
justo
tomás
átomo
adan
tom

How do I get a regular expression to recognize non-ASCII characters as letters?

I'm extracting information from a webpage in Swedish. This page is using characters like: öäå.
My problem is that when I print the information the öäå are gone.
I'm extracting the information using Beautiful Soup. I think that the problem is that I do a bunch of regular expressions on the strings that I extract, e.g. location = re.sub(r'([^\w])+', '', location) to remove everything except for the letters. Before this I guess that Beautiful Soup encoded the strings so that the öäå became something like /x02/, a hex value.
So if I'm correct, then the regexes are removing the öäå, right, I mean the only thing that should be left of the hex char is x after the regex, but there are no x instead of öäå on my page, so this little theory is maybe not correct? Anyway, if it's right or wrong, how do you solve this? When I later print the extracted information to my webpage i use self.response.out.write() in google app engine (don't know if that help in solving the problem)
EDIT: The encoding on the Swedish site is utf-8 and the encoding on my site is also utf-8.
EDIT2: You can use ISO-8859-10 for Swedish, but according to google chrome the encoding is Unicode(utf-8) on this specific site
Always work in unicode and only convert to an encoded representation when necessary.
For this particular situation, you also need to use the re.U flag so \w matches unicode letters:
#coding: utf-8
import re
location = "öäå".decode('utf-8')
location = re.sub(r'([^\w])+', '', location, flags=re.U)
print location # prints öäå
It would help if you could dump the strings before and after each step.
Check your value of re.UNICODE first, see this

Writing Escape Characters to a Csv File in Python

I'm using the csv module in python and escape characters keep messing up my csv's. For example, if I had the following:
import csv
rowWriter = csv.writer(open('bike.csv', 'w'), delimiter = ",")
text1 = "I like to \n ride my bike"
text2 = "pumpkin sauce"
rowWriter.writerow([text1, text2])
rowWriter.writerow(['chicken','wings'])
I would like my csv to look like:
I like to \n ride my bike,pumpkin sauce
chicken,wings
But instead it turns out as
I like to
ride my bike,pumpkin sauce
chicken,wings
I've tried combinations of quoting, doublequote, escapechar and other parameters of the csv module, but I can't seem to make it work. Does anyone know whats up with this?
*Note - I'm also using codecs encode("utf-8"), so text1 really looks like "I like to \n ride my bike".encode("utf-8")
The problem is not with writing them to the file. The problem is that \n is a line break when inside '' or "". What you really want is either 'I like to \\n ride my bike' or r'I like to \n ride my bike' (notice the r prefix).
Firstly, it is not obvious why you want r"\n" (two bytes) to appear in your file instead of "\n" (one byte). What is the consumer of the output file meant to do? Use ast.evaluate_literal() on each input field? If your actual data contains any of (non-ASCII characters, apostrophes, quotes), then I'd be very wary of serialising it using repr().
Secondly, you have misreported either your code or your output (or both). The code that you show actually produces:
"I like to
ride my bike",pumpkin sauce
chicken,wings
Thirdly, about your "I like to \n ride my bike".encode("utf-8"): str_object.encode("utf-8") is absolutely pointless if str_object contains only ASCII bytes -- it does nothing. Otherwise it raises an exception.
Fourthly, this comment:
I don't need to call encode anymore, now that I'm using the raw
string. There are a lot of unicode characters in the text that I am
using, so before I started using the raw string I was using encode so
that csv could read the unicode text
doesn't make any sense -- as I've said, "ascii string".encode('utf8') is pointless.
Consider taking a step ot two backwards, and explain what you are really trying to do: where does your data come from, what's in it, and most importantly, what does the process that is going to read the file going to do?

Categories