Extracting Words using nltk from German Text

Extracting Words using nltk from German Text - python

I am trying to extract words from a german document, when I use th following method as described in the nltk tutorial, I fail to get the words with language specific special characters.
ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*');
words = nltk.Text(ptcr.words(DocumentName))
What should I do to get the list of words in the document?
An example with nltk.tokenize.WordPunctTokenizer() for the german phrase Veränderungen über einen Walzer looks like:
In [231]: nltk.tokenize.WordPunctTokenizer().tokenize(u"Veränderungen über einen Walzer")
Out[231]: [u'Ver\xc3', u'\xa4', u'nderungen', u'\xc3\xbcber', u'einen', u'Walzer']
In this example "ä" is treated as a delimiter,even though "ü" is not.

Call PlaintextCorpusReader with the parameter encoding='utf-8':
ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8')
Edit: I see... you have two separate problems here:
a) Tokenization problem: When you test with a literal string from German,
you think you are
entering unicode. In fact you are telling python to take the bytes
between the quotes and convert them into a unicode string. But your bytes are being
misinterpreted. Fix: Add the following line at the very top of your
source file.
# -*- coding: utf-8 -*-
All of a sudden your constants will be seen and tokenized correctly:
german = u"Veränderungen über einen Walzer"
print nltk.tokenize.WordPunctTokenizer().tokenize(german)
Second problem: It turns out that Text() does not use unicode! If you
pass it a unicode string, it will try to convert it to a pure-ascii
string, which of course fails on non-ascii input. Ugh.
Solution: My recommendation would be to avoid using nltk.Text entirely, and work with the corpus readers directly. (This is in general a good idea: See nltk.Text's own documentation).
But if you must use nltk.Text with German data, here's how: Read your
data properly so it can be tokenized, but then "encode" your unicode back to a list of str. For German, it's
probably safest to just use the Latin-1 encoding, but utf-8 seems to work
too.
ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8');
# Convert unicode to utf8-encoded str
coded = [ tok.encode('utf-8') for tok in ptcr.words(DocumentName) ]
words = nltk.Text(coded)

Take a look at http://text-processing.com/demo/tokenize/
I'm not sure your text is getting the right encoding, since WordPunctTokenizer in the demo handles the words fine. And so does PunktWordTokenizer.

You might try a simple regular expression. The following suffices if you want just the words; it will swallow all punctuation:
>>> import re
>>> re.findall("\w+", "Veränderungen über einen Walzer.".decode("utf-8"), re.U)
[u'Ver\xe4nderungen', u'\xfcber', u'einen', u'Walzer']
Note that re.U changes the meaning of \w in the RE based on the current locale, so make sure that's set correctly. I have it set to en_US.UTF-8 which is apparently good enough for your example.
Also note that "Veränderungen über einen Walzer".decode("utf-8") and u"Veränderungen über einen Walzer" are different strings.

Related

special characters in xml file with Python2.7

I have several strings like this:
"Programa Directrices de Gesti\xc3\xb3n Tur\xc3\xadstica"
That I should store in an xml file in this way
<content><![CDATA[Programa Directrices de Gestión Turística]]></content>
I use this code:
from xml.dom import minidom
data_cdata = doc.createCDATASection(text)
cdv = doc.createElement(tag)
cdv.appendChild(data_cdata)
root.appendChild(cdv)
doc.appendChild(root)
but the output is:
<content><![CDATA["Programa Directrices de Gesti\xc3\xb3n Tur\xc3\xadstica]]></content>
How i can do it?
(sorry for my english)

Python does not represent characters outside of the ascii range like you would like. The special characters \xc3\xb3 and \xc3\xad are related to hexadecimal ordinals of each character: ó and í .
It seems like your code does not translate the special characters very well. Instead of posting the actual ó and í it posts their respective representations: \xc3\xb3 and \xc3\xad. Now I know nothing of the library you use, but I would search in the appendChild function for a quick fix regarding the translation. If you cannot find it, you could perhaps iterate over the text with a loop removing special characters and turning them into regular letters
("ó" into "o").
I wish I could be of more help :).
Good luck,
Jesper

How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?

I am trying to get a string to use in google geocoding api.I ve checked a lot of threads but I am still facing problem and I don't understand how to solve it.
I need addresse1 to be a string without any special characters. Addresse1 is for example: "32 rue d'Athènes Paris France".
addresse1= collect.replace(' ','+').replace('\n','')
addresse1=unicodedata.normalize('NFKD', addresse1).encode('utf-8','ignore')
here I got a string without any accent... Ho no... It is not a string but a bytes. So I ve done what was suggested and 'decode:
addresse1=addresse1.decode('utf-8')
But then addresse1 is exactly the same than at the begining... What do I have to do? What am I doing wrong? Or what i don't understand with unicode? Or is there a better solution?
Thanks,
Stéphane.

with 3rd party package: unidecode
3>> unidecode.unidecode("32 rue d'Athènes Paris France")
"32 rue d'Athenes Paris France"

addresse1=unicodedata.normalize('NFKD', addresse1).encode('utf-8','ignore')
You probably meant .encode('ascii', 'ignore'), to remove non-ASCII characters. UTF-8 contains all characters, so encoding to it doesn't get rid of any, and an encode-decode cycle with it is a no-op.
is there a better solution?
It depends what you are trying to do.
If you only want to remove diacritical marks and not lose all other non-ASCII characters, you could read unicodedata.category for each character after NFKD-normalising and remove those in category M.
If you want to transliterate to ASCII that becomes a language-specific question that requires custom replacements (for example in German ö becomes oe, but not in Swedish).
If you just want to fudge a string into ASCII because having non-ASCII characters in it causes some code to break, it is of course much better to fix that code to work properly with all Unicode characters than to mangle good data. The letter è is not encodable in ASCII, but neither are 99.9989% of all characters so that hardly makes it “special”. Code that only supports ASCII is lame.
The Google Geocoding API can work with Unicode perfectly well so there is no obvious reason you should need to do any of this.
ETA:
url2= 'maps.googleapis.com/maps/api/geocode/json?address=' + addresse1 ...
Ah, you need to URL-encode any data you inject into a URL. That's not just for Unicode — the above will break for many ASCII punctuation symbols too. Use urllib.quote to encode a single string, or urllib.encode to convert multiple parameters:
params = dict(
address=address1.encode('utf-8'),
key=googlekey
)
url2 = '...?' + urllib.urlencode(params)
(in Python 3 it's urllib.parse.quote and urllib.parse.encode and they automatically choose UTF-8 so you don't have to manually encode there.)
data2 = urllib.request.urlopen(url2).read().decode('utf-8')
data3=json.loads(data2)
json.loads reads byte strings so you should be safe to omit the UTF-8 decode. Anyway json.load will read directly from a file-like object so you shouldn't have to load the data into a string at all:
data3 = json.load(urllib.request.urlopen(url2))

Generally, there are two approaches: (1) regular expressions and (2) str.translate.
1) regular expressions
Decompose string and replace characters from the Unicode block \u0300-\u036f:
import unicodedata
import re
word = unicodedata.normalize("NFD", word)
word = re.sub("[\u0300-\u036f]", "", word)
It removes accents, circumflex, diaeresis, and so on:
pingüino > pinguino
εἴκοσι εἶσι > εικοσι εισι
For some languages, it could be another block, such as [\u0559-\u055f] for Armenian script.
2) str.translate
First, create replacement table (case-sensitive) and then apply it.
repl = str.maketrans(
"áéúíó",
"aeuio"
)
word.translate(repl)
Multi-char replacements are made as following:
repl = {
ord("æ"): "ae",
ord("œ"): "oe",
}
word.translate(repl)

I had a similar problem where I was generating tags that users might have to type with their phone.
Without using 3rd party packages you can simplify bobinces's answer above:
collect = "32 rue d'Athènes Paris France"
unicode_collect = unicodedata.normalize('NFD', collect)
address1 = unicode_collect.encode('ascii', 'ignore').decode('utf-8')
address1:
"32 rue d'Athenes Paris France"

You can use the translate() method from python.
Here's an example copied from tutorialspoint.com:
#!/usr/bin/python
from string import maketrans # Required to call maketrans function.
intab = "aeiou"
outtab = "12345"
trantab = maketrans(intab, outtab)
str = "this is string example....wow!!!";
print str.translate(trantab)
This outputs:
th3s 3s str3ng 2x1mpl2....w4w!!!
So you can define what characters you wish to replace more easily than with replace()

How to find non-ascii characters in file using Regular Expression Python

I have a string of characters that includes [a-z] as well as á,ü,ó,ñ,å,... and so on. Currently I am using regular expressions to get every line in a file that includes these characters.
Sample of spanishList.txt:
adan
celular
tomás
justo
tom
átomo
camara
rosa
avion
Python code (charactersToSearch comes from flask #application.route('/<charactersToSearch>')):
print (charactersToSearch)
#'átdsmjfnueó'
...
#encode
charactersToSearch = charactersToSearch.encode('utf-8')
query = re.compile('[' + charactersToSearch + ']{2,}$', re.UNICODE).match
words = set(word.rstrip('\n') for word in open('spanishList.txt') if query(word))
...
When I do this, I am expecting to get the words in the text file that include the characters in charactersToSearch. It works perfectly for words without special characters:
...
#after doing further searching for other conditions, return list of found words.
return '<br />'.join(sorted(set(word for (word, path) in solve())))
>>> adan
>>> justo
>>> tom
Only problem is that it ignores all words in the file that aren't ASCII. I should also be getting tomás and átomo.
I've tried encode, UTF-8, using ur'[...], but I haven't been able to get it to work for all characters. The file and the program (# -*- coding: utf-8 -*-) are in utf-8 as well.

A different tack
I'm not sure how to fix it in your current workflow, so I'll suggest a different route.
This regex will match characters that are neither white-space characters nor letters in the extended ASCII range, such as A and é. In other words, if one of your words contains a weird character that is not part of this set, the regex will match.
(?i)(?!(?![×Þß÷þø])[a-zÀ-ÿ])\S
Of course this will also match punctuation, but I'm assuming that we're only looking at words in an unpunctuated list. otherwise, excluding punctuation is not too hard.
As I see it, your challenge is to define your set.
In Python, you can so something like:
if re.search(r"(?i)(?!(?![×Þß÷þø])[a-zÀ-ÿ])\S", subject):
# Successful match
else:
# Match attempt failed

I feel your pain. Dealing with Unicode in python2.x is the headache.
The problem with that input is that python sees "á" as the raw byte string '\xc3\xa1' instead of the unicode character "u'\uc3a1'. So your going to need to sanitize the input before passing the string into your regex.
To change a raw byte string to to a unicode string
char = "á"
## print char yields the infamous, and in python unparsable "\xc3\xa1".
## which is probably what the regex is not registering.
bytes_in_string = [byte for byte in char]
string = ''.join([str(hex(ord(byte))).strip('0x') for byte in bytes_in_string])
new_unicode_string = unichr(int(string),16))
There's probably a better way, because this is a lot of operations to get something ready for regex, which I think is supposed to be faster in some way than iterating & 'if/else'ing.
Dunno though, not an expert.
I used something similar to this to isolate the special char words when I parsed wiktionary which was a wicked mess. As far as I can tell your going to have to comb through that to clean it up anyways, you may as well just:
for word in file:
try:
word.encode('UTF-8')
except UnicodeDecodeError:
your_list_of_special_char_words.append(word)
Hope this helped, and good luck!
On further research found this post:
Bytes in a unicode Python string

The was able to figure out the issue. After getting the string from the flask app route, encode it otherwise it give you an error, and then decode the charactersToSearch and each word in the file.
charactersToSearch = charactersToSearch.encode('utf-8')
Then decode it in UTF-8. If you leave the previous line out it give you an error
UNIOnlyAlphabet = charactersToSearch.decode('UTF-8')
query = re.compile('[' + UNIOnlyAlphabet + ']{2,}$', re.U).match
Lastly, when reading the UTF-8 file and using query, don't forget to decode each word in the file.
words = set(word.decode('UTF-8').rstrip('\n') for word in open('spanishList.txt') if query(word.decode('UTF-8')))
That should do it. Now the results show regular and special characters.
justo
tomás
átomo
adan
tom

python regular expression with utf8 issue

I got a file which includes many lines of plain utf-8 text. Such as below, by the by, it's Chinese.
PROCESS：类型：关爱积分[NOTIFY] 交易号：2012022900000109 订单号：W12022910079166 交易金额：0.01元 交易状态：true 2012-2-29 10:13:08
The file itself was saved in utf-8 format. file name is xx.txt
here is my python code, env is python2.7
#coding: utf-8
import re
pattern = re.compile(r'交易金额：(\d+)元')
for line in open('xx.txt'):
match = pattern.match(line.decode('utf-8'))
if match:
print match.group()
The problematic thing here is I got no results.
I wanna get the decimal string from 交易金额：0.01元, in here, which is 0.01.
Why doesn't this code work? Can anyone explain it to me, I got no clue whatsoever.

There are several issues with your code. First you should use re.compile(ur'<unicode string>'). Also it is nice to add re.UNICODE flag (not sure if really needed here though). Next one is that still you will not receive a match since \d+ doesn't handle decimals just a series of numbers, you should use \d+\.?\d+ instead (you want number, probably a dot and a number). Example code:
#coding: utf-8
text = u"PROCESS：类型：关爱积分[NOTIFY] 交易号：2012022900000109 订单号：W12022910079166 交易金额：0.01元 交易状态：true 2012-2-29 10:13:08"
import re
pattern = re.compile(ur'交易金额：(\d+\.?\d+)元', re.UNICODE)
print pattern.search(text).group(1)

You need to use .search() since .match() is like starting your regex with ^, i.e. it only checks at the beginning of the string.

If you use utf-8, you can use flags=re.LOCALE
#coding: utf-8
import re
pattern = re.compile(r'交易金额：(\d+\.?\d+)元', flags=re.LOCALE)
for line in open('xx.txt'):
match = pattern.match(line)
More details, see re.LOCALE. There is no need to convert utf-8 to unicode.

What's a good way to replace international characters with their base Latin counterparts using Python?

Say I have the string "blöt träbåt" which has a few a and o with umlaut and ring above. I want it to become "blot trabat" as simply as possibly. I've done some digging and found the following method:
import unicodedata
unicode_string = unicodedata.normalize('NFKD', unicode(string))
This will give me the string in unicode format with the international characters split into base letter and combining character (\u0308 for umlauts.) Now to get this back to an ASCII string I could do ascii_string = unicode_string.encode('ASCII', 'ignore') and it'll just ignore the combining characters, resulting in the string "blot trabat".
The question here is: is there a better way to do this? It feels like a roundabout way, and I was thinking there might be something I don't know about. I could of course wrap it up in a helper function, but I'd rather check if this doesn't exist in Python already.

It would be better if you created an explicit table, and then used the unicode.translate method. The advantage would be that transliteration is more precise, e.g. transliterating "ö" to "oe" and "ß" to "ss", as should be done in German.
There are several transliteration packages on PyPI: translitcodec, Unidecode, and trans.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting Words using nltk from German Text - python

Take a look at http://text-processing.com/demo/tokenize/ I'm not sure your text is getting the right encoding, since WordPunctTokenizer in the demo handles the words fine. And so does PunktWordTokenizer.

Related

special characters in xml file with Python2.7

How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?

How to find non-ascii characters in file using Regular Expression Python

python regular expression with utf8 issue

What's a good way to replace international characters with their base Latin counterparts using Python?

Categories

Resources