I have a large dataset with over 2 million rows of textual data. Now I want to remove the accents from the strings.
In the link below, two different modules are described to remove the accents:
What is the best way to remove accents in a Python unicode string?
The modules described are unicode and unicodedata. To me it's not clear what the differences are between the two and a comparison is hard, because I don't have many rows with accents and I don't know what accents might be replaced and which ones are not.
Therefore, I would like to know what the differences are between the two and which one is recommended to use.
There is only one module: unicodedata, which includes the unicode database, so the names and properties of unicode code points.
unicode was a built-in function in Python 2. This function just convert strings to unicode strings, so it was just the encoding, no need to store all the data. On python3 all strings are unicode (with some particularities). Just the encoding now should be defined explicitly.
On that answer, you see only import unicodedata, so only one module. To remove accents, you do no need just unicode code point, but also information about the type of a unicode code point (combining character), so you need unicodedata.
Maybe you mean unidecode. This is a special module, but outside standard library. It could be useful for some uses. The modules is simple and give only results in ASCII domain. This could be ok on some cases, but it could cause problems outside Latin writing system.
On the other hand, unicodedata do nothing for you. You should understand unicode and apply the right filter function (and maybe knowing how other languages works).
So it depends on the case, and maybe you need just other slug functions (to create non escaped string). When workign with languages, you should care not to overdo things (you may built an offensive word).
Related
Today, and on several other occasions, I received an error like this:
{TypeError}ufunc subtract cannot use operands with types dtype('<M8[us]') and dtype('O').
On other days, I'd want to do some printf type command and be at a loss for which character stood for some obtuse data type (e.g. signed octal value).
I always had a hard time finding the definitions of what I now found to be called "type codes" or "Array-protocol type strings" in the first example and not to be confused with "printf-style String Formatting conversion characters" as in the later case, as they are single characters with string literal quotes, and thus Googling them is just a mess or trying to find synonyms for a word I didn't know. Maybe I'm just bad at RegEx and can't navigate man pages well enough, but I just wanted to throw up a possibly self answered question, in order to tag a bunch of synonyms for things I was trying to find and in the end landed on type code. I knew I was looking for python or numpy data types, and was scouring the internet for a dtype('<M8[us]') for the longest time so thought I'd help those who end up in a similar situation by providing a would-be online bookmark.
I had already read about various data types and this syntax in the past from various sources, knowing about the little-endian symbol '<', that '8' had something to do with the size, but would change depending on the dtype, but I had no idea what 'M' or '[us]' was defining. In my late night stupidity I looked over the numpy and python docs, but both for an earlier version than I had in my current env, and it looks like this 'M' did not appear until recently so I was left thinking all the tables in the docs were non-exhaustive and there was some other Unix or C based definition of all these type codes (which I still have not ruled out, but assume this is not the case now that I've found 'M' in my current Numpy version doc).
I will put the various resources that I've located regarding these various type codes in python and associated libraries here, but I'm sure there are plenty more, so would welcome others' additions/edits. I'll add all my links as an answer, and who knows, if others also found themselves in this situation, maybe I'll make a type code cheat sheet or something as a general resource online somewhere. Anyways, I think they'd be helpful to gather in a place tagged by a bunch of keywords that I was using trying to find them, to no avail like: python numpy data type shorthand definitions, python numpy dtype abbreviations, python array dtype codes, etc. If you have any other words that came to your mind when labeling these un-googleable terms, feel free to edit and add.
General notes:
Make sure you are reading the doc for the right version of python, numpy, etc.
The codes used depend on the use case (i.e. numpy array-protocol type strings are different than those used to define the types in general python arrays)
Even worse, some of the same characters are used to mean different things depending on the use case ('b' and 'B' for example if you compare numpy and python arrays, or 'd' if comparing python printf and array codes).
Numpy 1.17: Array-protocol type strings and the 'M' type
Python 3.8.0: printf conversion types
Python 3.8.0 Array type codes. Edit: This class is not used often, but just wanted here for comparative and exhaustive reference.
Python 3.8.0 string formatter "mini language" syntax, aka "presentation types"
I won't go to the trouble of reiterating the docs despite my answer being primarily links since I don't expect the docs to go down anytime soon, but for the main point of how I got here, 'M' stands for a datetime type in numpy and '[us]' was for microsecond resolution
Python3 has unicode strings (str) and bytes. We already have bytestring literals and methods. Why do we need two different types, instead of just byte strings of various encodings?
The answer to your question depends on the meaning of the word "need."
We certainly don't need the str type in the sense that everything we can compute with the type we can also compute without it (as you know quite well from your well-worded question).
But we can also understand "need" from the point of view of convenience. Isn't it nice to have a sqrt function? Or log or exp or sin? You could write these yourself, but why bother? A standard library designer will add functions that are useful and convenient.
It is the same for the language itself. Do we "need" a while loop? Not really, we can use tail-recursive functions. Do we "need" list comprehensions? Tons of things in Python are not primitive. For that matter do we "need" high level languages. John von Neumann himself once asked "why would you want more than machine language?"
It is the same with str and bytes. The type str, while not necessary, is a nice, time-saving, convenient thing to have. It gives us an interface as a sequence of characters, so that we can manipulate text character-by-character without:
us having to write all the encoding and decoding logic ourselves, or
bloating the string interface with multiple sets of iterators, like each_byte and each_char.
As you suspect, we could have one type which exposes the byte sequence and the character sequence (as Ruby's String class does). The Python designers wanted to separate those usages into two separate types. You can convert an object of one type into the other very easily. By having two types, they are saying that separation of concerns (and usages) is more important than having fewer built-in types. Ruby makes a different choice.
TL;DR It's a matter of preference in language design: separation of concerns by distinct type rather than by different methods on the same type.
Because bytes should not be considered strings, and strings should not be considered bytes. Python3 gets this right, no matter how jarring this feels to the brand new developer.
In Python 2.6, if I read data from a file, and I passed the "r" flag, the text would be read in the current locale by default, which would be a string, while passing the "rb" flag would create a series of bytes. Indexing the data is entirely different, and methods that take a str may be unsure of whether I am using bytes or a str. This gets worse since for ASCII data the two are often synonymous, meaning that code which works in simple test cases or English locales will fail upon encountering non-ASCII characters.
There was therefore a conscious effort to ensure bytes and strings were not identical: that one was a sequence of "dumb bytes", and the other was a Unicode string with the optimal encoding for the data to preserve O(1) indexing (ASCII, UCS-2, or UTF-32, depending on the data used, I believe).
In Python 2, the Unicode string was used to disambiguate text from "dumb bytes", however, str was treated as text by many users.
Or, to quote the Benevolent Dictator:
Python's current string objects are overloaded. They serve to hold both sequences of characters and sequences of bytes. This overloading of purpose leads to confusion and bugs. In future versions of Python, string objects will be used for holding character data. The bytes object will fulfil the role of a byte container. Eventually the unicode type will be renamed to str and the old str type will be removed.
tl;dr version
Forcing the separation of bytes and str forces coders to be conscious of their difference, to short-term dissatisfaction, but better code long-term. It's a conscious choice after years of experience: that forcing you to be conscious of the difference immediately will save you days in a debugger later.
Byte strings with different encodings are incompatible with each other, but until Python 3 there was nothing in the language to remind you of this fact. It turns out that mixing different character encodings is a surprisingly common problem in today's world, leading to far too many bugs.
Also it's often just easier to work with whole characters, without having to worry that you just modified a byte that accidentally rendered your 4-byte character into an invalid sequence.
There are at least two reasons:
the str type has an important property "one element = one character".
the str type does not depend on encoding.
Just imagine how would you implement a simple operation like reversing a string (rword = word[::-1]) if word were a bytestring with some encoding.
In Python (2.7 and above, probably other versions too), it is possible to create a string that is centered by doing something like this:
'{:^10}'.format('abc')
The meaning of 'centered' is pretty clear when the total number of padding characters is even, but what about when it is odd?
When I print the above in vanilla C Python (and IPython), I get
' abc '
This appears to put the extra pad character on the right. However, the docs do not explicitly mention a spec for this behavior. Is the behavior of the centering format specifier in the presence of an odd number of padding characters specified somewhere, or is it an implementation detail that is not to be relied on?
You should be able to rely on this. I don't know that it is documented anywhere, but the standard python test suite asserts that the extra space is added on the right. Since test is part of the standard library, it's a good starting point for other python implementations and they'll be aiming for compliance with the reference implementation wherever possible.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
All the ascii characters can be represented by utf-8 (the first seven bits of space). Using exclusively utf-8 could simplify string handling greatly. Granted utf-8 is not a fixed length format and therefore has certain performance penalties with respect to ascii but I have the feeling python normally goes for pythonic before performance.
My question: Has it ever been addressed why python3 implements strings this way instead of utf-8 exclusively? Thereby not representing it as a bitstream with different representations but always with the utf-8 encoding.
I'm not looking for personal opinions from SO users but for PEP's or a transcript from the dictator addressing this very point.
From PEP 393:
Rationale
There are two classes of complaints about the current implementation
of the unicode type: on systems only supporting UTF-16, users complain
that non-BMP characters are not properly supported. On systems using
UCS-4 internally (and also sometimes on systems using UCS-2), there is
a complaint that Unicode strings take up too much memory - especially
compared to Python 2.x, where the same code would often use ASCII
strings (i.e. ASCII-encoded byte strings). With the proposed approach,
ASCII-only Unicode strings will again use only one byte per character;
while still allowing efficient indexing of strings containing non-BMP
characters (as strings containing them will use 4 bytes per
character).
One problem with the approach is support for existing applications
(e.g. extension modules). For compatibility, redundant representations
may be computed. Applications are encouraged to phase out reliance on
a specific internal representation if possible. As interaction with
other libraries will often require some sort of internal
representation, the specification chooses UTF-8 as the recommended way
of exposing strings to C code.
For many strings (e.g. ASCII), multiple representations may actually
share memory (e.g. the shortest form may be shared with the UTF-8 form
if all characters are ASCII). With such sharing, the overhead of
compatibility representations is reduced. If representations do share
data, it is also possible to omit structure fields, reducing the base
size of string objects.
If it is not clear from the above text:
We want most strings representation to be space efficient
We want efficient indexing whenever possible
We want to be compatible with all systems and provide all Unicode on all systems
Result is that using a single internal representation would fail at least one of the constraints.
I have a set of strings. I would like to extract a regular expression that matches all these strings. Further, it should match preferably only these and not many others.
Is there an existing python module that does this?
www.google.com
www.googlemail.com/hello/hey
www.google.com/hello/hey
Then, the extracted regex could be www\.google(mail)?\.com(/hello/hey)?
(This also matches www.googlemail.com but I guess I need to live with it)
My motivation for this is in a machine learning setting. I would like to extract a regular expression that "best" represents all these strings.
I understand that regexes like
(www.google.com)|(www.googlemail.com/hello/hey)|(www.google.com/hello/hey) or
www.google(mail.com/hello/hey)|(.com)|(/hello/hey) would be right given my specification, because they match no other urls other than the given ones. But such a regex will become very large if there are large number of strings in the set.
There's a little perl library that was designed to do this. I know you're using python, but if it's a very large list of strings, you can fork off a perl subprocess now and then. (Or copy the algorithm if you're sufficiently motivated).