Python: encoding issues - comparing two strings with different encoding - python

I have some problems with encoding using Python. I've searched for an answer for couple of hours now and still no luck.
I am currently working on Jupyter notebook with Python dataframes (pandas).
Long story short - In a dataframe column I have different strings - single letters from the alphabet. I wanted to apply a function on this column, that will convert letters to numbers based on a specific key. But I got an error every time I tried this. When I dug for a reason behind this I realised that:
I have two strings 'T'. But they are not equal.
string1.encode() = b'T'
string2.encode() = b'\xd0\xa2'
How can I standardize/encode/decode/modify all strings to have the same coding/basis so I can compare them and make operations on them? What is the easiest way to achieve that?

Related

Obtain a string only with a part of it in a csv

Basically, i need to obtain Strings with one part of its in common (the last two characters) in a dataframe (csv). I am using python.
need some help with that

Easiest way to get character into ASCII-number into binary format in Python

I know this question was answered in a huge variety of questions already. Although I read a lot here on stackoverflow I cant find the best way to achieve the following:
I parse through a string and get a list of different substrings. Then I want to parse through each string and replace the characters with their binary representation.
However I dont know how to not lose the leading zeroes. I could use this but I really dont know how to get rid of the binary indicator in front. The goal is to just get a huge array of ones and zeroes
>>> format(14, '#010b')
'0b00001110'
You can use the function bin(int) and remplace int with the number and you are going to get the bit representation of the number. If you want to remove the "0b" before the bits you can do bin(int)[2:] using container slicing.

is there a better way to collect alternative symbols from a string and combining in python?

I just started learning Python. I came across slicing in Python 3. I'm working on a problem where I need to collect alternating symbols from a given string and combine them with the remaining symbols from the given string after removal of alternative symbols and finally combine them both.
I tried this way. Is there a way to make this code faster?
This is given 1n2a3m4e5i$s and expected output is nameis,12345$
str="1n2a3m4e5i$s" # this is given to you
str1=str[0::2]
str2=str[1::2]
print(str2+","+str1) # this is required at the end

Python Idle column limitation

I am trying to use Idle for working with some data and my problem is that when there are too many columns, some of them will be omitted after running and replaced with the dots. Is there a way to increase the limit set by Idle ide ? I have seen sentdex using Idle with up to 11 columns and all of them were presented, hence my question.
Thank you very much for your responses.
What type are you printing? Some have a reduced representation that is produced by default by their str conversion to avoid flooding the terminal. You can get some of them to produce their full representation by applying repr to them and then printing the result.
This doesn't work for dataframes. They have their own adjustable row and column limits. See Pretty-print an entire Pandas Series / DataFrame

Mapping ID's to Names and Removing Duplicates Algorithmic Development

I have a column in a CSV file that has names such that each cell in that column could be the same as a slightly misspelled cell. For example, "Nike" could be the same as "Nike inc." could be the same as "Nike Inc".
My Current Script
I've already written a program in Python that removes prefixes and suffixes from
each cell if that value occurs more than 2 times in the column as prefixes or
suffixes. I then compared one row to the next after sorting alphabetically in
this column.
My Current Problem
There are still many cells that are in reality duplicates of other cells, but they
are not indicated as such. These examples are:
a) Not exact matches (and not off just by capitalization)
b) Not caught by comparing its stem (without prefix and without suffix) to
its alphabetical neighbor
My current Questions
1) Does anyone have experience mapping IDs to names from all over the world
(so accents, unicode and all that stuff is an issue here, too, although I managed
to solve most of these unicode issues)
and has good ideas for algorithm development that are not listed here?
2) In some of the cases where duplicates are not picked up, I know why I
know they are duplicates. In one instance there is a period in the middle of a
line that is not present in its non-period containing brother cell. Is one good
strategy to simply to create an extra column and output cell values that I suspect
of being duplicates based on the few instances where I know why I know it?
3) How do I check myself? One way is to flag the maximum number of potential
duplicates and look over all of these manually. Unfortunately, the size of our
dataset doesn't make that very pretty, nor very feasible...
Thanks for any help you can provide!
Try to transliterate the names, to remove all the international symbols, then consider using a function like soundex or http://en.wikipedia.org/wiki/Levenshtein_distance (e.g. http://pypi.python.org/pypi/Fuzzy) to calculate text similarity.

Categories