How to encode categorical values in Python

How to encode categorical values in Python - python

Given a vocabulary ["NY", "LA", "GA"],
how can one encode it in such a way that it becomes:
"NY" = 100
"LA" = 010
"GA" = 001
So if I do a lookup on "NY GA", I get 101

you can use numpy.in1d:
>>> xs = np.array(["NY", "LA", "GA"])
>>> ''.join('1' if f else '0' for f in np.in1d(xs, 'NY GA'.split(' ')))
'101'
or:
>>> ''.join(np.where(np.in1d(xs, 'NY GA'.split(' ')), '1', '0'))
'101'

vocab = ["NY", "LA", "GA"]
categorystring = '0'*len(vocab)
selectedVocabs = 'NY GA'
for sel in selectedVocabs.split():
categorystring = list(categorystring)
categorystring[vocab.index(sel)] = '1'
categorystring = ''.join(categorystring)
This is the end result of my won testing, turns out Python doesn't support string item assignment, somehow i thought it did.
Personally i think behzad's solution is better, numpy does a better job and is faster.

Or you can
vocabulary = ["NY","LA","GA"]
i=pow(10,len(vocabulary)-1)
dictVocab = dict()
for word in vocabulary:
dictVocab[word] = i
i /= 10
yourStr = "NY LA"
result = 0
for word in yourStr.split():
result += dictVocab[word]

Another solution using numpy. It looks like you're tyring to binary encode a dictionary, so the code below feels natural to me.
import numpy as np
def to_binary_representation(your_str="NY LA"):
xs = np.array(["NY", "LA", "GA"])
ys = 2**np.arange(3)[::-1]
lookup_table = dict(zip(xs,ys))
return bin(np.sum([lookup_table[k] for k in your_str.split()]))
It's also not needed to do it in numpy, but it is probably faster in case you have large arrays to work on. np.sum can be replaced by the builtin sum then and the xs and ys can be transformed to non-numpy equivalents.

To create a lookup dictionary, reverse the vocabulary, enumerate it, and take the power of 2:
>>> vocabulary = ["NY", "LA", "GA"]
d = dict((word, 2 ** i) for i, word in enumerate(reversed(vocabulary)))
>>> d
{'NY': 4, 'GA': 1, 'LA': 2}
To query the dictionary:
>>> query = "NY GA"
>>> sum(code for word, code in d.iteritems() if word in query.split())
5
If you want it formatted to binary:
>>> '{0:b}'.format(5)
'101'
edit: if you want a 'one liner':
>>> '{0:b}'.format(
sum(2 ** i
for i, word in enumerate(reversed(vocabulary))
if word in query.split()))
'101'
edit2: if you want padding, for example with six 'bits':
>>> '{0:06b}'.format(5)
'000101'

Related

Python How to find arrays that has a certain element efficiently

Given lists(a list can have an element that is in another list) and a string, I want to find all names of lists that contains a given string.
Simply, I could just go through all lists using if statements, but I feel that there is more efficient way to do so.
Any suggestion and advice would be appreciated. Thank you.
Example of Simple Method I came up with
arrayA = ['1','2','3','4','5']
arrayB = ['3','4','5']
arrayC = ['1','3','5']
arrayD = ['7']
foundArrays = []
if givenString in arrayA:
foundArrays.append('arrayA')
if givenString in arrayB:
foundArrays.append('arrayB')
if givenString in arrayC:
foundArrays.append('arrayC')
if givenString in arrayD:
foundArrays.append('arrayD')
return foundArrays

Lookup in a list is not very efficient; a set is much better.
Let's define your data like
data = { # a dict of sets
"a": {1, 2, 3, 4, 5},
"b": {3, 4, 5},
"c": {1, 3, 5},
"d": {7}
}
then we can search like
search_for = 3 # for example
in_which = {label for label,values in data.items() if search_for in values}
# -> in_which = {'a', 'b', 'c'}
If you are going to repeat this often, it may be worth pre-processing your data like
from collections import defaultdict
lookup = defaultdict(set)
for label,values in data.items():
for v in values:
lookup[v].add(label)
Now you can simply
in_which = lookup[search_for] # -> {'a', 'b', 'c'}

The simple one-liner is:
result = [lst for lst in [arrayA, arrayB, arrayC, arrayD] if givenString in lst]
or if you prefer a more functional style:
result = filter(lambda lst: givenString in lst, [arrayA, arrayB, arrayC, arrayD])
Note that neither of these gives you the NAME of the list. You shouldn't ever need to know that, though.

Array names?
Try something like this with eval() nonetheless using eval() is evil
arrayA = [1,2,3,4,5,'x']
arrayB = [3,4,5]
arrayC = [1,3,5]
arrayD = [7,'x']
foundArrays = []
array_names = ['arrayA', 'arrayB', 'arrayC', 'arrayD']
givenString = 'x'
result = [arr for arr in array_names if givenString in eval(arr)]
print result
['arrayA', 'arrayD']

python distinguish number and string solution

I'm new to python and trying to solve the distinguish between number and string
For example :
Input: 111aa111aa
Output : Number: 111111 , String : aaaa

Here is your answer
for numbers
import re
x = '111aa111aa'
num = ''.join(re.findall(r'[\d]+',x))
for alphabets
import re
x = '111aa111aa'
alphabets = ''.join(re.findall(r'[a-zA-Z]', x))

You can use in-built functions as isdigit() and isalpha()
>>> x = '111aa111aa'
>>> number = ''.join([i for i in x if i.isdigit()])
'111111'
>>> string = ''.join([i for i in x if i.isalpha()])
'aaaa'
Or You can use regex here :
>>> x = '111aa111aa'
>>> import re
>>> numbers = ''.join(re.findall(r'\d+', x))
'111111'
>>> string = ''.join(re.findall(r'[a-zA-Z]', x))
'aaaa'

>>> my_string = '111aa111aa'
>>> ''.join(filter(str.isdigit, my_string))
'111111'
>>> ''.join(filter(str.isalpha, my_string))
'aaaa'

Try with isalpha for strings and isdigit for numbers,
In [45]: a = '111aa111aa'
In [47]: ''.join([i for i in a if i.isalpha()])
Out[47]: 'aaaa'
In [48]: ''.join([i for i in a if i.isdigit()])
Out[48]: '111111'
OR
In [18]: strings,numbers = filter(str.isalpha,a),filter(str.isdigit,a)
In [19]: print strings,numbers
aaaa 111111

As you mentioned you are new to Python, most of the presented approaches using str.join with list comprehensions or functional styles are quite sufficient. Alternatively, I present some options using dictionaries that can help organize data, starting from basic to intermediate examples with arguably increasing intricacy.
Basic Alternative
# Dictionary
d = {"Number":"", "String":""}
for char in s:
if char.isdigit():
d["Number"] += char
elif char.isalpha():
d["String"] += char
d
# {'Number': '111111', 'String': 'aaaa'}
d["Number"] # access by key
# '111111'
import collections as ct
# Default Dictionary
dd = ct.defaultdict(str)
for char in s:
if char.isdigit():
dd["Number"] += char
elif char.isalpha():
dd["String"] += char
dd

How to make json decimal round to three digits

I have the following list of list:
x = [["foo",3.923239],["bar",1.22333]]
What I want to do is to convert the numeric value into 3 digits under JSON string.
Yielding
myjsonfinal = "[["foo", 3.923], ["bar", 1.223]]"
I tried this but failed:
import json
print json.dumps(x)
Ideally we'd like this to be fast because need to deal with ~1000 items. Then load to web.

Try round()
import json
x = [["foo",3.923239],["bar",1.22333]]
json.dumps([[s, round(i, 3)] for s, i in x])

#neversaint, Try this:
x = [["foo", 3.923239], ["bar", 1.22333]]
for i, j in enumerate(x):
x[i][1] = round(j[1], 3)
print x
Output:
[['foo', 3.923], ['bar', 1.223]]
Cheers!!

How to I assign each variable in a list, a number, and then add the numbers up for the same variables?

For example, if ZZAZAAZ is input, the sum of A would be 14 (since its placement is 3,5,6), while the sum of Z would be 14 (1 + 2 + 4 + 7).
How would I do that?

You can use a generator expression within sum :
>>> s='ZZAZAAZ'
>>> sum(i for i,j in enumerate(s,1) if j=='A')
14

For all the elements in s you could do this. Also, it would find the counts for each element in a single pass of the string s, hence it's linear in the number of elements in s.
>>> s = 'ZZAZAAZ'
>>> d = {}
>>> for i, item in enumerate(s):
... d[item] = d.get(item, 0) + i + 1
>>> print d
{'A': 14, 'Z': 14}

Furthering Kasra's idea of using enumerate, if you wanted a dictionary containing these sums you could use a dictionary comprehension, and iterate over the set of unique characters, like so:
>>> s = 'ZZAZAAZ'
>>> {let:sum(a for a,b in enumerate(s,1) if b==let) for let in set(s)}
{'Z': 14, 'A': 14}

Single integer to multiple integer translation in Python

I'm trying to translate a single integer input to a multiple integer output, and am currently using the transtab function. For instance,
intab3 = "abcdefg"
outtab3 = "ABCDEFG"
trantab3 = maketrans(intab3, outtab3)
is the most basic version of what I'm doing. What I'd like to be able to do is have the input be a single letter and the output be multiple letters. So something like:
intab4 = "abc"
outtab = "yes,no,maybe"
but commas and quotation marks don't work.
It keeps saying :
ValueError: maketrans arguments must have same length
Is there a better function I should be using? Thanks,

You can use a dict here:
>>> dic = {"a":"yes", "b":"no", "c":"maybe"}
>>> strs = "abcd"
>>> "".join(dic.get(x,x) for x in strs)
'yesnomaybed'

In python3, the str.translate method was improved so this just works.
>>> intab4 = "abc"
>>> outtab = "yes,no,maybe"
>>> d = {ord(k): v for k, v in zip(intab4, outtab.split(','))}
>>> print(d)
{97: 'yes', 98: 'no', 99: 'maybe'}
>>> 'abcdefg'.translate(d)
'yesnomaybedefg'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to encode categorical values in Python - python

Given a vocabulary ["NY", "LA", "GA"], how can one encode it in such a way that it becomes: "NY" = 100 "LA" = 010 "GA" = 001 So if I do a lookup on "NY GA", I get 101

you can use numpy.in1d: >>> xs = np.array(["NY", "LA", "GA"]) >>> ''.join('1' if f else '0' for f in np.in1d(xs, 'NY GA'.split(' '))) '101' or: >>> ''.join(np.where(np.in1d(xs, 'NY GA'.split(' ')), '1', '0')) '101'

Or you can vocabulary = ["NY","LA","GA"] i=pow(10,len(vocabulary)-1) dictVocab = dict() for word in vocabulary: dictVocab[word] = i i /= 10 yourStr = "NY LA" result = 0 for word in yourStr.split(): result += dictVocab[word]

Related

Python How to find arrays that has a certain element efficiently

python distinguish number and string solution

How to make json decimal round to three digits

How to I assign each variable in a list, a number, and then add the numbers up for the same variables?

Single integer to multiple integer translation in Python

Categories

Resources