More efficient solution? Dictionary as sparse vector - python

I have two dictionaries that I use as sparse vectors:
dict1 = {'a': 1, 'b': 4}
dict2 = {'a': 2, 'c': 2}
I wrote my own __add__ function to get this desired result:
dict1 = {'a': 3, 'b': 4, 'c': 2}
It is important that I know the strings 'a', 'b' and 'c' for each corresponding value. Just making sure that I add up the correct dimensions is not enough. I will also get many more, previously unknown strings with some values that I just add to my dictionary at the moment.
Now my question: Is there a more efficient data structure out there? I looked at Numpy's arrays and Scipy's sparse matrixes but as far as I understand they are not really of any help here or am I just not seeing the solution?
I could keep keys and values in separate arrays but I don't think I can just use any already existing function to get the desired result.
dict1_keys = np.array([a, b])
dict1_values = np.array([1, 4])
dict2_keys = np.array([a, c])
dict2_values = np.array([2, 2])
# is there anything that will efficiently produce the following?
dict1_keys = np.array([a, b, c])
dict1_values = np.array([3, 4, 2])

Perhaps pandas is what you're looking for:
d1 = pandas.DataFrame(numpy.array([1, 4]), index=['a', 'b'], dtype="int32")
d2 = pandas.DataFrame(numpy.array([2, 2]), index=['a', 'c'], dtype="int32")
d1.add(d2, fill_value=0)
result:
0
a 3
b 4
c 2

#sirfz's Pandas approach could be a one-liner using pandas Series:
>>> pd.Series(dict1).add(pd.Series(dict2), fill_value=0)
a 3.0
b 4.0
c 2.0
Or if your API required dicts
>>> dict(pd.Series(dict1).add(pd.Series(dict2), fill_value=0))
{'a': 3.0, 'b': 4.0, 'c': 2.0}
Plus, this should handle mixed inputs of dicts or Seriess or even scipy sparse matrix rows and sklearn Vectorizer output (sparse vectors/mappings)

Related

Pandas: Looking to avoid a for loop when creating a nested dictionary

Here is my data:
df:
id sub_id
A 1
A 2
B 3
B 4
and I have the following array:
[[1,2],
[2,5],
[1,4],
[7,8]]
Here is my code:
from collections import defaultdict
sub_id_array_dict = defaultdict(dict)
for i, s, a in zip(df['id'].to_list(), df['sub_id'].to_list(), arrays):
sub_id_array_dict[i][s] = a
Now, my actual dataframe includes a total of 100M rows (unique sub_id) with 500K unique ids. Ideally, I'd like to avoid a for loop.
Any help would be much appreciated.
Assuming the arrays variable has same number of rows as in the Dataframe,
df['value'] = arrays
Convert into dictionary by grouping
df.groupby('id').apply(lambda x: dict(zip(x.sub_id, x.value))).to_dict()
Output
{'A': {1: [1, 2], 2: [2, 5]}, 'B': {3: [1, 4], 4: [7, 8]}}

Python Pandas: multiple aggregations -> list of values

I have a DataFrame which contains the results of multiple aggregation functions applied to multiple columns, for example:
bar = pd.DataFrame([
{'a': 1, 'b': 2, 'grp': 0}, {'a': 3, 'b': 8, 'grp': 0},
{'a': 2, 'b': 2, 'grp': 1}, {'a': 4, 'b': 5, 'grp': 1}
])
bar.groupby('grp').agg([np.mean, np.std])
a b
mean std mean std
grp
0 2 1.414214 5.0 4.242641
1 3 1.414214 3.5 2.121320
I want to combine the aggregation results to lists (or tuples):
grp a b
0 [2, 1.414214] [5.0, 4.242641]
1 [3, 1.414214] [3.5, 2.121320]
What would be the proper way to do this?
Thanks in advance!
If you've to use lists in columns. You can
In [60]: bar.groupby('grp').agg(lambda x: [x.mean(), x.std()])
Out[60]:
a b
grp
0 [2.0, 1.4142135623730951] [5.0, 4.242640687119285]
1 [3.0, 1.4142135623730951] [3.5, 2.1213203435596424]
Not recommended to store data like this for pandas.
What would be the proper way to do this?
There is no proper way. Pandas was never designed to hold lists in series / columns. You can concoct expensive workarounds, but these are not recommended.
The main reason holding lists in series is not recommended is you lose all vectorised functionally attached to having numeric series with NumPy arrays held in contiguous memory blocks. Your series will be of object dtype, which represents a sequence of pointers. You will lose benefits in terms of memory and performance.
See also What are the advantages of NumPy over regular Python lists? The arguments in favour of Pandas are the same as for NumPy.

Put array into DataFrame as single element

Guys,
I have an dict like this :
dic = {}
dic['A'] = 1
dic['B'] = np.array([1,2,3])
dic['C'] = np.array([1,2,3,4])
dic['D'] = np.array([6,7])
Then I tried to put them into a DataFrame (also may insert more lines later, but the array length for each element may be variable), for some reasons, I want to keep them as a entire object for each columns, when print, it looks like:
A B C D
1 [1,2,3] [1,2,3,4] [6,7]
......
[2,3] [7,8] [5,6,7,2] 4
When I am trying to do this by :
pd.DataFrame.from_dict(dic)
I always get the error : ValueError: arrays must all be same length
Do I have anyway to keep the entire array as single element, however, some times I do have some single value as well ?
I am not sure why you required input as the dictionary. but if you pass elements as numpy array it converts missing values with NaN.
pd.DataFrame([np.array([1,2,3]),np.array([1,2,3,4]),np.array([6,7])],columns=['A','B','C','D'])
Output:-
A B C D
0 1 2 3.0 NaN
1 1 2 3.0 4.0
2 6 7 NaN NaN
IIUC this should work
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[1, np.array([2,3])],
"B":[np.array([1,2,3]), np.array([7,8])],
"C":[np.array([1,2,3,4]), np.array([5,6,7,2])],
"D":[np.array([6,7]), 4]})
So df.to_dict() returns
{'A': {0: 1, 1: array([2, 3])},
'B': {0: array([1, 2, 3]), 1: array([7, 8])},
'C': {0: array([1, 2, 3, 4]), 1: array([5, 6, 7, 2])},
'D': {0: array([6, 7]), 1: 4}}
UPDATE
If you want to save to file you should consider to use lists instead of numpy arrays and use delimiter=';'
convert arrays to strings by if you want to maintain this shape.

Calculate average of list values grouped by second list

I didn't know how to better express myself in the title. Basically what I have is two lists:
a = ['A','B','A','C','D','C','A',...]
b = [2,4,8,3,5,2,1,...]
a and b have the same length, b represents a value related to the letter in a .
Now I would like to calculate the Average value in b for each letter in a. So at the end I would have:
a = ['A','B','C','D',...]
b = [3.67, 4, 2.5, 5,...]
Is there a standard implementation for this in python?
You can first perform a group by. We can do this for instance with a defaultdict:
from collections import defaultdict
col = defaultdict(list)
for ai,bi in zip(a,b):
col[ai].append(bi)
Now the dictionary col will look like:
>>> col
defaultdict(<class 'list'>, {'C': [3, 2], 'B': [4], 'D': [5], 'A': [2, 8, 1]})
and now we can calculate the average of all elements in the dictionary for instance like:
>>> {key:sum(vals)/len(vals) for key,vals in col.items()}
{'C': 2.5, 'B': 4.0, 'D': 5.0, 'A': 3.6666666666666665}
You can also convert it to two tuples by using zip:
a,b = zip(*[(key,sum(vals)/len(vals)) for key,vals in col.items()])
resulting in:
>>> a,b = zip(*[(key,sum(vals)/len(vals)) for key,vals in col.items()])
>>> a
('C', 'B', 'D', 'A')
>>> b
(2.5, 4.0, 5.0, 3.6666666666666665)
If you want to generate lists instead, you can convert them to lists:
a,b = map(list,zip(*[(key,sum(vals)/len(vals)) for key,vals in col.items()]))
This results in:
>>> a,b = map(list,zip(*[(key,sum(vals)/len(vals)) for key,vals in col.items()]))
>>> a
['C', 'B', 'D', 'A']
>>> b
[2.5, 4.0, 5.0, 3.6666666666666665]
I believe a cleaner way to do this would be to simply use a pandas groupby:
import pandas as pd
data = pd.DataFrame(b,index=a)
a,b = (list(data.groupby(data.index)[0].mean().index),list(data.groupby(data.index)[0].mean()))
You can use numpy as follows:
>>> import numpy as np
>>> array_a = np.array(a)
>>> array_b = np.array(b)
>>> avrg_of_a = np.average(array_b[array_a == 'A'])
>>> avrg_of_a
3.6666666666666665
>>> avrg_of_b = np.average(array_b[array_a == 'B'])
4.0
You can generate a list use list comprehensions [np.average(array_b[array_a == item]) for item in np.unique(array_a)]

How should I transform multiple key/value columns in a scikit-learn pipeline?

I'd like to build a sklearn pipeline to transform data that contains multiple key/value pairs:
import pandas as pd
D = pd.DataFrame([ ['a', 1, 'b', 2], ['b', 2, 'c', 3]], columns = ['k1', 'v1', 'k2', 'v2'])
print(D)
Output:
k1 v1 k2 v2
0 a 1 b 2
1 b 2 c 3
DictVectorizer seems appropriate but I'm struggling with transforming multiple key/value columns present on each row into a suitable dict for processing.
DictVectorizer seems amenable to input like this:
row1 = {'a':1, 'b':2}
row2 = {'b':2, 'c':3}
data = [row1, row2]
# This is the output structure that I need:
print(data)
yielding:
[{'a': 1, 'b': 2}, {'c': 3, 'b': 2}]
Then it will transform into an array like this:
DictVectorizer( sparse=False ).fit_transform(data)
Final output:
array([[ 1., 2., 0.],
[ 0., 2., 3.]])
What would be a suitable custom transformer to transform multiple key/value pairs as shown above?
I don't know about a special transformer but you could use a simple list comprehension:
>>> data = [{row['k1']:row['v1'], row['k2']:row['v2']} for index, row in D.iterrows()]
>>> data
[{'a': 1, 'b': 2}, {'c': 3, 'b': 2}]
From here you could use a dict vectorizer like this:
>>> v = sklearn.feature_extraction.DictVectorizer(sparse=False)
>>> X = v.fit_transform(data)
>>> print X
[[ 1. 2. 0.]
[ 0. 2. 3.]]
Building on Mike's answer (which is definitely more elegant than my original one), you can use the same logic of pairs of columns and avoid having to specify each pair with the following:
[dict((row[i-1],row[i]) for i in np.arange(1,len(D.columns),2)) for index, row in D.iterrows() ]
This yields the following:
[{'a': 1, 'b': 2}, {'c': 3, 'b': 2}]
Note: This assumes that the pairs are organized like in your example (k1,v1,k2,v2, etc) and that there are an even number of columns.

Categories