Put array into DataFrame as single element - python

Guys,
I have an dict like this :
dic = {}
dic['A'] = 1
dic['B'] = np.array([1,2,3])
dic['C'] = np.array([1,2,3,4])
dic['D'] = np.array([6,7])
Then I tried to put them into a DataFrame (also may insert more lines later, but the array length for each element may be variable), for some reasons, I want to keep them as a entire object for each columns, when print, it looks like:
A B C D
1 [1,2,3] [1,2,3,4] [6,7]
......
[2,3] [7,8] [5,6,7,2] 4
When I am trying to do this by :
pd.DataFrame.from_dict(dic)
I always get the error : ValueError: arrays must all be same length
Do I have anyway to keep the entire array as single element, however, some times I do have some single value as well ?

I am not sure why you required input as the dictionary. but if you pass elements as numpy array it converts missing values with NaN.
pd.DataFrame([np.array([1,2,3]),np.array([1,2,3,4]),np.array([6,7])],columns=['A','B','C','D'])
Output:-
A B C D
0 1 2 3.0 NaN
1 1 2 3.0 4.0
2 6 7 NaN NaN

IIUC this should work
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[1, np.array([2,3])],
"B":[np.array([1,2,3]), np.array([7,8])],
"C":[np.array([1,2,3,4]), np.array([5,6,7,2])],
"D":[np.array([6,7]), 4]})
So df.to_dict() returns
{'A': {0: 1, 1: array([2, 3])},
'B': {0: array([1, 2, 3]), 1: array([7, 8])},
'C': {0: array([1, 2, 3, 4]), 1: array([5, 6, 7, 2])},
'D': {0: array([6, 7]), 1: 4}}
UPDATE
If you want to save to file you should consider to use lists instead of numpy arrays and use delimiter=';'

convert arrays to strings by if you want to maintain this shape.

Related

DataFrame from list of string dicts

So I have a list where each entry looks something like this:
"{'A': 1, 'B': 2, 'C': 3}"
I am trying to get a dataframe that looks like this
A B C
0 1 2 3
1 4 5 6
2 7 8 9
But I'm having trouble converting the format into something that can be read into a DataFrame. I know that pandas should automatically convert dicts into dataframes, but since my list elements are surrounded by quotes, it's getting confused and giving me
0
0 {'A': 1, 'B': 2, 'C': 3}
...
I've tried using using json, concat'ing a list of dataframes, and so on, but to no avail.
eval is not safe. Check this comparison.
Instead use ast.literal_eval:
Assuming this to be your list:
In [572]: l = ["{'A': 1, 'B': 2, 'C': 3}", "{'A': 4, 'B': 5, 'C': 6}"]
In [584]: import ast
In [587]: df = pd.DataFrame([ast.literal_eval(i) for i in l])
In [588]: df
Out[588]:
A B C
0 1 2 3
1 4 5 6
I would agree with the SomeDude that eval will work like this
pd.DataFrame([eval(s) for s in l])
BUT, if any user entered data is going into these strings, you should never use eval. Instead, you can convert the single quotes to double quotes and use the following syntax from the json package. This is much safer.
json.loads(u'{"A": 1, "B": 2, "C": 3}')
Use eval before reading it in dataframe:
pd.DataFrame([eval(s) for s in l])
Or better use ast.literal_eval as #Mayank Porwal's answer says.
Or use json.loads but after making sure its valid json
Take your list of strings, turn it into a list of dictionaries, then construct the data frame using pd.DataFrame.from_records.
>>> l = ["{'A': 1, 'B': 2, 'C': 3}"]
>>> pd.DataFrame.from_records(eval(s) for s in l)
A B C
0 1 2 3
Check that your input data doesn't include Python code, however, because eval is just to evaluate the input. Using it on something web-facing or something like that would be a severe security flaw.
You can try
lst = ["{'A': 1, 'B': 2, 'C': 3}", "{'A': 1, 'B': 2, 'C': 3}"]
df = pd.DataFrame(map(eval, lst))
# or
df = pd.DataFrame(lst)[0].apply(eval).apply(pd.Series)
# or
df = pd.DataFrame(lst)[0].apply(lambda x: pd.Series(eval(x)))
print(df)
A B C
0 1 2 3
1 1 2 3

Pandas: Looking to avoid a for loop when creating a nested dictionary

Here is my data:
df:
id sub_id
A 1
A 2
B 3
B 4
and I have the following array:
[[1,2],
[2,5],
[1,4],
[7,8]]
Here is my code:
from collections import defaultdict
sub_id_array_dict = defaultdict(dict)
for i, s, a in zip(df['id'].to_list(), df['sub_id'].to_list(), arrays):
sub_id_array_dict[i][s] = a
Now, my actual dataframe includes a total of 100M rows (unique sub_id) with 500K unique ids. Ideally, I'd like to avoid a for loop.
Any help would be much appreciated.
Assuming the arrays variable has same number of rows as in the Dataframe,
df['value'] = arrays
Convert into dictionary by grouping
df.groupby('id').apply(lambda x: dict(zip(x.sub_id, x.value))).to_dict()
Output
{'A': {1: [1, 2], 2: [2, 5]}, 'B': {3: [1, 4], 4: [7, 8]}}

Extracting item from a list stored as string in pandas dataframe

I have the following pandas dataframe
data = [{'a': 1, 'b': '[2,3,4,5,6' }, {'a': 10, 'b': '[54,3,40,5'}]
test = pd.DataFrame(data)
display(test)
a b
0 1 [2,3,4,5,6
1 10 [54,3,40,5
I want to list the number in column b, but as the list has the [ only at the beginning, doesnt allow me to create the list, I'm trying to remove the "[" so I can extract the numbers, but I keep getting errors, what I'm doing wrong?
This is how the numbers are stored
test.iloc[1,1]
'[54,3,40,5'
And this is what I've tried to remove the "[".
test.iloc[0,1].replace("[",'', regex=True).to_list()
test.iloc[0,1].str.replace("[\]\[]", "")
What i want to achieve is to have b as a proper list so i can apply other functions.
a b
0 1 [2,3,4,5,6]
1 10 [54,3,40,5]
To make your 'b' column a list you can first delete the open squared bracket at the beginning, and then use the split method on each element of your 'b' column
test['b'] = test['b'].str.replace('[', '').map(lambda x: x.split(','))
test
# a b
# 0 1 [2, 3, 4, 5, 6]
# 1 10 [54, 3, 40, 5]
try it:
def func(col):
return eval(col+']')
test['b'] = test['b'].apply(func)
import pandas as pd
data = [{'a': 1, 'b': '[2,3,4,5,6' }, {'a': 10, 'b': '[54,3,40,5'}]
test = pd.DataFrame(data)
print(test['b'][0][1:])
for i in range(len(test['b'])):
test['b'][i] = test['b'][i][1:]

Map index of numpy matrix

How should I map indices of a numpy matrix?
For example:
mx = np.matrix([[5,6,2],[3,3,7],[0,1,6]]
The row/column indices are 0, 1, 2.
So:
>>> mx[0,0]
5
Let s say I need to map these indices, converting 0, 1, 2 into, e.g. 10, 'A', 'B' in the way that:
mx[10,10] #returns 5
mx[10,'A'] #returns 6 and so on..
I can just set a dict and use it to access the elements, but I would like to know if it is possible to do something like what I just described.
I would suggest using pandas dataframe with the index and columns using the new mapping for row and col indexing respectively for ease in indexing. It allows us to select a single element or an entire row or column with the familiar colon operator.
Consider a generic (non-square 4x3 shaped matrix) -
mx = np.matrix([[5,6,2],[3,3,7],[0,1,6],[4,5,2]])
Consider the mappings for rows and columns -
row_idx = [10, 'A', 'B','C']
col_idx = [10, 'A', 'B']
Let's take a look on the workflow with the given sample -
# Get data into dataframe with given mappings
In [57]: import pandas as pd
In [58]: df = pd.DataFrame(mx,index=row_idx, columns=col_idx)
# Here's how dataframe data looks like
In [60]: df
Out[60]:
10 A B
10 5 6 2
A 3 3 7
B 0 1 6
C 4 5 2
# Get one scalar element
In [61]: df.loc['C',10]
Out[61]: 4
# Get one entire col
In [63]: df.loc[:,10].values
Out[63]: array([5, 3, 0, 4])
# Get one entire row
In [65]: df.loc['A'].values
Out[65]: array([3, 3, 7])
And best of all we are not making any extra copies as the dataframe and its slices are still indexing into the original matrix/array memory space -
In [98]: np.shares_memory(mx,df.loc[:,10].values)
Out[98]: True
Try this:
import numpy as np
A = np.array(((1,2),(3,4),(50,100)))
dt = np.dtype([('ID', np.int32), ('Ring', np.int32)])
B = np.array(list(map(tuple, A)), dtype=dt)
print(B['ID'])
You can use the __getitem__ and __setitem__ special methods and create a new class as shown.
Store the index map as a dictionary in an instance variable self.index_map.
import numpy as np
class Matrix(np.matrix):
def __init__(self, lis):
self.matrix = np.matrix(lis)
self.index_map = {}
def setIndexMap(self, index_map):
self.index_map = index_map
def getIndex(self, key):
if type(key) is slice:
return key
elif key not in self.index_map.keys():
return key
else:
return self.index_map[key]
def __getitem__(self, idx):
return self.matrix[self.getIndex(idx[0]), self.getIndex(idx[1])]
def __setitem__(self, idx, value):
self.matrix[self.getIndex(idx[0]), self.getIndex(idx[1])] = value
Usage:
Creating a matrix.
>>> mx = Matrix([[5,6,2],[3,3,7],[0,1,6]])
>>> mx
Matrix([[5, 6, 2],
[3, 3, 7],
[0, 1, 6]])
Defining the Index Map.
>>> mx.setIndexMap({10:0, 'A':1, 'B':2})
Different ways to index the matrix.
>>> mx[0,0]
5
>>> mx[10,10]
5
>>> mx[10,'A']
6
It also handles slicing as shown.
>>> mx[1:3, 1:3]
matrix([[3, 7],
[1, 6]])

More efficient solution? Dictionary as sparse vector

I have two dictionaries that I use as sparse vectors:
dict1 = {'a': 1, 'b': 4}
dict2 = {'a': 2, 'c': 2}
I wrote my own __add__ function to get this desired result:
dict1 = {'a': 3, 'b': 4, 'c': 2}
It is important that I know the strings 'a', 'b' and 'c' for each corresponding value. Just making sure that I add up the correct dimensions is not enough. I will also get many more, previously unknown strings with some values that I just add to my dictionary at the moment.
Now my question: Is there a more efficient data structure out there? I looked at Numpy's arrays and Scipy's sparse matrixes but as far as I understand they are not really of any help here or am I just not seeing the solution?
I could keep keys and values in separate arrays but I don't think I can just use any already existing function to get the desired result.
dict1_keys = np.array([a, b])
dict1_values = np.array([1, 4])
dict2_keys = np.array([a, c])
dict2_values = np.array([2, 2])
# is there anything that will efficiently produce the following?
dict1_keys = np.array([a, b, c])
dict1_values = np.array([3, 4, 2])
Perhaps pandas is what you're looking for:
d1 = pandas.DataFrame(numpy.array([1, 4]), index=['a', 'b'], dtype="int32")
d2 = pandas.DataFrame(numpy.array([2, 2]), index=['a', 'c'], dtype="int32")
d1.add(d2, fill_value=0)
result:
0
a 3
b 4
c 2
#sirfz's Pandas approach could be a one-liner using pandas Series:
>>> pd.Series(dict1).add(pd.Series(dict2), fill_value=0)
a 3.0
b 4.0
c 2.0
Or if your API required dicts
>>> dict(pd.Series(dict1).add(pd.Series(dict2), fill_value=0))
{'a': 3.0, 'b': 4.0, 'c': 2.0}
Plus, this should handle mixed inputs of dicts or Seriess or even scipy sparse matrix rows and sklearn Vectorizer output (sparse vectors/mappings)

Categories