Nested dictionary of lists to dataframe - python

I have a dictionary like this:
{'a': {'col_1': [1, 2], 'col_2': ['a', 'b']},
'b': {'col_1': [3, 4], 'col_2': ['c', 'd']}}
When I try to convert this to a dataframe a get this:
col_1 col_2
a [1, 2] [a, b]
b [3, 4] [c, d]
But what I need is this:
col_1 col_2
a 1 a
2 b
b 3 c
4 d
How can I get this format. Maybe I should change my input format as well?
Thanks for help=)

You can use pd.DataFrame.from_dict setting orient='index' so the dictionary keys are set as the dataframe's indices, and then explode all columns by applying pd.Series.explode:
pd.DataFrame.from_dict(d, orient='index').apply(pd.Series.explode)
col_1 col_2
a 1 a
a 2 b
b 3 c
b 4 d

you could run a generator comprehension and apply pandas concat ... the comprehension works on the values of the dictionary, which are themselves dictionaries :
pd.concat(pd.DataFrame(entry).assign(key=key) for key,entry in data.items()).set_index('key')
col_1 col_2
key
a 1 a
a 2 b
b 3 c
b 4 d
update:
Still uses concatenation; no need to assign key to individual dataframes:
(pd.concat([pd.DataFrame(entry)
for key, entry in data.items()],
keys=data)
.droplevel(-1))

Related

How to map values to a DataFrame with multiple columns as keys?

I have two dataframes like so:
data = {'A': [3, 2, 1, 0], 'B': [1, 2, 3, 4]}
data2 = {'A': [3, 2, 1, 0, 3, 2], 'B': [1, 2, 3, 4, 20, 2], 'C':[5,3,2,1, 5, 1]}
df1 = pd.DataFrame.from_dict(data)
df2 = pd.DataFrame.from_dict(data2)
Now I did a groupby of df2 for C
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
Now I would like to map df1['new C'] where the columns A and B match.
A B new_C
0 3 1 1.0
1 2 2 2.0
2 1 3 2.0
3 0 4 12.5
where new c is basically the averages of C for every pair A, B from df2
Note that A and B don't have to be keys of the dataframe (i.e. they aren't unique identifiers which is why I want to map it with a dictionary originally, but failed with multiple keys)
How would I go about that?
Thank you for looking into it with me!
I found a solution to this
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
df1['new_c'] = df1.apply(lambda x: values_to_map[x['A'], x['B']], axis=1)
Thanks for looking into it!
Just do np.vectorize:
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
df1['new_c'] = np.vectorize(lambda x: values_to_map.get(x['A'], x['B']))(df1[['A', 'B']])
You can first form a MultiIndex from the [["A", "B"]] subset of the frame df1 and use its map function to map the A-B pairs to the desired grouped mean values:
cols = ["A", "B"]
mapper = df2.groupby(cols).C.mean()
df1["new_c"] = pd.MultiIndex.from_frame(df1[cols]).map(mapper)
to get
>>> df1
A B new_c
0 3 1 5.0
1 2 2 2.0
2 1 3 2.0
3 0 4 1.0
(if an A-B pair in df1 isn't found in df2's groups, new_c corresponding to that pair will be NaN with this method.)
Note that neither pandas' apply nor np.vectorize are "vectorized" routines. However, they might be fast enough for one's purposes and might prove more readable in places.

Multiply all rows in a Pandas DataFrame by dictionary

Say I have a DataFrame call it one like this:
non_multiply_col col_1 col_2
A Name 1 3
and a dict like this call it two:
{'col_1': 4, 'col_2': 5}
is there a way that I can multiply all rows of one by the values in two for the columns as defined by two's keys so the result would be:
non_multiply_col col_1 col_2
A Name 4 15
I tried using multiply, but I'm not really looking to join on anything specific. Maybe I'm not understanding how to use multiply correctly.
Thanks
mul/multiply works fine if the dictionary is converted to a Series:
d = {'col_1': 4, 'col_2': 5}
df.mul(pd.Series(d), axis=1)
# col_1 col_2
#0 4 15
In case you have more columns in the data frame than the dictionary:
df = pd.DataFrame([{'col_1': 1, 'col_2': 3, 'col_3': 4}])
d = {'col_1': 4, 'col_2': 5}
cols_to_update = d.keys() # you might need cols_to_update = list(d.keys()) in python 3
# multiply the selected columns and update
df[cols_to_update] = df[cols_to_update].mul(pd.Series(d), axis=1)[cols_to_update]
df
col_1 col_2 col_3
#0 4 15 4
I happen to find this work as well, not sure if there is any caveat about this usage:
df[d.keys()] *= pd.Series(d)

set a multi index in the dataframe constructor using the data-list provided to the constructor

I know that by using set_index i can convert an existing column into a dataframe index, but is there a way to specify, directly in the Dataframe constructor to use of one the data columns as an index (instead of turning it into a column).
Right now i initialize a DataFrame using data records, then i use set_index to make the column into an index.
DataFrame([{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}], index= ['a', 'b'], columns=('c', 'd'))
I want:
c d
ab
11 2 1
12 2 2
Instead i get:
c d
a 2 1
b 2 2
You can use MultiIndex.from_tuples:
print (pd.MultiIndex.from_tuples([(x['a'], x['b']) for x in d], names=('a','b')))
MultiIndex(levels=[[1], [1, 2]],
labels=[[0, 0], [0, 1]],
names=['a', 'b'])
d = [{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}]
df= pd.DataFrame(d,
index = pd.MultiIndex.from_tuples([(x['a'], x['b']) for x in d],
names=('a','b')),
columns=('c', 'd'))
print (df)
c d
a b
1 1 2 1
2 2 2
You can just chain call set_index on the ctor without specifying the index and columns params:
In [19]:
df=pd.DataFrame([{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}]).set_index(['a','b'])
df
Out[19]:
c d
a b
1 1 2 1
2 2 2

Python : make DataFrame

When I make dataframe using list, error is occured.
My code is :
a=[1,2,3,4,5]
b=['a','b','c','d','e']
df=pd.DataFrame(a,columns=[b])
I want this dataframe output :
a b c d e
1 2 3 4 5
error code is assert(len(items) == len(values))
what should I do, I hope to solve this ploblem.
There are strict requirements on the shape and form of the data being passed, you can pass just the data and transpose it to get the initial data as a single row and then overwrite the column names:
In [166]:
a=[1,2,3,4,5]
b=['a','b','c','d','e']
df=pd.DataFrame(data=a).T
df.columns=b
df
Out[166]:
a b c d e
0 1 2 3 4 5
Another method would be to construct a dict and perform a list comprehension on your data elements and makes these a list:
In [170]:
df=pd.DataFrame(dict(zip(b,[[x] for x in a])))
df
Out[170]:
a b c d e
0 1 2 3 4 5
inline dict output:
In [169]:
dict(zip(b,[[x] for x in a]))
Out[169]:
{'a': [1], 'b': [2], 'c': [3], 'd': [4], 'e': [5]}
You are actually sending the columns' parameter as - [['a','b','c','d','e']] . It needs to be a single list, not a list of lists.
Also, when you send in a as the data , you are actually creating 5 rows for 1 column . instead you want to send in [a] that would create 1 row and 5 columns.
Try -
df=pd.DataFrame([a],columns=b)

DataFrame from dictionary

Sorry, if it is a duplicate, but I didn't find the solution in internet...
I have some dictionary
{'a':1, 'b':2, 'c':3}
Now I want to construct pandas DF with the columns names corresponding to key and values corresponding to values. Actually it should be Df with only one row.
a b c
1 2 3
At the other topic I found only solutions, where both - keys and values are columns in the new DF.
You have some caveats here, if you just pass the dict to the DataFrame constructor then it will raise an error:
ValueError: If using all scalar values, you must must pass an index
To get around that you can pass an index which will work:
In [139]:
temp = {'a':1,'b':2,'c':3}
pd.DataFrame(temp, index=[0])
Out[139]:
a b c
0 1 2 3
Ideally your values should be iterable, so a list or array like:
In [141]:
temp = {'a':[1],'b':[2],'c':[3]}
pd.DataFrame(temp)
Out[141]:
a b c
0 1 2 3
Thanks to #joris for pointing out that if you wrap the dict in a list then you don't have to pass an index to the constructor:
In [142]:
temp = {'a':1,'b':2,'c':3}
pd.DataFrame([temp])
Out[142]:
a b c
0 1 2 3
For flexibility, you can also use pd.DataFrame.from_dict with orient='index'. This works whether your dictionary values are scalars or lists.
Note the final transpose step, which can be performed via df.T or df.transpose().
temp1 = {'a': 1, 'b': 2, 'c': 3}
temp2 = {'a': [1, 2], 'b':[2, 3], 'c':[3, 4]}
print(pd.DataFrame.from_dict(temp1, orient='index').T)
a b c
0 1 2 3
print(pd.DataFrame.from_dict(temp2, orient='index').T)
a b c
0 1 2 3
1 2 3 4

Categories