Prevent Pandas from unpacking a tuple when creating a dataframe from dict - python

When creating a DataFrame in Pandas from a dictionary, a tuple is automatically expanded, i.e.
import pandas
d = {'a': 1, 'b': 2, 'c': (3,4)}
df = pandas.DataFrame.from_dict(d)
print(df)
returns
a b c
0 1 2 3
1 1 2 4
Apart from converting the tuple to string first, is there any way to prevent this from happening? I would want the result to be
a b c
0 1 2 (3, 4)

Try add [], so value in dictionary with key c is list of tuple:
import pandas
d = {'a': 1, 'b': 2, 'c': [(3,4)]}
df = pandas.DataFrame.from_dict(d)
print(df)
a b c
0 1 2 (3, 4)

Pass param orient='index' and transpose the result so it doesn't broadcast the scalar values:
In [13]:
d = {'a': 1, 'b': 2, 'c': (3,4)}
df = pd.DataFrame.from_dict(d, orient='index').T
df
Out[13]:
a c b
0 1 (3, 4) 2
To handle the situation where the first dict entry is a tuple, you'd need to enclose all the dict values into a list so it's iterable:
In [20]:
d = {'a': (5,6), 'b': 2, 'c': 1}
d1 = dict(zip(d.keys(), [[x] for x in d.values()]))
pd.DataFrame.from_dict(d1, orient='index').T
Out[23]:
a b c
0 (5, 6) 2 1

Related

What to pass to aggfunc in pandas pivot table for summing up counters

I have a data table with two columns "A" and "B", and the elements in column "B" are counters. For example,
c = Counter(a=4, b=2)
df = pd.DataFrame({"A": ["group1", "group1", "group1", "group2", "group2"],
"B": [c, c, c, c, c]})
I would like to create a pivot table, where I group over element values in column "A" and aggregate over column "B" by adding up the counters. I wonder what should I pass to aggfunc?
This is what I have tried, but sadly it does not work:
pt = pd.pivot_table(df, index = ['A'], values = ['B'], aggfunc = ['+'])
Any suggestions?
My expected output is
table
group1 Counter(a=12, b=6) # i.e., c+c+c
group2 Counter(a=8, b=4) # i.e., c+c
It's just sum.
>>> df.groupby('A')['B'].sum()
A
group1 {'a': 12, 'b': 6}
group2 {'a': 8, 'b': 4}
Name: B, dtype: object
Two notes:
Putting dictionaries into dataframe columns is usually not a good practice. I would use two columns to hold the value for 'a' and 'b' respectively.
"B": [c, c, c, c, c] initializes each element of column 'B' with the same counter object.
Demo:
>>> df.loc[0, 'B']['a'] = 100
>>> df
Out[9]:
A B
0 group1 {'a': 100, 'b': 2}
1 group1 {'a': 100, 'b': 2}
2 group1 {'a': 100, 'b': 2}
3 group2 {'a': 100, 'b': 2}
4 group2 {'a': 100, 'b': 2}
You might want "B": [c.copy() for _ in range(5)] - if you want to keep your original design at all, that is.

How to convert list of nested dictionary to pandas DataFrame?

I have some data containing nested dictionaries like below:
mylist = [{"a": 1, "b": {"c": 2, "d":3}}, {"a": 3, "b": {"c": 4, "d":3}}]
If we convert it to pandas DataFrame,
import pandas as pd
result_dataframe = pd.DataFrame(mylist)
print(result_dataframe)
It will output:
a b
0 1 {'c': 2, 'd': 3}
1 3 {'c': 4, 'd': 3}
I want to convert the list of dictionaries and ignore the key of the nested dictionary. My code is below:
new_dataframe = result_dataframe.drop(columns=["b"])
b_dict_list = [document["b"] for document in mylist]
b_df = pd.DataFrame(b_dict_list)
frames = [new_dataframe, b_df]
total_frame = pd.concat(frames, axis=1)
The total_frame is which I want:
a c d
0 1 2 3
1 3 4 3
But I think my code is a little complicated. Is there any simple way to deal with this problem? Thank you.
I had a similar problem to this one. I used pd.json_normalize(x) and it worked. The only difference is that the column names of the data frame will look a little different.
mylist = [{"a": 1, "b": {"c": 2, "d":3}}, {"a": 3, "b": {"c": 4, "d":3}}]
df = pd.json_normalize(mylist)
print(df)
Output:
a
b.c
b.d
0
1
2
3
1
3
4
3
Use dict comprehension with pop for extract value b and merge dictionaries:
a = [{**x, **x.pop('b')} for x in mylist]
print (a)
[{'a': 1, 'c': 2, 'd': 3}, {'a': 3, 'c': 4, 'd': 3}]
result_dataframe = pd.DataFrame(a)
print(result_dataframe)
a c d
0 1 2 3
1 3 4 3
Another solution, thanks #Sandeep Kadapa :
a = [{'a': x['a'], **x['b']} for x in mylist]
#alternative
a = [{'a': x['a'], **x.get('b')} for x in mylist]
Or by applying pd.Series() to your method:
mylist = [{"a": 1, "b": {"c": 2, "d":3}}, {"a": 3, "b": {"c": 4, "d":3}}]
result_dataframe = pd.DataFrame(mylist)
result_dataframe.drop('b',1).join(result_dataframe.b.apply(pd.Series))
a c d
0 1 2 3
1 3 4 3
I prefer to write a function that accepts your mylist and converts it 1 nested layer down and returns a dictionary. This has the added advantage of not requiring you to 'manually' know what key like b to convert. So this function works for all nested keys 1 layer down.
mylist = [{"a": 1, "b": {"c": 2, "d":3}}, {"a": 3, "b": {"c": 4, "d":3}}]
import pandas as pd
def dropnested(alist):
outputdict = {}
for dic in alist:
for key, value in dic.items():
if isinstance(value, dict):
for k2, v2, in value.items():
outputdict[k2] = outputdict.get(k2, []) + [v2]
else:
outputdict[key] = outputdict.get(key, []) + [value]
return outputdict
df = pd.DataFrame.from_dict(dropnested(mylist))
print (df)
# a c d
#0 1 2 3
#1 3 4 3
If you try:
mylist = [{"a": 1, "b": {"c": 2, "d":3}, "g": {"e": 2, "f":3}},
{"a": 3, "z": {"c": 4, "d":3}, "e": {"e": 2, "f":3}}]
df = pd.DataFrame.from_dict(dropnested(mylist))
print (df)
# a c d e f
#0 1 2 3 2 3
#1 3 4 3 2 3
We can see here that it converts keys b,g,z,e without issue, as opposed to having to define each and every nested key name to convert

How to get n most column values from each column in pandas

I know how to get most frequent value of each column in dataframe using "mode". For example:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})
df.mode()
A
0 2
But I am unable to find "n" most frequent value of each column of a dataframe? For example for the mentioned dataframe, i would like following output for n=2:
A
0 2
1 1
Any pointer ?
One way is to use pd.Series.value_counts and extract the index:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})
res = pd.DataFrame({col: df[col].value_counts().head(2).index for col in df})
# A
# 0 2
# 1 1
Use value_counts and select index values by indexing, but it working for each column separately, so need apply or dict comprehension with DataFrame contructor. Casting to Series is necessary for more general solution if possible indices does not exist, e.g:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3],
'B': [1, 1, 1, 1, 1, 1]})
N = 2
df = df.apply(lambda x: pd.Series(x.value_counts().index[:N]))
Or:
N = 2
df = pd.DataFrame({x:pd.Series( df[x].value_counts().index[:N]) for x in df.columns})
print (df)
A B C
0 2 1.0 d
1 1 NaN e
For more general solution select only numeric columns first by select_dtypes:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3],
'B': [1, 1, 1, 1, 1, 1],
'C': list('abcdef')})
N = 2
df = df.select_dtypes([np.number]).apply(lambda x: pd.Series(x.value_counts().index[:N]))
N = 2
cols = df.select_dtypes([np.number]).columns
df = pd.DataFrame({x: pd.Series(df[x].value_counts().index[:N]) for x in cols})
print (df)
A B C
0 2 1.0 d
1 1 NaN e

What to do when pandas column renaming creates column name duplicates

Why doesn't a pandas.DataFrame object complain when I rename a column if the new column name already exists?
This makes referencing the new column in the future return a pandas.DataFrame as opposed to a pandas.Series , which can cause further errors.
Secondly, is there a suggested way to handle such a situation?
Example:
import pandas
df = pd.DataFrame( {'A' : ['foo','bar'] ,'B' : ['bar','foo'] } )
df.B.map( {'bar':'foo','foo':'bar'} )
# 0 foo
# 1 bar
# Name: B, dtype: object
df.rename(columns={'A':'B'},inplace=True)
Now, the following will fail:
df.B.map( {'bar':'foo','foo':'bar'} )
#AttributeError: 'DataFrame' object has no attribute 'map'
Let's say you had a dictionary mapping old columns to new column names. When renaming your DataFrame, you could use a dictionary comprehension to test if the new value v is already in the DataFrame:
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
d = {'a': 'B', 'b': 'B'}
df.rename(columns={k: v for k, v in d.iteritems() if v not in df}, inplace=True)
>>> df
a B
0 1 3
1 2 4
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
d = {'a': 'b'}
df.rename(columns={k: v for k, v in d.iteritems() if v not in df}, inplace=True)
>>> df
a b
0 1 3
1 2 4

Get a Dictionary by applying function to pandas Series

I would like to apply a function to a dataframe and receive a single dictionary as a result. pandas.apply gives me a Series of dicts, and so currently I have to combine keys from each. I'll use an example to illustrate.
I have a pandas dataframe like so.
In [20]: df
Out[20]:
0 1
0 2.025745 a
1 -1.840914 b
2 -0.428811 c
3 0.718237 d
4 0.079593 e
I have some function that returns a dictionary. For this example I'm using a toy lambda function lambda x: {x: ord(x)} that returns a dictionary.
In [22]: what_i_get = df[1].apply(lambda x: {x: ord(x)})
In [23]: what_i_get
Out[23]:
0 {'a': 97}
1 {'b': 98}
2 {'c': 99}
3 {'d': 100}
4 {'e': 101}
Name: 1
apply() gives me a series of dictionaries, but what I want is a single dictionary.
I could create it with something like this:
In [41]: what_i_want = {}
In [42]: for elem in what_i_get:
....: for k,v in elem.iteritems():
....: what_i_want[k] = v
....:
In [43]: what_i_want
Out[43]: {'a': 97, 'b': 98, 'c': 99, 'd': 100, 'e': 101}
But it seems I should be able to get what I want more directly.
Instead of returning a dict from your function, just return the mapped value, then create one dict outside the mapping operation:
>>> d
Stuff
0 a
1 b
2 c
3 d
>>> dict(zip(d.Stuff, d.Stuff.map(ord)))
{'a': 97, 'b': 98, 'c': 99, 'd': 100}
Cutting out the items() middle-man:
what_i_want = {}
for elem in what_i_get:
what_i_want.update(elem)

Categories