Load a dataframe from json array in order

Load a dataframe from json array in order - python

Suppose I have the following array in python:
[
{'id': [1,2,3]},
{'name': [4,3,2]},
{'age': [9,0,1]},
]
How would I load this into a pandas dataframe? Usually I do pd.DataFrame from a dict, but it's important for me to maintain the column order.
The final data should look like this:
id name age
1 4 9
2 3 0
3 2 1

You can construct a single dictionary and then feed to pd.DataFrame. To guarantee column ordering is preserved, use collections.OrderedDict:
from collections import OrderedDict
L = [{'id': [1,2,3]},
{'name': [4,3,2]},
{'age': [9,0,1]}]
df = pd.DataFrame(OrderedDict([(k, v) for d in L for k, v in d.items()]))
print(df)
id name age
0 1 4 9
1 2 3 0
2 3 2 1
With Python 3.7+ dictionaries are insertion ordered, so you can use a regular dict:
df = pd.DataFrame({k: v for d in L for k, v in d.items()})

Or merge the list of dictionaries (source) and convert the result to a dataframe:
merged_data = {}
[merged_data.update(d) for d in original_data]
# or, b/c it's more pythonic:
# list(map(lambda x: merged_data.update(x), original_data))
df = pd.DataFrame.from_dict(merged_data)
df = df[['id', 'name', 'age']]
print(df)
# id name age
# 0 1 4 9
# 1 2 3 0
# 2 3 2 1
For me it's more clear and readable.

A little hacky, but does
pd.concat([pd.DataFrame(d_) for d_ in d], axis=1)
work?
(assuming
d = your_list
)

Related

Add length of multiple dataframe and store it in a list

I have few dfs:
A
id A B C
1 2 2 2
2 3 3 3
B
id A B C
1 5 5 5
2 6 6 6
3 8 8 8
4 0 0 0
C
id A B C
1 6 6 6
I need to find the length of each df and store it in a list:
search_list = ["A", "B", "C"]
I took the reference from the previous post. Is there a way to loop over this list to do something like:
my_list=[]
for i in search_list:
my_list.append(len(search_list[i]))
Desired output:
len_df =
[{'A': 2},
{'B': 4},
{'C': 1}]

You can loop over the DataFrames themselves if their are "loopable"/iterable, meaning are in a list or other similar container themselves. And since those DataFrames don't contain their names you want for them, you need a separate list/container of their names like search_list, which you mentioned.
If df_a, df_b, and df_c are the names of the DataFrame variables and search_list is a list of their names, then:
df_list = [df_a, df_b, df_c]
len_list = [{search_list[i] : df.shape[0]} for i, df in enumerate(df_list)]
But if you want to keep those DataFrame names together with the DataFrames themselves in your code for further use, it might be reasonable to initially organize those DataFrames not in a list, but in a dictionary with their names:
df_dict = { 'A': df_a, 'B': df_b, 'C': df_c }
len_list = [{key : df.shape[0]} for key, df in df_dict.items()]

They need to be in a list or dict in the first place if the intention is to loop thru the list.
Here is how you would do this:
import pandas as pd
# 3 dicts
a = {'a':[1,2,3], 'b':[4,5,6]}
b = {'a':[1,2,3,4], 'b':[4,5,6,7]}
c = {'a':[1,2,3,4,5], 'b':[4,5,6,7,8]}
# 3 dataframes
df1=pd.DataFrame(a)
df2=pd.DataFrame(b)
df3=pd.DataFrame(c)
# dict or dataframes
dict_of_df = {'a':df1, 'b':df2, 'c':df3}
# put all lengths into a new dict
df_lenghts = {}
for k,v in dict_of_df.items():
df_lenghts[k] = len(v)
print(df_lenghts)
And this is the result:
{'a': 3, 'b': 4, 'c': 5}

Add column to pandas dataframe from a reversed dictionary

I have a dataframe (pandas) and a dictionary with keys and values as list. The values in lists are unique across all the keys. I want to add a new column to my dataframe based on values of the dictionary having keys in it. E.g. suppose I have a dataframe like this
import pandas as pd
df = {'a':1, 'b':2, 'c':2, 'd':4, 'e':7}
df = pd.DataFrame.from_dict(df, orient='index', columns = ['col2'])
df = df.reset_index().rename(columns={'index':'col1'})
df
col1 col2
0 a 1
1 b 2
2 c 2
3 d 4
4 e 7
Now I also have dictionary like this
my_dict = {'x':['a', 'c'], 'y':['b'], 'z':['d', 'e']}
I want the output like this
col1 col2 col3
0 a 1 x
1 b 2 y
2 c 2 x
3 d 4 z
4 e 7 z
Presently I am doing this by reversing the dictionary first, i.e. like this
my_dict_rev = {value:key for key in my_dict for value in my_dict[key]}
df['col3']= df['col1'].map(my_dict_rev)
df
But I am sure that there must be some direct method.

I know this is an old question but here are two other ways to do the same job. First convert my_dict to a Series object, then explode it. Then reverse the mapping and use map:
tmp = pd.Series(my_dict).explode()
df['col3'] = df['col1'].map(pd.Series(tmp.index, tmp))
Another option (starts similar to above) but instead of map, merge:
df = df.merge(pd.Series(my_dict, name='col1').explode().rename_axis('col3').reset_index())
Output:
col1 col2 col3
0 a 1 x
1 b 2 y
2 c 2 x
3 d 4 z
4 e 7 z

parse a dict from the csv file python

I am reading a data from the csv file like :
import pandas as pd
data_1=pd.read_csv("sample.csv")
data_1.head(10)
It has two columns :
ID detail
1 [{'a': 1, 'b': 1.85, 'c': 'aaaa', 'd': 6}, {'a': 2, 'b': 3.89, 'c': 'bbbb', 'd': 10}]
the detail column is not a json but it is a dict and I want to flatten the dict and want the result something like this :
ID a b c d
1 1 1.85 aaaa 6
1 2 3.89 bbbb 10
I always get a,b,c,d in the detail column and want to move the final results to a sql table.
Can someone please help me as how to solve it.

Use dictionary comprehension with ast.literal for convert strings repr to list of dicts and convert it to DataFrame, then use concat and convert first level of MultiIndex to ID column:
import ast
d = {i: pd.DataFrame(ast.literal_eval(d)) for i, d in df[['ID','detail']].to_numpy()}
#for oldier pandas version use .values
#d = {i: pd.DataFrame(ast.literal_eval(d)) for i, d in df[['ID','detail']].values)}
df = pd.concat(d).reset_index(level=1, drop=True).rename_axis('ID').reset_index()
print (df)
ID a b c d
0 1 1 1.85 aaaa 6
1 1 2 3.89 bbbb 10
Or use lsit comprehension with DataFrame.assign for ID column, only necessary change order of columns - last column to first:
import ast
L = [pd.DataFrame(ast.literal_eval(d)).assign(ID=i) for i, d in df[['ID','detail']].to_numpy()]
#for oldier pandas versions use .values
#L = [pd.DataFrame(ast.literal_eval(d)).assign(ID=i) for i, d in df[['ID','detail']].values]
df = pd.concat(L, ignore_index=True)
df = df[df.columns[-1:].tolist() + df.columns[:-1].tolist()]
print (df)
ID a b c d
0 1 1 1.85 aaaa 6
1 1 2 3.89 bbbb 10
EDIT:
For 2 IDs change second solution:
d = [pd.DataFrame(ast.literal_eval(d)).assign(ID1=i1, ID2=i2) for i1, i2, d in df[['ID1','ID2','detail']].to_numpy()]
df = pd.concat(d)
df = df[df.columns[-2:].tolist() + df.columns[:-2].tolist()]

Create several columns with default values in Salesforce

I have a dataframe that looks like 1000 rows, 10 columns
I want to add 20 columns with only one single value in each column (what I call a default value)
Therefore, my final df would be 1000 rows, with 30 columns
I know that I can do it 30 times by doing:
df['column 11'] = 'default value'
df['column 12'] = 'default value 2'
But I would like to do it in a proper way of coding
I have a dict with my {'column label' : 'defaultvalues'}
How can I do so ?
I've tried pd.insert or pd.concatenate but couldn't find my way through
thanks
regards,
Eric

One way to do so:
df_len = len(df)
new_df = pd.DataFrame({col: [val] * df_len for col,val in your_dict.items()})
df = pd.concat((df,new_df), axis=1)

Generally if possible spaces in keys in dictionary for new columns names use DataFrame constuctor with DataFrame.join:
df = pd.DataFrame({'a':range(5)})
print (df)
a
0 0
1 1
2 2
3 3
4 4
d = {'A 11' : 's', 'A 12':'c'}
df = df.join(pd.DataFrame(d, index=df.index))
print (df)
a A 11 A 12
0 0 s c
1 1 s c
2 2 s c
3 3 s c
4 4 s c
If no spaces and no numbers in columns names (need valid identifier) is possible use DataFrame.assign:
d = {'A11' : 's', 'A12':'c'}
df = df.assign(**d)
print (df)
a A11 A12
0 0 s c
1 1 s c
2 2 s c
3 3 s c
4 4 s c
Another solution is loop by dictionary and assign:
for k, v in d.items():
df[k] = v

Fast splitting of pandas dataframe by column value

I have a pandas dataframe:
0 1
0 john 14
1 jack 2
2 emma 6
3 john 23
4 john 53
5 jack 43
that is really large(1+GB). I want to split the dataframe by name and to execute code on each of the resulting dataframes. This is my code, that works:
df.sort(columns=[0], inplace=True)
df.set_index(keys=[0], drop=False, inplace=True)
names = df[0].unique().tolist()
for name in names:
name_df = df.loc[df[0] == name]
do_stuff(name_df)
However it runs really slow. Is there a faster way to accomplish this task?

Here is an dictionary comprehension example that simply adds together each sub dataframe grouped on name:
>>> {k: gb['1'].sum() for k, gb in df.groupby('0')}
{'emma': 6, 'jack': 45, 'john': 90}
For something more complicated, you can create a function and then apply it to the group.
def foo(df):
df += 1
df *= 2
df = df.sum()
return df
{k: g['1'].apply(foo) for k, g in df.groupby('0')}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Load a dataframe from json array in order - python

A little hacky, but does pd.concat([pd.DataFrame(d_) for d_ in d], axis=1) work? (assuming d = your_list )

Related

Add length of multiple dataframe and store it in a list

Add column to pandas dataframe from a reversed dictionary

parse a dict from the csv file python

Create several columns with default values in Salesforce

Fast splitting of pandas dataframe by column value

Categories

Resources