parse a dict from the csv file python - python

I am reading a data from the csv file like :
import pandas as pd
data_1=pd.read_csv("sample.csv")
data_1.head(10)
It has two columns :
ID detail
1 [{'a': 1, 'b': 1.85, 'c': 'aaaa', 'd': 6}, {'a': 2, 'b': 3.89, 'c': 'bbbb', 'd': 10}]
the detail column is not a json but it is a dict and I want to flatten the dict and want the result something like this :
ID a b c d
1 1 1.85 aaaa 6
1 2 3.89 bbbb 10
I always get a,b,c,d in the detail column and want to move the final results to a sql table.
Can someone please help me as how to solve it.

Use dictionary comprehension with ast.literal for convert strings repr to list of dicts and convert it to DataFrame, then use concat and convert first level of MultiIndex to ID column:
import ast
d = {i: pd.DataFrame(ast.literal_eval(d)) for i, d in df[['ID','detail']].to_numpy()}
#for oldier pandas version use .values
#d = {i: pd.DataFrame(ast.literal_eval(d)) for i, d in df[['ID','detail']].values)}
df = pd.concat(d).reset_index(level=1, drop=True).rename_axis('ID').reset_index()
print (df)
ID a b c d
0 1 1 1.85 aaaa 6
1 1 2 3.89 bbbb 10
Or use lsit comprehension with DataFrame.assign for ID column, only necessary change order of columns - last column to first:
import ast
L = [pd.DataFrame(ast.literal_eval(d)).assign(ID=i) for i, d in df[['ID','detail']].to_numpy()]
#for oldier pandas versions use .values
#L = [pd.DataFrame(ast.literal_eval(d)).assign(ID=i) for i, d in df[['ID','detail']].values]
df = pd.concat(L, ignore_index=True)
df = df[df.columns[-1:].tolist() + df.columns[:-1].tolist()]
print (df)
ID a b c d
0 1 1 1.85 aaaa 6
1 1 2 3.89 bbbb 10
EDIT:
For 2 IDs change second solution:
d = [pd.DataFrame(ast.literal_eval(d)).assign(ID1=i1, ID2=i2) for i1, i2, d in df[['ID1','ID2','detail']].to_numpy()]
df = pd.concat(d)
df = df[df.columns[-2:].tolist() + df.columns[:-2].tolist()]

Related

Add length of multiple dataframe and store it in a list

I have few dfs:
A
id A B C
1 2 2 2
2 3 3 3
B
id A B C
1 5 5 5
2 6 6 6
3 8 8 8
4 0 0 0
C
id A B C
1 6 6 6
I need to find the length of each df and store it in a list:
search_list = ["A", "B", "C"]
I took the reference from the previous post. Is there a way to loop over this list to do something like:
my_list=[]
for i in search_list:
my_list.append(len(search_list[i]))
Desired output:
len_df =
[{'A': 2},
{'B': 4},
{'C': 1}]
You can loop over the DataFrames themselves if their are "loopable"/iterable, meaning are in a list or other similar container themselves. And since those DataFrames don't contain their names you want for them, you need a separate list/container of their names like search_list, which you mentioned.
If df_a, df_b, and df_c are the names of the DataFrame variables and search_list is a list of their names, then:
df_list = [df_a, df_b, df_c]
len_list = [{search_list[i] : df.shape[0]} for i, df in enumerate(df_list)]
But if you want to keep those DataFrame names together with the DataFrames themselves in your code for further use, it might be reasonable to initially organize those DataFrames not in a list, but in a dictionary with their names:
df_dict = { 'A': df_a, 'B': df_b, 'C': df_c }
len_list = [{key : df.shape[0]} for key, df in df_dict.items()]
They need to be in a list or dict in the first place if the intention is to loop thru the list.
Here is how you would do this:
import pandas as pd
# 3 dicts
a = {'a':[1,2,3], 'b':[4,5,6]}
b = {'a':[1,2,3,4], 'b':[4,5,6,7]}
c = {'a':[1,2,3,4,5], 'b':[4,5,6,7,8]}
# 3 dataframes
df1=pd.DataFrame(a)
df2=pd.DataFrame(b)
df3=pd.DataFrame(c)
# dict or dataframes
dict_of_df = {'a':df1, 'b':df2, 'c':df3}
# put all lengths into a new dict
df_lenghts = {}
for k,v in dict_of_df.items():
df_lenghts[k] = len(v)
print(df_lenghts)
And this is the result:
{'a': 3, 'b': 4, 'c': 5}

remove string from json row

I'm taking several columns from a data frame and adding them to a new column.
A B C
1 3 6
1 2 4
4 5 0
df['D'] = df.apply(lambda x: x[['C', 'B']].to_json(), axis=1)
I'm then creating a new data frame that locates the unique instances of df['A']:
df2 = pd.DataFrame({'A': df.A.unique()})
finally, I'm creating a new column in df2 that list the value of df['B'] and df['C']
df2['E'] = [list(set(df['D'].loc[df['A'] == x['A']]))
for _, x in df2.iterrows()]
but this is stringing each object:
A B C D
1 3 6 ['{"B":"3","C":6"}', '{"B":"2","C":4"}']
furthermore, when I dump this in JSON I get:
payload = json.dumps(data)
I get this result:
["{\"B\":\"3\",\"C\":"6"}", "{\"B\":\"2\",\"C\":"\4"}"]
but I'm ultimately looking to remove the string on the objects and have this as the output:
[{"B":"3","C":"6"}, {"B":"2","C":"4"}]
Any guidance will be greatly appreciated.
In your case do groupby with to_dict
out = df.groupby('A').apply(lambda x : x[['B','C']].to_dict('records')).to_frame('E').reset_index()
out
Out[198]:
A E
0 1 [{'B': 3, 'C': 6}, {'B': 2, 'C': 4}]
1 4 [{'B': 5, 'C': 0}]

Assigning column names while creating dataframe results in nan values

I have a list of dict which is being converted to a dataframe. When I attempt to pass the columns argument the output values are all nan.
# This code does not result in desired output
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
pd.DataFrame(l, columns=['c', 'd'])
c d
0 NaN NaN
1 NaN NaN
# This code does result in desired output
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
df = pd.DataFrame(l)
df.columns = ['c', 'd']
df
c d
0 1 2
1 3 4
Why is this happening?
Because if pass list of dictionaries from keys are created new columns names in DataFrame constructor:
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
print (pd.DataFrame(l))
a b
0 1 2
1 3 4
If pass columns parameter with some values not exist in keys of dictionaries then are filtered columns from dictonaries and for not exist values are created columns with missing values with order like values in list of columns names:
#changed order working, because a,b keys at least in one dictionary
print (pd.DataFrame(l, columns=['b', 'a']))
b a
0 2 1
1 4 3
#filtered a, d filled missing values - key is not at least in one dictionary
print (pd.DataFrame(l, columns=['a', 'd']))
a d
0 1 NaN
1 3 NaN
#filtered b, c filled missing values - key is not at least in one dictionary
print (pd.DataFrame(l, columns=['c', 'b']))
c b
0 NaN 2
1 NaN 4
#filtered a,b, c, d filled missing values - keys are not at least in one dictionary
print (pd.DataFrame(l, columns=['c', 'd','a','b']))
c d a b
0 NaN NaN 1 2
1 NaN NaN 3 4
So if want another columns names you need rename them or set new one like in your second code.

Load a dataframe from json array in order

Suppose I have the following array in python:
[
{'id': [1,2,3]},
{'name': [4,3,2]},
{'age': [9,0,1]},
]
How would I load this into a pandas dataframe? Usually I do pd.DataFrame from a dict, but it's important for me to maintain the column order.
The final data should look like this:
id name age
1 4 9
2 3 0
3 2 1
You can construct a single dictionary and then feed to pd.DataFrame. To guarantee column ordering is preserved, use collections.OrderedDict:
from collections import OrderedDict
L = [{'id': [1,2,3]},
{'name': [4,3,2]},
{'age': [9,0,1]}]
df = pd.DataFrame(OrderedDict([(k, v) for d in L for k, v in d.items()]))
print(df)
id name age
0 1 4 9
1 2 3 0
2 3 2 1
With Python 3.7+ dictionaries are insertion ordered, so you can use a regular dict:
df = pd.DataFrame({k: v for d in L for k, v in d.items()})
Or merge the list of dictionaries (source) and convert the result to a dataframe:
merged_data = {}
[merged_data.update(d) for d in original_data]
# or, b/c it's more pythonic:
# list(map(lambda x: merged_data.update(x), original_data))
df = pd.DataFrame.from_dict(merged_data)
df = df[['id', 'name', 'age']]
print(df)
# id name age
# 0 1 4 9
# 1 2 3 0
# 2 3 2 1
For me it's more clear and readable.
A little hacky, but does
pd.concat([pd.DataFrame(d_) for d_ in d], axis=1)
work?
(assuming
d = your_list
)

DataFrame from dictionary

Sorry, if it is a duplicate, but I didn't find the solution in internet...
I have some dictionary
{'a':1, 'b':2, 'c':3}
Now I want to construct pandas DF with the columns names corresponding to key and values corresponding to values. Actually it should be Df with only one row.
a b c
1 2 3
At the other topic I found only solutions, where both - keys and values are columns in the new DF.
You have some caveats here, if you just pass the dict to the DataFrame constructor then it will raise an error:
ValueError: If using all scalar values, you must must pass an index
To get around that you can pass an index which will work:
In [139]:
temp = {'a':1,'b':2,'c':3}
pd.DataFrame(temp, index=[0])
Out[139]:
a b c
0 1 2 3
Ideally your values should be iterable, so a list or array like:
In [141]:
temp = {'a':[1],'b':[2],'c':[3]}
pd.DataFrame(temp)
Out[141]:
a b c
0 1 2 3
Thanks to #joris for pointing out that if you wrap the dict in a list then you don't have to pass an index to the constructor:
In [142]:
temp = {'a':1,'b':2,'c':3}
pd.DataFrame([temp])
Out[142]:
a b c
0 1 2 3
For flexibility, you can also use pd.DataFrame.from_dict with orient='index'. This works whether your dictionary values are scalars or lists.
Note the final transpose step, which can be performed via df.T or df.transpose().
temp1 = {'a': 1, 'b': 2, 'c': 3}
temp2 = {'a': [1, 2], 'b':[2, 3], 'c':[3, 4]}
print(pd.DataFrame.from_dict(temp1, orient='index').T)
a b c
0 1 2 3
print(pd.DataFrame.from_dict(temp2, orient='index').T)
a b c
0 1 2 3
1 2 3 4

Categories