I'm taking several columns from a data frame and adding them to a new column.
A B C
1 3 6
1 2 4
4 5 0
df['D'] = df.apply(lambda x: x[['C', 'B']].to_json(), axis=1)
I'm then creating a new data frame that locates the unique instances of df['A']:
df2 = pd.DataFrame({'A': df.A.unique()})
finally, I'm creating a new column in df2 that list the value of df['B'] and df['C']
df2['E'] = [list(set(df['D'].loc[df['A'] == x['A']]))
for _, x in df2.iterrows()]
but this is stringing each object:
A B C D
1 3 6 ['{"B":"3","C":6"}', '{"B":"2","C":4"}']
furthermore, when I dump this in JSON I get:
payload = json.dumps(data)
I get this result:
["{\"B\":\"3\",\"C\":"6"}", "{\"B\":\"2\",\"C\":"\4"}"]
but I'm ultimately looking to remove the string on the objects and have this as the output:
[{"B":"3","C":"6"}, {"B":"2","C":"4"}]
Any guidance will be greatly appreciated.
In your case do groupby with to_dict
out = df.groupby('A').apply(lambda x : x[['B','C']].to_dict('records')).to_frame('E').reset_index()
out
Out[198]:
A E
0 1 [{'B': 3, 'C': 6}, {'B': 2, 'C': 4}]
1 4 [{'B': 5, 'C': 0}]
Related
I have few dfs:
A
id A B C
1 2 2 2
2 3 3 3
B
id A B C
1 5 5 5
2 6 6 6
3 8 8 8
4 0 0 0
C
id A B C
1 6 6 6
I need to find the length of each df and store it in a list:
search_list = ["A", "B", "C"]
I took the reference from the previous post. Is there a way to loop over this list to do something like:
my_list=[]
for i in search_list:
my_list.append(len(search_list[i]))
Desired output:
len_df =
[{'A': 2},
{'B': 4},
{'C': 1}]
You can loop over the DataFrames themselves if their are "loopable"/iterable, meaning are in a list or other similar container themselves. And since those DataFrames don't contain their names you want for them, you need a separate list/container of their names like search_list, which you mentioned.
If df_a, df_b, and df_c are the names of the DataFrame variables and search_list is a list of their names, then:
df_list = [df_a, df_b, df_c]
len_list = [{search_list[i] : df.shape[0]} for i, df in enumerate(df_list)]
But if you want to keep those DataFrame names together with the DataFrames themselves in your code for further use, it might be reasonable to initially organize those DataFrames not in a list, but in a dictionary with their names:
df_dict = { 'A': df_a, 'B': df_b, 'C': df_c }
len_list = [{key : df.shape[0]} for key, df in df_dict.items()]
They need to be in a list or dict in the first place if the intention is to loop thru the list.
Here is how you would do this:
import pandas as pd
# 3 dicts
a = {'a':[1,2,3], 'b':[4,5,6]}
b = {'a':[1,2,3,4], 'b':[4,5,6,7]}
c = {'a':[1,2,3,4,5], 'b':[4,5,6,7,8]}
# 3 dataframes
df1=pd.DataFrame(a)
df2=pd.DataFrame(b)
df3=pd.DataFrame(c)
# dict or dataframes
dict_of_df = {'a':df1, 'b':df2, 'c':df3}
# put all lengths into a new dict
df_lenghts = {}
for k,v in dict_of_df.items():
df_lenghts[k] = len(v)
print(df_lenghts)
And this is the result:
{'a': 3, 'b': 4, 'c': 5}
I have a list of dict which is being converted to a dataframe. When I attempt to pass the columns argument the output values are all nan.
# This code does not result in desired output
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
pd.DataFrame(l, columns=['c', 'd'])
c d
0 NaN NaN
1 NaN NaN
# This code does result in desired output
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
df = pd.DataFrame(l)
df.columns = ['c', 'd']
df
c d
0 1 2
1 3 4
Why is this happening?
Because if pass list of dictionaries from keys are created new columns names in DataFrame constructor:
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
print (pd.DataFrame(l))
a b
0 1 2
1 3 4
If pass columns parameter with some values not exist in keys of dictionaries then are filtered columns from dictonaries and for not exist values are created columns with missing values with order like values in list of columns names:
#changed order working, because a,b keys at least in one dictionary
print (pd.DataFrame(l, columns=['b', 'a']))
b a
0 2 1
1 4 3
#filtered a, d filled missing values - key is not at least in one dictionary
print (pd.DataFrame(l, columns=['a', 'd']))
a d
0 1 NaN
1 3 NaN
#filtered b, c filled missing values - key is not at least in one dictionary
print (pd.DataFrame(l, columns=['c', 'b']))
c b
0 NaN 2
1 NaN 4
#filtered a,b, c, d filled missing values - keys are not at least in one dictionary
print (pd.DataFrame(l, columns=['c', 'd','a','b']))
c d a b
0 NaN NaN 1 2
1 NaN NaN 3 4
So if want another columns names you need rename them or set new one like in your second code.
I am reading a data from the csv file like :
import pandas as pd
data_1=pd.read_csv("sample.csv")
data_1.head(10)
It has two columns :
ID detail
1 [{'a': 1, 'b': 1.85, 'c': 'aaaa', 'd': 6}, {'a': 2, 'b': 3.89, 'c': 'bbbb', 'd': 10}]
the detail column is not a json but it is a dict and I want to flatten the dict and want the result something like this :
ID a b c d
1 1 1.85 aaaa 6
1 2 3.89 bbbb 10
I always get a,b,c,d in the detail column and want to move the final results to a sql table.
Can someone please help me as how to solve it.
Use dictionary comprehension with ast.literal for convert strings repr to list of dicts and convert it to DataFrame, then use concat and convert first level of MultiIndex to ID column:
import ast
d = {i: pd.DataFrame(ast.literal_eval(d)) for i, d in df[['ID','detail']].to_numpy()}
#for oldier pandas version use .values
#d = {i: pd.DataFrame(ast.literal_eval(d)) for i, d in df[['ID','detail']].values)}
df = pd.concat(d).reset_index(level=1, drop=True).rename_axis('ID').reset_index()
print (df)
ID a b c d
0 1 1 1.85 aaaa 6
1 1 2 3.89 bbbb 10
Or use lsit comprehension with DataFrame.assign for ID column, only necessary change order of columns - last column to first:
import ast
L = [pd.DataFrame(ast.literal_eval(d)).assign(ID=i) for i, d in df[['ID','detail']].to_numpy()]
#for oldier pandas versions use .values
#L = [pd.DataFrame(ast.literal_eval(d)).assign(ID=i) for i, d in df[['ID','detail']].values]
df = pd.concat(L, ignore_index=True)
df = df[df.columns[-1:].tolist() + df.columns[:-1].tolist()]
print (df)
ID a b c d
0 1 1 1.85 aaaa 6
1 1 2 3.89 bbbb 10
EDIT:
For 2 IDs change second solution:
d = [pd.DataFrame(ast.literal_eval(d)).assign(ID1=i1, ID2=i2) for i1, i2, d in df[['ID1','ID2','detail']].to_numpy()]
df = pd.concat(d)
df = df[df.columns[-2:].tolist() + df.columns[:-2].tolist()]
How can I modify a list value inside dataframes? I am trying to adjust data received by JSON and the DataFrame is as below:
Dataframe df:
id options
0 0 [{'a':1 ,'b':2, 'c':3, 'd':4}]
1 1 [{'a':5 ,'b':6, 'c':7, 'd':8}]
2 2 [{'a':9 ,'b':10, 'c':11, 'd':12}]
If I want to use only 'a' and 'c' key / values in options how can I modify datafames? The expected result would be:
Dataframe df:
id options
0 0 [{'a':1 ,'c':3}]
1 1 [{'a':5 ,'c':7}]
2 2 [{'a':9 ,'c':11}]
You could just apply a function that modifies the entries:
>>> df.options = df.options.apply(lambda x: [{k: v for k, v in x[0].items() if k in ('a', 'c')}])
>>> df
id options
0 0 [{'a': 1, 'c': 3}]
1 1 [{'a': 5, 'c': 7}]
2 2 [{'a': 9, 'c': 11}]
Here's an example using list comprehension to create new dictionaries with only certain keys:
df['options'] = [[{'a': x[0]['a'], 'c': x[0]['c']}] for x in df['options']]
I have a DataFrame like this:
a b
A 1 0
B 0 1
and I have an array ["A","B","C"].
From these, I want to create a new DataFrame like this:
a b
A 1 0
B 0 1
C NaN NaN
How can I do this?
Assuming I understand what you're after (setting aside weird duplicated-index cases), one way is to use loc to index into your frame:
>>> df = pd.DataFrame({'a': {'A': 1, 'B': 0}, 'b': {'A': 0, 'B': 1}})
>>> arr = ["A", "B", "C"]
>>> df
a b
A 1 0
B 0 1
>>> df.loc[arr]
a b
A 1 0
B 0 1
C NaN NaN
Create an DataFrame with only index=['C'] and concat:
df = pd.DataFrame({'a': {'A': 1, 'B': 0}, 'b': {'A': 0, 'B': 1}}
df = pd.concat([df, pd.DataFrame(index=['C'])])