remove string from json row

remove string from json row - python

I'm taking several columns from a data frame and adding them to a new column.
A B C
1 3 6
1 2 4
4 5 0
df['D'] = df.apply(lambda x: x[['C', 'B']].to_json(), axis=1)
I'm then creating a new data frame that locates the unique instances of df['A']:
df2 = pd.DataFrame({'A': df.A.unique()})
finally, I'm creating a new column in df2 that list the value of df['B'] and df['C']
df2['E'] = [list(set(df['D'].loc[df['A'] == x['A']]))
for _, x in df2.iterrows()]
but this is stringing each object:
A B C D
1 3 6 ['{"B":"3","C":6"}', '{"B":"2","C":4"}']
furthermore, when I dump this in JSON I get:
payload = json.dumps(data)
I get this result:
["{\"B\":\"3\",\"C\":"6"}", "{\"B\":\"2\",\"C\":"\4"}"]
but I'm ultimately looking to remove the string on the objects and have this as the output:
[{"B":"3","C":"6"}, {"B":"2","C":"4"}]
Any guidance will be greatly appreciated.

In your case do groupby with to_dict
out = df.groupby('A').apply(lambda x : x[['B','C']].to_dict('records')).to_frame('E').reset_index()
out
Out[198]:
A E
0 1 [{'B': 3, 'C': 6}, {'B': 2, 'C': 4}]
1 4 [{'B': 5, 'C': 0}]

Related

Add length of multiple dataframe and store it in a list

I have few dfs:
A
id A B C
1 2 2 2
2 3 3 3
B
id A B C
1 5 5 5
2 6 6 6
3 8 8 8
4 0 0 0
C
id A B C
1 6 6 6
I need to find the length of each df and store it in a list:
search_list = ["A", "B", "C"]
I took the reference from the previous post. Is there a way to loop over this list to do something like:
my_list=[]
for i in search_list:
my_list.append(len(search_list[i]))
Desired output:
len_df =
[{'A': 2},
{'B': 4},
{'C': 1}]

You can loop over the DataFrames themselves if their are "loopable"/iterable, meaning are in a list or other similar container themselves. And since those DataFrames don't contain their names you want for them, you need a separate list/container of their names like search_list, which you mentioned.
If df_a, df_b, and df_c are the names of the DataFrame variables and search_list is a list of their names, then:
df_list = [df_a, df_b, df_c]
len_list = [{search_list[i] : df.shape[0]} for i, df in enumerate(df_list)]
But if you want to keep those DataFrame names together with the DataFrames themselves in your code for further use, it might be reasonable to initially organize those DataFrames not in a list, but in a dictionary with their names:
df_dict = { 'A': df_a, 'B': df_b, 'C': df_c }
len_list = [{key : df.shape[0]} for key, df in df_dict.items()]

They need to be in a list or dict in the first place if the intention is to loop thru the list.
Here is how you would do this:
import pandas as pd
# 3 dicts
a = {'a':[1,2,3], 'b':[4,5,6]}
b = {'a':[1,2,3,4], 'b':[4,5,6,7]}
c = {'a':[1,2,3,4,5], 'b':[4,5,6,7,8]}
# 3 dataframes
df1=pd.DataFrame(a)
df2=pd.DataFrame(b)
df3=pd.DataFrame(c)
# dict or dataframes
dict_of_df = {'a':df1, 'b':df2, 'c':df3}
# put all lengths into a new dict
df_lenghts = {}
for k,v in dict_of_df.items():
df_lenghts[k] = len(v)
print(df_lenghts)
And this is the result:
{'a': 3, 'b': 4, 'c': 5}

Assigning column names while creating dataframe results in nan values

I have a list of dict which is being converted to a dataframe. When I attempt to pass the columns argument the output values are all nan.
# This code does not result in desired output
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
pd.DataFrame(l, columns=['c', 'd'])
c d
0 NaN NaN
1 NaN NaN
# This code does result in desired output
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
df = pd.DataFrame(l)
df.columns = ['c', 'd']
df
c d
0 1 2
1 3 4
Why is this happening?

Because if pass list of dictionaries from keys are created new columns names in DataFrame constructor:
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
print (pd.DataFrame(l))
a b
0 1 2
1 3 4
If pass columns parameter with some values not exist in keys of dictionaries then are filtered columns from dictonaries and for not exist values are created columns with missing values with order like values in list of columns names:
#changed order working, because a,b keys at least in one dictionary
print (pd.DataFrame(l, columns=['b', 'a']))
b a
0 2 1
1 4 3
#filtered a, d filled missing values - key is not at least in one dictionary
print (pd.DataFrame(l, columns=['a', 'd']))
a d
0 1 NaN
1 3 NaN
#filtered b, c filled missing values - key is not at least in one dictionary
print (pd.DataFrame(l, columns=['c', 'b']))
c b
0 NaN 2
1 NaN 4
#filtered a,b, c, d filled missing values - keys are not at least in one dictionary
print (pd.DataFrame(l, columns=['c', 'd','a','b']))
c d a b
0 NaN NaN 1 2
1 NaN NaN 3 4
So if want another columns names you need rename them or set new one like in your second code.

parse a dict from the csv file python

I am reading a data from the csv file like :
import pandas as pd
data_1=pd.read_csv("sample.csv")
data_1.head(10)
It has two columns :
ID detail
1 [{'a': 1, 'b': 1.85, 'c': 'aaaa', 'd': 6}, {'a': 2, 'b': 3.89, 'c': 'bbbb', 'd': 10}]
the detail column is not a json but it is a dict and I want to flatten the dict and want the result something like this :
ID a b c d
1 1 1.85 aaaa 6
1 2 3.89 bbbb 10
I always get a,b,c,d in the detail column and want to move the final results to a sql table.
Can someone please help me as how to solve it.

Use dictionary comprehension with ast.literal for convert strings repr to list of dicts and convert it to DataFrame, then use concat and convert first level of MultiIndex to ID column:
import ast
d = {i: pd.DataFrame(ast.literal_eval(d)) for i, d in df[['ID','detail']].to_numpy()}
#for oldier pandas version use .values
#d = {i: pd.DataFrame(ast.literal_eval(d)) for i, d in df[['ID','detail']].values)}
df = pd.concat(d).reset_index(level=1, drop=True).rename_axis('ID').reset_index()
print (df)
ID a b c d
0 1 1 1.85 aaaa 6
1 1 2 3.89 bbbb 10
Or use lsit comprehension with DataFrame.assign for ID column, only necessary change order of columns - last column to first:
import ast
L = [pd.DataFrame(ast.literal_eval(d)).assign(ID=i) for i, d in df[['ID','detail']].to_numpy()]
#for oldier pandas versions use .values
#L = [pd.DataFrame(ast.literal_eval(d)).assign(ID=i) for i, d in df[['ID','detail']].values]
df = pd.concat(L, ignore_index=True)
df = df[df.columns[-1:].tolist() + df.columns[:-1].tolist()]
print (df)
ID a b c d
0 1 1 1.85 aaaa 6
1 1 2 3.89 bbbb 10
EDIT:
For 2 IDs change second solution:
d = [pd.DataFrame(ast.literal_eval(d)).assign(ID1=i1, ID2=i2) for i1, i2, d in df[['ID1','ID2','detail']].to_numpy()]
df = pd.concat(d)
df = df[df.columns[-2:].tolist() + df.columns[:-2].tolist()]

How to modify dictonaries in a DataFrame?

How can I modify a list value inside dataframes? I am trying to adjust data received by JSON and the DataFrame is as below:
Dataframe df:
id options
0 0 [{'a':1 ,'b':2, 'c':3, 'd':4}]
1 1 [{'a':5 ,'b':6, 'c':7, 'd':8}]
2 2 [{'a':9 ,'b':10, 'c':11, 'd':12}]
If I want to use only 'a' and 'c' key / values in options how can I modify datafames? The expected result would be:
Dataframe df:
id options
0 0 [{'a':1 ,'c':3}]
1 1 [{'a':5 ,'c':7}]
2 2 [{'a':9 ,'c':11}]

You could just apply a function that modifies the entries:
>>> df.options = df.options.apply(lambda x: [{k: v for k, v in x[0].items() if k in ('a', 'c')}])
>>> df
id options
0 0 [{'a': 1, 'c': 3}]
1 1 [{'a': 5, 'c': 7}]
2 2 [{'a': 9, 'c': 11}]

Here's an example using list comprehension to create new dictionaries with only certain keys:
df['options'] = [[{'a': x[0]['a'], 'c': x[0]['c']}] for x in df['options']]

Creating new pandas DataFrame from existing DataFrame and index

I have a DataFrame like this:
a b
A 1 0
B 0 1
and I have an array ["A","B","C"].
From these, I want to create a new DataFrame like this:
a b
A 1 0
B 0 1
C NaN NaN
How can I do this?

Assuming I understand what you're after (setting aside weird duplicated-index cases), one way is to use loc to index into your frame:
>>> df = pd.DataFrame({'a': {'A': 1, 'B': 0}, 'b': {'A': 0, 'B': 1}})
>>> arr = ["A", "B", "C"]
>>> df
a b
A 1 0
B 0 1
>>> df.loc[arr]
a b
A 1 0
B 0 1
C NaN NaN

Create an DataFrame with only index=['C'] and concat:
df = pd.DataFrame({'a': {'A': 1, 'B': 0}, 'b': {'A': 0, 'B': 1}}
df = pd.concat([df, pd.DataFrame(index=['C'])])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

remove string from json row - python

In your case do groupby with to_dict out = df.groupby('A').apply(lambda x : x[['B','C']].to_dict('records')).to_frame('E').reset_index() out Out[198]: A E 0 1 [{'B': 3, 'C': 6}, {'B': 2, 'C': 4}] 1 4 [{'B': 5, 'C': 0}]

Related

Add length of multiple dataframe and store it in a list

Assigning column names while creating dataframe results in nan values

parse a dict from the csv file python

How to modify dictonaries in a DataFrame?

Creating new pandas DataFrame from existing DataFrame and index

Categories

Resources