How to modify dictonaries in a DataFrame? - python

How can I modify a list value inside dataframes? I am trying to adjust data received by JSON and the DataFrame is as below:
Dataframe df:
id options
0 0 [{'a':1 ,'b':2, 'c':3, 'd':4}]
1 1 [{'a':5 ,'b':6, 'c':7, 'd':8}]
2 2 [{'a':9 ,'b':10, 'c':11, 'd':12}]
If I want to use only 'a' and 'c' key / values in options how can I modify datafames? The expected result would be:
Dataframe df:
id options
0 0 [{'a':1 ,'c':3}]
1 1 [{'a':5 ,'c':7}]
2 2 [{'a':9 ,'c':11}]

You could just apply a function that modifies the entries:
>>> df.options = df.options.apply(lambda x: [{k: v for k, v in x[0].items() if k in ('a', 'c')}])
>>> df
id options
0 0 [{'a': 1, 'c': 3}]
1 1 [{'a': 5, 'c': 7}]
2 2 [{'a': 9, 'c': 11}]

Here's an example using list comprehension to create new dictionaries with only certain keys:
df['options'] = [[{'a': x[0]['a'], 'c': x[0]['c']}] for x in df['options']]

Related

Add length of multiple dataframe and store it in a list

I have few dfs:
A
id A B C
1 2 2 2
2 3 3 3
B
id A B C
1 5 5 5
2 6 6 6
3 8 8 8
4 0 0 0
C
id A B C
1 6 6 6
I need to find the length of each df and store it in a list:
search_list = ["A", "B", "C"]
I took the reference from the previous post. Is there a way to loop over this list to do something like:
my_list=[]
for i in search_list:
my_list.append(len(search_list[i]))
Desired output:
len_df =
[{'A': 2},
{'B': 4},
{'C': 1}]
You can loop over the DataFrames themselves if their are "loopable"/iterable, meaning are in a list or other similar container themselves. And since those DataFrames don't contain their names you want for them, you need a separate list/container of their names like search_list, which you mentioned.
If df_a, df_b, and df_c are the names of the DataFrame variables and search_list is a list of their names, then:
df_list = [df_a, df_b, df_c]
len_list = [{search_list[i] : df.shape[0]} for i, df in enumerate(df_list)]
But if you want to keep those DataFrame names together with the DataFrames themselves in your code for further use, it might be reasonable to initially organize those DataFrames not in a list, but in a dictionary with their names:
df_dict = { 'A': df_a, 'B': df_b, 'C': df_c }
len_list = [{key : df.shape[0]} for key, df in df_dict.items()]
They need to be in a list or dict in the first place if the intention is to loop thru the list.
Here is how you would do this:
import pandas as pd
# 3 dicts
a = {'a':[1,2,3], 'b':[4,5,6]}
b = {'a':[1,2,3,4], 'b':[4,5,6,7]}
c = {'a':[1,2,3,4,5], 'b':[4,5,6,7,8]}
# 3 dataframes
df1=pd.DataFrame(a)
df2=pd.DataFrame(b)
df3=pd.DataFrame(c)
# dict or dataframes
dict_of_df = {'a':df1, 'b':df2, 'c':df3}
# put all lengths into a new dict
df_lenghts = {}
for k,v in dict_of_df.items():
df_lenghts[k] = len(v)
print(df_lenghts)
And this is the result:
{'a': 3, 'b': 4, 'c': 5}

remove string from json row

I'm taking several columns from a data frame and adding them to a new column.
A B C
1 3 6
1 2 4
4 5 0
df['D'] = df.apply(lambda x: x[['C', 'B']].to_json(), axis=1)
I'm then creating a new data frame that locates the unique instances of df['A']:
df2 = pd.DataFrame({'A': df.A.unique()})
finally, I'm creating a new column in df2 that list the value of df['B'] and df['C']
df2['E'] = [list(set(df['D'].loc[df['A'] == x['A']]))
for _, x in df2.iterrows()]
but this is stringing each object:
A B C D
1 3 6 ['{"B":"3","C":6"}', '{"B":"2","C":4"}']
furthermore, when I dump this in JSON I get:
payload = json.dumps(data)
I get this result:
["{\"B\":\"3\",\"C\":"6"}", "{\"B\":\"2\",\"C\":"\4"}"]
but I'm ultimately looking to remove the string on the objects and have this as the output:
[{"B":"3","C":"6"}, {"B":"2","C":"4"}]
Any guidance will be greatly appreciated.
In your case do groupby with to_dict
out = df.groupby('A').apply(lambda x : x[['B','C']].to_dict('records')).to_frame('E').reset_index()
out
Out[198]:
A E
0 1 [{'B': 3, 'C': 6}, {'B': 2, 'C': 4}]
1 4 [{'B': 5, 'C': 0}]

Assigning column names while creating dataframe results in nan values

I have a list of dict which is being converted to a dataframe. When I attempt to pass the columns argument the output values are all nan.
# This code does not result in desired output
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
pd.DataFrame(l, columns=['c', 'd'])
c d
0 NaN NaN
1 NaN NaN
# This code does result in desired output
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
df = pd.DataFrame(l)
df.columns = ['c', 'd']
df
c d
0 1 2
1 3 4
Why is this happening?
Because if pass list of dictionaries from keys are created new columns names in DataFrame constructor:
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
print (pd.DataFrame(l))
a b
0 1 2
1 3 4
If pass columns parameter with some values not exist in keys of dictionaries then are filtered columns from dictonaries and for not exist values are created columns with missing values with order like values in list of columns names:
#changed order working, because a,b keys at least in one dictionary
print (pd.DataFrame(l, columns=['b', 'a']))
b a
0 2 1
1 4 3
#filtered a, d filled missing values - key is not at least in one dictionary
print (pd.DataFrame(l, columns=['a', 'd']))
a d
0 1 NaN
1 3 NaN
#filtered b, c filled missing values - key is not at least in one dictionary
print (pd.DataFrame(l, columns=['c', 'b']))
c b
0 NaN 2
1 NaN 4
#filtered a,b, c, d filled missing values - keys are not at least in one dictionary
print (pd.DataFrame(l, columns=['c', 'd','a','b']))
c d a b
0 NaN NaN 1 2
1 NaN NaN 3 4
So if want another columns names you need rename them or set new one like in your second code.

How to get a Series of json/dictionaries from pandas Dataframe groupby object

I have a Dataframe with more than 2 columns (Col1, Col2, etc...), and I want to generate a Series where the index is Col1 and the values of the Series are dictionaries where the key is Col2 and the value (of the dict) is the occurrencies of the tuple (Col1, Col2).
Let's say the dataframe is something like this:
Col1 Col2 Col3 ...
0 A b ...
1 B e ...
2 A a ...
3 C a ...
4 A b ...
5 B c ...
6 A e ...
7 B c ...
The output I want is:
A {'a':1,'b':2,'e':1}
B {'c':2,'e':1}
C {'a':1}
I managed to it with this loop:
for t in my_df['Col1'].unique():
my_series.loc[t] = my_df[my_df['Col1'] == t].groupby('Col2').size().to_json()
but I was wondering if there is a way to do it more efficiently with pandas methods, without iterating.
I was also trying groupby with two indexes:
my_df.groupby(['Col1','Col2']).size()
>
Col1 Col2
A a 1
b 2
e 1
B c 2
e 1
C a 1
but can't find the next step to convert the result to the Series of dict as shown above
A defaultdict is what you need:
import collections
resul = collections.defaultdict(dict)
for row in my_df.groupby(['Col1','Col2']).size().iteritems():
resul[row[0][0]][row[0][1]] = row[1]
pprint.pprint(resul)
gives as expected:
defaultdict(<class 'dict'>,
{'A': {'a': 1, 'b': 2, 'e': 1},
'B': {'c': 2, 'e': 1},
'C': {'a': 1}})
If you want to get rid of the defaultdict and want instead a plain dict:
resul = dict(resul)

How to use Pandas to replace column entries in DataFrame and create dictionary new-old values

I have a file which contains data as the following:
x y
z w
a b
a x
w y
I want to create a file with the following replacements dictionary, which has a unique replacement number for each string that is determined by the the order in which strings first appear in the file, when read left-to-right and top to bottom (note that this should be created, it is not supplied):
{'x':1, 'y':2, 'z':3, 'w':4 , 'a':5, 'b':6}
and the output file would be:
1 2
3 3
5 6
5 1
4 2
Is there any efficient way to create both the processed file and the dictionary with Pandas?
I thought of creating the dictionary in the following policy:
_counter = 0
def counter():
global _counter
_counter += 1
return _counter
replacements_dict = collections.defaultdict(counter)
You can use factorize with MultiIndex Series created bystack, then unstack and last write to file by to_csv:
df = pd.read_csv(file, sep="\s+", header=None)
print (df)
0 1
0 x y
1 z w
2 a b
3 a x
4 w y
s = df.stack()
fact = pd.factorize(s)
#indexing is necessary
d = dict(zip(fact[1].values[fact[0]], fact[0] + 1))
print (d)
{'x': 1, 'y': 2, 'z': 3, 'w': 4, 'a': 5, 'b': 6}
For new file:
#values splited by ,
pd.Series(d).to_csv('dict.csv')
#read Series from file, convert to dict
d = pd.read_csv('dict.csv', index_col=[0], squeeze=True, header=None).to_dict()
print (d)
{'x': 1, 'y': 2, 'z': 3, 'w': 4, 'a': 5, 'b': 6}
df = pd.Series(fact[0] + 1, index=s.index).unstack()
print (df)
0 1
0 1 2
1 3 4
2 5 6
3 5 1
4 4 2
df.to_csv('out', index=False, header=None)
I assume you want dictionary d in such a way that values assigned to keys correspond to the keys appearance, in rows:
d={'col1':['x', 'y', 'a', 'a', 'w'], 'col2':['z','w','b','x','y']}
df=pd.DataFrame(d)
print(df)
Output:
col1 col2
0 x z
1 y w
2 a b
3 a x
4 w y
=================================
Using itertools:
import itertools
raw_list = list(itertools.chain(*[df.iloc[i].tolist() for i in range(df.shape[0])]))
d=dict()
counter=1
for k in raw_list:
try:
_=d[k]
except:
d[k]=counter
counter+=1
then:
d
Output:
{'a': 5, 'b': 6, 'w': 4, 'x': 1, 'y': 3, 'z': 2}
I hope it helps!
===========================================
Using factorize:
s = df.stack()
d=dict{}
for (x,y) in zip(pd.factorize(s)[1], pd.factorize(s)[0]+1):
d[x]=y

Categories