How to save pandas DataFrame's rows as JSON strings? - python

I have a pandas DataFrame df and I convert each row to JSON string as follows:
df = pd.DataFrame(np.random.randn(50, 4), columns=list('ABCD'))
df_as_json = df.to_json(orient='records')
Then I want to iterate over the JSON strings (rows) of df_as_json and make further processing as follows:
for json_document in df_as_json.split('\n'):
jdict = json.loads(json_document)
//...
The problem is that df_as_json.split('\n') does not really split df_as_json into separate JSON strings.
How can I do what I need?

To get each row of the dataframe as a dict, you can use pandas.DataFrame.to_dict():
Code:
df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
for jdict in df.to_dict(orient='records'):
print(jdict)
Results:
{'A': -0.81155648424969018, 'B': 0.54051722275060621, 'C': 2.1858014972680886, 'D': -0.92089743800379931}
{'A': -0.051650790117511704, 'B': -0.79176498452586563, 'C': -0.9181773278020231, 'D': 1.1698955805545324}
{'A': -0.59790963665018559, 'B': -0.63673166723131003, 'C': 1.0493603533698836, 'D': 1.0027811601157812}
{'A': -0.20909149867564752, 'B': -1.8022674158328837, 'C': 1.0849019267782165, 'D': 1.2203116471260997}
{'A': 0.33798033123267207, 'B': 0.13927004774974402, 'C': 1.6671536830551967, 'D': 0.29193412587056755}
{'A': -0.079327003827824386, 'B': 0.58625181818942929, 'C': -0.42365912798153349, 'D': -0.69644626255641828}
{'A': 0.33849577559616656, 'B': -0.42955248285258169, 'C': 0.070860788937864225, 'D': 1.4971679265264808}
{'A': 1.3411846077264038, 'B': -0.20189961315847924, 'C': 1.6294881274421233, 'D': 1.1168181183218009}
{'A': 0.61028134135655399, 'B': 0.48445766812257018, 'C': -0.31117315672299928, 'D': -1.7986688463810827}
{'A': 0.9181074339928279, 'B': 0.84151139156427757, 'C': -1.111794854210024, 'D': -0.7131446510569609}

Starting from v0.19, you can use to_json with lines=True parameter to save your data as a JSON lines file.
df.to_json('file.json', orient='records', lines=True)
This eliminates the need for a loop to save each record, as a solution with to_dict would involve.
The first 5 lines of file.json look like this -
{"A":0.0162261253,"B":0.8770884013,"C":0.1577913843,"D":-0.3097990255}
{"A":-1.2870077735,"B":-0.1610902061,"C":-0.2426829569,"D":-0.3247587907}
{"A":-0.7743891125,"B":-0.9487264737,"C":1.6366125588,"D":0.2943377348}
{"A":1.5128287075,"B":-0.389437321,"C":0.4841038875,"D":0.5315466818}
{"A":-0.1455759399,"B":1.0205229385,"C":0.6776108196,"D":0.832060379}

another way is
input_data=[row.to_json() for index,row in dataset.iterrows()]

Related

Updating a nested dictionary whose root keys match the index of a certain dataframe with said dataframe’s values

I have a nested dict that is uniform throughout (i.e. each 2nd level dict will have the same keys).
{
'0': {'a': 1, 'b': 2},
'1': {'a': 3, 'b': 4},
'2': {'a': 5, 'b': 6},
}
and the following data frame
c
0 9
1 6
2 4
Is there a way (without for loops) to update/map the dict/key-values such that I get
{
'0': {'a': 1, 'b': 2, 'c': 9},
'1': {'a': 3, 'b': 4, 'c': 6},
'2': {'a': 5, 'b': 6, 'c': 4},
}
Try this
# input
my_dict = {
'0': {'a': 1, 'b': 2},
'1': {'a': 3, 'b': 4},
'2': {'a': 5, 'b': 6},
}
my_df = pd.DataFrame({'c': [9, 6, 4]})
# build df from my_dict
df1 = pd.DataFrame.from_dict(my_dict, orient='index')
# append my_df as a column to df1
df1['c'] = my_df.values
# get dictionary
df1.to_dict('index')
But a simple loop is much more efficient here. I tested on a sample with 1mil entries and the loop is 2x faster.1
for d, c in zip(my_dict.values(), my_df['c']):
d['c'] = c
my_dict
{'0': {'a': 1, 'b': 2, 'c': 9},
'1': {'a': 3, 'b': 4, 'c': 6},
'2': {'a': 5, 'b': 6, 'c': 4}}
1: Constructing a dataframe is expensive, so unless you want a dataframe (and possibly do other computations later), it's not worth it to construct one for a task such as this one.

Find the smallest three values in a nested dictionary

I have a nested dictionary ( i.e. sample_dict), where for each day, we need to find the smallest three values (in ascending manner), after which the result has to be stored in a new dictionary.
The sample_dict is as follows:
sample_dict ={ '2020-12-22': {'A': 0.0650,'B': 0.2920, 'C': 0.0780, 'D': 1.28008, 'G': 3.122},
'2020-12-23': {'B': 0.3670, 'C': 0.4890, 'G':1.34235, 'H': 0.227731},
'2020-12-24': {'A': 0.3630, 'B': 0.3960, 'C': 0.0950, 'Z':0.3735},
'2020-12-25': {'C': 0.8366, 'B': 0.4840},
'2020-12-26': {'Y': 5.366}}
The final dictionary (i.e. result) after selecting the smallest three for each date would look like:
Can someone suggest a solution using for loops.
Let's use heapq.nsmallest inside a dictionary comprehension to select the smallest 3 items per subdict:
from operator import itemgetter
import heapq
for k, v in sample_dict.items():
# Look ma, no `sorted`
sample_dict[k] = dict(heapq.nsmallest(3, v.items(), key=itemgetter(1)))
print (sample_dict)
# {'2020-12-22': {'A': 0.065, 'C': 0.078, 'B': 0.292},
# '2020-12-23': {'H': 0.227731, 'B': 0.367, 'C': 0.489},
# '2020-12-24': {'C': 0.095, 'A': 0.363, 'Z': 0.3735},
# '2020-12-25': {'B': 0.484, 'C': 0.8366},
# '2020-12-26': {'Y': 5.366}}
This is pretty fast because it does not need to sort the array, and updates sample_dict in-place.
Try using this dictionary comprehension:
print({k: dict(sorted(sorted(v.items(), key=lambda x: x[1]), key=lambda x: x[0])[:3]) for k, v in sample_dict.items()})
Output:
{'2020-12-22': {'A': 0.065, 'B': 0.292, 'C': 0.078}, '2020-12-23': {'B': 0.367, 'C': 0.489, 'G': 1.34235}, '2020-12-24': {'A': 0.363, 'B': 0.396, 'C': 0.095}, '2020-12-25': {'B': 0.484, 'C': 0.8366}, '2020-12-26': {'Y': 5.366}}
This should work for your purposes.
sample_dict = {'2020-12-22': {'A': 0.0650, 'B': 0.2920, 'C': 0.0780, 'D': 1.28008, 'G': 3.122},
'2020-12-23': {'B': 0.3670, 'C': 0.4890, 'G':1.34235, 'H': 0.227731},
'2020-12-24': {'A': 0.3630, 'B': 0.3960, 'C': 0.0950, 'Z':0.3735},
'2020-12-25': {'C': 0.8366, 'B': 0.4840},
'2020-12-26': {'Y': 5.366}}
results_dict = {day[0]:{sample[0]:sample[1] for sample in sorted(day[1].items(), key=lambda e: e[1])[:3]} for day in sample_dict.items()}
# Output
{'2020-12-22': {'A': 0.065, 'B': 0.292, 'C': 0.078},
'2020-12-23': {'B': 0.367, 'C': 0.489, 'H': 0.227731},
'2020-12-24': {'A': 0.363, 'C': 0.095, 'Z': 0.3735},
'2020-12-25': {'B': 0.484, 'C': 0.8366},
'2020-12-26': {'Y': 5.366}}

ParameterGrid splits the string instead of combination

I'm trying to get the parameter grid for model selection. So, following the example in Sklearn documentation about ParameterGrid function we have this:
param_grid = {'a': [1, 2], 'b': [True, False]}
list(ParameterGrid(param_grid)) == (
[{'a': 1, 'b': True}, {'a': 1, 'b': False},
{'a': 2, 'b': True}, {'a': 2, 'b': False}])
But what I want is pass only one value, without using list annotation ([]), like this:
param_grid = {'a': [1, 2], 'b': 'True', 'c': 'something'}
But then, list(ParameterGrid(param_grid)) just split all the strings instead of creation of two combinations. Result:
{'a': 1, 'b': 'T', 'c': 's'}
{'a': 1, 'b': 'T', 'c': 'o'}
{'a': 1, 'b': 'T', 'c': 'm'}
The question is, it's required to put all items in list format, or I'm missing something?
Yes, you need to use the [] notation, because the ParameterGrid expects the values to be an iterable. So when you set b as
'b': 'True'
It will iterate over the string 'True', hence you are getting different combinations with T, R, U and E.
To fix this, use it like this
param_grid = {'a': [1, 2], 'b': [True], 'c': ['something']}

Python How to add all values in a 2d Dictionary and return a single dictionary with summed values?

Part of the program I am developing has a 2D dict of length n.
Dictionary Example:
test_dict = {
0: {'A': 2, 'B': 1, 'C': 5},
1: {'A': 3, 'B': 1, 'C': 2},
2: {'A': 1, 'B': 1, 'C': 1},
3: {'A': 4, 'B': 2, 'C': 5}
}
All of the dictionaries have the same keys but different values. I need to sum all the values as to equal below.
I have tried to merge the dictionaries using the following:
new_dict = {}
for k, v in test_dict.items():
new_dict.setdefault(k, []).append(v)
I also tried using:
new_dict = {**test_dict[0], **test_dict[1], **test_dict[2], **test_dict[3]}
Unfortuntly I have not had any luck in getting the desired outcome.
Desired Outcome: outcome = {'A': 10, 'B': 5, 'C': 13}
How can I add all the values into a single dictionary?
Solution using pandas
Convert your dict to pandas.DataFrame and then do summation on columns and convert it back to dict.
import pandas as pd
df = pd.DataFrame.from_dict(test_dict, orient='index')
print(df.sum().to_dict())
Output:
{'A': 10, 'B': 5, 'C': 13}
Alternate solution
Use collections.Counter which allows you to add the values of same keys within dict
from collections import Counter
d = Counter()
for _,v in test_dict.items():
d.update(v)
print(d)

Get dictionary contains in list if key and value exists

How to get complete dictionary data inside lists. but first I need to check them if key and value is exists and paired.
test = [{'a': 'hello' , 'b': 'world', 'c': 1},
{'a': 'crawler', 'b': 'space', 'c': 5},
{'a': 'jhon' , 'b': 'doe' , 'c': 8}]
when I try to make it conditional like this
if any((d['c'] is 8) for d in test):
the value is True or False, But I want the result be an dictionary like
{'a': 'jhon', 'b': 'doe', 'c': 8}
same as if I do
if any((d['a'] is 'crawler') for d in test):
the results is:
{'a': 'crawler', 'b': 'space', 'c': 5}
Thanks in advance
is tests for identity, not for equality which means it compares the memory address not the values those variables are storing. So it is very likely it might return False for same values. You should use == instead to check for equality.
As for your question, you can use filter or list comprehensions over any:
>>> [dct for dct in data if dct["a"] == "crawler"]
>>> filter(lambda dct: dct["a"] == "crawler", data)
The result is a list containing the matched dictionaries. You can get the [0]th element if you think it contains only one item.
Use comprehension:
data = [{'a': 'hello' , 'b': 'world', 'c': 1},
{'a': 'crawler', 'b': 'space', 'c': 5},
{'a': 'jhon' , 'b': 'doe' , 'c': 8}]
print([d for d in data if d["c"] == 8])
# [{'c': 8, 'a': 'jhon', 'b': 'doe'}]

Categories