Access a json column with pandas - python

I have a csv file where one column is json. I want to be able to access the information in the json column but I can't figure it out.
My csv file is like
id, "letter", "json"
1,"a","{""add"": 2}"
2,"b","{""sub"": 5}"
3,"c","{""add"": {""sub"": 4}}"
I'm reading in the like like
test = pd.read_csv(filename)
df = pd.DataFrame(test)
I'd like to be able to get all the rows that have "sub" in the json column and ultimately be able to get the values for those keys.

Here's one approach, which uses the read_csv converters argument to build json as JSON. Then use apply to select on the json field keys in each row. CustomParser taken from this answer.
EDIT
Updated to look two levels deep, and takes variable target parameter (so it can be "add" or "sub", as needed). This solution won't handle an arbitrary number of levels, though.
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
df = pd.read_csv('test.csv', converters={'json':CustomParser})
def check_keys(json, target):
if target in json:
return True
for key in json:
if isinstance(json[key], dict):
if target in json[key]:
return True
return False
print(df.loc[df.json.apply(check_keys, args=('sub',))])
id letter json
1 2 b {'sub': 5}
2 3 c {'add': {'sub': 4}}

When you read the file in, the json field will still be of str type, you can use ast.literal_eval to convert the string to dictionary, and then use apply method to check if any cell contain the key add:
from ast import literal_eval
df["json"] = df["json"].apply(literal_eval)
df[df["json"].apply(lambda d: "add" in d)]
# id letter json
#0 1 a {'add': 2}
#2 3 c {'add': {'sub': 4}}
In case you want to check nested keys:
def check_add(d):
if "add" in d:
return True
for k in d:
if isinstance(d[k], dict):
if check_add(d[k]):
return True
return False
df[df["json"].apply(check_add)]
# id letter json
#0 1 a {'add': 2}
#2 3 c {'add': {'sub': 4}}
This doesn't check nested values other than dictionary; If you need to, it should be similar to implement based on your data.

Related

How to efficiently create a dict with key/value having the value as number of occurrences for a given key?

How to efficiently create a dict with key/value, where the value is the number of occurrences for a given key?
I'm currently doing like this:
dict_map = dict()
for car in data_frame["cars"]:
if car in dict_map :
dict_map.update({car : dict_counter.get(car)+1})
else:
dict_map.update({car : 1})
return dict_map
Is there any other way to do it in a more efficient way or using less code?
This is actually plenty efficient, just unidiomatic. Don't use .update here, and there's no need for the if-else.
dict_map = {}
for car in data_frame['cars']:
dict_map[car] = dict_map.get(car, 0) + 1
But this is such a common use-case, the standard library includes collections.Counter which is just a dict subclass specialized for this very thing, and you can get this using
import collections
dict_map = collections.Counter(data_frame["cars"])
However since you are using pandas, you should use the built-in pandas API first.
>>> data_frame = pd.DataFrame(dict(cars=['a','b','c','a','a','c']))
>>> data_frame['cars'].value_counts()
a 3
c 2
b 1
Name: cars, dtype: int64
>>> data_frame['cars'].value_counts().to_dict()
{'a': 3, 'c': 2, 'b': 1}
from collections import Counter
dict_map = dict(Counter(data_frame["cars"])

How to extract part of JSON file in python?

How to extract part of JSON file in python? have been trying to extract only certain data from a JSON file using python.
[{"id":"1", "user":"a"},
{"id":"2", "user":"b"]},
{"id":"2", "user":"c"}]
I want only the "user" data as the output.
What you have pasted is not a json. It's either a tuple, or a list of tuples.
If you have a json 'this'.
>>> this = {
"id":1,
"user":"a"
}
you can simply do
>>> this.get("user")
'a'
Jsons should be treated as dictionaries or lists that contain nested lists or dictionaries. In order to address the items/values within it, you must give it the same treatment.
json = [{"id":"1", "user":"a"}, {"id":"2", "user":"b"]}, {"id":"2", "user":"c"}]
for i in range(len(json):
print(json[i]['id']
Output:
1
2
3
If you wish to have these values stored somewhere you can try creating a dictionary which will append these values:
json = [{"id":"1", "user":"a"}, {"id":"2", "user":"b"]}, {"id":"2", "user":"c"}]
support_dict = {'id':[],'user':[]
for i in range(len(json):
support_dict['id'].append(json[i]['id'])
support_dict['user'].append(json[i]['user']
import pandas as pd
df = pd.DataFrame(support_dict)
print(df)
Output:
id user
1 1.0 a
2 2.0 b
3 3.0 c
As for your example, you have a list [] of three dictionaries{}
x=[{"id":"1", "user":"a"}, {"id":"2", "user":"b"}, {"id":"2", "user":"c"}]
Each dictionary contains keys and values. If you want to print the value assigned to each user; you can try:
for item in x:
for key,value in item.items():
print (value)

Modify the values in a dictionary and return the key and the modified value

I created a dictionary with pandas and I'm trying to get only the value
a b
hello_friend HELLO<by>
hi_friend HI<byby>
good_friend GOOD<bybyby>
I would like to get the list of values, apply multiple methods only on it and at the end return the key and the modified values
def open_pandas():
df = pandas.read_csv('table.csv', encoding = 'utf-8')
dico = df.groupby('a')['b'].apply(list).to_dict()
return dico
def methods_values(dico)
removes = b.str.replace(r'<.*>', '')
b_lower = removes.astype(str).str.lower()
b_list = dico.to_dict('b')
#here, I'm going to apply a clustering on the values
return dico_with_modified_values
I need the two functions (but my second function is not working) and my desired output:
{"hello_friend": ['hello'],"hi_friend": ['hi'], "good_friend": ['good']}
Is this possible?
I think need first processes column b of DataFrame and then convert it to dictionary of lists:
df = pandas.read_csv('table.csv', encoding = 'utf-8')
df['b'] = df['b'].str.replace(r'<.*>', '').str.lower()
dico = df.groupby('a')['b'].apply(list).to_dict()
print (dico)
{'good_friend': ['good'], 'hello_friend': ['hello'], 'hi_friend': ['hi']}

Extracting value from JSON column very slow

I've got a CSV with a bunch of data. One of the columns, ExtraParams contains a JSON object. I want to extract a value using a specific key, but it's taking quite a while to get through the 60.000something rows in the CSV. Can it be sped up?
counter = 0 #just to see where I'm at
order_data['NewColumn'] = ''
for row in range(len(total_data)):
s = total_data['ExtraParams'][row]
try:
data = json.loads(s)
new_data = data['NewColumn']
counter += 1
print(counter)
order_data['NewColumn'][row] = new_data
except:
print('NewColumn not in row')
I use a try-except because a few of the rows have what I assume is messed up JSON, as they crash the program with a "expecting delimiter ','" error.
When I say "slow" I mean ~30mins for 60.000rows.
EDIT: It might be worth nothing each JSON contains about 35 key/value pairs.
You could use something like pandas and make use of the apply method. For some simple sample data in test.csv
Col1,Col2,ExtraParams
1,"a",{"dog":10}
2,"b",{"dog":5}
3,"c",{"dog":6}
You could use something like
In [1]: import pandas as pd
In [2]: import json
In [3]: df = pd.read_csv("test.csv")
In [4]: df.ExtraParams.apply(json.loads)
Out[4]:
0 {'dog': 10}
1 {'dog': 5}
2 {'dog': 6}
Name: ExtraParams, dtype: object
If you need to extract a field from the json, assuming the field is present in each row you can write a lambda function like
In [5]: df.ExtraParams.apply(lambda x: json.loads(x)['dog'])
Out[5]:
0 10
1 5
2 6
Name: ExtraParams, dtype: int64

Read data in chunks and keep one row for each ID in Python

Imagine we have a big file with rows as follows
ID value string
1 105 abc
1 98 edg
1 100 aoafsk
2 160 oemd
2 150 adsf
...
Say the file is named file.txt and is separated by tab.
I want to keep the largest value for each ID. The expected output is
ID value string
1 105 abc
2 160 oemd
...
How can I read it by chunks and process the data? If I read the data in chunks, how can I make sure at the end of each chunk the records are complete for each ID?
Keep track of the data in a dictionary of this format:
data = {
ID: [value, 'string'],
}
As you read each line from the file, see if that ID is already in the dict. If not, add it; if it is, and the current ID is bigger, replace it in the dict.
At the end, your dict should have every biggest ID.
# init to empty dict
data = {}
# open the input file
with open('file.txt', 'r') as fp:
# read each line
for line in fp:
# grab ID, value, string
item_id, item_value, item_string = line.split()
# convert ID and value to integers
item_id = int(item_id)
item_value = int(item_value)
# if ID is not in the dict at all, or if the value we just read
# is bigger, use the current values
if item_id not in data or item_value > data[item_id][0]:
data[item_id] = [item_value, item_string]
for item_id in data:
print item_id, data[item_id][0], data[item_id][1]
Dictionaries don't enforce any specific ordering of their contents, so at the end of your program when you get the data back out of the dict, it might not be in the same order as the original file (i.e. you might see ID 2 first, followed by ID 1).
If this matters to you, you can use an OrderedDict, which retains the original insertion order of the elements.
(Did you have something specific in mind when you said "read by chunks"? If you meant a specific number of bytes, then you might run into issues if a chunk boundary happens to fall in the middle of a word...)
Code
import csv
import itertools as it
import collections as ct
with open("test.csv") as f:
reader = csv.DictReader(f, delimiter=" ") # 1
for k, g in it.groupby(reader, lambda d: d["ID"]): # 2
print(max(g, key=lambda d: float(d["value"]))) # 3
# {'value': '105', 'string': 'abc', 'ID': '1'}
# {'value': '160', 'string': 'oemd', 'ID': '2'}
Details
The with block ensures safe opening and closing of file f. The file is iterable allowing you to loop over it or ideally apply itertools.
For each line of f, csv.DictReader splits the data and maintains header-row information as key-value pairs of a dictionary ,e.g. [{'value': '105', 'string': 'abc', 'ID': '1'}, ...
This data is iterable and passed to groupby that chunks all of the data by ID. See this post from more details on how groupby works.
The the max() builtin combined with a special key function returns the dicts with the largest "value". See this tutorial for more details on the max() function.

Categories