I need to convert a JSON dictionary to a Pandas DataFrame, but the embedding is tripping me up.
Here is basically what the JSON dict looks like.
{
"report": "{'name':{'data':[{'key 1':'value 1','key 2':'value 2'},{'key 1':'value 1','key 2':'value 2'}]}}"
}
In the DataFrame, I want the keys to be the column headers and values in the rows below them.
The extra layer of embedding is throwing me off somewhat from all the usual methods of doing this.
One tricky part is 'name' will change each time I get this JSON dict, so I can't use an exact sting value for 'name'.
Your JSON looks a bit odd. It looks more like a Python dict converted to a string, so you can use ast.literal_eval (a built-in function) to convert it to a real dict, and then use pd.json_normalize to get it into a dataframe form:
import ast
j = ...
parsed_json = ast.literal_eval(j['report'])
df = pd.json_normalize(parsed_json, record_path=[list(parsed_json)[0], 'data'])
Output:
>>> df
key 1 key 2
0 value 1 value 2
1 value 1 value 2
The error suggests that you're trying to index the strings (because the value under report is a string) using another string.
You just need ast.literal_eval to parse the string and a DataFrame constructor. If the "name" is unknown, you can iterate over the dict.values after you parse the string.
import ast
out = pd.DataFrame(y for x in ast.literal_eval(your_data['report']).values() for y in x['data'])
Output:
key 1 key 2
0 value 1 value 2
1 value 1 value 2
Related
I have a dataframe as below,
df = pd.DataFrame({'URL_domains':[['wa.me','t.co','goo.gl','fb.com'],['tinyurl.com','bit.ly'],['test.in']]})
Here the column URL_Domains has got 2 observations with a list of Domains.
I would like to know the length of each observations URL domain list as:
df['len_of_url_list'] = df['URL_domains'].map(len)
and output as:
That is fine and no issues with above case and
In my case these list observations are treated string type as below:
When i execute the below code with string type URL domain it has shown the below output:
How to make a datatype conversion from string to list here in pandas ?
Use ast.literal_eval, because eval is bad practice:
import ast
df['len_of_url_list'] = df['URL_domains'].map(ast.literal_eval)
df["URL_domains"] = df["URL_domains"].apply(eval)
I have a column {'duration': 0, 'is_incoming': False}
I want to fetch 0 and Falseout of this. How do I split it using Python (Pandas)?
I tried - data["additional_data"] = data["additional_data"].apply(lambda x :",".join(x.split(":")[:-1]))
I want two columns Duration and Incoming_Time
How do I do this?
You can try converting those string to actual dict:
from ast import literal_eval
Finally:
out=pd.DataFrame(df['additional_data'].astype(str).map(literal_eval).tolist())
Now if you print out you will get your expected output
If needed use join() method:
df=df.join(out)
Now if you print df you will get your expected result
If your column additional_data contains real dict / json, you can directly use the string accessor .str[] to get the dict values by keys, as follows:
data['Duration'] = data['additional_data].str['duration']
data['Incoming_Time'] = = data['additional_data].str['is_incoming']
If your column additional_data contains strings of dict (enclosing dict with a pair of single quotes or double quotes), you need to convert the string to dict first, by:
from ast import literal_eval
data['Duration'] = data['additional_data].map(literal_eval).str['duration']
data['Incoming_Time'] = data['additional_data].map(literal_eval).str['is_incoming']
I have a some data that I put into a pandas dataframe. Inside of cell [0,5] I have a list of times that I want to call and be printed out.
Dataframe:
GAME_A PROCESSING_SPEED
yellow_selected 19
red_selected 0
yellow_total 19
red_total 60
counters [0.849998, 1.066601, 0.883263, 0.91658, 0.96668]
Code:
import pandas as pd
df = pd.read_csv('data.csv', sep = '>')
print(df.iloc[0])
proc_speed = df.iat[0,5]
print(proc_speed[2])
When I try to print the 3rd time in the dictionary I get .. I tried to use a for loop to print the times but instead I get this. How can I call the specific values from the list. How would I print out the 3rd time 0.883263?
[
0
.
8
4
9
9
9
8
,
1
.
0
6
6
...
This happens because with the way you are loading the data, the column 'PROCESSING_SPEED' is read as an object type, therefore, all elements of that series are considered strings (i.e., in this case proc_speed = "[0.849998, 1.066601, 0.883263, 0.91658, 0.96668]", which is exactly the string the loop is printing character by character).
Before printing the values you desire to display (from that cell), one should convert the string to a list of numbers, for example:
proc_speed = df.iat[4,1]
proc_speed = [float(s) for s in proc_speed[1:-1].split(',')]
for num in proc_speed:
print( num)
Where proc_speed[1:-1].split(',') takes the string containing the list, except for the brackets at the beginning and end, and splits it according to the commas separating values.
In general, we have to be careful when loading columns with varying or ambiguous data types, as Pandas could have trouble parsing them correctly or in the way we want/expect it to be.
You can simply call proc_speed[index] as you have set this variable as a list. Here is a working example, note my call to df.iat has different indexes;
d = {'GAME_A':['yellow_selected', 'red_selected', 'yellow_total', 'red_total', 'counters'],'PROCESSING_SPEED':[19,0,19,60,[0.849998, 1.066601, 0.883263, 0.91658, 0.96668]]}
df = pd.DataFrame(d)
proc_speed = df.iat[4, 1]
for i in proc_speed:
print(i)
0.849998
1.066601
0.883263
0.91658
0.96668
proc_speed[1]
1.066601
proc_speed[3]
0.91658
You can convert with apply, it's easier than splitting, and converts your ints to ints:
pd.read_clipboard(sep="(?!\s+(?<=,\s))\s+")['PROCESSING_SPEED'].apply(eval)[4][2]
# 0.883263
I want to convert a python dictionary to a pandas DataFrame, but as the dictionary values are not of the same length, when I do:
recomm = pd.DataFrame(recommendation.items(),columns=['id','recId1','recId2','recId3','recId4','recId5'])
I get:
6 columns passed, passed data had 2 columns
which mean that one of the provided values is of length 2.
To correct it, I did:
for key in recommendation.keys():
while True:
l1 = recommendation[key]
l1.append(0)
recommendation[key] = l1
if len(l1) < 5:
break
But I still get the error when converting to DF.
I checked the dictionary as follow:
for key in recommendation.keys():
if len(recommendation[key]) != 5:
print key
and discovered that 0 was added to those of length 5 too. means I'm now having some of the values with 6 as length.
e.g dictionary value:
[12899423, 12907790, 12443129, 12558006, 12880407, 0]
How to correct the while code so that it ONLY adds 0 to the list of values if the length of the list is < 5.
and is there a better way to convert the dictionary to pandas DataFrame?
Dictionary keys are: int and str.
You could use the following :
In python 2.7 use iteritems() as it return an iterator over the dictionary, in python 3.x, items() have the same behavior
import numpy as np
import pandas as pd
#Your dictionary
d = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
df = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.iteritems() ]))
It will fill the dataframe with NaN values for the missing entries, then you just have to call the fillna function :
df.fillna(0,inplace=True)
Your missing data will now be filled with zeros
I am trying to figure out the best way of creating tuples with the format:
(x:y) from 2 columns in a dataframe and then use column a of the dataframe as the key of the tuple
key data_1 data_2
0 14303 24.75 25.03
1 12009 25.00 25.07
2 14303 24.99 25.15
3 12009 24.62 24.77
The resulting dictionary
{14303 24.38:24.61 24:99:25:15
12009 24.62:24.77 25.00:25.07 }
I have tried to use iterrows and enumerate but was wondering if there is a more efficient way to achieve it
I think you wanted to append the (data_1, data2) tuple as a value for the given key. This solution uses iterrows(), which I acknowledge you said you already use. If this is not what you are looking for please post your code and exactly the output you want. I don't know if there is a native method in pandas to do this.
# df is the dataframe
from collections import defaultdict
sample_dict = defaultdict(list)
for line in df.iterrows():
k = line[1][0] # key
d_tuple = (line[1][1], line[1][2]) # (data_1, data_2)
sample_dict[k].append(d_tuple)
sample_list is therefore:
defaultdict(list,
{12009.0: [(25.0, 25.07), (24.620000000000001, 24.77)],
14303.0: [(24.75, 25.030000000000001),
(24.989999999999998, 25.149999999999999)]})
sample_list[12009] is therefore:
[(25.0, 25.07), (24.620000000000001, 24.77)]
Update:
You might take a look at this thread too:
https://stackoverflow.com/a/24368660/4938264