I have a .JSON file which is around 3GB. I would like to read this JSON data and load it to pandas data frames. Below is what i did so far..
Step 1: Read JSON file
import pandas as pd
with open('MyFile.json', 'r') as f:
data = f.readlines()
Step2: just take one component, since data is huge and i want to see how it looks
cp = data[0:1]
print(cp)
['{"reviewerID": "AO94DHGC771SJ", "asin": "0528881469", "reviewerName": "amazdnu", "helpful": [0, 0], "reviewText": "some review text...", "overall": 5.0, "summary": "Gotta have GPS!", "unixReviewTime": 1370131200, "reviewTime": "06 2, 2013"}\n']
Step3: to remove new line('\n') character
while ix<len(t):
t[ix]=t[ix].rstrip("\n")
ix+=1
Questions:
Why this JSON data is in string? Am I making any mistakes?
How do I convert it into dictionary?
What I tried?
I tried b=dict(zip(t[0::2],t[1::2])),
but get - 'dict' object not callable
Tried joining, but did not work though
Can any one please help me? Thanks!
Why haven't you tried pandas.read_json?
import pandas as pd
df = pd.read_json('MyFile.json')
Works for the example you posted!
In[82]: i = '{"reviewerID": "AO94DHGC771SJ", "asin": "0528881469", "reviewerName": "amazdnu", "helpful": [0, 0], "reviewText": "some review text...", "overall": 5.0, "summary": "Gotta have GPS!", "unixReviewTime": 1370131200, "reviewTime": "06 2, 2013"}'
In[83]: pd.read_json(i)
Out[83]:
asin helpful overall reviewText reviewTime reviewerID reviewerName summary unixReviewTime
0 528881469 0 5 some review text... 06 2, 2013 AO94DHGC771SJ amazdnu Gotta have GPS! 1370131200
Related
I have this DataFrame:
df = pd.DataFrame({'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"})
All the dataframe fields are ASCII strings and is the output from a SQL query (pd.read_sql_query) so the line to create the dataframe above may not be quite right.
And I wish the final JSON output to be in the form
[{
"Survey": "001_220816080015",
"BCD": "001_220816080015.bcd",
"Sections": [
"4700A1/305",
"4700A1/312"
}]
I realize that may not be 'normal' JSON but that is the format expected by a program over which I have no control.
The nearest I have achieved so far is
[{
"Survey": "001_220816080015",
"BCD": "001_220816080015.bcd",
"Sections": "4700A1/305, 4700A1/312"
}]
Problem might be the structure of the dataframe but how to reformat it to produce the requirement is not clear to me.
The JSON line is:
df.to_json(orient='records', indent=2)
Isn't the only thing you need to do to parse the Sections into a list?
import pandas as pd
df= pd.DataFrame({'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"}, index=[0])
df['Sections'] = df['Sections'].str.split(', ')
print(df.to_json(orient='records', indent=2))
[
{
"Survey":"001_220816080015",
"BCD":"001_220816080015.bcd",
"Sections":[
"4700A1\/305",
"4700A1\/312"
]
}
]
The DataFrame won't help you here, since it's just giving back the input parameter you gave it.
You should just split the specific column you need into an array:
input_data = {'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"}
input_data['Sections'] = input_data['Sections'].split(', ')
nested_json = [input_data]
Im fairly new dealing with .txt files that has a dictionary within it. Im trying to pd.read_csv and create a dataframe in pandas.I get thrown an error of Error tokenizing data. C error: Expected 4 fields in line 2, saw 11. I belive I found the root problem which is the file is difficult to read because each row contains a dict, whose key-value pairs are separated by commas in this case is the delimiter.
Data (store.txt)
id,name,storeid,report
11,JohnSmith,3221-123-555,{"Source":"online","FileFormat":0,"Isonline":true,"comment":"NAN","itemtrack":"110", "info": {"haircolor":"black", "age":53}, "itemsboughtid":[],"stolenitem":[{"item":"candy","code":1},{"item":"candy","code":1}]}
35,BillyDan,3221-123-555,{"Source":"letter","FileFormat":0,"Isonline":false,"comment":"this is the best store, hands down and i will surely be back...","itemtrack":"110", "info": {"haircolor":"black", "age":21},"itemsboughtid":[1,42,465,5],"stolenitem":[{"item":"shoe","code":2}]}
64,NickWalker,3221-123-555, {"Source":"letter","FileFormat":0,"Isonline":false, "comment":"we need this area to be fixed, so much stuff is everywhere and i do not like this one bit at all, never again...","itemtrack":"110", "info": {"haircolor":"red", "age":22},"itemsboughtid":[1,2],"stolenitem":[{"item":"sweater","code":11},{"item":"mask","code":221},{"item":"jack,jill","code":001}]}
How would I read this csv file and create new columns based on the key-values. In addition, what if there are more key-value in other data... for example > 11 keys within the dictionary.
Is there a an efficient way of create a df from the example above?
My code when trying to read as csv##
df = pd.read_csv('store.txt', header=None)
I tried to import json and user a converter but it do not work and converted all the commas to a |
`
import json
df = pd.read_csv('store.txt', converters={'report': json.loads}, header=0, sep="|")
In addition I also tried to use:
`
import pandas as pd
import json
df=pd.read_csv('store.txt', converters={'report':json.loads}, header=0, quotechar="'")
I also was thinking to add a quote at the begining of the dictionary and at the end to make it a string but thought that was too tedious to find the closing brackets.
I think adding quotes around the dictionaries is the right approach. You can use regex to do so and use a different quote character than " (I used § in my example):
from io import StringIO
import re
import json
with open("store.txt", "r") as f:
csv_content = re.sub(r"(\{.*})", r"§\1§", f.read())
df = pd.read_csv(StringIO(csv_content), skipinitialspace=True, quotechar="§", engine="python")
df_out = pd.concat([
df[["id", "name", "storeid"]],
pd.DataFrame(df["report"].apply(lambda x: json.loads(x)).values.tolist())
], axis=1)
print(df_out)
Note: the very last value in your csv isn't valid json: "code":001. It should either be "code":"001" or "code":1
Output:
id name storeid Source ... itemtrack info itemsboughtid stolenitem
0 11 JohnSmith 3221-123-555 online ... 110 {'haircolor': 'black', 'age': 53} [] [{'item': 'candy', 'code': 1}, {'item': 'candy...
1 35 BillyDan 3221-123-555 letter ... 110 {'haircolor': 'black', 'age': 21} [1, 42, 465, 5] [{'item': 'shoe', 'code': 2}]
2 64 NickWalker 3221-123-555 letter ... 110 {'haircolor': 'red', 'age': 22} [1, 2] [{'item': 'sweater', 'code': 11}, {'item': 'ma...
I am trying to convert a list of objects which has been queried using SQLAlchemy.
The issue that I am having is that the process is taking around 18-20 seconds to loop, process and send the data to the frontdnd. The data is of around 5 million rows which is way too slow to put into production.
Here is an example of what I using.
test = [
{"id": 5, "date":"2022-01-01 00:00:00"},
{"id": 5, "date": "2022-01-01 00:00:00"},
{"id": 5, "date": "2022-01-01 00:00:00"},
{"id": 5, "date": "2022-01-01 00:00:00"},
]
test_dict = {}
for i in test:
if i["id"] not in test_dict:
test_dict[i["id"]] = []
test_dict[i["id"]].append(i["date"].isoformat)
Expected output should be e.g
[
{5: [date, date, date, date, date]},
{6: [date]}
]
I totally understand this is not working code and I am not looking to fix it. I just wrote this on the fly but my main focus is what to do with the for loop to speed the process up.
Thank you everyone for your help.
Thank you everyone for your answers so far.
Providing more info, the data needs to be sent to the frontend which is then rendered on a graph. This data is updated around every minute or so and can also be requested between 2 time ranges. These time ranges are always a minimum of 35 days so the rows returned are always a minimum of 5 million or so. 20 seconds for a graph to load for the end user I would say is too slow. The for loop is the cause of this bottleneck but would be nice to get the for loop down to say 5 seconds at least.
Thank you
Extra info:
Processing database side is unfortunately not an option for this. The data must be converted to the correct format inside the API. For example, concat the data into the correct format or converting to JSON during query isn't an option.
If I understood currently you can use pandas dataframes
test = [
{"id": 5, "date":"2022-01-01 00:00:00"},
{"id": 5, "date": "2022-02-01 00:00:00"},
{"id": 5, "date": "2022-03-01 00:00:00"},
{"id": 5, "date": "2022-04-01 00:00:00"},
{"id": 6, "date": "2022-05-01 00:00:00"},
]
import pandas as pd
df = pd.DataFrame.from_dict(test)
res = df.groupby("id").agg(list)
print(res)
Output :
date
id
5 [2022-01-01 00:00:00, 2022-02-01 00:00:00, 2022-03-01 00:00:00, 2022-04-01 00:00:00]
6 [2022-05-01 00:00:00]
And if you want it to be as dict you can use res.to_dict()
You should probably not send 5 millions objects to the frontend.
Usually we use pagination, filters, and sort elements.
Then if you are really willing to do so, the fastest way would probably be to cache your data, for instance by creating and maintaining a json file on your server that the clients would download.
I want to change the JSON structure to generate the expected output.
I don't want to achieve it with Python and Pandas.
Any idea about how to change the json format,
so that I can get the output by pd.read_json(JSON_STR) directly.
Thanks
Current output of dataframe
json_str='''{
"2013-03-20_change_in_real_gdp":{
"2013":{
"upper_end_of_central_tendency":"2.8",
"lower_end_of_range":"2.0"
},
"2014":{
"upper_end_of_central_tendency":"3.4",
"lower_end_of_range":"2.6"
}
},
"2012-04-25_change_in_real_gdp":{
"2013":{
"upper_end_of_central_tendency":"7.7",
"lower_end_of_range":"7.0"
},
"2014":{
"upper_end_of_central_tendency":"7.4",
"lower_end_of_range":"6.3"
}
}
}'''
pd.read_json(json_str)
This is the expected output from dataframe
Working backwards, you can create your dataframe and then convert it to JSON to see the expected format. When you try to convert it to JSON, you'll get an error because the measure index values are not unique. After resetting the index, you'll get the following JSON.
import json
df = pd.DataFrame({2013: [2.8, 2, 7.7, 7],
2014: [3.4, 2.6, 7.4, 6.3],
'source': ['2013-03-20_change_in_real_gdp',
'2013-03-20_change_in_real_gdp',
'2012-04-25_change_in_real_gdp',
'2012-04-25_change_in_real_gdp']},
index=['upper_end_of_central_tendency',
'lower_end_of_range',
'upper_end_of_central_tendency',
'lower_end_of_range'])
df.index.name = 'measure'
>>> df.reset_index().to_json()
'{"measure": {"0":"upper_end_of_central_tendency",
"1":"lower_end_of_range",
"2":"upper_end_of_central_tendency",
"3":"lower_end_of_range"},
"2013": {"0":2.8,"1":2.0,"2":7.7,"3":7.0},
"2014": {"0":3.4,"1":2.6,"2":7.4,"3":6.3},
"source": {"0":"2013-03-20_change_in_real_gdp",
"1":"2013-03-20_change_in_real_gdp",
"2":"2012-04-25_change_in_real_gdp",
"3":"2012-04-25_change_in_real_gdp"}}'"""
This is first time use stackoverflow to ask question. I have poor English,so if I affend you accidently in word, please don't mind.
I have a json file (access.json),format like:
[
{u'IP': u'aaaa1', u'Domain': u'bbbb1', u'Time': u'cccc1', ..... },
{u'IP': u'aaaa2', u'Domain': u'bbbb2', u'Time': u'cccc2', ..... },
{u'IP': u'aaaa3', u'Domain': u'bbbb3', u'Time': u'cccc3', ..... },
{u'IP': u'aaaa4', u'Domain': u'bbbb4', u'Time': u'cccc4', ..... },
{ ....... },
{ ....... }
]
When I use:
ipython
import pasdas as pd
data = pd.read_json('./access.json')
it return:
ValueError: Expected object or value
that is the result I want:
[out]
IP Domain Time ...
0 aaaa1 bbbb1 cccc1 ...
1 aaaa2 bbbb2 cccc2 ...
2 aaaa3 bbbb3 cccc3 ...
3 aaaa4 bbbb4 cccc4 ...
...and so on
How should I do to achieve this goal? Thank you for answer!
This isn't valid json which is why read_json won't parse it.
{u'IP': u'aaaa1', u'Domain': u'bbbb1', u'Time': u'cccc1', ..... },
should be
{"IP": "aaaa1", "Domain": "bbbb1", "Time": "cccc1", ..... },
You could smash this (the entire file) with a regular expression to find these, for example:
In [11]: line
Out[11]: "{u'IP': u'aaaa1', u'Domain': u'bbbb1', u'Time': u'cccc1'},"
In [12]: re.sub("(?<=[\{ ,])u'|'(?=[:,\}])", '"', line)
Out[12]: '{"IP": "aaaa1", "Domain": "bbbb1", "Time": "cccc1"},'
Note: this will get tripped up by some strings, so use with caution.
A better "solution" would be to ensure you had valid json in the first place... It looks like this has come from python's str/unicode/repr rather than json.dumps.
Note: json.dumps produces valid json, so can be read by read_json.
In [21]: repr({u'IP': u'aaa'})
Out[21]: "{u'IP': u'aaa'}"
In [22]: json.dumps({u'IP': u'aaa'})
Out[22]: '{"IP": "aaa"}'
If someone else created this "json", then complain! It's not json.
It is not a JSON format. It is a list of dictionaries. You can use ast.literal_eval() to get the actual list from the file and pass it to the DataFrame constructor:
from ast import literal_eval
import pandas as pd
with open('./access.log2.json') as f:
data = literal_eval(f.read())
df = pd.DataFrame(data)
print df
Output for the example data you've provided:
Domain IP Time
0 bbbb1 aaaa1 cccc1
1 bbbb2 aaaa2 cccc2
2 bbbb3 aaaa3 cccc3
3 bbbb4 aaaa4 cccc4
You can also use
pd.read_json("{json_file_name}", orient='records')
assuming that the JSON data is in list format as shown in the question.