I have a '.csv' file containing data about movies and I'm trying to reformat it as a JSON file to use it in MongoDB. So I loaded that csv file to a pandas DataFrame and then used to_json method to write it back.
here is how one row in DataFrame looks like:
In [43]: result.iloc[0]
Out[43]:
title Avatar
release_date 2009
cast [{"cast_id": 242, "character": "Jake Sully", "...
crew [{"credit_id": "52fe48009251416c750aca23", "de...
Name: 0, dtype: object
but when pandas writes it back, it becomes like this:
{ "title":"Avatar",
"release_date":"2009",
"cast":"[{\"cast_id\": 242, \"character\": \"Jake Sully\", \"credit_id\": \"5602a8a7c3a3685532001c9a\", \"gender\": 2,...]",
"crew":"[{\"credit_id\": \"52fe48009251416c750aca23\", \"department\": \"Editing\", \"gender\": 0, \"id\": 1721,...]"
}
As you can see, 'cast' ans 'crew' are lists and they have tons of redundant backslashes. These backslashes appear in MongoDB collections and make it impossible to extract data from these two fields.
How can I solve this problem other than replacing \" with "?
P.S.1: this is how I save the DataFrame as JSON:
result.to_json('result.json', orient='records', lines=True)
UPDATE 1:
Apparently pandas is doing just fine and the problem is caused by the original csv files.
here is how they look like:
movie_id,title,cast,crew
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""cast_id"": 25, ""character"": ""Dr. Grace Augustine"", ""credit_id"": ""52fe48009251416c750aca39"", ""gender"": 1, ""id"": 10205, ""name"": ""Sigourney Weaver"", ""order"": 2}, {""cast_id"": 4, ""character"": ""Col. Quaritch"", ""credit_id"": ""52fe48009251416c750ac9cf"", ""gender"": 2, ""id"": 32747, ""name"": ""Stephen Lang"", ""order"": 3},...]"
I tried to replace "" with " (and I really wanted to avoid this hack):
sed -i 's/\"\"/\"/g'
And of course it caused problems in some lines of data when reading it as csv again:
ParserError: Error tokenizing data. C error: Expected 1501 fields in line 4, saw 1513
So we can conclude it's not safe to do such blind replacement. Any idea?
P.S.2: I'm using kaggle's 5000 movie dataset: https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset
I had the same issue : the solution is in 3 steps
1- Data-frame form csv or in my case from xlsx:
excel_df= pd.read_excel(dataset ,sheet_name=my_sheet_name)
2- convert to json (if you have date in your data)
json_str = excel_df.to_json(orient='records' ,date_format='iso')
3-The most important thing : json.loads **** this is it !
parsed = json.loads(json_str)
4- (facultative) you can write or send the json file :
for example : write locally
with open(out, 'w') as json_file:
json_file.write(json.dumps({"data": parsed}, indent=4 ))
more info :
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html
Pandas is escaping the " character because it thinks the values in the json columns are text. To get the desired behaviour, simply parse the values in the json column as json.
let the file data.csv have the following content (with quotes escaped).
# data.csv
movie_id,title,cast
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""cast_id"": 25, ""character"": ""Dr. Grace Augustine"", ""credit_id"": ""52fe48009251416c750aca39"", ""gender"": 1, ""id"": 10205, ""name"": ""Sigourney Weaver"", ""order"": 2}, {""cast_id"": 4, ""character"": ""Col. Quaritch"", ""credit_id"": ""52fe48009251416c750ac9cf"", ""gender"": 2, ""id"": 32747, ""name"": ""Stephen Lang"", ""order"": 3}]"
read this into a dataframe, then apply the json.loads function & write out to a file as json.
df = pd.read_csv('data.csv')
df.cast = df.cast.apply(json.loads)
df.to_json('data.json', orient='records', lines=True)
The output is a properly formatted json (extra newlines added by me)
# data.json
{"movie_id":19995,
"title":"Avatar",
"cast":[{"cast_id":242,"character":"Jake Sully","credit_id":"5602a8a7c3a3685532001c9a","gender":2,"id":65731,"name":"Sam Worthington","order":0},
{"cast_id":3,"character":"Neytiri","credit_id":"52fe48009251416c750ac9cb","gender":1,"id":8691,"name":"Zoe Saldana","order":1},
{"cast_id":25,"character":"Dr. Grace Augustine","credit_id":"52fe48009251416c750aca39","gender":1,"id":10205,"name":"Sigourney Weaver","order":2},
{"cast_id":4,"character":"Col. Quaritch","credit_id":"52fe48009251416c750ac9cf","gender":2,"id":32747,"name":"Stephen Lang","order":3}]
}
Related
How can I get the json format from pandas, where each rows are separated with new line. For example if I have a dataframe like:
import pandas as pd
data = [{'a': 1, 'b': 2},
{'a': 3, 'b': 4}]
df = pd.DataFrame(data)
print("First:\n", df.to_json(orient="records"))
print("Second:\n", df.to_json(orient="records", lines=True))
Output:
First:
[{"a":1,"b":2},{"a":3,"b":4}]
Second:
{"a":1,"b":2}
{"a":3,"b":4}
But I really want an output like so:
[{"a":1,"b":2},
{"a":3,"b":4}]
or
[
{"a":1,"b":2},
{"a":3,"b":4}
]
I really just want each line to be separated by new line, but still a valid json format that can be read. I know I can use to_json with lines=True and just split by new line then .join, but wondering if there is a more straight forward/faster solution just using pandas.
Here you go:
import json
list(json.loads(df.to_json(orient="index")).values())
Use indent parameter
import pandas as pd
data = [{'a': 1, 'b': 2},
{'a': 3, 'b': 4}]
df = pd.DataFrame(data)
print(df.to_json(orient="records", indent=1))
#output :
[
{
"a":1,
"b":2
},
{
"a":3,
"b":4
}
]
Why don't you just add the brackets:
print(f"First:\n[{df.to_json(orient='records', lines=True)}]")
print(f"Second:\n[\n{df.to_json(orient='records', lines=True)}\n]")
I am working on SQLAlchemy and want to fetch the data from database and convert the same into JSON format.
I have below code :
db_string = "postgres://user:pwd#10.**.**.***:####/demo_db"
Base = declarative_base()
db = create_engine(db_string)
record = db.execute("SELECT name, columndata, gridname, ownerid, issystem, ispublic, isactive FROM col.layout WHERE (ispublic=1 AND isactive=1) OR (isactive=1 AND ispublic=1 AND ownerid=ownerid);")
for row in record:
result.append(row)
print(result)
Data is coming in this format:
[('layout-1', {'theme': 'blue', 'sorting': 'price_down', 'filtering': ['Sub Strategye', 'PM Strategy']}, 'RealTimeGrid', 1, 0, 1, 1), ('layout-2', {'theme': 'orange', 'sorting': 'price_up', 'filtering': ['FX Rate', 'Start Price']}, 'RealBalancing Grid', 2, 0, 1, 1), ('layout-3', {'theme': 'red', 'sorting': 'mv_price', 'filtering': ['Sub Strategye', 'PM Strategy']}, 'RT', 3, 0, 1, 1)]
But I am facing a lot of issues to convert the above result into JSON Format. Please suggest.
Your data is basically a list of tuples.
like first tuple is like
('layout-3',
{'filtering': ['Sub Strategye', 'PM Strategy'],
'sorting': 'mv_price',
'theme': 'red'},
'RT',
3,
0,
1,
1)
if you want to convert whole data as it is to json, you can use json module dumps function
import json
jsn_data = json.dumps(data)
Your list of tuple is converted to json
[["layout-1", {"theme": "blue", "sorting": "price_down", "filtering": ["Sub Strategye", "PM Strategy"]}, "RealTimeGrid", 1, 0, 1, 1], ["layout-2", {"theme": "orange", "sorting": "price_up", "filtering": ["FX Rate", "Start Price"]}, "RealBalancing Grid", 2, 0, 1, 1], ["layout-3", {"theme": "red", "sorting": "mv_price", "filtering": ["Sub Strategye", "PM Strategy"]}, "RT", 3, 0, 1, 1]]
but If you need json formate as key value pair , first need to convert the result in python dictionary then use json.dumps(dictionary_Var)
What you want to accomplish is called "serialization".
You can follow Sudhanshu Patel's answer if you just want to dump json into response.
However if you intend to produce a more sophisticated application, consider using a serialization library. You'll be able to input data from request into db, check if input data is in the right format, and send response in a standarised format.
Check these libraries:
Marshmallow
Python's own Pickle
I am fairly new to Python and have several nested JSON files I need to convert to CSV's.
The structure of these is the following:
{'Z': {'#SchemaVersion': 9,
'FD': [{'FDRecord': [{'NewCase': {'#TdrRecVer': 5,
'CaseLabel': '',
'StdHdr': {'DevDateTime': '2000-05-02T10:43:18',
'ElapsedTime': 0,
'GUID': '5815D34615C15690936B822714009468',
'MsecTime': 5012,
'RecId': 4},
'UniqueCaseId': '5389F346136315497325122714009468'}},
{'NewCase': {'#TdrRecVer': 5,
'CaseLabel': '',
'StdHdr': {'DevDateTime': '2000-05-02T10:43:18',
'ElapsedTime': 0,
'GUID': '5819D346166610458312622714009468',
'MsecTime': 9459,
'RecId': 4},
'UniqueCaseId': '5819F346148627009653284714009468'}},
{'AnnotationEvt': {'#EvtName': 'New',
'#TdrRecVer': 1,
'DevEvtCode': 13,
'Payload': '0 0 0 0',
'StdHdr': {'DevDateTime': '2000-05-02T10:43:18',
'ElapsedTime': 0,
'GUID': '5899D34616BC1000938B824538219968',
'MsecTime': 7853,
'RecId': 8},
'TreatmentSummary': 1,
'XidCode': '0000000B'}},
{'TrendRpt': {'#TdrRecVer': 9,
'CntId': 0,
'DevEvtCode': 30,
'StdHdr': {'DevDateTime': '2000-05-02T10:43:18',
'ElapsedTime': 0,
'GUID': '5819C34616781004852225698409468',
'MsecTime': 4052,
'RecId': 12}, ...
My problem is, most examples online show how to read in a very small json and write it out to a csv by explicitly stating the keys or field names when creating a csv. My files are far too large to do this, some of them being over 40MB.
I tried following another person's example (below) from online, but did not succeed:
import json
import csv
with open('path-where-json-located') as file:
data_parse = json.load(file)
data = data_parse['Z']
data_writer = open('path-for-csv', 'w')
csvwriter = csv.writer(data_writer)
count=0
for i in data:
if count == 0:
header = i.keys()
csvwriter.writerow(header)
count+=1
csvwriter.writerow(i.values())
data_writer.close()
When I run this, I get the following error:
AttributeError: 'str' object has no attribute 'keys'
I understand that for some reason it is treating the key I want to pull as a string object, but I do not know how to get around this and correctly parse this into a csv.
I have a .JSON file which is around 3GB. I would like to read this JSON data and load it to pandas data frames. Below is what i did so far..
Step 1: Read JSON file
import pandas as pd
with open('MyFile.json', 'r') as f:
data = f.readlines()
Step2: just take one component, since data is huge and i want to see how it looks
cp = data[0:1]
print(cp)
['{"reviewerID": "AO94DHGC771SJ", "asin": "0528881469", "reviewerName": "amazdnu", "helpful": [0, 0], "reviewText": "some review text...", "overall": 5.0, "summary": "Gotta have GPS!", "unixReviewTime": 1370131200, "reviewTime": "06 2, 2013"}\n']
Step3: to remove new line('\n') character
while ix<len(t):
t[ix]=t[ix].rstrip("\n")
ix+=1
Questions:
Why this JSON data is in string? Am I making any mistakes?
How do I convert it into dictionary?
What I tried?
I tried b=dict(zip(t[0::2],t[1::2])),
but get - 'dict' object not callable
Tried joining, but did not work though
Can any one please help me? Thanks!
Why haven't you tried pandas.read_json?
import pandas as pd
df = pd.read_json('MyFile.json')
Works for the example you posted!
In[82]: i = '{"reviewerID": "AO94DHGC771SJ", "asin": "0528881469", "reviewerName": "amazdnu", "helpful": [0, 0], "reviewText": "some review text...", "overall": 5.0, "summary": "Gotta have GPS!", "unixReviewTime": 1370131200, "reviewTime": "06 2, 2013"}'
In[83]: pd.read_json(i)
Out[83]:
asin helpful overall reviewText reviewTime reviewerID reviewerName summary unixReviewTime
0 528881469 0 5 some review text... 06 2, 2013 AO94DHGC771SJ amazdnu Gotta have GPS! 1370131200
I have successfully created some data bar plot with the python module xlsxwriter by its conditional_format method.
However, is it possible to specify the fill pattern of condition format within xlswriter or python generally?
I tried the following code which didn't work:
myformat = workbook.add_format()
myformat .set_pattern(1)
worksheet.conditional_format(0, 4, 0, 4, {'type': 'data_bar',
'min_value': 0,
'min_type': 'num',
'max_value': 110,
'max_type': 'num',
'bar_color': '#C00000',
'format': myformat})
It is possible to set the pattern of the format for some conditional formats with XlsxWriter.
However, as far as I know, it isn't possible, in Excel, to set a patten type for data_bar conditional formats.
You could do it with a cell format:
import xlsxwriter
workbook = xlsxwriter.Workbook('hello_world3.xlsx')
worksheet = workbook.add_worksheet()
myformat = workbook.add_format()
myformat.set_pattern(1)
myformat.set_bg_color('#C00000')
worksheet.conditional_format(0, 4, 0, 4, {'type': 'cell',
'criteria': 'between',
'minimum': 0,
'maximum': 110,
'format': myformat})
worksheet.write(0, 4, 50)
workbook.close()
If, on the other hand, you are looking for a non-gradient data_bar fill, that isn't currently supported.
for newer version, you only need the 'data_bar_2010' property to true:
worksheet.conditional_format(0, 4, 0, 4, {'type': 'data_bar','data_bar_2010': True})
see the xlsxwriter documents here