I have a big csv file (aprx. 1GB) that I want to convert to a json file in the following way:
the csv file has the following structure:
header: tid;inkey;outkey;value
values:
tid1;inkey1;outkey1;value1
tid1;inkey2;outkey2;value2
tid2;inkey2;outkey3;value2
tid2;inkey4;outkey3;value2
etc.
The idea is to convert this csv to a json with the following structure, basically to group everything by "tid":
{
"tid1": {
"inkeys":["inkey1", "inkey2"],
"outkeys":["outkey1", "outkey2"]
}
}
I can imagine how to do it normal python dicts and lists, but my problem is also the huge amount of data that i have to process. I suppose pandas can help here, but I am still very confused with this tool.
I think this should be straight-forward to do with standard Python data structures such as defaultdict. Unless you have very limited memory, I see no reason why a 1gb file will be problematic using a straight-forward approach.
Something like (did not test):
from collections import defaultdict
import csv
import json
out_data = defaultdict(lambda: {"inkeys": [], "outkeys": [], "values": []})
with file("your-file.csv") as f:
reader = csv.reader(f):
for line in reader:
tid, inkey, outkey, value = line
out_data[tid]["inkeys"].append(inkey)
out_data[tid]["outkeys"].append(outkey)
out_data[tid]["values"].append(value)
print(json.dumps(out_data))
There might be a faster or more memory efficient way to do it with Pandas or others, but simplicity and zero dependencies go a long way.
First you need to use pandas and read your csv into a dataframe. Say the csv is saved in a file called my_file.csv then you call
import pandas as pd
my_df = pd.read_csv('my_file.csv')
Then you need to convert this dataframe to the form that you specified. The following call will convert it to a dict with the specified structure
my_json = dict(my_df.set_index('tid1').groupby(level=0).apply(lambda x : x.to_json(orient = 'records')))
Now you can export it to a json file if you want
import json
with open('my_json.json', 'w') as outfile:
json.dump(my_json, outfile)
You can use Pandas with groupby and a dictionary comprehension:
from io import StringIO
import pandas as pd
mystr = StringIO("""tid1;inkey1;outkey1;value1
tid1;inkey2;outkey2;value2
tid2;inkey2;outkey3;value2
tid2;inkey4;outkey3;value2""")
# replace mystr with 'file.csv'
df = pd.read_csv(mystr, sep=';', header=None, names=['tid1', 'inkeys', 'outkeys'])
# group by index
grouper = df.groupby(level=0)
# nested dictionary comprehension with selected columns
res = {k: {col: v[col].tolist() for col in ('inkeys', 'outkeys')} for k, v in grouper}
print(res)
{'tid1': {'inkeys': ['outkey1', 'outkey2'], 'outkeys': ['value1', 'value2']},
'tid2': {'inkeys': ['outkey3', 'outkey3'], 'outkeys': ['value2', 'value2']}}
Similar to the other defaultdict() answer:
from collections import defaultdict
d = defaultdict(lambda: defaultdict(list))
with open('file.txt') as in_file:
for line in in_file:
tid, inkey, outkey, value = line.strip().split(';')
d[tid]['inkeys'].append(inkey)
d[tid]['outkeys'].append(outkey)
d[tid]['values'].append(value)
Related
I have json data which is in the structure below:
{"Text1": 4, "Text2": 1, "TextN": 123}
I want to read the json file and make a dataframe such as
Each key value pairs will be a row in the dataframe and I need to need headers "Sentence" and "Label". I tried with using lines = True but it returns all the key-value pairs in one row.
data_df = pd.read_json(PATH_TO_DATA, lines = True)
What is the correct way to load such json data?
you can use:
with open('json_example.json') as json_data:
data = json.load(json_data)
df=pd.DataFrame.from_dict(data,orient='index').reset_index().rename(columns={'index':'Sentence',0:'Label'})
Easy way that I remember
import pandas as pd
import json
with open("./data.json", "r") as f:
data = json.load(f)
df = pd.DataFrame({"Sentence": data.keys(), "Label": data.values()})
With read_json
To read straight from the file using read_json, you can use something like:
pd.read_json("./data.json", lines=True)\
.T\
.reset_index()\
.rename(columns={"index": "Sentence", 0: "Labels"})
Explanation
A little dirty but as you probably noticed, lines=True isn't completely sufficient so the above transposes the result so that you have
(index)
0
Text1
4
Text2
1
TextN
123
So then resetting the index moves the index over to be a column named "index" and then renaming the columns.
I'm writing a very small Pandas dataframe to a JSON file. In fact, the Dataframe has only one row with two columns.
To build the dataframe:
import pandas as pd
df = pd.DataFrame.from_dict(dict({'date': '2020-10-05', 'ppm': 411.1}), orient='index').T
print(df)
prints
date ppm
0 2020-10-05 411.1
The desired json output is as follows:
{
"date": "2020-10-05",
"ppm": 411.1
}
but when writing the json with pandas, I can only print it as an array with one element, like so:
[
{
"date":"2020-10-05",
"ppm":411.1
}
]
I've currently hacked my code to convert the Dataframe to a dict, and then use the json module to write the file.
import json
data = df.to_dict(orient='records')
data = data[0] # keep the only element
with open('data.json', 'w') as fp:
json.dump(data, fp, indent=2)
Is there a native way with pandas' .to_json() to keep the only dictionary item if there is only one?
I am currently using .to_json() like this, which incorrectly prints the array with one dictionary item.
df.to_json('data.json', orient='index', indent = 2)
Python 3.8.6
Pandas 1.1.3
If you want to export only one row, use iloc:
print (df.iloc[0].to_dict())
#{'date': '2020-10-05', 'ppm': 411.1}
I have a csv file (image attached) and to take the CSV file and create a dictionary of lists with the format "{method},{number},{orbital_period},{mass},{distance},{year}" .
So far I have code :
import csv
with open('exoplanets.csv') as inputfile :
reader = csv.reader(inputfile)
inputm = list(reader)
print(inputm)
but my output is coming out like ['Radial Velocity', '1', '269.3', '7.1', '77.4', '2006']
when I want it to look like :
"Radial Velocity" : {"number":[1,1,1], "orbital_period":[269.3, 874.774, 763.0], "mass":[7.1, 2.21, 2.6], "distance":[77.4, 56.95, 19.84], "year":[2006.0, 2008.0, 2011.0] } , "Transit" : {"number":[1,1,1], "orbital_period":[1.5089557, 1.7429935, 4.2568], "mass":[], "distance":[200.0, 680.0], "year":[2008.0, 2008.0, 2008.0] }
Any ideas on how I can alter my code?
Hey SKR01 welcome to Stackoverflow!
I would suggest working with the pandas library. It is meant for table like contents that you have there. What you are then looking for is a groupby on your #method column.
import pandas as pd
def remove_index(row):
d = row._asdict()
del d["Index"]
return d
df = pd.read_csv("https://docs.google.com/uc?export=download&id=1PnQzoefx-IiB3D5BKVOrcawoVFLIPVXQ")
{row.Index : remove_index(row) for row in df.groupby('#method').aggregate(list).itertuples()}
The only thing that remains is removing the nan values from the resulting dict.
If you don't want to use Pandas, maybe something like this is what you're looking for:
import csv
with open('exoplanets.csv') as inputfile :
reader = csv.reader(inputfile)
inputm = list(reader)
header = inputm.pop(0)
del header[0] # probably you don't want "#method"
# create and populate the final dictionary
data = {}
for row in inputm:
if row[0] not in data:
data[row[0]] = {h:[] for h in header}
for i, h in enumerate(header):
data[row[0]][h].append(row[i+1])
print(data)
This is a bit complex, and I'm questioning why you want the data this way, but this should get you the output format you want without requiring any external libraries like Pandas.
import csv
with open('exoplanets.csv') as input_file:
rows = list(csv.DictReader(input_file))
# Create the data structure
methods = {d["#method"]: {} for d in rows}
# Get a list of fields, trimming off the method column
fields = list(rows[1])[1:]
# Fill in the data structure
for method in methods:
methods[method] = {
# Null-trimmed version of listcomp
# f: [r[f] for r in rows if r["#method"] == method and r[f]]
f: [r[f] for r in rows if r["#method"] == method]
for f
in fields
}
Note: This could be one multi-tiered list/dict comprehension, but I've broken it apart for clarity.
I am trying to import a (very) large json file (3.3m rows, 1k columns), that has nested multiple nested jsons within it. Some of these nested jsons are double nested. I have found two ways to import the json file into a dataframe, however, I can't get the imported json to be flattened, and converted to strings at he same time.
The codes I am using are:
# 1: Import directly and convert to string
def Data_IMP(path):
with open(path) as Data:
Data_IMP = pd.read_json(Data, dtype=str)
Data_IMP = Data_IMP.replace("nan", "", regex=True)
return Data_IMP
The issue with the above is that it doesn't flatten the json file fully.
# 2: Import json and normalise
def Data_IMP(path):
with open(path) as Data:
d = json.load(Data)
Data_IMP = json_normalize(d)
return Data_IMP
The above script flattens out the json, but lets Python decide on the dtype for each column.
Is there a way to combine these approaches, so that the json file is flattened, and all columns read a strings?
I found a solution that worked, and was able to both import and flatten the jsons, as well as convert all text to strings.
# Function to import data from ARIC json file to dataframe
def Data_IMP(path):
with open(path) as Data:
d = json.load(Data)
Data_IMP = json_normalize(d)
return Data_IMP
# --------------------------------------------------------------------------------------------------------- #
# Function to cleanse Data file
def Data_Cleanse(Data_IMP):
Data_Cleanse = Data_IMP.replace(np.nan, '', regex=True)
Data_Cleanse = Data_Cleanse.astype(str)
return Data_Cleanse
The dictionary looks like the following.
res = {'Qatar': ['68.61994212', '59.03245947', '55.10905996'],
'Burundi': ['0.051012487', '0.048311391', '0.046681908'],
'Japan': ['9.605144835', '9.247590692', '9.542878595', ]}
I want to get rid of ' [ ] in my csv file
I want to get the output csv as,
Qatar 68.61994212 59.03245947 55.10905996
Burundi 0.051012487 0.048311391 0.046681908
Japan 9.605144835 9.247590692 9.542878595
Try the code below. The reason you are getting '[]' is because you might be trying to write the val of the dictionary as-is which is a list. Instead you need to retrieve the values in the list and then write it.
import csv
res = {'Qatar': ['68.61994212', '59.03245947', '55.10905996'],
'Burundi': ['0.051012487', '0.048311391', '0.046681908'],
'Japan': ['9.605144835', '9.247590692', '9.542878595', ]}
with open('./result.csv', 'w') as res_file:
csv_writer = csv.writer(res_file)
for k, v in res.items():
res_val = [x for x in v]
res_val.insert(0, k)
csv_writer.writerow(res_val)
OUTPUT:
The contents of the file (result.csv) are as below:
Burundi,0.051012487,0.048311391,0.046681908
Japan,9.605144835,9.247590692,9.542878595
Qatar,68.61994212,59.03245947,55.10905996
Aside from Jay-s answer if you are allowed to use Pandas then you can use panda-s to_csv function to just make the csv in one line.
import pandas as pd
df = pd.DataFrame(res)
df.to_csv('my_result.csv', index=False)
Try this:
[(k,) + tuple(res[k]) for k in res]
You will get list of tuples likes this which you can write to a csv file:
[('Burundi', '0.051012487', '0.048311391', '0.046681908'), ('Japan', '9.605144835', '9.247590692', '9.542878595'), ('Qatar', '68.61994212', '59.03245947', '55.10905996')]
Pandas will do it:
import pandas as pd
res = {'Qatar': ['68.61994212', '59.03245947', '55.10905996'],
'Burundi': ['0.051012487', '0.048311391', '0.046681908'],
'Japan': ['9.605144835', '9.247590692', '9.542878595', ]}
df = pd.DataFrame.from_dict(res, orient='index')
df.to_csv('res.csv', header=False)
Be sure to use "orient='index'" when creating the dataframe so that you get the correct row indexing in the csv
Qatar,68.61994212,59.03245947,55.10905996
Burundi,0.051012487,0.048311391,0.046681908
Japan,9.605144835,9.247590692,9.542878595