I am new here , need some help with writing to json file:
I have a dataframe with below values, which is created by reading a excel file
need to write this to json file with object as column dtls
Output :
A similar task is considered in the question:
Converting Excel into JSON using Python
Different approaches are possible to solve this problem.
I hope, it works for your solution.
import pandas as pd
import json
df = pd.read_excel('./TfidfVectorizer_sklearn.xlsx')
df.to_json('new_file1.json', orient='records') # excel to json
# read json and then append details to it
with open('./new_file1.json', 'r') as json_file:
a = {}
data = json.load(json_file)
a['details'] = data
# write new json with details in it
with open("./new_file1.json", "w") as jsonFile:
json.dump(a, jsonFile)
JSON Output:
I need to open a gzipped file, that has a parquet file inside with some data. I am having so much trouble trying to print/read what is inside the file. I tried the following:
with gzip.open("myFile.parquet.gzip", "rb") as f:
data = f.read()
This does not seem to work, as I get an error that my file id not a gz file. Thanks!
You can use read_parquet function from pandas module:
Install pandas and pyarrow:
pip install pandas pyarrow
use read_parquet which returns DataFrame:
data = read_parquet("myFile.parquet.gzip")
print(data.count()) # example of operation on the returned DataFrame
I have a very large CSV File (~12Gb) that looks something like this:
posX,posY,posZ,eventID,parentID,clockTime
-117.9853515625,60.2998046875,0.29499998688697815,0,0,0
-117.9853515625,60.32909393310547,0.29499998688697815,0,0,0
-117.9560546875,60.2998046875,0.29499998688697815,0,0,0
-117.9560546875,60.32909393310547,0.29499998688697815,0,0,0
-117.92676544189453,60.2998046875,0.29499998688697815,0,0,0
-117.92676544189453,60.32909393310547,0.29499998688697815,0,0,0
-118.04051208496094,60.34012985229492,4.474999904632568,0,0,0
-118.04051208496094,60.36941909790039,4.474999904632568,0,0,0
-118.04051208496094,60.39870834350586,4.474999904632568,0,0,0
I want to convert this CSV File into the HDF5 Format using the library h5py while also lowering the total file size by setting the field / index types i.G. saying:
Save posX, posY and posZ as float32. Save eventID, parentID and clockTime as int32 or something along those lines.
Note: I need to chunk the data in some form when I read it in to avoid Memory Errors.
However I am unable to get the wished result. What I have tried so far:
Using Pandas own methods following this guide: How to write a large csv file to hdf5 in python?
This creates the file but im somehow unable to change the types and the file remains too big (~10.7Gb). The field types are float64 and int64.
I also tried to split the CSV up into parts (using split -n x myfile.csv) before working with the increments. I ran into some data errors in the beginning and end on each file which I was able to fix by removing said lines using sed. Then I tried out the following code:
import pandas as pd
import h5py
PATH_csv = "/home/MYNAME/Documents/Workfolder/xaa" #xaa is my csv increment
DATA_csv = pd.read_csv(PATH_csv)
with h5py.File("pct_data-hdf5.h5", "a") as DATA_hdf:
dset = DATA_hdf.create_dataset("posX", data=DATA_csv["posX"], dtype="float32")
Sadly this created the file and the table but didn't write any data into it.
Expectation
Creating a HDF5 File containing the data of a large CSV file while also changing the variable type of each index.
If something is unclear please ask me for clarification. Im still a beginner!
Have you considered the numpy module?
It has a handy function (genfromtxt) to read CSV data with headers into a Numpy array. You define the dtype. The array is suitable for loading into HDF5 with the h5py.create_dataset() function.
See code below. I included 2 print statements. The first shows the dtype names created from the CSV headers. The second shows how you can access the data in the numpy array by field (column) name.
import h5py
import numpy as np
PATH_csv = 'SO_55576601.csv'
csv_dtype= ('f8', 'f8', 'f8', 'i4', 'i4', 'i4' )
csv_data = np.genfromtxt(PATH_csv, dtype=csv_dtype, delimiter=',', names=True)
print (csv_data.dtype.names)
print (csv_data['posX'])
with h5py.File('SO_55576601.h5', 'w') as h5f:
dset = h5f.create_dataset('CSV_data', data=csv_data)
h5f.close()
I have used the following code to parse a binary file using numpy. After reading the binary data, I want to save it to somewhere because I want to use this extracted data for another use/purpose. I am trying to use np.save but it is not saving anywhere. How do I save this data?
import numpy as np
with open(r'file_path', 'rb') as f:
data = np.fromfile(f, dtype=np.int32, count=-1)
print (data)
np.save(filenum,data)
I am quite sure that my arff files are correct, for that I have downloaded different files on the web and successfully opened them in Weka.
But I want to use my data in python, then I typed:
import arff
data = arff.load('file_path','rb')
It always returns an error message: Invalid layout of the ARFF file, at line 1.
Why this happened and how should I do to make it right?
If you change your code like in below, it'll work.
import arff
data = arff.load(open('file_path'))
Using scipy we can load arff data in python
from scipy.io import arff
import pandas as pd
data = arff.loadarff('dataset.arff')
df = pd.DataFrame(data[0])
df.head()