I have a trajectory file from a molecular simulation that is written in netCDF format. I would like to convert this file to .csv format so that I can apply further Python-based analysis of the proximity between molecules. The trajectory file contains information corresponding to 3D Cartesian coordinates for all 6500 atoms of my simulation for each time step.
I have used the below script to convert this netCDF file to a .csv file using the netCDF4 and Pandas modules. My code is given below.
import netCDF4
import pandas as pd
fp='TEST4_simulate1.traj'
dataset = netCDF4.Dataset(fp, mode='r')
cols = list(dataset.variables.keys())
list_dataset = []
for c in cols:
list_dataset.append(list(dataset.variables[c][:]))
#print(list_dataset)
df_dataset = pd.DataFrame(list_dataset)
df_dataset = df_dataset.T
df_dataset.columns = cols
df_dataset.to_csv("file_path.csv", index = False)
A small selection of the output .csv file is given below. Notice that a set of ellipses are given between the first and last set of 3 atomic coordinates.
time,spatial,coordinates
12.0,b'x',"[[ 33.332325 -147.24976 -107.131 ]
[ 34.240444 -147.80115 -107.4043 ]
[ 33.640083 -146.47362 -106.41945 ]
...
[ 70.31757 -16.499006 -186.13313 ]
[ 98.310844 65.95696 76.43664 ]
[ 84.08772 52.676186 145.48856 ]]"
How can I modify this code so that the entirety of my atomic coordinates are written to my .csv file?
Related
I am currently working on a project where i need to collect coordinates and transfer that to a csv file. I am using the k-means algorithm to find the coordinates (the centroids of a larger coordinate collection). The output is a list with coordinates. At first I wanted to simply copy it to an excel file, but that did not work as well as i wanted it to be.
This is my code:
df = pd.read_excel("centroid coordinaten excel lijst.xlsx")
df.head(n=16)
plt.scatter(df.X,df.Y)
km = KMeans(n_clusters=200)
print(km)
y_predict = km.fit_predict(df[['X','Y']])
print(y_predict)
df['cluster'] = y_predict
kmc = km.cluster_centers_
print(kmc)
#The output kmc is the list with coordinates and it looks like this:
[[ 4963621.73063468 52320928.30284858]
[ 4981357.33667335 52293627.08917835]
[ 4974134.37538941 52313274.21495327]
[ 4945992.84398977 52304446.43606138]
[ 4986701.53977273 52317701.43831169]
[ 4993362.9143898 52296985.49271403]
[ 4949408.06109325 52320541.97963558]
[ 4966756.82872596 52301871.5655048 ]
[ 4980845.77591313 52324669.94175716]
[ 4970904.14472671 52292401.47190146]]
Is there anybode who knows how to convert the 'kmc' output into a csv file.
Thanks in advance!
You could use the csv library as follows:
import csv
kmc = [
[4963621.73063468,52320928.30284858],
[4981357.33667335,52293627.08917835],
[4974134.37538941,52313274.21495327],
[4945992.84398977,52304446.43606138],
[4986701.53977273,52317701.43831169],
[4993362.9143898,52296985.49271403],
[4949408.06109325,52320541.97963558],
[4966756.82872596,52301871.5655048],
[4980845.77591313,52324669.94175716],
[4970904.14472671,52292401.47190146],
]
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['lat', 'long'])
csv_output.writerows(kmc)
Giving you output.csv containing:
lat,long
4963621.73063468,52320928.30284858
4981357.33667335,52293627.08917835
4974134.37538941,52313274.21495327
4945992.84398977,52304446.43606138
4986701.53977273,52317701.43831169
4993362.9143898,52296985.49271403
4949408.06109325,52320541.97963558
4966756.82872596,52301871.5655048
4980845.77591313,52324669.94175716
4970904.14472671,52292401.47190146
I suggest you put a full path to your output file to ensure you have write permission. Or as suggested, use sudo. Alternatively, you could add the following to the top:
import os
os.chdir(os.path.dirname(os.path.abspath(__file__)))
This ensures the output will be in the same folder as your script.
Good day everyone.
I was hoping someone here could help me with a bit of a problem. I've run an experiment, where data has been gathered from 6 separate sensors simultaneously. The data has then been exported to a single shared txt file. Now I need to import the data to python to analyze it.
I know I can do this by taking each of the lines and simply copy&pasting data output from each sensor into a separate document, and then import those in a loop - but that is a lot of work and brings in a high potential of human error.
But is there no way of using readline with specific lines read, and porting that to pandas DataFrame? There is a fixed header spacing, and line spacing between each sensor.
I tried:
f=open('OR0024622_auto3200.txt')
lines = f.readlines()
base = 83
sensorlines = 6400
Sensor=[]
Sensor = lines[base:sensorlines+base]
df_sens = pd.DataFrame(Sensor)
df_sens
but the output isn't very useful:
Snip from of Output
--
Here's the file i am importing:
link.
Any suggestions ?
Looks like a tab separated data.
use
>>> df = pd.read_csv('OR0024622_auto3200.txt', delimiter=r'\t', skiprows=83, header=None, nrows=38955-84)
>>> df.tail()
0 1 2
38686 6397 3.1980000000e+003 9.28819e-009
38687 6398 3.1985000000e+003 9.41507e-009
38688 6399 3.1990000000e+003 1.11703e-008
38689 6400 3.1995000000e+003 9.64276e-009
38690 6401 3.2000000000e+003 8.92203e-009
>>> df.head()
0 1 2
0 1 0.0000000000e+000 6.62579e+000
1 2 5.0000000000e-001 3.31289e+000
2 3 1.0000000000e+000 2.62362e-011
3 4 1.5000000000e+000 1.51130e-011
4 5 2.0000000000e+000 8.35723e-012
abhilb's answer is to the point and correct, but there is a lot to be said regarding loading/reading files. A quick browser search will take you a long way (I encourage you to read up on this!), but I'll add a few details here:
If you want to load multiple files that match a pattern you can do so iteratively via glob:
import pandas as pd
from glob import glob as gg
filePattern = "/path/to/file/*.txt"
for fileName in gg(filePattern):
df = pd.read_csv('OR0024622_auto3200.txt', delimiter=r'\t')
This will load each file one-by-one. What if you want to put all data into a single dataframe? Do this:
masterDF = pd.Dataframe()
for fileName in gg(filePattern):
df = pd.read_csv('OR0024622_auto3200.txt', delimiter=r'\t')
masterDF = pd.concat([masterDF, df], axis=0)
This works great for pandas, but what if you want to read into a numpy array?
import numpy as np
# using previous imports
base = 83
sensorlines = 6400
# create an empty array that has three columns
masterArray = np.full((0, 3), np.nan)
for fileName in gg(filePattern):
# open the file (NOTE: this does not read the file, just puts it in a buffer)
with open(fileName, "r") as tmp:
# now read the file and split each line by the carriage return (could be "\r\n")
# you now have a list of strings
data = tmp.read().split("\n")
# keep only the "data" portion of the file
data = data[base:sensorlines + base]
# convert list of strings to an array of floats
# here, I use a "list comprehension" for speed and simplicity
data = np.array([r.split("\t") for r in data]).astype(float)
# stack your new data onto your master array
masterArray = np.vstack([masterArray, data])
Opening a file via the "with open(fileName, "r")" syntax is handy because Python automatically closes the file when you are done. If you don't use "with" then you must manually close the file (e.g. tmp.close()).
These are just some starting points to get you on your way. Feel free to ask for clarification.
I have a CSV file of tab-separated data with headers and data of different types which I would like to convert into a dictionary of vectors. Eventually I would like to convert the dictionary into numpy arrays, and store them in some binary format for fast retrieval by different scripts. This is a large file with approximately 700k records and 16 columns. The following is a sample:
"answer_option" "value" "fcast_date" "expertise"
"a" 0.8 "2013-07-08" 3
"b" 0.2 "2013-07-08" 3
I have started implementing this with the DictReader class, which I'm just learning about.
import csv
with open( "filename.tab", 'r') as records:
reader = csv.DictReader( records, dialect='excel-tab' )
row = list( reader )
n = len( row )
d = {}
keys = list( row[0] )
for key in keys :
a = []
for i in range(n):
a.append( row[i][key] )
d [key] = a
which gives the result
{'answer_option': ['a', 'b'],
'value': ['0.8', '0.2'],
'fcast_date': ['2013-07-08', '2013-07-08'],
'expertise': ['3', '3']}
Besides the small nuisance of having to clean from the numerical values the quotation characters that are enclosing them, I thought that perhaps there is something ready made. I'm also wondering if there is anything that extracts directly from the file into numpy vectors, since I do not need to necessarily transform my data in dictionaries.
I took a look at SciPy.org and a search of CSV also refers to HDF5 and genfromtxt, but I haven't dived into those suggestions yet. Ideally I would like to be able to store the data in a fast-to-load format, so that it would be simple to load from other scripts with only one command, where all vectors are made available the same way it is possible in Matlab/Octave. Suggestions are appreciated
EDIT: the data are tab separated with strings enclosed by quotation marks.
This will read the csv into a Pandas data frame and remove the quotes:
import pandas as pd
import csv
import io
with open('data_with_quotes.csv') as f_input:
data = [next(csv.reader(io.StringIO(line.replace('"', '')))) for line in f_input]
df = pd.DataFrame(data[1:], columns=data[0])
print(df)
answer_option value fcast_date expertise
0 a 0.8 2013-07-08 3
1 b 0.2 2013-07-08 3
You can easily convert the data to a numpy array using df.values:
array([['a', '0.8', '2013-07-08', '3'],
['b', '0.2', '2013-07-08', '3']], dtype=object)
To save the data in a binary format, I recommend using Hdf5:
import h5py
with h5py.File('file.hdf5', 'w') as f:
dset = f.create_dataset('default', data=df)
To load the data, use the following:
with h5py.File('file.hdf5', 'r') as f:
data = f['default']
You can also use Pandas to save and load the data in binary format:
# Save the data
df.to_hdf('data.h5', key='df', mode='w')
# Load the data
df = pd.read_hdf('data.h5', 'df')
I have video files in three directories (Train , Development , Test ) and
there is two .csv files ( the features are in one .csv file , but the labels
and file names are in other .csv file ). I'm trying to make a dataframe by
selecting some colunms from the two .csv files to feed it to keras CNN model .
Below is what I tried to read some features from the 1st .csv file , but I
don't know if this this is the right way to do it ?
import io
import numpy as np
def read_data(file_path):
data = pd.read_csv(file_path,sep=',')
data =data.iloc[0:1560, [0,680,681,682,683,684]]
return data
df=read_data('dev_001.csv')
print(df.shape)
df.head()
data_frame = pd.DataFrame(df)
output features saved in data_frame
I want now to insert this data frame into one cell so I have something like this :
Table :
Features Subject_id labels
Extracted features from .csv file (data_frame ) ID from .csv file Label from .csv file
Where I need all features in .csv file in first cell. How this can be done ?
I tried this :
new_df = pd.DataFrame({'Features': [temp_data_frame]})
print(temp_data_frame.shape)
print(new_df.shape)
(1560, 35)
(1, 1)
But it only takes the first row from (temp_data_frame)
I have many large .csv files that I want to convert to .nc (i.e. netCDF files) using xrray. However, I found that saving the .nc files takes a very long time, and the resulting .nc files are much larger (4x to 12x larger) than the original .csv files.
Below is sample code to show how the same data produces .nc files that are about 4 times larger than when saved in .csv
import pandas as pd
import xarray as xr
import numpy as np
import os
# Create pandas DataFrame
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(100000,5)),
columns=['a', 'b', 'c', 'd', 'e'])
# Make 'e' a column of strings
df['e'] = df['e'].astype(str)
# Save to csv
df.to_csv('df.csv')
# Convert to an xarray's Dataset
ds = xr.Dataset.from_dataframe(df)
# Save NetCDF file
ds.to_netcdf('ds.nc')
# Compute stats
stats1 = os.stat('df.csv')
stats2 = os.stat('ds.nc')
print('csv=',str(stats1.st_size))
print('nc =',str(stats2.st_size))
print('nc/csv=',str(stats2.st_size/stats1.st_size))
The result:
>>> csv = 1688902 bytes
>>> nc = 6432441 bytes
>>> nc/csv = 3.8086526038811015
As you can see, the .nc file is about 4 times larger than the .csv file.
I found this post suggesting that changing from type 'string' to type 'char' drastically reduces file size, but how to I do this in xarray?
Also, note that even when having all data as Integers (i.e. comment-out df['e'] = df['e'].astype(str)) the resulting .nc file is still 50% larger than .csv
Am I missing a compression setting? ...or something else?
I found an answer to my own question...
Enable compression for each variable
For column e, specify that dtype is "character" (i.e. S1)
Before saving the .nc file, add the following code:
encoding = {'a':{'zlib':True},
'b':{'zlib':True},
'c':{'zlib':True},
'd':{'zlib':True},
'e':{'zlib':True, 'dtype':'S1'}}
ds.to_netcdf('ds.nc',format='NETCDF4',engine='netcdf4',encoding=encoding)
The new results are:
>>> csv = 1688902 bytes
>>> nc = 1066182 bytes
>>> nc/csv = 0.6312870729029867
Note that it still takes a bit of time to save the .nc file.
As you use only variabled from 0 to 9, in the CSV file 1 byte are sufficient to store the data. xarray, uses int64 (8 bytes) per default for integers.
To tell xarray to use 1-byte integers, you can use this:
ds.to_netcdf('ds2.nc',encoding = {'a':{'dtype': 'int8'},
'b':{'dtype': 'int8'}, 'c':{'dtype': 'int8'},
'd':{'dtype': 'int8'}, 'e':{'dtype': 'S1'}})
The resulting file is 1307618 bytes. Compression will reduce the file size even more especially for non-random data :-)