I am trying to write a 100x6 numpy array to a data file using
# print array to file
data = open("store.dat", "w")
for j in xrange(len(array)):
data.write(str(array[i]) + "\n")
but when the file writes, it automatically splits the line after 68 characters i.e. 5 columns of a 6 column array. For example, I want to see
[ 35.47842918 21.82382715 3.18277209 0.38992263 1.17862342 0.46170848]
as one of the lines, but am getting
[ 35.47842918 21.82382715 3.18277209 0.38992263 1.17862342
0.46170848]
I've narrowed it down to a problem with str(array[i]) and it deciding to make itself a new line.
Secondly, is there a better way to be going about this? I'm very new to python and know little about properly coding in it. Ultimately, I'm writing out a simulation of stars to later be read by a module that will render the coordinates in VPython. I thought, to avoid the problems of real-time rendering I could just pass a file once the simulation is complete. Is this inefficient? Should I be passing an object instead?
Thank you
Perhaps it would be more convenient to write it using numpy.save instead?
Alternatively, you can also use:
numpy.array_str(array[i], max_line_width=1000000)
Related
Hello I was given an h5 file with an xdmf file associated with it. The visualization looks like this. Here the color is just the feature ID. I wanted to add some data to the h5 file to be able to visualize it in paraview. The newly added data does not appear in the paraview although clearly being there when using hdfview. The data I'm trying to add are the ones titled engineering stress and true stress. The only difference I noticed is that the number of attributes for these is zero while it's 5 for the rest but I dont know what to do with that information.
Here's the code I currently have set up:
nf_product = h5py.File(filename,"a")
e_princ = np.empty((280,150,280,3))
t_princ = e_princ
for i in tqdm(range(grain_count)):
a = np.where(feature_ID == i+1)
e_princ[a,0] = eng_stress[i,0]
e_princ[a,1] = eng_stress[i,1]
e_princ[a,2] = eng_stress[i,2]
t_princ[a,0] = true_stress[i,0]
t_princ[a,1] = true_stress[i,1]
t_princ[a,2] = true_stress[i,2]
EngineeringStress = nf_product.create_dataset('DataContainers/nfHEDM/CellData/EngineeringStressPrinciple',data=np.float32(e_princ))
TrueStress = nf_product.create_dataset('DataContainers/nfHEDM/CellData/TrueStressPrinciple',data=np.float32(t_princ))
I am new to using h5 and xdmf files so I may be going about this entirely wrong but the way I understand it is an xdmf file acts as a pointer to the data in the h5 file so I can't understand why the new data doesnt appear in paraview.
First, did you close the file withnf_product.close()? If not, new datasets may not have been flushed from memory. You may also need to flush the buffers withnf_product.flush() Better, use the Python with / as: file context manager and it is done automatically.
Next, you can simply use data=e_princ (and t_princ), there is no need to cast a numpy array to a numpy array.
Finally, verify the values in e_princ and t_princ. I think will be the same because they reference the same numpy object. You need to create t_princ as an empty array, the same as e_princ. Also they have 4 indices, and you only have 2 when you populate them with [a,0]. Be sure that works as expected.
I have a script that loops through ~335k filenames, opens the fits-tables from the filenames, performs a few operations on the tables and writes the results to a file. In the beginning the loop goes relatively fast but with time it consumes more and more RAM (and CPU resources, I guess) and the script also gets slower. I would like to know how can I improve the performance/make the code quicker. E.g. is there a better way to write to the output file (open the output file ones and do everything within a while-open loop vs opening the file every time a new to write in it)? Is there a better looping way? Can I dump memory that I don't need anymore?
My script looks like that:
#spectral is a package for manipulation of spectral data
from spectral import *
# I use this dictionary to store functions, that I don't want to generate a new each time I need them.
# Generating them a new would be more time consuming, I figured out
lam_resample_dic = {}
with open("/home/bla/Downloads/output.txt", "ab") as f:
for fname, ind in zip(list_of_fnames, range(len(list_of_fnames))):
data_s = Table.read('/home/nestor/Downloads/all_eBoss_QSO/'+fname, format='fits')
# lam_str_identifier is just the dic-key I need for finding the corresponding BandResampler function from below
lam_str_identifier = ''.join([str(x) for x in data_s['LOGLAM'].data.astype(str)])
if lam_str_identifier not in lam_resample_dic:
# BandResampler is the function I avoid doing everytime a new
# I do it only if necessary - when lam_str_identifier indicates a unique new set of data
resample = BandResampler(centers1=10**data_s['LOGLAM'], centers2=df_jpas["Filter.wavelength"].values, fwhm2=df_jpas["Filter.width"].values)
lam_resample_dic[lam_str_identifier] = resample
photo_spec = np.around(resample(data_s['FLUX']),4)
else:
photo_spec = np.around(lam_resample_dic[lam_str_identifier](data_s['FLUX']),4)
np.savetxt(f, [photo_spec], delimiter=',', fmt='%1.4f')
# this is just to keep track of the progress of the loop
if ind%1000==0:
print('num of files processed so far:',find)
Thanks for any suggestions!
I have a jupyter notebook where I run the same simulation using many different combinations of parameters (essentially, to simulate different versions of environment and their effect on the results). Let's say that the result of each run is an image and a 2d array of all relevant metrics for my system. I want to be able to keep the images in notebook, but save the arrays all in one place, so that I can work with them later on if needed.
Ideally I would save them into an external file with the following format:
'Experiment environment version i' (or some other description)
2d array
and every time I would run a new simulation (a new cell) the results would be added into this file until I close it.
Any ideas how to end up with such external summary file?
If you have excel available to you then you could use pandas to write the results to a spreadsheet (or you could use pandas to write to a csv). See the documentation here, but essentially you would do the following when appending and/or using a new sheet:
import pandas as pd
for i in results:
with pd.ExcelWriter('results.xlsx', mode='a') as writer:
df.to_excel(writer, sheet_name='Result'+i)
You will need to have your array in dataframe 'df', there are lots of tutorials on how to put an array into pandas.
After a bit of try and error, here is a general answer how to write to txt (without pandas, otherwise see #jaybeesea's answer)
with open("filename.txt", "a+") as f:
f.write("Comment 1 \n")
f.write("%s \n" %np.array2string(array, separator=' , '))
Every time you run it, it adds to the file "f".
I have a text-file which is a huge set of data(around 9 GB). I have arranged the file as 244 X 3089987 with data delimited with tabs. I would like to load this text-file in Matlab as a matrix. Here is what I have tried and I have been unsuccessful (My Matlab gets hung).
fread = fopen('merge.txt','r');
formatString = repmat('%f',244,3089987);
C = textscan(fread,formatString);
Am I doing something wrong or is my approach wrong? If this is easily possible in Python, could someone please suggest accordingly.
If you read the documentation for textscan you will see that you can define an input argument N so that:
textscan reads file data using the formatSpec N times, where N is a
positive integer. To read additional data from the file after N
cycles, call textscan again using the original fileID. If you resume a
text scan of a file by calling textscan with the same file identifier
(fileID), then textscan automatically resumes reading at the point
where it terminated the last read.
You can also pass a blank formatSpec to textscan in order to read in an arbitrary number of columns. This is how dlmread, a wrapper for textscan operates.
For example:
fID = fopen('test.txt');
chunksize = 10; % Number of lines to read for each iteration
while ~feof(fID) % Iterate until we reach the end of the file
datachunk = textscan(fID, '', chunksize, 'Delimiter', '\t', 'CollectOutput', true);
datachunk = datachunk{1}; % Pull data out of cell array. Can take time for large arrays
% Do calculations
end
fclose(fID);
This will read in 10 line chunks until you reach the end of the file.
If you have enough RAM to store the data (a 244 x 3089987 array of double is just over 6 gigs) then you can do:
mydata = textscan(fID, '', 'Delimiter', '\t', 'CollectOutput', true);
mydata = mydata{1}; % Pull data out of cell array. Can take time for large arrays
try:
A = importdata('merge.txt', '\t');
http://es.mathworks.com/help/matlab/ref/importdata.html
and if the rows are not delimited by '\n':
[C, remaining] = vec2mat(A, 244)
http://es.mathworks.com/help/comm/ref/vec2mat.html
Another option in recent MATLAB releases is to use datastore. This has the advantage of being designed to allow you to page through the data, rather than read the whole lot at once. It can generally deduce all the formatting stuff.
http://www.mathworks.com/help/matlab/import_export/read-and-analyze-data-in-a-tabulartextdatastore.html
I'm surprised this is even trying to run, when I try something similar textscan throws an error.
If you really want to use textscan you only need the format for each row so you can replace 244 in your code with 1 and it should work. Edit: having read your comment not that in the first element is the number of columns so you should do formatString = repmat('%f',1, 244);. Also you can apparently just leave the format as empty ('') and it will work.
However, Matlab has several text import functions of which textscan is rarely the easiest way to do something.
In this case I would probably use dlmread, which does any delimitated numerical data. You want something like:
C=dlmread('merge.txt', '\t');
Also as you are trying to load 9GB of data I assume you have enough memory, you'll probably get an out of memory error if you don't but it is something to consider.
I want to read in a tab separated value file and turn it into a numpy array. The file has 3 lines. It looks like this:
Ann1 Bill1 Chris1 Dick1
Ann2 Bill2 Chris2 "Dick2
Ann3 Bill3 Chris3 Dick3
So, I used this simple line of code:
new_list = []
with open('/home/me/my_tsv.tsv') as tsv:
for line in csv.reader(tsv, delimiter="\t"):
new_list.append(line)
new = np.array(job_posts)
print new.shape
And because of that pesky " character, the shape of my fancy new numpy array is
(2,4)
That's not right! So, the solution is to include the argument, quoting, in the csv.reader call, as follows:
for line in csv.reader(tsv, delimiter="\t", quoting=csv.QUOTE_NONE):
This is great! Now my dimension is
(3,4)
as I hoped.
Now comes the real problem -- in all actuality, I have a 700,000 X 10 .tsv file, with long fields. I can read the file into Python with no problems, just like in the above situation. But, when I get to the step where I create new = np.array(job_posts), my wimpy 16 GB laptop cries and says...
MEMORY ERROR
Obviously, I cannot simultaneously have these 2 object in memory -- the Python list of lists, and the numpy array.
My question is therefore this: how can I read this file directly into a numpy array, perhaps using genfromtxt or something similar...yet also achieve what I've achieved by using the quoting=csv.QUOTE_NONE argument in the csv.reader?
So far, I've found no analogy to the quoting=csv.QUOTE_NONE option available anywhere in the standard ways to read in a tsv file using numpy.
This is a tough little problem. I though about iteratively building the numpy array during the reading in process, but I can't figure it out.
I tried
nparray = np.genfromtxt("/home/me/my_tsv.tsv", delimiter="/t")
print obj.shap
and got
(3,0)
If anyone has any advice I'd be greatly appreciative. Also, I know the real answer is probably to use Pandas...but at this point I'm committed to using numpy for a lot of compelling reasons...
Thanks in advance.