How to Import multiple CSV files then make a Master Table? - python

I am a research chemist and have carried out a measurement where I record 'signal intensity' vs 'mass-to-charge (m/z)' . I have repeated this experiment 15x, by changing a specific parameter (Collision Energy). As a result, I have 15 CSV files and would like to align/join them within the same range of m/z values and same interval values. Due to the instrument thresholding rules, certain m/z values were not recorded, thus I have files that cannot simply be exported into excel and copy/pasted. The data looks a bit like the tables posted below
Dataset 1: x | y Dataset 2: x | y
--------- ---------
0.0 5 0.0 2
0.5 3 0.5 6
2.0 7 1.0 9
3.0 1 2.5 1
3.0 4
Using matlab I started with this code:
%% Create a table for the set m/z range with an interval of 0.1 Da
mzrange = 50:0.1:620;
mzrange = mzrange';
mzrange = array2table(mzrange,'VariableNames',{'XThompsons'});
Then I manually imported 1 X/Y CSV (Xtitle=XThompson, Ytitle=YCounts) to align with the specified m/z range.
%% Join/merge the two tables using a common Key variable 'XThompson' (m/z value)
mzspectrum = outerjoin(mzrange,ReserpineCE00,'MergeKeys',true);
% Replace all NaN values with zero
mzspectrum.YCounts(isnan(mzspectrum.YCounts)) = 0;
At this point I am stuck because repeating this process with a separate file will overwrite my YCounts column. The title of the YCounts column doesnt matter to me as I can change it later, however I would like to have the table continue as such:
XThompson | YCounts_1 | YCounts_2 | YCounts_3 | etc...
--------------------------------------------------------
How can I carry this out in Matlab so that this is at least semi-automated? I've had posted earlier describing a similar scenario but it turned out that it could not carry out what I need. I must admit that my mind is not of a programmer so I have been struggling with this problem quite a bit.
PS- Is this problem best executed in Matlab or Python?

I don't know or use matlab so my answer is pure python based. I think python and matlab should be equally well suited to read csv files and generate a master table.
Please consider this answer more as pointer to how to address the problem in python.
In python one would typically address this problem using the pandas package. This package provides "high-performance, easy-to-use data structures and data analysis tools" and can read natively a large set of file formats including CSV files. A master table from two CSV files "foo.csv" and "bar.csv" could be generated e.g. as follows:
import pandas as pd
df = pd.read_csv('foo.csv')
df2 = pd.read_csv('bar.cvs')
master_table = pd.concat([df, df2])
Pandas further allows to group and structure the data in many ways. The pandas documentation has very good descriptions of its various features.
One can install pandas with the python package installer pip:
sudo pip install pandas
if on Linux or OSX.

The counts from the different analyses should be named differently, i.e., YCounts_1, YCounts_2, and YCounts_3 from analyses 1, 2, and 3, respectively, in the different datasets before joining them. However, the M/Z name (i.e., XThompson) should be the same since this is the key that will be used to join the datasets. The code below is for MATLAB.
This step is not needed (just recreates your tables) and I copied dataset2 to create dataset3 for illustration. You could use 'readtable' to import your data i.e., imported_data = readtable('filename');
dataset1 = table([0.0; 0.5; 2.0; 3.0], [5; 3; 7; 1], 'VariableNames', {'XThompson', 'YCounts_1'});
dataset2 = table([0.0; 0.5; 1.0; 2.5; 3.0], [2; 6; 9; 1; 4], 'VariableNames', {'XThompson', 'YCounts_2'});
dataset3 = table([0.0; 0.5; 1.0; 2.5; 3.0], [2; 6; 9; 1; 4], 'VariableNames', {'XThompson', 'YCounts_3'});
Merge tables using outerjoin. You could use loop if you have many datasets.
combined_dataset = outerjoin(dataset1,dataset2, 'MergeKeys', true);
Add dataset3 to the combined_dataset
combined_dataset = outerjoin(combined_dataset,dataset3, 'MergeKeys', true);
You could export the combined data as Excel Sheet by using writetable
writetable(combined_dataset, 'joined_icp_ms_data.xlsx');

I managed to create a solution to my problem based on learning through everyone's input and taking an online matlab courses. I am not a natural coder so my script is not as elegant as the geniuses here, but hopefully it is clear enough for other non-programming scientists to use.
Here's the result that works for me:
% Reads a directory containing *.csv files and corrects the x-axis to an evenly spaced (0.1 unit) interval.
% Create a matrix with the input x range then convert it to a table
prompt = 'Input recorded min/max data range separated by space \n(ex. 1 to 100 = 1 100): ';
inputrange = input(prompt,'s');
min_max = str2num(inputrange)
datarange = (min_max(1):0.1:min_max(2))';
datarange = array2table(datarange,'VariableNames',{'XAxis'});
files = dir('*.csv');
for q=1:length(files);
% Extract each XY pair from the csvread cell and convert it to an array, then back to a table.
data{q} = csvread(files(q).name,2,1);
data1 = data(q);
data2 = cell2mat(data1);
data3 = array2table(data2,'VariableNames',{'XAxis','YAxis'});
% Join the datarange table and the intensity table to obtain an evenly spaced m/z range
data3 = outerjoin(datarange,data3,'MergeKeys',true);
data3.YAxis(isnan(data3.YAxis)) = 0;
data3.XAxis = round(data3.XAxis,1);
% Remove duplicate values
data4 = sortrows(data3,[1 -2]);
[~, idx] = unique(data4.XAxis);
data4 = data4(idx,:);
% Save the file as the same name in CSV without underscores or dashes
filename = files(q).name;
filename = strrep(filename,'_','');
filename = strrep(filename,'-','');
filename = strrep(filename,'.csv','');
writetable(data4,filename,'FileType','text');
clear data data1 data2 data3 data4 filename
end
clear

Related

HDF5 tagging datasets to events in other datasets

I am sampling time series data off various machines, and every so often need to collect a large high frequency burst of data from another device and append it to the time series data.
Imagine I am measuring temperature over time, and then every 10 degrees increase in temperature I sample a micro at 200khz, I want to be able to tag the large burst of micro data to a timestamp in the time-series data. Maybe even in the form of a figure.
I was trying to do this with regionref, but am struggling to find a elegant solution. and I'm finding myself juggling between pandas store and h5py and it just feels messy.
Initially I thought I would be able to make separate datasets from the burst-data then use reference or links to timestamps in the time-series data. But no luck so far.
Any way to reference a large packet of data to a timestamp in another pile of data would be appreciated!
How did use region references? I assume you had an array of references, with references alternating between a range of "standard rate" and "burst rate" data. That is a valid approach, and it will work. However, you are correct: it's messy to create, and messy to recover the data.
Virtual Datasets might be a more elegant solution....but tracking and creating the virtual layout definitions could get messy too. :-) However, once you have the virtual data set, you can read it with typical slice notation. HDF5/h5py handles everything under the covers.
To demonstrate, I created a "simple" example (realizing virtual datasets aren't "simple"). That said, if you can figure out region references, you can figure out virtual datasets. Here is a link to the h5py Virtual Dataset Documentation and Example for details. Here is a short summary of the process:
Define the virtual layout: this is the shape and dtype of the virtual dataset that will point to other datasets.
Define the virtual sources. Each is a reference to a HDF5 file and dataset (1 virtual source for file/dataset combination.)
Map virtual source data to the virtual layout (you can use slice notation, which is shown in my example).
Repeat steps 2 and 3 for all sources (or slices of sources)
Note: virtual datasets can be in a separate file, or in the same file as the referenced datasets. I will show both in the example. (Once you have defined the layout and sources, both methods are equally easy.)
There are at least 3 other SO questions and answers on this topic:
h5py, enums, and VirtualLayout
h5py error reading virtual dataset into NumPy array
How to combine multiple hdf5 files into one file and dataset?
Example follows:
Step 1: Create some example data. Without your schema, I guessed at how you stored "standard rate" and "burst rate" data. All standard rate data is stored in dataset 'data_log' and each burst is stored in a separate dataset named: 'burst_log_##'.
import numpy as np
import h5py
log_ntimes = 31
log_inc = 1e-3
arr = np.zeros((log_ntimes,2))
for i in range(log_ntimes):
time = i*log_inc
arr[i,0] = time
#temp = 70.+ 100.*time
#print(f'For Time = {time:.5f} ; Temp= {temp:.4f}')
arr[:,1] = 70.+ 100.*arr[:,0]
#print(arr)
with h5py.File('SO_72654160.h5','w') as h5f:
h5f.create_dataset('data_log',data=arr)
n_bursts = 4
burst_ntimes = 11
burst_inc = 5e-5
for n in range(1,n_bursts):
arr = np.zeros((burst_ntimes-1,2))
for i in range(1,burst_ntimes):
burst_time = 0.01*(n)
time = burst_time + i*burst_inc
arr[i-1,0] = time
#temp = 70.+ 100.*t
arr[:,1] = 70.+ 100.*arr[:,0]
with h5py.File('SO_72654160.h5','a') as h5f:
h5f.create_dataset(f'burst_log_{n:02}',data=arr)
Step 2: This is where the virtual layout and sources are defined and used to create the virtual dataset. This creates a virtual dataset a new file, and one in the existing file. (The statements are identical except for the file name and mode.)
source_file = 'SO_72654160.h5'
a0 = 0
with h5py.File(source_file, 'r') as h5f:
for ds_name in h5f:
a0 += h5f[ds_name].shape[0]
print(f'Total data rows in source = {a0}')
# alternate getting data from
# dataset: data_log, get rows 0-11, 11-21, 21-31
# datasets: burst_log_01, burst log_02, etc (each has 10 rows)
# Define virtual dataset layout
layout = h5py.VirtualLayout(shape=(a0, 2),dtype=float)
# Map virstual dataset to logged data
vsource1 = h5py.VirtualSource(source_file, 'data_log', shape=(41,2))
layout[0:11,:] = vsource1[0:11,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_01', shape=(10,2))
layout[11:21,:] = vsource2
layout[21:31,:] = vsource1[11:21,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_02', shape=(10,2))
layout[31:41,:] = vsource2
layout[41:51,:] = vsource1[21:31,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_03', shape=(10,2))
layout[51:61,:] = vsource2
# Create NEW file, then add virtual dataset
with h5py.File('SO_72654160_VDS.h5', 'w') as h5vds:
h5vds.create_virtual_dataset("vdata", layout)
print(f'Total data rows in VDS 1 = {h5vds["vdata"].shape[0]}')
# Open EXISTING file, then add virtual dataset
with h5py.File('SO_72654160.h5', 'a') as h5vds:
h5vds.create_virtual_dataset("vdata", layout)
print(f'Total data rows in VDS 2 = {h5vds["vdata"].shape[0]}')

ways to improve efficiency of Python script

I have a list of genes, their coordinates, and their expression (right now just looking at the top 500 most highly expressed genes) and 12 files corresponding to DNA reads. I have a python script that searches for reads overlapping with each gene's coordinates and storing the values in a dictionary. I then use this dictionary to create a Pandas dataframe and save this as a csv. (I will be using these to create a scatterplot.)
The RNA file looks like this (the headers are gene name, chromosome, start, stop, gene coverage/enrichment):
MSTRG.38 NC_008066.1 9204 9987 48395.347656
MSTRG.36 NC_008066.1 7582 8265 47979.933594
MSTRG.33 NC_008066.1 5899 7437 43807.781250
MSTRG.49 NC_008066.1 14732 15872 26669.763672
MSTRG.38 NC_008066.1 8363 9203 19514.273438
MSTRG.34 NC_008066.1 7439 7510 16855.662109
And the DNA file looks like this (the headers are chromosome, start, stop, gene name, coverage, strand):
JQ673480.1 697 778 SRX6359746.5505370/2 8 +
JQ673480.1 744 824 SRX6359746.5505370/1 8 -
JQ673480.1 1712 1791 SRX6359746.2565519/2 27 +
JQ673480.1 3445 3525 SRX6359746.7028440/2 23 -
JQ673480.1 4815 4873 SRX6359746.6742605/2 37 +
JQ673480.1 5055 5092 SRX6359746.5420114/2 40 -
JQ673480.1 5108 5187 SRX6359746.2349349/2 24 -
JQ673480.1 7139 7219 SRX6359746.3831446/2 22 +
The RNA file has >9,000 lines, and the DNA files have > 12,000,000 lines.
I originally had a for-loop that would generate a dictionary containing all values for all 12 files in one go, but it runs extremely slowly. Since I have access to a computing system with multiple cores, I've decided to run a script that only calculates coverage one DNA file at a time, like so:
#import modules
import csv
import pandas as pd
import matplotlib.pyplot as plt
#set sample name
sample='CON-2'
#set fraction number
f=6
#dictionary to store values
d={}
#load file name into variable
fileRNA="top500_R8_7-{}-RNA.gtf".format(sample)
print(fileRNA)
#read tsv file
tsvRNA = open(fileRNA)
readRNA = csv.reader(tsvRNA, delimiter="\t")
expGenes=[]
#convert tsv file into Python list
for row in readRNA:
gene=row[0],row[1],row[2],row[3],row[4]
expGenes.append(row)
#print(expGenes)
#establish file name for DNA reads
fileDNA="D2_7-{}-{}.bed".format(sample,f)
print(fileDNA)
tsvDNA = open(fileDNA)
readDNA = csv.reader(tsvDNA, delimiter="\t")
#put file into Python list
MCNgenes=[]
for row in readDNA:
read=row[0],row[1],row[2]
MCNgenes.append(read)
#find read counts
for r in expGenes:
#include FPKM in the dictionary
d[r[0]]=[r[4]]
regionCount=0
#set start and stop points based on transcript file
chr=r[1]
start=int(r[2])
stop=int(r[3])
#print("start:",start,"stop:",stop)
for row in MCNgenes:
if start < int(row[1]) < stop:
regionCount+=1
d[r[0]].append(regionCount)
n+=1
df=pd.DataFrame.from_dict(d)
#convert to heatmap
df.to_csv("7-CON-2-6_forHeatmap.csv")
This script also runs quite slowly, however. Are there any changes I can make to get it run more efficiently?
If I understood correctly and you are trying to match between coordinates of genes in different files I believe the best option would be to use something like KDTree partitioning algorithm.
You can use KDtree to partition your DNA and RNA data. I'm assumming you're using 'start' and 'stop' as 'coordinates':
import pandas as pd
import numpy as np
from sklearn.neighbors import KDTree
dna = pd.DataFrame() # this is your dataframe with DNA data
rna = pd.DataFrame() # Same for RNA
# Let's assume you are using 'start' and 'stop' columns as coordinates
dna_coord = dna.loc[:, ['start', 'stop']]
rna_coord = rna.loc[:, ['start', 'stop']]
dna_kd = KDTree(dna_coord)
rna_kd = KDTree(rna_coord)
# Now you can go through your data and match with DNA:
my_data = pd.DataFrame()
for start, stop in zip(my_data.start, my_data.stop):
coord = np.array(start, stop)
dist, idx = dna_kd.query(coord, k=1)
# Assuming you need an exact match
if np.islose(dist, 0):
# Now that you have the index of the matchin row in DNA data
# you can extract information using the index and do whatever
# you want with it
dna_gene_data = dna.loc[idx, :]
You can adjust your search parameters to get the desired results, but this will be much faster than searching every time.
Generally, Python is extremely extremely easy to work with at the cost of it being inefficient! Scientific libraries (such as pandas and numpy) help here by only paying the Python overhead a minimum limited number of times to map the work into a convenient space, then doing the "heavy lifting" in a more efficient language (which may be quite painful/inconvenient to work with).
General advice
try to get data into a dataframe whenever possible and keep it there (do not convert data into some intermediate Python object like a list or dict)
try to use methods of the dataframe or parts of it to do work (such as .apply() and .map()-like methods)
whenever you must iterate in native Python, iterate on the shorter side of a dataframe (ie. there may be only 10 columns, but 10,000 rows ; go over the columns)
More on this topic here:
How to iterate over rows in a DataFrame in Pandas?
Answer: DON'T*!
Once you have a program, you can benchmark it by collecting runtime information. There are many libraries for this, but there is also a builtin one called cProfile which may work for you.
docs: https://docs.python.org/3/library/profile.html
python3 -m cProfile -o profile.out myscript.py

Save each Excel-spreadsheet-row with header in separate .txt-file (saved as a parameter-sample to be read by simulation programs)

I'm a building energy simulation modeller with an Excel-question to enable automated large-scale simulations using parameter samples (samples generated using Monte Carlo). Now I have the following question in saving my samples:
I want to save each row of an Excel-spreadsheet in a separate .txt-file in a 'special' way to be read by simulation programs.
Let's say, I have the following excel-file with 4 parameters (a,b,c,d) and 20 values underneath:
a b c d
2 3 5 7
6 7 9 1
3 2 6 2
5 8 7 6
6 2 3 4
Each row of this spreadsheet represents a simulation-parameter-sample.
I want to store each row in a separate .txt-file as follows (so 5 '.txt'-files for this spreadsheet):
'1.txt' should contain:
a=2;
b=3;
c=5;
d=7;
'2.txt' should contain:
a=6;
b=7;
c=9;
d=1;
and so on for files '3.txt', '4.txt' and '5.txt'.
So basically matching the header with its corresponding value underneath for each row in a separate .txt-file ('header equals value;').
Is there an Excel add-in that does this or is it better to use some VBA-code? Anybody some idea?
(I'm quit experienced in simulation modelling but not in programming, therefore this rather easy parameter-sample-saving question in Excel. (Solutions in Python are also welcome if that's easier for you people))
my idea would be to use Python along with Pandas as it's one of the most flexible solutions, as your use case might expand in the future.
I'm gonna try making this as simple as possible. Though I'm assuming, that you have Python, that you know how to install packages via pip or conda and are ready to run a python script on whatever system you are using.
First your script needs to import pandas and read the file into a DataFrame:
import pandas as pd
df = pd.read_xlsx('path/to/your/file.xlsx')
(Note that you might need to install the xlrd package, in addition to pandas)
Now you have a powerful data structure, that you can manipulate in plenty of ways. I guess the most intuitive one, would be to loop over all items. Use string formatting, which is best explained over here and put the strings together the way you need them:
outputs = {}
for row in df.index:
s = ""
for col in df.columns:
s += "{}={};\n".format(col, df[col][row])
print(s)
now you just need to write to a file using python's io method open. I'll just name the files by the index of the row, but this solution will overwrite older text files, created by earlier runs of this script. You might wonna add something unique like the date and time or the name of the file you read to it or increment the file name further with multiple runs of the script, for example like this.
All together we get:
import pandas as pd
df = pd.read_excel('path/to/your/file.xlsx')
file_count = 0
for row in df.index:
s = ""
for col in df.columns:
s += "{}={};\n".format(col, df[col][row])
file = open('test_{:03}.txt'.format(file_count), "w")
file.write(s)
file.close()
file_count += 1
Note that it's probably not the most elegant way and that there are one liners out there, but since you are not a programmer I thought you might prefer a more intuitive way, that you can tweak yourself easily.
I got this to work in Excel. You can expand the length of the variables x,y and z to match your situation and use LastRow, LastColumn methods to find the dimensions of your data set. I named the original worksheet "Data", as shown below.
Sub TestExportText()
Dim Hdr(1 To 4) As String
Dim x As Long
Dim y As Long
Dim z As Long
For x = 1 To 4
Hdr(x) = Cells(1, x)
Next x
x = 1
For y = 1 To 5
ThisWorkbook.Sheets.Add After:=Sheets(Sheets.Count)
ActiveSheet.Name = y
For z = 1 To 4
With ActiveSheet
.Cells(z, 1) = Hdr(z) & "=" & Sheets("Data").Cells(x + 1, z) & ";"
End With
Next z
x = x + 1
ActiveSheet.Move
ActiveWorkbook.ActiveSheet.SaveAs Filename:="File" & y & ".txt", FileFormat:=xlTextWindows
ActiveWorkbook.Close SaveChanges:=False
Next y
End Sub
If you can save your Excel spreadsheet as a CSV file then this python script will do what you want.
with open('data.csv') as file:
data_list = [l.rstrip('\n').split(',') for l in file]
counter = 1
for x in range (1, len (data_list)) :
output_file_name = str (counter) + '.txt'
with open (output_file_name, 'w' ) as file :
for x in range (len (data_list [counter])) :
print (x)
output_string = data_list [0] [x] + '=' + data_list [counter] [x] + ';\n'
file.write (output_string)
counter += 1

Loop through netcdf files and run calculations - Python or R

This is my first time using netCDF and I'm trying to wrap my head around working with it.
I have multiple version 3 netcdf files (NOAA NARR air.2m daily averages for an entire year). Each file spans a year between 1979 - 2012. They are 349 x 277 grids with approximately a 32km resolution. Data was downloaded from here.
The dimension is time (hours since 1/1/1800) and my variable of interest is air. I need to calculate accumulated days with a temperature < 0. For example
Day 1 = +4 degrees, accumulated days = 0
Day 2 = -1 degrees, accumulated days = 1
Day 3 = -2 degrees, accumulated days = 2
Day 4 = -4 degrees, accumulated days = 3
Day 5 = +2 degrees, accumulated days = 0
Day 6 = -3 degrees, accumulated days = 1
I need to store this data in a new netcdf file. I am familiar with Python and somewhat with R. What is the best way to loop through each day, check the previous days value, and based on that, output a value to a new netcdf file with the exact same dimension and variable.... or perhaps just add another variable to the original netcdf file with the output I'm looking for.
Is it best to leave all the files separate or combine them? I combined them with ncrcat and it worked fine, but the file is 2.3gb.
Thanks for the input.
My current progress in python:
import numpy
import netCDF4
#Change my working DIR
f = netCDF4.Dataset('air7912.nc', 'r')
for a in f.variables:
print(a)
#output =
lat
long
x
y
Lambert_Conformal
time
time_bnds
air
f.variables['air'][1, 1, 1]
#Output
298.37473
To help me understand this better what type of data structure am I working with? Is ['air'] the key in the above example and [1,1,1] are also keys? to get the value of 298.37473. How can I then loop through [1,1,1]?
You can use the very nice MFDataset feature in netCDF4 to treat a bunch of files as one aggregated file, without the need to use ncrcat. So you code would look like this:
from pylab import *
import netCDF4
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.19??.nc')
# print variables
f.variables.keys()
atemp = f.variables['air']
print atemp
ntimes, ny, nx = shape(atemp)
cold_days = zeros((ny,nx),dtype=int)
for i in xrange(ntimes):
cold_days += atemp[i,:,:].data-273.15 < 0
pcolormesh(cold_days)
colorbar()
And here's one way to write the file (there might be easier ways):
# create NetCDF file
nco = netCDF4.Dataset('/usgs/data2/notebook/cold_days.nc','w',clobber=True)
nco.createDimension('x',nx)
nco.createDimension('y',ny)
cold_days_v = nco.createVariable('cold_days', 'i4', ( 'y', 'x'))
cold_days_v.units='days'
cold_days_v.long_name='total number of days below 0 degC'
cold_days_v.grid_mapping = 'Lambert_Conformal'
lono = nco.createVariable('lon','f4',('y','x'))
lato = nco.createVariable('lat','f4',('y','x'))
xo = nco.createVariable('x','f4',('x'))
yo = nco.createVariable('y','f4',('y'))
lco = nco.createVariable('Lambert_Conformal','i4')
# copy all the variable attributes from original file
for var in ['lon','lat','x','y','Lambert_Conformal']:
for att in f.variables[var].ncattrs():
setattr(nco.variables[var],att,getattr(f.variables[var],att))
# copy variable data for lon,lat,x and y
lono[:]=f.variables['lon'][:]
lato[:]=f.variables['lat'][:]
xo[:]=f.variables['x'][:]
yo[:]=f.variables['y'][:]
# write the cold_days data
cold_days_v[:,:]=cold_days
# copy Global attributes from original file
for att in f.ncattrs():
setattr(nco,att,getattr(f,att))
nco.Conventions='CF-1.6'
nco.close()
If I try looking at the resulting file in the Unidata NetCDF-Java Tools-UI GUI, it seems to be okay:
Also note that here I just downloaded two of the datasets for testing, so I used
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.19??.nc')
as an example. For all the data, you could use
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.????.nc')
or
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.*.nc')
Here is an R solution.
infiles <- list.files("data", pattern = "nc", full.names = TRUE, include.dirs = TRUE)
outfile <- "data/air.colddays.nc"
library(raster)
r <- raster::stack(infiles)
r <- sum((r - 273.15) < 0)
plot(r)
I know this is rather late for this thread from 2013, but I just want to point out that the accepted solution doesn't provide the solution to the exact question posed. The question seems to want the length of each continuous period of temperatures below zero (note in the question the counter resets if the temperature exceeds zero), which can be important for climate applications (e.g. for farming) whereas the accepted solution only gives the total number of days in a year that the temperature is below zero. If this is really what mkmitchell wants (it has been accepted as the answer) then it can be done in from the command line in cdo without having to worry about NETCDF input/output:
cdo timsum -lec,273.15 in.nc out.nc
so a looped script would be:
files=`ls *.nc` # pick up all the netcdf files in a directory
for file in $files ; do
# I use 273.15 as from the question seems T is in Kelvin
cdo timsum -lec,273.15 $file ${file%???}_numdays.nc
done
If you then want the total number over the whole period you can then cat the _numdays files instead which are much smaller:
cdo cat *_numdays.nc total.nc
cdo timsum total.nc total_below_zero.nc
But again, the question seems to want accumulated days per event, which is different, but not provided by the accepted answer.

Trying to parse text files in python for data analysis

I do a lot of data analysis in perl and I am trying to replicate this work in python using pandas, numpy, matplotlib, etc.
The general workflow goes as follows:
1) glob all the files in a directory
2) parse the files because they have metadata
3) use regex to isolate relevant lines in a given file (They usually begin with a tag such as 'LOOPS')
4) split the lines that match the tag and load data into hashes
5) do some data analysis
6) make some plots
Here is a sample of what I typically do in perl:
print"Reading File:\n"; # gets data
foreach my $vol ($SmallV, $LargeV) {
my $base_name = "${NF}flav_${vol}/BlockedWflow_low_${vol}_[0-9].[0-9]_-0.25_$Mass{$vol}.";
my #files = <$base_name*>; # globs for file names
foreach my $f (#files) { # loops through matching files
print"... $f\n";
my #split = split(/_/, $f);
my $beta = $split[4];
if (!grep{$_ eq $beta} #{$Beta{$vol}}) { # constructs Beta hash
push(#{$Beta{$vol}}, $split[4]);
}
open(IN, "<", "$f") or die "cannot open < $f: $!"; # reads in the file
chomp(my #in = <IN>);
close IN;
my #lines = grep{$_=~/^LOOPS/} #in; # greps for lines with the header LOOPS
foreach my $l (#lines) { # loops through matched lines
my #split = split(/\s+/, $l); # splits matched lines
push(#{$val{$vol}{$beta}{$split[1]}{$split[2]}{$split[4]}}, $split[6]);# reads data into hash
if (!grep{$_ eq $split[1]} #smearingt) {# fills the smearing time array
push(#smearingt, $split[1]);
}
if (!grep{$_ eq $split[4]} #{$block{$vol}}) {# fills the number of blockings
push(#{$block{$vol}}, $split[4]);
}
}
}
foreach my $beta (#{$Beta{$vol}}) {
foreach my $loop (0,1,2,3,4) { # loops over observables
foreach my $b (#{$block{$vol}}) { # beta values
foreach my $t (#smearingt) { # and smearing times
$avg{$vol}{$beta}{$t}{$loop}{$b} = stat_mod::avg(#{$val{$vol}{$beta}{$t}{$loop}{$b}}); # to find statistics
$err{$vol}{$beta}{$t}{$loop}{$b} = stat_mod::stdev(#{$val{$vol}{$beta}{$t}{$loop}{$b}});
}
}
}
}
}
print"File Read in Complete!\n";
My hope is to load this data into a Hierarchical Indexed data structure with indices of the perl hash becoming indicies of my python data structure. Every example I have come across so far of pandas data structures has been highly contrived where the whole structure (indicies and values) was assigned manually in one command and then manipulated to demonstrate all the features of the data structure. Unfortunately I can not assign the data all at once because I don't know what mass, beta, sizes, etc are in the data that is going to be analyzed. Am I doing this the wrong way? Does anyone know a better way of doing this? The data files are immutable, I will have to parse through them using regex which I understand how to do. What I need help with is putting the data into an appropriate data structure so that I can take averages, standard deviations, perform mathematical operations, and plot the data.
Typical data has a header that is an unknown number of lines long but the stuff I care about looks like this:
Alpha 0.5 0.5 0.4
Alpha 0.5 0.5 0.4
LOOPS 0 0 0 2 0.5 1.7800178
LOOPS 0 1 0 2 0.5 0.84488326
LOOPS 0 2 0 2 0.5 0.98365135
LOOPS 0 3 0 2 0.5 1.1638834
LOOPS 0 4 0 2 0.5 1.0438407
LOOPS 0 5 0 2 0.5 0.19081102
POLYA NHYP 0 2 0.5 -0.0200002 0.119196 -0.0788721 -0.170488
BLOCKING COMPLETED
Blocking time 1.474 seconds
WFLOW 0.01 1.57689 2.30146 0.000230146 0.000230146 0.00170773 -0.0336667
WFLOW 0.02 1.66552 2.28275 0.000913101 0.00136591 0.00640552 -0.0271222
WFLOW 0.03 1.75 2.25841 0.00203257 0.00335839 0.0135 -0.0205722
WFLOW 0.04 1.83017 2.22891 0.00356625 0.00613473 0.0224607 -0.0141664
WFLOW 0.05 1.90594 2.19478 0.00548695 0.00960351 0.0328218 -0.00803792
WFLOW 0.06 1.9773 2.15659 0.00776372 0.0136606 0.0441807 -0.00229793
WFLOW 0.07 2.0443 2.1149 0.010363 0.018195 0.0561953 0.00296648
What I (think) I want, I preface this with think because I am new to python and an expert may know a better data structure, is a Hierarchical Indexed Series that would look like this:
volume mass beta observable t value
1224 0.0 5.6 0 0 1.234
1 1.490
2 1.222
1 0 1.234
1 1.234
2448 0.0 5.7 0 1 1.234
and so on like this: http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-hierarchical
For those of you who don't understand the perl:
The meat and potatoes of what I need is this:
push(#{$val{$vol}{$beta}{$split[1]}{$split[2]}{$split[4]}}, $split[6]);# reads data into hash
What I have here is a hash called 'val'. This is a hash of arrays. I believe in python speak this would be a dict of lists. Here each thing that looks like this: '{$something}' is a key in the hash 'val' and I am appending the value stored in the variable $split[6] to the end of the array that is the hash element specified by all 5 keys. This is the fundamental issue with my data is there are a lot of keys for each quantity that I am interested in.
==========
UPDATE
I have come up with the following code which results in this error:
Traceback (most recent call last):
File "wflow_2lattice_matching.py", line 39, in <module>
index = MultiIndex.from_tuples(zipped, names=['volume', 'beta', 'montecarlo_time, smearing_time'])
NameError: name 'MultiIndex' is not defined
Code:
#!/usr/bin/python
from pandas import Series, DataFrame
import pandas as pd
import glob
import re
import numpy
flavor = 4
mass = 0.0
vol = []
b = []
m_t = []
w_t = []
val = []
#tup_vol = (1224, 1632, 2448)
tup_vol = 1224, 1632
for v in tup_vol:
filelist = glob.glob(str(flavor)+'flav_'+str(v)+'/BlockedWflow_low_'+str(v)+'_*_0.0.*')
for filename in filelist:
print 'Reading filename: '+filename
f = open(filename, 'r')
junk, start, vv, beta, junk, mass, mont_t = re.split('_', filename)
ftext = f.readlines()
for line in ftext:
if re.match('^WFLOW.*', line):
line=line.strip()
junk, smear_t, junk, junk, wilson_flow, junk, junk, junk = re.split('\s+', line)
vol.append(v)
b.append(beta)
m_t.append(mont_t)
w_t.append(smear_t)
val.append(wilson_flow)
zipped = zip(vol, beta, m_t, w_t)
index = MultiIndex.from_tuples(zipped, names=['volume', 'beta', 'montecarlo_time, smearing_time'])
data = Series(val, index=index)
You are getting the following:
NameError: name 'MultiIndex' is not defined
because you are not importing MultiIndex directly when you import Series and DataFrame.
You have -
from pandas import Series, DataFrame
You need -
from pandas import Series, DataFrame, MultiIndex
or you can instead refer to MultiIndex using pd.MultiIndex since you are importing pandas as pd
Hopefully this helps you get started?
import sys, os
def regex_match(line) :
return 'LOOPS' in line
my_hash = {}
for fd in os.listdir(sys.argv[1]) : # for each file in this directory
for line in open(sys.argv[1] + '/' + fd) : # get each line of the file
if regex_match(line) : # if its a line I want
line.rstrip('\n').split('\t') # get the data I want
my_hash[line[1]] = line[2] # store the data
for key in my_hash : # data science can go here?
do_something(key, my_hash[key] * 12)
# plots
p.s. make the first line
#!/usr/bin/python
(or whatever which python returns ) to run as an executable
To glob your files, use the built-in glob module in Python.
To read your csv files after globbing them, the read_csv function that you can import using from pandas.io.parsers import read_csv will help you do that.
As for MultiIndex feature in the pandas dataframe that you instantiate after using read_csv, you can then use them to organize your data and slice them anyway you want.
3 pertinent links for your reference.
Understanding MultiIndex dataframes in pandas - understanding MultiIndex and Benefits of panda's multiindex?
Using glob in a directory to grab and manipulate your files - extract values/renaming filename in python

Categories