I'm a new python user. I'm trying to use obspy to create xml for a seismic array. I downloaded the template found at https://docs.obspy.org/tutorial/code_snippets/stationxml_file_from_scratch.html.
import obspy
from obspy.core.inventory import Inventory, Network, Station, Channel, Site
from obspy.clients.nrl import NRL
# We'll first create all the various objects. These strongly follow the
# hierarchy of StationXML files.
inv = Inventory(
# We'll add networks later.
networks=[],
# The source should be the id whoever create the file.
source="ObsPy-Tutorial")
net = Network(
# This is the network code according to the SEED standard.
code="XX",
# A list of stations. We'll add one later.
stations=[],
description="A test stations.",
# Start-and end dates are optional.
start_date=obspy.UTCDateTime(2016, 1, 2))
sta = Station(
# This is the station code according to the SEED standard.
code="ABC",
latitude=1.0,
longitude=2.0,
elevation=345.0,
creation_date=obspy.UTCDateTime(2016, 1, 2),
site=Site(name="First station"),
code="DEF",
latitude=10.0,
longitude=20.0,
elevation=3450.0,
creation_date=obspy.UTCDateTime(2016, 1, 3),
site=Site(name="Second station"))
cha = Channel(
# This is the channel code according to the SEED standard.
code="HHZ",
# This is the location code according to the SEED standard.
location_code="",
# Note that these coordinates can differ from the station coordinates.
latitude=1.0,
longitude=2.0,
elevation=345.0,
depth=10.0,
azimuth=0.0,
dip=-90.0,
sample_rate=200)
# By default this accesses the NRL online. Offline copies of the NRL can
# also be used instead
nrl = NRL()
# The contents of the NRL can be explored interactively in a Python prompt,
# see API documentation of NRL submodule:
# http://docs.obspy.org/packages/obspy.clients.nrl.html
# Here we assume that the end point of data logger and sensor are already
# known:
response = nrl.get_response( # doctest: +SKIP
sensor_keys=['Streckeisen', 'STS-1', '360 seconds'],
datalogger_keys=['REF TEK', 'RT 130 & 130-SMA', '1', '200'])
# Now tie it all together.
cha.response = response
sta.channels.append(cha)
net.stations.append(sta)
inv.networks.append(net)
# And finally write it to a StationXML file. We also force a validation against
# the StationXML schema to ensure it produces a valid StationXML file.
#
# Note that it is also possible to serialize to any of the other inventory
# output formats ObsPy supports.
inv.write("station.xml", format="stationxml", validate=True)
I'm stuck on a silly question: how can I add another station in sta? Something like
code="DEF",
latitude=10.0,
longitude=20.0,
elevation=3450.0,
creation_date=obspy.UTCDateTime(2016, 1, 3),
site=Site(name="Second station"))
I'm using Spyder 5.3.3.
Thank you for your help!
You can make another Station object, for example sta2 = Station(code=...), and then add it to the inventory using net.stations.append(sta2) :
import obspy
from obspy.core.inventory import Inventory, Network, Station, Channel, Site
from obspy.clients.nrl import NRL
# We'll first create all the various objects. These strongly follow the
# hierarchy of StationXML files.
inv = Inventory(
# We'll add networks later.
networks=[],
# The source should be the id whoever create the file.
source="ObsPy-Tutorial")
net = Network(
# This is the network code according to the SEED standard.
code="XX",
# A list of stations. We'll add one later.
stations=[],
description="A test stations.",
# Start-and end dates are optional.
start_date=obspy.UTCDateTime(2016, 1, 2))
sta = Station(
# This is the station code according to the SEED standard.
code="ABC",
latitude=1.0,
longitude=2.0,
elevation=345.0,
creation_date=obspy.UTCDateTime(2016, 1, 2),
site=Site(name="First station"))
# Second station
sta2 = Station(
code="DEF",
latitude=10.0,
longitude=20.0,
elevation=3450.0,
creation_date=obspy.UTCDateTime(2016, 1, 3),
site=Site(name="Second station"))
cha = Channel(
# This is the channel code according to the SEED standard.
code="HHZ",
# This is the location code according to the SEED standard.
location_code="",
# Note that these coordinates can differ from the station coordinates.
latitude=1.0,
longitude=2.0,
elevation=345.0,
depth=10.0,
azimuth=0.0,
dip=-90.0,
sample_rate=200)
# By default this accesses the NRL online. Offline copies of the NRL can
# also be used instead
nrl = NRL()
# The contents of the NRL can be explored interactively in a Python prompt,
# see API documentation of NRL submodule:
# http://docs.obspy.org/packages/obspy.clients.nrl.html
# Here we assume that the end point of data logger and sensor are already
# known:
response = nrl.get_response( # doctest: +SKIP
sensor_keys=['Streckeisen', 'STS-1', '360 seconds'],
datalogger_keys=['REF TEK', 'RT 130 & 130-SMA', '1', '200'])
# Now tie it all together.
cha.response = response
sta.channels.append(cha)
net.stations.append(sta)
# add second station to network
net.stations.append(sta2)
inv.networks.append(net)
# And finally write it to a StationXML file. We also force a validation against
# the StationXML schema to ensure it produces a valid StationXML file.
# Note that it is also possible to serialize to any of the other inventory
# output formats ObsPy supports.
inv.write("station.xml", format="stationxml", validate=True)
You can double check that the new station is there with print(inv) (I find XML files rather hard to read sometimes):
Inventory created at 2022-10-14T00:33:59.433461Z
Created by: ObsPy 1.3.0
https://www.obspy.org
Sending institution: ObsPy-Tutorial
Contains:
Networks (1):
XX
Stations (2):
XX.ABC (First station)
XX.DEF (Second station)
Channels (1):
XX.ABC..HHZ
You can use a similar method to add channels to a station using sta.channels.append(new channel). Cheers!
Related
I am currently processing medical data in Tensorflow. The problem I have is in regards to attaching labels to the Data.
The structure of my directory is as follows: There is one big folder that contains 3 subfolders for every patient (with 24 patients, so 72 Folders) in those folders is the data.
Now the Problem is that Patients are divided into two groups, group 1 and group 0(Response to treatment, or no response).
My objective is to train the Model to figure out which group a patient is supposed to be in.
I am following This Tutorial on how to create a dataset from all these files.
But in the Example the label is part of the path. This is not possible here (The University does not want to change the structure of the directory)
Here is my example code "functions" is my python script for all functions and settings is a python script that is supposed to create global variables:
settings.json:
Is my json file for all the settings. Yes I have two files named settings, I'm going to rename those at some point. "labels" is a List of the labels for each subfolder. (first three are patient 1, second are the three for patient 2 etc.)
"labels": [1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,0,0,0.....(too long for here)
"datasetpath" is just the path to my data.
settings.py:
import functions
def init():
global settings
settings = functions.initjson()
global a
a = 1
functions.py:
import settings
import tensorflow as tf
def process_path(file_path):
parts = tf.strings.split(file_path, '/')
#label = settings.settings["labels"][settings.a]
label = settings.a #To test if it ticks up at all
settings.a = settings.a +1
return tf.io.read_file(file_path), label
as you can see
The actual Script:
datasetpath = settings.settings["datasetpath"]
file_paths = functions.getfilepaths(datasetpath,verbose=True)
setpath = pathlib.Path(datasetpath)
list_ds = tf.data.Dataset.list_files(str(setpath/'*/*'))
labeled_ds = list_ds.map(functions.process_path)
for f in list_ds.take(5):
print(f.numpy())
for image_raw, label_text in labeled_ds.take(3):
print(repr(image_raw.numpy()[:100]))
print()
print(label_text.numpy())
So I cannot just take a part of the path a s a label how they did it.
I can't compare the Patient name inside the path with a normal List (see Here)
I tried working with a global variable that ticks up and accesses a list, but it does not tick up.
And I'm fresh out of ideas.
Does anyone have an Idea on how to solve this?
So I solved it myself now.
It's not ideal, but what I did was going through all Paths and comparing them to the "labels" List.
Here is the function I used.
def PD_PET_list():
datasetpath = settings["Path to PET"]
labels = settings["labels"]
#if settings.verbose: print("Path to Dataset for label Creation:",datasetpath)
setpath = pathlib.Path(datasetpath)
#if settings.debug: print(setpath)
pat = sorted(glob.glob(str(setpath / '*')))
global pd
pd = []
for i in range(len(labels)):
# print(pat[i])
# print(labels[i])
if labels[i] == 1:
#print(pat[i])
#print(labels[i])
parts = tf.strings.split(pat[i], '/')
pd.append(parts[-1])
Afterwards I just checked if the Name of the Folder is within that list (Looping through the list "If x in y" did not work sadly.
I am running a model evaluation protocol for Modeller. It evaluates every model and writes its result to a separate file. However I have to run it for every model and write to a single file.
This is the original code:
from modeller import *
from modeller.scripts import complete_pdb
log.verbose() # request verbose output
env = environ()
env.libs.topology.read(file='$(LIB)/top_heav.lib') # read topology
env.libs.parameters.read(file='$(LIB)/par.lib') # read parameters
# read model file
mdl = complete_pdb(env, 'TvLDH.B99990001.pdb')
# Assess all atoms with DOPE:
s = selection(mdl)
s.assess_dope(output='ENERGY_PROFILE NO_REPORT', file='TvLDH.profile',
normalize_profile=True, smoothing_window=15)
I added a loop to evaluate every model in a single run, however I am creating several files (one for each model) and I want is to print all evaluations in a single file
from modeller import *
from modeller.scripts import complete_pdb
log.verbose() # request verbose output
env = environ()
env.libs.topology.read(file='$(LIB)/top_heav.lib') # read topology
env.libs.parameters.read(file='$(LIB)/par.lib') # read parameters
#My loop starts here
for i in range (1,1001):
number=str(i)
if i<10:
name='000'+number
else:
if i<100:
name='00'+number
else:
if i<1000:
name='0'+number
else:
name='1000'
# read model file
mdl = complete_pdb(env, 'TcP5CDH.B9999'+name+'.pdb')
# Assess all atoms with DOPE: this is the assesment that i want to print in the same file
s = selection(mdl)
savename='TcP5CDH.B9999'+name+'.profile'
s.assess_dope(output='ENERGY_PROFILE NO_REPORT',
file=savename,
normalize_profile=True, smoothing_window=15)
As I am new to programming, any help will be very helpful!
Welcome :-) Looks like you're very close. Let's introduce you to using a python function and the .format() statement.
Your original has a comment line # read model file, which looks like it could be a function, so let's try that. It could look something like this.
from modeller import *
from modeller.scripts import complete_pdb
log.verbose() # request verbose output
# I'm assuming this can be done just once
# and re-used for all your model files...
# (if not, the env stuff should go inside the
# read_model_file() function.
env = environ()
env.libs.topology.read(file='$(LIB)/top_heav.lib') # read topology
env.libs.parameters.read(file='$(LIB)/par.lib') # read parameters
def read_model_file(file_name):
print('--- read_model_file(file_name='+file_name+') ---')
mdl = complete_pdb(env, file_name)
# Assess all atoms with DOPE:
s = selection(mdl)
output_file = file_name+'.profile'
s.assess_dope(
output='ENERGY_PROFILE NO_REPORT',
file=output_file,
normalize_profile=True,
smoothing_window=15)
for i in range(1,1001):
file_name = 'TcP5CDH.B9999{:04d}pdb'.format(i)
read_model_file(file_name)
Using .format() we can get ride of the multiple if-statement checks for 10? 100? 1000?
Basically .format() replaces {} curly braces with the argument(s).
It can be pretty complex but you don't need to digetst all of it.
Example:
'Hello {}!'.format('world') yields Hello world!. The {:04d} stuff uses formatting, basically that says "Please make a 4-character wide digit-substring and zero-fill it, so you should get '0001', ..., '0999', '1000'.
Just {:4d} (no leading zero) would give you space padded results (e.g. ' 1', ..., ' 999', '1000'.
Here's a little more on the zero-fill: Display number with leading zeros
I would like to do a FVA only for selected reactions, in my case on transport reactions between compartments (e.g. between the cytosol and mitochondrion). I know that I can use selected_reactions in doFVA like this:
import cbmpy as cbm
mod = cbm.CBRead.readSBML3FBC('iMM904.xml.gz')
cbm.doFVA(mod, selected_reactions=['R_FORtm', 'R_CO2tm'])
Is there a way to get the entire list of transport reactions, not only the two I manually added? I thought about
selecting the reactions based on their ending tm but that fails for 'R_ORNt3m' (and probably other reactions, too).
I want to share this model with others. What is the best way of storing the information in the SBML file?
Currently, I would store the information in the reaction annotation as in
this answer. For example
mod.getReaction('R_FORtm').setAnnotation('FVA', 'yes')
which could be parsed.
There is no built-in function for this kind of task. As you already mentioned, relying on the IDs is generally not a good idea as those can differ between different databases, models and groups (e.g. if someone decided just to enumerate reactions from r1 till rn and or metabolites from m1 till mm, filtering based on IDs fails). Instead, one can make use of the compartment field of the species. In CBMPy you can access a species' compartment by doing
import cbmpy as cbm
import pandas as pd
mod = cbm.CBRead.readSBML3FBC('iMM904.xml.gz')
mod.getSpecies('M_atp_c').getCompartmentId()
# will return 'c'
# run a FBA
cbm.doFBA(mod)
This can be used to find all fluxes between compartments as one can check for each reaction in which compartment their reagents are located. A possible implementation could look as follows:
def get_fluxes_associated_with_compartments(model_object, compartments, return_values=True):
# check whether provided compartment IDs are valid
if not isinstance(compartments, (list, set) or not set(compartments).issubset(model_object.getCompartmentIds())):
raise ValueError("Please provide valid compartment IDs as a list!")
else:
compartments = set(compartments)
# all reactions in the model
model_reactions = model_object.getReactionIds()
# check whether provided compartments are identical with the ones of the reagents of a reaction
return_reaction_ids = [ri for ri in model_reactions if compartments == set(si.getCompartmentId() for si in
model_object.getReaction(ri).getSpeciesObj())]
# return reaction along with its value
if return_values:
return {ri: model_object.getReaction(ri).getValue() for ri in return_reaction_ids}
# return only a list with reaction IDs
return return_reaction_ids
So you pass your model object and a list of compartments and then for each reaction it is checked whether there is at least one reagent located in the specified compartments.
In your case you would use it as follows:
# compartment IDs for mitochondria and cytosol
comps = ['c', 'm']
# you only want the reaction IDs; remove the ', return_values=False' part if you also want the corresponding values
trans_cyt_mit = get_fluxes_associated_with_compartments(mod, ['c', 'm'], return_values=False)
The list trans_cyt_mit will then contain all desired reaction IDs (also the two you specified in your question) which you can then pass to the doFVA function.
About the second part of your question. I highly recommend to store those reactions in a group rather than using annotation:
# create an empty group
mod.createGroup('group_trans_cyt_mit')
# get the group object so that we can manipulate it
cyt_mit = mod.getGroup('group_trans_cyt_mit')
# we can only add objects to a group so we get the reaction object for each transport reaction
reaction_objects = [mod.getReaction(ri) for ri in trans_cyt_mit]
# add all the reaction objects to the group
cyt_mit.addMember(reaction_objects)
When you now export the model, e.g. by using
cbm.CBWrite.writeSBML3FBCV2(mod, 'iMM904_with_groups.xml')
this group will be stored as well in SBML. If a colleague reads the SBML again, he/she can then easily run a FVA for the same reactions by accessing the group members which is far easier than parsing the annotation:
# do an FVA; fva_res: Reaction, Reduced Costs, Variability Min, Variability Max, abs(Max-Min), MinStatus, MaxStatus
fva_res, rea_names = cbm.doFVA(mod, selected_reactions=mod.getGroup('group_trans_cyt_mit').getMemberIDs())
fva_dict = dict(zip(rea_names, fva_res.tolist()))
# store results in a dataframe which makes the selection of reactions easier
fva_df = pd.DataFrame.from_dict(fva_dict, orient='index')
fva_df = fva_df.rename({0: "flux_value", 1: "reduced_cost_unscaled", 2: "variability_min", 3: "variability_max",
4: "abs_diff_var", 5: "min_status", 6: "max_status"}, axis='columns')
Now you can easily query the dataframe and find the flexible and not flexible reactions within your group:
# filter the reactions with flexibility
fva_flex = fva_df.query("abs_diff_var > 10 ** (-4)")
# filter the reactions that are not flexible
fva_not_flex = fva_df.query("abs_diff_var <= 10 ** (-4)")
I am trying to create my own tool to use in ArcMap but keep running into a problem. I want to create a buffer (which I can do) and then clip the points that fall within the buffer. The problem I run into is that I cannot figure out how to use the buffer as the input feature for the clip section of my tool.
import arcpy
import os
from arcpyimmport env
env.workspace = "C:/LabData"
arcpy.env.overwriteOutput = True
In_lake = arcpy.GetParameterAsText(0)
Out_Buff = arcpy.GetParameterAsText(1)
Buffer_Distance = arcpy.GetParameterAstext(2)
in_cities = arcpy.GetParameterAsText(3)
cliped_cities = GetParameterAsText(4)
New_Table = arcpy.GetParameterAsText(5)
Join_Input = arcpy.GetParameteAsText(6)
# step 1 create a buffer around the lakes
arcpy.Buffer_analysis(In_Lake, Out_Buff, Buffer_Distance)
# Step 2 Clip all cities that fall within the buffer
arcpy.Clip_analysis( in_cities,out_Buff, clipped_cities)
# Step 3
arcpy.Statistics_analysis(clipped_cities, New_Table, statistics_fields,\
'Population SUM', 'CNTRY_NAME')
# Step 5
arcpy.AddField_management (New_Table, 'Country', 'TEXT')
]1
Check carefully that your variable names match -- Python and ArcPy are case sensitive.
In_Lake = arcpy.GetParameterAsText(0) ## was In_lake
Out_Buff = arcpy.GetParameterAsText(1)
Buffer_Distance = arcpy.GetParameterAstext(2)
in_cities = arcpy.GetParameterAsText(3)
clipped_cities = GetParameterAsText(4) ## was cliped_cities
New_Table = arcpy.GetParameterAsText(5)
Join_Input = arcpy.GetParameteAsText(6)
# step 1 create a buffer around the lakes
arcpy.Buffer_analysis(In_Lake, Out_Buff, Buffer_Distance)
# Step 2 Clip all cities that fall within the buffer
arcpy.Clip_analysis(in_cities, Out_Buff, clipped_cities) ## was out_Buff
Unless you want to keep the lake buffer, it doesn't necessarily need to be an input parameter specified by the user. Consider instead using the in_memory workspace -- just be aware any data in it will be deleted once the tool execution is completed.
Out_Buff = r'in_memory\lakeBuffer'
A similar strategy can be used for any intermediate feature class or table that you don't really care about. However, it's sometimes useful to have those intermediate results around to verify that your tool is doing what you expect at every step.
The Problem
I'm doing time-series analysis. Measured data comes from the sampling the voltage output of a sensor at 50 kHz and then dumping that data to disk as separate files in hour chunks. Data is saved to an HDF5 file using pytables as a CArray. This format was chosen to maintain interoperability with MATLAB.
The full data set is now multiple TB, far too large to load into memory.
Some of my analysis requires me to iterative over the full data set. For analysis that requires me to grab chunks of data, I can see a path forward through creating a generator method. I'm a bit uncertain of how to proceed with analysis that requires a continuous time series.
Example
For example, let's say I'm looking to find and categorize transients using some moving window process (e.g. wavelet analysis) or apply a FIR filter. How do I handle the boundaries, either at the end or beginning of a file or at chunk boundaries? I would like the data to appear as one continuous data set.
Request
I would love to:
Keep the memory footprint low by loading data as necessary.
Keep a map of the entire data set in memory so that I can address the data set as I would a regular pandas Series object, e.g. data[time1:time2].
I'm using scientific python (Enthought distribution) with all the regular stuff: numpy, scipy, pandas, matplotlib, etc. I only recently started incorporating pandas into my work flow and I'm still unfamiliar with all of its capabilities.
I've looked over related stackexchange threads and didn't see anything that exactly addressed my issue.
EDIT: FINAL SOLUTION.
Based upon the helpful hints I built a iterator that steps over files and returns chunks of arbitrary size---a moving window that hopefully handles file boundaries with grace. I've added the option of padding the front and back of each of the windows with data (overlapping windows). I can then apply a succession of filters to the overlapping windows and then remove the overlaps at the end. This, I hope, gives me continuity.
I haven't yet implemented __getitem__ but it's on my list of things to do.
Here's the final code. A few details are omitted for brevity.
class FolderContainer(readdata.DataContainer):
def __init__(self,startdir):
readdata.DataContainer.__init__(self,startdir)
self.filelist = None
self.fs = None
self.nsamples_hour = None
# Build the file list
self._build_filelist(startdir)
def _build_filelist(self,startdir):
"""
Populate the filelist dictionary with active files and their associated
file date (YYYY,MM,DD) and hour.
Each entry in 'filelist' has the form (abs. path : datetime) where the
datetime object contains the complete date and hour information.
"""
print('Building file list....',end='')
# Use the full file path instead of a relative path so that we don't
# run into problems if we change the current working directory.
filelist = { os.path.abspath(f):self._datetime_from_fname(f)
for f in os.listdir(startdir)
if fnmatch.fnmatch(f,'NODE*.h5')}
# If we haven't found any files, raise an error
if not filelist:
msg = "Input directory does not contain Illionix h5 files."
raise IOError(msg)
# Filelist is a ordered dictionary. Sort before saving.
self.filelist = OrderedDict(sorted(filelist.items(),
key=lambda t: t[0]))
print('done')
def _datetime_from_fname(self,fname):
"""
Return the year, month, day, and hour from a filename as a datetime
object
"""
# Filename has the prototype: NODE##-YY-MM-DD-HH.h5. Split this up and
# take only the date parts. Convert the year form YY to YYYY.
(year,month,day,hour) = [int(d) for d in re.split('-|\.',fname)[1:-1]]
year+=2000
return datetime.datetime(year,month,day,hour)
def chunk(self,tstart,dt,**kwargs):
"""
Generator expression from returning consecutive chunks of data with
overlaps from the entire set of Illionix data files.
Parameters
----------
Arguments:
tstart: UTC start time [provided as a datetime or date string]
dt: Chunk size [integer number of samples]
Keyword arguments:
tend: UTC end time [provided as a datetime or date string].
frontpad: Padding in front of sample [integer number of samples].
backpad: Padding in back of sample [integer number of samples]
Yields:
chunk: generator expression
"""
# PARSE INPUT ARGUMENTS
# Ensure 'tstart' is a datetime object.
tstart = self._to_datetime(tstart)
# Find the offset, in samples, of the starting position of the window
# in the first data file
tstart_samples = self._to_samples(tstart)
# Convert dt to samples. Because dt is a timedelta object, we can't use
# '_to_samples' for conversion.
if isinstance(dt,int):
dt_samples = dt
elif isinstance(dt,datetime.timedelta):
dt_samples = np.int64((dt.day*24*3600 + dt.seconds +
dt.microseconds*1000) * self.fs)
else:
# FIXME: Pandas 0.13 includes a 'to_timedelta' function. Change
# below when EPD pushes the update.
t = self._parse_date_str(dt)
dt_samples = np.int64((t.minute*60 + t.second) * self.fs)
# Read keyword arguments. 'tend' defaults to the end of the last file
# if a time is not provided.
default_tend = self.filelist.values()[-1] + datetime.timedelta(hours=1)
tend = self._to_datetime(kwargs.get('tend',default_tend))
tend_samples = self._to_samples(tend)
frontpad = kwargs.get('frontpad',0)
backpad = kwargs.get('backpad',0)
# CREATE FILE LIST
# Build the the list of data files we will iterative over based upon
# the start and stop times.
print('Pruning file list...',end='')
tstart_floor = datetime.datetime(tstart.year,tstart.month,tstart.day,
tstart.hour)
filelist_pruned = OrderedDict([(k,v) for k,v in self.filelist.items()
if v >= tstart_floor and v <= tend])
print('done.')
# Check to ensure that we're not missing files by enforcing that there
# is exactly an hour offset between all files.
if not all([dt == datetime.timedelta(hours=1)
for dt in np.diff(np.array(filelist_pruned.values()))]):
raise readdata.DataIntegrityError("Hour gap(s) detected in data")
# MOVING WINDOW GENERATOR ALGORITHM
# Keep two files open, the current file and the next in line (que file)
fname_generator = self._file_iterator(filelist_pruned)
fname_current = fname_generator.next()
fname_next = fname_generator.next()
# Iterate over all the files. 'lastfile' indicates when we're
# processing the last file in the que.
lastfile = False
i = tstart_samples
while True:
with tables.openFile(fname_current) as fcurrent, \
tables.openFile(fname_next) as fnext:
# Point to the data
data_current = fcurrent.getNode('/data/voltage/raw')
data_next = fnext.getNode('/data/voltage/raw')
# Process all data windows associated with the current pair of
# files. Avoid unnecessary file access operations as we moving
# the sliding window.
while True:
# Conditionals that depend on if our slice is:
# (1) completely into the next hour
# (2) partially spills into the next hour
# (3) completely in the current hour.
if i - backpad >= self.nsamples_hour:
# If we're already on our last file in the processing
# que, we can't continue to the next. Exit. Generator
# is finished.
if lastfile:
raise GeneratorExit
# Advance the active and que file names.
fname_current = fname_next
try:
fname_next = fname_generator.next()
except GeneratorExit:
# We've reached the end of our file processing que.
# Indicate this is the last file so that if we try
# to pull data across the next file boundary, we'll
# exit.
lastfile = True
# Our data slice has completely moved into the next
# hour.
i-=self.nsamples_hour
# Return the data
yield data_next[i-backpad:i+dt_samples+frontpad]
# Move window by amount dt
i+=dt_samples
# We've completely moved on the the next pair of files.
# Move to the outer scope to grab the next set of
# files.
break
elif i + dt_samples + frontpad >= self.nsamples_hour:
if lastfile:
raise GeneratorExit
# Slice spills over into the next hour
yield np.r_[data_current[i-backpad:],
data_next[:i+dt_samples+frontpad-self.nsamples_hour]]
i+=dt_samples
else:
if lastfile:
# Exit once our slice crosses the boundary of the
# last file.
if i + dt_samples + frontpad > tend_samples:
raise GeneratorExit
# Slice is completely within the current hour
yield data_current[i-backpad:i+dt_samples+frontpad]
i+=dt_samples
def _to_samples(self,input_time):
"""Convert input time, if not in samples, to samples"""
if isinstance(input_time,int):
# Input time is already in samples
return input_time
elif isinstance(input_time,datetime.datetime):
# Input time is a datetime object
return self.fs * (input_time.minute * 60 + input_time.second)
else:
raise ValueError("Invalid input 'tstart' parameter")
def _to_datetime(self,input_time):
"""Return the passed time as a datetime object"""
if isinstance(input_time,datetime.datetime):
converted_time = input_time
elif isinstance(input_time,str):
converted_time = self._parse_date_str(input_time)
else:
raise TypeError("A datetime object or string date/time were "
"expected")
return converted_time
def _file_iterator(self,filelist):
"""Generator for iterating over file names."""
for fname in filelist:
yield fname
#Sean here's my 2c
Take a look at this issue here which I created a while back. This is essentially what you are trying to do. This is a bit non-trivial.
Without knowing more details, I would offer a couple of suggestions:
HDFStore CAN read in a standard CArray type of format, see here
You can easily create a 'Series' like object that has nice properties of a) knowing where each file is and its extents, and uses __getitem__ to 'select' those files, e.g. s[time1:time2]. From a top-level view this might be a very nice abstraction, and you can then dispatch operations.
e.g.
class OutOfCoreSeries(object):
def __init__(self, dir):
.... load a list of the files in the dir where you have them ...
def __getitem__(self, key):
.... map the selection key (say its a slice, which 'time1:time2' resolves) ...
.... to the files that make it up .... , then return a new Series that only
.... those file pointers ....
def apply(self, func, **kwargs):
""" apply a function to the files """
results = []
for f in self.files:
results.append(func(self.read_file(f)))
return Results(results)
This can very easily get quite complicated. For instance, if you apply an operation that does a reduction that you can fit in memory, Results can simpley be a pandas.Series (or Frame). Hoever,
you may be doing a transformation which necessitates you writing out a new set of transformed data files. If you so, then you have to handle this.
Several more suggestions:
You may want to hold onto your data in possibly multiple ways that may be useful. For instance you say that you are saving multiple values in a 1-hour slice. It may be that you can split these 1-hour files instead into a file for each variable you are saving but save a much longer slice that then becomes memory readable.
You might want to resample the data to lower frequencies, and work on these, loading the data in a particular slice as needed for more detailed work.
You might want to create a dataset that is queryable across time, e.g. say high-low peaks at varying frequencies, e.g. maybe using the Table format see here
Thus you may have multiple variations of the same data. Disk space is usually much cheaper/easier to manage than main memory. It makes a lot of sense to take advantage of that.