Error in reading and writing files in Python - python

I am trying to convert files from one format to other in Python. The current format is DAQ (data acquisition format), which is read in first. Then I use undaq Tools module to write the files to hdf5 format.
import glob
ctnames = glob.glob('*.daq')
Following are the few filenames (there are 100 in total):
ctnames
['Cars_20160601_01.daq',
'Cars_20160601_02.daq',
'Cars_20160601_03.daq',
'Cars_20160601_04.daq',
'Cars_20160601_05.daq',
'Cars_20160601_06.daq',
'Cars_20160601_07.daq',
.
.
.
## Importing undaq tools:
from undaqTools import Daq
Reading the DAQ files and writing to hdf5:
for n in ctnames:
x = daq.read(n)
daq.write_hd5(x)
Following is the error I got:
C:\Anaconda3\envs\py27\lib\site-packages\undaqtools-0.2.3-py2.7.egg\undaqTools\daq.py:405: RuntimeWarning: Failed loading file on frame 46970. (stopped reading file)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-17-6fe7a8c9496d> in <module>()
1 for n in ctnames:
----> 2 x = daq.read(n)
3 daq.write_hd5(x)
C:\Anaconda3\envs\py27\lib\site-packages\undaqtools-0.2.3-py2.7.egg\undaqTools\daq.pyc in read_daq(self, filename, elemlist, loaddata, process_dynobjs, interpolate_missing_frames)
272
273 if loaddata:
--> 274 self._loaddata()
275 self._unwrap_lane_deviation()
276
C:\Anaconda3\envs\py27\lib\site-packages\undaqtools-0.2.3-py2.7.egg\undaqTools\daq.pyc in _loaddata(self)
449 assert tmpdata[name].shape[0] == frame.frame.shape[0]
450 else:
--> 451 assert tmpdata[name].shape[1] == frame.frame.shape[0]
452
453 # cast as Element objects
AssertionError:
Questions
I have 2 questions:
1. How do I know which of the 100 files is throwing the error?
2. How do I skip the files if they throw the error?

Wrap the read() call in a try/except block. If you get an exception, print the current filename and skip to the next one.
for n in ctnames:
try:
x = daq.read(n)
except AssertionError:
print 'Could not process file %s. Skipping.' % n
continue
daq.write_hd5(x)

Related

How to batch sentiment with PYABSA

I followed this tutorial to do sentiment inference. I could run the code with a list of sentences (in the section Aspect Sentiment Inference of the Colab Notebook). However, I don't know how to modify the following code (in the section Batch Sentiment Inference) to infer sentiment for a file of my own (containing just 2 lines, each has two sentences).
# inference_sets = ABSADatasetList.Phone # original code
inference_sets = 'test.dat.apc' # this is my own file that I want to infer sentiment for each sentence
results = sent_classifier.batch_infer(target_file=inference_sets,
print_result=True,
save_result=True,
ignore_error=False,
)
Running the modified code caused the following error
RuntimeError Traceback (most recent call last)
Input In [56], in <cell line: 2>()
1 test = 'test.dat.apc'
----> 2 results = sent_classifier.batch_infer(target_file=test,
3 print_result=False,
4 save_result=True,
5 ignore_error=False,
6 )
File ~\Anaconda3\envs\spacy\lib\site-packages\pyabsa-1.16.15-py3.9.egg\pyabsa\core\apc\prediction\sentiment_classifier.py:197, in SentimentClassifier.batch_infer(self, target_file, print_result, save_result, ignore_error, clear_input_samples)
193 self.clear_input_samples()
195 save_path = os.path.join(os.getcwd(), 'apc_inference.result.json')
--> 197 target_file = detect_infer_dataset(target_file, task='apc')
198 if not target_file:
199 raise FileNotFoundError('Can not find inference datasets!')
File ~\Anaconda3\envs\spacy\lib\site-packages\pyabsa-1.16.15-py3.9.egg\pyabsa\functional\dataset\dataset_manager.py:302, in detect_infer_dataset(dataset_path, task)
300 if os.path.isdir(dataset_path.dataset_name):
301 print('No inference set found from: {}, unrecognized files: {}'.format(dataset_path, ', '.join(os.listdir(dataset_path.dataset_name))))
--> 302 raise RuntimeError(
303 'Fail to locate dataset: {}. If you are using your own dataset, you may need rename your dataset according to {}'.format(
304 dataset_path,
305 'https://github.com/yangheng95/ABSADatasets#important-rename-your-dataset-filename-before-use-it-in-pyabsa')
306 )
307 if len(dataset_path) > 1:
308 print(colored('Please DO NOT mix datasets with different sentiment labels for training & inference !', 'yellow'))
RuntimeError: Fail to locate dataset: ['test.dat.apc']. If you are using your own dataset, you may need rename your dataset according to https://github.com/yangheng95/ABSADatasets#important-rename-your-dataset-filename-before-use-it-in-pyabsa
What did I do wrong?
You need to rename your dataset file name, by ending with .inference

File is not showing when applying rasterio.open()

Here is my code
refPath = '/Users/admin/Downloads/Landsat8/'
ext = '_NDWI.tif'
for file in sorted(os.listdir(refPath)):
if file.endswith(ext):
print(file)
ndwiopen = rs.open(file)
ndwiread = ndwiopen.read(1)
Here is the error
2014_NDWI.tif
---------------------------------------------------------------------------
CPLE_OpenFailedError Traceback (most recent call last)
File rasterio/_base.pyx:302, in rasterio._base.DatasetBase.__init__()
File rasterio/_base.pyx:213, in rasterio._base.open_dataset()
File rasterio/_err.pyx:217, in rasterio._err.exc_wrap_pointer()
CPLE_OpenFailedError: 2014_NDWI.tif: No such file or directory
During handling of the above exception, another exception occurred:
RasterioIOError Traceback (most recent call last)
Input In [104], in <cell line: 33>()
34 if file.endswith(ext):
35 print(file)
---> 36 ndwiopen = rs.open(file)
38 ndwiread = ndwiopen.read(1)
39 plt.figure(figsize = (20, 15))
File /Applications/anaconda3/lib/python3.9/site-packages/rasterio/env.py:442, in ensure_env_with_credentials.<locals>.wrapper(*args, **kwds)
439 session = DummySession()
441 with env_ctor(session=session):
--> 442 return f(*args, **kwds)
File /Applications/anaconda3/lib/python3.9/site-packages/rasterio/__init__.py:277, in open(fp, mode, driver, width, height, count, crs, transform, dtype, nodata, sharing, **kwargs)
274 path = _parse_path(raw_dataset_path)
276 if mode == "r":
--> 277 dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
278 elif mode == "r+":
279 dataset = get_writer_for_path(path, driver=driver)(
280 path, mode, driver=driver, sharing=sharing, **kwargs
281 )
File rasterio/_base.pyx:304, in rasterio._base.DatasetBase.__init__()
RasterioIOError: 2014_NDWI.tif: No such file or directory
As it is shown that the file is getting printed as an output but that can not be opened by the RasterIO (as rs).
Can't understand what is missing in the script.
Unsure if this is your exact problem, but I rammed my head against this same exact error for 5-10 hours before I realized that the '.tif' file I was trying to read had an extension in all caps, as in '.TIF'. This is apparently the default for the Landsat 8 image bands that I was working with.
I was doing similar concatenation but my string would result in 'filename.tif' instead of the correct 'filename.TIF', so rasterio would be unable to read it. It was really frustrating, so I figured I would share how I was able to solve it since you have not yet received any replies, even though I cannot know if this was your issue. When I searched this error, this post is one of the first and most similar that would pop up but was unanswered, so I thought I would post in case any one with my issue might stumble across it as well (or, for myself when I inevitably forget in a few months how I had solved this).

Extracting chains from a electron microscopy structure

I need to extract single chains from a structure file in cif format as available from the PDB. I've read several related questions, such as this and this. The proposed solution indeed works well if the chain ID is an integer or a single character. If applied to a structure such as 6KMW to extract chain aA it raises the error TypeError: %c requires int or char. Full code used to reproduce the error and output included below.
from Bio.PDB import PDBList, PDBIO, FastMMCIFParser, Select
class ChainSelect(Select):
def __init__(self, chain):
self.chain = chain
def accept_chain(self, chain):
if chain.get_id() == self.chain:
return 1
else:
return 0
pdbl = PDBList()
io = PDBIO()
parser = FastMMCIFParser(QUIET = True)
pdbl.retrieve_pdb_file('6kmw', pdir = '.', file_format='mmCif')
structure = parser.get_structure('6kmw', '6kmw.cif')
io.set_structure(structure)
io.save('6kmw_aA.pdb', ChainSelect('aA'))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-095b98a12800> in <module>
18 structure = parser.get_structure('6kmw', '6kmw.cif')
19 io.set_structure(structure)
---> 20 io.save('6kmw_aA.pdb', ChainSelect('aA'))
~/miniconda3/envs/lab2/lib/python3.8/site-packages/Bio/PDB/PDBIO.py in save(self, file, select, write_end, preserve_atom_numbering)
368 )
369
--> 370 s = get_atom_line(
371 atom,
372 hetfield,
~/miniconda3/envs/lab2/lib/python3.8/site-packages/Bio/PDB/PDBIO.py in _get_atom_line(self, atom, hetfield, segid, atom_number, resname, resseq, icode, chain_id, charge)
227 charge,
228 )
--> 229 return _ATOM_FORMAT_STRING % args
230
231 else:
TypeError: %c requires int or char
Is anyone aware of a Biopython functionality to achieve the result? Preferably one that doesn't rely on parsing the entire file by custom functions.
I think, what you are trying to achieve is just impossible. Effectively you want to convert a cif file to a pdb file. It does not matter that you want to reduce the protein structure to a single chain in the process.
The PDB format is a file format from the last century. (I know how widely spread it is till today...) It is column oriented and only allows for one character for the chain id. This is the reason you cannot download a PDB file for protein 6KMW. See the tooltip at https://www.rcsb.org/structure/6KMW for that: "PDB format files are not available for large structures". In your case "large" means, proteins with so many chains that they need two characters.
You cannot store two characters as the chain name for a PDB file.
You got two options now:
Rename the chain "aA" and save the file in PDB format
Don't use the PDB format as your file format but stick to cif
This snippet renames the chain and stores the structure as a pdb file:
[...]
io.set_structure(structure)
for model in structure:
for chain in model:
if chain.get_id() == "A":
chain.id = "_"
print("renamed chain A to _")
if chain.get_id() == "aA":
chain.id = "A"
print("renamed chain aA to A")
io.save('6kmw_aA.pdb', ChainSelect('A'))
This snippet stores only chain 'aA' in mmCIF format:
from Bio.PDB.mmcifio import MMCIFIO
io = MMCIFIO()
io.set_structure(structure)
io.save("6kmw_aA.cif", ChainSelect('aA'))

How to handle BigTable Scan InvalidChunk exceptions?

I am trying to scan BigTable data where some rows are 'dirty' - but this fails depending on the scan, causing (serialization?) InvalidChunk exceptions.
the code is as follows:
from google.cloud import bigtable
from google.cloud import happybase
client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(instance_id)
connection = happybase.Connection(instance=instance)
table = connection.table(table_name)
for key, row in table.scan(limit=5000): #BOOM!
pass
leaving out some columns or limiting the rows to less or specifying the start and stop keys, allows the scan to succeed.
I cannot detect which values are problematic from the stacktrace - it varies across columns - the scan just fails. This makes it problematic to clean the data at source.
When I leverage the python debugger, I see that the chunk (which is of type google.bigtable.v2.bigtable_pb2.CellChunk) has no value (it is NULL / undefined):
ipdb> pp chunk.value
b''
ipdb> chunk.value_size
0
I can confirm this with the HBase shell from the rowkey ( i got from self._row.row_key)
So the question becomes: How can a BigTable scan filter-out columns which have undefined / empty / null value ?
I get the same problem from both google cloud APIs that return generators which internally stream data as chunks over gRPC:
google.cloud.happybase.table.Table# scan()
google.cloud.bigtable.table.Table# read_rows().consume_all()
the abbreviated stacktrace is as follows:
---------------------------------------------------------------------------
InvalidChunk Traceback (most recent call last)
<ipython-input-48-922c8127f43b> in <module>()
1 row_gen = table.scan(limit=n)
2 rows = []
----> 3 for kvp in row_gen:
4 pass
.../site-packages/google/cloud/happybase/table.py in scan(self, row_start, row_stop, row_prefix, columns, timestamp, include_timestamp, limit, **kwargs)
391 while True:
392 try:
--> 393 partial_rows_data.consume_next()
394 for row_key in sorted(rows_dict):
395 curr_row_data = rows_dict.pop(row_key)
.../site-packages/google/cloud/bigtable/row_data.py in consume_next(self)
273 for chunk in response.chunks:
274
--> 275 self._validate_chunk(chunk)
276
277 if chunk.reset_row:
.../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk(self, chunk)
388 self._validate_chunk_new_row(chunk)
389 if self.state == self.ROW_IN_PROGRESS:
--> 390 self._validate_chunk_row_in_progress(chunk)
391 if self.state == self.CELL_IN_PROGRESS:
392 self._validate_chunk_cell_in_progress(chunk)
.../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk_row_in_progress(self, chunk)
368 self._validate_chunk_status(chunk)
369 if not chunk.HasField('commit_row') and not chunk.reset_row:
--> 370 _raise_if(not chunk.timestamp_micros or not chunk.value)
371 _raise_if(chunk.row_key and
372 chunk.row_key != self._row.row_key)
.../site-packages/google/cloud/bigtable/row_data.py in _raise_if(predicate, *args)
439 """Helper for validation methods."""
440 if predicate:
--> 441 raise InvalidChunk(*args)
InvalidChunk:
Can you show me how to scan BigTable from Python, ignoring / logging dirty rows that raise InvalidChunk?
(try ... except wont work around the generator,which is in the google cloud API row_data PartialRowsData class)
Also, can you show me code to chunk stream a table scan in BigTable?
HappyBase batch_size & scan_batching don't seem to be supported.
This was likely due to this bug: https://github.com/googleapis/google-cloud-python/issues/2980
The bug has been fixed, so this should no longer be an issue.

Why the source code of conjugate in Numpy cannot be found by using the inspect module?

I want to see the implementation of the conjugate function used in Numpy. Then I tried the following:
import numpy as np
import inspect
inspect.getsource(np.conjugate)
However, I received the following error message stating that <ufunc 'conjugate'> is not a module, class, method, function, traceback, frame, or code object. May someone answer why?
In [8]: inspect.getsource(np.conjugate)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-821ecfb71e08> in <module>()
----> 1 inspect.getsource(np.conj)
/Users/duanlx/anaconda/python.app/Contents/lib/python2.7/inspect.pyc in getsource(object)
699 or code object. The source code is returned as a single string. An
700 IOError is raised if the source code cannot be retrieved."""
--> 701 lines, lnum = getsourcelines(object)
702 return string.join(lines, '')
703
/Users/duanlx/anaconda/python.app/Contents/lib/python2.7/inspect.pyc in getsourcelines(object)
688 original source file the first line of code was found. An IOError is
689 raised if the source code cannot be retrieved."""
--> 690 lines, lnum = findsource(object)
691
692 if ismodule(object): return lines, 0
/Users/duanlx/anaconda/lib/python2.7/site-packages/IPython/core/ultratb.pyc in findsource(object)
149 FIXED version with which we monkeypatch the stdlib to work around a bug."""
150
--> 151 file = getsourcefile(object) or getfile(object)
152 # If the object is a frame, then trying to get the globals dict from its
153 # module won't work. Instead, the frame object itself has the globals
/Users/duanlx/anaconda/python.app/Contents/lib/python2.7/inspect.pyc in getsourcefile(object)
442 Return None if no way can be identified to get the source.
443 """
--> 444 filename = getfile(object)
445 if string.lower(filename[-4:]) in ('.pyc', '.pyo'):
446 filename = filename[:-4] + '.py'
/Users/duanlx/anaconda/python.app/Contents/lib/python2.7/inspect.pyc in getfile(object)
418 return object.co_filename
419 raise TypeError('{!r} is not a module, class, method, '
--> 420 'function, traceback, frame, or code object'.format(object))
421
422 ModuleInfo = namedtuple('ModuleInfo', 'name suffix mode module_type')
TypeError: <ufunc 'conjugate'> is not a module, class, method, function, traceback, frame, or code object
Thanks!
Numpy is written in C, for speed. You can only see the source of Python functions.

Categories