great_expectations create datasource of csv files on ADLS Gen2

great_expectations create datasource of csv files on ADLS Gen2 - python

I want to run great_expectation test suites against csv files in my ADLS Gen2. On my ADLS, I have a container called "data" in which I have a file at mypath/test/mydata.csv. I use a InferredAssetAzureDataConnector. I was able to create and test/validate the data source configuration but I believe there is a "silent" issue which was not caught.
The problem is that I cannot create a test suite based on this data source. When I run great_expectations suite new,
I select (3) to create the suite with the profiler, and then
select my newly created datasource, and then
instead of showing me the available files at the data source, it crashes with the following error (see below for full stacktrace):
TypeError: __init__() missing 1 required positional argument: 'data_asset_name'
When I execute this with a local data source (InferredAssetFilesystemDataConnector), it would show me the available files at the data source for selection in the CLI.
Does the error mean it cannot find the csv file on the ADLS and thus has no names to show? How do I fix this?
My Code to create the data source:
import great_expectations as ge
from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource, check_if_datasource_name_exists
context = ge.get_context()
datasource_name = "my_datasource_name"
example_yaml = f"""
name: {datasource_name}
class_name: Datasource
execution_engine:
class_name: SparkDFExecutionEngine
azure_options:
account_url: https://<ACCOUNT-NAME>.blob.core.windows.net
credential: <ACCOUNT-KEY>
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetAzureDataConnector
azure_options:
account_url: https://<ACCOUNT-NAME>.blob.core.windows.net
credential: <ACCOUNT-KEY>
container: data
name_starts_with: mypath/test
default_regex:
group_names:
- data_asset_name
pattern: (.csv)
default_runtime_data_connector_name:
class_name: RuntimeDataConnector
assets:
my_runtime_asset_name:
batch_identifiers:
- runtime_batch_identifier_name
"""
print(example_yaml)
# Test the yml:
context.test_yaml_config(yaml_config=example_yaml)
The output after creating the data source via the Jupyter Notebook:
Attempting to instantiate class from config...
Instantiating as a Datasource, since class_name is Datasource
Successfully instantiated Datasource
ExecutionEngine class name: SparkDFExecutionEngine
Data Connectors:
default_inferred_data_connector_name : InferredAssetAzureDataConnector
Available data_asset_names (0 of 0):
Unmatched data_references (0 of 0):[]
default_runtime_data_connector_name:RuntimeDataConnector
default_runtime_data_connector_name : RuntimeDataConnector
Available data_asset_names (1 of 1):
my_runtime_asset_name (0 of 0): []
Unmatched data_references (0 of 0):[]
<great_expectations.datasource.new_datasource.Datasource at 0x1cdc9e01f70>
Full error stack:
Traceback (most recent call last):
File "c:\coding\python38\lib\runpy.py", line 192, in _run_module_as_main
return _run_code(code, main_globals, None,
File "c:\coding\python38\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\coding\myrepo\venv\Scripts\great_expectations.exe\__main__.py", line 7, in <module>
File "C:\coding\myrepo\venv\lib\site-packages\great_expectations\cli\cli.py", line 190, in main
cli()
File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 1053, in main
rv = self.invoke(ctx)
File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "C:\coding\myrepo\venv\lib\site-packages\click\decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "C:\coding\myrepo\venv\lib\site-packages\great_expectations\cli\suite.py", line 151, in suite_new
_suite_new_workflow(
File "C:\coding\myrepo\venv\lib\site-packages\great_expectations\cli\suite.py", line 335, in _suite_new_workflow
raise e
File "C:\coding\myrepo\venv\lib\site-packages\great_expectations\cli\suite.py", line 279, in _suite_new_workflow
toolkit.add_citation_with_batch_request(
File "C:\coding\myrepo\venv\lib\site-packages\great_expectations\cli\toolkit.py", line 1020, in add_citation_with_batch_request
and BatchRequest(**batch_request)
TypeError: __init__() missing 1 required positional argument: 'data_asset_name'

I had a mistake in my regex ... with the following pattern it works flawlessly:
default_regex:
group_names:
- data_asset_name
pattern: (.*\.csv)

Related

_gdbm.error: Database needs recovery -- after running out of storage while fetching api data

I really don't know how to help myself, being unfamiliar with this kind of error, and not finding anything on the Google landscape really. My last hope is one of you guys since I don't know where else to go with this. I tried reinstalling all libraries and setting up a new venv. For more action I don't trust myself enough in these kinds of things.
The code triggering the error:
from wetterdienst import DWDObservationData
observations_daily = DWDObservationData(
station_ids=station_ids_d,
parameter=params_daily,
time_resolution=TimeResolution.DAILY,
start_date="2015-01-01",
end_date="2020-10-10",
tidy_data=True,
humanize_column_names=True,
)
for df in observations_hourly.collect_data():
name = str(df.STATION_ID.iloc[0]).strip(".0")
df.to_csv('./data/hourly/{}.csv'.format(name))
print('{} done'.format(name))
API is found here: https://github.com/earthobservations/wetterdienst
Error:
Traceback (most recent call last):
File "/Users/sashakaun/PycharmProjects/wetter2.0/main.py", line 83, in <module>
for df in observations_hourly.collect_data():
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/wetterdienst/dwd/observations/api.py", line 178, in collect_data
df_parameter = self._collect_parameter_from_station(
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/wetterdienst/dwd/observations/api.py", line 243, in _collect_parameter_from_station
df_period = collect_climate_observations_data(
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/wetterdienst/dwd/observations/access.py", line 82, in collect_climate_observations_data
filenames_and_files = download_climate_observations_data_parallel(remote_files)
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/wetterdienst/dwd/observations/access.py", line 106, in download_climate_observations_data_parallel
return list(zip(remote_files, files_in_bytes))
File "/usr/local/Cellar/python#3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 611, in result_iterator
yield fs.pop().result()
File "/usr/local/Cellar/python#3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/usr/local/Cellar/python#3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
File "/usr/local/Cellar/python#3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/wetterdienst/dwd/observations/access.py", line 124, in _download_climate_observations_data
return BytesIO(__download_climate_observations_data(remote_file=remote_file))
File "<decorator-gen-2>", line 2, in __download_climate_observations_data
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/dogpile/cache/region.py", line 1356, in get_or_create_for_user_func
return self.get_or_create(
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/dogpile/cache/region.py", line 954, in get_or_create
with Lock(
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/dogpile/lock.py", line 185, in __enter__
return self._enter()
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/dogpile/lock.py", line 94, in _enter
generated = self._enter_create(value, createdtime)
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/dogpile/lock.py", line 178, in _enter_create
return self.creator()
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/dogpile/cache/region.py", line 920, in gen_value
self.backend.set(key, value)
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/dogpile/cache/backends/file.py", line 239, in set
dbm[key] = pickle.dumps(value, pickle.HIGHEST_PROTOCOL)
_gdbm.error: Database needs recovery
Thanks a lot!!

A GDBM file has been corrupted. You need to use gdbmtool to recover the database. Install gdbmtool then run
gdbmtool FILENAME
Where FILENAME is the name of the GDBM database. A prompt will appear, then you can enter
gdbmtool> recover summary
If the database can be recovered it will display a summary of the recovery results, eg:
Recovery succeeded.
Keys recovered: 6870650, failed: 5, duplicate: 0
Buckets recovered: 64830, failed: 2

Setup of Flask architecture for machine learning pipeline

I want to setup a machine learning pipeline that is callable by flask but I am facing some issues, these links are for the documentations I have read so far:
https://exploreflask.com/en/latest/views.html#view-decorators
https://flask.palletsprojects.com/en/1.1.x/api/#flask.Flask
Let me explain the pipeline I have in mind:
pull a dataframe from a PostgreSQL database
encode said dataframe to make it ready for most algorithms
split up the data
feed to a pipeline and determine accuracy
store the model in a pickle file
What is working so far:
All parts are working as a regular script
I can just slap all the steps into one huge flask file with one decorator and it would run as well (my emergency solution)
The File Structure
The encoder script:
#Flask main thread
#makes flask start this part as application and not as module
app = Flask('encoder_module')
#app.route('/df_encoder')
def df_encoder(rng = 4):
encoding stuff
`return df`
The Pipeline script (random forest regressor here)
app = Flask('pipeline_module')
#app.route('/pipeline_rfr')
def pipeline_rfr():
pipeline stuff
`return grid_search_rfr`
The pickle module:
app = Flask('pickle_module')
#app.route('/store_reg_pickle')
def store_pickle():
"""
Usage of a Pickle Model -Storage of a trained Model
"""
model = grid_search_rfr
#specify file name in letter strings
model_file = "regression_model"
with open(model_file, mode='wb') as m_f:
pickle.dump(model, m_f)
print(f"Model saved in: {os.getcwd()}")
return model_file
The Main Flask File
#packages
from flask import Flask
from encoder_main_thread import df_encoder
from rfr_pipeline_function import pipeline_rfr
from pickle_call import store_pickle
app = Flask(__name__.split('.')[0])
#app.route('/regression_pipe')
#df_encoder
#pipeline_rfr
#store_reg_pickle
def regression_pipe():
`return 'pipeline done`
The problem iss that the return value of the encoder cannot be a dataframe, only a string, tuple, etc.
Is there a workaround for this?
I actually want it to be a flawless passing of the dataframe to the pipeline and eventually storing it in the pickle file which is then saved in the folder.
For some reason it cannot detect the pickle file import and throws following error:
Use a production WSGI server instead.
* Debug mode: off
Traceback (most recent call last):
File "C:\ANACONDA3\Scripts\flask-script.py", line 9, in <module>
sys.exit(main())
File "C:\ANACONDA3\lib\site-packages\flask\cli.py", line 967, in main
cli.main(args=sys.argv[1:], prog_name="python -m flask" if as_module else None)
File "C:\ANACONDA3\lib\site-packages\flask\cli.py", line 586, in main
return super(FlaskGroup, self).main(*args, **kwargs)
File "C:\ANACONDA3\lib\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "C:\ANACONDA3\lib\site-packages\click\core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\ANACONDA3\lib\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\ANACONDA3\lib\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "C:\ANACONDA3\lib\site-packages\click\decorators.py", line 73, in new_func
return ctx.invoke(f, obj, *args, **kwargs)
File "C:\ANACONDA3\lib\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "C:\ANACONDA3\lib\site-packages\flask\cli.py", line 848, in run_command
app = DispatchingApp(info.load_app, use_eager_loading=eager_loading)
File "C:\ANACONDA3\lib\site-packages\flask\cli.py", line 305, in __init__
self._load_unlocked()
File "C:\ANACONDA3\lib\site-packages\flask\cli.py", line 330, in _load_unlocked
self._app = rv = self.loader()
File "C:\ANACONDA3\lib\site-packages\flask\cli.py", line 388, in load_app
app = locate_app(self, import_name, name)
File "C:\ANACONDA3\lib\site-packages\flask\cli.py", line 240, in locate_app
__import__(module_name)
File "C:\Users\bill-\OneDrive\Dokumente\Docs Bill\TA_files\functions_scripts_storage\flask_test\flask_regression_pipeline.py", line 18, in <module>
#store_reg_pickle
NameError: name 'store_reg_pickle' is not defined
If you wish I could upload the entire scripts but that is a lot to look through and since it is working as a long regular pice of code, the mistake needs to be somewhere with my flask setup.

RuntimeValueProviderError when creating a google cloud dataflow template with Apache Beam python

I can't stage a cloud dataflow template with python 3.7. It fails on the one parametrized argument with apache_beam.error.RuntimeValueProviderError: RuntimeValueProvider(option: input, type: str, default_value: 'gs://dataflow-samples/shakespeare/kinglear.txt') not accessible
Staging the template with python 2.7 works fine.
I have tried running dataflow jobs with 3.7 and they work fine. Only the template staging is broken.
Is python 3.7 still not supported in dataflow templates or did the syntax for staging in python 3 change?
Here is the pipeline piece
class WordcountOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--input',
default='gs://dataflow-samples/shakespeare/kinglear.txt',
help='Path of the file to read from',
dest="input")
def main(argv=None):
options = PipelineOptions(flags=argv)
setup_options = options.view_as(SetupOptions)
wordcount_options = options.view_as(WordcountOptions)
with beam.Pipeline(options=setup_options) as p:
lines = p | 'read' >> ReadFromText(wordcount_options.input)
if __name__ == '__main__':
main()
Here is the full repo with the staging scripts https://github.com/firemuzzy/dataflow-templates-bug-python3
There was a previous similar issues, but am not sure how related it is since that was done in python 2.7 but my template stages fine in 2.7 but fails in 3.7
How to create Google Cloud Dataflow Wordcount custom template in Python?
**** Stack Trace ****
Traceback (most recent call last):
File "run_pipeline.py", line 44, in <module>
main()
File "run_pipeline.py", line 41, in main
lines = p | 'read' >> ReadFromText(wordcount_options.input)
File "/usr/local/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py", line 906, in __ror__
return self.transform.__ror__(pvalueish, self.label)
File "/usr/local/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py", line 515, in __ror__
result = p.apply(self, pvalueish, label)
File "/usr/local/lib/python3.7/site-packages/apache_beam/pipeline.py", line 490, in apply
return self.apply(transform, pvalueish)
File "/usr/local/lib/python3.7/site-packages/apache_beam/pipeline.py", line 525, in apply
pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 183, in apply
return m(transform, input, options)
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 189, in apply_PTransform
return transform.expand(input)
File "/usr/local/lib/python3.7/site-packages/apache_beam/io/textio.py", line 542, in expand
return pvalue.pipeline | Read(self._source)
File "/usr/local/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py", line 515, in __ror__
result = p.apply(self, pvalueish, label)
File "/usr/local/lib/python3.7/site-packages/apache_beam/pipeline.py", line 525, in apply
pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 183, in apply
return m(transform, input, options)
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1020, in apply_Read
return self.apply_PTransform(transform, pbegin, options)
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 189, in apply_PTransform
return transform.expand(input)
File "/usr/local/lib/python3.7/site-packages/apache_beam/io/iobase.py", line 863, in expand
return pbegin | _SDFBoundedSourceWrapper(self.source)
File "/usr/local/lib/python3.7/site-packages/apache_beam/pvalue.py", line 113, in __or__
return self.pipeline.apply(ptransform, self)
File "/usr/local/lib/python3.7/site-packages/apache_beam/pipeline.py", line 525, in apply
pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 183, in apply
return m(transform, input, options)
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 189, in apply_PTransform
return transform.expand(input)
File "/usr/local/lib/python3.7/site-packages/apache_beam/io/iobase.py", line 1543, in expand
| core.ParDo(self._create_sdf_bounded_source_dofn()))
File "/usr/local/lib/python3.7/site-packages/apache_beam/io/iobase.py", line 1517, in _create_sdf_bounded_source_dofn
estimated_size = source.estimate_size()
File "/usr/local/lib/python3.7/site-packages/apache_beam/options/value_provider.py", line 136, in _f
raise error.RuntimeValueProviderError('%s not accessible' % obj)
apache_beam.error.RuntimeValueProviderError: RuntimeValueProvider(option: input, type: str, default_value: 'gs://dataflow-samples/shakespeare/kinglear.txt') not accessible

Unfortunately, it looks like templates are broken on Apache Beam's Python SDK 2.18.0.
For now, the solution to this is to avoid Beam 2.18.0, so in your requirements / dependencies, define apache-beam[gcp]<2.18.0 or apache-beam[gcp]>2.18.0

msrest.exceptions.ValidationError: Parameter 'resource_group_name' must conform to the following pattern: '^[-\\w\\._\$\$]+$'

sathish#Azure:~/quickstart/python-docs-hello-world$ az webapp up -n appname
The command failed with an unexpected error. Here is the traceback:
Parameter 'resource_group_name' must conform to the following pattern: '^[-\\w\\._\\(\\)]+$'.
Traceback (most recent call last):
File "/opt/az/lib/python3.6/site-packages/knack/cli.py", line 206, in invoke
cmd_result = self.invocation.execute(args)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 326, in execute
raise ex
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 384, in _run_jobs_serially
results.append(self._run_job(expanded_arg, cmd_copy))
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 375, in _run_job
cmd_copy.exception_handler(ex)
File "/opt/az/lib/python3.6/site-packages/azure/cli/command_modules/appservice/commands.py", line 54, in _polish_bad_errors
raise ex
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 354, in _run_job
result = cmd_copy(params)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 145, in __call__
return self.handler(*args, **kwargs)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/__init__.py", line 451, in default_command_handler
return op(**command_args)
File "/opt/az/lib/python3.6/site-packages/azure/cli/command_modules/appservice/custom.py", line 2313, in webapp_up
_create_new_rg = should_create_new_rg(cmd, default_rg, rg_name, is_linux)
File "/opt/az/lib/python3.6/site-packages/azure/cli/command_modules/appservice/_create_util.py", line 282, in should_create_new_rg
elif (_check_resource_group_exists(cmd, rg_name) and
File "/opt/az/lib/python3.6/site-packages/azure/cli/command_modules/appservice/_create_util.py", line 86, in _check_resource_group_exists
return rcf.resource_groups.check_existence(rg_name)
File "/opt/az/lib/python3.6/site-packages/azure/mgmt/resource/resources/v2018_05_01/operations/resource_groups_operations.py", line 61, in check_existence
'resourceGroupName': self._serialize.url("resource_group_name", resource_group_name, 'str', max_length=90, min_length=1, pattern=r'^[-\w\._\(\)]+$'),
File "/opt/az/lib/python3.6/site-packages/msrest/serialization.py", line 592, in url
data = self.validate(data, name, required=True, **kwargs)
File "/opt/az/lib/python3.6/site-packages/msrest/serialization.py", line 672, in validate
raise ValidationError(key, name, value)
msrest.exceptions.ValidationError: Parameter 'resource_group_name' must conform to the following pattern: '^[-\\w\\._\\(\\)]+$'.
To open an issue, please run: 'az feedback'

I had this problem, but resolved it by removing the quotes around the resource group name. Silly mistake on my behalf, but may be the issue here.
e.g. use az group create -n my-groupname-rg -l northeurope
rather than: e.g. use az group create -n 'my-groupname-rg' -l 'northeurope'

For resource group name, you should take a look at Naming rules and restrictions:

Strange error adding to Whoosh index

Can anyone help me with this strange error I'm getting when adding a new document to a Whoosh index?
Here's the code:
def add_to_index(self, doc):
ix = index.open_dir(self.index_dir)
writer = AsyncWriter(ix) # use async writer to prevent write lock errors
writer.add_document(**self.get_doc_args(doc))
writer.commit()
def get_doc_args(self, doc):
return {
'id': u""+str(doc['id']),
'org': doc['org__id'],
'created': doc['created_date'],
'date': doc['received_date'],
'from_addr': doc['from_addr'],
'subject': doc['subject'],
'body': doc['messagebody__cleaned_message']
}
I get the following error:
TypeError('ord() expected a character, but string of length 0 found',)
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/celery/execute/trace.py", line 36, in trace
return cls(states.SUCCESS, retval=fun(*args, **kwargs))
File "/usr/local/lib/python2.6/dist-packages/celery/app/task/__init__.py", line 232, in __call__
return self.run(*args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/celery/app/__init__.py", line 172, in run
return fun(*args, **kwargs)
File "/mnt/deploy/prod/chorus/src/chorus/../chorus/search/__init__.py", line 131, in index_message
MessageSearcher().add_to_index(message)
File "/mnt/deploy/prod/chorus/src/chorus/../chorus/search/__init__.py", line 29, in add_to_index
writer.commit()
File "/usr/local/lib/python2.6/dist-packages/whoosh/writing.py", line 423, in commit
self.writer.commit(*args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/whoosh/filedb/filewriting.py", line 501, in commit
new_segments = mergetype(self, self.segments)
File "/usr/local/lib/python2.6/dist-packages/whoosh/filedb/filewriting.py", line 78, in MERGE_SMALL
reader = SegmentReader(writer.storage, writer.schema, seg)
File "/usr/local/lib/python2.6/dist-packages/whoosh/filedb/filereading.py", line 63, in __init__
self.termsindex = TermIndexReader(tf)
File "/usr/local/lib/python2.6/dist-packages/whoosh/filedb/filetables.py", line 590, in __init__
super(TermIndexReader, self).__init__(dbfile)
File "/usr/local/lib/python2.6/dist-packages/whoosh/filedb/filetables.py", line 502, in __init__
OrderedHashReader.__init__(self, dbfile)
File "/usr/local/lib/python2.6/dist-packages/whoosh/filedb/filetables.py", line 379, in __init__
HashReader.__init__(self, dbfile)
File "/usr/local/lib/python2.6/dist-packages/whoosh/filedb/filetables.py", line 187, in __init__
self.hashtype = dbfile.read_byte()
File "/usr/local/lib/python2.6/dist-packages/whoosh/filedb/structfile.py", line 219, in read_byte
return ord(self.file.read(1))
Strangely, the exact same code using a standard writer (i.e. not AsyncWriter) works just fine. What am I missing here? Note that in production I have to use AsyncWriter in order to avoid LockErrors.

This error is caused by some kind of index corruption. In my case the machine crashed by another reason while index was being rebuild.
You can easily solve it by deleting whoosh_index folder contents completely and rebuilidng index.

Ended up finding a solution; it's called Solr :-)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

great_expectations create datasource of csv files on ADLS Gen2 - python

I had a mistake in my regex ... with the following pattern it works flawlessly: default_regex: group_names: - data_asset_name pattern: (.*\.csv)

Related

_gdbm.error: Database needs recovery -- after running out of storage while fetching api data

Setup of Flask architecture for machine learning pipeline

RuntimeValueProviderError when creating a google cloud dataflow template with Apache Beam python

msrest.exceptions.ValidationError: Parameter 'resource_group_name' must conform to the following pattern: '^[-\\w\\._\\(\\)]+$'

Strange error adding to Whoosh index

Categories

Resources