Use Great Expectations to validate pandas DataFrame with existing suite JSON - python

I'm using the Great Expectations python package (version 0.14.10) to validate some data. I've already followed the provided tutorials and created a great_expectations.yml in the local ./great_expectations folder. I've also created a great expectations suite based on a .csv file version of the data (call this file ge_suite.json).
GOAL: I want to use the ge_suite.json file to validate an in-memory pandas DataFrame.
I've tried following this SO question answer with code that looks like this:
import great_expectations as ge
import pandas as pd
from ruamel import yaml
from great_expectations.data_context import DataContext
context = DataContext()
df = pd.read_pickle('/path/to/my/df.pkl')
batch_kwargs = {"datasource": "my_datasource_name", "dataset": df}
batch = context.get_batch(batch_kwargs=batch_kwargs, expectation_suite_name="ge_suite")
My datasources section of my great_expectations.yml file looks like this:
datasources:
my_datasource_name:
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
module_name: great_expectations.datasource
class_name: Datasource
data_connectors:
default_inferred_data_connector_name:
module_name: great_expectations.datasource.data_connector
base_directory: /tmp
class_name: InferredAssetFilesystemDataConnector
default_regex:
group_names:
- data_asset_name
pattern: (.*)
default_runtime_data_connector_name:
batch_identifiers:
- default_identifier_name
module_name: great_expectations.datasource.data_connector
class_name: RuntimeDataConnector
When I run the batch = context.get_batch(... command in python I get the following error:
File "/Users/username/opt/miniconda3/envs/myenv/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 1655, in get_batch
return self._get_batch_v2(
File "/Users/username/opt/miniconda3/envs/myenv/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 1351, in _get_batch_v2
batch = datasource.get_batch(
AttributeError: 'Datasource' object has no attribute 'get_batch'
I'm assuming that I need to add something to the definition of the datasource in the great_expectations.yml file to fix this. Or, could it be a versioning issue? I'm not sure. I looked around for a while in the online documentation and didn't find an answer. How do I achieve the "GOAL" (defined above) and get past this error?

If you want to validate an in-memory pandas dataframe you can reference the following 2 pages for information on how to do that:
https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/in_memory/pandas/
https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_create_a_batch_of_data_from_an_in_memory_spark_or_pandas_dataframe/
To give a concrete example in code though, you can do something like this:
import great_expectations as ge
import os
import pandas as pd
from great_expectations.core.batch import RuntimeBatchRequest
context = ge.get_context()
df = pd.read_pickle('/path/to/my/df.pkl')
suite_name = 'ge_suite'
data_asset_name = 'your_data_asset_name'
batch_id = 'your_batch_id'
batch_request = RuntimeBatchRequest(datasource_name="my_datasource_name",
data_connector_name="default_runtime_data_connector_name",
data_asset_name=data_asset_name,
runtime_parameters={"batch_data": df},
batch_identifiers={"default_identifier_name": batch_id}, )
# context.run_checkpoint method looks for checkpoint file on disk. Create one...
checkpoint_name = 'your_checkpoint_name'
checkpoint_path = os.path.abspath(f'./great_expectations/checkpoints/{checkpoint_name}.yml')
checkpoint_yml = f'''
name: {checkpoint_name}
config_version: 1
class_name: SimpleCheckpoint
expectation_suite_name: {suite_name}
'''
with open(checkpoint_path, 'w') as f:
f.write(checkpoint_yml)
result = context.run_checkpoint(
checkpoint_name=checkpoint_name,
validations=[{"batch_request": batch_request, 'expectation_suite_name': suite_name}, ],
)

Related

Take my data from my computer and Verify that data is not stolen

As a Python programmer recently I received a small project to edit and add some functions. (the project is Python/Django)
But while I was working on it, I noticed something unusual, which is the presence of some python libraries(hashlib and others), which can take my data(Gmail accounts, passwords, chrome bookmarks . . .) from the computer.
This Script is an example from the code.
import hashlib
import logging
import re
import pandas as pd
from file import reader
logger = logging.getLogger(__name__)
def load_data(user_id, project_id, columns: dict):
group_files = {}
df = pd.DataFrame(None)
for id_, column in columns.items():
group = column['group']
if group not in group_files:
df_file = reader.load_file(user_id, project_id, group)
group_files[group] = df_file
if group_files[group] is not None:
if column['content'] in group_files[group].columns:
df[column['content']] = group_files[group][column['content']]
return df
def get_hash(string: str):
return hashlib.md5(string.encode()).hexdigest()[:5]
My question is: How can I know if they are taking my data from the computer or not?
Thanks in Advance.

Import a JSON project wise, so it loads just once

I have a Python project that performs a JSON validation against a specific schema.
It will run as a Transform step in GCP Dataflow, so it's very important that all dependencies are gathered before the run to avoid downloading the same file again and again.
The schema is placed in a separated Git repository.
The nature of the Transformer is that you receive a single record in your class, and you work with it. The typical flow is that you load the JSON Schema, you validate the record against it, and then you do stuff with the invalid and with the valid. Loading the schema in this way means that I download the schema from the repo for every record, and it could be hundred thousands.
The code gets "cloned" into the workers and then work kinda independent.
Inspired by the way Python loads the requirements at the beginning (one single time) and using them as imports, I thought I could add the repository (where the JSON schema lives) as a Python requirement, and then simply use it in my Python code. But of course, it's a JSON, not a Python module to be imported. How can it work?
An example would be something like:
requirements.txt
git+git://github.com/path/to/json/schema#41b95ec
dataflow_transformer.py
import apache_beam as beam
import the_downloaded_schema
from jsonschema import validate
class Verifier(beam.DoFn):
def process(self, record: dict):
validate(instance=record, schema=the_downloaded_schema)
# ... more stuff
yield record
class Transformer(beam.PTransform):
def expand(self, record):
return (
record
| "Verify Schema" >> beam.ParDo(Verifier())
)
You can load the json schema once and use it as a side input.
An example:
import json
import requests
json_current='https://covidtracking.com/api/v1/states/current.json'
def get_json_schema(url):
with requests.Session() as session:
schema = json.loads(session.get(url).text)
return schema
schema_json = get_json_schema(json_current)
def feed_schema(data, schema):
yield {'record': data, 'schema': schema[0]}
schema = p | beam.Create([schema_json])
data = p | beam.Create(range(10))
data_with_schema = data | beam.FlatMap(feed_schema, schema=beam.pvalue.AsSingleton(schema))
# Now do your schema validation
Just a demonstration of what the data_with_schema pcollection looks like
Why don't you just use a class for loading your resources that uses a cache in order to prevent double loading? Something along the lines of:
class JsonLoader:
def __init__(self):
self.cache = set()
def import(self, filename):
filename = os.path.absname(filename)
if filename not in self.cache:
self._load_json(filename)
self.cache.add(filename)
def _load_json(self, filename):
...

How to create a gdal.Dataset or xarray.Dataset object from a django.contrib.gis.gdal.GDALRaster object?

I am working on a Django project in which I'm trying to get all the raster data from my Database.
Here is my model in models.py
from django.contrib.gis.db import models
class RasterWithName(models.Model):
raster = models.RasterField()
name = models.TextField()
Here is the method I use to get all the rows from my Database in django's shell.
First I have to do a python manage.py shell and then run the below code, one by one:
all_objects = RasterWithName.objects.all()
first_object_in_database = all_objects[0]
print(first_object_in_database)
It prints:
RasterWithName object (1)
Additionally, running the below line,
print(type(first_object_in_database))
prints:
<class 'geo.models.RasterWithName'>
I then run the two lines below:
raster = first_object_in_database.raster
print(type(raster))
Which prints:
<class 'django.contrib.gis.gdal.raster.source.GDALRaster'>
How can I convert this GDALRaster object into a more known object like a gdal Dataset( that can be imported like this: from osgeo.gdal import Dataset) or xarray Dataset (that can be imported like this: from xarray import Dataset)?
#####################################################
EDIT‌ #1:
#####################################################
Thanks to Val, Here is a working solution:
all_objects = RasterWithName.objects.all()
first_object_in_database = all_objects[0]
my_raster = first_object_in_database.raster
gdal_raster = gdal.Open(raster.name)
print(type(gdal_raster))
which prints:
<class 'osgeo.gdal.Dataset'>
However, I don't think this would be very optimal since it simply opens the file from its path on my local storage.

Python package with sample datasets but deferred download?

I have a data analysis tool that I made a Python package for and I'd like to include some sample datasets, but I don't want to include all the datasets directly in the Python package because it will bloat the size and slow down install for people who don't use them.
The behavior I want is when a sample dataset is referenced it automatically gets downloaded from a URL and saved to the package locally, but then the next time it is used it will read the local version instead of re-downloading it. And this caching should persist permanently for my package, not only the during of the Python instance.
How can I do this?
I ended up making a folder under AppData using the appdirs package
datasets.py
import os
import pandas as pd
from pandasgui.utility import get_logger
from appdirs import user_data_dir
from tqdm import tqdm
logger = get_logger(__name__)
__all__ = ["all_datasets",
"country_indicators",
"us_shooting_incidents",
"diamonds",
"pokemon",
"anscombe",
"attention",
"car_crashes",
"dots",
"exercise",
"flights",
"fmri",
"gammas",
"geyser",
"iris",
"mpg",
"penguins",
"planets",
"tips",
"titanic",
"gapminder",
"stockdata"]
dataset_names = [x for x in __all__ if x != "all_datasets"]
all_datasets = {}
root_data_dir = os.path.join(user_data_dir(), "pandasgui", "dataset_files")
# Open local data CSVs if they exists
if all([os.path.exists(os.path.join(root_data_dir, f"{name}.csv")) for name in dataset_names]):
for name in dataset_names:
data_path = os.path.join(root_data_dir, f"{name}.csv")
if os.path.isfile(data_path):
all_datasets[name] = pd.read_csv(data_path)
# Download data if it doesn't exist locally
else:
os.makedirs(root_data_dir, exist_ok=True)
logger.info(f"Downloading PandasGui sample datasets into {root_data_dir}...")
pbar = tqdm(dataset_names, bar_format='{percentage:3.0f}% {bar} | {desc}')
for name in pbar:
pbar.set_description(f"{name}.csv")
data_path = os.path.join(root_data_dir, f"{name}.csv")
if os.path.isfile(data_path):
all_datasets[name] = pd.read_csv(data_path)
else:
all_datasets[name] = pd.read_csv(
os.path.join("https://raw.githubusercontent.com/adamerose/datasets/master/",
f"{name}.csv"))
all_datasets[name].to_csv(data_path, index=False)
# Add the datasets to globals so they can be imported like `from pandasgui.datasets import iris`
for name in all_datasets.keys():
globals()[name] = all_datasets[name]

How to know and instantiate only one class implemented in a Python module dynamically

Suppose in "./data_writers/excel_data_writer.py", I have:
from generic_data_writer import GenericDataWriter
class ExcelDataWriter(GenericDataWriter):
def __init__(self, config):
super().__init__(config)
self.sheet_name = config.get('sheetname')
def write_data(self, pandas_dataframe):
pandas_dataframe.to_excel(
self.get_output_file_path_and_name(), # implemented in GenericDataWriter
sheet_name=self.sheet_name,
index=self.index)
In "./data_writers/csv_data_writer.py", I have:
from generic_data_writer import GenericDataWriter
class CSVDataWriter(GenericDataWriter):
def __init__(self, config):
super().__init__(config)
self.delimiter = config.get('delimiter')
self.encoding = config.get('encoding')
def write_data(self, pandas_dataframe):
pandas_dataframe.to_csv(
self.get_output_file_path_and_name(), # implemented in GenericDataWriter
sep=self.delimiter,
encoding=self.encoding,
index=self.index)
In "./datawriters/generic_data_writer.py", I have:
import os
class GenericDataWriter:
def __init__(self, config):
self.output_folder = config.get('output_folder')
self.output_file_name = config.get('output_file')
self.output_file_path_and_name = os.path.join(self.output_folder, self.output_file_name)
self.index = config.get('include_index') # whether to include index column from Pandas' dataframe in the output file
Suppose I have a JSON config file that has a key-value pair like this:
{
"__comment__": "Here, user can provide the path and python file name of the custom data writer module she wants to use."
"custom_data_writer_module": "./data_writers/excel_data_writer.py"
"there_are_more_key_value_pairs_in_this_JSON_config_file": "for other input parameters"
}
In "main.py", I want to import the data writer module based on the custom_data_writer_module provided in the JSON config file above. So I wrote this:
import os
import importlib
def main():
# Do other things to read and process data
data_writer_class_file = config.get('custom_data_writer_module')
data_writer_module = importlib.import_module\
(os.path.splitext(os.path.split(data_writer_class_file)[1])[0])
dw = data_writer_module.what_should_this_be? # <=== Here, what should I do to instantiate the right specific data writer (Excel or CSV) class instance?
for df in dataframes_to_write_to_output_file:
dw.write_data(df)
if __name__ == "__main__":
main()
As I asked in the code above, I want to know if there's a way to retrieve and instantiate the class defined in a Python module assuming that there is ONLY ONE class defined in the module. Or if there is a better way to refactor my code (using some sort of pattern) without changing the structure of JSON config file described above, I'd like to learn from Python experts on StackOverflow. Thank you in advance for your suggestions!
You can do this easily with vars:
cls1,=[v for k,v in vars(data_writer_module).items()
if isinstance(v,type)]
dw=cls1(config)
The comma enforces that exactly one class is found. If the module is allowed to do anything like from collections import deque (or even foo=str), you might need to filter based on v.__module__.

Categories