Python package with sample datasets but deferred download?

Python package with sample datasets but deferred download? - python

I have a data analysis tool that I made a Python package for and I'd like to include some sample datasets, but I don't want to include all the datasets directly in the Python package because it will bloat the size and slow down install for people who don't use them.
The behavior I want is when a sample dataset is referenced it automatically gets downloaded from a URL and saved to the package locally, but then the next time it is used it will read the local version instead of re-downloading it. And this caching should persist permanently for my package, not only the during of the Python instance.
How can I do this?

I ended up making a folder under AppData using the appdirs package
datasets.py
import os
import pandas as pd
from pandasgui.utility import get_logger
from appdirs import user_data_dir
from tqdm import tqdm
logger = get_logger(__name__)
__all__ = ["all_datasets",
"country_indicators",
"us_shooting_incidents",
"diamonds",
"pokemon",
"anscombe",
"attention",
"car_crashes",
"dots",
"exercise",
"flights",
"fmri",
"gammas",
"geyser",
"iris",
"mpg",
"penguins",
"planets",
"tips",
"titanic",
"gapminder",
"stockdata"]
dataset_names = [x for x in __all__ if x != "all_datasets"]
all_datasets = {}
root_data_dir = os.path.join(user_data_dir(), "pandasgui", "dataset_files")
# Open local data CSVs if they exists
if all([os.path.exists(os.path.join(root_data_dir, f"{name}.csv")) for name in dataset_names]):
for name in dataset_names:
data_path = os.path.join(root_data_dir, f"{name}.csv")
if os.path.isfile(data_path):
all_datasets[name] = pd.read_csv(data_path)
# Download data if it doesn't exist locally
else:
os.makedirs(root_data_dir, exist_ok=True)
logger.info(f"Downloading PandasGui sample datasets into {root_data_dir}...")
pbar = tqdm(dataset_names, bar_format='{percentage:3.0f}% {bar} | {desc}')
for name in pbar:
pbar.set_description(f"{name}.csv")
data_path = os.path.join(root_data_dir, f"{name}.csv")
if os.path.isfile(data_path):
all_datasets[name] = pd.read_csv(data_path)
else:
all_datasets[name] = pd.read_csv(
os.path.join("https://raw.githubusercontent.com/adamerose/datasets/master/",
f"{name}.csv"))
all_datasets[name].to_csv(data_path, index=False)
# Add the datasets to globals so they can be imported like `from pandasgui.datasets import iris`
for name in all_datasets.keys():
globals()[name] = all_datasets[name]

Related

Use Great Expectations to validate pandas DataFrame with existing suite JSON

I'm using the Great Expectations python package (version 0.14.10) to validate some data. I've already followed the provided tutorials and created a great_expectations.yml in the local ./great_expectations folder. I've also created a great expectations suite based on a .csv file version of the data (call this file ge_suite.json).
GOAL: I want to use the ge_suite.json file to validate an in-memory pandas DataFrame.
I've tried following this SO question answer with code that looks like this:
import great_expectations as ge
import pandas as pd
from ruamel import yaml
from great_expectations.data_context import DataContext
context = DataContext()
df = pd.read_pickle('/path/to/my/df.pkl')
batch_kwargs = {"datasource": "my_datasource_name", "dataset": df}
batch = context.get_batch(batch_kwargs=batch_kwargs, expectation_suite_name="ge_suite")
My datasources section of my great_expectations.yml file looks like this:
datasources:
my_datasource_name:
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
module_name: great_expectations.datasource
class_name: Datasource
data_connectors:
default_inferred_data_connector_name:
module_name: great_expectations.datasource.data_connector
base_directory: /tmp
class_name: InferredAssetFilesystemDataConnector
default_regex:
group_names:
- data_asset_name
pattern: (.*)
default_runtime_data_connector_name:
batch_identifiers:
- default_identifier_name
module_name: great_expectations.datasource.data_connector
class_name: RuntimeDataConnector
When I run the batch = context.get_batch(... command in python I get the following error:
File "/Users/username/opt/miniconda3/envs/myenv/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 1655, in get_batch
return self._get_batch_v2(
File "/Users/username/opt/miniconda3/envs/myenv/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 1351, in _get_batch_v2
batch = datasource.get_batch(
AttributeError: 'Datasource' object has no attribute 'get_batch'
I'm assuming that I need to add something to the definition of the datasource in the great_expectations.yml file to fix this. Or, could it be a versioning issue? I'm not sure. I looked around for a while in the online documentation and didn't find an answer. How do I achieve the "GOAL" (defined above) and get past this error?

If you want to validate an in-memory pandas dataframe you can reference the following 2 pages for information on how to do that:
https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/in_memory/pandas/
https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_create_a_batch_of_data_from_an_in_memory_spark_or_pandas_dataframe/
To give a concrete example in code though, you can do something like this:
import great_expectations as ge
import os
import pandas as pd
from great_expectations.core.batch import RuntimeBatchRequest
context = ge.get_context()
df = pd.read_pickle('/path/to/my/df.pkl')
suite_name = 'ge_suite'
data_asset_name = 'your_data_asset_name'
batch_id = 'your_batch_id'
batch_request = RuntimeBatchRequest(datasource_name="my_datasource_name",
data_connector_name="default_runtime_data_connector_name",
data_asset_name=data_asset_name,
runtime_parameters={"batch_data": df},
batch_identifiers={"default_identifier_name": batch_id}, )
# context.run_checkpoint method looks for checkpoint file on disk. Create one...
checkpoint_name = 'your_checkpoint_name'
checkpoint_path = os.path.abspath(f'./great_expectations/checkpoints/{checkpoint_name}.yml')
checkpoint_yml = f'''
name: {checkpoint_name}
config_version: 1
class_name: SimpleCheckpoint
expectation_suite_name: {suite_name}
'''
with open(checkpoint_path, 'w') as f:
f.write(checkpoint_yml)
result = context.run_checkpoint(
checkpoint_name=checkpoint_name,
validations=[{"batch_request": batch_request, 'expectation_suite_name': suite_name}, ],
)

Python: always import the last revision in the directory

Imagine that we have the following Data Base structure with the data stored in python files ready to be imported:
data_base/
foo_data/
rev_1.py
rev_2.py
bar_data/
rev_1.py
rev_2.py
rev_3.py
In my main script, I would like to import the last revision of the data available in the folder. For example, instead of doing this:
from data_base.foo_data.rev_2 import foofoo
from data_base.bar_data.rev_3 import barbar
I want to call a method:
import_from_db(path='data_base.foo_data', attr='foofoo', rev='last')
import_from_db(path='data_base.bar_data', attr='barbar', rev='last')
I could take a relative path to the Data Base and use glob.glob to search the last revision, but for this, I should know the path to the data_base folder, which complicates things (imagine that the parent folder of the data_base is in sys.path so the from data_base.*** import will work)
Is there an efficient way to maybe retrieve a full path knowing only part of it (data_base.foo_data)? Other ideas?

I think it's better to install the last version.
but going on with your flow, you may use getattr on the module:
from data_base import foo_data
i = 0
while True:
try:
your_module = getattr(foo_data, f'rev_{i}')
except AttributeError:
break
i += 1
# Now your_module is the latest rev

#JohnDoriaN 's idea led me to a quite simple solution:
import os, glob
def import_from_db(import_path, attr, rev_id=None):
"""
"""
# Get all the modules/folders names
dir_list = import_path.split('.')
# Import the last module
exec(f"from {'.'.join(dir_list[:-1])} import {dir_list[-1]}")
db_parent = locals()[dir_list[-1]]
# Get an absolute path to corresponding to the db_parent folder
abs_path = db_parent.__path__._path[0]
rev_path = os.path.join(abs_path, 'rev_*.py')
rev_names = [os.path.basename(x) for x in glob.glob(rev_path)]
if rev_id is None:
revision = rev_names[-1]
else:
revision = rev_names[rev_id]
revision = revision.split('.')[0]
# import attribute
exec(f'from {import_path}.{revision} import {attr}', globals())
Some explanations:
Apparently (I didn't know this), we can import a folder as a module; this module has a __path__ attribute (found out using the built-in dir method).
glob.glob allows us to use regex expressions to search for a required pattern for files in the directory.
using exec without parameters will import only in the local namespace (namespace of the method) so without polluting the global namespace.
using exec with globals() allows us to import in the global namespace.

Pyspark cannot load from pathlib object

Python Version 3.7.5
Spark Version 3.0
Databricks Runtime 7.3
I'm currently working with paths in my datalake file system.
This is
p = dbutils.fs.ls('dbfs:/databricks-datasets/nyctaxi')
print(p)
[FileInfo(path='dbfs:/databricks-datasets/nyctaxi/readme_nyctaxi.txt', name='readme_nyctaxi.txt', size=916),
FileInfo(path='dbfs:/databricks-datasets/nyctaxi/reference/', name='reference/', size=0),
FileInfo(path='dbfs:/databricks-datasets/nyctaxi/taxizone/', name='taxizone/', size=0),
FileInfo(path='dbfs:/databricks-datasets/nyctaxi/tripdata/', name='tripdata/', size=0)]
now, to turn this into a valid Pathlib Posix Object I pass this through a function
def create_valid_path(paths):
return Path('/dbfs').joinpath(*[part for part in Path(paths).parts[1:]])
the output for tripdata is
PosixPath('/dbfs/databricks-datasets/nyctaxi/tripdata')
now, if I want to read this into a sparkdata frame after collecting a subset of csvs into a list.
from pyspark.sql.functions import *
df = spark.read.format('csv').load(paths)
this returns
AttributeError: 'PosixPath' object has no attribute '_get_object_id'
now, the only way I can get this to work is to manually prepend the path dbfs:/.. and return each item to a string however it's necessary to use Pathlib to do some basic I/O operations. am I missing something simple or can Pyspark simply not read a pathlib object?
e.g
trip_paths_str = [str(Path('dbfs:').joinpath(*part.parts[2:])) for part in trip_paths]
print(trip_paths_str)
['dbfs:/databricks-datasets/nyctaxi/tripdata/fhv/fhv_tripdata_2015-01.csv.gz',
'dbfs:/databricks-datasets/nyctaxi/tripdata/fhv/fhv_tripdata_2015-02.csv.gz'...]

What about doing this then instead?
from pyspark.sql.functions import *
import os
def db_list_files(file_path):
file_list = [file.path for file in dbutils.fs.ls(file_path) if os.path.basename(file.path)]
return file_list
files = db_list_files('dbfs:/FileStore/tables/')
df = spark.read.format('text').load(files)
df.show()

module google.cloud has no attribute storage

I'm trying to run a beam script in python on GCP following this tutorial:
[https://levelup.gitconnected.com/scaling-scikit-learn-with-apache-beam-251eb6fcf75b][1]
but I keep getting the following error:
AttributeError: module 'google.cloud' has no attribute 'storage'
I have google-cloud-storage in my requirements.txt so really not sure what I'm missing here.
My full script:
import apache_beam as beam
import json
query = """
SELECT
year,
plurality,
apgar_5min,
mother_age,
father_age,
gestation_weeks,
ever_born,
case when mother_married = true then 1 else 0 end as mother_married,
weight_pounds as weight,
current_timestamp as time,
GENERATE_UUID() as guid
FROM `bigquery-public-data.samples.natality`
order by rand()
limit 100
"""
class ApplyDoFn(beam.DoFn):
def __init__(self):
self._model = None
from google.cloud import storage
import pandas as pd
import pickle as pkl
self._storage = storage
self._pkl = pkl
self._pd = pd
def process(self, element):
if self._model is None:
bucket = self._storage.Client().get_bucket('bqr_dump')
blob = bucket.get_blob('natality/sklearn-linear')
self._model = self._pkl.loads(blob.download_as_string())
new_x = self._pd.DataFrame.from_dict(element,
orient='index').transpose().fillna(0)
pred_weight = self._model.predict(new_x.iloc[:, 1:8])[0]
return [ {'guid': element['guid'],
'predicted_weight': pred_weight,
'time': str(element['time'])}]
# set up pipeline options
options = {'project': my-project-name,
'runner': 'DataflowRunner',
'temp_location': 'gs://bqr_dump/tmp',
'staging_location': 'gs://bqr_dump/tmp'
}
pipeline_options = beam.pipeline.PipelineOptions(flags=[], **options)
with beam.Pipeline(options=pipeline_options) as pipeline:
(
pipeline
| 'ReadTable' >> beam.io.Read(beam.io.BigQuerySource(
query=query,
use_standard_sql=True))
| 'Apply Model' >> beam.ParDo(ApplyDoFn())
| 'Save to BigQuery' >> beam.io.WriteToBigQuery(
'pzn-pi-sto:beam_test.weight_preds',
schema='guid:STRING,weight:FLOAT64,time:STRING',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED))
and my requirements.txt:
google-cloud==0.34.0
google-cloud-storage==1.30.0
apache-beam[GCP]==2.20.0

This issue is usually related to two main reasons: the modules not being well installed, which means that something broke during the installation and the second reason, the import of the module not being correctly done.
To fix the issue, in case the reason is the broken modules, reinstalling or checking it in a virtual environment would be the solution. As indicated here, a similar case as yours, this should fix your case.
For the second reason, try to change your code and import all the modules in the beginning of the code, as demonstrated in this official example here. Your code should be something like this:
import apache_beam as beam
import json
import pandas as pd
import pickle as pkl
from google.cloud import storage
...
Let me know if this information helped you!

Make sure u have installed the correct version. Because the modules Google maintains will have constant updates. If u just give pip install for the required package it is directly going to install the latest version of the package.

Kaggle in Jupyter

I'm trying to link my kaggle project to Google Cloud Platform but I can't seem to get it done even after I followed: https://cloud.google.com/docs/authentication/getting-started
I still get this error:
DefaultCredentialsError: File /C:/Users/Jirah Marie Navarro/kagglebqtest-2999bd391350.json was not found.
This is my code:
# Replace 'kaggle-competitions-project' with YOUR OWN project id here --
PROJECT_ID = 'kagglebqtest'
from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID, location="US")
dataset = client.create_dataset('bqml_example', exists_ok=True)
from google.cloud.bigquery import magics
from kaggle.gcp import KaggleKernelCredentials
magics.context.credentials = KaggleKernelCredentials()
magics.context.project = PROJECT_ID
# create a reference to our table
table = client.get_table("kaggle-competition-datasets.geotab_intersection_congestion.train")
# look at five rows from our dataset
client.list_rows(table, max_results=5).to_dataframe()

Seems that the GOOGLE_APPLICATION_CREDENTIALS environment variable is not set correctly.
I suggest you to verify that the 'kagglebqtest-2999bd391350.json' file is in the path 'C:/Users/Jirah Marie Navarro/'.
I recommend you also to use a path without spaces such as 'C:/' or 'C:/credentials/' maybe the JSON credential is not recognized for the spaces in your path, so you can try with something like:
$env:GOOGLE_APPLICATION_CREDENTIALS="C:\credentials/kagglebqtest-2999bd391350.json"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python package with sample datasets but deferred download? - python

Related

Use Great Expectations to validate pandas DataFrame with existing suite JSON

Python: always import the last revision in the directory

Pyspark cannot load from pathlib object

module google.cloud has no attribute storage

Kaggle in Jupyter

Categories

Resources