Pyspark cannot load from pathlib object - python

Python Version 3.7.5
Spark Version 3.0
Databricks Runtime 7.3
I'm currently working with paths in my datalake file system.
This is
p = dbutils.fs.ls('dbfs:/databricks-datasets/nyctaxi')
print(p)
[FileInfo(path='dbfs:/databricks-datasets/nyctaxi/readme_nyctaxi.txt', name='readme_nyctaxi.txt', size=916),
FileInfo(path='dbfs:/databricks-datasets/nyctaxi/reference/', name='reference/', size=0),
FileInfo(path='dbfs:/databricks-datasets/nyctaxi/taxizone/', name='taxizone/', size=0),
FileInfo(path='dbfs:/databricks-datasets/nyctaxi/tripdata/', name='tripdata/', size=0)]
now, to turn this into a valid Pathlib Posix Object I pass this through a function
def create_valid_path(paths):
return Path('/dbfs').joinpath(*[part for part in Path(paths).parts[1:]])
the output for tripdata is
PosixPath('/dbfs/databricks-datasets/nyctaxi/tripdata')
now, if I want to read this into a sparkdata frame after collecting a subset of csvs into a list.
from pyspark.sql.functions import *
df = spark.read.format('csv').load(paths)
this returns
AttributeError: 'PosixPath' object has no attribute '_get_object_id'
now, the only way I can get this to work is to manually prepend the path dbfs:/.. and return each item to a string however it's necessary to use Pathlib to do some basic I/O operations. am I missing something simple or can Pyspark simply not read a pathlib object?
e.g
trip_paths_str = [str(Path('dbfs:').joinpath(*part.parts[2:])) for part in trip_paths]
print(trip_paths_str)
['dbfs:/databricks-datasets/nyctaxi/tripdata/fhv/fhv_tripdata_2015-01.csv.gz',
'dbfs:/databricks-datasets/nyctaxi/tripdata/fhv/fhv_tripdata_2015-02.csv.gz'...]

What about doing this then instead?
from pyspark.sql.functions import *
import os
def db_list_files(file_path):
file_list = [file.path for file in dbutils.fs.ls(file_path) if os.path.basename(file.path)]
return file_list
files = db_list_files('dbfs:/FileStore/tables/')
df = spark.read.format('text').load(files)
df.show()

Related

How can I import JSON files

I have a problem. I have several JSON files. I do not want to create manually Collections and import these files. I found this question Bulk import of .json files in arangodb with python, but unfortunately I got an error [OUT] AttributeError: 'Database' object has no attribute 'collection'.
How can I import several JSON files and import them fully automatically via Python in Collections?
from pyArango.connection import *
conn = Connection(username="root", password="")
db = conn.createDatabase(name="test")
a = db.collection('collection_name') # <- here is the error
for x in list_of_json_files:
with open(x,'r') as json_file:
data = json.load(json_file)
a.import_bulk(data)
I also looked at the documentation from ArangoDB https://www.arangodb.com/tutorials/tutorial-python/
There is no "collection" method in db instance, which you try to call in your code on this line:
a = db.collection('collection_name') # <- here is the error
According to docs you should use db.createCollection method of db instance.
studentsCollection = db.createCollection(name="Students")

Python: always import the last revision in the directory

Imagine that we have the following Data Base structure with the data stored in python files ready to be imported:
data_base/
foo_data/
rev_1.py
rev_2.py
bar_data/
rev_1.py
rev_2.py
rev_3.py
In my main script, I would like to import the last revision of the data available in the folder. For example, instead of doing this:
from data_base.foo_data.rev_2 import foofoo
from data_base.bar_data.rev_3 import barbar
I want to call a method:
import_from_db(path='data_base.foo_data', attr='foofoo', rev='last')
import_from_db(path='data_base.bar_data', attr='barbar', rev='last')
I could take a relative path to the Data Base and use glob.glob to search the last revision, but for this, I should know the path to the data_base folder, which complicates things (imagine that the parent folder of the data_base is in sys.path so the from data_base.*** import will work)
Is there an efficient way to maybe retrieve a full path knowing only part of it (data_base.foo_data)? Other ideas?
I think it's better to install the last version.
but going on with your flow, you may use getattr on the module:
from data_base import foo_data
i = 0
while True:
try:
your_module = getattr(foo_data, f'rev_{i}')
except AttributeError:
break
i += 1
# Now your_module is the latest rev
#JohnDoriaN 's idea led me to a quite simple solution:
import os, glob
def import_from_db(import_path, attr, rev_id=None):
"""
"""
# Get all the modules/folders names
dir_list = import_path.split('.')
# Import the last module
exec(f"from {'.'.join(dir_list[:-1])} import {dir_list[-1]}")
db_parent = locals()[dir_list[-1]]
# Get an absolute path to corresponding to the db_parent folder
abs_path = db_parent.__path__._path[0]
rev_path = os.path.join(abs_path, 'rev_*.py')
rev_names = [os.path.basename(x) for x in glob.glob(rev_path)]
if rev_id is None:
revision = rev_names[-1]
else:
revision = rev_names[rev_id]
revision = revision.split('.')[0]
# import attribute
exec(f'from {import_path}.{revision} import {attr}', globals())
Some explanations:
Apparently (I didn't know this), we can import a folder as a module; this module has a __path__ attribute (found out using the built-in dir method).
glob.glob allows us to use regex expressions to search for a required pattern for files in the directory.
using exec without parameters will import only in the local namespace (namespace of the method) so without polluting the global namespace.
using exec with globals() allows us to import in the global namespace.

Python - esy-OSMfilter error when trying to work through the example - OSM_raw_data does not exist

I am getting the following error when trying to go through the esy-osm example:
INFO:esy.osmfilter.pre_filter:OSM_raw_data does not exist
I am using python 3.8 on Windows and the code I am using is below:
import os, sys
import configparser, contextlib
from esy.osmfilter import osm_colors as cc
from esy.osmfilter import run_filter
from esy.osmfilter import Node, Way, Relation
PBF_inputfile = os.path.join(os.getcwd(), 'Geospatial_data\OSM_raw/liechtenstein-latest.osm.pbf')
JSON_outputfile = os.path.join(os.getcwd(), 'Geospatial_data/OSM_filtered/liechtenstein.json')
prefilter = {Node: {}, Way: {"man_made":["pipeline",],}, Relation: {}}
whitefilter = []
blackfilter = []
[Data,_]=run_filter('noname',
PBF_inputfile,
JSON_outputfile,
prefilter,
whitefilter,
blackfilter,
NewPreFilterData=True,
CreateElements=False,
LoadElements=False,
verbose=True)
print(len(Data['Node']))
print(len(Data['Relation']))
print(len(Data['Way']))
Does anyone know where I am going wrong on this?
you dont find the pbf file.
Please look at the path separator on you machine.
For windows, it’s ‘\’ and for unix it’s ‘/’.
You have used both simutaniously:
'Geospatial_data\OSM_raw/liechtenstein-latest.osm.pbf'
Cheers,
Adam

module google.cloud has no attribute storage

I'm trying to run a beam script in python on GCP following this tutorial:
[https://levelup.gitconnected.com/scaling-scikit-learn-with-apache-beam-251eb6fcf75b][1]
but I keep getting the following error:
AttributeError: module 'google.cloud' has no attribute 'storage'
I have google-cloud-storage in my requirements.txt so really not sure what I'm missing here.
My full script:
import apache_beam as beam
import json
query = """
SELECT
year,
plurality,
apgar_5min,
mother_age,
father_age,
gestation_weeks,
ever_born,
case when mother_married = true then 1 else 0 end as mother_married,
weight_pounds as weight,
current_timestamp as time,
GENERATE_UUID() as guid
FROM `bigquery-public-data.samples.natality`
order by rand()
limit 100
"""
class ApplyDoFn(beam.DoFn):
def __init__(self):
self._model = None
from google.cloud import storage
import pandas as pd
import pickle as pkl
self._storage = storage
self._pkl = pkl
self._pd = pd
def process(self, element):
if self._model is None:
bucket = self._storage.Client().get_bucket('bqr_dump')
blob = bucket.get_blob('natality/sklearn-linear')
self._model = self._pkl.loads(blob.download_as_string())
new_x = self._pd.DataFrame.from_dict(element,
orient='index').transpose().fillna(0)
pred_weight = self._model.predict(new_x.iloc[:, 1:8])[0]
return [ {'guid': element['guid'],
'predicted_weight': pred_weight,
'time': str(element['time'])}]
# set up pipeline options
options = {'project': my-project-name,
'runner': 'DataflowRunner',
'temp_location': 'gs://bqr_dump/tmp',
'staging_location': 'gs://bqr_dump/tmp'
}
pipeline_options = beam.pipeline.PipelineOptions(flags=[], **options)
with beam.Pipeline(options=pipeline_options) as pipeline:
(
pipeline
| 'ReadTable' >> beam.io.Read(beam.io.BigQuerySource(
query=query,
use_standard_sql=True))
| 'Apply Model' >> beam.ParDo(ApplyDoFn())
| 'Save to BigQuery' >> beam.io.WriteToBigQuery(
'pzn-pi-sto:beam_test.weight_preds',
schema='guid:STRING,weight:FLOAT64,time:STRING',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED))
and my requirements.txt:
google-cloud==0.34.0
google-cloud-storage==1.30.0
apache-beam[GCP]==2.20.0
This issue is usually related to two main reasons: the modules not being well installed, which means that something broke during the installation and the second reason, the import of the module not being correctly done.
To fix the issue, in case the reason is the broken modules, reinstalling or checking it in a virtual environment would be the solution. As indicated here, a similar case as yours, this should fix your case.
For the second reason, try to change your code and import all the modules in the beginning of the code, as demonstrated in this official example here. Your code should be something like this:
import apache_beam as beam
import json
import pandas as pd
import pickle as pkl
from google.cloud import storage
...
Let me know if this information helped you!
Make sure u have installed the correct version. Because the modules Google maintains will have constant updates. If u just give pip install for the required package it is directly going to install the latest version of the package.

Python package with sample datasets but deferred download?

I have a data analysis tool that I made a Python package for and I'd like to include some sample datasets, but I don't want to include all the datasets directly in the Python package because it will bloat the size and slow down install for people who don't use them.
The behavior I want is when a sample dataset is referenced it automatically gets downloaded from a URL and saved to the package locally, but then the next time it is used it will read the local version instead of re-downloading it. And this caching should persist permanently for my package, not only the during of the Python instance.
How can I do this?
I ended up making a folder under AppData using the appdirs package
datasets.py
import os
import pandas as pd
from pandasgui.utility import get_logger
from appdirs import user_data_dir
from tqdm import tqdm
logger = get_logger(__name__)
__all__ = ["all_datasets",
"country_indicators",
"us_shooting_incidents",
"diamonds",
"pokemon",
"anscombe",
"attention",
"car_crashes",
"dots",
"exercise",
"flights",
"fmri",
"gammas",
"geyser",
"iris",
"mpg",
"penguins",
"planets",
"tips",
"titanic",
"gapminder",
"stockdata"]
dataset_names = [x for x in __all__ if x != "all_datasets"]
all_datasets = {}
root_data_dir = os.path.join(user_data_dir(), "pandasgui", "dataset_files")
# Open local data CSVs if they exists
if all([os.path.exists(os.path.join(root_data_dir, f"{name}.csv")) for name in dataset_names]):
for name in dataset_names:
data_path = os.path.join(root_data_dir, f"{name}.csv")
if os.path.isfile(data_path):
all_datasets[name] = pd.read_csv(data_path)
# Download data if it doesn't exist locally
else:
os.makedirs(root_data_dir, exist_ok=True)
logger.info(f"Downloading PandasGui sample datasets into {root_data_dir}...")
pbar = tqdm(dataset_names, bar_format='{percentage:3.0f}% {bar} | {desc}')
for name in pbar:
pbar.set_description(f"{name}.csv")
data_path = os.path.join(root_data_dir, f"{name}.csv")
if os.path.isfile(data_path):
all_datasets[name] = pd.read_csv(data_path)
else:
all_datasets[name] = pd.read_csv(
os.path.join("https://raw.githubusercontent.com/adamerose/datasets/master/",
f"{name}.csv"))
all_datasets[name].to_csv(data_path, index=False)
# Add the datasets to globals so they can be imported like `from pandasgui.datasets import iris`
for name in all_datasets.keys():
globals()[name] = all_datasets[name]

Categories