I am trying to learn the use of gravityai and frankly i am a bit new to this. For that i followed https://www.youtube.com/watch?v=i6qL3NqFjs4 from Ania Kubow. When i do this, at the end i encounter the error message. This message appears in gravity ai, when trying to run the job, i.e. after uploading all zipped files three .pkl files, one .py file, one .txt file, one .json file), after docker is initialized and run:
Error running executable: usage: classify_financial_articles.py [-h] {run,serve} ... classify_financial_articles.py: error: argument subcommand: invalid choice: '/tmp/gai_temp/0675f15ca0b04cf98071474f19e38f3c/76f5cdc86a1241af8c01ce1b4d441b0c' (choose from 'run', 'serve').
I do not understand the error message and therefore cannot fix it. Is it an error in the code? or in the configuration on the gravityai platform? At no point do i run the .py file explicitly so i conclude, that it must be from the gravityai. Yet i dont get the error. Can anyone help me?
i added the .py file, as it is the one throwing the error
from gravityai import gravityai as grav
import pickle
import pandas as pd
model = pickle.load(open('financial_text_classifier.pkl', 'rb'))
tfidf_vectorizer = pickle.load(open('financial_text_vectorizer.pkl','rb'))
label_encder = pickle.load(open('financial_text_encoder.pkl', 'rb'))
def process(inPath, outPath):
# read csv input file
input_df = pd.read_csv(inPath)
# read the data
features = tfidf_vectorizer.transform(input_df['body'])
# predict classes
predictions = model.predict(features)
#convert outpulabels to categories
input_df['category'] = label_encder.inverse_transform(predictions)
#save results to csv
output_df = input_df(['id', 'category'])
output_df.csv(outPath, index=False)
grav.wait_for_requests(process)
I can't find any errors in the .py file
The error you got comes from the line that imports gravityai library:
from gravityai import gravityai as grav
I believe you need to upload your projet to the gravityai platform and then you will be able to test it
To test localy you need to:
1)Add line"import sys". 2)Comment or delete line"grav.wait_for_requests(process)" 3)Add line: "process(inPath=sys.argv[2], outPath=sys.argv[3])". 4)Run from command line "python classify_financial_articles.py run test_set.csv test_set_out.csv"
Example of code:
from gravityai import gravityai as grav
import pickle
import pandas as pd
import sys
model = pickle.load(open('financial_text_clasifier.pkl', 'rb'))
tfidf_vectorizer = pickle.load(open('financial_text_vectorizer.pkl', 'rb'))
label_encoder = pickle.load(open('financial_text_encoder.pkl', 'rb'))
def process(inPath, outPath):
input_df = pd.read_csv(inPath)
features = tfidf_vectorizer.transform(input_df['body'])
predictions = model.predict(features)
input_df['category'] = label_encoder.inverse_transform(predictions)
output_df = input_df[['id', 'category']]
output_df.to_csv(outPath, index=False)
process(inPath=sys.argv[2], outPath=sys.argv[3])
Related
I am pretty new to python and have been working on a data validation program. I am trying to run my main.py file but I am getting the following error.
Traceback (most recent call last):
File "/Users/user/data_validation/main.py", line 5, in <module>
from src.compile_csv import combine_csv_files
File "/Users/user/data_validation/src/compile_csv.py", line 5, in <module>
from helpers.helper_methods import set_home_path
ModuleNotFoundError: No module named 'helpers'
I am using python version 3.94
This is my folder structure
data_validation
- src
- files
validation.csv
- folder_one
combined.csv
- folder_two
combined_two.csv
- helpers
helper_methods.py
compile_csv.py
mysql.py
split.py
main.py
Both compile_csv.py and split.py use methods from helpers.helper_methods.py
My main.py which is throwing the error when being run, looks like the following:
import os
import sys
import earthpy as et
from src.mysql import insert_data, output_non_matches
from src.compile_csv import combine_csv_files
from src.split import final_merged_file
from src.helpers.helper_methods import set_home_path, does_file_exist
home_path = et.io.HOME
file_path = home_path, "data_validation", "src", "files"
folder_name = sys.argv[1]
def configure_file_path():
master_aims_file = os.path.join(
file_path, "validation.csv")
output_csv = os.path.join(
file_path, "output.csv.csv")
gdpr_file_csv = set_home_path(folder_name + "_output.csv")
output_csv = does_file_exist(os.path.join(
file_path, folder_name + "_mismatch_output.csv"))
return output_csv, master_aims_file, gdpr_file_csv
output_csv, master_aims_file, gdpr_file_csv = configure_file_path()
if __name__ == "__main__":
print("🏁 Finding names which do not match name in master file")
if (folder_name == "pd") and (folder_name == "wu"):
final_merged_file()
else:
combine_csv_files()
insert_failures = insert_data(output_csv, master_aims_file)
output_failures = output_non_matches(output_csv, gdpr_file_csv)
if insert_failures or output_failures:
exit(
"⚠️ There were errors in finding non-matching data, read above for more info"
)
os.remove(os.path.join(home_path, "data_validation", "members_data.db"))
exit(
f"✅ mismatches found and have been outputted to {output_csv} in the {folder_name} folder")
From what I understand in python 3 we do not need to use __init__.py and you can use . for defining the path during import, So I am not entirely sure as to what I am doing wrong.
I am executing the file from /Users/user/data_validation and using the following command python main.py pd
There it is. The error is happening in your compile_csv.py file. I'm guessing in that file you have from helpers.helper_methods import blah. But you need to change it to
from .helpers.helper_methods import blah
OR
from src.helpers.helper_methods import blah
Reason being is that imports are relative to cwd not to the file where the code is running. So you need to add the import relative to /Users/user/data_validation.
you can try by setting PYTHONPATH variable. check more about it here
I installed sklearn in my enviorment and running it now on jupyter notebook on windows.
How can I avoid the error:
URLError: urlopen error [Errno 11004] getaddrinfo failed
I am running the following code:
import sklearn
import sklearn.ensemble
import sklearn.metrics
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
which gives the error with line 5:
----> 3 newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
I am behind a proxy on my working computer, is there any option to avoid this error and to be able to use the sample datasets?
According to source code, scikit-learn will download the file from:
https://ndownloader.figshare.com/files/5975967
I am assuming that you cannot reach this location from behind the proxy.
Can you access the dataset by some other means? If yes, then you can download it manually and then run the following script on it:
and keep it at the location:
~/scikit_learn_data/
Here ~ refers to the user home folder. You can use the following code to know the default location of that folder according to your system.
from sklearn.datasets import get_data_home
print(get_data_home())
Update: Once done, use the following script to make it in a form in which scikit-learn keeps its caches
import codecs, pickle, tarfile, shutil
from sklearn.datasets import load_files
data_folder = '~/scikit_learn_data/'
target_folder = data_folder+'20news_home/'
tarfile.open(data_folder+'20newsbydate.tar.gz', "r:gz").extractall(path=target_folder)
cache = dict(train=load_files(target_folder+'20news-bydate-train', encoding='latin1'),
test=load_files(target_folder+'20news-bydate-test', encoding='latin1'))
compressed_content = codecs.encode(pickle.dumps(cache), 'zlib_codec')
with open(data_folder+'20news-bydate_py3.pkz', 'wb') as f:
f.write(compressed_content)
shutil.rmtree(target_folder)
Scikit-learn will always check if the dataset exists locally before attempting to download from internet. For that it will check the above location.
After that you can run the import normally.
I am trying to use dataflow to complete a task that requires the use of a .csv and .json files. From what I understand, I should be able to create a setup.py file that will include these files and distribute them to multiple workers.
This is how my files are laid out:
pipline.py
setup.py
utils /
-->__init__.py
-->**CSV.csv**
-->**JSON.json**
This is my setup.py file:
import setuptools
setuptools.setup(name='utils',
version='0.0.1',
description='utils',
packages=setuptools.find_packages(),
package_data={'utils': ['**CSV.csv**', '**JSON.json**']},
include_package_data=True)
This is my bean.DoFn functions:
class DoWork(beam.DoFn):
def process(self, element):
import pandas as pd
df_csv = pd.read_csv('**CSV.csv**')
df_json = pd.read_json('**JSON.json**')
Do other stuff with dataframes
yield [stuff]
My pipeline is setup like so:
dataflow_options = ['--job_name=pipline',
'--project=pipeline',
'--temp_location=gs://pipeline/temp',
'--staging_location=gs://pipeline/stage',
'--setup_file=./setup.py']
options = PipelineOptions(dataflow_options)
gcloud_options = options.view_as(GoogleCloudOptions)
options.view_as(StandardOptions).runner = 'DataflowRunner'
with beam.Pipeline(options=options) as p:
update = p | beam.Create(files) | beam.ParDo(DoWork())
Basically I keep getting an:
IOError: File CSV.csv does not exist
It doesn't think the .json file exists either but is just erroring out before it reaches that step. The files are possibly not making it to dataflow or I am referencing them incorrectly within the DoFn. Should I actually be putting the files into data_files parameter of the setup function instead of package_data?
you need to upload the input files in gs and give a gs location rather than CSV. I think you ran the code locally having the csv file in the same directory as the code. But running it using DataflowRunner will need the files in gs.
My specific issue is exactly the title. I have a large raster processing script in python and need to perform a clump function which I cannot find in gdal / python nor have I figured out how to 'write it' myself.
I am becoming better with python all the time just still newish, but am learning R for this task. (installed R version 3.4.1 (2017-06-30))
I am able to get rpy2 installed within python after spending a little time learning R and through help on Stackoverflow I have been able to perform several 'tests' of rpy2.
The most helpful info in getting rpy2 to respond was to establish where your R is within your python session or script. from another Stack answer. As below:
import os
os.environ['PYTHONHOME'] = r'C:\Python27\ArcGIS10.3\Scripts\new_ve_folder\Scripts'
os.environ['PYTHONPATH'] = r'C:\Python27\ArcGIS10.3\Scripts\new_ve_folder\Lib\site-packages'
os.environ['R_HOME'] = r'C:\Program Files\R\R-3.4.1'
os.environ['R_USER'] = r'C:\Python27\ArcGIS10.3\Scripts\new_ve_folder\Lib\site-packages\rpy2'
However, the main tests listed in the documentation http://rpy.sourceforge.net/rpy2/doc-2.1/html/overview.html I cannot get to work.
import rpy2.robjects.tests
import unittest
# the verbosity level can be increased if needed
tr = unittest.TextTestRunner(verbosity = 1)
suite = rpy2.robjects.tests.suite()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'suite'
However:
import rpy2.robjects as robjects
pi = robjects.r['pi']
pi[0]
works just fine. as do a few other rpy2.robjects tests I have found. I can create string = ''' f <- functions ect ''' and call those from python.
If i use:
python -m 'rpy2.tests'
I get the following error.
r\Scripts>python -m 'rpy2.tests'
r\Scripts\python.exe: No module named 'rpy2
Documentation states: On Python 2.6, this should return that all tests were successful. I am using Python 2.7 and I also tried this in Python 3.3.
My script for clump starts as below:
I do not want to have to actually install the package names each time I run the script as they are already installed in my R Home.
I would like to use my python variables if possible.
I need to figure out why rpy2 does not respond as the documentation indicates, or why I am getting errors. And then after that figure out the correct way to write my clump portion of my python script.
packageNames = ('raster', 'rgdal')
if all(rpackages.isinstalled(x) for x in packageNames):
have_packages = True
else:
have_packages = False
if not have_packages:
utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1)
packnames_to_install = [x for x in packageNames if not rpackages.isinstalled(x)]
if len(packnames_to_install) > 0:
utils.install_packages(StrVector(packnames_to_install))
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
There are several ways I have found to call the raster and clump options from R, however, if I cannot get rpy2 to respond correctly, I am not going to get these to work at all But since several other tests work I am not positive.
raster = robjects.r['raster']
raster = importr('raster')
clump = raster.clump
clump = robjects.r.clump
type(raster.clump)
tempDIR = r"C:\Users\script_out\temp"
slope_recode = os.path.join(tempDIR, "step2b_input.img")
outfile = os.path.join(tempDIR, "Rclumpfile.img")
raster.clump(slope_recode, filename=outfile, direction=4, gaps=True, format='HFA', overwrite=True)
Which results in a large amount of errors.
Traceback (most recent call last):
File "C:/Python27/ArcGIS10.3/Scripts/new_ve_folder/Scripts/rpy2_practice.py", line 97, in <module>
raster.clump(slope_recode, filename=outfile, direction=4, gaps=True, format='HFA', overwrite=True)
File "C:\Python27\ArcGIS10.3\Scripts\new_ve_folder\lib\site-packages\rpy2\robjects\functions.py", line 178, in __call__
return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
File "C:\Python27\ArcGIS10.3\Scripts\new_ve_folder\lib\site-packages\rpy2\robjects\functions.py", line 106, in __call__
res = super(Function, self).__call__(*new_args, **new_kwargs)
rpy2.rinterface.RRuntimeError: Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function 'clump' for signature '"character"'
Issues:
testing rpy2 in command line and script (both produce errors, but I am still able to use basic rpy2
importing the R packages so as not to install them each time
finally getting my clump script called correctly
If I have missed something basic, please point me in the right direction. Thanks all.
For your first problem, replace suite = rpy2.robjects.tests.suite() with suite = rpy2.tests.suite().
For your third problem (getting clump to work correctly), you need to create a RasterLayer object in R using the image. I'm not familiar with the raster package, so I can't give you the exact steps.
I will point out the arcpy module is not "pythonic". Normally, strings of filenames are just strings in Python. arcpy is weird in using plain strings to represent objects like map layers.
In your example, slope_recode is just a string. That's why you got the error unable to find an inherited method for function 'clump' for signature '"character"'. It means slope_recode was passed to R as a character value (which it is), and the clump function expects a RasterLayer object. It doesn't know how to handle character values.
I got this all to work with the below code.
import warnings
os.environ['PATH'] = os.path.join(scriptPath, 'path\\my_VE\\R\\R-3.4.2\\bin\\x64')
os.environ['PYTHONHOME'] = os.path.join(scriptPath, 'path\\my_VE\\Scripts\\64bit')
os.environ['PYTHONPATH'] = os.path.join(scriptPath, 'path\\my_VE\\Lib\\site-packages')
os.environ['R_HOME'] = os.path.join(scriptPath, 'path\\my_VE\\R\\R-3.4.2')
os.environ['R_USER'] = os.path.join(scriptPath, 'path\\my_VE\\Scripts\\new_ve_folder\\Scripts\\rpy2')
#
import platform
z = platform.architecture()
print(z)
## above will confirm you are working on 64 bit
gc.collect()
## this code snippit will tell you which library is being Read
command = 'Rscript'
cmd = [command, '-e', ".libPaths()"]
print(cmd)
x = subprocess.Popen(cmd, shell=True)
x.wait()
import rpy2.robjects.packages as rpackages
import rpy2.robjects as robjects
from rpy2.robjects import r
import rpy2.interactive.packages
from rpy2.robjects import lib
from rpy2.robjects.lib import grid
# # grab r packages
print("loading packages from R")
## fails at this point with the following error
## Error: cannot allocate vector of size 232.6 Mb when working with large rasters
rpy2.robjects.packages.importr('raster')
rpy2.robjects.packages.importr('rgdal')
rpy2.robjects.packages.importr('sp')
rpy2.robjects.packages.importr('utils')
# rpy2.robjects.packages.importr('memory')
# rpy2.robjects.packages.importr('dplyr')
rpy2.robjects.packages.importr('data.table')
grid.activate()
# set python variables for R code names
raster = robjects.r['raster']
writeRaster = robjects.r['writeRaster']
# setwd = robjects.r['setwd']
clump = robjects.r['clump']
# head = robjects.r['head']
crs = robjects.r['crs']
dim = robjects.r['dim']
projInfo = robjects.r['projInfo']
slope_recode = os.path.join(tempDIR, "_lope_recode.img")
outfile = os.path.join(tempDIR, "Rclumpfile.img")
recode = raster(slope_recode) # this is taking the image and reading it into R raster package
## https://stackoverflow.com/questions/47399682/clear-r-memory-using-rpy2
gc.collect() # No noticeable effect on memory usage
time.sleep(2)
gc.collect() # Finally, memory usage drops
R = robjects.r
R('memory.limit()')
R('memory.limit(size = 65535)')
R('memory.limit()')
print"starting Clump with rpy2"
clump(recode, filename=outfile, direction=4, gaps="True", format="HFA")
final = raster(outfile)
final = crs("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0,-0,-0,-0,0 +no_defs")
print ("clump file created, CRS accurate, next step")
I am working on a project for my beginners' Python class and have gotten a little stuck. I have three .tif files that I want to do Zonal Statistics for, but I am getting an error. Here is my script:
import arcpy
import os
from arcpy import env
from arcpy.sa import *
env.workspace = r'C:\Users\alvaremi\Documents\Final Project_Python'
path = r'C:\Users\alvaremi\Documents\Final Project_Pythonn'
env.overwriteOutput = 1
arcpy.CheckOutExtension('Spatial')
in_zone_data = 'counties_in_cog.shp'
zone_field = 'NAME'
impervious = os.listdir(env.workspace + '\ImpvClipped')
print impervious
for year in impervious:
if year.endswith(".tif"):
outZonalStatistics = ZonalStatistics(in_zone_data, zone_field, year, "MEAN", "NODATA")
outZonalStatistics.save(year[:8] + 'zonalstats')
print 'Done'
When I run it, I get this error:
ExecuteError: Failed to execute. Parameters are not valid.
ERROR 000865: Input value raster: 2001impvclipped.tif does not exist.
Failed to execute (ZonalStatistics).
I am also unsure of how to save the new files so that they keep the date on them. The files I want to run the Zonal Stats on are "2001impclipped", "2006impclipped", and "2011impclipped".
Thanks!
You need to add the full directory path to the filename in order for Python to find it.
fileName = env.workspace + '\ImpvClipped\' + year
ZonalStatistics(in_zone_data, zone_field, fileName, "MEAN", "NODATA")