Accessing external file in Python UDF

Accessing external file in Python UDF - python

I am using hive and a python udf. I defined a sql file in which I added the python udf and I call it. So far so good and I can process on my query results using my python function.
However, at this point of time, I have to use an external .txt file in my python udf. I uploaded that file into my cluster (the same directory as .sql and .py file) and I also added that in my .sql file using this command:
ADD FILE /home/ra/stopWords.txt;
When I call this file in my python udf as this:
file = open("/home/ra/stopWords.txt", "r")
I got several errors. I cannot figure out how to add nested files and using them in hive.
any idea?

All added files are located in the current working directory (./) of UDF script.
If you add a single file using ADD FILE /dir1/dir2/dir3/myfile.txt, its path will be
./myfile.txt
If you add a directory using ADD FILE /dir1/dir2, the file's path will be
./dir2/dir3/myfile.txt

Related

Python/Jupyter getting a FileNotFoundError when attempting to read an excel file however said file is in the correct directory

data = pd.read_excel("ETH-USD")
I continually receive an error message informing me that the file cannot be found
this is despite the fact that
1: my directory has been changed within to Python to the same address as the folder where the excel file is stored
2: the file name is input exactly as displayed
Any ideas on why it is unable to find my file?

Is it possible that the excel file has an extension of .xlsx, but your file explorer is set to "hide file extensions"? Try:
data = pd.read_excel("ETH-USD.xlsx")
Or, see what's in the current directory by running:
import os
print(os.listdir())
A tip from the comments:
Windows tip too: hold Shift, right click the excel file and copy as path, then you can see its path (if you don't enable viewing file extensions in the file browser). –
creanion

Often when running python scripts from compilers the "working directory", or where you are running the script from doesn't match the location of your script, hence why I find it much more reliable to use this instead:
import os
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
data = pd.read_excel(os.path.join(BASE_DIR,"ETH-USD")
To add, while I do not use Jupyter, in VSCode which I use, the working directory (which is where python looks for if you put a path in read_excel() but its not a full path) is often the current directory opened in there, so I expect a similar issue to be the reason for your issue.

databricks python dbutils can't move file from one directory to another

I have a file that I can see in my current working directory:
%sh
pwd
ls
The output of above is:
/databricks/driver
conf
sample.csv
logs
I want to move the sample.csv file from here to Workspace/Shared directory for which I am using dbutils.fs.mv:
dbutils.fs.mv("dbfs:/databricks/driver/sample.csv","dbfs:/Workspace/Shared/")
but this gives error as java.is.FileNotFoundException:dbfs:/databricks/driver/sample.csv
How do I resolve this error?

when you're executing command on via %sh, it's executed on the driver node, so file is local to it. But you're trying to copy file as it's on the DBFS already, and then it isn't found. You need to change scheme from dbfs to file to point to the file on the driver node, like this:
dbutils.fs.mv("file:///databricks/driver/sample.csv","dbfs:/Workspace/Shared/")

Cannot read file from NAS

I am trying to read an excel file from a NAS using Jupyter Notebook (macOS, Python 3, SynologyDS218+).
Script worked absolutely fine when the file was stored locally, but I cannot determine the correct code or file path adjustment to access the file once moved to the NAS.
I am logged into the NAS and from Mac Finder the file path is:
Server: smb://NAS/home/folder/file.xlsx
I have reviewed...
How to read an excel file directly from a Server with Python
Python - how to read path file/folder from server
https://apple.stackexchange.com/questions/337472/how-to-access-files-on-a-networked-smb-server-with-python-on-macos
... and tried numerous code variations as a result, but with no success.
I am using:
pd.read_excel(“//NAS/home/folder/file.xlsx”, sheet_name=‘total’, header=84, index_col=0, usecols=‘B,AL,DC’, skiprows=0, parse_dates=True).dropna()
But regardless of the code/file path variation, the same error is returned:
FileNotFoundError [Errno 2] No such file or directory: //NAS/home/folder/file.xlsx

I finally located the correct code / file path adjustment necessary to read the file. See https://apple.stackexchange.com/questions/349102/accessing-a-nas-device-from-python-code-on-mac-os
Although the drive was mapped and Mac Finder indicated a file path of "smb://NAS/home/folder/file.xlsx", using Control+Command+C instead, to copy the file path from Finder to the clipboard returned "/Volumes/home/folder/file.xlsx". Using this file path located the file and allowed it to be read.

Where does Python search for files when using open?

I have the simplest file opening line in my code.
file = open("file.txt", "r+")
Where does python search for files? The only location that works for me is
C:/Users/useraccount
file = open("file.txt", "w")
This also creates the file in that specific location.
It won't open the file if the file is in the exact same folder as the python script itself.
Also, if I make it
file = open("folder/file.txt", "r+")
it will not open the file if the file is in C:/Users/*useraccount*/folder.
Is it possible to open files that aren't in that specific location?

If you pass a relative path, like file.txt, Python will search for that file relative to the same directory where you are running the command from.
If you are in - C:/Users/useraccount/ and you try to open file.txt then Python tries to open C:/Users/useraccount/file.txt.
Similarly, if it's folder/file.txt then Python tries to open C:/Users/useraccount/folder/file.txt
You should always try to get the absolute path of a file by using the different functions in the os.path module.

If you use relative paths, they will be relative to the current working directory. To find out the current working directory, run the following code snippet from Python.
import os
print os.getcwd()
To avoid this, specify the absolute path.

How to place output of python script compiling latex document at desired location using "execute_process" command of cmake?

I have a cmake file from which i am executing a python script using "execute_command" as follows:
execute_process (COMMAND C:/Programs/Python27/python.exe "C:/packaging/doc/release_doc.py"
--var_ProjectName "${TARGET}"
--var_version "${_VERSION}" OUTPUT_FILE "C:/packaging/doc/")
When the cmake is executed i am not getting the output at the location specified inside "OUTPUT_FILE". The python file which i am executing is actually compiling a latex .tex file and hence generating a the corresponding pdf document along with the .log file, .aux file and .out file.
When i execute the python script from the location where the latex .tex document is located i get all the files generated at the same location and pdf is all alligned but when i have to execute the same python script from the cmake i am getting all the four files placed at the location different from .tex document and the pdf file alignment gets distrupted.
So, please suggest how can i have the this python file executed from within the cmake and the output of files at the location as i desire.

Is "C:/packaging/doc/" a folder? Then please replace it by something like "C:/packaging/doc/output.txt".
you can also try using OUTPUT_VARIABLE and check if it makes any difference:
execute_process (COMMAND COMMAND C:/Programs/Python27/python.exe "C:/packaging/doc/release_doc.py" --var_ProjectName "${TARGET}" --var_version "${_VERSION}" OUTPUT_VARIABLE test)
message(${test})
file(WRITE "C:/packaging/doc/output2.txt" "${test}")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Accessing external file in Python UDF - python

All added files are located in the current working directory (./) of UDF script. If you add a single file using ADD FILE /dir1/dir2/dir3/myfile.txt, its path will be ./myfile.txt If you add a directory using ADD FILE /dir1/dir2, the file's path will be ./dir2/dir3/myfile.txt

Related

Python/Jupyter getting a FileNotFoundError when attempting to read an excel file however said file is in the correct directory

databricks python dbutils can't move file from one directory to another

Cannot read file from NAS

Where does Python search for files when using open?

How to place output of python script compiling latex document at desired location using "execute_process" command of cmake?

Categories

Resources