Running bash on python to read from S3 Bucket and saving output

Running bash on python to read from S3 Bucket and saving output - python

I'm trying to run the following bash command on python and save that output into a variable. I'm new to using bash so any help will be appreciated.
Here's my usecase I have data stored in an S3 bucket (let's say the path is s3://test-bucket/folder1/subd1/datafiles/)
in the datafiles folder there are multiple data files:
a1_03_27_2020_N.csv
a1_04_05_2021_O.csv
a1_07_16_2021_N.csv
I'm trying to select the latest file (in this case a1_07_16_2021_N) and then read that data file using pandas
Here's what I have so far
The command to select the latest file
ls -t a1*|head -1
but then I'm not sure how to
1- run that command on python
2- how to save that the output as a variable
(I know this is not correct but something like
latest_file = os.environ['ls -t a1*|head -1'])
Then read the file:
df = pd.read_csv(latest_file)
Thank you in advance again!

Python replaces most shell functionality. You can do the search and filtering in python itself. No need for a callout.
from pathlib import Path
dir_to_search = Path("test-bucket/folder1/subd1/datafiles/")
try:
latest = max(dir_to_search.glob("a1*.csv"), key=lambda path: path.stat().st_mtime)
print(latest)
except ValueError:
print("no csv here")
But if you want to run the shell, several functions in subprocess will do it. For instance,
import subprocess as subp
result = subp.run("ls -t test-bucket/folder1/subd1/datafiles/a1* | head -1",
shell=True,
capture_output=True, text=True).stdout.strip()

Related

PYTHON AND BATCH SCRIPT: Run file if it exists and create if it doesn't

Full Disclaimer: I DO NOT KNOW PYTHON.
Hi Guys,
I have made an AutoHotKey Script for my volume keys. I would like to create a batch file which runs a python file (so if I change computers, I can easily create this scripts) which would do the following
Check if volume_keys.ahk exists in the D Drive
If it exists, run that;
If it doesn't exist, then create a file named volume_keys.ahk and add my script to it.
My script is:
^!NumpadMult::Send {Volume_Mute}
^!NumpadAdd::Send {Volume_Up}
^!NumpadSub::Send {Volume_Down}
I know how to code the .bat file and just need help for the python point-of-view, but I request the community to check it:
#ECHO OFF
ECHO This script will run an AHK Script. If you want to stop this process from happening, then cross this window off.If you want to continye:
pause
cd d:
D:\run_volume_keys_ahk_script.py
I really appreciate any help by the community.
Thanks in advance

You can use the os library for this. Here's what the python program would look like.
import os
if os.path.isfile('D:\\volume_keys.ahk'): # check if it exists
os.system('D:\\volume_keys.ahk') # execute it
else:
with open('D:\\volume_keys.ahk', 'w') as f: # open it in w (write) mode
f.write('^!NumpadMult::Send {Volume_Mute} \
^!NumpadAdd::Send {Volume_Up} \
^!NumpadSub::Send {Volume_Down}') # Write to file
os.system('D:\\volume_keys.ahk') # execute

To activate the ahk script, you might want to use the subprocess module, of which I took the example from here
import subprocess
subprocess.call(["path/to/ahk.exe", "script.ahk"])
Note that you'll have to find the ahk executable on a computer before you can use the script, maybe you want to automatically check that too.
You can set the path you want to check for scripts in one string, and then add the filenames of your scripts as strings to a list. You can use listdir() from the os module to see any files and directories at a given path, then iterate over your scriptnames and check if it exists in that list of files. If it does, run it.
In this example I copy-pasted your script into a string as value for the key 'scriptname' in a dictionary, so that python can actually create the script file. This isn't really a neat way to do it though, you might want to have your scripts prepared in a directory next to your python script and copy them from there. See an example of how here
from os import listdir
from os.path import isfile, join
CHECK_PATH = "D:"
AHK_EXECUTABLE_PATH = "path/to/ahk.exe"
SCRIPTS_TO_CHECK = {'script1.ahk':"""^!NumpadMult::Send {Volume_Mute}
^!NumpadAdd::Send {Volume_Up}
^!NumpadSub::Send {Volume_Down} """, 'script2.ahk':" some other script here"}
files_to_check = set(listdir(CHECK_PATH)) # using a set for fast lookup later
for scriptname, script in SCRIPTS_TO_CHECK.items():
if not scriptname in files_to_check:
print(f"script {scriptname} not found, creating it.")
with open(scriptname, 'w') as file:
file.write(script)
# else
subprocess.call(AHK_EXECUTABLE_PATH, scriptname)

How can I run a binary executable with input file (bash command) in Python?

I have a binary executable named as "abc" and I have a input file called as "input.txt". I can run these with following bash command:
./abc < input.txt
How can I run this bash command in Python, I tried some ways but I got errors.
Edit:
I also need the store the output of the command.
Edit2:
I solved with this way, thanks for the helps.
input_path = path of the input.txt file.
out = subprocess.Popen(["./abc"],stdin=open(input_path),stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
stdout,stderr = out.communicate()
print(stdout)

use os.system
import os
os.system("echo test from shell");

Using subprocess is the best way to invoke system commands and executables. It provides better control than os.system() and is intended to replace it. The python documentation link below provides additional information.
https://docs.python.org/3/library/subprocess.html
Here is a bit of code that uses subprocess to read output from head to return the first 100 rows from a txt file and process it row by row. It gives you the output (out) and any errors (err).
mycmd = 'head -100 myfile.txt'
(out, err) = subprocess.Popen(mycmd, stdout=subprocess.PIPE, shell=True).communicate()
myrows = str(out.decode("utf-8")).split("\n")
for myrow in myrows:
# do something with myrow

This can be done with os module. The following code works perfectly fine.
import os
path = "path of the executable 'abc' and 'input.txt' file"
os.chdir(path)
os.system("./abc < input.txt")
Hope this works :)

Output redirection to multiple files

I need to run a python script for multiple input files and for each one, I want to generate a new corresponding output file (e.g. for input_16jun.txt I want the output file to be 16jun_output.txt). I tried doing something like:
nohup python script.py input_{16..22}jun.txt > {16..22}jun_output.txt &
But I keep getting "ambiguous redirect" error. Does anyone know how to fix this? Or any other better approach?

Looping over each input file like this with bash should work.
for f in input_*.txt; do python script.py $f > "${f:6:-4}"_output.txt; done
Alternatively if you want to do the loop in a python script.
import glob
import os
input_files = glob.glob("input_*.txt")
for f in input_files:
os.system("python script.py {} > {}_output.txt".format(f,f.split("input_")[1].rstrip(".txt")))
If you want to run script.py in parallel (rather than sequentially) you can also consider using the python multiprocessing package.

Get a list of file names from HDFS using python

Hadoop noob here.
I've searched for some tutorials on getting started with hadoop and python without much success. I do not need to do any work with mappers and reducers yet, but it's more of an access issue.
As a part of Hadoop cluster,
there are a bunch of .dat files on the HDFS.
In order to access those files on my client (local computer) using Python,
what do I need to have on my computer?
How do I query for filenames on HDFS ?
Any links would be helpful too.

As far as I've been able to tell there is no out-of-the-box solution for this, and most answers I've found have resorted to using calls to the hdfs command. I'm running on Linux, and have the same challenge. I've found the sh package to be useful. This handles running o/s commands for you and managing stdin/out/err.
See here for more info on it: https://amoffat.github.io/sh/
Not the neatest solution, but it's one line (ish) and uses standard packages.
Here's my cut-down code to grab an HDFS directory listing. It will list files and folders alike, so you might need to modify if you need to differentiate between them.
import sh
hdfsdir = '/somedirectory'
filelist = [ line.rsplit(None,1)[-1] for line in sh.hdfs('dfs','-ls',hdfsdir).split('\n') if len(line.rsplit(None,1))][1:]
My output - In this case these are all directories:
[u'/somedirectory/transaction_basket_fct/date_id=2015-01-01',
u'/somedirectory/transaction_basket_fct/date_id=2015-01-02',
u'/somedirectory/transaction_basket_fct/date_id=2015-01-03',
u'/somedirectory/transaction_basket_fct/date_id=2015-01-04',
u'/somedirectory/transaction_basket_fct/date_id=2015-01-05',
u'/somedirectory/transaction_basket_fct/date_id=2015-01-06',
u'/somedirectory/transaction_basket_fct/date_id=2015-01-07',
u'/somedirectory/transaction_basket_fct/date_id=2015-01-08']
Let's break it down:
To run the hdfs dfs -ls /somedirectory command we can use the sh package like this:
import sh
sh.hdfs('dfs','-ls',hdfsdir)
sh allows you to call o/s commands seamlessly as if they were functions on the module. You pass command parameters as function parameters. Really neat.
For me this returns something like:
Found 366 items
drwxrwx---+ - impala hive 0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-01
drwxrwx---+ - impala hive 0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-02
drwxrwx---+ - impala hive 0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-03
drwxrwx---+ - impala hive 0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-04
drwxrwx---+ - impala hive 0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-05
Split that into lines based on new line characters using .split('\n')
Obtain the last 'word' in the string using line.rsplit(None,1)[-1].
To prevent issues with empty elements in the list use if len(line.rsplit(None,1))
Finally remove the first element in the list (the Found 366 items) using [1:]

for the "query for filenames on HDFS" using just raw subprocess library for python 3:
from subprocess import Popen, PIPE
hdfs_path = '/path/to/the/designated/folder'
process = Popen(f'hdfs dfs -ls -h {hdfs_path}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
list_of_file_names = [fn.split(' ')[-1].split('/')[-1] for fn in std_out.decode().split('\n')[1:]][:-1]
list_of_file_names_with_full_address = [fn.split(' ')[-1] for fn in std_out.decode().split('\n')[1:]][:-1]

what do I need to have on my computer?
You need Hadoop installed and running and ofcourse, Python.
How do I query for filenames on HDFS ?
You can try something like this here. I haven't tested the code so don't just rely on it.
from subprocess import Popen, PIPE
process = Popen('hdfs dfs -cat filename.dat',shell=True,stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
check for returncode, std_err
if:
everything is OK, do whatever with stdout
else:
do something in else condition
You can also look at Pydoop which is a Python API for Hadoop.
Although my example include shell=true, you can try running without it as it is a security risk. Why you shouldn't use shell=True?

You should have login access to a node in the cluster. Let the cluster administrator pick the node and setup the account and inform you how to access the node securely. If you are the administrator, let me know if the cluster is local or remote and if remote then is it hosted on your computer, inside a corporation or on a 3rd party cloud and if so whose and then I can provide more relevant information.
To query file names in HDFS, login to a cluster node and run hadoop fs -ls [path]. Path is optional and if not provided, the files in your home directory are listed. If -R is provided as an option, then it lists all the files in path recursively. There are additional options for this command. For more information about this and other Hadoop file system shell commands see http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html.
An easy way to query HDFS file names in Python is to use esutil.hdfs.ls(hdfs_url='', recurse=False, full=False), which executes hadoop fs -ls hdfs_url in a subprocess, plus it has functions for a number of other Hadoop file system shell commands (see the source at http://code.google.com/p/esutil/source/browse/trunk/esutil/hdfs.py). esutil can be installed with pip install esutil. It is on PyPI at https://pypi.python.org/pypi/esutil, documentation for it is at http://code.google.com/p/esutil/ and its GitHub site is https://github.com/esheldon/esutil.

As JGC stated, the most straightforward thing you could do is start by logging onto (via ssh) one of the nodes (a server that is participating in a Hadoop cluster) and verifying that you have the correct access controls and privileges to:
List your home directory using the HDFS client i.e. hdfs dfs -ls
List the directory of interest that lives in HDFS i.e. hdfs dfs -ls <absolute or relative path to HDFS directory>
Then, in Python, you should use subprocesses and the HDFS client to access the paths of interest, and use the -C flag to exclude unnecessary metadata (to avoid doing ugly post-processing later).
i.e. Popen(['hdfs', 'dfs', '-ls', '-C', dirname])
Afterwards, split the output on new lines and then you will have your list of paths.
Here's an example along with logging and error handling (including for when the directory/file doesn't exist):
from subprocess import Popen, PIPE
import logging
logger = logging.getLogger(__name__)
FAILED_TO_LIST_DIRECTORY_MSG = 'No such file or directory'
class HdfsException(Exception):
pass
def hdfs_ls(dirname):
"""Returns list of HDFS directory entries."""
logger.info('Listing HDFS directory ' + dirname)
proc = Popen(['hdfs', 'dfs', '-ls', '-C', dirname], stdout=PIPE, stderr=PIPE)
(out, err) = proc.communicate()
if out:
logger.debug('stdout:\n' + out)
if proc.returncode != 0:
errmsg = 'Failed to list HDFS directory "' + dirname + '", return code ' + str(proc.returncode)
logger.error(errmsg)
logger.error(err)
if not FAILED_TO_LIST_DIRECTORY_MSG in err:
raise HdfsException(errmsg)
return []
elif err:
logger.debug('stderr:\n' + err)
return out.splitlines()
# dat_files will contain a proper Python list of the paths to the '.dat' files you mentioned above.
dat_files = hdfs_ls('/hdfs-dir-with-dat-files/')

The answer by #JGC was a big help. I wanted a version that was a more transparent function instead of a harder to read one-liner; I also swapped the string parsing to use regex so that it is both more transparent and less brittle to changes in the hdfs syntax. This version looks like this, the same general approach as JGC:
import re
import sh
def get_hdfs_files(directory:str) -> List[str]:
'''
Params:
directory: an HDFS directory e.g. /my/hdfs/location
'''
output = sh.hdfs('dfs','-ls',directory).split('\n')
files = []
for line in output:
match = re.search(f'({re.escape(directory)}.*$)', line)
if match:
files.append(match.group(0))
return files

Wildcard or * for matching a datetime python 2.7

I am trying to match the following string and not having any luck. Below you will find my attempt.
LOG FORMAT:
riskserver.2014-04-07-08:45:01.log
I think I only will need the year month and date. So I was attempting a wildcard * which python 2.7 does not seem to like.
cmd = 'tail -n10000 /opt/rubedo/log/riskserver.'+nowFormat+*'
Help is very much appreciated here. Thanks, I hope I explained this well, and some can understand.
I am using subprocess with grep involved.
tail: cannot open `/opt/rubedo/log/riskserver.2014-04-08' for reading: No such file or directory
grep: not: No such file or directory
EDIT:
now = datetime.datetime.now().strftime("%H:%M:%S")
nowFormat = datetime.datetime.now().strftime("%Y\-%m\-%d")

glob is the splat (*) , eg ls *.txt gets processed by the linux shell into ls f1.txt f2.txt f3.txt f4.txt ...
so that ls actually recieves a list of files that match not the matching string. that is what they mean in the comments
nowFormat = "2014-04-07"
cmd = 'tail -n10000 /opt/rubedo/log/riskserver.'+nowFormat+'*'
os.system(cmd) #this will execute it through your linux shell you should see the output, allthough this call will not give you access to the output in python
or in python
import glob
fnames = glob.glob('/opt/rubedo/log/riskserver.'+nowFormat+'*')
print fnames

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Running bash on python to read from S3 Bucket and saving output - python

Related

PYTHON AND BATCH SCRIPT: Run file if it exists and create if it doesn't

How can I run a binary executable with input file (bash command) in Python?

Output redirection to multiple files

Get a list of file names from HDFS using python

Wildcard or * for matching a datetime python 2.7

Categories

Resources