Get a list of file names from HDFS using python - python

Hadoop noob here.
I've searched for some tutorials on getting started with hadoop and python without much success. I do not need to do any work with mappers and reducers yet, but it's more of an access issue.
As a part of Hadoop cluster,
there are a bunch of .dat files on the HDFS.
In order to access those files on my client (local computer) using Python,
what do I need to have on my computer?
How do I query for filenames on HDFS ?
Any links would be helpful too.

As far as I've been able to tell there is no out-of-the-box solution for this, and most answers I've found have resorted to using calls to the hdfs command. I'm running on Linux, and have the same challenge. I've found the sh package to be useful. This handles running o/s commands for you and managing stdin/out/err.
See here for more info on it: https://amoffat.github.io/sh/
Not the neatest solution, but it's one line (ish) and uses standard packages.
Here's my cut-down code to grab an HDFS directory listing. It will list files and folders alike, so you might need to modify if you need to differentiate between them.
import sh
hdfsdir = '/somedirectory'
filelist = [ line.rsplit(None,1)[-1] for line in sh.hdfs('dfs','-ls',hdfsdir).split('\n') if len(line.rsplit(None,1))][1:]
My output - In this case these are all directories:
[u'/somedirectory/transaction_basket_fct/date_id=2015-01-01',
u'/somedirectory/transaction_basket_fct/date_id=2015-01-02',
u'/somedirectory/transaction_basket_fct/date_id=2015-01-03',
u'/somedirectory/transaction_basket_fct/date_id=2015-01-04',
u'/somedirectory/transaction_basket_fct/date_id=2015-01-05',
u'/somedirectory/transaction_basket_fct/date_id=2015-01-06',
u'/somedirectory/transaction_basket_fct/date_id=2015-01-07',
u'/somedirectory/transaction_basket_fct/date_id=2015-01-08']
Let's break it down:
To run the hdfs dfs -ls /somedirectory command we can use the sh package like this:
import sh
sh.hdfs('dfs','-ls',hdfsdir)
sh allows you to call o/s commands seamlessly as if they were functions on the module. You pass command parameters as function parameters. Really neat.
For me this returns something like:
Found 366 items
drwxrwx---+ - impala hive 0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-01
drwxrwx---+ - impala hive 0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-02
drwxrwx---+ - impala hive 0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-03
drwxrwx---+ - impala hive 0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-04
drwxrwx---+ - impala hive 0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-05
Split that into lines based on new line characters using .split('\n')
Obtain the last 'word' in the string using line.rsplit(None,1)[-1].
To prevent issues with empty elements in the list use if len(line.rsplit(None,1))
Finally remove the first element in the list (the Found 366 items) using [1:]

for the "query for filenames on HDFS" using just raw subprocess library for python 3:
from subprocess import Popen, PIPE
hdfs_path = '/path/to/the/designated/folder'
process = Popen(f'hdfs dfs -ls -h {hdfs_path}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
list_of_file_names = [fn.split(' ')[-1].split('/')[-1] for fn in std_out.decode().split('\n')[1:]][:-1]
list_of_file_names_with_full_address = [fn.split(' ')[-1] for fn in std_out.decode().split('\n')[1:]][:-1]

what do I need to have on my computer?
You need Hadoop installed and running and ofcourse, Python.
How do I query for filenames on HDFS ?
You can try something like this here. I haven't tested the code so don't just rely on it.
from subprocess import Popen, PIPE
process = Popen('hdfs dfs -cat filename.dat',shell=True,stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
check for returncode, std_err
if:
everything is OK, do whatever with stdout
else:
do something in else condition
You can also look at Pydoop which is a Python API for Hadoop.
Although my example include shell=true, you can try running without it as it is a security risk. Why you shouldn't use shell=True?

You should have login access to a node in the cluster. Let the cluster administrator pick the node and setup the account and inform you how to access the node securely. If you are the administrator, let me know if the cluster is local or remote and if remote then is it hosted on your computer, inside a corporation or on a 3rd party cloud and if so whose and then I can provide more relevant information.
To query file names in HDFS, login to a cluster node and run hadoop fs -ls [path]. Path is optional and if not provided, the files in your home directory are listed. If -R is provided as an option, then it lists all the files in path recursively. There are additional options for this command. For more information about this and other Hadoop file system shell commands see http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html.
An easy way to query HDFS file names in Python is to use esutil.hdfs.ls(hdfs_url='', recurse=False, full=False), which executes hadoop fs -ls hdfs_url in a subprocess, plus it has functions for a number of other Hadoop file system shell commands (see the source at http://code.google.com/p/esutil/source/browse/trunk/esutil/hdfs.py). esutil can be installed with pip install esutil. It is on PyPI at https://pypi.python.org/pypi/esutil, documentation for it is at http://code.google.com/p/esutil/ and its GitHub site is https://github.com/esheldon/esutil.

As JGC stated, the most straightforward thing you could do is start by logging onto (via ssh) one of the nodes (a server that is participating in a Hadoop cluster) and verifying that you have the correct access controls and privileges to:
List your home directory using the HDFS client i.e. hdfs dfs -ls
List the directory of interest that lives in HDFS i.e. hdfs dfs -ls <absolute or relative path to HDFS directory>
Then, in Python, you should use subprocesses and the HDFS client to access the paths of interest, and use the -C flag to exclude unnecessary metadata (to avoid doing ugly post-processing later).
i.e. Popen(['hdfs', 'dfs', '-ls', '-C', dirname])
Afterwards, split the output on new lines and then you will have your list of paths.
Here's an example along with logging and error handling (including for when the directory/file doesn't exist):
from subprocess import Popen, PIPE
import logging
logger = logging.getLogger(__name__)
FAILED_TO_LIST_DIRECTORY_MSG = 'No such file or directory'
class HdfsException(Exception):
pass
def hdfs_ls(dirname):
"""Returns list of HDFS directory entries."""
logger.info('Listing HDFS directory ' + dirname)
proc = Popen(['hdfs', 'dfs', '-ls', '-C', dirname], stdout=PIPE, stderr=PIPE)
(out, err) = proc.communicate()
if out:
logger.debug('stdout:\n' + out)
if proc.returncode != 0:
errmsg = 'Failed to list HDFS directory "' + dirname + '", return code ' + str(proc.returncode)
logger.error(errmsg)
logger.error(err)
if not FAILED_TO_LIST_DIRECTORY_MSG in err:
raise HdfsException(errmsg)
return []
elif err:
logger.debug('stderr:\n' + err)
return out.splitlines()
# dat_files will contain a proper Python list of the paths to the '.dat' files you mentioned above.
dat_files = hdfs_ls('/hdfs-dir-with-dat-files/')

The answer by #JGC was a big help. I wanted a version that was a more transparent function instead of a harder to read one-liner; I also swapped the string parsing to use regex so that it is both more transparent and less brittle to changes in the hdfs syntax. This version looks like this, the same general approach as JGC:
import re
import sh
def get_hdfs_files(directory:str) -> List[str]:
'''
Params:
directory: an HDFS directory e.g. /my/hdfs/location
'''
output = sh.hdfs('dfs','-ls',directory).split('\n')
files = []
for line in output:
match = re.search(f'({re.escape(directory)}.*$)', line)
if match:
files.append(match.group(0))
return files

Related

Running bash on python to read from S3 Bucket and saving output

I'm trying to run the following bash command on python and save that output into a variable. I'm new to using bash so any help will be appreciated.
Here's my usecase I have data stored in an S3 bucket (let's say the path is s3://test-bucket/folder1/subd1/datafiles/)
in the datafiles folder there are multiple data files:
a1_03_27_2020_N.csv
a1_04_05_2021_O.csv
a1_07_16_2021_N.csv
I'm trying to select the latest file (in this case a1_07_16_2021_N) and then read that data file using pandas
Here's what I have so far
The command to select the latest file
ls -t a1*|head -1
but then I'm not sure how to
1- run that command on python
2- how to save that the output as a variable
(I know this is not correct but something like
latest_file = os.environ['ls -t a1*|head -1'])
Then read the file:
df = pd.read_csv(latest_file)
Thank you in advance again!
Python replaces most shell functionality. You can do the search and filtering in python itself. No need for a callout.
from pathlib import Path
dir_to_search = Path("test-bucket/folder1/subd1/datafiles/")
try:
latest = max(dir_to_search.glob("a1*.csv"), key=lambda path: path.stat().st_mtime)
print(latest)
except ValueError:
print("no csv here")
But if you want to run the shell, several functions in subprocess will do it. For instance,
import subprocess as subp
result = subp.run("ls -t test-bucket/folder1/subd1/datafiles/a1* | head -1",
shell=True,
capture_output=True, text=True).stdout.strip()

I am trying to print the last line of every file in a directory using shell command from python script

I am storing the number of files in a directory in a variable and storing their names in an array. I'm unable to store file names in the array.
Here is the piece of code I have written.
import os
temp = os.system('ls -l /home/demo/ | wc -l')
no_of_files = temp - 1
command = "ls -l /home/demo/ | awk 'NR>1 {print $9}'"
file_list=[os.system(command)]
for i in range(len(file_list))
os.system('tail -1 file_list[i]')
Your shell scripting is orders of magnitude too complex.
output = subprocess.check_output('tail -qn1 *', shell=True)
or if you really prefer,
os.system('tail -qn1 *')
which however does not capture the output in a Python variable.
If you have a recent-enough Python, you'll want to use subprocess.run() instead. You can also easily let Python do the enumeration of the files to avoid the pesky shell=True:
output = subprocess.check_output(['tail', '-qn1'] + os.listdir('.'))
As noted above, if you genuinely just want the output to be printed to the screen and not be available to Python, you can of course use os.system() instead, though subprocess is recommended even in the os.system() documentation because it is much more versatile and more efficient to boot (if used correctly). If you really insist on running one tail process per file (perhaps because your tail doesn't support the -q option?) you can do that too, of course:
for filename in os.listdir('.'):
os.system("tail -n 1 '%s'" % filename)
This will still work incorrectly if you have a file name which contains a single quote. There are workarounds, but avoiding a shell is vastly preferred (so back to subprocess without shell=True and the problem of correctly coping with escaping shell metacharacters disappears because there is no shell to escape metacharacters from).
for filename in os.listdir('.'):
print(subprocess.check_output(['tail', '-n1', filename]))
Finally, tail doesn't particularly do anything which cannot easily be done by Python itself.
for filename in os.listdir('.'):
with open (filename, 'r') as handle:
for line in handle:
pass
# print the last one only
print(line.rstrip('\r\n'))
If you have knowledge of the expected line lengths and the files are big, maybe seek to somewhere near the end of the file, though obviously you need to know how far from the end to seek in order to be able to read all of the last line in each of the files.
os.system returns the exitcode of the command and not the output. Try using subprocess.check_output with shell=True
Example:
>>> a = subprocess.check_output("ls -l /home/demo/ | awk 'NR>1 {print $9}'", shell=True)
>>> a.decode("utf-8").split("\n")
Edit (as suggested by #tripleee) you probably don't want to do this as it will get crazy. Python has great functions for things like this. For example:
>>> import glob
>>> names = glob.glob("/home/demo/*")
will directly give you a list of files and folders inside that folder. Once you have this, you can just do len(names) to get the first command.
Another option is:
>>> import os
>>> os.listdir("/home/demo")
Here, glob will give you the whole filepath /home/demo/file.txt and os.listdir will just give you the filename file.txt
The ls -l /home/demo/ | wc -l command is also not the correct value as ls -l will show you "total X" on top mentioning how many total files it found and other info.
You could likely use a loop without much issue:
files = [f for f in os.listdir('.') if os.path.isfile(f)]
for f in files:
with open(f, 'rb') as fh:
last = fh.readlines()[-1].decode()
print('file: {0}\n{1}\n'.format(f, last))
fh.close()
Output:
file.txt
Hello, World!
...
If your files are large then readlines() probably isn't the best option. Maybe go with tail instead:
for f in files:
print('file: {0}'.format(f))
subprocess.check_call(['tail', '-n', '1', f])
print('\n')
The decode is optional, although for text "utf-8" usually works or if it's a combination of binary/text/etc then maybe something such as "iso-8859-1" usually should work.
you are not able to store file names because os.system does not return output as you expect it to be. For more information see : this.
From the docs
On Unix, the return value is the exit status of the process encoded in the format specified for wait(). Note that POSIX does not specify the meaning of the return value of the C system() function, so the return value of the Python function is system-dependent.
On Windows, the return value is that returned by the system shell after running command, given by the Windows environment variable COMSPEC: on command.com systems (Windows 95, 98 and ME) this is always 0; on cmd.exe systems (Windows NT, 2000 and XP) this is the exit status of the command run; on systems using a non-native shell, consult your shell documentation.
os.system executes linux shell commands as it is. for getting output for these shell commands you have to use python subprocess
Note : In your case you can get file names using either glob module or os.listdir(): see How to list all files of a directory

Serial Numbers from a Storage Controller over SSH

Background
I'm working on a bash script to pull serial numbers and part numbers from all the devices in a server rack, my goal is to be able to run a single script (inventory.sh) and walk away while it generates text files containing the information I need. I'm using bash for maximum compatibility, the RHEL 6.7 systems do have Perl and Python installed, however they have minimal libraries. So far I haven't had to use anything other than bash, but I'm not against calling a Perl or Python script from my bash script.
My Problem
I need to retrieve the Serial Numbers and Part numbers from the drives in a Dot Hill Systems AssuredSAN 3824, as well as the Serial numbers from the equipment inside. The only way I have found to get all the information I need is to connect over SSH and run the following three commands dumping the output to a local file:
show controllers
show frus
show disks
Limitations:
I don't have "sshpass" installed, and would prefer not to install it.
The Controller is not capable of storing SSH keys ( no option in custom shell).
The Controller also cannot write or transfer local files.
The Rack does NOT have access to the Internet.
I looked at paramiko, but while Python is installed I do not have pip.
I also cannot use CPAN.
For what its worth, the output comes back in XML format. (I've already written the code to parse it in bash)
Right now I think my best option would be to have a library for Python or Perl in the folder with my other scripts, and write a script to dump the commands' output to files that I can parse with my bash script. Which language is easier to just provide a library in a file? I'm looking for a library that is as small and simple as possible to use. I just need a way to get the output of those commands to XML files. Right now I am just using ssh 3 times in my script and having to enter the password each time.
Have a look at SNMP. There is a reasonable chance that you can use SNMP tools to remotely extract the information you need. The manufacturer should be able to provide you with the MIBs.
I ended up contacting the Manufacturer and asking my question. They said that the system isn't setup for connecting without a password, and their SNMP is very basic and won't provide the information I need. They said to connect to the system with FTP and use "get logs " to download an archive of the configuration and logs. Not exactly ideal as it takes 4 minutes just to run that one command but it seems to be my only option. Below is the script I wrote to retrieve the file automatically by adding the login credentials to the .netrc file. This works on RHEL 6.7:
#!/bin/bash
#Retrieve the logs and configuration from a Dot Hill Systems AssuredSAN 3824 automatically.
#Modify "LINE" and "HOST" to fit your configuration.
LINE='machine <IP> login manage password <password>'
HOST='<IP>'
AUTOLOGIN="/root/.netrc"
FILE='logfiles.zip'
#Check for and verify the autologin file
if [ -f $AUTOLOGIN ]; then
printf "Found auto-login file, checking for proper entry... \r"
READLINE=`cat $AUTOLOGIN | grep "$LINE"`
#Append the line to the end of .netrc if file exists but not the line.
if [ "$LINE" != "$READLINE" ]; then
printf "Proper entry not found, creating it... \r"
echo "$LINE" >> "$AUTOLOGIN"
else
printf "Proper entry found... \r"
fi
#Create the Autologin file if it doesn't exist
else
printf "Auto-Login file does not exist, creating it and setting permissions...\r"
echo "$LINE" > "$AUTOLOGIN"
chmod 600 "$AUTOLOGIN"
fi
#Start getting the information from the controller. (This takes a VERY long time)
printf "Retrieving Storage Controller data, this will take awhile... \r"
ftp $HOST << SCRIPT
get logs $FILE
SCRIPT
exit 0
This gave me a bunch of files in the zip, but all I needed was the "store_....logs" file. It was about 500,000 lines long, the first portion is the entire configuration in XML format, then the configuration in text format, followed by the logs from the system. I parsed the file and stripped off the logs at the end which cut the file down to 15,000 lines. From there I divided it into two files (config.xml and config.txt). I then pulled the XML output of the 3 commands that I needed and it to the 3 files my previously written script searches for. Now my inventory script pulls in everything it needs, albeit pretty slow due to waiting 4 minutes for the system to generate the zip file. I hope this helps someone in the future.
Edit:
Waiting 4 minutes for the system to compile was taking too long. So I ended up using paramiko and python scripts to dump output from the commands to files that my other code can parse. It accepts the IP of the Controller as a parameter. Here is the script for those interested. Thank you again for all the help.
#!/usr/bin/env python
#Saves output of "show disks" from the storage Controller to an XML file.
import paramiko
import sys
import re
import xmltodict
IP = sys.argv[1]
USERNAME = "manage"
PASSWORD = "password"
FILENAME = "./logfiles/disks.xml"
cmd = "show disks"
client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
try:
client.connect(IP,username=USERNAME,password=PASSWORD)
stdin, stdout, stderr = client.exec_command(cmd)
except Exception as e:
sys.exit(1)
data = ""
for line in stdout:
if re.search('#', line):
pass
else:
data += line
client.close()
f = open(FILENAME, 'w+')
f.write(data)
f.close()
sys.exit(0)

Searching over 20,000+ files in network driver. Glob is slow. Command line gives error

I need search a file from network driver (the folder contains twenty thousand or more files)
I tried use glob to search but is very slow. It usually needs 50s~60s to complete the search.
path = r'\\network\driver\'
file = '12AC9K-XXE-849*'
print(glob.glob(path+file))
I also tried use command line tools (all system is in Windows):
os.chdir(path)
print(os.system("dir 12AC9K-XXE-849*"))
I think the issues are in the network so I can not use cmd.exe because I will get the error message:
'\\network\driver\'
CMD.EXE was started with the above path as the current directory.
UNC paths are not supported. Defaulting to Windows directory.
The system cannot find the path specified.
Thanks tdelaney
finally i find answer and get "dir" value.
use subprocess
import subprocess
proc = subprocess.Popen(["dir/b", "\\network\driver\12AC9K-XXE-849*"], stdout=subprocess.PIPE, shell=True)
(out, err) = proc.communicate()
out = out.splitlines()
print out
get dir value anwser
thanks all help

Hadoop commands from python

I am trying to get some stats for a directory in hdfs. I am trying to get the no of files/subdirs and the size for each. I started out thinking that I can do this in bash.
#!/bin/bash
OP=$(hadoop fs -ls hdfs://mydirectory)
echo $(wc -l < "$OP")
I only have this much so far and I quickly realised that python might be a better option for this. However I am not able to figure out how to execute hadoop commands like hadoop fs -ls from python
Try the following snippet:
output = subprocess.Popen(["hadoop", "fs", "-ls", "/user"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
for line in output.stdout:
print(line)
Additionally, you can refer to this sub-process example, where you can get return status, output and error message separately.
See https://docs.python.org/2/library/commands.html for your options, including how to get the return status (in case of an error). The basic code you're missing is
import commands
hdir_list = commands.getoutput('hadoop fs -ls hdfs://mydirectory')
Yes: deprecated in 2.6, still useful in 2.7, but removed from Python 3. If that bothers you, switch to
os.command (<code string>)
... or better yet use subprocess.call (introduced in 2.4).

Categories