Spark: CopyToLocal in Cluster Mode

Spark: CopyToLocal in Cluster Mode - python

I have a PySpark Script where data is processed and then converted to CSV files. As the end result should be ONE CSV file accessible via WinSCP, I do some additional processing to put the CSV files on the worker nodes together and transfer it out of HDFS to the FTP Server (I think it's called Edge Node).
from py4j.java_gateway import java_import
import os
YYMM = date[2:7].replace('-','')
# First, clean out both HDFS and local folder so CSVs do not stack up (data history is stored in DB anyway if update option is enabled)
os.system('hdfs dfs -rm -f -r /hdfs/path/new/*')
os.system('rm -f /ftp/path/new/*')
#timestamp = str(datetime.now()).replace(' ','_').replace(':','-')[0:19]
df.coalesce(1).write.csv('/hdfs/path/new/dataset_temp_' + date, header = "true", sep = "|")
# By default, output CSV has weird name ("part-0000-..."). To give proper name and delete automatically created upper folder, do some more processing
java_import(spark._jvm, 'org.apache.hadoop.fs.Path')
sc = spark.sparkContext
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
file = fs.globStatus(sc._jvm.Path('/hdfs/path/new/dataset_temp_' + date + '/part*'))[0].getPath().getName()
fs.rename(sc._jvm.Path('/hdfs/path/new/dataset_temp_' + date + "/" + file), sc._jvm.Path('/hdfs/path/new/dataset_' + YYMM + '.csv'))
fs.delete(sc._jvm.Path('/hdfs/path/new/dataset_temp_' + date), True)
# Shift CSV file out of HDFS into "regular" SFTP server environment
os.system('hdfs dfs -copyToLocal hdfs://<server>/hdfs/path/new/dataset_' + YYMM + '.csv' + ' /ftp/path/new')
In Client mode all works fine. But when I switch to Cluster, it gives an Error Message that the final /ftp/path/new in the CopyToLocal-Command is not found, I suppose because it is looking on the Worker Nodes and not on the Edge Node. Is there any way to overcome this? As an alternative, I thought to do the final CopyToLocal command from a batch script outside of the Spark Session, but I'd rather have it all in one script...

Instead of running the OS commands in your spark script, you can directly write the output out the ftp location. You need to provide the path to the ftp location with savemode set to overwrite. you can then run the code to rename the data after your spark script has completed.
YYMM = date[2:7].replace('-','')
df.coalesce(1).write
.mode("overwrite")
.csv('/ftp/path/new/{0}'.format(date), header = "true", sep = "|")
#run the command below in a separate step once the above code has executed.
os.system("mv /ftp/path/new/{0}/*.csv /ftp/path/new/{0}/dataset_' + YYMM + '.csv".format(date))
I have made an assumption that the ftp location is accessible by the worker nodes since you are able to run the copyToLOcal command in client mode. If the location is not accessible, you will have to write the file out the hdfs location as before and run the moving of ile and renaming of the file in a separate process/code outsideof the spark script

Related

SQL Server Agent not executing part of Python script where it runs a third party exe file

I am trying to run a python script from a powershell script inside SQL Server Agent.
I was able to execute most job of a python script (Task1-Task2) except the last portion (Task3) where it runs a third party exe file called SQBConverter (from RedGate) which converts files from SQB format to BAK format.
When I manually run powershell script directly which runs python script, there is no issue.
I modified the "Log On As" from default ("Local System") to my own (JDoh), and it executes the powershell within SQL Server Agent, but it only does the job except where it converts files from SQB to BAK format (Task3).
Without changing to my own (JDoh), it would not even executes the any job of python script.
I don't think there is any issue with powershell script side because it still triggers python script when I changed the "Log On As" to "Local System". It does not show error, but it shows as SQL Server Agent job completed. But, it does not run any tasks within python script at all.
So, I am guessing it might be something to do with SQL Server Agent not able to trigger/run the SQBConverter exe file.
Here is whole python code (ConvertToBAK.py) to give you the whole idea of logic. It does everything until where it converts from SQB to BAK (Task3: last two lines).
import os
from os import path
import datetime
from datetime import timedelta
import glob
import shutil
import re
import time, sys
today = datetime.date.today()
yesterday = today - timedelta(days = 1)
yesterday = str(yesterday)
nonhyphen_yesterday = yesterday.replace('-','')
revised_yesterday = "LOG_us_xxxx_multi_replica_" + nonhyphen_yesterday
src = "Z:\\TestPCC\\FTP"
dst = "Z:\\TestPCC\\Yesterday"
password = "Password"
path = "Z:\\TestPCC\\FTP"
now = time.time()
### Task1: To delete old files (5 days or older)
for f in os.listdir(path):
f = os.path.join(path, f)
if os.stat(f).st_mtime < now - 5 * 86400:
if os.path.isfile(f):
os.remove(os.path.join(path, f))
filelist = glob.glob(os.path.join(dst, "*"))
for f in filelist:
os.remove(f)
### Task2: To move all files from one folder to other folder location
src_files = os.listdir(src)
src_files1 = [g for g in os.listdir(src) if re.match(revised_yesterday, g)]
for file_name in src_files1:
full_file_name = os.path.join(src, file_name)
if os.path.isfile(full_file_name):
shutil.copy(full_file_name, dst)
### Task3: Convert from SQB format to BAK format (running SQBConverter.exe)
for f in glob.glob(r'Z:\\TestPCC\\Yesterday\\*.SQB'):
os.system( f'SQBConverter "{f}" "{f[:-4]}.bak" {password}' )
This is powershell code (Test.ps1):
$path = 'Z:\TestPCC'
$file = 'ConvertToBAK.py'
$cmd = $path+"\\"+$file # This line of code will create the concatenate the path and file
Start-Process $cmd # This line will execute the cmd
This is screenshot of SQL Server Agent's step:
I looked at the properties of SQBConverter exe file itself, and I granted FULL control for all users listed.

I got it working by modifying the last line of my Python code.
From:
os.system( f'SQBConverter "{f}" "{f[:-4]}.bak" {password}' )
To (absolute path):
os.system( f'Z:\\TestPCC\\SQBConverter.exe "{f}" "{f[:-4]}.bak" {password}' )

awk: fatal: cannot open file 'file' for reading (Permission denied)

A piece code below is part of a larger program which I am running on a remote server via a batch script with #!/bin/bash -l as its first line.
On my local machine it runs normally but on a remote server permission issues arises. What may be wrong?
The description of the code may not important to the problem, but basically the code uses awk in processing the contents of the files based on the names of the files.
Why is awk denied permission to operate on the files? When I run awk directly on a shell prompt of the remote server it works normally.
#!/usr/bin/env python
list_of_files = ["file1", "file2", "file3"]
for file in list_of_files:
awk_cmd = '''awk '/^>/{print ">" substr(FILENAME,1,length(FILENAME)) ++i; next} 1' ''' + file + " > tmp && mv tmp " + file + \
" | cat files > 'pooled_file' "
exitcode = subprocess.call(awk_cmd, shell=True)
Any help would be appreciated.

I am pretty sure it is a permissions issue since when you are landing into remote machine it is NOT landing on directory where your Input_file(s) are present, off course it will land in HOME directory of logged in user at remote server. So it is a good practice to mention file names with complete paths(Make sure file names with path you are giving are present in target location too else you could write a wrapper over it to check either files are present or not too). Could you please try following.
#!/usr/bin/env python
list_of_files = ["/full/path/file1", "/full/path/file2", "/full/path/file3"]
for file in list_of_files:
awk_cmd = '''awk '/^>/{num=split(FILENAME,array,"/");print ">" substr(array[num],1,length(array[num])) ++i; next} 1' ''' + file + " > tmp$$ && mv tmp$$ " + file + \
" | cat files > 'pooled_file' "
exitcode = subprocess.call(awk_cmd, shell=True)
I haven't tested it but I have changed it as per full path, since awk will print complete path with filename so I have changed FILENAME in your code to as per array's place, I also changed tmp temporary file to tmp$$ for safer side.

Read a CSV file stored in a FTP in Python

I have connected to a FTP and the connection is successful.
import ftplib
ftp = ftplib.FTP('***', '****','****')
listoffiles = ftp.dir()
print (listoffiles)
I have a few CSV files in this FTP and a few folders which contain some more CSV's.
I need to identify the list of folders in this location (home) and need to navigate into the folders. I think cwd command should work.
I also read the CSV stored in this FTP. How can I do that? Is there a way to directly load the CSV's here into Pandas?

Based on the answer here (Python write create file directly in FTP) and my own knowledge about ftplib:
What you can do is the following:
from ftplib import FTP
import io, pandas
session = FTP('***', '****','****')
# get filenames on ftp home/root
remoteFilenames = session.nlst()
if ".." in remoteFilenames:
remoteFilenames.remove("..")
if "." in remoteFilenames:
remoteFilenames.remove(".")
# iterate over filenames and check which ones are folder
for filename in remoteFilenames:
dirTest = session.nlst(filename)
# This dir test does not work on certain servers
if dirTest and len(dirTest) > 1:
# its a directory => go to directory
session.cwd(filename)
# get filename for on ftp one level deeper
remoteFilenames2 = session.nlst()
if ".." in remoteFilenames2:
remoteFilenames2.remove("..")
if "." in remoteFilenames2:
remoteFilenames2.remove(".")
for filename in remoteFilenames2:
# check again if the filename is a directory and this time ignore it in this case
dirTest = session.nlst(filename)
if dirTest and len(dirTest) > 1:
continue
# download the file but first create a virtual file object for it
download_file = io.BytesIO()
session.retrbinary("RETR {}".format(filename), download_file.write)
download_file.seek(0) # after writing go back to the start of the virtual file
pandas.read_csv(download_file) # read virtual file into pandas
##########
# do your thing with pandas here
##########
download_file.close() # close virtual file
session.quit() # close the ftp session
Alternatively if you know the structure of the ftpserver you could loop over a dictionary with the folder/file structure and download the files via ftplib or urllib like in the example:
for folder in {"folder1": ["file1", "file2"], "folder2": ["file1"]}:
for file in folder:
path = "/{}/{}".format(folder, file)
##########
# specific ftp file download stuff here
##########
##########
# do your thing with pandas here
##########
Both solution can be optimized by making them recursive or in general support more than one level of folders

Better late than never... I was able to read directly into pandas. Not sure if this works for anyone.
import pandas as pd
from ftplib import FTP
ftp = FTP('ftp.[domain].com') # you need to put in your correct ftp domain
ftp.login() # i don't need login info for my ftp
ftp.cwd('[Directory]') # change directory to where the file is
df = pd.read_csv("[file.csv]", delimiter = "|", encoding='latin1') # i needed to specify delimiter and encoding
df.head()

Opening Pathway to Remote Directory (Python)

I am working on a code to copy images from a folder in a local directory to a remote directory. I am trying to use scp.
So in my directory, there is a folder that contains subfolders with images in it. There are also images that are in the main folder that are not in subfolders. I am trying to iterate through the subfolders and individual images and sort them by company, then make corresponding company folders for those images to be organized and copied onto the remote directory.
I am having problems creating the new company folder in the remote directory.
This is what I have:
def imageSync():
path = os.path.normpath("Z:\Complete")
folders = os.listdir(path)
subfolder = []
#separates subfolders from just images in complete folder
for folder in folders:
if folder[len(folder)-3:] == "jpg":
pass
else:
subfolder.append(folder)
p = dict()
for x in range(len(subfolder)):
p[x] = os.path.join(path, subfolder[x])
sub = []
for location in p.items():
sub.append(location[1])
noFold= []
for s in sub:
path1 = os.path.normpath(s)
images = os.listdir(path1)
for image in images:
name = image.split("-")
comp = name[0]
pathway = os.path.join(path1, image)
path2 = "scp " + pathway + " blah#192.168.1.10: /var/files/ImageSync/" + comp
pathhh = os.system(path2)
if not os.path.exists(pathhh):
noFold.append(image)
There's more to the code, but I figured the top part would help explain what I am trying to do.
I have created a ssh key in hopes of making os.system work, but Path2 is returning 1 when I would like it to be the path to the remote server. (I tried this: How to store the return value of os.system that it has printed to stdout in python?)
Also how do I properly check to see if the company folder in the remote directory already exists?
I have looked at Secure Copy File from remote server via scp and os module in Python and How to copy a file to a remote server in Python using SCP or SSH? but I guess I am doing something wrong.
I'm new to Python so thanks for any help!

try this to copy dirs and nested subdirs from local to remote:
cmd = "sshpass -p {} scp -r {}/* root#{}://{}".format(
remote_root_pass,
local_path,
remote_ip,
remote_path)
os.system(cmd)
don't forget to import os,
You may check the exitcode returned (0 for success)
Also you might need to "yum install sshpass"
And change /etc/ssh/ssh_config
StrictHostKeyChecking ask
to:
StrictHostKeyChecking no

Track folder changes / Dropbox changes

First of: I know of pyinotify.
What I want is an upload service to my home server using Dropbox.
I will have a Dropbox's shared folder on my home server. Everytime someone else, who is sharing that folder, puts anything into that folder, I want my home server to wait until it is fully uploaded and move all the files to another folder, and removing those files from the Dropbox folder, thus, saving Dropbox space.
The thing here is, I can't just track for changes in the folder and move the files right away, because if someone uploads a large file, Dropbox will already start downloading and therefore showing changes in the folder on my home server.
Is there some workaround? Is that somehow possible with the Dropbox API?
Haven't tried it myself, but the Dropbox CLI version seems to have a 'filestatus' method to check for current file status. Will report back when I have tried it myself.

There is a Python dropbox CLI client, as you mentioned in your question. It returns "Idle..." when it isn't actively processing files. The absolutely simplest mechanism I can imagine for achieving what you want would be a while loop that checked the output of dropbox.py filestatus /home/directory/to/watch and performed an scp of the contents and then a delete on the contents if that suceeded. Then slept for five minutes or so.
Something like:
import time
from subprocess import check_call, check_output
DIR = "/directory/to/watch/"
REMOTE_DIR = "user#my_server.com:/folder"
While True:
if check_output(["dropbox.py", "status", DIR]) == "\nIdle...":
if check_call(["scp", "-r", DIR + "*", REMOTE_DIR]):
check_call(["rm", "-rf", DIR + "*"])
time.sleep(360)
Of course I would be very careful when testing something like this, put the wrong thing in that second check_call and you could lose your filesystem.

You could run incrond and have it wait for IN_CLOSE_WRITE events in your Dropbox folder. Then it would only be triggered when a file transfer completed.

Here is a Ruby version that doesn't wait for Dropbox to be idle, therefore can actually start moving files, while it is still syncing. Also it ignores . and ... It actually checks the filestatus of each file within a given directory.
Then I would run this script either as a cronjob or in a separate screen.
directory = "path/to/dir"
destination = "location/to/move/to"
Dir.foreach(directory) do |item|
next if item == '.' or item == '..'
fileStatus = `~/bin/dropbox.py filestatus #{directory + "/" + item}`
puts "processing " + item
if (fileStatus.include? "up to date")
puts item + " is up to date, starting to move file now."
# cp command here. Something along this line: `cp #{directory + "/" + item + destination}`
# rm command here. Probably you want to confirm that all copied files are correct by comparing md5 or something similar.
else
puts item + " is not up to date, moving on to next file."
end
end
This is the full script, I ended up with:
# runs in Ruby 1.8.x (ftools)
require 'ftools'
directory = "path/to/dir"
destination = "location/to/move/to"
Dir.glob(directory+"/**/*") do |item|
next if item == '.' or item == '..'
fileStatus = `~/bin/dropbox.py filestatus #{item}`
puts "processing " + item
puts "filestatus: " + fileStatus
if (fileStatus.include? "up to date")
puts item.split('/',2)[1] + " is up to date, starting to move file now."
`cp -r #{item + " " + destination + "/" + item.split('/',2)[1]}`
# remove file in Dropbox folder, if current item is not a directory and
# copied file is identical.
if (!File.directory?(item) && File.cmp(item, destination + "/" + item.split('/',2)[1]).to_s)
puts "remove " + item
`rm -rf #{item}`
end
else
puts item + " is not up to date, moving to next file."
end
end

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark: CopyToLocal in Cluster Mode - python

Related

SQL Server Agent not executing part of Python script where it runs a third party exe file

awk: fatal: cannot open file 'file' for reading (Permission denied)

Read a CSV file stored in a FTP in Python

Opening Pathway to Remote Directory (Python)

Track folder changes / Dropbox changes

Categories

Resources