pyspark on databricks, using multiple source files with git

pyspark on databricks, using multiple source files with git - python

I'm working with databricks and try to find a good setup for my project. Basically I have to process many files and therefore I wrote quite a bit of python code. I'm now faced with the problem to run this code on a spark cluster such that my source code files are on the respective nodes as well. My approach so far (it even works) is to call a map on an rdd which then runs the parse_file function for each element of the rdd. Is there a better approach, such I don't have to clone every time the whole git?
def parse_file(filename, string):
os.system("rm -rf source_code_folder")
os.system("git clone https://user:password#dev.azure.com/company/project/_git/repo source_code_folder")
sys.path.append(os.path.join(subprocess.getoutput("pwd"),"source_code_folder/"))
from my_module1.whatever import my_function
from my_module2.whatever import some_other_function
result = (my_function(string), some_other_function(string))
return result
my_rdd = sc.wholeTextFiles("abfss://raw#storage_account.dfs.core.windows.net/files/*.txt")
processed_rdd = my_rdd(lambda x: (x[0], parse_file(x[0],x[1])))
Thanks!

Related

How to deal in python with the xml files preceded by the TIME prefix in successive runs of SUMO

I want to get the results of a SUMO successive runs in CSV format directly by using python script (not by using the xml2csv tools and cmd). Due to the TIME prefix comes before the XML file, I don't know how to deal with this part of the code.
Here we want the run to show the results separately by using the time:
sumoCmd = [sumoBinary, "-c", "test4.sumocfg", "--tripinfo-output", "tripinfo.xml", "--output-prefix", " TIME"].
And here is where I must put the proper XML file name which is my question:
tree = ET.parse("myfile.xml")
Any help would be appreciated.
Best, Ali

You can just find the file using glob e.g.:
import glob
tripinfos = glob.glob("*tripinfo.xml")
To get the latest you can use sorted:
latest = sorted(tripinfos)[-1]
tree = ET.parse(latest)

Why does shutil.copy2 sometimes modify the files' st_mtime?

I'm currently writing a little Python3 script for monitoring parts of my macOS filesystem. It checks selected folders for new and modified files and copies them (via shutil.copy2) into a review folder which I check from time to time. The test for modification uses a comparison between the st_mtime of the original and the (already) copied files. While testing it I encountered some strange behaviour: Some files got copied over and over again, despite being unmodified.
After some poking around I found out that shutil.copy2 apparently doesn't always carry over the exact st_mtime. (I also tried out shutil.copystat explicitly, with the same result -- which isn't much of a surprise.)
To illustrate my problem: When I run the following code ...
from shutil import copy2
from os import stat
source = '/Users/me/myfile'
target = source + '-copy'
copy2(source, target)
print(stat(source).st_mtime, stat(target).st_mtime)
... the result sometimes looks like (not always):
1600616170.8300607 1600616170.83006
When I'm using the nanosecond version st_mtime_ns the result looks like:
1600616170830060720 1600616170830060000
Now my question: Does anyone know what's going on here?

HTCondor output files: obtain created directory

I am using HTcondor to generate some data (txt, png). By running my program, it creates a directory next to the .sub file, named datasets, where the datasets are stored into. Unfortunately, condor does not give me back this created data when finished. In other words, my goal is to get the created data in a "Datasets" subfolder next to the .sub file.
I tried:
1) to not put the data under the datasets subfolder, and I obtained them as thought. Howerver, this is not a smooth solution, since I generate like 100 files which are now mixed up with the .sub file and all the other.
2) Also I tried to set this up in the sub file, leading to this:
notification = Always
should_transfer_files = YES
RunAsOwner = True
When_To_Transfer_Output = ON_EXIT_OR_EVICT
getenv = True
transfer_input_files = main.py
transfer_output_files = Datasets
universe = vanilla
log = log/test-$(Cluster).log
error = log/test-$(Cluster)-$(Process).err
output = log/test-$(Cluster)-$(Process).log
executable = Simulation.bat
queue
This time I get the error, that Datasets was not found. Spelling was checked already.
3) Another option would be, to pack everything in a zip, but since I have to run hundreds of jobs, I do not want to unpack all this files afterwards.
I hope somebody comes up with a good idea on how to solve this.

Just for the record here: HTCondor does not transfer created directories at the end of the run or its contents. The best way to get the content back is to write a wrapper script that will run your executable and then compress the created directory at the root of the working directory. This file will be transferred with all other files. For example, create run.exe:
./Simulation.bat
tar zcf Datasets.tar.gz Datasets
and in your condor submission script put:
executable = run.exe
However, if you do not want to do this and if HTCondor is using a common shared space like an AFS you can simply copy the whole directory out:
./Simulation.bat
cp -r Datasets <AFS location>
The other alternative is to define an initialdir as described at the end of: https://research.cs.wisc.edu/htcondor/manual/quickstart.html
But one must create the directory structure by hand.
also, look around pg. 65 of: https://indico.cern.ch/event/611296/contributions/2604376/attachments/1471164/2276521/TannenbaumT_UserTutorial.pdf
This document is, in general, a very useful one for beginners.

Get All Revisions for a specific file in gitpython

I am using gitpython library for performing git operations, retrieve git info from python code. I want to retrieve all revisions for a specific file. But couldn't find a specific reference for this on the docs.
Can anybody give some clue on which function will help in this regard? Thanks.

A follow-on, to read each file:
import git
repo = git.Repo()
path = "file_you_are_looking_for"
revlist = (
(commit, (commit.tree / path).data_stream.read())
for commit in repo.iter_commits(paths=path)
)
for commit, filecontents in revlist:
...

There is no such function, but it is easily implemented:
import git
repo = git.Repo()
path = "dir/file_you_are_looking_for"
commits_touching_path = list(repo.iter_commits(paths=path))
Performance will be moderate even if multiple paths are involved. Benchmarks and more code about that can be found in an issue on github.

Turn subversion path into walkable directory

I have a subversion repo ie "http://crsvn/trunk/foo" ... I want to walk this directory or for starters simply to a directory list.
The idea is to create a script that will do mergeinfo on all the branches in "http://crsvn/branches/bar" and compare them to trunk to see if the branch has been merged.
So the first problem I have is that I cannot walk or do
os.listdir('http://crsvn/branches/bar')
I get the value label syntax is incorrect (mentioning the URL)

You can use PySVN. In particular, the pysvn.Client.list method should do what you want:
import pysvn
svncl = pysvn.Client()
entries = svncl.list("http://rabbitvcs.googlecode.com/svn/trunk/")
# Gives you a list of directories:
dirs = (entry[0].repos_path for entry in entries if entry[0].kind == pysvn.node_kind.dir)
list(dirs)
No checkout needed. You could even specify a revision to work on, to ensure your script can ignore other people working on the repository while it runs.

listdir takes a path and not a url. It would be nice if python could be aware of the structure on a remote server but i don't think that is the case.
If you were to checkout your repository locally first you could easly walk the directories using pythons functions.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pyspark on databricks, using multiple source files with git - python

Related

How to deal in python with the xml files preceded by the TIME prefix in successive runs of SUMO

Why does shutil.copy2 sometimes modify the files' st_mtime?

HTCondor output files: obtain created directory

Get All Revisions for a specific file in gitpython

Turn subversion path into walkable directory

Categories

Resources