Can I implement my own Python-based git merge strategy? - python

I developed a Python library for merging large numbers of XML files in a very specific way. These XML files are split up and altered by multiple users in my group and it would be much easier to put everything into a Git repo and have git-merge manage everything via my Python code.
It seems that implementing my code for git-mergetool is possible, but I would have to write my own code to manage the conflict returns for the internal git-merge (i.e. parse the >>>>>>> <<<<<<< ======= identifiers), which would be more time consuming.
So, is there a way to have Git's merge command automatically use my Python code instead of its internal git-merge?

You can implement a custom merge driver that's used for certain filetypes instead of Git's default merge driver.
Relevant documentation in gitattributes(5)
Some related StackOverflow questions:
Git - how to force merge conflict and manual merge on selected file
How do I tell git to always select my local version for conflicted merges on a specific file?

Related

How to version-control a set of input data along with its processing scripts?

I am working with a set of Python scripts that take data from an Excel file that is set up to behave as a pseudo-database. Excel is used instead of an SQL software due to compatibility and access requipements for other people I work with who aren't familiar with databases.
I have a set of about 10 tables with multiple records in each and relational keys linking them all (again in a pseudo-linking kind of way, using some flimsy data validation).
The scripts I am using are version controlled by Git, and I know the pitfalls of adding a .xlsx file to a repo, so I have kept it away. Since the data is a bit vulnerable, I want to make sure I have a way of keeping track of any changes we make to it. My thought was to have a script that breaks the Excel file into .csv tables and adds those to the repo, i.e.:
import pandas as pd
from pathlib import Path
excel_input_file = Path(r"<...>")
output_path = Path(r"<...>")
tables_dict = pd.read_excel(excel_input_file, sheet_name=None)
for i,x in tables_dict.items():
x.to_csv(output_path / (i+'.csv'), index=False)
Would this be a typically good method for keeping track of the input files at each stage?
Git tends to work better with text files rather than binary files, as you've noted, so this would be a better choice than just checking in an Excel file. Specifically, Git would be able to merge and diff these files, whereas they couldn't be merged natively by Git otherwise.
Typically the way that people handle this sort of situation is to take one or more plain text input files (e.g., CSV or SQL) and then build them into the usable output format (e.g., Excel or database) as part of the build or test step, depending on where they're needed. I've done similar things by using a Git fast-export dump to create test Git repositories, and it generally works well.
If you had just one input file, which you don't in this case, you could also use smudge and clean filters to turn the source file in the repository into a different format in the checkout. You can read about this with man gitattributes.

VS Integration Services: flat file source to OLE DB destination - detect new data columns in file, add columns to table,import columns automatically?

My place of work receives sets of pipe delimited files from many different clients that we use Visual Studio Integration Services projects to import into tables in our MS SQL 2008 R2 server for later processing - specifically with Data Flow Tasks containing Flat File Source to OLE DB Destination steps. Each data flow task has columns that are specifically mapped to columns in our tables, but the chances of a column addition in any file from any client are relatively high (and we are rarely warned that there will be changes), which is becoming tedious as I currently need to...
Run a python script that uses pyodbc to grab the columns contained in the destination tables and compare them to the source files to find out if there is a difference in columns
Execute the necessary SQL to add the columns to the destination tables
Open the corresponding VS Solution, refresh the columns in the flat file sources that have new columns and manually map each new column to the newly created columns in the OLE DB Destination
We are quickly getting more and more sites that I have to do this with, and I desperately need to find a way to automate this. The VS project can easily be automated if we could depend on the changes being accounted for, but as of now this needs to be a manual process to ensure we load all the data properly. Things I've thought about but have been unable to execute...
Using an XML parser - combined with the output of the python script mentioned above - to append new column mappings to the source/destination objects in the VS Package.dtsx.xml. I hit a dead end when I could not find out more information about creating a valid "DTS:DTSID" for new column mapping, and the file became corrupted whenever I edited it. This also seemed a very unstable option
Finding any built-in event handler in Visual Studio to throw an error if the flat file has a new, un-mapped column - I would be fine with this as a solution because we could confidently schedule the import projects to run automatically and only worry about changing the mapping for projects that failed. I could find a built in feature that does this. I'm also aware I could do this with a python script similar to the one mentioned above that fails if there are differences, but this would be extremely tedious to implement due to file-naming conventions and the fact that there are 50+ clients with more on the way.
I am open to any type of solution, even if it's just an idea. As this is my first question on Stack Overflow, I apologize if this was asked poorly and ask for feedback if the question could be improved. Thanks in advance to those that take the time to read!
Edit:
#Larnu stated that SSIS by default throws an error when unrecognized columns are found in the files. This however does not currently happen with our Visual Studio Integration Services projects and our team would certainly resist a conversion of all packages to SSIS at this point. It would be wonderful if someone could provide insight as to how to ensure the package would fail if there were new columns - in VS. If this isn't possible, I may have to pursue the difficult route as mentioned by #Dave Cullum, though I don't think I get paid enough for that!
Also, talking sense into the clients has proven to be impossible - the addition of columns will always be a crapshoot!
Using a script task you can read your file and record how many pipes are in a line:
using (System.IO.StreamReader sr = new System.IO.StreamReader(path))
{
string line = sr.ReadLine();
int ColumnCount = line.Length - line.Replace("|", "").Length +1;
}
I assume you know how to set that to a variable.
Now add an execute SQL and store result as another variable:
Select Count(*)
from INFORMATION_SCHEMA.columns
where TABLE_NAME = [your destination table]
Now exiting the execute SQL add a conditional arrow and compare the numbers. If they are equal continue your process. If they are not equal then go ahead and send an email (or some other type of notification.

Git:get changes released to master over time

as a personal project, I'd like to check different python libraries and projects (be it proprietary or open source) and analyze how the code was changed over time in different releases to gather some info about the technical debt (mainly through static code analysis). I'm doing this using gitpython library. However, I'm struggling to filter the merge commits to the master.
I filter the merge commits using git.log("--merges", "--first-parent", "master") from where I extract the commit hashes and filter these particular commits from all repository commits.
As the second part, I'd like to get all changed files in each merge commit. I'm able to access the blobs via git tree, but I don't know how to get only changed files.
Is there some efficient way how to accomplish this? Thanks!
... I'd like to get all changed files in each merge commit. ... but I don't know how to get only changed files.
Once you have your commit list as you described above, loop over them and run the following:
git diff
Use the git diff with the --name-only flag
git diff
--name-only
Show only names of changed files.
--name-status
Show only the names and status of changed files. See the description of the --diff-filter option on what the status letters mean.

Maya Python: fix matching names

I am writing a script to export alembic caches of animation in a massive project containing lots of maya files. Our main character is having an issue; along the way his eyes somehow ended up with the same name. This has created issues with the alembic export. Dose maya already have a sort of clean up function that can correct matching names?
Any two objects can have the same names, but never the same DAG paths. In your script, make sure all your ls, listRelatives calls etc. Have the full path or longName or long flags set so you always operate on the full DAG paths as opposed to the possibly conflicting short names.
To my knowledge maya (and its python api) does not offer anything like that.
You'll have to run a snippet on export to check for duplicates before export.
Or, alternatively use an already existing script and run that.

Convert CVS/SVN to a Programming Snippets Site

I use cvs to maintain all my python snippets, notes, c, c++ code. As the hosting provider provides a public web- server also, I was thinking that I should convert the cvs automatically to a programming snippets website.
cvsweb is not what I mean.
doxygen is for a complete project and to browse the self-referencing codes online.I think doxygen is more like web based ctags.
I tried with rest2web, it is requires that I write /restweb headers and files to be .txt files and it will interfere with the programming language syntax.
An approach I have thought is:
1) run source-hightlight and create .html pages for all the scripts.
2) now write a script to index those script .htmls and create webpage.
3) Create the website of those pages.
before proceeding, I thought I shall discuss here, if the members have any suggestion.
What do do, when you want to maintain your snippets and notes in cvs and also auto generate it into a good website. I like rest2web for converting notes to html.
Run Trac on the server linked to the (svn) repository. The Trac wiki can conveniently refer to files and changesets. You get TODO tickets, too.
enscript or pygmentize (part of pygments) can be used to convert code to HTML. You can use a custom header or footer to link to the actual code for download.
I finally settled for rest2web. I had to do the following.
Use a separate python script to recursively copy the files in the CVS to a separate directory.
Added extra files index.txt and template.txt to all the directories which I wanted to be in the webpage.
The best thing about rest2web is that it supports python scripting within the template.txt, so I just ran a loop of the contents and indexed them in the page.
There is still lot more to go to automate the entire process. For eg. Inline viewing of programs and colorization, which I think can be done with some more trials.
I have the completed website here, It is called uthcode.

Categories