Calling Python from Stata - python

This is probably very easy, but after looking through documentation and possible examples online for the past several hours I cannot figure it out.
I have a large dataset (a spreadsheet) that gets heavily cleaned by a DO file. In the DO file I then want to save certain variables of the cleaned data as a temp .csv run some Python scripts, that produce a new CSV and then append that output to my cleaned data.
If that was unclear here is an example.
After cleaning my data set (XYZ) goes from variables A to Z with 100 observations. I want to take variables A and D through F and save it as test.csv. I then want to run a python script that takes this data and creates new variables AA to GG. I want to then take that information and append it to the XYZ dataset (making the dataset now go from A to GG with 100 observations) and then be able to run a second part of my DO file for analysis.
I have been doing this manually and it is fine but the file is going to start changing quickly and it would save me a lot of time.

Would this work (assuming you can get to python
tempfile myfiletemp
save `myfiletemp'
outsheet myfile1.csv
shell python.exe myscript.py
insheet myfile2.csv, clear
append using `myfiletemp'

Type "help shell" in Stata. What you want to do is shell out from Stata, call Python, and then have Stata resume whatever you want it to do after the Python script has completed.

Related

Writing timestamps from commands displayed in Jenkins Output Log to either .txt or .CSV file

I'm relatively new to Python so was wondering if anyone can give some hints or tips regarding something I'm wanting to do using Python whilst being run as part of a build on a Jenkins Pipeline.
To give a basic breakdown I'm wanting to export/save timestamps from the Jenkins Output, which current timestamps all commands/strings that happen within it, whilst it is running a build to either a .txt file or .csv file. These timestamp will be taken when specific commands/strings occur in the Jenkins output. I've given an example below for the Timestamp and Command being looked for. 
"2021-08-17 11:46:38,899 - LOG: Successfully sent the test record"
I'd prefer just to send the timestamp itself, but if the full line needs to be sent then that would work as well, as the there is a lot of information generated in the console that isn't of interest for what I want to do.
My ultimate goal would be to do this for multiple different and unique commands/strings that occur in the Jenkins output. Along with this, some testing I’d be doing would involve running the same script over and over for a set number of loops, so I’d want the timestamp data to be saved into a singular output file (and not overwritten) or in separate output files for each loop.
Any hints or tips for this would be greatly appreciated as I’ve reached a dead end on what I can search up online involving the use of either the Logging function, using the wait_for_value function to find the required command/string in the console output and then save/print it to a created variable or seeing if Regex would be suitable for the task.
Thanks in advance for any help on this.

How does one deal with refresh delay when calling python from VBA using .txt as input and output?

I am using VBA-macros in order to automate serveral data processing steps in excel, such as data reduction and visualization. But since excel has no appropriate fit for my purposes I use a python script using scipy's least squares cubic b-spline function. The in and output is done via .txt files. Since I adjusted the script from a manual script I got from a friend.
VBA calls Python
Call common.callLSQCBSpline(targetrng, ThisWorkbook) #calls python which works
Call common.callLoadFitTxt(targetrng, ThisWorkbook) # loads pythons output
Now the funny business:
This works in debug mode but does not when running "full speed". The solution to the problem is simply wait for the directory were the .txt is written to refresh and to load the current and not the previous output file. My solution currently looks like this:
Call common.callLSQCBSpline(targetrng, ThisWorkbook)
Application.Wait (Now + 0.00005)
Call common.callLoadFitTxt(targetrng, ThisWorkbook)
This is slow and anoying but works. Is there a way to speed this up? The python script works fine and writes the output.txt file properly. VBA just needs a second or two before it can load it. The txts are very small under 1 kB.
Thanks in advance!

access pandas dataframe from another file while it is still being updated

consider a dataframe that is constantly being appended new values at given interval (say, every 10 mins) over a specific length of time (say, 300 mins). Whilst data is being added to this dataframe, I want to simultaneously be able to read this dataframe in another file [meaning perform some further process/analytics on the dataframe values from another .py file]. How can I achieve this? I suspect, that i need to use the multiprocessing or multi-treading library, but can I read the dataframe from memory or do i have to first write it to the disk and read the stored file?
Also, how can i run the first file (which appends the data) in the background so as to be able to work on other files from ipython shell ( i am using spyder 3.3 and python 2.7)
I did some online reading on multiprocessing but couldn't understand how to go about the two issues referred above. Generally, any pointers on how this can be achieved in simplest possible way will be helpful

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)
I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.

Excel Big Data Calculation (PCA...)

I have to do some calculation on data stored in excel for my internship.
I am supposed to aggregate market datas (50 assets over 15 years) and do a Principal Component Analysis over the aggregated datas.
For the moment I have the market data in a worksheet, I save it as a tabulation separated text (like csv but with tabulation instead of commas). Then I read it with R and use some powerfull package to do the PCA. Finally, with R I create another tabulation separated text and read it trough excel. I now have datas and results in excel and I can plot everything I want.
The problem is that the process is not enough automated for my colleagues.
As they said, they want a button in excel which launch the PCA when clicked.
I've tried to install some Excel Package (Rexcel) which allow to use R function directly in excel. It's not working (a server problem) and not well documented. So I'm trying to find others way to do big calculation directly in excel. It seems that there is the same kind of package to use Python in Excel. I've also heard about other powerfull langage compatible with excel. The problem is that I can't install what I want on my computer (yeah I have to call an IT guy for every package I want to install...), so it already took me 2/3 days to try the R solution. This is also why i'm looking for a simple solution, my colleagues won't have 2/3 days to install some excel package to use my macro...
So i'm here to ask: what would be the easiest way to do PCA, using tools from other languages, directly in excel ?
Many thanks in advance
You can use the very handy executable Rscriptto launch automatically your R scripts.
Within VBA you create a macro where you type something like this :
retVal = Shell(MY_RSCRIPT_BAT, vbNormalFocus) ## vba code here
I assume that you can call a VBA macro from a button.
your MY_RSCRIPT_BAT , is .bat file where you type something like:
#echo off
C:
PATH R_PATH;%path%
cd DEMO_PATH
Rscript your_pca_script.R
exit

Categories