I am writing some macros that call Python code to perform operations on ranges in Excel. It is much easier to do a lot of the required operations with pandas in Python. Because I want to do this while the spreadsheet is open (and may not have been saved), I am using the win32com.client to read in a range of cells to convert to a Pandas dataframe. However, this is extremely slow, presumably because the way I calculate it is very inefficient:
import datetime
import pytz
import pandas
import time
import win32com.client
def range_to_table(excelRange, tsy, tsx, height, width, add_cell_refs = True):
ii = 0
keys = []
while ii < width:
keys.append(str(excelRange[ii]))
ii += 1
colnumbers = {key:jj+tsx for jj, key in enumerate(keys)}
keys.append('rownumber')
mydict = {key:[] for key in keys}
while ii < width*height:
mydict[keys[ii%width]].append(excelRange[ii].value)
ii += 1
for yy in range(tsy + 1, tsy + 1 + height - 1): # add 1 to not include header
mydict['rownumber'].append(yy)
return (mydict, colnumbers)
ExcelApp = win32com.client.GetActiveObject('Excel.Application')
wb = ExcelApp.Workbooks('myworkbook.xlsm')
sheet_num = [sheet.Name for sheet in wb.Sheets].index("myworksheet name") + 1
ws = wb.Worksheets(sheet_num)
height = int(ws.Cells(1, 3)) # obtain table height from formula in excel spreadsheet
width = int(ws.Cells(1, 2)) # obtain table width from formula in excel spreadsheet
myrange = ws.Range(ws.Cells(2, 1), ws.Cells(2 + height - 1, 1 + width - 1))
df, colnumbers = range_to_table(myrange, 1, 1, height, width)
df = pandas.DataFrame.from_dict(df)
This works, but the range_to_table function I wrote is extremely slow for large tables since it iterates over each cell one by one.
I suspect there is probably a much better way to convert the Excel Range object to a Pandas dataframe. Do you know of a better way?
Here a simplified example of what my range would look like:
The height and width variables in the code are just taken from cells immediately above the table:
Any ideas here, or am I just going to have to save the workbook and use Pandas to read in the table from the saved file?
There are two parts to the operation: defining the spreadsheet range and then getting the data into Python. Here is the test data that I'm working with:
1. Defining the range: Excel has a feature called Dynamic Ranges. This allows you to give a name to a range whose extent is variable.
I've set up a dynamic range called 'DynRange', and you can see that it uses the row and column counts from $C$1 and $C$2 to define the size of the array.
Once you have this definition, the range can be used by Name in Python, and saves you having to access the row and column count explicitly.
2. Using this range in Python via win32.com: Once you have defined the name in Excel, handling it in Python is much easier.
import win32com.client as wc
import pandas as pd
#Create a dispatch interface
xl = wc.gencache.EnsureDispatch('Excel.Application')
filepath = 'SomeFilePath\\TestBook.xlsx'
#Open the workbook
wb = xl.Workbooks.Open(filepath)
#Get the Worksheet by name
ws = wb.Sheets('Sheet1')
#Use the Value property to get all the data in the range
listVals = ws.Range('DynRange').Value
#Construct the dataframe, using first row as headers
df = pd.DataFrame(listVals[1:],columns=listVals[0])
#Optionally process the datetime value to avoid tz warnings
df['Datetime'] = df['Datetime'].dt.tz_convert(None)
print(df)
wb.Close()
Output:
Datetime Principal Source Amt Cost Basis
0 2021-04-21 04:59:00 -5.0 1.001 5.0
1 2021-04-25 15:16:00 -348.26 1.001 10.0
2 2021-04-29 11:04:00 0.0 1.001 5.0
3 2021-04-29 21:26:00 0.0 1.001 5.0
4 2021-04-29 23:39:00 0.0 1.001 5.0
5 2021-05-02 14:00:00 -2488.4 1.001 5.0
As the OP suspects, iterating over the range cell-by-cell performs slowly. The COM infrastructure has to do a good deal of processing to pass data from one process (Excel) to another (Python). This is known as 'marshalling'. Most of the time is spent packing up the variables on one side and unpacking on the other. It is much more efficient to marshal the entire contents of an Excel Range in one go (as a 2D array) and Excel allows this by exposing the Value property on the Range as a whole, rather than by cell.
You can try using multiprocessing for this. You could have each worker scanning a different column for example, or even do the same on the lines.
Minor changes to your code are needed:
Create a function iterating over the columns and storing the information in a dict
Use the simple multiprocessing example # https://pymotw.com/2/multiprocessing/basics.html
Create a function appending all different dicts created by each worker into a single one
That should divide your compute time by the amount of workers used.
Related
I have a list of genes, their coordinates, and their expression (right now just looking at the top 500 most highly expressed genes) and 12 files corresponding to DNA reads. I have a python script that searches for reads overlapping with each gene's coordinates and storing the values in a dictionary. I then use this dictionary to create a Pandas dataframe and save this as a csv. (I will be using these to create a scatterplot.)
The RNA file looks like this (the headers are gene name, chromosome, start, stop, gene coverage/enrichment):
MSTRG.38 NC_008066.1 9204 9987 48395.347656
MSTRG.36 NC_008066.1 7582 8265 47979.933594
MSTRG.33 NC_008066.1 5899 7437 43807.781250
MSTRG.49 NC_008066.1 14732 15872 26669.763672
MSTRG.38 NC_008066.1 8363 9203 19514.273438
MSTRG.34 NC_008066.1 7439 7510 16855.662109
And the DNA file looks like this (the headers are chromosome, start, stop, gene name, coverage, strand):
JQ673480.1 697 778 SRX6359746.5505370/2 8 +
JQ673480.1 744 824 SRX6359746.5505370/1 8 -
JQ673480.1 1712 1791 SRX6359746.2565519/2 27 +
JQ673480.1 3445 3525 SRX6359746.7028440/2 23 -
JQ673480.1 4815 4873 SRX6359746.6742605/2 37 +
JQ673480.1 5055 5092 SRX6359746.5420114/2 40 -
JQ673480.1 5108 5187 SRX6359746.2349349/2 24 -
JQ673480.1 7139 7219 SRX6359746.3831446/2 22 +
The RNA file has >9,000 lines, and the DNA files have > 12,000,000 lines.
I originally had a for-loop that would generate a dictionary containing all values for all 12 files in one go, but it runs extremely slowly. Since I have access to a computing system with multiple cores, I've decided to run a script that only calculates coverage one DNA file at a time, like so:
#import modules
import csv
import pandas as pd
import matplotlib.pyplot as plt
#set sample name
sample='CON-2'
#set fraction number
f=6
#dictionary to store values
d={}
#load file name into variable
fileRNA="top500_R8_7-{}-RNA.gtf".format(sample)
print(fileRNA)
#read tsv file
tsvRNA = open(fileRNA)
readRNA = csv.reader(tsvRNA, delimiter="\t")
expGenes=[]
#convert tsv file into Python list
for row in readRNA:
gene=row[0],row[1],row[2],row[3],row[4]
expGenes.append(row)
#print(expGenes)
#establish file name for DNA reads
fileDNA="D2_7-{}-{}.bed".format(sample,f)
print(fileDNA)
tsvDNA = open(fileDNA)
readDNA = csv.reader(tsvDNA, delimiter="\t")
#put file into Python list
MCNgenes=[]
for row in readDNA:
read=row[0],row[1],row[2]
MCNgenes.append(read)
#find read counts
for r in expGenes:
#include FPKM in the dictionary
d[r[0]]=[r[4]]
regionCount=0
#set start and stop points based on transcript file
chr=r[1]
start=int(r[2])
stop=int(r[3])
#print("start:",start,"stop:",stop)
for row in MCNgenes:
if start < int(row[1]) < stop:
regionCount+=1
d[r[0]].append(regionCount)
n+=1
df=pd.DataFrame.from_dict(d)
#convert to heatmap
df.to_csv("7-CON-2-6_forHeatmap.csv")
This script also runs quite slowly, however. Are there any changes I can make to get it run more efficiently?
If I understood correctly and you are trying to match between coordinates of genes in different files I believe the best option would be to use something like KDTree partitioning algorithm.
You can use KDtree to partition your DNA and RNA data. I'm assumming you're using 'start' and 'stop' as 'coordinates':
import pandas as pd
import numpy as np
from sklearn.neighbors import KDTree
dna = pd.DataFrame() # this is your dataframe with DNA data
rna = pd.DataFrame() # Same for RNA
# Let's assume you are using 'start' and 'stop' columns as coordinates
dna_coord = dna.loc[:, ['start', 'stop']]
rna_coord = rna.loc[:, ['start', 'stop']]
dna_kd = KDTree(dna_coord)
rna_kd = KDTree(rna_coord)
# Now you can go through your data and match with DNA:
my_data = pd.DataFrame()
for start, stop in zip(my_data.start, my_data.stop):
coord = np.array(start, stop)
dist, idx = dna_kd.query(coord, k=1)
# Assuming you need an exact match
if np.islose(dist, 0):
# Now that you have the index of the matchin row in DNA data
# you can extract information using the index and do whatever
# you want with it
dna_gene_data = dna.loc[idx, :]
You can adjust your search parameters to get the desired results, but this will be much faster than searching every time.
Generally, Python is extremely extremely easy to work with at the cost of it being inefficient! Scientific libraries (such as pandas and numpy) help here by only paying the Python overhead a minimum limited number of times to map the work into a convenient space, then doing the "heavy lifting" in a more efficient language (which may be quite painful/inconvenient to work with).
General advice
try to get data into a dataframe whenever possible and keep it there (do not convert data into some intermediate Python object like a list or dict)
try to use methods of the dataframe or parts of it to do work (such as .apply() and .map()-like methods)
whenever you must iterate in native Python, iterate on the shorter side of a dataframe (ie. there may be only 10 columns, but 10,000 rows ; go over the columns)
More on this topic here:
How to iterate over rows in a DataFrame in Pandas?
Answer: DON'T*!
Once you have a program, you can benchmark it by collecting runtime information. There are many libraries for this, but there is also a builtin one called cProfile which may work for you.
docs: https://docs.python.org/3/library/profile.html
python3 -m cProfile -o profile.out myscript.py
I'm a building energy simulation modeller with an Excel-question to enable automated large-scale simulations using parameter samples (samples generated using Monte Carlo). Now I have the following question in saving my samples:
I want to save each row of an Excel-spreadsheet in a separate .txt-file in a 'special' way to be read by simulation programs.
Let's say, I have the following excel-file with 4 parameters (a,b,c,d) and 20 values underneath:
a b c d
2 3 5 7
6 7 9 1
3 2 6 2
5 8 7 6
6 2 3 4
Each row of this spreadsheet represents a simulation-parameter-sample.
I want to store each row in a separate .txt-file as follows (so 5 '.txt'-files for this spreadsheet):
'1.txt' should contain:
a=2;
b=3;
c=5;
d=7;
'2.txt' should contain:
a=6;
b=7;
c=9;
d=1;
and so on for files '3.txt', '4.txt' and '5.txt'.
So basically matching the header with its corresponding value underneath for each row in a separate .txt-file ('header equals value;').
Is there an Excel add-in that does this or is it better to use some VBA-code? Anybody some idea?
(I'm quit experienced in simulation modelling but not in programming, therefore this rather easy parameter-sample-saving question in Excel. (Solutions in Python are also welcome if that's easier for you people))
my idea would be to use Python along with Pandas as it's one of the most flexible solutions, as your use case might expand in the future.
I'm gonna try making this as simple as possible. Though I'm assuming, that you have Python, that you know how to install packages via pip or conda and are ready to run a python script on whatever system you are using.
First your script needs to import pandas and read the file into a DataFrame:
import pandas as pd
df = pd.read_xlsx('path/to/your/file.xlsx')
(Note that you might need to install the xlrd package, in addition to pandas)
Now you have a powerful data structure, that you can manipulate in plenty of ways. I guess the most intuitive one, would be to loop over all items. Use string formatting, which is best explained over here and put the strings together the way you need them:
outputs = {}
for row in df.index:
s = ""
for col in df.columns:
s += "{}={};\n".format(col, df[col][row])
print(s)
now you just need to write to a file using python's io method open. I'll just name the files by the index of the row, but this solution will overwrite older text files, created by earlier runs of this script. You might wonna add something unique like the date and time or the name of the file you read to it or increment the file name further with multiple runs of the script, for example like this.
All together we get:
import pandas as pd
df = pd.read_excel('path/to/your/file.xlsx')
file_count = 0
for row in df.index:
s = ""
for col in df.columns:
s += "{}={};\n".format(col, df[col][row])
file = open('test_{:03}.txt'.format(file_count), "w")
file.write(s)
file.close()
file_count += 1
Note that it's probably not the most elegant way and that there are one liners out there, but since you are not a programmer I thought you might prefer a more intuitive way, that you can tweak yourself easily.
I got this to work in Excel. You can expand the length of the variables x,y and z to match your situation and use LastRow, LastColumn methods to find the dimensions of your data set. I named the original worksheet "Data", as shown below.
Sub TestExportText()
Dim Hdr(1 To 4) As String
Dim x As Long
Dim y As Long
Dim z As Long
For x = 1 To 4
Hdr(x) = Cells(1, x)
Next x
x = 1
For y = 1 To 5
ThisWorkbook.Sheets.Add After:=Sheets(Sheets.Count)
ActiveSheet.Name = y
For z = 1 To 4
With ActiveSheet
.Cells(z, 1) = Hdr(z) & "=" & Sheets("Data").Cells(x + 1, z) & ";"
End With
Next z
x = x + 1
ActiveSheet.Move
ActiveWorkbook.ActiveSheet.SaveAs Filename:="File" & y & ".txt", FileFormat:=xlTextWindows
ActiveWorkbook.Close SaveChanges:=False
Next y
End Sub
If you can save your Excel spreadsheet as a CSV file then this python script will do what you want.
with open('data.csv') as file:
data_list = [l.rstrip('\n').split(',') for l in file]
counter = 1
for x in range (1, len (data_list)) :
output_file_name = str (counter) + '.txt'
with open (output_file_name, 'w' ) as file :
for x in range (len (data_list [counter])) :
print (x)
output_string = data_list [0] [x] + '=' + data_list [counter] [x] + ';\n'
file.write (output_string)
counter += 1
I'm reading in a very large (15M lines) csv file into a panda dataframe. I then want to split it in smaller ones (ultimately creating smaller csv files, or a panda panel...).
I have working code but it's VERY slow. I believe it's not taking advantage of the fact that my dataframe is 'ordered'.
The df looks like:
ticker date open high low
0 AAPL 1999-11-18 45.50 50.0000 40.0000
1 AAPL 1999-11-19 42.94 43.0000 39.8100
2 AAPL 1999-11-22 41.31 44.0000 40.0600
...
1000 MSFT 1999-11-18 45.50 50.0000 40.0000
1001 MSFT 1999-11-19 42.94 43.0000 39.8100
1002 MSFT 1999-11-22 41.31 44.0000 40.0600
...
7663 IBM 1999-11-18 45.50 50.0000 40.0000
7664 IBM 1999-11-19 42.94 43.0000 39.8100
7665 IBM 1999-11-22 41.31 44.0000 40.0600
I want to take all rows where symbol=='AAPL', and make a dataframe with it. Then all rows where symbol=='MSFT', and so on. The number of rows for each symbol is NOT the same, and the code has to adapt. I might load in a new 'large' csv where everything is different.
This is what I came up with:
#Read database
alldata = pd.read_csv('./alldata.csv')
#get a list of all unique ticker present in the database
alltickers = alldata.iloc[:,0].unique();
#write data of each ticker in its own csv file
for ticker in alltickers:
print('Creating csv for '+ticker)
#get data for current ticker
tickerdata = alldata.loc[alldata['ticker'] == ticker]
#remove column with ticker symbol (will be the file name) and reindex as
#we're grabbing from somwhere in a large dataframe
tickerdata = tickerdata.iloc[:,1:13].reset_index(drop=True)
#write csv
tickerdata.to_csv('./split/'+ticker+'.csv')
this takes forever to run. I thought it was the file I/O, but I commented the write csv part in the for loop, and I see that this line is the problem:
tickerdata = alldata.loc[alldata['ticker'] == ticker]
I wonder if pandas is looking in the WHOLE dataframe every single time. I do know that the dataframe is in order of ticker. Is there a way to leverage that?
Thank you very much!
Dave
Easiest way to do this is to create a dictionary of the dataframes using a dictionary comprehension and pandas groupby
dodf = {ticker: sub_df for ticker, sub_df in alldata.groupby('ticker')}
dodf['IBM']
ticker date open high low
7663 IBM 1999-11-18 45.50 50.0 40.00
7664 IBM 1999-11-19 42.94 43.0 39.81
7665 IBM 1999-11-22 41.31 44.0 40.06
It makes sense that creating a boolean index of length 15 million, and doing it repeatedly, is going to take a little while. Honestly, for splitting the file into subfiles, I think Pandas is the wrong tool for the job. I'd just use a simple loop to iterate over the lines in the input file, writing them to the appropriate output file as they come. This doesn't even have to load the whole file at once, so it will be fairly fast.
import itertools as it
tickers = set()
with open('./alldata.csv') as f:
headers = next(f)
for ticker, lines in it.groupby(f, lambda s: s.split(',', 1)[0]):
with open('./split/{}.csv'.format(ticker), 'a') as w:
if ticker not in tickers:
w.writelines([headers])
tickers.add(ticker)
w.writelines(lines)
Then you can load each individual split file using pd.read_csv() and turn that into its own DataFrame.
If you know that the file is ordered by ticker, then you can skip everything involving the set tickers (which tracks which tickers have already been encountered). But that's a fairly cheap check.
Probably, the best approach is to use groupby. Suppose:
>>> df
ticker v1 v2
0 A 6 0.655625
1 A 2 0.573070
2 A 7 0.549985
3 B 32 0.155053
4 B 10 0.438095
5 B 26 0.310344
6 C 23 0.558831
7 C 15 0.930617
8 C 32 0.276483
Then group:
>>> grouped = df.groupby('ticker', as_index=False)
Finally, iterate over your groups:
>>> for g, df_g in grouped:
... print('creating csv for ', g)
... print(df_g.to_csv())
...
creating csv for A
,ticker,v1,v2
0,A,6,0.6556248347252436
1,A,2,0.5730698850517599
2,A,7,0.5499849530664374
creating csv for B
,ticker,v1,v2
3,B,32,0.15505313728451087
4,B,10,0.43809490694469133
5,B,26,0.31034386153099336
creating csv for C
,ticker,v1,v2
6,C,23,0.5588311692150466
7,C,15,0.930617426953476
8,C,32,0.2764826801584902
Of course, here I am printing a csv, but you can do whatever you want.
Using groupby is great, but it does not take advantage of the fact that the data is presorted and so will likely have more overhead compared to a solution that does. For a large dataset, this could be a noticeable slowdown.
Here is a method which is optimized for the sorted case:
import pandas as pd
import numpy as np
alldata = pd.read_csv("tickers.csv")
tickers = np.array(alldata.ticker)
# use numpy to compute change points, should
# be super fast and yield performance boost over groupby:
change_points = np.where(
tickers[1:] != tickers[:-1])[0].tolist()
# add last point in as well to get last ticker block
change_points += [tickers.size - 1]
prev_idx = 0
for idx in change_points:
ticker = alldata.ticker[idx]
print('Creating csv for ' + ticker)
# get data for current ticker
tickerdata = alldata.iloc[prev_idx: idx + 1]
tickerdata = tickerdata.iloc[:, 1:13].reset_index(drop=True)
tickerdata.to_csv('./split/' + ticker + '.csv')
prev_idx = idx + 1
I am a research chemist and have carried out a measurement where I record 'signal intensity' vs 'mass-to-charge (m/z)' . I have repeated this experiment 15x, by changing a specific parameter (Collision Energy). As a result, I have 15 CSV files and would like to align/join them within the same range of m/z values and same interval values. Due to the instrument thresholding rules, certain m/z values were not recorded, thus I have files that cannot simply be exported into excel and copy/pasted. The data looks a bit like the tables posted below
Dataset 1: x | y Dataset 2: x | y
--------- ---------
0.0 5 0.0 2
0.5 3 0.5 6
2.0 7 1.0 9
3.0 1 2.5 1
3.0 4
Using matlab I started with this code:
%% Create a table for the set m/z range with an interval of 0.1 Da
mzrange = 50:0.1:620;
mzrange = mzrange';
mzrange = array2table(mzrange,'VariableNames',{'XThompsons'});
Then I manually imported 1 X/Y CSV (Xtitle=XThompson, Ytitle=YCounts) to align with the specified m/z range.
%% Join/merge the two tables using a common Key variable 'XThompson' (m/z value)
mzspectrum = outerjoin(mzrange,ReserpineCE00,'MergeKeys',true);
% Replace all NaN values with zero
mzspectrum.YCounts(isnan(mzspectrum.YCounts)) = 0;
At this point I am stuck because repeating this process with a separate file will overwrite my YCounts column. The title of the YCounts column doesnt matter to me as I can change it later, however I would like to have the table continue as such:
XThompson | YCounts_1 | YCounts_2 | YCounts_3 | etc...
--------------------------------------------------------
How can I carry this out in Matlab so that this is at least semi-automated? I've had posted earlier describing a similar scenario but it turned out that it could not carry out what I need. I must admit that my mind is not of a programmer so I have been struggling with this problem quite a bit.
PS- Is this problem best executed in Matlab or Python?
I don't know or use matlab so my answer is pure python based. I think python and matlab should be equally well suited to read csv files and generate a master table.
Please consider this answer more as pointer to how to address the problem in python.
In python one would typically address this problem using the pandas package. This package provides "high-performance, easy-to-use data structures and data analysis tools" and can read natively a large set of file formats including CSV files. A master table from two CSV files "foo.csv" and "bar.csv" could be generated e.g. as follows:
import pandas as pd
df = pd.read_csv('foo.csv')
df2 = pd.read_csv('bar.cvs')
master_table = pd.concat([df, df2])
Pandas further allows to group and structure the data in many ways. The pandas documentation has very good descriptions of its various features.
One can install pandas with the python package installer pip:
sudo pip install pandas
if on Linux or OSX.
The counts from the different analyses should be named differently, i.e., YCounts_1, YCounts_2, and YCounts_3 from analyses 1, 2, and 3, respectively, in the different datasets before joining them. However, the M/Z name (i.e., XThompson) should be the same since this is the key that will be used to join the datasets. The code below is for MATLAB.
This step is not needed (just recreates your tables) and I copied dataset2 to create dataset3 for illustration. You could use 'readtable' to import your data i.e., imported_data = readtable('filename');
dataset1 = table([0.0; 0.5; 2.0; 3.0], [5; 3; 7; 1], 'VariableNames', {'XThompson', 'YCounts_1'});
dataset2 = table([0.0; 0.5; 1.0; 2.5; 3.0], [2; 6; 9; 1; 4], 'VariableNames', {'XThompson', 'YCounts_2'});
dataset3 = table([0.0; 0.5; 1.0; 2.5; 3.0], [2; 6; 9; 1; 4], 'VariableNames', {'XThompson', 'YCounts_3'});
Merge tables using outerjoin. You could use loop if you have many datasets.
combined_dataset = outerjoin(dataset1,dataset2, 'MergeKeys', true);
Add dataset3 to the combined_dataset
combined_dataset = outerjoin(combined_dataset,dataset3, 'MergeKeys', true);
You could export the combined data as Excel Sheet by using writetable
writetable(combined_dataset, 'joined_icp_ms_data.xlsx');
I managed to create a solution to my problem based on learning through everyone's input and taking an online matlab courses. I am not a natural coder so my script is not as elegant as the geniuses here, but hopefully it is clear enough for other non-programming scientists to use.
Here's the result that works for me:
% Reads a directory containing *.csv files and corrects the x-axis to an evenly spaced (0.1 unit) interval.
% Create a matrix with the input x range then convert it to a table
prompt = 'Input recorded min/max data range separated by space \n(ex. 1 to 100 = 1 100): ';
inputrange = input(prompt,'s');
min_max = str2num(inputrange)
datarange = (min_max(1):0.1:min_max(2))';
datarange = array2table(datarange,'VariableNames',{'XAxis'});
files = dir('*.csv');
for q=1:length(files);
% Extract each XY pair from the csvread cell and convert it to an array, then back to a table.
data{q} = csvread(files(q).name,2,1);
data1 = data(q);
data2 = cell2mat(data1);
data3 = array2table(data2,'VariableNames',{'XAxis','YAxis'});
% Join the datarange table and the intensity table to obtain an evenly spaced m/z range
data3 = outerjoin(datarange,data3,'MergeKeys',true);
data3.YAxis(isnan(data3.YAxis)) = 0;
data3.XAxis = round(data3.XAxis,1);
% Remove duplicate values
data4 = sortrows(data3,[1 -2]);
[~, idx] = unique(data4.XAxis);
data4 = data4(idx,:);
% Save the file as the same name in CSV without underscores or dashes
filename = files(q).name;
filename = strrep(filename,'_','');
filename = strrep(filename,'-','');
filename = strrep(filename,'.csv','');
writetable(data4,filename,'FileType','text');
clear data data1 data2 data3 data4 filename
end
clear
I do a lot of data analysis in perl and I am trying to replicate this work in python using pandas, numpy, matplotlib, etc.
The general workflow goes as follows:
1) glob all the files in a directory
2) parse the files because they have metadata
3) use regex to isolate relevant lines in a given file (They usually begin with a tag such as 'LOOPS')
4) split the lines that match the tag and load data into hashes
5) do some data analysis
6) make some plots
Here is a sample of what I typically do in perl:
print"Reading File:\n"; # gets data
foreach my $vol ($SmallV, $LargeV) {
my $base_name = "${NF}flav_${vol}/BlockedWflow_low_${vol}_[0-9].[0-9]_-0.25_$Mass{$vol}.";
my #files = <$base_name*>; # globs for file names
foreach my $f (#files) { # loops through matching files
print"... $f\n";
my #split = split(/_/, $f);
my $beta = $split[4];
if (!grep{$_ eq $beta} #{$Beta{$vol}}) { # constructs Beta hash
push(#{$Beta{$vol}}, $split[4]);
}
open(IN, "<", "$f") or die "cannot open < $f: $!"; # reads in the file
chomp(my #in = <IN>);
close IN;
my #lines = grep{$_=~/^LOOPS/} #in; # greps for lines with the header LOOPS
foreach my $l (#lines) { # loops through matched lines
my #split = split(/\s+/, $l); # splits matched lines
push(#{$val{$vol}{$beta}{$split[1]}{$split[2]}{$split[4]}}, $split[6]);# reads data into hash
if (!grep{$_ eq $split[1]} #smearingt) {# fills the smearing time array
push(#smearingt, $split[1]);
}
if (!grep{$_ eq $split[4]} #{$block{$vol}}) {# fills the number of blockings
push(#{$block{$vol}}, $split[4]);
}
}
}
foreach my $beta (#{$Beta{$vol}}) {
foreach my $loop (0,1,2,3,4) { # loops over observables
foreach my $b (#{$block{$vol}}) { # beta values
foreach my $t (#smearingt) { # and smearing times
$avg{$vol}{$beta}{$t}{$loop}{$b} = stat_mod::avg(#{$val{$vol}{$beta}{$t}{$loop}{$b}}); # to find statistics
$err{$vol}{$beta}{$t}{$loop}{$b} = stat_mod::stdev(#{$val{$vol}{$beta}{$t}{$loop}{$b}});
}
}
}
}
}
print"File Read in Complete!\n";
My hope is to load this data into a Hierarchical Indexed data structure with indices of the perl hash becoming indicies of my python data structure. Every example I have come across so far of pandas data structures has been highly contrived where the whole structure (indicies and values) was assigned manually in one command and then manipulated to demonstrate all the features of the data structure. Unfortunately I can not assign the data all at once because I don't know what mass, beta, sizes, etc are in the data that is going to be analyzed. Am I doing this the wrong way? Does anyone know a better way of doing this? The data files are immutable, I will have to parse through them using regex which I understand how to do. What I need help with is putting the data into an appropriate data structure so that I can take averages, standard deviations, perform mathematical operations, and plot the data.
Typical data has a header that is an unknown number of lines long but the stuff I care about looks like this:
Alpha 0.5 0.5 0.4
Alpha 0.5 0.5 0.4
LOOPS 0 0 0 2 0.5 1.7800178
LOOPS 0 1 0 2 0.5 0.84488326
LOOPS 0 2 0 2 0.5 0.98365135
LOOPS 0 3 0 2 0.5 1.1638834
LOOPS 0 4 0 2 0.5 1.0438407
LOOPS 0 5 0 2 0.5 0.19081102
POLYA NHYP 0 2 0.5 -0.0200002 0.119196 -0.0788721 -0.170488
BLOCKING COMPLETED
Blocking time 1.474 seconds
WFLOW 0.01 1.57689 2.30146 0.000230146 0.000230146 0.00170773 -0.0336667
WFLOW 0.02 1.66552 2.28275 0.000913101 0.00136591 0.00640552 -0.0271222
WFLOW 0.03 1.75 2.25841 0.00203257 0.00335839 0.0135 -0.0205722
WFLOW 0.04 1.83017 2.22891 0.00356625 0.00613473 0.0224607 -0.0141664
WFLOW 0.05 1.90594 2.19478 0.00548695 0.00960351 0.0328218 -0.00803792
WFLOW 0.06 1.9773 2.15659 0.00776372 0.0136606 0.0441807 -0.00229793
WFLOW 0.07 2.0443 2.1149 0.010363 0.018195 0.0561953 0.00296648
What I (think) I want, I preface this with think because I am new to python and an expert may know a better data structure, is a Hierarchical Indexed Series that would look like this:
volume mass beta observable t value
1224 0.0 5.6 0 0 1.234
1 1.490
2 1.222
1 0 1.234
1 1.234
2448 0.0 5.7 0 1 1.234
and so on like this: http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-hierarchical
For those of you who don't understand the perl:
The meat and potatoes of what I need is this:
push(#{$val{$vol}{$beta}{$split[1]}{$split[2]}{$split[4]}}, $split[6]);# reads data into hash
What I have here is a hash called 'val'. This is a hash of arrays. I believe in python speak this would be a dict of lists. Here each thing that looks like this: '{$something}' is a key in the hash 'val' and I am appending the value stored in the variable $split[6] to the end of the array that is the hash element specified by all 5 keys. This is the fundamental issue with my data is there are a lot of keys for each quantity that I am interested in.
==========
UPDATE
I have come up with the following code which results in this error:
Traceback (most recent call last):
File "wflow_2lattice_matching.py", line 39, in <module>
index = MultiIndex.from_tuples(zipped, names=['volume', 'beta', 'montecarlo_time, smearing_time'])
NameError: name 'MultiIndex' is not defined
Code:
#!/usr/bin/python
from pandas import Series, DataFrame
import pandas as pd
import glob
import re
import numpy
flavor = 4
mass = 0.0
vol = []
b = []
m_t = []
w_t = []
val = []
#tup_vol = (1224, 1632, 2448)
tup_vol = 1224, 1632
for v in tup_vol:
filelist = glob.glob(str(flavor)+'flav_'+str(v)+'/BlockedWflow_low_'+str(v)+'_*_0.0.*')
for filename in filelist:
print 'Reading filename: '+filename
f = open(filename, 'r')
junk, start, vv, beta, junk, mass, mont_t = re.split('_', filename)
ftext = f.readlines()
for line in ftext:
if re.match('^WFLOW.*', line):
line=line.strip()
junk, smear_t, junk, junk, wilson_flow, junk, junk, junk = re.split('\s+', line)
vol.append(v)
b.append(beta)
m_t.append(mont_t)
w_t.append(smear_t)
val.append(wilson_flow)
zipped = zip(vol, beta, m_t, w_t)
index = MultiIndex.from_tuples(zipped, names=['volume', 'beta', 'montecarlo_time, smearing_time'])
data = Series(val, index=index)
You are getting the following:
NameError: name 'MultiIndex' is not defined
because you are not importing MultiIndex directly when you import Series and DataFrame.
You have -
from pandas import Series, DataFrame
You need -
from pandas import Series, DataFrame, MultiIndex
or you can instead refer to MultiIndex using pd.MultiIndex since you are importing pandas as pd
Hopefully this helps you get started?
import sys, os
def regex_match(line) :
return 'LOOPS' in line
my_hash = {}
for fd in os.listdir(sys.argv[1]) : # for each file in this directory
for line in open(sys.argv[1] + '/' + fd) : # get each line of the file
if regex_match(line) : # if its a line I want
line.rstrip('\n').split('\t') # get the data I want
my_hash[line[1]] = line[2] # store the data
for key in my_hash : # data science can go here?
do_something(key, my_hash[key] * 12)
# plots
p.s. make the first line
#!/usr/bin/python
(or whatever which python returns ) to run as an executable
To glob your files, use the built-in glob module in Python.
To read your csv files after globbing them, the read_csv function that you can import using from pandas.io.parsers import read_csv will help you do that.
As for MultiIndex feature in the pandas dataframe that you instantiate after using read_csv, you can then use them to organize your data and slice them anyway you want.
3 pertinent links for your reference.
Understanding MultiIndex dataframes in pandas - understanding MultiIndex and Benefits of panda's multiindex?
Using glob in a directory to grab and manipulate your files - extract values/renaming filename in python