I do a lot of data analysis in perl and I am trying to replicate this work in python using pandas, numpy, matplotlib, etc.
The general workflow goes as follows:
1) glob all the files in a directory
2) parse the files because they have metadata
3) use regex to isolate relevant lines in a given file (They usually begin with a tag such as 'LOOPS')
4) split the lines that match the tag and load data into hashes
5) do some data analysis
6) make some plots
Here is a sample of what I typically do in perl:
print"Reading File:\n"; # gets data
foreach my $vol ($SmallV, $LargeV) {
my $base_name = "${NF}flav_${vol}/BlockedWflow_low_${vol}_[0-9].[0-9]_-0.25_$Mass{$vol}.";
my #files = <$base_name*>; # globs for file names
foreach my $f (#files) { # loops through matching files
print"... $f\n";
my #split = split(/_/, $f);
my $beta = $split[4];
if (!grep{$_ eq $beta} #{$Beta{$vol}}) { # constructs Beta hash
push(#{$Beta{$vol}}, $split[4]);
}
open(IN, "<", "$f") or die "cannot open < $f: $!"; # reads in the file
chomp(my #in = <IN>);
close IN;
my #lines = grep{$_=~/^LOOPS/} #in; # greps for lines with the header LOOPS
foreach my $l (#lines) { # loops through matched lines
my #split = split(/\s+/, $l); # splits matched lines
push(#{$val{$vol}{$beta}{$split[1]}{$split[2]}{$split[4]}}, $split[6]);# reads data into hash
if (!grep{$_ eq $split[1]} #smearingt) {# fills the smearing time array
push(#smearingt, $split[1]);
}
if (!grep{$_ eq $split[4]} #{$block{$vol}}) {# fills the number of blockings
push(#{$block{$vol}}, $split[4]);
}
}
}
foreach my $beta (#{$Beta{$vol}}) {
foreach my $loop (0,1,2,3,4) { # loops over observables
foreach my $b (#{$block{$vol}}) { # beta values
foreach my $t (#smearingt) { # and smearing times
$avg{$vol}{$beta}{$t}{$loop}{$b} = stat_mod::avg(#{$val{$vol}{$beta}{$t}{$loop}{$b}}); # to find statistics
$err{$vol}{$beta}{$t}{$loop}{$b} = stat_mod::stdev(#{$val{$vol}{$beta}{$t}{$loop}{$b}});
}
}
}
}
}
print"File Read in Complete!\n";
My hope is to load this data into a Hierarchical Indexed data structure with indices of the perl hash becoming indicies of my python data structure. Every example I have come across so far of pandas data structures has been highly contrived where the whole structure (indicies and values) was assigned manually in one command and then manipulated to demonstrate all the features of the data structure. Unfortunately I can not assign the data all at once because I don't know what mass, beta, sizes, etc are in the data that is going to be analyzed. Am I doing this the wrong way? Does anyone know a better way of doing this? The data files are immutable, I will have to parse through them using regex which I understand how to do. What I need help with is putting the data into an appropriate data structure so that I can take averages, standard deviations, perform mathematical operations, and plot the data.
Typical data has a header that is an unknown number of lines long but the stuff I care about looks like this:
Alpha 0.5 0.5 0.4
Alpha 0.5 0.5 0.4
LOOPS 0 0 0 2 0.5 1.7800178
LOOPS 0 1 0 2 0.5 0.84488326
LOOPS 0 2 0 2 0.5 0.98365135
LOOPS 0 3 0 2 0.5 1.1638834
LOOPS 0 4 0 2 0.5 1.0438407
LOOPS 0 5 0 2 0.5 0.19081102
POLYA NHYP 0 2 0.5 -0.0200002 0.119196 -0.0788721 -0.170488
BLOCKING COMPLETED
Blocking time 1.474 seconds
WFLOW 0.01 1.57689 2.30146 0.000230146 0.000230146 0.00170773 -0.0336667
WFLOW 0.02 1.66552 2.28275 0.000913101 0.00136591 0.00640552 -0.0271222
WFLOW 0.03 1.75 2.25841 0.00203257 0.00335839 0.0135 -0.0205722
WFLOW 0.04 1.83017 2.22891 0.00356625 0.00613473 0.0224607 -0.0141664
WFLOW 0.05 1.90594 2.19478 0.00548695 0.00960351 0.0328218 -0.00803792
WFLOW 0.06 1.9773 2.15659 0.00776372 0.0136606 0.0441807 -0.00229793
WFLOW 0.07 2.0443 2.1149 0.010363 0.018195 0.0561953 0.00296648
What I (think) I want, I preface this with think because I am new to python and an expert may know a better data structure, is a Hierarchical Indexed Series that would look like this:
volume mass beta observable t value
1224 0.0 5.6 0 0 1.234
1 1.490
2 1.222
1 0 1.234
1 1.234
2448 0.0 5.7 0 1 1.234
and so on like this: http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-hierarchical
For those of you who don't understand the perl:
The meat and potatoes of what I need is this:
push(#{$val{$vol}{$beta}{$split[1]}{$split[2]}{$split[4]}}, $split[6]);# reads data into hash
What I have here is a hash called 'val'. This is a hash of arrays. I believe in python speak this would be a dict of lists. Here each thing that looks like this: '{$something}' is a key in the hash 'val' and I am appending the value stored in the variable $split[6] to the end of the array that is the hash element specified by all 5 keys. This is the fundamental issue with my data is there are a lot of keys for each quantity that I am interested in.
==========
UPDATE
I have come up with the following code which results in this error:
Traceback (most recent call last):
File "wflow_2lattice_matching.py", line 39, in <module>
index = MultiIndex.from_tuples(zipped, names=['volume', 'beta', 'montecarlo_time, smearing_time'])
NameError: name 'MultiIndex' is not defined
Code:
#!/usr/bin/python
from pandas import Series, DataFrame
import pandas as pd
import glob
import re
import numpy
flavor = 4
mass = 0.0
vol = []
b = []
m_t = []
w_t = []
val = []
#tup_vol = (1224, 1632, 2448)
tup_vol = 1224, 1632
for v in tup_vol:
filelist = glob.glob(str(flavor)+'flav_'+str(v)+'/BlockedWflow_low_'+str(v)+'_*_0.0.*')
for filename in filelist:
print 'Reading filename: '+filename
f = open(filename, 'r')
junk, start, vv, beta, junk, mass, mont_t = re.split('_', filename)
ftext = f.readlines()
for line in ftext:
if re.match('^WFLOW.*', line):
line=line.strip()
junk, smear_t, junk, junk, wilson_flow, junk, junk, junk = re.split('\s+', line)
vol.append(v)
b.append(beta)
m_t.append(mont_t)
w_t.append(smear_t)
val.append(wilson_flow)
zipped = zip(vol, beta, m_t, w_t)
index = MultiIndex.from_tuples(zipped, names=['volume', 'beta', 'montecarlo_time, smearing_time'])
data = Series(val, index=index)
You are getting the following:
NameError: name 'MultiIndex' is not defined
because you are not importing MultiIndex directly when you import Series and DataFrame.
You have -
from pandas import Series, DataFrame
You need -
from pandas import Series, DataFrame, MultiIndex
or you can instead refer to MultiIndex using pd.MultiIndex since you are importing pandas as pd
Hopefully this helps you get started?
import sys, os
def regex_match(line) :
return 'LOOPS' in line
my_hash = {}
for fd in os.listdir(sys.argv[1]) : # for each file in this directory
for line in open(sys.argv[1] + '/' + fd) : # get each line of the file
if regex_match(line) : # if its a line I want
line.rstrip('\n').split('\t') # get the data I want
my_hash[line[1]] = line[2] # store the data
for key in my_hash : # data science can go here?
do_something(key, my_hash[key] * 12)
# plots
p.s. make the first line
#!/usr/bin/python
(or whatever which python returns ) to run as an executable
To glob your files, use the built-in glob module in Python.
To read your csv files after globbing them, the read_csv function that you can import using from pandas.io.parsers import read_csv will help you do that.
As for MultiIndex feature in the pandas dataframe that you instantiate after using read_csv, you can then use them to organize your data and slice them anyway you want.
3 pertinent links for your reference.
Understanding MultiIndex dataframes in pandas - understanding MultiIndex and Benefits of panda's multiindex?
Using glob in a directory to grab and manipulate your files - extract values/renaming filename in python
Related
I am writing some macros that call Python code to perform operations on ranges in Excel. It is much easier to do a lot of the required operations with pandas in Python. Because I want to do this while the spreadsheet is open (and may not have been saved), I am using the win32com.client to read in a range of cells to convert to a Pandas dataframe. However, this is extremely slow, presumably because the way I calculate it is very inefficient:
import datetime
import pytz
import pandas
import time
import win32com.client
def range_to_table(excelRange, tsy, tsx, height, width, add_cell_refs = True):
ii = 0
keys = []
while ii < width:
keys.append(str(excelRange[ii]))
ii += 1
colnumbers = {key:jj+tsx for jj, key in enumerate(keys)}
keys.append('rownumber')
mydict = {key:[] for key in keys}
while ii < width*height:
mydict[keys[ii%width]].append(excelRange[ii].value)
ii += 1
for yy in range(tsy + 1, tsy + 1 + height - 1): # add 1 to not include header
mydict['rownumber'].append(yy)
return (mydict, colnumbers)
ExcelApp = win32com.client.GetActiveObject('Excel.Application')
wb = ExcelApp.Workbooks('myworkbook.xlsm')
sheet_num = [sheet.Name for sheet in wb.Sheets].index("myworksheet name") + 1
ws = wb.Worksheets(sheet_num)
height = int(ws.Cells(1, 3)) # obtain table height from formula in excel spreadsheet
width = int(ws.Cells(1, 2)) # obtain table width from formula in excel spreadsheet
myrange = ws.Range(ws.Cells(2, 1), ws.Cells(2 + height - 1, 1 + width - 1))
df, colnumbers = range_to_table(myrange, 1, 1, height, width)
df = pandas.DataFrame.from_dict(df)
This works, but the range_to_table function I wrote is extremely slow for large tables since it iterates over each cell one by one.
I suspect there is probably a much better way to convert the Excel Range object to a Pandas dataframe. Do you know of a better way?
Here a simplified example of what my range would look like:
The height and width variables in the code are just taken from cells immediately above the table:
Any ideas here, or am I just going to have to save the workbook and use Pandas to read in the table from the saved file?
There are two parts to the operation: defining the spreadsheet range and then getting the data into Python. Here is the test data that I'm working with:
1. Defining the range: Excel has a feature called Dynamic Ranges. This allows you to give a name to a range whose extent is variable.
I've set up a dynamic range called 'DynRange', and you can see that it uses the row and column counts from $C$1 and $C$2 to define the size of the array.
Once you have this definition, the range can be used by Name in Python, and saves you having to access the row and column count explicitly.
2. Using this range in Python via win32.com: Once you have defined the name in Excel, handling it in Python is much easier.
import win32com.client as wc
import pandas as pd
#Create a dispatch interface
xl = wc.gencache.EnsureDispatch('Excel.Application')
filepath = 'SomeFilePath\\TestBook.xlsx'
#Open the workbook
wb = xl.Workbooks.Open(filepath)
#Get the Worksheet by name
ws = wb.Sheets('Sheet1')
#Use the Value property to get all the data in the range
listVals = ws.Range('DynRange').Value
#Construct the dataframe, using first row as headers
df = pd.DataFrame(listVals[1:],columns=listVals[0])
#Optionally process the datetime value to avoid tz warnings
df['Datetime'] = df['Datetime'].dt.tz_convert(None)
print(df)
wb.Close()
Output:
Datetime Principal Source Amt Cost Basis
0 2021-04-21 04:59:00 -5.0 1.001 5.0
1 2021-04-25 15:16:00 -348.26 1.001 10.0
2 2021-04-29 11:04:00 0.0 1.001 5.0
3 2021-04-29 21:26:00 0.0 1.001 5.0
4 2021-04-29 23:39:00 0.0 1.001 5.0
5 2021-05-02 14:00:00 -2488.4 1.001 5.0
As the OP suspects, iterating over the range cell-by-cell performs slowly. The COM infrastructure has to do a good deal of processing to pass data from one process (Excel) to another (Python). This is known as 'marshalling'. Most of the time is spent packing up the variables on one side and unpacking on the other. It is much more efficient to marshal the entire contents of an Excel Range in one go (as a 2D array) and Excel allows this by exposing the Value property on the Range as a whole, rather than by cell.
You can try using multiprocessing for this. You could have each worker scanning a different column for example, or even do the same on the lines.
Minor changes to your code are needed:
Create a function iterating over the columns and storing the information in a dict
Use the simple multiprocessing example # https://pymotw.com/2/multiprocessing/basics.html
Create a function appending all different dicts created by each worker into a single one
That should divide your compute time by the amount of workers used.
I have a list of genes, their coordinates, and their expression (right now just looking at the top 500 most highly expressed genes) and 12 files corresponding to DNA reads. I have a python script that searches for reads overlapping with each gene's coordinates and storing the values in a dictionary. I then use this dictionary to create a Pandas dataframe and save this as a csv. (I will be using these to create a scatterplot.)
The RNA file looks like this (the headers are gene name, chromosome, start, stop, gene coverage/enrichment):
MSTRG.38 NC_008066.1 9204 9987 48395.347656
MSTRG.36 NC_008066.1 7582 8265 47979.933594
MSTRG.33 NC_008066.1 5899 7437 43807.781250
MSTRG.49 NC_008066.1 14732 15872 26669.763672
MSTRG.38 NC_008066.1 8363 9203 19514.273438
MSTRG.34 NC_008066.1 7439 7510 16855.662109
And the DNA file looks like this (the headers are chromosome, start, stop, gene name, coverage, strand):
JQ673480.1 697 778 SRX6359746.5505370/2 8 +
JQ673480.1 744 824 SRX6359746.5505370/1 8 -
JQ673480.1 1712 1791 SRX6359746.2565519/2 27 +
JQ673480.1 3445 3525 SRX6359746.7028440/2 23 -
JQ673480.1 4815 4873 SRX6359746.6742605/2 37 +
JQ673480.1 5055 5092 SRX6359746.5420114/2 40 -
JQ673480.1 5108 5187 SRX6359746.2349349/2 24 -
JQ673480.1 7139 7219 SRX6359746.3831446/2 22 +
The RNA file has >9,000 lines, and the DNA files have > 12,000,000 lines.
I originally had a for-loop that would generate a dictionary containing all values for all 12 files in one go, but it runs extremely slowly. Since I have access to a computing system with multiple cores, I've decided to run a script that only calculates coverage one DNA file at a time, like so:
#import modules
import csv
import pandas as pd
import matplotlib.pyplot as plt
#set sample name
sample='CON-2'
#set fraction number
f=6
#dictionary to store values
d={}
#load file name into variable
fileRNA="top500_R8_7-{}-RNA.gtf".format(sample)
print(fileRNA)
#read tsv file
tsvRNA = open(fileRNA)
readRNA = csv.reader(tsvRNA, delimiter="\t")
expGenes=[]
#convert tsv file into Python list
for row in readRNA:
gene=row[0],row[1],row[2],row[3],row[4]
expGenes.append(row)
#print(expGenes)
#establish file name for DNA reads
fileDNA="D2_7-{}-{}.bed".format(sample,f)
print(fileDNA)
tsvDNA = open(fileDNA)
readDNA = csv.reader(tsvDNA, delimiter="\t")
#put file into Python list
MCNgenes=[]
for row in readDNA:
read=row[0],row[1],row[2]
MCNgenes.append(read)
#find read counts
for r in expGenes:
#include FPKM in the dictionary
d[r[0]]=[r[4]]
regionCount=0
#set start and stop points based on transcript file
chr=r[1]
start=int(r[2])
stop=int(r[3])
#print("start:",start,"stop:",stop)
for row in MCNgenes:
if start < int(row[1]) < stop:
regionCount+=1
d[r[0]].append(regionCount)
n+=1
df=pd.DataFrame.from_dict(d)
#convert to heatmap
df.to_csv("7-CON-2-6_forHeatmap.csv")
This script also runs quite slowly, however. Are there any changes I can make to get it run more efficiently?
If I understood correctly and you are trying to match between coordinates of genes in different files I believe the best option would be to use something like KDTree partitioning algorithm.
You can use KDtree to partition your DNA and RNA data. I'm assumming you're using 'start' and 'stop' as 'coordinates':
import pandas as pd
import numpy as np
from sklearn.neighbors import KDTree
dna = pd.DataFrame() # this is your dataframe with DNA data
rna = pd.DataFrame() # Same for RNA
# Let's assume you are using 'start' and 'stop' columns as coordinates
dna_coord = dna.loc[:, ['start', 'stop']]
rna_coord = rna.loc[:, ['start', 'stop']]
dna_kd = KDTree(dna_coord)
rna_kd = KDTree(rna_coord)
# Now you can go through your data and match with DNA:
my_data = pd.DataFrame()
for start, stop in zip(my_data.start, my_data.stop):
coord = np.array(start, stop)
dist, idx = dna_kd.query(coord, k=1)
# Assuming you need an exact match
if np.islose(dist, 0):
# Now that you have the index of the matchin row in DNA data
# you can extract information using the index and do whatever
# you want with it
dna_gene_data = dna.loc[idx, :]
You can adjust your search parameters to get the desired results, but this will be much faster than searching every time.
Generally, Python is extremely extremely easy to work with at the cost of it being inefficient! Scientific libraries (such as pandas and numpy) help here by only paying the Python overhead a minimum limited number of times to map the work into a convenient space, then doing the "heavy lifting" in a more efficient language (which may be quite painful/inconvenient to work with).
General advice
try to get data into a dataframe whenever possible and keep it there (do not convert data into some intermediate Python object like a list or dict)
try to use methods of the dataframe or parts of it to do work (such as .apply() and .map()-like methods)
whenever you must iterate in native Python, iterate on the shorter side of a dataframe (ie. there may be only 10 columns, but 10,000 rows ; go over the columns)
More on this topic here:
How to iterate over rows in a DataFrame in Pandas?
Answer: DON'T*!
Once you have a program, you can benchmark it by collecting runtime information. There are many libraries for this, but there is also a builtin one called cProfile which may work for you.
docs: https://docs.python.org/3/library/profile.html
python3 -m cProfile -o profile.out myscript.py
I'm a building energy simulation modeller with an Excel-question to enable automated large-scale simulations using parameter samples (samples generated using Monte Carlo). Now I have the following question in saving my samples:
I want to save each row of an Excel-spreadsheet in a separate .txt-file in a 'special' way to be read by simulation programs.
Let's say, I have the following excel-file with 4 parameters (a,b,c,d) and 20 values underneath:
a b c d
2 3 5 7
6 7 9 1
3 2 6 2
5 8 7 6
6 2 3 4
Each row of this spreadsheet represents a simulation-parameter-sample.
I want to store each row in a separate .txt-file as follows (so 5 '.txt'-files for this spreadsheet):
'1.txt' should contain:
a=2;
b=3;
c=5;
d=7;
'2.txt' should contain:
a=6;
b=7;
c=9;
d=1;
and so on for files '3.txt', '4.txt' and '5.txt'.
So basically matching the header with its corresponding value underneath for each row in a separate .txt-file ('header equals value;').
Is there an Excel add-in that does this or is it better to use some VBA-code? Anybody some idea?
(I'm quit experienced in simulation modelling but not in programming, therefore this rather easy parameter-sample-saving question in Excel. (Solutions in Python are also welcome if that's easier for you people))
my idea would be to use Python along with Pandas as it's one of the most flexible solutions, as your use case might expand in the future.
I'm gonna try making this as simple as possible. Though I'm assuming, that you have Python, that you know how to install packages via pip or conda and are ready to run a python script on whatever system you are using.
First your script needs to import pandas and read the file into a DataFrame:
import pandas as pd
df = pd.read_xlsx('path/to/your/file.xlsx')
(Note that you might need to install the xlrd package, in addition to pandas)
Now you have a powerful data structure, that you can manipulate in plenty of ways. I guess the most intuitive one, would be to loop over all items. Use string formatting, which is best explained over here and put the strings together the way you need them:
outputs = {}
for row in df.index:
s = ""
for col in df.columns:
s += "{}={};\n".format(col, df[col][row])
print(s)
now you just need to write to a file using python's io method open. I'll just name the files by the index of the row, but this solution will overwrite older text files, created by earlier runs of this script. You might wonna add something unique like the date and time or the name of the file you read to it or increment the file name further with multiple runs of the script, for example like this.
All together we get:
import pandas as pd
df = pd.read_excel('path/to/your/file.xlsx')
file_count = 0
for row in df.index:
s = ""
for col in df.columns:
s += "{}={};\n".format(col, df[col][row])
file = open('test_{:03}.txt'.format(file_count), "w")
file.write(s)
file.close()
file_count += 1
Note that it's probably not the most elegant way and that there are one liners out there, but since you are not a programmer I thought you might prefer a more intuitive way, that you can tweak yourself easily.
I got this to work in Excel. You can expand the length of the variables x,y and z to match your situation and use LastRow, LastColumn methods to find the dimensions of your data set. I named the original worksheet "Data", as shown below.
Sub TestExportText()
Dim Hdr(1 To 4) As String
Dim x As Long
Dim y As Long
Dim z As Long
For x = 1 To 4
Hdr(x) = Cells(1, x)
Next x
x = 1
For y = 1 To 5
ThisWorkbook.Sheets.Add After:=Sheets(Sheets.Count)
ActiveSheet.Name = y
For z = 1 To 4
With ActiveSheet
.Cells(z, 1) = Hdr(z) & "=" & Sheets("Data").Cells(x + 1, z) & ";"
End With
Next z
x = x + 1
ActiveSheet.Move
ActiveWorkbook.ActiveSheet.SaveAs Filename:="File" & y & ".txt", FileFormat:=xlTextWindows
ActiveWorkbook.Close SaveChanges:=False
Next y
End Sub
If you can save your Excel spreadsheet as a CSV file then this python script will do what you want.
with open('data.csv') as file:
data_list = [l.rstrip('\n').split(',') for l in file]
counter = 1
for x in range (1, len (data_list)) :
output_file_name = str (counter) + '.txt'
with open (output_file_name, 'w' ) as file :
for x in range (len (data_list [counter])) :
print (x)
output_string = data_list [0] [x] + '=' + data_list [counter] [x] + ';\n'
file.write (output_string)
counter += 1
I have several hundred text files. I want to extract a specific column with a set number of rows. The files are exactly the same the only thing different is the data values. I want to put that data into a new text file with each new column preceding the previous one.
The file is a .sed basically the same as a .txt file. this is what it looks like. The file actually goes from Wvl 350-2150.
Comment:
Version: 2.2
File Name: C:\Users\HyLab\Desktop\Curtis
Bernard\PSR+3500_1596061\PSR+3500_1596061\2019_Feb_16\Contact_00186.sed
<Metadata>
Collected By:
Sample Name:
Location:
Description:
Environment:
</Metadata>
Instrument: PSR+3500_SN1596061 [3]
Detectors: 512,256,256
Measurement: REFLECTANCE
Date: 02/16/2019,02/16/2019
Time: 13:07:52.66,13:29:17.00
Temperature (C): 31.29,8.68,-5.71,31.53,8.74,-5.64
Battery Voltage: 7.56,7.20
Averages: 10,10
Integration: 2,2,2,10,8,2
Dark Mode: AUTO,AUTO
Foreoptic: PROBE {DN}, PROBE {DN}
Radiometric Calibration: DN
Units: None
Wavelength Range: 350,2500
Latitude: n/a
Longitude: n/a
Altitude: n/a
GPS Time: n/a
Satellites: n/a
Calibrated Reference Correction File: none
Channels: 2151
Columns [5]:
Data:
Chan.# Wvl Norm. DN (Ref.) Norm. DN (Target) Reflect. %
0 350.0 1.173460E+002 1.509889E+001 13.7935
1 351.0 1.202493E+002 1.529762E+001 13.6399
2 352.0 1.232869E+002 1.547818E+001 13.4636
3 353.0 1.264006E+002 1.563467E+001 13.2665
4 354.0 1.294906E+002 1.578425E+001 13.0723
I've taken some coding classes but that was a long time ago. I figured this is a pretty straightforward problem for even a novice coder which I am not but I can't seem to find anything like this so was hoping for help on here.
I honestly don't need anything fancy just something like this would be amazing so I don't have to copy and paste each file!
12.3 11.3 etc...
12.3 11.3 etc...
12.3 11.3 etc...
etc.. etc.. etc...
In MATLAB R2016b or later, the easiest way to do this would be using readtable:
t = readtable('file.sed', delimitedTextImportOptions( ...
'NumVariables', 5, 'DataLines', 36, ...
'Delimiter', ' ', 'ConsecutiveDelimitersRule', 'join'));
where
file.sed is the name of the file
'NumVariables', 5 means there are 5 columns of data
'DataLines', 36 means the data starts on the 36th line and continues to the end of the file
'Delimiter', ' ' means the character that separates the columns is a space
'ConsecutiveDelimitersRule', 'join' means treat more than one space as if they were just one (rather than as if they separate empty columns of data).
This assumes that the example file you've posted is in the exact format of your real data. If it's different you may have to modify the parameters above, possibly with reference to the help for delimitedTextImportOptions or as an alternative, fixedWidthImportOptions.
Now you have a MATLAB table t with five columns, of which column 2 is the wavelengths and column 5 is the reflectances - I assume that's the one you want? You can access that column with
t(:,5)
So to collect all the reflectance columns into one table you would do something like
fileList = something % get the list of files from somewhere - say as a string array or a cell array of char
resultTable = table;
for ii = 1:numel(fileList)
sedFile = fileList{ii};
t = readtable(sedFile, delimitedTextImportOptions( ...
'NumVariables', 5, 'DataLines', 36, ...
'Delimiter', ' ', 'ConsecutiveDelimitersRule', 'join'));
t.Properties.VariableNames{5} = sprintf('Reflectance%d', ii);
resultTable = [resultTable, t(:,5)];
end
The t.Properties.VariableNames ... line is there because column 5 of t will be called Var5 every time, but in the result table each variable name needs to be unique. Here we're renaming the output table variables Reflectance1, Reflectance2 etc but you could change this to whatever you want - perhaps the name of the actual file from sedFile - as long as it's a valid unique variable name.
Finally you can save the result table to a text file using writetable. See the MATLAB help for how to use that.
In Python 3.x with numpy:
import numpy as np
file_list = something # filenames in a Python list
result_array = None
for sed_file in file_list:
reflectance_column = np.genfromtxt(sed_file, skip_header=35, usecols=4)
result_array = (reflectance_column if result_array is None else
np.column_stack((result_array, reflectance_column)))
np.savetxt('outputfile.txt', result_array)
Here
skip_header=35 ignores the first 35 lines
usecols=4 only returns column 5 (Python uses zero-based indexing)
see the help for savetxt for further details
I am a research chemist and have carried out a measurement where I record 'signal intensity' vs 'mass-to-charge (m/z)' . I have repeated this experiment 15x, by changing a specific parameter (Collision Energy). As a result, I have 15 CSV files and would like to align/join them within the same range of m/z values and same interval values. Due to the instrument thresholding rules, certain m/z values were not recorded, thus I have files that cannot simply be exported into excel and copy/pasted. The data looks a bit like the tables posted below
Dataset 1: x | y Dataset 2: x | y
--------- ---------
0.0 5 0.0 2
0.5 3 0.5 6
2.0 7 1.0 9
3.0 1 2.5 1
3.0 4
Using matlab I started with this code:
%% Create a table for the set m/z range with an interval of 0.1 Da
mzrange = 50:0.1:620;
mzrange = mzrange';
mzrange = array2table(mzrange,'VariableNames',{'XThompsons'});
Then I manually imported 1 X/Y CSV (Xtitle=XThompson, Ytitle=YCounts) to align with the specified m/z range.
%% Join/merge the two tables using a common Key variable 'XThompson' (m/z value)
mzspectrum = outerjoin(mzrange,ReserpineCE00,'MergeKeys',true);
% Replace all NaN values with zero
mzspectrum.YCounts(isnan(mzspectrum.YCounts)) = 0;
At this point I am stuck because repeating this process with a separate file will overwrite my YCounts column. The title of the YCounts column doesnt matter to me as I can change it later, however I would like to have the table continue as such:
XThompson | YCounts_1 | YCounts_2 | YCounts_3 | etc...
--------------------------------------------------------
How can I carry this out in Matlab so that this is at least semi-automated? I've had posted earlier describing a similar scenario but it turned out that it could not carry out what I need. I must admit that my mind is not of a programmer so I have been struggling with this problem quite a bit.
PS- Is this problem best executed in Matlab or Python?
I don't know or use matlab so my answer is pure python based. I think python and matlab should be equally well suited to read csv files and generate a master table.
Please consider this answer more as pointer to how to address the problem in python.
In python one would typically address this problem using the pandas package. This package provides "high-performance, easy-to-use data structures and data analysis tools" and can read natively a large set of file formats including CSV files. A master table from two CSV files "foo.csv" and "bar.csv" could be generated e.g. as follows:
import pandas as pd
df = pd.read_csv('foo.csv')
df2 = pd.read_csv('bar.cvs')
master_table = pd.concat([df, df2])
Pandas further allows to group and structure the data in many ways. The pandas documentation has very good descriptions of its various features.
One can install pandas with the python package installer pip:
sudo pip install pandas
if on Linux or OSX.
The counts from the different analyses should be named differently, i.e., YCounts_1, YCounts_2, and YCounts_3 from analyses 1, 2, and 3, respectively, in the different datasets before joining them. However, the M/Z name (i.e., XThompson) should be the same since this is the key that will be used to join the datasets. The code below is for MATLAB.
This step is not needed (just recreates your tables) and I copied dataset2 to create dataset3 for illustration. You could use 'readtable' to import your data i.e., imported_data = readtable('filename');
dataset1 = table([0.0; 0.5; 2.0; 3.0], [5; 3; 7; 1], 'VariableNames', {'XThompson', 'YCounts_1'});
dataset2 = table([0.0; 0.5; 1.0; 2.5; 3.0], [2; 6; 9; 1; 4], 'VariableNames', {'XThompson', 'YCounts_2'});
dataset3 = table([0.0; 0.5; 1.0; 2.5; 3.0], [2; 6; 9; 1; 4], 'VariableNames', {'XThompson', 'YCounts_3'});
Merge tables using outerjoin. You could use loop if you have many datasets.
combined_dataset = outerjoin(dataset1,dataset2, 'MergeKeys', true);
Add dataset3 to the combined_dataset
combined_dataset = outerjoin(combined_dataset,dataset3, 'MergeKeys', true);
You could export the combined data as Excel Sheet by using writetable
writetable(combined_dataset, 'joined_icp_ms_data.xlsx');
I managed to create a solution to my problem based on learning through everyone's input and taking an online matlab courses. I am not a natural coder so my script is not as elegant as the geniuses here, but hopefully it is clear enough for other non-programming scientists to use.
Here's the result that works for me:
% Reads a directory containing *.csv files and corrects the x-axis to an evenly spaced (0.1 unit) interval.
% Create a matrix with the input x range then convert it to a table
prompt = 'Input recorded min/max data range separated by space \n(ex. 1 to 100 = 1 100): ';
inputrange = input(prompt,'s');
min_max = str2num(inputrange)
datarange = (min_max(1):0.1:min_max(2))';
datarange = array2table(datarange,'VariableNames',{'XAxis'});
files = dir('*.csv');
for q=1:length(files);
% Extract each XY pair from the csvread cell and convert it to an array, then back to a table.
data{q} = csvread(files(q).name,2,1);
data1 = data(q);
data2 = cell2mat(data1);
data3 = array2table(data2,'VariableNames',{'XAxis','YAxis'});
% Join the datarange table and the intensity table to obtain an evenly spaced m/z range
data3 = outerjoin(datarange,data3,'MergeKeys',true);
data3.YAxis(isnan(data3.YAxis)) = 0;
data3.XAxis = round(data3.XAxis,1);
% Remove duplicate values
data4 = sortrows(data3,[1 -2]);
[~, idx] = unique(data4.XAxis);
data4 = data4(idx,:);
% Save the file as the same name in CSV without underscores or dashes
filename = files(q).name;
filename = strrep(filename,'_','');
filename = strrep(filename,'-','');
filename = strrep(filename,'.csv','');
writetable(data4,filename,'FileType','text');
clear data data1 data2 data3 data4 filename
end
clear