Reading columns of data into arrays in Python - python

I am new to Python. I use Fortran to generate the data file I wish to read. For many reasons I would like to use python to calculate averages and statistics on the data rather than fortan.
I need to read the entries in the the first three rows as strings, and then the data which begins in the fourth row onwards as numbers. I don't need the first column, but I do need the rest of the columns each as their own arrays.
# Instantaneous properties
# MC_STEP Density Pressure Energy_Total
# (molec/A^3) (bar) (kJ/mol)-Ext
0 0.34130959E-01 0.52255964E+05 0.26562549E+04
10 0.34130959E-01 0.52174646E+05 0.25835710E+04
20 0.34130959E-01 0.52050492E+05 0.25278775E+04
And the data goes on for thousands, and sometimes millions of lines.
I have tried the following, but run into problems since I can't analyze the lists I have made, and I can't seem to convert them to arrays. I would prefer however to just create arrays to begin with, but if I can convert my lists to arrays that would work too. In my method I get an error when i try to use an element in one of the lists, i.e. Energy(i)
with open('nvt_test_1.out.box1.prp1') as f:
Title = f.readline()
Properties = f.readline()
Units = f.readline()
Density = []
Pressure = []
Energy = []
for line in f:
row = line.split()
Density.append(row[1])
Pressure.append(row[2])
Energy.append(row[3])
I appreciate any help!

I would use pandas module for this task:
import pandas as pd
In [9]: df = pd.read_csv('a.csv', delim_whitespace=True,
comment='#', skiprows=3,header=None,
names=['MC_STEP','Density','Pressure','Energy_Total'])
Data Frame:
In [10]: df
Out[10]:
MC_STEP Density Pressure Energy_Total
0 0 0.034131 52255.964 2656.2549
1 10 0.034131 52174.646 2583.5710
2 20 0.034131 52050.492 2527.8775
Average values for all columns:
In [11]: df.mean()
Out[11]:
MC_STEP 10.000000
Density 0.034131
Pressure 52160.367333
Energy_Total 2589.234467
dtype: float64

You can consider a list in Python like an array in other languages and it's very optimised. If you have some special needs there is an array type available but rarely used, alternatively the numpy.array that is designed for scientific computation; you have to install the Numpy package for that.
Before performing calculations cast the string to a float, like in energy.append(float(row[3]))
Maybe do it at once using map function:
row = map(float, line.split())
Last, as #Hamms said, access the elements by using square brackets e = energy[i]

You can also use the csv module's DictReader to read each row into a dictionary, as follows:
with open('filename', 'r') as f:
reader = csv.DictReader(f, delimiter=r'\s+', fieldnames=('MC_STEP', 'DENSITY', 'PRESSURE', 'ENERGY_TOTAL')
for row in reader:
Density.append(float(row['DENSITY'])
Pressure.append(float(row['PRESSURE'])
Energy.append(float(row['ENERGY_TOTAL'])
Ofcourse this assumes that the file is formatted more like a CSV (that is, no comments). If the file does have comments at the top, you can skip them before initializing the DictReader as follows:
next(f)

Related

Trying to split csv column data into lists after reading in using pandas library

I have a csv file containing 3 columns of data: column 1 = time vector, column 2 is untuned circuit response and column 3 is the tuned circuit response.
I am reading in this csv data in python using pandas:
df = pd.read_csv(filename, delimiter = ",")
I am now trying to create 3 lists, 1 list for each column of data. I have tried the following but not working as lists end up empty:
for col in df:
time.append(col[0])
untuned.append(col[1])
tuned.append(col[2])
Can anyone give me some help on this. Thanks.
You can use pandas series tolist method:
time = df['time vector'].tolist()
untuned = df['untuned circuit'].tolist()
tuned = df['tuned circuit'].tolist()
To be honest, if your use case it to just get it in lists, use csvreader. It reduces a lot of overhead.
import csv
time = list()
untuned = list()
tuned = list()
with open("filename.csv") as csv_data_file:
csv_reader = csv.reader(csv_data_file, delimiter=",")
for each_row in csv_reader:
time.append(each_row[0])
untuned.append(each_row[1])
tuned.append(each_row[2])
If you have other use cases that require pandas or that your file is large and you require the power of pandas, use .tolist() as suggested by #Bruno Mello.
You can also use an iterator.
for index, row in df.iterrows():
time.append(row[0])
untuned.append(row[1])
tuned.append(row[2])

Problem either with number of characters exceeding cell limit, or storing lists of variable length

The problem:
I have lists of genes expressed in 53 different tissues. Originally, this data was stored in a maximal array of the genes, with 'NaN' where there was no expression. I am trying to create new lists for each tissue that just have the genes expressed, as it was very inefficient to be searching through this array every time I was running my script. I have a code that finds the genes for each tissue as required, but I do not know how to store the ouptut.
I was using pandas data frame, and then converting to csv. But this does not accept lists of varying length, unless I put this list as a single item. However, then when I save the data frame to a csv, it tries to squeeze this very long list (all genes exprssed for one tissue) into a single cell. I get an error of the string length exceeding the excel character-per-cell limit.
Therefore I need a way of either dealing with this limit, or stroing my lists in a different way. I would rather just have one file for all lists.
My code:
import csv
import pandas as pd
import math
import numpy as np
#Import list of tissues:
df = pd.read_csv(r'E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
tissuedict=df.to_dict()
tissuelist = list(tissuedict.keys())[2:]
all_genes = [gene for key,gene in tissuedict['Gene Name'].items()]
data = []
for tissue in tissuelist:
#Create array to keep track of the protein mRnaS in tissue that are not present in the network
#initiate with first tissue, protein
nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]
tissueExpression = np.delete(all_genes, nanInd)
datatis = [tissue, tissueExpression.tolist()]
print(datatis)
data.append(datatis)
print(data)
df = pd.DataFrame(data)
df.to_csv(r'tissue_expression_data.csv')
Link to data (either one):
https://github.com/joanna-lada/gene_data/blob/master/E-MTAB-5214-query-results.tsv
https://raw.githubusercontent.com/joanna-lada/gene_data/master/E-MTAB-5214-query-results.tsv
IIUC you need lists of the gene names found in each tissue. This writes these lists as columns into a csv:
import pandas as pd
df = pd.read_csv('E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
df = df.drop(columns='Gene ID').set_index('Gene Name')
res = pd.DataFrame()
for c in df.columns:
res = pd.concat([res, pd.Series(df[c].dropna().index, name=c)], axis=1)
res.to_csv('E-MTAB-5214-query-results.csv', index=False)
(Writing them as rows would have been easier, but Excel can't import so many columns)
Don't open the csv in Excel directly, but use a blank worksheet and import the csv (Data - External data, From text), otherwise you can't separate them into Excel columns in one run (at least in Excel 2010).
create your data variable as a dictionary
you can save the dictionary to a json file using json.dump refer here
import json
data = {}
for tissue in tissuelist:
nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]
tissueExpression = np.delete(all_genes, nanInd)
data[tissue] = tissueExpression.tolist()
with open('filename.json', 'w') as fp:
json.dump(data, fp)

Column statistics from given input file?

I am given a .txt file of data:
1,2,3,0,0
1,0,4,5,0
1,1,1,1,1
3,4,5,6,0
1,0,1,0,3
3,3,4,0,0
My objective is to calculate the min,max,avg,range,median of the columns of given data and write it to an output .txt file.
My logic in approaching this question is as follows
Step 1) Read the data
infile = open("Data.txt", "r")
tempLine = infile.readline()
while tempLine:
print(tempLine.split(','))
tempLine = infile.readline()
Obviously it's not perfect but the idea is that the data can be read by this...
Step 2) Store the data into corresponding list variables? row1, row2,... row6
Step 3) Combine above lists all into one, giving a final list like this...
flist =[[1,2,3,0,0],[1,0,4,5,0],[1,1,1,1,1],[3,4,5,6,0],[1,0,1,0,3],[3,3,4,0,0]]
Step 4) Using nested for loop, access elements individually and store them into list variables
col1, col2, col3, ... , col5
Step 5) Calculate min, max etc and write to output file
My question is, with my rather beginner knowledge of computer science and python, is this logic inefficient, and could there possibly be an easier, and better logic towards solving this problem?
My main problem is probably steps 2 through 5. The rest I know how to do for sure.
Any advice would be helpful!
Try numpy. Numpy library provides a fast options when dealing with nested lists in a list, or simply, matrices.
To use numpy, you must import numpy at the beginning of your code.
numpy.matrix(1,2,3,0,0;1,0,4,5,0;....;3,3,4,0,0)
will give you
flist =[[1,2,3,0,0],[1,0,4,5,0],[1,1,1,1,1],[3,4,5,6,0],[1,0,1,0,3],[3,3,4,0,0]] straight off the bat.
Also, you may look through the axis(in this case, rows) and get mean, min, max easily using
max([axis, out]) Return the maximum value along an axis.
mean([axis, dtype, out]) Returns the average of the matrix elements along the given axis.
min([axis, out]) Return the minimum value along an axis.
This is from https://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html, a numpy document, so for more information, please read the numpy document.
To get the data I would to something like this:
from statistics import median
infile = open("Data.txt", "r")
rows = [line.split(',') for line in infile.readlines()]
for row in rows:
minRow = min(row)
maxRow = max(row)
avgRow = sum(row) / len(row)
rangeRow = maxRow - minRow
medianRow = median(row)
#then write the data to the output file
You can use the pandas library for this (http://pandas.pydata.org/)
The code below worked for me:
import pandas as pd
df = pd.read_csv('data.txt',header=None)
somestats = df.describe()
somestats.to_csv('dataOut.txt')
This is how I ended up doing it if anyone is curious
import numpy
infile = open("Data1.txt", "r")
outfile = open("ColStats.txt", "w")
oMat = numpy.loadtxt(infile)
tMat = numpy.transpose(oMat) #Create new matrix where Columns of oMat becomes rows and rows become columns
#print(tMat)
for x in range (5):
tempM = tMat[x]
mn = min(tempM)
mx = max(tempM)
avg = sum(tempM)/6.0
rng = mx - mn
median = numpy.median(tempM)
out = ("[{} {} {} {} {}]".format(mn, mx, avg, rng, median))
outfile.write(out + '\n')
infile.close()
outfile.close()
#print(tMat)

Read CSV file that needs data sanitization prior to loading into dataframe

I'm reading a CSV file into pandas. The issue is that the file needs removal of rows and calculated values on the other rows. My current idea starts like this
with open(down_path.name) as csv_file:
rdr = csv.DictReader(csv_file)
for row in rdr:
type = row['']
if type == 'Summary':
current_ward = row['Name']
else:
name = row['Name']
count1 = row['Count1']
count2 = row['Count2']
count3 = row['Count3']
index_count += 1
# write to someplace
,Name,count1,count2,count3
Ward Summary,Aloha 1,35,0,0
Individual Statistics,John,35,0,0
Ward Summary,Aloha I,794,0,0
Individual Statistics,Walter,476,0,0
Individual Statistics,Deborah,182,0,0
The end result needs to end up in a dataframe that i can concatenate to an existing dataframe.
Braindead way to do this is simply do my conversions and create a new CSV file, then read that in. Seems like a non-pythonic way to go.
Need to take out the summary lines, combine those with similar names (Aloha 1 and Aloha I), remove the individual stat info and put the Aloha 1 label on each of the individuals. Plus i need to add which month this data is from. As you can see the data needs some work :)
desired output would be
Jan-16, Aloha 1, John, 1,2,3
Where the Aloha 1 comes from the summary line above it
My personal preference would be to do everything in Pandas.
Perhaps something like this...
# imports
import numpy as np
import pandas as pd
from StringIO import StringIO
# read in your data
data = """,Name,count1,count2,count3
Ward Summary,Aloha 1,35,0,0
Individual Statistics,John,35,0,0
Ward Summary,Aloha I,794,0,0
Individual Statistics,Walter,476,0,0
Individual Statistics,Deborah,182,0,0"""
df = pd.read_csv(StringIO(data))
# give the first column a better name for convenience
df.rename(columns={'Unnamed: 0':'Desc'}, inplace=True)
# create a mask for the Ward Summary lines
ws_mask = df.Desc == 'Ward Summary'
# create a ward_name column that has names only for Ward Summary lines
df['ward_name'] = np.where(ws_mask, df.Name, np.nan)
# forward fill the missing ward names from the previous summary line
df.ward_name.fillna(method='ffill', inplace=True)
# get rid of the ward summary lines
df = df.ix[~ws_mask]
# get rid of the Desc column
df.drop('Desc', axis=1)
Yes; you pass over the data more than once, so you could potentially do better with a smarter single pass algorithm. But, if performance isn't your main concern, I think this has benefits in terms of conciseness and readability.

Converting values of named tuples from strings to integers

I'm creating a script to read a csv file into a set of named tuples from their column headers. I will then use these namedtuples to pull out rows of data which meet certain criteria.
I've worked out the input (shown below), but am having issues with filtering the data before outputting it to another file.
import csv
from collections import namedtuple
with open('test_data.csv') as f:
f_csv = csv.reader(f) #read using csv.reader()
Base = namedtuple('Base', next(f_csv)) #create namedtuple keys from header row
for r in f_csv: #for each row in the file
row = Base(*r)
# Process row
print(row) #print data
The contents of my input file are as follows:
Locus Total_Depth Average_Depth_sample Depth_for_17
chr1:6484996 1030 1030 1030
chr1:6484997 14 14 14
chr1:6484998 0 0 0
And they are printed from my code as follow:
Base(Locus='chr1:6484996', Total_Depth='1030',
Average_Depth_sample='1030', Depth_for_17='1030')
Base(Locus='chr1:6484997', Total_Depth='14',
Average_Depth_sample='14', Depth_for_17='14')
Base(Locus='chr1:6484998', Total_Depth='0', Average_Depth_sample='0',
Depth_for_17='0')
I want to be able to pull out only the records with a Total_Depth greater than 15.
Intuitively I tried the following function:
if Base.Total_Depth >= 15 :
print row
However this only prints the final row of data (from the above output table). I think the problem is twofold. As far as I can tell I'm not storing my named tuples anywhere for them to be referenced later. And secondly the numbers are being read in string format rather than as integers.
Firstly can someone correct me if I need to store my namedtuples somewhere.
And secondly how do I convert the string values to integers? Or is this not possible because namedtuples are immutable.
Thanks!
I previously asked a similar question with respect to dictionaries, but now would like to use namedtuples instead. :)
Map your values to int when creating the named tuple instances:
row = Base(r[0], *map(int, r[1:]))
This keeps the r[0] value as a string, and maps the remaining values to int().
This does require knowledge of the CSV columns as which ones can be converted to integer is hardcoded here.
Demo:
>>> from collections import namedtuple
>>> Base = namedtuple('Base', ['Locus', 'Total_Depth', 'Average_Depth_sample', 'Depth_for_17'])
>>> r = ['chr1:6484996', '1030', '1030', '1030']
>>> Base(r[0], *map(int, r[1:]))
Base(Locus='chr1:6484996', Total_Depth=1030, Average_Depth_sample=1030, Depth_for_17=1030)
Note that you should test against the rows, not the Base class:
if row.Total_Depth >= 15:
within the loop, or in a new loop of collected rows.

Categories