turn 2x16 dataframe into 4 x 4 matrix in python - python

I am trying to convert a file into an adjacency matrix. I need to do this in a way that allows files of different sizes to fill this matrix. My current working file is of size 4. This is my testing file, what I need is way of doing this in an abstract way to do larger files.
This is my test file. The 1 — 4 are the column that the Boolean values belong.
1,0
1,0
1,1
1,1
2,0
2,0
2,0
2,1
3,1
3,0
3,0
3,1
4,1
4,1
4,1
4,0
I would like an end result of:
0 0 1 1
0 0 0 1
1 0 0 1
1 1 1 0
Here is the code I have that produces a dataframe similar to my input file.
# Importing needed libraries
import os.path
from math import sqrt
import numpy as np
import pandas as pd
# changing filepath to a variable name
fileName = "./testAlgorithm.csv"
# opening file, doing file check, converting
# file to dataframe
if os.path.isfile(fileName):
with open(fileName, "r") as csvfile:
df = pd.read_csv(fileName, header=None)
else:
print(f"file{fileName} does not exist")
# method used to count the number of lines
# in data file
def simpleCount(fileName):
lines = 0
for line in open(fileName):
lines += 1
return sqrt(lines)
# method call for line count.
lineNum = simpleCount(fileName)
print(df)
num = int(simpleCount(fileName))

df = pd.DataFrame({"A":[0,0,1,1,0,0,0,1,1,0,0,1,1,1,1,0]})
df.values.reshape(4,4)
If you want to make it back to a dataframe
pd.DataFrame(df.values.reshape(4,4), columns=["A", "B", "C", "D"])

you can try:
dummy = pd.DataFrame(columns= [['c1','c2','c3','c4']])
dummy['c1'] = np.array(df['c2'].loc[df['c1'] == 1])
dummy['c2'] = np.array(df['c2'].loc[df['c1'] == 2])
dummy['c3'] = np.array(df['c2'].loc[df['c1'] == 3])
dummy['c4'] = np.array(df['c2'].loc[df['c1'] == 4])
It will give you as :
c1 c2 c3 c4
0 0 0 1 1
1 0 0 0 1
2 1 0 0 1
3 1 1 1 0

Related

Creating matrix of 0 and 1 from a string vector in R or python

I want to create a matrix of 0 and 1 from a vector where each string contains the two names I want to map to the matrix. For example, if I have the following vector
vector_matrix <- c("A_B", "A_C", "B_C", "B_D", "C_D")
I would like to transform it into the following matrix
A B C D
A 0 1 1 0
B 0 0 1 1
C 0 0 0 1
D 0 0 0 0
I am open to any suggestion, but it is better if there is some built-in function that can deal with it. I am trying to do a very similar thing but in a magnitude that I will generate a matrix of 25 million cells.
I prefer if the code is R, but doesn't matter if there is some pythonic solution :)
Edit:
So when I say "A_B", I want a "1" in row A column B. It doesn't matter if it is the contrary (column A row B).
Edit:
I would like to have a matrix where its rownames and colnames are the letters.
Create a two column data frame d from the data, calculate the levels and then generate a list in which each colunn of d is a factor and finally run table. The second line sorts each row and that isn't actually needed for the input shown so it could be omitted but you might need it for other data if B_A is to be regarded as A_B.
d <- read.table(text = vector_matrix, sep = "_")
d[] <- t(apply(d, 1, sort))
tab <- table( lapply(d, factor, levels = levels(factor(unlist(d)))) )
tab
giving this table:
V2
V1 A B C D
A 0 1 1 0
B 0 0 1 1
C 0 0 0 1
D 0 0 0 0
heatmap(tab[nrow(tab):1, ], NA, NA, col = 2:3, symm = TRUE)
library(igraph)
g <- graph_from_adjacency_matrix(tab, mode = "undirected")
plot(g)
The following should work in Python. It splits the input data in two lists, converts the characters to indexes and sets the indexes of a matrix to 1.
import numpy as np
vector_matrix = ("A_B", "A_C", "B_C", "B_D", "C_D")
# Split data in two lists
rows, cols = zip(*(s.split("_") for s in vector_matrix))
print(rows, cols)
>>> ('A', 'A', 'B', 'B', 'C') ('B', 'C', 'C', 'D', 'D')
# With inspiration from: https://stackoverflow.com/a/5706787/10603874
row_idxs = np.array([ord(char) - 65 for char in rows])
col_idxs = np.array([ord(char) - 65 for char in cols])
print(row_idxs, col_idxs)
>>> [0 0 1 1 2] [1 2 2 3 3]
n_rows = row_idxs.max() + 1
n_cols = col_idxs.max() + 1
print(n_rows, n_cols)
>>> 3 4
mat = np.zeros((n_rows, n_cols), dtype=int)
mat[row_idxs, col_idxs] = 1
print(mat)
>>>
[[0 1 1 0]
[0 0 1 1]
[0 0 0 1]]

How to create an adjacency matrix in pandas such that the labels are preserved when rows and cols are rearranged

I have never used pandas or numpy for this purpose before and am wondering what's the idiomatic way to construct labeled adjacency matrices in pandas.
My data comes in a shape similar to this. Each "uL22" type of thing is a protein and the the arrays are the neighbors of this protein. Hence( in this example below) an adjacency matrix would have 1s in bL31 row, uL5 column, and the converse, etc.
My problem is twofold:
The actual dimension of the adjacency matrix is dictated by a set of protein-names that is generally much larger than those contained in the nbrtree, so i'm wondering what's the best way to map my nbrtree data to that set, say a 100 by 100 matrix corresponding to neighborhood relationships of a 100 proteins.
I'm not quite sure how to "bind" the names(i.e.uL32etc.) of those 100 proteins to the rows and columns of this matrix such that when I start moving rows around the names move accordingly. ( i'm planning to rearange the adjacency matrix into to have a block-diagonal structure)
"nbrtree": {
"bL31": ["uL5"],
"uL5": ["bL31"],
"bL32": ["uL22"],
"uL22": ["bL32","bL17"],
...
"bL33": ["bL35"],
"bL35": ["bL33","uL15"],
"uL13": ["bL20"],
"bL20": ["uL13","bL21"]
}
>>>len(nbrtree)
>>>40
I'm sure this is a manipulation that people perform daily, i'm just not quite familiar with how dataframes function properly, so i'm probably looking for something very obvious.
Thank you so much!
I don't fully understand your question, But from what I get try out this code.
from pprint import pprint as pp
import pandas as pd
dic = {"first": {
"a": ["b","d"],
"b": ["a","h"],
"c": ["d"],
"d": ["c","g"],
"e": ["f"],
"f": ["e","d"],
"g": ["h","a"],
"h": ["g","b"]
}}
col = list(dic['first'].keys())
data = pd.DataFrame(0, index = col, columns = col, dtype = int)
for x,y in dic['first'].items():
data.loc[x,y] = 1
pp(data)
The output from this code being
a b c d e f g h
a 0 1 0 1 0 0 0 0
b 1 0 0 0 0 0 0 1
c 0 0 0 1 0 0 0 0
d 0 0 1 0 0 0 1 0
e 0 0 0 0 0 1 0 0
f 0 0 0 1 1 0 0 0
g 1 0 0 0 0 0 0 1
h 0 1 0 0 0 0 1 0
Note that this adjaceny matrix here is not symmetric as I have taken some random data
To absorb your labels into the dataframe change to the following
data = pd.DataFrame(0, index = ['index']+col, columns = ['column']+col, dtype = int)
data.loc['index'] = [0]+col
data.loc[:, 'column'] = ['*']+col

Using a for loop to create a new data frame based on an old data frame from multiple conditions

I am new to python and am trying to write a code to create a new dataframe based on conditions from an old dataframe along with the results in the cell above on the new dataframe.
Here is an example of what I am trying to do:
is the raw data
I need to create a new dataframe where if the corresponding position in the raw data is 0 the result is 0, if it is greater than 0 then 1 plus the above row
I need to remove any instances where the consecutive number of intervals doesn't reach at least 3
The way I think about the code is as such, but being new to python I am struggling.
From Raw data to Dataframe 2:
if (1,1)=0 then (1a, 1a)= 0: # line 1
else (1a,1a)=1;
if (2,1)=0 then (2a,1a)=0; # line 2
else (2a,1a)= (1a,1a)+1 = 2;
if (3,1)=0 then (3a,1a)=0; # line 3
From Dataframe 2 to 3:
If any of the last 3 rows is greater than 3 then return that cells value else return 0
I am not sure how to make any of these work, if there is an easier way to do/think about this then what I am doing please let me know. Any help is appreciated!
Based on your question, the output I was able to generate was:
Earlier, the DataFrame looked like so:
A B C
0.05 5 0 0
0.10 7 0 1
0.15 0 0 12
0.20 0 4 3
0.25 1 0 5
0.30 21 5 0
0.35 6 0 9
0.40 15 0 0
Now, the DataFrame looks like so:
A B C
0.05 0 0 0
0.10 0 0 1
0.15 0 0 2
0.20 0 0 3
0.25 1 0 4
0.30 2 0 0
0.35 3 0 0
0.40 4 0 0
The code I used for this is given below, just copy the following code in a new file, say code.py and run it
import re
import pandas as pd
def get_continous_runs(ext_list, threshold):
mylist = list(ext_list)
for i in range(len(mylist)):
if mylist[i] != 0:
mylist[i] = 1
samp = "".join(map(str, mylist))
finder = re.finditer(r"1{%s,}" % threshold, samp)
ranges = [x.span() for x in finder]
return ranges
def build_column(ranges, max_len):
answer = [0]*max_len
for r in ranges:
start = r[0]
run_len = r[1] - start
for i in range(run_len):
answer[start+i] = i + 1
return answer
def main(df):
print("Earlier, the DataFrame looked like so:")
print(df)
ndf = df.copy()
for col_name, col_data in df.iteritems():
ranges = get_continous_runs(col_data.values, 4)
column_len = len(col_data.values)
new_column = build_column(ranges, column_len)
ndf[col_name] = new_column
print("\nNow, the DataFrame looks like so:")
print(ndf)
return
if __name__ == '__main__':
raw_data = [
(5,0,0), (7,0,1), (0,0,12), (0,4,3),
(1,0,5), (21,5,0), (6,0,9), (15,0,0),
]
df = pd.DataFrame(
raw_data,
columns=list("ABC"),
index=[0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40]
)
main(df)
You can adjust threshold in line #28 to get consecutive number of intervals other than 4 (i.e. more than 3).
As always, start by reading main() function to understand how everything works. I have tried to use good variable names to aid understanding. My method might seem a little contrived because I am using regex, but I didn't want to overwhelm a very beginner with a custom run-length counter, so...

how to populate a matrix in python

I was trying to write a code that needs to be outputted as matrix but since being a novice, i am not getting it right. Basically i want to generate a matrix of counts for each of A,C,G,T for each column. I was able to do it for a single column but having hard time to do it for other columns.
Input file
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
My code so far
fh_in = open("consensus_seq.txt", 'r')
A_count = 0
C_count = 0
G_count = 0
T_count = 0
result = []
for line in fh_in:
line = line.strip()
if not line.startswith(">"):
for nuc in line[0]:
if nuc == "A":
A_count += 1
if nuc == "C":
C_count += 1
if nuc == "G":
G_count += 1
if nuc == "T":
T_count += 1
result.append(A_count)
result.append(C_count)
result.append(G_count)
result.append(T_count)
print result
Output
[5, 0, 1, 1]
The actual output that i want is
A 5 1 0 0 5 5 0 0
C 0 0 1 4 2 0 6 1
G 1 1 6 3 0 1 0 0
T 1 5 0 0 0 1 1 6
Any help/hint is appreciated.
First make a list of the rows, stripping out the lines starting with >. Then you can zip this to turn it into a list of columns. Then you can make a list of column counts of each letter.
rows = [line.strip() for line in infile if not line.startswith('>')]
columns = zip(*rows)
for letter in 'ACGT':
print letter, [column.count(letter) for column in columns]
However this may be memory intensive if your file is very large. An alternative is just to go through line by line counting the letters.
counts = {letter: [0] * 8 for letter in 'ACGT'}
for line in infile:
if not line.startswith('>'):
for i, letter in enumerate(line.strip()):
counts[letter][i] += 1
for letter, columns in counts.items():
print letter, columns
You could also use a Counter, especially if you aren't sure in advance how many columns there will be:
from collections import Counter
# ...
counts = Counter()
for line in infile:
if not line.startswith('>'):
counts.update(enumerate(line.strip()))
columns = range(max(counts.keys())[0])
for letter in 'ACGT':
print letter, [counts[column, letter] for column in columns]
You could use numpy to load the text file. Since the format is a little funky it is hard to load, but the summation becomes trivial after that:
import numpy as np
data = np.loadtxt("raw.txt", comments=">",
converters = {0: lambda s: [x for x in s]}, dtype=str)
print (data=="A").sum(axis=0)
print (data=="T").sum(axis=0)
print (data=="C").sum(axis=0)
print (data=="G").sum(axis=0)
Output:
[5 1 0 0 5 5 0 0]
[1 5 0 0 0 1 1 6]
[0 0 1 4 2 0 6 1]
[1 1 6 3 0 1 0 0]
The real advantage to this is the numpy array you've constructed can do other things. For example, let's say I wanted to know, instead of the sum, the average number of times we found an A along the columns of the "Rosalinds":
print (data=="A").mean(axis=0)
[ 0.71428571 0.14285714 0. 0. 0.71428571 0.71428571 0. 0.]
import collections
answer = []
with open('blah') as infile:
rows = [line.strip() for _,line in zip(infile, infile)]
cols = zip(*rows)
for col in cols:
d = collections.Counter(col)
answer.append([d[i] for i in "ATCG"])
answer = [list(i) for i in zip(*answer)]
for line in answer:
print(' '.join([str(i) for i in line]))
Output:
5 1 0 1 0
0 0 1 6 6
5 0 2 0 1
0 1 6 0 0

How to find the average of multiple columns in a file using python

Hi I have a file that consists of too many columns to open in excel. Each column has 10 rows of numerical values 0-2 and has a row saying the title of the column. I would like the output to be the name of the column and the average value of the 10 rows. The file is too large to open in excel 2000 so I have to try using python. Any tips on an easy way to do this.
Here is a sample of the first 3 columns:
Trial1 Trial2 Trial3
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
I want python to output as a test file
Trial 1 Trial 2 Trial 3
1 2 1 (whatever the averages are)
A memory-friendly solution without using any modules:
with open("filename", "rtU") as f:
columns = f.readline().strip().split(" ")
numRows = 0
sums = [0] * len(columns)
for line in f:
# Skip empty lines
if not line.strip():
continue
values = line.split(" ")
for i in xrange(len(values)):
sums[i] += int(values[i])
numRows += 1
for index, summedRowValue in enumerate(sums):
print columns[index], 1.0 * summedRowValue / numRows
You can use Numpy:
import numpy as np
from StringIO import StringIO
s = StringIO('''\
Trial1 Trial2 Trial3
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
''')
data = np.loadtxt(s, skiprows=1) # skip header row
print data.mean(axis=0) # column means
# OUTPUT: array([ 0.8, 1. , 0.8])
Note that the first argument to loadtxt could be the name of your file instead of a file like object.
You can use the builtin csv module:
import csv
csvReader = csv.reader(open('input.txt'), delimiter=' ')
headers = csvReader.next()
values = [map(int, row) for row in csvReader]
def average(l):
return float(sum(l)) / len(l)
averages = [int(round(average(trial))) for trial in zip(*values)]
print ' '.join(headers)
print ' '.join(str(x) for x in averages)
Result:
Trial1 Trial2 Trial3
1 1 1
Less of an answer than it is an alternative understanding of the problem:
You could think of each line being a vector. In this way, the average done column-by-column is just the average of each of these vectors. All you need in order to do this is
A way to read a line into a vector object,
A vector addition operation,
Scalar multiplication (or division) of vectors.
Python comes (I think) with most of this already installed, but this should lead to some easily readable code.

Categories