I have a big file of word/tag pairs saved like this:
This/DT gene/NN called/VBN gametocide/NN
Now I want to put these pairs into a DataFrame with their counts like this:
DT | NN --
This| 1 0
Gene| 0 1
:
I tried doing this with a dict that counts the pairs and then put it in the DataFrame:
file = open("data.txt", "r")
train = file.read()
words = train.split()
data = defaultdict(int)
for i in words:
data[i] += 1
matrixB = pd.DataFrame()
for elem, count in data.items():
word, tag = elem.split('/')
matrixB.loc[tag, word] = count
But this takes a really long time (file has like 300000 of these). Is there a faster way to do this?
What was wrong with the answers from your other question?
from collections import Counter
with open('data.txt') as f:
train = f.read()
c = Counter(tuple(x.split('/')) for x in train.split())
s = pd.Series(c)
df = s.unstack().fillna(0)
print(df)
yields
DT NN VBN
This 1 0 0
called 0 0 1
gametocide 0 1 0
gene 0 1 0
I thought this question was remarkably similar... Why did you post twice?
from collection import Counter
text = "This/DT gene/NN called/VBN gametocide/NN"
>>> pd.Series(Counter(tuple(pair.split('/')) for pair in text.split())).unstack().fillna(0)
DT NN VBN
This 1 0 0
called 0 0 1
gametocide 0 1 0
gene 0 1 0
Related
I have never used pandas or numpy for this purpose before and am wondering what's the idiomatic way to construct labeled adjacency matrices in pandas.
My data comes in a shape similar to this. Each "uL22" type of thing is a protein and the the arrays are the neighbors of this protein. Hence( in this example below) an adjacency matrix would have 1s in bL31 row, uL5 column, and the converse, etc.
My problem is twofold:
The actual dimension of the adjacency matrix is dictated by a set of protein-names that is generally much larger than those contained in the nbrtree, so i'm wondering what's the best way to map my nbrtree data to that set, say a 100 by 100 matrix corresponding to neighborhood relationships of a 100 proteins.
I'm not quite sure how to "bind" the names(i.e.uL32etc.) of those 100 proteins to the rows and columns of this matrix such that when I start moving rows around the names move accordingly. ( i'm planning to rearange the adjacency matrix into to have a block-diagonal structure)
"nbrtree": {
"bL31": ["uL5"],
"uL5": ["bL31"],
"bL32": ["uL22"],
"uL22": ["bL32","bL17"],
...
"bL33": ["bL35"],
"bL35": ["bL33","uL15"],
"uL13": ["bL20"],
"bL20": ["uL13","bL21"]
}
>>>len(nbrtree)
>>>40
I'm sure this is a manipulation that people perform daily, i'm just not quite familiar with how dataframes function properly, so i'm probably looking for something very obvious.
Thank you so much!
I don't fully understand your question, But from what I get try out this code.
from pprint import pprint as pp
import pandas as pd
dic = {"first": {
"a": ["b","d"],
"b": ["a","h"],
"c": ["d"],
"d": ["c","g"],
"e": ["f"],
"f": ["e","d"],
"g": ["h","a"],
"h": ["g","b"]
}}
col = list(dic['first'].keys())
data = pd.DataFrame(0, index = col, columns = col, dtype = int)
for x,y in dic['first'].items():
data.loc[x,y] = 1
pp(data)
The output from this code being
a b c d e f g h
a 0 1 0 1 0 0 0 0
b 1 0 0 0 0 0 0 1
c 0 0 0 1 0 0 0 0
d 0 0 1 0 0 0 1 0
e 0 0 0 0 0 1 0 0
f 0 0 0 1 1 0 0 0
g 1 0 0 0 0 0 0 1
h 0 1 0 0 0 0 1 0
Note that this adjaceny matrix here is not symmetric as I have taken some random data
To absorb your labels into the dataframe change to the following
data = pd.DataFrame(0, index = ['index']+col, columns = ['column']+col, dtype = int)
data.loc['index'] = [0]+col
data.loc[:, 'column'] = ['*']+col
I'm having a input .txt file with adjacency matrix looks like:
A B C
A 0 55 0
B 55 0 0
C 0 0 0
How can I parse this input into a 2D array or nested dictionary?
e.g.
map['A']['B'] = 55
import StringIO
# this is just for the sake of a self-contained example
# this would be your actual file opened with open()
inf = StringIO.StringIO(" A B C\n"
"A 0 55 0\n"
"B 55 0 0\n"
"C 0 0 0")
import re
map = {}
lines = inf.readlines()
headers = []
# extract each group of consecutive word characters.
# If your headers might contain dashes or other non-word characters,
# you might want ([^\s]+) instead.
for header in re.findall('(\w+)', lines[0]):
headers.append(header)
map[header] = {}
for line in lines[1:]:
items = re.findall('(\w+)', line)
rowname = items[0]
for idx, item in enumerate(items[1:]):
map[headers[idx]][rowname] = item
print map
from io import StringIO
d = StringIO(u" A B C\nA 0 55 0\nB 55 0 0\nC 0 0 0\n")
import pandas as pd
map = pd.read_table(d, header=0, index_col=0, delim_whitespace=True)
print(map)
A B C
A 0 55 0
B 55 0 0
C 0 0 0
print(map['A']['B'])
55
My Python code
import operator
with open('index.txt') as f:
lines = f.read().splitlines()
print type(lines)
print len(lines)
l2=lines[1::3]
print len(l2)
print l2[0]
list1 = [0,2]
my_items = operator.itemgetter(*list1)
new_list = [ my_items(x) for x in l2 ]
with open('newindex1.txt','w') as thefile:
for item in l2:
thefile.write("%s\n" % item)
Couple of lines from index.txt
0 0 0
0 1 0
0 2 0
1 0 0
1 1 0
1 2 0
2 0 0
2 1 0
2 2 0
3 0 0
Couple of lines from newindex1.txt
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
I wanted to read the file as a list,then choose every third row and then finally select first and the third column from that list.It seems that I do not understand how operator works.
If I try with Back2Basics solution
import numpy as np
myarray = np.fromfile('index.txt', dtype=int, sep=' ')
anotherarray = myarray[::3][0,2]
I got
File "a12.py", line 4, in <module>
anotherarray = myarray[::3][0,2]
IndexError: too many indices
You don't need to read all the data into memory at all, you can use itertools.islice to parse the rows you want and the csv lib to read and write the data:
from operator import itemgetter
from itertools import islice
import csv
with open("in.txt") as f, open('newindex1.txt','w') as out:
r = csv.reader(f, delimiter=" ")
wr = csv.writer(out, delimiter=" ")
for row in iter(lambda: list(islice(r, 0, 3, 3)), []):
wr.writerow(map(itemgetter(0, 2), row)[0])
I'd highly suggest using numpy for this. The reason being this is all numerical data that fits so nicely into memory. The code looks like this.
import numpy as np
myarray = np.fromfile('index.txt', dtype=int, sep=' ')
anotherarray = myarray[::3,::2]
and then you want to write the file
anotherarray.tofile('newfile.txt', sep=" ")
The way the array slicing line [::3,::2] reads is "take every 3rd row starting from 0, and take every other column starting from 0"
I think you need something this?
lines = []
with open('index.txt', 'r') as fi:
lines = fi.read().splitlines()
lines = [line.split() for line in lines]
with open('answer.txt', 'w') as fo:
for column in range(len(lines)):
if (column + 1) % 3:
fo.write('%s %s\n' % (lines[column][0], lines[column][2]))
I was trying to write a code that needs to be outputted as matrix but since being a novice, i am not getting it right. Basically i want to generate a matrix of counts for each of A,C,G,T for each column. I was able to do it for a single column but having hard time to do it for other columns.
Input file
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
My code so far
fh_in = open("consensus_seq.txt", 'r')
A_count = 0
C_count = 0
G_count = 0
T_count = 0
result = []
for line in fh_in:
line = line.strip()
if not line.startswith(">"):
for nuc in line[0]:
if nuc == "A":
A_count += 1
if nuc == "C":
C_count += 1
if nuc == "G":
G_count += 1
if nuc == "T":
T_count += 1
result.append(A_count)
result.append(C_count)
result.append(G_count)
result.append(T_count)
print result
Output
[5, 0, 1, 1]
The actual output that i want is
A 5 1 0 0 5 5 0 0
C 0 0 1 4 2 0 6 1
G 1 1 6 3 0 1 0 0
T 1 5 0 0 0 1 1 6
Any help/hint is appreciated.
First make a list of the rows, stripping out the lines starting with >. Then you can zip this to turn it into a list of columns. Then you can make a list of column counts of each letter.
rows = [line.strip() for line in infile if not line.startswith('>')]
columns = zip(*rows)
for letter in 'ACGT':
print letter, [column.count(letter) for column in columns]
However this may be memory intensive if your file is very large. An alternative is just to go through line by line counting the letters.
counts = {letter: [0] * 8 for letter in 'ACGT'}
for line in infile:
if not line.startswith('>'):
for i, letter in enumerate(line.strip()):
counts[letter][i] += 1
for letter, columns in counts.items():
print letter, columns
You could also use a Counter, especially if you aren't sure in advance how many columns there will be:
from collections import Counter
# ...
counts = Counter()
for line in infile:
if not line.startswith('>'):
counts.update(enumerate(line.strip()))
columns = range(max(counts.keys())[0])
for letter in 'ACGT':
print letter, [counts[column, letter] for column in columns]
You could use numpy to load the text file. Since the format is a little funky it is hard to load, but the summation becomes trivial after that:
import numpy as np
data = np.loadtxt("raw.txt", comments=">",
converters = {0: lambda s: [x for x in s]}, dtype=str)
print (data=="A").sum(axis=0)
print (data=="T").sum(axis=0)
print (data=="C").sum(axis=0)
print (data=="G").sum(axis=0)
Output:
[5 1 0 0 5 5 0 0]
[1 5 0 0 0 1 1 6]
[0 0 1 4 2 0 6 1]
[1 1 6 3 0 1 0 0]
The real advantage to this is the numpy array you've constructed can do other things. For example, let's say I wanted to know, instead of the sum, the average number of times we found an A along the columns of the "Rosalinds":
print (data=="A").mean(axis=0)
[ 0.71428571 0.14285714 0. 0. 0.71428571 0.71428571 0. 0.]
import collections
answer = []
with open('blah') as infile:
rows = [line.strip() for _,line in zip(infile, infile)]
cols = zip(*rows)
for col in cols:
d = collections.Counter(col)
answer.append([d[i] for i in "ATCG"])
answer = [list(i) for i in zip(*answer)]
for line in answer:
print(' '.join([str(i) for i in line]))
Output:
5 1 0 1 0
0 0 1 6 6
5 0 2 0 1
0 1 6 0 0
How to count the frequency of numbers given in a text file. The text file is as follows.
0
2
0
1
0
1
55
100
100
I want the output as follows
0 3
1 2
2 1
55 1
100 2
I tried this without success
def histogram( A, flAsList=False ):
"""Return histogram of values in array A."""
H = {}
for val in A:
H[val] = H.get(val,0) + 1
if flAsList:
return H.items()
return H
Any better way. Thanks in advance!
Use Counter. It's the best way for this type of problems
from collections import Counter
with open('file.txt', 'r') as fd:
lines = fd.read().split()
counter = Counter(lines)
# sorts items
items = sorted(counter.items(), key=lambda x: int(x[0]))
# prints desired output
for k, repetitions in items:
print k,'\t', repetitions
The output:
0 3
1 2
2 1
55 1
100 2
Use a Counter object for this:
from collections import Counter
c = Counter(A)
Now the c variable will hold a frequency map of each of the values. For instance:
Counter(['a', 'b', 'c', 'a', 'c', 'a'])
=> Counter({'a': 3, 'c': 2, 'b': 1})
Please consider using update:
def histogram( A, flAsList=False ):
"""Return histogram of values in array A."""
H = {}
for val in A:
# H[val] = H.get(val,0) + 1
if H.has_key(val):
H[val] = H[val] + 1
else:
H.update({val : 1})
if flAsList:
return H.items()
return H
Simple approach using a dictionary:
histogram = {}
with open("file","r") as f:
for line in f:
try:
histogram[line.strip()] +=1
except KeyError:
histogram[line.strip()] = 1
for key in sorted(histogram.keys(),key=int):
print key,"\t",histogram[key]
Output:
0 3
1 2
2 1
55 1
100 2
Edit:
To select a specific column you'd want to split the line using split(). For example the sixth field by splitting on a single space:
try:
histogram[line.strip().split(' ')[5]] +=1
except KeyError:
histogram[line.strip().split(' ')[5]] = 1