Considering column instead of strings - python

I am trying to apply the following code to this column:
Test
"Find the number behind the lines"
"Look at the sky"
"It is a such wonderful day today"
In the following code, docs are a list of documents; in my case they should be the rows in Test column.
D = np.zeros((len(docs), len(docs)))
for i in range(len(docs)):
for j in range(len(docs)):
if i == j:
continue
if i > j:
D[i, j] = D[j, i]
How can I apply it to my column?
In my code, I am assuming your list of strings/rows (each a list-of-words) is docs, to calculate the array of pairwise distances D with the code above. The problem is in applying it to a column.
The expected output (but I cannot determine with the code above, unfortunately) would be the similarity of reference sentence and other sentences. i,j are my indices and they run through each row in the column Test. The algorithm I am going to use is the mover's distance.

def f(docs):
D = np.zeros((len(docs), len(docs)))
for i in range(len(docs)):
for j in range(len(docs)):
if i == j:
continue
if i > j:
D[i, j] = D[j, i]
df.Test.apply(lambda x: f(x))

from your question i understood u want ro rename your row and column with strings of docs. if that's right try this
docs=["Find the number behind the lines","Look at the sky","It is a such wonderful day today"]
D = np.zeros((len(docs), len(docs)))
df=pd.DataFrame(D,columns=docs,index=docs)
print(df)

Related

Is there a way to use vectorization in place of this nested for loop?

I have this current (working as designed) nested for-loop that grabs some data of interest and creates a pandas dataframe from it. In an effort to cut down on runtime, I have been trying to figure out if vectorization could be a good replacement for the nested for-loop. Is there any way to vectorize these loops?
for i, row in enumerate(ldamodel[corpus]):
row = sorted(row, key=lambda x: (x[1]), reverse=True)
# Get the Dominant topic, % Contribution and Keywords for each document
for j, (topic_num, prop_topic) in enumerate(row):
if j == 0: # Pull dominant topic
wp = ldamodel.show_topic(topic_num)
topic_keywords = ", ".join([word for word, prop in wp])
sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), ldamodel.get_document_topics(corpus[i]), topic_keywords]), ignore_index=True)
else:
break

Print index of row and column in adjacency matrix

I have a function to build adjacency matrix. I want to improve matrix readability for humans, so I decided to print row index like this:
Now I want to print column index in the same way, but I can't do it properly. best result I get is this:
Any Ideas and suggestions how i can print column indexes neatly?
Source code here.
def generate_adjacency_matrix(vertices):
# Create empty Matrix
matrix = [['.' for _ in range(len(vertices))] for _ in range(len(vertices))]
# Fill Matrix
for row in range(len(matrix)):
for num in range(len(matrix)):
if num in vertices[row]:
matrix[row][num] = '1'
# Print column numbers
numbers = list(range(len(matrix)))
for i in range(len(numbers)):
numbers[i] = str(numbers[i])
print(' ', numbers)
#Print matrix and row numbers
for i in range(len(matrix)):
if len(str(i)) == 1:
print(str(i) + ' ', matrix[i])
else:
print(i, matrix[i])
If it matters Parameter in my function is a dictionary that looks like:
{0:[1],
1:[0,12,8],
2:[3,8,15]
....
20:[18]
}
If you know you're only going to 20, then just pad everything to 2 chars:
For the header row:
numbers[i] = str(numbers[i].zfill(2))
For the other rows, set to ". " or ".1" or something else that looks neat.
That would seem to be the easiest way.
Alternative way is to have 2 column headers, one above the other, first one is the tens value, second is the unit value. That allows you to keep the width of 1 in the table as well, which maybe you need.

How to handle index of out range and invalid syntax for a sample with binary values in a list of lists in Pytjon? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
The following is my sample that consists of binary values 0 and 1
sample = [['0100','0101','0101''],['011','100','001','001'],['001','001','001']]
For any number of elements in the list and any number of elements in the list of lists I need to do the following:
A. Convert it into a list of lists such that the corresponding elements of each list are strung together in a list of lists
column = [['000','111','000','0111'],['0100','1000','1011'],['000','000','111']
B. Create a counter(nn) to count length of each element and divide by (nn-1)
nn = [[3,3,3,3],[4,4,4],[3,3,3]]
nn - 1 = [[2,2,2,2],[3,3,3,3],[2,2,2]]
d = nn-1
div = nn/d
C. Need to calculate a parameter for pi. Here is a link showing how this can be done for a list https://eval.in/672980.
I have tried to write the code for the same. I hit the following errors:
A. l += seq_list[j][i]
IndexError: list index out of range.
I am certain that i,j and k are all in the correct range.
B. counters = [Counter(sub_list) for sub_list in column]
^
SyntaxError: invalid syntax
Why is it invalid syntax?
Any ideas on how to correct the errors? I tried different ways to do the same, but I am unable to do so.
#Tranposing. Moving along the columns only
column = []
for k in range(len(seq_list)):
for i in range(len(seq_list[k][0])): #Length of the row
l = ""
for j in range(len(seq_list[k])): #Length of the column
l += seq_list[j][i]
column.append(l)
print "\n Making the columns as a list: " + str(column)
#Creating a separate list where -/? will not be part of the sequence
tt = ["".join(y for y in x if y in {'0','1'}) for x in column]
#Creating a counter that stores n/n-1 values
counters = [Counter(sub_list) for sub_list in tt]
nn =[]
d = []
for counter in counters:
binary_count = sum((val for key, val in counter.items() if key in "01"))
nn.append(binary_count)
d = [i - 1 for i in nn]
div = [int(b) / int(m) for b,m in zip(nn, d)]
I think this should work for you:
seq_list = [['0100','0101','0101'],['011','100','001','001'],['001','001','001']]
results = []
for k in range(len(seq_list)):
column_list = [[] for i in range(len(seq_list[k][0]))]
for seq in seq_list[k]:
for i, nuc in enumerate(seq):
column_list[i].append(nuc)
tt = ["".join(y for y in x if y in {'0','1'}) for x in column_list]
results.append(tt)
### Creating a counter that stores n/n-1 values
BINARY = {'0','1'}
counts = [[sum(c in BASES for c in s) for s in pair] for pair in results]
countsminusone1 = [[(sum(c in BINARY for c in s)-1) for s in pair] for pair in results]
countsminusone = [[1 if x <= 0 else x for x in pair] for pair in countsminusone1]
bananasplit = [[n/d for n, d in zip(subq, subr)] for subq, subr in zip(counts, countsminusone)]

Checking the same elements in a list : python

Hey there i'm so new in coding and i want make program comparing two lists elements and returning the same elements in them.
so far i writed this code but i'm having problem with algoritm because it is set operation and i can't find actual same elements with intersection function.
in my code i want to look for each string and finding similarity of them.
what i've tried to do is :
input="AGA"
input1="ACA"
input=input_a
if len(input1) == len(input):
i = 0
while i < len(input1):
j = 0
while j < len(input_a):
input_length = list(input_a)
if input1[i] != input_a[j]:
if input1[i] in input_a:
print "1st %s" % input_length
print "2nd %s" % set(input1)
intersection = set(DNA_input_length).intersection(set(input1))
print intersection
total = len(intersection)
print (float(total) / float(
len(input1))) * 100, "is the similarity percentage"
break
DNA_input_length.remove(input_a[i])
j = j + 1
break
what is wrong with my code is actually the intersection part i guess and
i want to see as common elements which are included each list for input and input1 = A,A (2 A's both) however, i get just one A..
How can i improve this code to evaluating common elements which is Two A not one. I really need your help..
I would define similarity as the the hamming distance between the words (which I think is what you want
word1 = "AGA"
word2 = "ACAT"
score = sum(a==b for a,b in zip(word1,word2)) + abs(len(word1)-len(word2))
If you just need to find the intersecting elements of 2 flat lists, do:
a = "AGA"
b = "ACA"
c = set(a) & set(b)
print(c)
> {'A'}

Create multiple dictionaries from a single iterator in nested for loops

I have a nested list comprehension which has created a list of six lists of ~29,000 items. I'm trying to parse this list of final data, and create six separate dictionaries from it. Right now the code is very unpythonic, I need the right statement to properly accomplish the following:
1.) Create six dictionaries from a single statement.
2.) Scale to any length list, i.e., not hardcoding a counter shown as is.
I've run into multiple issues, and have tried the following:
1.) Using while loops
2.) Using break statements, will break out of the inner most loop, but then does not properly create other dictionaries. Also break statements set by a binary switch.
3.) if, else conditions for n number of indices, indices iterate from 1-29,000, then repeat.
Note the ellipses designate code omitted for brevity.
# Parse csv files for samples, creating a dictionary of key, value pairs and multiple lists.
with open('genes_1') as f:
cread_1 = list(csv.reader(f, delimiter = '\t'))
sample_1_values = [j for i, j in (sorted([x for x in {i: float(j)
for i, j in cread_1}.items()], key = lambda v: v[1]))]
sample_1_genes = [i for i, j in (sorted([x for x in {i: float(j)
for i, j in cread_1}.items()], key = lambda v: v[1]))]
...
# Compute row means.
mean_values = []
for i, (a, b, c, d, e, f) in enumerate(zip(sample_1_values, sample_2_values, sample_3_values, sample_4_values, sample_5_values, sample_6_values)):
mean_values.append((a + b + c + d + e + f)/6)
# Provide proper gene names for mean values and replace original data values by corresponding means.
sample_genes_list = [i for i in sample_1_genes, sample_2_genes, sample_3_genes, sample_4_genes, sample_5_genes, sample_6_genes]
sample_final_list = [sorted(zip(sg, mean_values)) for sg in sample_genes_list]
# Create multiple dictionaries from normalized values for each dataset.
class BreakIt(Exception): pass
try:
count = 1
for index, items in enumerate(sample_final_list):
sample_1_dict_normalized = {}
for index, (genes, values) in enumerate(items):
sample_1_dict_normalized[genes] = values
count = count + 1
if count == 29595:
raise BreakIt
except BreakIt:
pass
...
try:
count = 1
for index, items in enumerate(sample_final_list):
sample_6_dict_normalized = {}
for index, (genes, values) in enumerate(items):
if count > 147975:
sample_6_dict_normalized[genes] = values
count = count + 1
if count == 177570:
raise BreakIt
except BreakIt:
pass
# Pull expression values to qualify overexpressed proteins.
print 'ERG values:'
print 'Sample 1:', round(sample_1_dict_normalized.get('ERG'), 3)
print 'Sample 6:', round(sample_6_dict_normalized.get('ERG'), 3)
Your code is too long for me to give exact answer. I will answer very generally.
First, you are using enumerate for no reason. if you don't need both index and value, you probably don't need enumerate.
This part:
with open('genes.csv') as f:
cread_1 = list(csv.reader(f, delimiter = '\t'))
sample_1_dict = {i: float(j) for i, j in cread_1}
sample_1_list = [x for x in sample_1_dict.items()]
sample_1_values_sorted = sorted(sample_1_list, key=lambda expvalues: expvalues[1])
sample_1_genes = [i for i, j in sample_1_values_sorted]
sample_1_values = [j for i, j in sample_1_values_sorted]
sample_1_graph_raw = [float(j) for i, j in cread_1]
should be (a) using a list named samples and (b) much shorter, since you don't really need to extract all this information from sample_1_dict and move it around right now. It can be something like:
samples = [None] * 6
for k in range(6):
with open('genes.csv') as f: #but something specific to k
cread = list(csv.reader(f, delimiter = '\t'))
samples[k] = {i: float(j) for i, j in cread}
after that, calculating the sum and mean will be way more natural.
In this part:
class BreakIt(Exception): pass
try:
count = 1
for index, items in enumerate(sample_final_list):
sample_1_dict_normalized = {}
for index, (genes, values) in enumerate(items):
sample_1_dict_normalized[genes] = values
count = count + 1
if count == 29595:
raise BreakIt
except BreakIt:
pass
you should be (a) iterating of the samples list mentioned earlier, and (b) not using count at all, since you can iterate naturally over samples or sample[i].list or something like that.
Your code has several problems. You should put your code in functions that preferably do one thing each. Than you can call a function for each sample without repeating the same code six times (I assume that is what the ellipsis is hiding.). Give each function a self-describing name and a doc string that explains what it does. There is quite a bit unnecessary code. Some of this might become obvious once you have it in functions. Since functions take arguments you can hand in your 29595, for example.

Categories