outputting dictionary values to text file: uncoupling lists and strings

outputting dictionary values to text file: uncoupling lists and strings - python

I have a dictionary called shared_double_lists which is made of 6 keys called [(0, 1), (1, 2), (1, 3), (2, 3), (0, 3), (0, 2)]. The values for all the keys are lists.
I am trying to output the values for key (0, 1) to a file. Here is my code:
output = open('test_output.txt', 'w')
counter = 0
for locus in shared_double_lists[(0, 1)]:
for value in locus:
output.write(str(shared_double_lists[(0, 1)][counter]))
output.write ("\t")
output.write ("\n")
counter +=1
output.close()
This almost works, the output looks like this:
['ACmerged_contig_10464', '1259', '.', 'G', 'C', '11.7172', '.', 'DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=41', 'GT:PL', '1/1:41,3,0']
['ACmerged_contig_10464', '1260', '.', 'A', 'T', '11.7172', '.', 'DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=41', 'GT:PL', '1/1:41,3,0']
Whereas I want it to look like this:
ACmerged_contig_10464 1259 . G C 11.7172 . DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=41 GT:PL 1/1:41,3,0
ACmerged_contig_10464 1260 . A T 11.7172 . DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=41 GT:PL 1/1:41,3,0
i.e. not have the lines of text in list format in the file, but have each item of each list separated by a tab

You can simply join lists to a string: Docs
my_string = '\t'.join(my_list)
\t should join them with a tab, but you can use what you want there.
In this example:
output.write('\t'.join(shared_double_lists[(0, 1)][counter]))

Related

Where did 0 index go in return statements of tuples in this sorting algorithm?

I'm looking at a sorting algorithm to put lowercase letters in front, then uppercase, then odd, and lastly even. For example String1234 becomes ginrtS1324
Code
def getKey(x):
if x.islower():
return(1,x)
elif x.isupper():
return(2,x)
elif x.isdigit():
if int(x)%2 == 1:
return(3,x)
else:
return(4,x)
print(*sorted('String1234',key=getKey),sep='')
I understand that tuples are returned as (1, g), (1,i)... (2, S), (3, 1), (3, 3) (4, 2), (4,4). What I don't understand is why a list is created: ['g', 'i', 'n', 'r', 't', '1', '3', '2', '4'] and what happened to the 0 indexes of the tuples?

sorted returns a sorted list with the elements of whatever iterable you pass into it:
>>> sorted('String1234')
['1', '2', '3', '4', 'S', 'g', 'i', 'n', 'r', 't']
If you want to turn this back into a string, an easy way is join():
>>> ''.join(sorted('String1234'))
'1234Sginrt'
If you pass a key parameter, the resulting keys (obtained by calling the key function on each element to be sorted) are used for the comparison within the sort, but the output is still built out of the original elements, not the keys!
>>> ''.join(sorted('String1234', key=getKey))
'ginrtS1324'
If you wanted to get the list of tuples rather than a list of letters, you'd do that by mapping your key function over the list itself before sorting it (and do that instead of passing it as a separate parameter to sorted):
>>> sorted(map(getKey, 'String1234'))
[(1, 'g'), (1, 'i'), (1, 'n'), (1, 'r'), (1, 't'), (2, 'S'), (3, '1'), (3, '3'), (4, '2'), (4, '4')]
>>> ''.join(map(lambda x: ''.join(map(str, x)), sorted(map(getKey, 'String1234'))))
'1g1i1n1r1t2S31334244'

Python - Finding most occurring words in a CSV row

I want to find the most occurring substring in a CSV row either by itself, or by using a list of keywords for lookup.
I've found a way to find out the top 5 most occurring words in each row of a CSV file using Python using the below responses, but, that doesn't solve my purpose. It gives me results like -
[(' Trojan.PowerShell.LNK.Gen.2', 3),
(' Suspicious ZIP!lnk', 2),
(' HEUR:Trojan-Downloader.WinLNK.Powedon.a', 2),
(' TROJ_FR.8D496570', 2),
('Trojan.PowerShell.LNK.Gen.2', 1),
(' Trojan.PowerShell.LNK.Gen.2 (B)', 1),
(' Win32.Trojan-downloader.Powedon.Lrsa', 1),
(' PowerShell.DownLoader.466', 1),
(' malware (ai score=86)', 1),
(' Probably LNKScript', 1),
(' virus.lnk.powershell.a', 1),
(' Troj/LnkPS-A', 1),
(' Trojan.LNK', 1)]
Whereas, I would want something like 'Trojan', 'Downloader', 'Powershell' ... as the top results.
The matching words can be a substring of a value (cell) in the CSV or can be a combination of two or more words. Can someone help fix this either by using a keywords list or without.
Thanks!

Let, my_values = ['A', 'B', 'C', 'A', 'Z', 'Z' ,'X' , 'A' ,'X','H','D' ,'A','S', 'A', 'Z'] is your list of words which is to sort.
Now take a list which will store information of occurrences of every words.
count_dict={}
Populate the dictionary with appropriate values :
for i in my_values:
if count_dict.get(i)==None: #If the value is not present in the dictionary then this is the first occurrence of the value
count_dict[i]=1
else:
count_dict[i] = count_dict[i]+1 #If previously found then increment it's value
Now sort the values of dict according to their occurrences :
sorted_items= sorted(count_dict.items(),key=operator.itemgetter(1),reverse=True)
Now you have your expected results!
The most occurring 3 values are:
print(sorted_items[:3])
output :
[('A', 5), ('Z', 3), ('X', 2)]
The most occurring 2 values are :
print(sorted_items[:3])
output:
[('A', 5), ('Z', 3)]
and so on.

Vectorizing trigrams with all possible 3-grams - Python

I'm trying to create a 3-gram model to apply machine learning techniques.
Basically I'm trying as follow:
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import itertools
my_array = ['worda', 'wordb']
vector = CountVectorizer(analyzer=nltk.trigrams,ngram_range=(3,3))
vector.fit_transform(my_array)
My vocabulary:
{('o', 'r', 'd'): 0,
('r', 'd', 'a'): 1,
('r', 'd', 'b'): 2,
('w', 'o', 'r'): 3}
None of my words have spaces or special characters.
So when I run this:
tr_test = vector.transform(['word1'])
print(tr_test)
print(tr_test.shape)
I get this return:
(0, 0) 1
(0, 1) 1
(0, 3) 1
(1, 4) #this is the shape
I think this is right... at least makes sense...
But I would like to represent each word with a matrix containing all 3-gram possibilities. So, each work would be represented by a (1x17576) matrix.
Now I'm using 1x4 matrix (in this particular case), because my vocabulary is built based on my data.
17576 (26ˆ3)- Represents all 3 letters combination in the alphabet (aaa, aab, aac, etc...)
I tried to set my vocabulary to an array with all 3-grams possibilities, like this:
#This creates an array with all 3 letters combination
#['aaa', 'aab', 'aac', ...]
keywords = [''.join(i) for i in itertools.product(ascii_lowercase, repeat = 3)]
vector = CountVectorizer(analyzer=nltk.trigrams,ngram_range=(3,3), vocabulary=keywords)
This didn't work... Someone can figure out how to do this?
Thanks!!!

I tried to change the analyzer to 'char', and it seems to work now:
keywords = [''.join(i) for i in itertools.product(ascii_lowercase, repeat = 3)]
vector = CountVectorizer(analyzer='char', ngram_range=(3,3), vocabulary=keywords)
tr_test = vector.transform(['word1'])
print(tr_test)
And the output is:
(0, 9909) 1
(0, 15253) 1
Just as a check:
test = vector.transform(['aaa aab'])
print(test)
The output:
(0, 0) 1
(0, 1) 1

Python:Update list of tuples

I have a list of tuples like this:
list = [(1, 'q'), (2, 'w'), (3, 'e'), (4, 'r')]
and i am trying to create a update function update(item,num) which search the item in the list and then change the num.
for example if i use update(w,6) the result would be
list = [(1, 'q'), (6, 'w'), (3, 'e'), (4, 'r')]
i tried this code but i had error
if item in heap:
heap.remove(item)
Pushheap(item,num)
else:
Pushheap(item,num)
Pushheap is a function that push tuples in the heap
any ideas?

You can simply scan through the list looking for a tuple with the desired letter and replace the whole tuple (you can't modify tuples), breaking out of the loop when you've found the required item. Eg,
lst = [(1, 'q'), (2, 'w'), (3, 'e'), (4, 'r')]
def update(item, num):
for i, t in enumerate(lst):
if t[1] == item:
lst[i] = num, item
break
update('w', 6)
print(lst)
output
[(1, 'q'), (6, 'w'), (3, 'e'), (4, 'r')]
However, you should seriously consider using a dictionary instead of a list of tuples. Searching a dictionary is much more efficient than doing a linear scan over a list.

As noted in the comments, you are using an immutable data structure for data items that you are attempting to change. Without further context, it looks like you want a dictionary, not a list of tuples, and it also looks like you want the second item in the tuple (the letter) to be the key, since you are planning on modifying the number.
Using these assumptions, I recommend converting the list of tuples to a dictionary and then using normal dictionary assignment. This also assumes that order is not important (if it is, you can use an OrderedDict) and that the same letter does not appear twice (if it does, only the last number will be in the dict).
>>> lst = [(1, 'q'), (2, 'w'), (3, 'e'), (4, 'r')]
>>> item_dict = dict(i[::-1] for i in lst)
>>> item_dict
{'q': 1, 'r': 4, 'e': 3, 'w': 2}
>>> item_dict['w'] = 6
>>> item_dict
{'q': 1, 'r': 4, 'e': 3, 'w': 6}

Tuples are an immutable object. Which means once they're created, you can't go changing there contents.
You can, work around this however, by replaceing the tuple you want to change. Possibly something such as this:
def change_item_in_list(lst, item, num):
for pos, tup in enumerate(lst):
if tup[1] == item:
lst[pos] = (num, item)
return
l = [(1, 'q'), (2, 'w'), (3, 'e'), (4, 'r')]
print(l)
change_item_in_list(l, 'w', 6)
print(l)
But as #brianpck has already said, you probably want a (ordered)-dictionary instead of a list of tuples.

Detecting if string iterator is a blank space

I'm attempting to write a small block of code that detects the most frequently occurring character. However, I've become stuck on not being able to detect if a value is blank space.
Below is the code I have:
text = "Hello World!"
## User lower() because case does not matter
setList = list(set(textList.lower()))
for s in setList:
if s.isalpha() and s != " ":
## Do Something
else:
setList.remove(s)
The problem is that set list ends with the following values:
[' ', 'e', 'd', 'h', 'l', 'o', 'r', 'w']
I've tried multiple ways of detecting the blank space with no luck, including using strip() on the original string value. isspace() will not work because it looks for at least one character.

The problem is, you are removing items from a list while iterating it. Never do that. Consider this case
['!', ' ', 'e', 'd', 'h', 'l', 'o', 'r', 'w']
This is how the setList looks like, after converting to a set and list. In the first iteration, ! will be seen and that will be removed from the setList. Now that ! is removed, the next character becomes the current character, which is . For the next iteration, the iterator is incremented and it points to e (since space is the current character). That is why it is still there in the output. You can check this with this program
num_list = range(10)
for i in num_list:
print i,
if i % 2 == 1:
num_list.remove(i)
pass
Output
0 1 3 5 7 9
But if you comment num_list.remove(i), the output will become
0 1 2 3 4 5 6 7 8 9
To solve your actual problem, you can use collections.Counter to find the frequency of characters, like this
from collections import Counter
d = Counter(text.lower())
if " " in d: del d[" "] # Remove the count of space char
print d.most_common()
Output
[('l', 3), ('o', 2), ('!', 1), ('e', 1), ('d', 1), ('h', 1), ('r', 1), ('w', 1)]

A short way is to first remove the spaces from the text
>>> text = "Hello world!"
>>> text = text.translate(None, " ")
>>> max(text, key=text.count)
'l'
This isn't very efficient though, because count scans the entire string once for each character (O(n2))
For longer strings it's better to use Collections.Counter, or Collections.defaultdict to do the counting in a single pass

How about removing the blank spaces before you start with lists and sets:
text = "Hello world!"
text = re.sub(' ', '', text)
# text = "Helloworld!"

the above answers are legitimate. you could also use the built-in count operator if you are not concerned with algorithmic complexity. For example:
## User lower() because case does not matter
text=text.lower()
num=0
most_freq=None
for s in text:
cur=text.count(s)
if cur>num:
most_freq=s
num=cur
else:
pass

How about using split(): it will fail if its blank space:
>>> [ x for x in text if x.split()]
['H', 'e', 'l', 'l', 'o', 'W', 'o', 'r', 'l', 'd', '!']
>>>
To count the duplicate:
>>> d = dict()
>>> for e in [ x for x in text if x.split()]:
... d[e] = d.get(e,0) + 1
...
>>> print d
{'!': 1, 'e': 1, 'd': 1, 'H': 1, 'l': 3, 'o': 2, 'r': 1, 'W': 1}
>>>

To get the single most frequent, use max:
text = "Hello World!"
count={}
for c in text.lower():
if c.isspace():
continue
count[c]=count.get(c, 0)+1
print count
# {'!': 1, 'e': 1, 'd': 1, 'h': 1, 'l': 3, 'o': 2, 'r': 1, 'w': 1}
print max(count, key=count.get)
# 'l'
If you want the whole shebang:
print sorted(count.items(), key=lambda t: (-t[1], t[0]))
# [('l', 3), ('o', 2), ('!', 1), ('d', 1), ('e', 1), ('h', 1), ('r', 1), ('w', 1)]
If you want to use Counter and use a generator type approach, you could do:
from collections import Counter
from string import ascii_lowercase
print Counter(c.lower() for c in text if c.lower() in ascii_lowercase)
# Counter({'l': 3, 'o': 2, 'e': 1, 'd': 1, 'h': 1, 'r': 1, 'w': 1})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

outputting dictionary values to text file: uncoupling lists and strings - python

You can simply join lists to a string: Docs my_string = '\t'.join(my_list) \t should join them with a tab, but you can use what you want there. In this example: output.write('\t'.join(shared_double_lists[(0, 1)][counter]))

Related

Where did 0 index go in return statements of tuples in this sorting algorithm?

Python - Finding most occurring words in a CSV row

Vectorizing trigrams with all possible 3-grams - Python

Python:Update list of tuples

Detecting if string iterator is a blank space

Categories

Resources