Python - Finding most occurring words in a CSV row - python

I want to find the most occurring substring in a CSV row either by itself, or by using a list of keywords for lookup.
I've found a way to find out the top 5 most occurring words in each row of a CSV file using Python using the below responses, but, that doesn't solve my purpose. It gives me results like -
[(' Trojan.PowerShell.LNK.Gen.2', 3),
(' Suspicious ZIP!lnk', 2),
(' HEUR:Trojan-Downloader.WinLNK.Powedon.a', 2),
(' TROJ_FR.8D496570', 2),
('Trojan.PowerShell.LNK.Gen.2', 1),
(' Trojan.PowerShell.LNK.Gen.2 (B)', 1),
(' Win32.Trojan-downloader.Powedon.Lrsa', 1),
(' PowerShell.DownLoader.466', 1),
(' malware (ai score=86)', 1),
(' Probably LNKScript', 1),
(' virus.lnk.powershell.a', 1),
(' Troj/LnkPS-A', 1),
(' Trojan.LNK', 1)]
Whereas, I would want something like 'Trojan', 'Downloader', 'Powershell' ... as the top results.
The matching words can be a substring of a value (cell) in the CSV or can be a combination of two or more words. Can someone help fix this either by using a keywords list or without.
Thanks!

Let, my_values = ['A', 'B', 'C', 'A', 'Z', 'Z' ,'X' , 'A' ,'X','H','D' ,'A','S', 'A', 'Z'] is your list of words which is to sort.
Now take a list which will store information of occurrences of every words.
count_dict={}
Populate the dictionary with appropriate values :
for i in my_values:
if count_dict.get(i)==None: #If the value is not present in the dictionary then this is the first occurrence of the value
count_dict[i]=1
else:
count_dict[i] = count_dict[i]+1 #If previously found then increment it's value
Now sort the values of dict according to their occurrences :
sorted_items= sorted(count_dict.items(),key=operator.itemgetter(1),reverse=True)
Now you have your expected results!
The most occurring 3 values are:
print(sorted_items[:3])
output :
[('A', 5), ('Z', 3), ('X', 2)]
The most occurring 2 values are :
print(sorted_items[:3])
output:
[('A', 5), ('Z', 3)]
and so on.

Related

outputting dictionary values to text file: uncoupling lists and strings

I have a dictionary called shared_double_lists which is made of 6 keys called [(0, 1), (1, 2), (1, 3), (2, 3), (0, 3), (0, 2)]. The values for all the keys are lists.
I am trying to output the values for key (0, 1) to a file. Here is my code:
output = open('test_output.txt', 'w')
counter = 0
for locus in shared_double_lists[(0, 1)]:
for value in locus:
output.write(str(shared_double_lists[(0, 1)][counter]))
output.write ("\t")
output.write ("\n")
counter +=1
output.close()
This almost works, the output looks like this:
['ACmerged_contig_10464', '1259', '.', 'G', 'C', '11.7172', '.', 'DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=41', 'GT:PL', '1/1:41,3,0']
['ACmerged_contig_10464', '1260', '.', 'A', 'T', '11.7172', '.', 'DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=41', 'GT:PL', '1/1:41,3,0']
Whereas I want it to look like this:
ACmerged_contig_10464 1259 . G C 11.7172 . DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=41 GT:PL 1/1:41,3,0
ACmerged_contig_10464 1260 . A T 11.7172 . DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=41 GT:PL 1/1:41,3,0
i.e. not have the lines of text in list format in the file, but have each item of each list separated by a tab
You can simply join lists to a string: Docs
my_string = '\t'.join(my_list)
\t should join them with a tab, but you can use what you want there.
In this example:
output.write('\t'.join(shared_double_lists[(0, 1)][counter]))

Compare List of Tuples and Return Indices of Matched Values

I'm new to programming and am having some trouble with this exercise. The goal is to write a function that returns a list of matching items.
Items are defined by a tuple with a letter and a number and we consider item 1 to match item 2 if:
Both their letters are vowels (aeiou), or both are consonants
AND
The sum of their numbers is a multiple of 3
NOTE: The return list should not include duplicate matches --> (1,2) contains the same information as (2,1), the output list should only contain one of them.
Here's an example:
***input:*** [('a', 4), ('b', 5), ('c', 1), ('d', 3), ('e', 2), ('f',6)]
***output:*** [(0,4), (1,2), (3,5)]
Any help would be much appreciated!
from itertools import combinations
lst = [('a', 4), ('b', 5), ('c', 1), ('d', 3), ('e', 2), ('f',6)]
vowels = 'aeiou'
matched = [(i[0],j[0]) for (i,j) in combinations(enumerate(lst),2) if (i[1][0] in vowels) == (j[1][0] in vowels) and ((i[1][1] + j[1][1]) % 3 == 0)]
print(matched)
Sorry, I'm high enough rep to comment, but i'll edit / update once I can.
Im a little confused about the question, what is the purpose of the letters, should we be using their positon in the alphabet as their value? i.e a=0, b=1?
what are we comparing one tuple to?
Thanks
You can use itertools.combinations with enumerate to iterate all combinations and output indices. Combinations do not include permutations, so you will not see duplicates.
from itertools import combinations
lst = [('a', 4), ('b', 5), ('c', 1), ('d', 3), ('e', 2), ('f',6)]
def checker(lst):
vowels = set('aeiou')
for (idx_i, i), (idx_j, j) in combinations(enumerate(lst), 2):
if ((i[0] in vowels) == (j[0] in vowels)) and ((i[1] + j[1]) % 3 == 0):
yield idx_i, idx_j
res = list(checker(lst))
# [(0, 4), (1, 2), (3, 5)]

Trying to create python program, but simplified with def

I'm trying to create a program without importing anything. The program lets the user input a passage, then prints how many A's there are in the message, how many B's, etc.
So it works...it's just VERY long. I'm new to coding, and I know that there is a way to simplify the code below with def but I'm not really sure how. Can anyone help?
You need no methods, but you can definately cut it short:
String can be used as an array of characters.
You can use the index method to determine what is the position of the letter in the alphabet.
You can iterate a zipped list of pairs from the alphabet and the counter list, to produce the output.
Use if letter in alphabet as a guard to ensure the letter is valid for the alphabet, instead of hard coding the alphabet. That way you can even expand your alphabet. (Note that the counter is set to the length of the alphabet).
Here is a suggestion:
message = input('what is your message? ').upper()
alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
counter = [0] * len(alphabet)
for letter in message:
if letter in alphabet:
counter[alphabet.index(letter)] += 1
for letter, count in zip(alphabet, counter):
print(letter, ':', count)
One can do it with a one line instruction, where we make use of:
count method of string that returns the numbers of element contained in a string
chr function that gives a character from an int. chr(65) gives a A, chr(66) gives a B, ...
join function that concatenates strings of a list
The result looks like
message = input('what is your message? ').upper()
print('\n'.join([chr(65+i)+':'+str(message.count(chr(65+i))) for i in range(26)]))
For a very short and elegant solution use the Counter unit from the collections module:
from collections import Counter
message=raw_input("what is your message?")
message=message.upper()
c = Counter(message)
print c.most_common()
This counts every kind of letter in the message. And it can even sort the result for you quickly. Here is a sample dialog:
"what is your message?Hi there, new Pythonist!
[(' ', 3), ('E', 3), ('H', 3), ('T', 3), ('I', 2), ('N', 2), ('!', 1), (',', 1), ('O', 1), ('P', 1), ('S', 1), ('R', 1), ('W', 1), ('Y', 1)]"

Cipher Decryption based on letter frequencys - Python

Okay so I need to make a python program that takes an encrypted string and from this works out the English plain text using letter frequency. Now from what I gather I should be taking the string and using string.count to get the frequency although I am stuck from here.
After getting the frequency how can I then say the most frequent letter in the cipher is 'e' so print all of the most frequent letter as 'e', the 2nd most frequent is 't' and so on?
Can anyone give me a few things to look at which could help with the creation of this?
from collections import Counter
code_string = "abcdhjshslsldjhdjh"
letters = Counter(code_string)
print(letters.most_common())
results in
[('h', 4), ('d', 3), ('j', 3), ('s', 3), ('l', 2), ('a', 1), ('c', 1), ('b', 1)]

How to count characters in a file and print them sorted alphanumerically

(If you have a better title, do edit, I couldn't explain it properly! :)
So this is my code:
with open('cipher.txt') as f:
f = f.read().replace(' ', '')
new = []
let = []
for i in f:
let.append(i)
if i.count(i) > 1:
i.count(i) == 1
else:
new = sorted([i + ' ' + str(f.count(i)) for i in f])
for o in new:
print(o)
And this is cipher.txt:
xli uymgo fvsar jsb
I'm supposed to print out the letters used and how many times they are used, my code works, but I need it alphabetical, I tried putting them in a list list(a) and then sorting them, but i didn't quite get it, any ideas? Thanks in advance!
Whenever dealing with counting, you can use collections.Counter here:
>>> from collections import Counter
>>> print sorted(Counter('xli uymgo fvsar jsb'.replace(' ', '')).most_common())
[('a', 1), ('b', 1), ('f', 1), ('g', 1), ('i', 1), ('j', 1), ('l', 1), ('m', 1), ('o', 1), ('r', 1), ('s', 2), ('u', 1), ('v', 1), ('x', 1), ('y', 1)]
If you can't import any modules, then you can append a to a list and then sort it:
new = []
for i in f:
new.append(i + ' ' + str(f.count(i)) # Note that i is a string, so str() is unnecessary
Or, using a list comprehension:
new = [i + ' ' + str(f.count(i)) for i in f]
Finally, to sort it, just put sorted() around it. No extra parameters are needed because your outcome is alphabetical :).
Here's a oneliner without imports:
{s[i]: n for i, n in enumerate(map(s.count, s))}
And in alphabetical order (if the above is d):
for k in sorted(d): print k, d[k]
Or another version (oneliner alphabetical):
sorted(set([(s[i], n) for i, n in enumerate(map(s.count, s))]))

Categories