python dictionary creating keys on letters and values on frequency it appears - python

I have to read text that looks like
TCCATCTACT
GGGCCTTCCT
TCCATCTACC
etc...
I want to create a dictionary, how can I read through this and set T, C, A, or G as the key and the values is the frequency that letter
appeared throughout the text?

Simply pass the whole string to a collections.Counter() object and it'll count each character.
It may be more efficient to do so line by line, so as not to require too much memory:
from collections import Counter
counts = Counter()
with open('inputtextfilename') as infh:
for line in infh:
counts.update(line.strip())
The str.strip() call removes any whitespace (such as the newline character).
A quick demo using your sample input:
>>> from collections import Counter
>>> sample = '''\
... TCCATCTACT
... GGGCCTTCCT
... TCCATCTACC
... '''.splitlines(True)
>>> counts = Counter()
>>> for line in sample:
... counts.update(line.strip())
...
>>> for letter, count in counts.most_common():
... print(letter, count)
...
C 13
T 10
A 4
G 3
I used the Counter.most_common() method to get a sorted list of letter-count pairs (in order from most to least common).

Related

How can I clean this data for easier visualizing?

I'm writing a program to read a set of data rows and quantify matching sets. I have the code below however would like to cut, or filter the numbers which is not being recognized as a match.
import collections
a = "test.txt" #This can be changed to a = input("What's the filename? ", )
line_file = open(a, "r")
print(line_file.readable()) #Readable check.
#print(line_file.read()) #Prints each individual line.
#Code for quantity counter.
counts = collections.Counter() #Creates a new counter.
with open(a) as infile:
for line in infile:
for number in line.split():
counts.update((number,))
for key, count in counts.items():
print(f"{key}: x{count}")
line_file.close()
This is what it outputs, however I'd like for it to not read the numbers at the end and pair the matching sets accordingly.
A2-W-FF-DIN-22: x1
A2-FF-DIN: x1
A2-W-FF-DIN-11: x1
B12-H-BB-DD: x2
B12-H-BB-DD-77: x1
C1-GH-KK-LOP: x1
What I'm aiming for is so that it ignored the "-77" in this, and instead counts the total as x3
B12-H-BB-DD: x2
B12-H-BB-DD-77: x1
Split each element on the dashes and check the last element is a number. If so, remove it, then continue on.
from collections import Counter
def trunc(s):
parts = s.split('-')
if parts[-1].isnumeric():
return '-'.join(parts[:-1])
return s
with open('data.txt') as f:
data = [trunc(x.rstrip()) for x in f.readlines()]
counts = Counter(data)
for k, v in counts.items():
print(k, v)
Output
A2-W-FF-DIN 2
A2-FF-DIN 1
B12-H-BB-DD 3
C1-GH-KK-LOP 1
You could use a regular expression to create a matching group for a digit suffix. If each number is its own string, e.g. "A2-W-FF-DIN-11", then a regular expression like (?P<base>.+?)(?:-(?P<suffix>\d+))?\Z could work.
Here, (?P<base>.+?) is a non-greedy match of any character except for a newline grouped under the name "base", (?:-(?P<suffix>\d+))? matches 0 or 1 occurrences of something like -11 occurring at the end of the "base" group and puts the digits in a group named "suffix", and \Z is the end of the string.
This is what it does in action:
>>> import re
>>> regex = re.compile(r"(?P<base>.+?)(?:-(?P<suffix>\d+))?\Z")
>>> regex.match("A2-W-FF-DIN-11").groupdict()
{'base': 'A2-W-FF-DIN', 'suffix': '11'}
>>> regex.match("A2-W-FF-DIN").groupdict()
{'base': 'A2-W-FF-DIN', 'suffix': None}
So you can see, in this instance, whether or not the string has a digital suffix, the base is the same.
All together, here's a self-contained example of how it might be applied to data like this:
import collections
import re
regex = re.compile(r"(?P<base>.+?)(?:-(?P<suffix>\d+))?\Z")
sample_data = [
"A2-FF-DIN",
"A2-W-FF-DIN-11",
"A2-W-FF-DIN-22",
"B12-H-BB-DD",
"B12-H-BB-DD",
"B12-H-BB-DD-77",
"C1-GH-KK-LOP"
]
counts = collections.Counter()
# Iterates through the data and updates the counter.
for datum in sample_data:
# Isolates the base of the number from any digit suffix.
number = regex.match(datum)["base"]
counts.update((number,))
# Prints each number and prints how many instances were found.
for key, count in counts.items():
print(f"{key}: x{count}")
For which the output is
A2-FF-DIN: x1
A2-W-FF-DIN: x2
B12-H-BB-DD: x3
C1-GH-KK-LOP: x1
Or in the example code you provided, it might look like this:
import collections
import re
# Compiles a regular expression to match the base and suffix
# of a number in the file.
regex = re.compile(r"(?P<base>.+?)(?:-(?P<suffix>\d+))?\Z")
a = "test.txt"
line_file = open(a, "r")
print(line_file.readable()) # Readable check.
# Creates a new counter.
counts = collections.Counter()
with open(a) as infile:
for line in infile:
for number in line.split():
# Isolates the base match of the number.
counts.update((regex.match(number)["base"],))
for key, count in counts.items():
print(f"{key}: x{count}")
line_file.close()

Create a code in python to get the most frequent tag and value pair from a list

I have a .txt file with 3 columns: word position, word and tag (NN, VB, JJ, etc.).
Example of txt file:
1 i PRP
2 want VBP
3 to TO
4 go VB
I want to find the frequency of the word and tag as a pair in the list in order to find the most frequently assigned tag to a word.
Example of Results:
3 (food, NN), 2 (Brave, ADJ)
My idea is to start by opening the file from the folder, read the file line by line and split, set a counter using dictionary and print with the most common to uncommon in descending order.
My code is extremely rough (I'm almost embarrassed to post it):
file=open("/Users/Desktop/Folder1/trained.txt")
wordcount={}
for word in file.read().split():
from collections import Counter
c = Counter()
for d in dicts.values():
c += Counter(d)
print(c.most_common())
file.close()
Obviously, i'm getting no results. Anything will help. Thanks.
UPDATE:
so i got this code posted on here which worked, but my results are kinda funky. here's the code (the author removed it so i don't know who to credit):
file=open("/Users/Desktop/Folder1/trained.txt").read().split('\n')
d = {}
for i in file:
if i[1:] in d.keys():
d[i[1:]] += 1
else:
d[i[1:]] = 1
print (sorted(d.items(), key=lambda x: x[1], reverse=True))
here are my results:
[('', 15866), ('\t.\t.', 9479), ('\ti\tPRP', 7234), ('\tto\tTO', 4329), ('\tlike\tVB', 2533), ('\tabout\tIN', 2518), ('\tthe\tDT', 2389), ('\tfood\tNN', 2092), ('\ta\tDT', 2053), ('\tme\tPRP', 1870), ('\twant\tVBP', 1713), ('\twould\tMD', 1507), ('0\t.\t.', 1427), ('\teat\tVB', 1390), ('\trestaurant\tNN', 1371), ('\tuh\tUH', 1356), ('1\t.\t.', 1265), ('\ton\tIN', 1237), ("\t'd\tMD", 1221), ('\tyou\tPRP', 1145), ('\thave\tVB', 1127), ('\tis\tVBZ', 1098), ('\ttell\tVB', 1030), ('\tfor\tIN', 987), ('\tdollars\tNNS', 959), ('\tdo\tVBP', 956), ('\tgo\tVB', 931), ('2\t.\t.', 912), ('\trestaurants\tNNS', 899),
there seem to be a mix of good results with words and other results with space or random numbers, anyone know a way to remove what aren't real words? also, i know \t is supposed to signify a tab, is there a way to remove that as well? you guys really helped a lot
You need to have a separate collections.Counter for each word. This code uses defaultdict to create a dictionary of counters, without checking every word to see if it is known.
from collections import Counter, defaultdict
counts = defaultdict(Counter)
for row in file: # read one line into `row`
if not row.strip():
continue # ignore empty lines
pos, word, tag = row.split()
counts[word.lower()][tag] += 1
That's it, you can now check the most common tag of any word:
print(counts["food"].most_common(1))
# Prints [("NN", 3)] or whatever
If you don't mind using pandas which is a great library for tabular data I would do the following:
import pandas as pd
df = pd.read_csv("/Users/Desktop/Folder1/trained.txt", sep=" ", header=None, names=["position", "word", "tag"])
df["word_tag_counts"] = df.groupby(["word", "tag"]).transform("count")
Then if you only want the maximum one from each group you can do:
df.groupby(["word", "tag"]).max()["word_tag_counts"]
which should give you a table with the values you want

Constructing peculiar dictionary out of file (python)

I'd like to automaticaly form a dictionary from files that have the following structure.
str11 str12 str13
str21 str22
str31 str32 str33 str34
...
that is, two, three or four strings each line, with spaces in between. The dictionary I'd like to construct out of this list must have following structure:
{str11:(str12,str13),str21:(str22),str31:(str32,str33,str34), ... }
(that is, all entries str*1 are the keys -- all of them different -- and the remaining ones are the values). What can I use?
>>> with open('abc') as f:
... dic = {}
... for line in f:
... key, val = line.split(None,1)
... dic[key] = tuple(val.split())
...
>>> dic
{'str31': ('str32', 'str33', 'str34'),
'str21': ('str22',),
'str11': ('str12', 'str13')}
If you want the order of items to be preserved then consider using OrderedDict:
>>> from collections import OrderedDict
>>> with open('abc') as f:
dic = OrderedDict()
for line in f:
key, val = line.split(None,1)
dic[key] = tuple(val.split())
...
>>> dic
OrderedDict([
('str11', ('str12', 'str13')),
('str21', ('str22',)),
('str31', ('str32', 'str33', 'str34'))
])
Using a StringIO instance for simplicity:
import io
fobj = io.StringIO("""str11 str12 str13
str21 str22
str31 str32 str33 str34""")
One line does the trick:
>>> {line.split(None, 1)[0]: tuple(line.split()[1:]) for line in fobj}
{'str11': ('str12', 'str13'),
'str21': ('str22',),
'str31': ('str32', 'str33', 'str34')}
Note the line.split(None, 1). This limits the splitting to one item because we have to use .split() twice in a dict comprehension. We cannot store intermediate results for reuse as in a loop. The None means split at any whitespace.
For an OrderedDict you can also get away with one line using a generator expression:
from collections import OrderedDict
>>> OrderedDict((line.split(None, 1)[0], tuple(line.split()[1:]))
for line in fobj)
OrderedDict([('str11', ('str12', 'str13')), ('str21', ('str22',)),
('str31', ('str32', 'str33', 'str34'))])

count for number of repetition of values in a list and generate outfile

I have a file having a few columns like:
PAIR 1MFK 1 URANIUM 82 HELIUM 112 2.5506
PAIR 2JGH 2 PLUTONIUM 98 POTASSIUM 88 5.3003
PAIR 345G 3 SODIUM 23 CARBON 14 1.664
PAIR 4IG5 4 LITHIUM 82 ARGON 99 2.5506
PAIR 234G 5 URANIUM 99 KRYPTON 89 1.664
Now what I wanted to do is read the last column and iterate the values for repetitions and generate an output file containing two column 'VALUE' & 'NO OF TIMES REPEATED'.
I have tried like:
inp = ('filename'.'r').read().strip().replace('\t',' ').split('\n')
from collections import defaultdict
D = defaultdict(line)
for line in map(str.split,inp):
k=line[-1]
D[k].append(line)
I'm stuck here.
plaese help.!
There are a number of issues with the code as posted. A while-loop isn't allowed inside a list comprehension. The argument to defaultdict should be list not line. Here is a fixed-up version of your code:
from collections import defaultdict
D = defaultdict(list)
for line in open('filename', 'r'):
k = line.split()[-1]
D[k].append(line)
print 'VALUE NO TIMES REPEATED'
print '----- -----------------'
for value, lines in D.items():
print '%-6s %d' % (value, len(lines))
Another way to do it is to use collections.Counter to conveniently sum the number of repetitions. That let's you simplify the code a bit:
from collections import Counter
D = Counter()
for line in open('filename', 'r'):
k = line.split()[-1]
D[k] += 1
print 'VALUE NO TIMES REPEATED'
print '----- -----------------'
for value, count in D.items():
print '%-6s %d' % (value, count)
Now what I wanted to do is read the last column and iterate the values for repetitions and generate an output file containing two column 'VALUE' & 'NO OF TIMES REPEATED'.
So use collections.Counter to count the number of times each value appears, not a defaultdict. (It's not at all clear what you're trying to do with the defaultdict, and your initialization won't work, anyway; defaultdict is constructed with a callable that will create a default value. In your case, the default value you apparently had in mind is an empty list, so you would use list to initialize the defaultdict.) You don't need to store the lines to count them. The Counter counts them for you automatically.
Also, processing the entire file ahead of time is a bit ugly, since you can iterate over the file directly and get lines, which does part of the processing for you. Although you can actually do that iteration automatically in the Counter creation.
Here is a complete solution:
from collections import Counter
with open('input', 'r') as data:
histogram = Counter(line.split('\t')[-1].strip() for line in data)
with open('output', 'w') as result:
for item in histogram.iteritems():
result.write('%s\t%s\n' % item)

Searching and writing

I need to write a program which looks for words with the same three middle characters(each word is 5 characters long) in a list, then writes them into a file like this :
wasdy
casde
tasdf
gsadk
csade
hsadi
Between the similar words i need to leave an empty line. I am kinda stuck.
Is there a way to do this? I use Python 3.2 .
Thanks for your help.
I would use the itertools.groupby function for this. Assuming wordlist is a list of the words you want to group, this code does the trick.
import itertools
for k, v in itertools.groupby(wordlist, lambda word: word[1:4]):
# here, k is the key the words are grouped by, i.e. word[1:4]
# and v is a list/iterable of the words in the group
for word in v:
print word
print
itertools.groupby(wordlist, lambda word: word[1:4]) basically takes all the words, and groups them by word[1:4], i.e. the three middle characters. Here's the output of the above code with your sample data:
wasdy
casde
tasdf
gsadk
csade
hsadi
 
To get you started: try using the builtin sorted function on the list of words, and for the key you should experiment with using a slice(1, 4).
For example:
some_list = ['wasdy', 'casde', 'tasdf', 'gsadk', 'other', 'csade', 'hsadi']
sorted(some_list, key = lambda x: sorted(x[1:4]))
# outputs ['wasdy', 'casde', 'tasdf', 'gsadk', 'csade', 'hsadi', 'other']
edit: It was unclear to me whether you wanted "same three middle characters, in order" or just "same three middle characters". If the latter, then you could look at sorted(some_list, key = lambda x: x[1:4]) instead.
try:
from collections import defaultdict
dict_of_words = defaultdict(list)
for word in list_of_words:
dict_of_words[word[1:-1]].append(word)
then, to write to an output file:
with open('outfile.txt', 'w') as f:
for key in dict_of_words:
f.write('\n'.join(dict_of_words[key])
f.write('\n' )
word_list = ['wasdy', 'casde','tasdf','gsadk','csade','hsadi']
def test_word(word):
return all([x in word[1:4] for x in ['a','s','d']])
f = open('yourfile.txt', 'w')
f.write('\n'.join([word for word in word_list if test_word(word)]))
f.close()
returns:
wasdy
casde
tasdf
gsadk
csade
hsadi

Categories