Create dictionary and see if key always has same value - python

If I had a file of lines starting with a number followed by some text, how could I see if the numbers are always followed by different text? For example:
0 Brucella abortus Brucellaceae
0 Brucella ceti Brucellaceae
0 Brucella canis Brucellaceae
0 Brucella ceti Brucellaceae
So here, I'd like to know that 0 is followed by 3 different "types" of text.
Ideally I could read a file into a python script that would have output something like this:
1:250
2:98
3:78
4:65
etc.
The first number would be the number of different "texts", and the number after the : would be how many numbers have this occurring.
I have the following script that calculates how many times a "text" is found in different numbers, so I'm wondering how to kind of reverse it so I know how many times a number has different texts, and how many different texts are present. This script makes the files of numbers and "text" into a dictionary but I'm unsure of how to manipulate this dictionary to get what I want.
#!/usr/bin/env python
#Dictionary to broken species, genus, family
fileIn = 'usearchclusternumgenus.txt'
d = {}
with open(fileIn, "r") as f:
for line in f:
clu, gen, spec, fam = line.split()
d.setdefault(clu, []).append((spec))
# Iterate through and find out how many times each key occurs
vals = {} # A dictonary to store how often each value occurs.
for i in d.values():
for j in set(i): # Convert to a set to remove duplicates
vals[j] = 1 + vals.get(j,0) # If we've seen this value iterate the count
# Otherwise we get the default of 0 and iterate it
#print vals
# Iterate through each possible freqency and find how many values have that count.
counts = {} # A dictonary to store the final frequencies.
# We will iterate from 0 (which is a valid count) to the maximum count
for i in range(0,max(vals.values())+1):
# Find all values that have the current frequency, count them
#and add them to the frequency dictionary
counts[i] = len([x for x in vals.values() if x == i])
for key in sorted(counts.keys()):
if counts[key] > 0:
print key,":",counts[key]`

Use a collections.defaultdict() object with a set as the factory to track different lines, then print out the sizes of the collected sets:
from collections import defaultdict
unique_clu = defaultdict(set)
with open(fileIn) as infh:
for line in infh:
clu, gen, spec, rest = line.split(None, 3)
unique_clu[clu].add(spec)
for key in sorted(unique_clu):
count = len(unique_clu[key])
if count:
print '{}:{}'.format(key, count)

Related

How can I clean this data for easier visualizing?

I'm writing a program to read a set of data rows and quantify matching sets. I have the code below however would like to cut, or filter the numbers which is not being recognized as a match.
import collections
a = "test.txt" #This can be changed to a = input("What's the filename? ", )
line_file = open(a, "r")
print(line_file.readable()) #Readable check.
#print(line_file.read()) #Prints each individual line.
#Code for quantity counter.
counts = collections.Counter() #Creates a new counter.
with open(a) as infile:
for line in infile:
for number in line.split():
counts.update((number,))
for key, count in counts.items():
print(f"{key}: x{count}")
line_file.close()
This is what it outputs, however I'd like for it to not read the numbers at the end and pair the matching sets accordingly.
A2-W-FF-DIN-22: x1
A2-FF-DIN: x1
A2-W-FF-DIN-11: x1
B12-H-BB-DD: x2
B12-H-BB-DD-77: x1
C1-GH-KK-LOP: x1
What I'm aiming for is so that it ignored the "-77" in this, and instead counts the total as x3
B12-H-BB-DD: x2
B12-H-BB-DD-77: x1
Split each element on the dashes and check the last element is a number. If so, remove it, then continue on.
from collections import Counter
def trunc(s):
parts = s.split('-')
if parts[-1].isnumeric():
return '-'.join(parts[:-1])
return s
with open('data.txt') as f:
data = [trunc(x.rstrip()) for x in f.readlines()]
counts = Counter(data)
for k, v in counts.items():
print(k, v)
Output
A2-W-FF-DIN 2
A2-FF-DIN 1
B12-H-BB-DD 3
C1-GH-KK-LOP 1
You could use a regular expression to create a matching group for a digit suffix. If each number is its own string, e.g. "A2-W-FF-DIN-11", then a regular expression like (?P<base>.+?)(?:-(?P<suffix>\d+))?\Z could work.
Here, (?P<base>.+?) is a non-greedy match of any character except for a newline grouped under the name "base", (?:-(?P<suffix>\d+))? matches 0 or 1 occurrences of something like -11 occurring at the end of the "base" group and puts the digits in a group named "suffix", and \Z is the end of the string.
This is what it does in action:
>>> import re
>>> regex = re.compile(r"(?P<base>.+?)(?:-(?P<suffix>\d+))?\Z")
>>> regex.match("A2-W-FF-DIN-11").groupdict()
{'base': 'A2-W-FF-DIN', 'suffix': '11'}
>>> regex.match("A2-W-FF-DIN").groupdict()
{'base': 'A2-W-FF-DIN', 'suffix': None}
So you can see, in this instance, whether or not the string has a digital suffix, the base is the same.
All together, here's a self-contained example of how it might be applied to data like this:
import collections
import re
regex = re.compile(r"(?P<base>.+?)(?:-(?P<suffix>\d+))?\Z")
sample_data = [
"A2-FF-DIN",
"A2-W-FF-DIN-11",
"A2-W-FF-DIN-22",
"B12-H-BB-DD",
"B12-H-BB-DD",
"B12-H-BB-DD-77",
"C1-GH-KK-LOP"
]
counts = collections.Counter()
# Iterates through the data and updates the counter.
for datum in sample_data:
# Isolates the base of the number from any digit suffix.
number = regex.match(datum)["base"]
counts.update((number,))
# Prints each number and prints how many instances were found.
for key, count in counts.items():
print(f"{key}: x{count}")
For which the output is
A2-FF-DIN: x1
A2-W-FF-DIN: x2
B12-H-BB-DD: x3
C1-GH-KK-LOP: x1
Or in the example code you provided, it might look like this:
import collections
import re
# Compiles a regular expression to match the base and suffix
# of a number in the file.
regex = re.compile(r"(?P<base>.+?)(?:-(?P<suffix>\d+))?\Z")
a = "test.txt"
line_file = open(a, "r")
print(line_file.readable()) # Readable check.
# Creates a new counter.
counts = collections.Counter()
with open(a) as infile:
for line in infile:
for number in line.split():
# Isolates the base match of the number.
counts.update((regex.match(number)["base"],))
for key, count in counts.items():
print(f"{key}: x{count}")
line_file.close()

Group nltk.FreqDist output by first word (python)

I'm an amateur with basic coding skills in python, I'm working on a data frame that has a column as below. The intent is to group the output of nltk.FreqDist by the first word
What I have so far
t_words = df_tech['message']
data_analysis = nltk.FreqDist(t_words)
# Let's take the specific words only if their frequency is greater than 3.
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])
for key in sorted(filter_words):
print("%s: %s" % (key, filter_words[key]))
sample current output:
click full refund showing currently viewed rr number: 1
click go: 1
click post refund: 1
click refresh like replace tokens sending: 1
click refund: 1
click refund order: 1
click resend email confirmation: 1
click responsible party: 1
click send right: 1
click tick mark right: 1
I have 10000+ rows in my output.
My Expected Output
I would like to group the output by the first word and extract it as a dataframe
What I have tried among other solutions
I have tried adapting solutions given here and here, but no satisfactory results.
Any help/guidance appreciated.
Try the following (documentation is inside the code):
# I assume the input, t_words is a list of strings (Each containing multiple words)
t_words = ...
# This creates a counter from a string to it's occurrences
input_frequencies = nltk.FreqDist(t_words)
# Taking inputs only if they appear 3 or more times.
# This is similar to your code, but looks at the frequency. Your previous code
# did len(m) where m was the message. If you want to filter by the string length,
# you can restore it to len(input_str) > 3
frequent_inputs = {
input_str: count
for input_str, count in input_frequencies.items()
if count > 3
}
# We will apply this function on each string to get the first word (to be
# used as the key for the grouping)
def first_word(value):
# You can replace this by a better implementation from nltk
return value.split(' ')[0]
# Now we will use itertools.groupby for the grouping, as documented in
# https://docs.python.org/3/library/itertools.html#itertools.groupby
first_word_to_inputs = itertools.groupby(
# Take the strings from the above dictionary
frequent_inputs.keys(),
# And key by the first word
first_word)
# If you would also want to keep the count of each word, we can map from
# first word to a list of (string, count) pairs:
first_word_to_inpus_and_counts = itertools.groupby(
# Pairs of words and count
frequent_inputs.items(),
# Extract the string from the pair, and then take the first word
lambda pair: first_word(pair[0])
)
I managed to do it like below. There could be an easier implementation. But for now, this gives me what I had expected.
temp = pd.DataFrame(sorted(data_analysis.items()), columns=['word', 'frequency'])
temp['word'] = temp['word'].apply(lambda x: x.strip())
#Removing emtpy rows
filter = temp["word"] != ""
dfNew = temp[filter]
#Splitting first word
dfNew['first_word'] = dfNew.word.str.split().str.get(0)
#New column with setences split without first word
dfNew['rest_words'] = dfNew['word'].str.split(n=1).str[1]
#Subsetting required columns
dfNew = dfNew[['first_word','rest_words']]
# Grouping by first word
dfNew= dfNew.groupby('first_word').agg(lambda x: x.tolist()).reset_index()
#Transpose
dfNew.T
Sample Output

How to get the sequence counts (in fasta) with conditions using python?

I have a fasta file (fasta is a file in which header line starts with > followed by a sequence line corresponding to that header). I want to get the counts for sequences matching TRINITY and total sequences that starts with >K after each >TRINITY sequences. I was able to get the counts for >TRINITY sequences, but not sure how to get the counts for >K for corresponding >TRINITY sequence group. How can I get this done in python?
myfasta.fasta:
>TRINITY_DN12824_c0_g1_i1
TGGTGACCTGAATGGTCACCACGTCCATACAGA
>K00363:119:HTJ23BBXX:1:1212:18730:9403 1:N:0:CGATGTAT
CACTATTACAATTCTGATGTTTTAATTACTGAGACAT
>K00363:119:HTJ23BBXX:1:2228:9678:46223_(reversed) 1:N:0:CGATGTAT
TAGATTTAAAATAGACGCTTCCATAGA
>TRINITY_DN12824_c0_g1_i1
TGGTGACCTGAATGGTCACCACGTCCATACAGA
>K00363:119:HTJ23BBXX:1:1212:18730:9403 1:N:0:CGATGTAT
CACTATTACAATTCTGATGTTTTAATTACTGAGACAT
>TRINITY_DN555_c0_g1_i1
>K00363:119:HTJ23BBXX:1:2228:9658:46188_(reversed) 1:N:0:CGATGTAT
CGATGCTAGATTTAAAATAGACG
>K00363:119:HTJ23BBXX:1:2106:15260:10387_(reversed) 1:N:0:CGATGTAT
TTAAAATAGACGCTTCCATAGAGA
Result I want:
reference reference_counts Corresponding_K_sequences
>TRINITY_DN12824_c0_g1_i1 2 3
>TRINITY_DN555_c0_g1_i1 1 2
Here is the code I have written which only accounts for >TRINITY sequence counts, but couldn't extend it to the bit where it also would count the corresponding >K sequences, so any help would be appreciated.
To Run:
python code.py myfasta.fasta output.txt
import sys
import os
from Bio import SeqIO
from collections import defaultdict
filename = sys.argv[1]
outfile = sys.argv[2]
dedup_records = defaultdict(list)
for record in SeqIO.parse(filename, "fasta"):
#print(record)
#print(record.id)
if record.id.startswith('TRINITY'):
#print(record.id)
# Use the sequence as the key and then have a list of id's as the value
dedup_records[str(record.seq)].append(record.id)
#print(dedup_records)
with open(outfile, 'w') as output:
# # to get the counts of duplicated TRINITY ids (sorted order)
for seq, ids in sorted(dedup_records.items(), key = lambda t: len(t[1]), reverse=True):
#output.write("{} {}\n".format(ids,len(ids)))
print(ids, len(ids))
You have the correct kind of thinking but you need to keep track of the last header that starts with "TRINITY" and slightly alter your structure:
from Bio import SeqIO
from collections import defaultdict
TRIN, d = None, defaultdict(lambda: [0,0])
for r in SeqIO.parse('myfasta.fasta', 'fasta'):
if r.id.startswith('TRINITY'):
TRIN = r.id
d[TRIN][0] += 1
elif r.id.startswith('K'):
if TRIN:
d[TRIN][1] += 1
print('reference\treference_counts\tCorresponding_K_sequences')
for k,v in d.items():
print('{}\t{}\t{}'.format(k,v[0],v[1]))
Outputs:
reference reference_counts Corresponding_K_sequences
TRINITY_DN12824_c0_g1_i1 2 3
TRINITY_DN555_c0_g1_i1 1 2

How to sort a large number of lists to get a top 10 of the longest lists

So I have a text file with around 400,000 lists that mostly look like this.
100005 127545 202036 257630 362970 376927 429080
10001 27638 51569 88226 116422 126227 159947 162938 184977 188045
191044 246142 265214 290507 296858 300258 341525 348922 359832 365744
382502 390538 410857 433453 479170 489980 540746
10001 27638 51569 88226 116422 126227 159947 162938 184977 188045
191044 246142 265214 290507 300258 341525 348922 359832 365744 382502
So far I have a for loop that goes line by line and turns the current line into a temp array list.
How would I create a top ten list that has the list with the most elements of the whole file.
This is the code I have now.
file = open('node.txt', 'r')
adj = {}
top_ten = []
at_least_3 = 0
for line in file:
data = line.split()
adj[data[0]] = data[1:]
And this is what one of the list look like
['99995', '110038', '330533', '333808', '344852', '376948', '470766', '499315']
# collect the lines
lines = []
with open("so.txt") as f:
for line in f:
# split each line into a list
lines.append(line.split())
# sort the lines by length, descending
lines = sorted(lines, key=lambda x: -len(x))
# print the first 10 lines
print(lines[:10])
Why not use collections to display the top 10? i.e.:
import re
import collections
file = open('numbers.txt', 'r')
content = file.read()
numbers = re.findall(r"\d+", content)
counter = collections.Counter(numbers)
print(counter.most_common(10))
Ideone Demo
When wanting to count and then find the one(s) with the highest counts, collections.Counter comes to mind:
from collections import Counter
lists = Counter()
with open('node.txt', 'r') as file:
for line in file:
values = line.split()
lists[tuple(values)] = len(values)
print('Length Data')
print('====== ====')
for values, length in lists.most_common(10):
print('{:2d} {}'.format(length, list(values)))
Output (using sample file data):
Length Data
====== ====
10 ['191044', '246142', '265214', '290507', '300258', '341525', '348922', '359832', '365744', '382502']
10 ['191044', '246142', '265214', '290507', '296858', '300258', '341525', '348922', '359832', '365744']
10 ['10001', '27638', '51569', '88226', '116422', '126227', '159947', '162938', '184977', '188045']
7 ['382502', '390538', '410857', '433453', '479170', '489980', '540746']
7 ['100005', '127545', '202036', '257630', '362970', '376927', '429080']
Use a for loop and max() maybe? You say you've got a for loop that's placing the values into a temp array. From that you could use "max()" to pick out the largest value and put that into a list.
As an open for loop, something like appending max() to a new list:
newlist = []
for x in data:
largest = max(x)
newlist.append(largest)
Or as a list comprehension:
newlist = [max(x) for x in data]
Then from there you have to do the same process on the new list(s) until you get to the desired top 10 scenario.
EDIT: I've just realised that i've misread your question. You want to get the lists with the most elements, not the highest values. Ok.
len() is a good one for this.
for x in data:
if len(templist) > x:
newlist.append(templist)
That would give you the current highest and from there you could create a top 10 list of lengths or of the temp lists themselves, or both.
If your data is really as shown with each number the same length, then I would make a dictionary with key = line, value = length, get the top value / key pairs in the dictionary and voila. Sounds easy enough.

Storing 3 different variables (dict. or list) while iterating through documents?

I am iterating through hundreds of thousands of words in several documents, looking to find the frequencies of contractions in English. I have formatted the documents appropriately, and it's now a matter of writing the correct function and storing the data properly. I need to store information for each document on which contractions were found and how frequently they were used in the document. Ideally, my data frame would look something like the following:
filename contraction count
file1 it's 34
file1 they're 13
file1 she's 9
file2 it's 14
file2 we're 15
file3 it's 4
file4 it's 45
file4 she's 13
How can I best go about this?
Edit: Here's my code, thus far:
for i in contractions_list: # for each of the 144 contractions in my list
for l in every_link: # for each speech
count = 0
word_count = 0
content_2 = processURL_short(l)
for word in content2.split():
word = word.strip(p)
word_count = word_count + 1
if i in contractions:
count = count + 1
Where processURL_short() is a function I wrote that scrapes a website and returns a speech as str.
Edit2:
link_store = {}
for i in contractions_list_test: # for each of the 144 contractions
for l in every_link_test: # for each speech
link_store[l] = {}
count = 0
word_count = 0
content_2 = processURL_short(l)
for word in content_2.split():
word = word.strip(p)
word_count = word_count + 1
if word == i:
count = count + 1
if count: link_store[l][i] = count
print i,l,count
Here's my file-naming code:
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president,speech_num)
Opening and reading are slow operations: don't cycle through the entire file list 144 times.
Exceptions are slow: throwing an exception for every non-contraction in every speech will be ponderous.
Don't cycle through your list of contractions checking against words. Instead, use the built-in in function to see whether that contraction is on the list, and then use a dictionary to tally the entries, just as you might do by hand.
Go through the files, word by word. When you see a word on the contraction list, see whether it's already on your tally sheet. If so, add a mark, if not, add it to the sheet with a count of 1.
Here's an example. I've made very short speeches and a trivial processURL_short function.
def processURL_short(string):
return string.lower()
every_link = [
"It's time for going to Sardi's",
"We're in the mood; it's about DST",
"They're he's it's don't",
"I'll be home for Christmas"]
contraction_list = [
"it's",
"don't",
"can't",
"i'll",
"he's",
"she's",
"they're"
]
for l in every_link: # for each speech
contraction_count = {}
content = processURL_short(l)
for word in content.split():
if word in contraction_list:
if word in contraction_count:
contraction_count[word] += 1
else:
contraction_count[word] = 1
for key, value in contraction_count.items():
print key, '\t', value
you can have your structure set up like this:
links = {}
for l in every_link:
links[l] = {}
for i in contractions_list:
count = 0
... #here is where you do your count, which you seem to know how to do
... #note that in your code, i think you meant if i in word/ if i == word for your final if statement
if count: links[l][i] = count #only adds the value if count is not 0
you would end up with a data structure like this:
links = {
'file1':{
"it's":34,
"they're":14,
...,
},
'file2':{
....,
},
...,
}
which you could easily iterate through to write the necessary data to your file (which i again assume you know how to do since its seemingly not part of the question)
Dictionaries seems to be the best option here, because they will allow
you easier manipulation of your data. Your goal should be indexing
results by filename extracted form the link (the URL to your speech
text) to a mapping of contraction and its count.
Something like:
{"file1": {"it's": 34, "they're": 13, "she's": 9},
"file2": {"it's": 14, "we're": 15},
"file3": {"it's": 4},
"file4": {"it's": 45, "she's": 13}}
Here's the full code:
ret = {}
for link, text in ((l, processURL_short(l)) for l in every_link):
contractions = {c:0 for c in contractions_list}
for word in text.split():
try:
contractions[word] += 1
except KeyError:
# Word or contraction not found.
pass
ret[file_naming_code(link)] = contractions
Let's go into each step.
First we intialize ret, it will be the resulting dictionary. Then we use
generator expressions
to perform processURL_short() for each step (instead of going though
all link list at once). We return a list of tuple (<link-name>, <speech-test>) so we can use the link name later.
Next That's the contractions count mapping, intialize to 0s, it
will be used to count contractions.
Then we split the text into words, for each word we search for it
in the contractions mapping, if found we count it, otherwise
KeyError will be raise for each key not found.
(Another question stated that this will perform poorly, another
possiblity is checking with in, like word in contractions.)
Finally:
ret[file_naming_code(link)] = contractions
Now ret is a dictionary of filename mapping to contractions
occurrences. Now you can easily create your table using:
Here's how you would get your output:
print '\t'.join(('filename', 'contraction', 'count'))
for link, counts in ret.items():
for name, count in counts.items():
print '\t'.join((link, name, count))

Categories