I am trying to identify from a text file, a set of words that appear at least some number of times within any single of the text file. I have a list to hold the qualifying words. The file is read line by line. In each line, words occurred in the line and their counts are put into a dictionary. Words with count number higher than threshold is appended to the list. The code operating on a single line looks as follows (I pseudo coded some parts that doesn't pertain to the problem):
words = []
candidates = {}
for line in text:
for word in line:
if word in dict:
candidates[word] += 1
else
candidates[word] = 1
for word in candidates:
if candidates[word] > threshold:
if word not in words:
words.append(word)
# candidates.clear()
At the end of each line, I was hoping to empty the dictionary and not carry useless content in it. However, the line which I put after the # now: dict.clear() erases the content of the list, and leaves only the qualifying words in the final line. When this line is removed, the output is correct.
Can someone please explain why this is happening? Does the append() method of list class make local copy of the data or just maintain a pointer? Does the dictionary clear() method not only releases the dict's reference to the key value pairs, but also releases other objects' reference to them?
#EDIT: to address some of the comments, the word extraction in each line is pseudocode. I did not think this step is relevant to the problem. If you guys are interested, here's the original code. https://github.com/muyezhu/python/blob/master/freqword
This code looks for frequently occurring short DNA piece in a long sequence. Sample data can be downloaded at this link: http://rosalind.info/problems/1d/
Trying your linked code with the linked dataset shows that you're only getting one set of updates to kmers because the outermost for loop only runs once.
This is due to the range call you're using: range(range(0, len(genome) - L + 1, L - k). In the example data, len(genome) is 100, L is 75 and k is 5. That means your range is range(0, 26, 70), which yields only 0 (the next value would be 70, which is much greater than the upper bounds of 26).
I'm pretty sure you don't want to give the L - k step argument to range. If you change the loop code to use range(len(genome) - L + 1), you get the expected results in kmers: ['CGACA', 'GAAGA', 'AATGT'].
Related
I want to make a function that takes the directory of a .txt file as an input and returns a dictionary based on specific parameters. If the .txt file is empty,
then the function will return nothing. When writing this function, I request that no imports, no list comprehension, and only for/while and if statements are used.
This is for the sake of the content I am learning right now, and I would like to be able to learn and interpret the function step-by-step.
An example of a .txt file is below. The amount of lines can vary but every line is formatted such that they appear in the order:
word + a string of 3 numbers connected by commas.
terra,4,5,6
cloud,5,6,7
squall,6,0,8
terra,4,5,8
cloud,6,5,7
First I would like to break down the steps of the function
Each component of the string that is separated by a comma serves a specific purpose:
The last number in the string will be subtracted by the second to last number in a string to form a value in the dictionary.
for example, the last two characters of terra,4,5,6 will be subtracted to form a value of [1] in the dictionary
The alphabetical words will form the keys of the dictionary. If there are multiple entries of the same word in a .txt file then a single key will be formed
and it will contain all the values of the duplicate keys.
for example, terra,4,5,6 , terra,4,4,6 , and terra,4,4,7 will output ('terra', 4):[1,2,3] as a key and value respectively.
However, in order for a key to be marked as a duplicate, the first values of the keys must be the same. If they are not, then they will be separate values.
For example, terra,4,5,6 and terra,5,4,6 will appear separately from eachother in the dictionary as ('terra', 4):[1] and ('terra', 5):[2] respectively.
Example input
if we use the example .txt file mentioned above, the input should look like create_dict("***files/example.txt") and should ouput a dictionary
{('terra', 4):[1,3],('cloud', 5):[1],('squall', 6):[8],('cloud', 6):[2]}. I will add a link to the .txt file for the sake of recreating this example. (note that *** are placeholders for the rest of the directory)
What I'm Trying:
testfiles = (open("**files/example.txt").read()).split('\n')
int_list = []
alpha_list = []
for values in testfiles:
ao = values.split(',') #returns only a portion of the list. why?
for values in ao:
if values.isnumeric():
int_list.append(values) #retrives list of ints from list
for values in ao:
if values.isalpha():
alpha_list.append(values) #retrieves a list of words
{((alpha_list[0]), int(int_list[0])):(int(int_list[2])-(int(int_list[1])))} #each line will always have 3 number values so I used index
this returns {('squall', 6): 1} which is mainly just a proof of concept and not a solution to the function. I wanted to see if it was possible to use the numbers and words I found in int_list and alpha_list using indexes to generate entries in the dictionary. If possible, the same could be applied to the rest of the strings in the .txt file.
Your input is in CSV format.
You really should be using one of these
https://docs.python.org/3/library/csv.html#csv.reader
https://docs.python.org/3/library/csv.html#csv.DictReader
since "odd" characters within a comma-separated field
are non-trivial to handle.
Better to let the library worry about such details.
Using defaultdict(list) is the most natural way,
the most readable way, to implement your dup key requirement.
https://docs.python.org/3/library/collections.html#collections.defaultdict
I know, I know, "no import";
now on to a variant solution.
d = {}
with open('example.txt') as f:
for line in f:
word, nums = line.split(',', maxsplit=1)
a, b, c = map(int, nums.split(','))
delta = c - b
key = (word, a)
if key not in d:
d[key] = []
d[key].append(delta)
return d
I know this is not the place to ask for homework answers, but I just want some direction since I'm completely lost. I started a Python course at my college where the professor assumes no one has experience in the language. This is our first assignment:
Build on this code to create a subsequent loop that starts at W and takes a random walk of a particular maximum length specified as a variable. If there are multiple possible next states, choose one uniformly at random. If there are no next states, then you should stop the loop, even if you haven't reached your maximum length.
I understand that this code my professor provided is walking through each character of the string s, but I don't know how it's actually working. I don't get what c in enumerate(s) or next[c]=[] are doing. Any help explaining how this works or how to handle characters and strings in Python would be greatly appreciated. I've been coding in other languages for a few years and I have no idea how to even start this assignment.
next = {} # This will hold the directed graph
s = "Welcome to cs 477"
# This loops through all of the characters in s
# and keeps track of their indices
for i, c in enumerate(s):
if not c in next:
# If this is the first time seeing a particular
# character, we need to make a new key/value pair for it
next[c] = []
if i < len(s)-1: # If there is a character after this
# Record that s[i+1] is one of the following characters
next[c].append(s[i+1])
enumerate(s) allows you to iterate over a list getting both the ith element in the list and the position it's in, as detailed here. In your case, the variable i holds the number of the current element starting from 0 and c the current character. If you were to print i and c inside your for loop, the result would look like this:
for i, c in enumerate(s):
print(i, c)
# 0, W
# 1, e
# 2, l
# ...
Regarding next[c] = [], you need to first understand that next is a dictionary, which is basically a hash set. What next[c] = [] is doing, is adding the character c to the dictionary as the key, and it's value as an empty list.
I was writing a Python program which includes printing a array created from user input in the order the user inputed each item of the array. Unfortunately, I have had few problems with that; Once it repeated the first item twice with one of the set, and then in another set it put the last 2 items at the beginning.
I checked the array in the shell and the array contained the right amount of items in the right order, so I don't know what is going on. My script looks something like this:
i = 1
lines = []
for i in range (1, (leng + 1)):
lines.append(input())
input() # The data stripped is not used, the input is a wait for the user to be ready.
i = 0
for i in range (0, (leng + 1)):
print(lines[i - len(lines)])
I searches found me nothing for my purposes (but then again, I could have not used the correct search term like in my last question).
Please answer or find a duplicate if existing. I'd like an answer.
Don't you just want this?
for line in lines:
print(line)
EDIT
As an explanation of what's wrong with your code... you're looping one too many times (leng+1 instead of leng). Then you're using i - len(lines), which should probably be okay but is just the equivalent of i. Another fix for your code could be:
for i in range(len(lines)):
print(lines[i])
SECOND EDIT
Rewriting your full code to what I think is the simplest, most idiomatic version:
# store leng lines
lines = [input() for _ in range(leng)]
# wait for user to be ready
input()
# print all the lines
for line in lines:
print(line)
Okay so I've annotated almost everything in my code, but I'm struggling slightly with annotating my for loop, I've got all of it except these two lines I just don't know how to explain with it making sense to anyone but me. Would be great if I could get some tips on this!
y = {} #
Positions = [] #
for i, word in enumerate (Sentence): #This starts a for look to go through every word within the variable 'Sentence' which is a list.
if not word in y: #This continues the code below if the word isn't found within the variable 'y'.
y[word] = (i+1) #This gives the word that wasn't found within the variable 'y' the next unused number plus 1 so that it doesn't confuse those unfamiliar with computer science starting at 0.
Positions = Positions + [y[word]] #This sets the variable 'Positions' to the variables 'Positions' and '[d[word]]'.
If you're going to comment a variable, then the comment should explain that the variable contains (or to be precise, since the purpose of the code is to populate these variables, our goal for what the variable will contain) and/or what it's expected to be used for. Since we don't see this data used for anything, I'll stick to the former:
y = {} # dictionary mapping words to the (1-based) index of their first occurrence in the sentence
Positions = [] # list containing, for each position in the sentence, the (1-based) index of the first occurrence of the word at that position.
In one you are declaring a dictionary:
y = {} #
in another a list:
Positions = [] #
Dictionaries store objects with keys. Lists are stacks of elements (position wise).
I have 4x4 table of letters and I want to find all possible paths there. They are candidates for being words. I have problems with the variable "used" It is a list that includes all the places where the path has been already so it doesn't go there again. There should be one used-list for every path. But it doesn't work correctly. For example I had a test print that printed the current word and the used-list. Sometimes the word had only one letter, but path had gone through all 16 cells/indices.
The for-loop of size 8 is there for all possible directions. And main-function executes the chase-function 16 times - once for every possible starting point. Move function returns the indice after moving to a specific direction. And is_allowed tests for whether it is allowed to move to a certain division.
sample input: oakaoastsniuttot. (4x4 table, where first 4 letters are first row etc.)
sample output: all the real words that can be found in dictionary of some word
In my case it might output one or two words but not nearly all, because it thinks some cells are used eventhough they are not.
def chase(current_place, used, word):
used.append(current_place) #used === list of indices that have been used
word += letter_list[current_place]
if len(word)>=11:
return 0
for i in range(3,9):
if len(word) == i and word in right_list[i-3]: #right_list === list of all words
print word
break
for i in range(8):
if is_allowed(current_place, i) and (move(current_place, i) not in used):
chase(move(current_place, i), used, word)
The problem is that there's only one used list that gets passed around. You have two options for fixing this in chase():
Make a copy of used and work with that copy.
Before you return from the function, undo the append() that was done at the start.