Is there a foolproof way of matching two similar string sequences? - python

This is the CSV in question. I got this data from Extra History, which is a video series that talks about many historical topics through many mini-series, such as 'Rome: The Punic Wars', or 'Europe: The First Crusades' (in the CSV). Episodes in these mini-series are numbered #1 up to #6 (though Justinian Theodora has two series, the first numbered #1-#6, the second #7-#12).
I would like to do some statistical analysis on these mini-series, which entails sorting these numbered episodes (e.g. episodes #2-#6) into their appropriate series, i.e. end result look something like this; I can then easily automate sorting into the appropriate python list.
My python code matches the #2-#6 episodes to the #1 episode correctly 99% of the time, with only 1 big error in red, and 1 slight error in yellow (because the first episode of that series is #7, not #1). However, I get the nagging feeling that there is an easier and foolproof way since the strings are well organized, with regular patterns in their names. Is that possible? And can I achieve that with my current code, or should I change and approach it from a different angle?
import csv
eh_csv = '/Users/Work/Desktop/Extra History Playlist Video Data.csv'
with open(eh_csv, newline='', encoding='UTF-8') as f:
reader = csv.reader(f)
data = list(reader)
import re
series_first = []
series_rest = []
singles_music_lies = []
all_episodes = [] #list of name of all episodes
#seperates all videos into 3 non-overlapping list: first episodes of a series,
#the other numbered episodes of the series, and singles/music/lies videos
for video in data:
all_episodes.append(video[0])
#need regex b/c normall string search of #1 also matched (Justinian &) Theodora #10
if len(re.findall('\\b1\\b', video[0])) == 1:
series_first.append(video[0])
elif '#' not in video[0]:
singles_music_lies.append(video[0])
else:
series_rest.append(video[0])
#Dice's Coefficient
#got from here; John Rutledge's answer with NinjaMeTimbers modification
#https://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings
#------------------------------------------------------------------------------------------
def get_bigrams(string):
"""
Take a string and return a list of bigrams.
"""
s = string.lower()
return [s[i:i+2] for i in list(range(len(s) - 1))]
def string_similarity(str1, str2):
"""
Perform bigram comparison between two strings
and return a percentage match in decimal form.
"""
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
union = len(pairs1) + len(pairs2)
hit_count = 0
for x in pairs1:
for y in pairs2:
if x == y:
hit_count += 1
pairs2.remove(y)
break
return (2.0 * hit_count) / union
#-------------------------------------------------------------------------------------------
#only take couple words of the episode's names for comparison, b/c the first couple words are 99% of the
#times the name of the series; can't make too short or words like 'the, of' etc will get matched (now or
#in future), or too long because will increase chance of superfluous match; does much better than w/o
#limitting to first few words
def first_three_words(name_string):
#eg ''.join vs ' '.join
first_three = ' '.join(name_string.split()[:5]) #-->'The Haitian Revolution' slightly worse
#first_three = ''.join(name_string.split()[:5]) #--> 'TheHaitianRevolution', slightly better
return first_three
#compared given episode with all first videos, and return a list of comparison scores
def compared_with_first(episode, series_name = series_first):
episode_scores = []
for i in series_name:
x = first_three_words(episode)
y = first_three_words(i)
#comparison_score = round(string_similarity(episode, i),4)
comparison_score = round(string_similarity(x,y),4)
episode_scores.append((comparison_score, i))
return episode_scores
matches = []
#go through video number 2,3,4 etc in a series and compare them with the first episode
#of all series, then get a comparison score
for episode in series_rest:
scores_list = compared_with_first(episode)
similarity_score = 0
most_likely_match = []
#go thru list of comparison scores returned from compared_with_first,
#then append the currentepisode/highest score/first episode to
#most_likely_match; repeat for all non-first episodes
for score in scores_list:
if score[0] > similarity_score:
similarity_score = score[0]
most_likely_match.clear() #MIGHT HAVE BEEN THE CRUCIAL KEY
most_likely_match.append((episode,score))
matches.append(most_likely_match)
final_match = []
for i in matches:
final_match.append((i[0][0], i[0][1][1], i[0][1][0]))
#just to get output in desired presentation
path = '/Users/Work/Desktop/'
with open('EH Sorting Episodes.csv', 'w', newline='',encoding='UTF-8') as csvfile:
csvwriter = csv.writer(csvfile)
for currentRow in final_match:
csvwriter.writerow(currentRow)
#print(currentRow)

Related

Finding the last data number recorded in the text using Python

I have a txt file of data that is recorded daily. The program runs every day and records the data it receives from the user and considers a number for each input data.
like this:
#1
data number 1
data
data
data
-------------
#2
data number 2
text
text
-------------
#3
data number 3
-------------
My problem is in numbering the data. For example, when I run the program to record a data in a txt file, the program should find the number of the last recorded data, add one to it and record my data.
But I can't write the program to find the last data number.
I tried these:
Find "#" in text. List all numbers after hashtags and find the biggest number that can be the number of the last recorded data.
text_file = open(r'test.txt', 'r')
line = text_file.read().splitlines()
for Number in line:
hashtag = Number[Number.find('#')]
if hashtag == '#':
hashtag = Number[Number.find('#')+1]
hashtag = int(hashtag)
record_list.append(hashtag)
last_number = max(record_list)
But when I use hashtag = Number[Number.find('#')], even in the lines where there is no hashtag, it returns the first or last letters in that line as a hashtag.
And if the text file is empty, it gives the following error:
hashtag = Number[Number.find('#')]
~~~~~~^^^^^^^^^^^^^^^^^^
IndexError: string index out of range
How can I find the number of the last data and use it in saving the next data?
Consider:
>>> s = "hello world"
>>> s[s.find('#')]
'd'
>>> s.find('#')
-1
If # is not in the line, -1 is returned, which when we use as an index, returns the last character.
We can use regular expressions and a list comprehension as one approach to solving this. Iterate over the lines, selecting only those which match the pattern of a numbered line. We'll match the number part, converting that to an int. We select the last one, which should be the highest number.
with open('test.txt' ,'r') as text_file:
next_number = [
int(m.group(1))
for x in text_file.read().splitlines()
if (m := re.match(r'^\s*#(\d+)\s*$', x))
][-1] + 1
Or we can pass a generator expression to max to ensure we get the highest number.
with open('test.txt' ,'r') as text_file:
next_number = max(
int(m.group(1))
for x in text_file.read().splitlines()
if (m := re.match(r'^\s*#(\d+)\s*$', x))
) + 1

Write map to a csv in python

I am not sure if the title of this is right. I know it is not a list and I am trying to take the results into a dictionary, but it is only adding the last value of my loop.
So I have pasted all my code but I have a question specifically on my candidates loop, where I am trying to get the percentages of votes per candidate. When I print the information it looks like this:
enter image description here
As you can see the 3rd session of the results is showing the candidates and next to them the percentage and the total votes. This results is what I am not sure what is (not a list not a dictionary)
I am trying to write this in my output csv, however after so many ways I always get to write only the last result which is O'Tooley.
I am new at this, so I am not sure first, why even if I save my percentage in a list after each loop, I am still saving only the percentage of O'Tooley. That's why I decided to print after each loop. That was my only way to make sure all the results look as in the picture.
import os
import csv
electiondatapath = os.path.join('../..','gt-atl-data-pt-03-2020-u-c', '03-Python', 'Homework', 'PyPoll', 'Resources', 'election_data.csv')
with open (electiondatapath) as csvelectionfile:
csvreader = csv.reader(csvelectionfile, delimiter=',')
# Read the header row first
csv_header = next(csvelectionfile)
#hold number of rows which will be the total votes
num_rows = 0
#total votes per candidate
totalvotesDic = {}
#list to zip and write to csv
results = []
for row in csvreader:
#total number of votes cast
num_rows += 1
# Check if candidate in the dictionary keys, if is not then add the candidate to the dictionary and count it as one, else sum 1 to the votes
if row[2] not in totalvotesDic.keys():
totalvotesDic[row[2]] = 1
else:
totalvotesDic[row[2]] += 1
print("Election Results")
print("-----------------------")
print(f"Total Votes: {(num_rows)}")
print("-----------------------")
#get the percentage of votes and print result next to candidate and total votes
for candidates in totalvotesDic.keys():
#totalvotesDic[candidates].append("{:.2%}".format(totalvotesDic[candidates] / num_rows))
candidates_info = candidates, "{:.2%}".format(totalvotesDic[candidates] / num_rows), "(", totalvotesDic[candidates], ")"
print(candidates, "{:.2%}".format(totalvotesDic[candidates] / num_rows), "(", totalvotesDic[candidates], ")")
#get the winner out of the candidates
winner = max(totalvotesDic, key=totalvotesDic.get)
print("-----------------------")
print(f"Winner: {(winner)}")
print("-----------------------")
#append to the list to zip
results.append("Election Results")
results.append(f"Total Votes: {(num_rows)}")
results.append(candidates_info)
results.append(f"Winner: {(winner)}")
# zip list together
cleaned_csv = zip(results)
# Set variable for output file
output_file = os.path.join("output_Pypoll.csv")
# Open the output file
with open(output_file, "w") as datafile:
writer = csv.writer(datafile)
# Write in zipped rows
writer.writerows(cleaned_csv)
In each iteration, you created a variable named candidates_info for just one candidate. You need to concatenate strings like this example:
candidates_info = ""
for candidates in totalvotesDic.keys():
candidates_info = '\n'.join([candidates_info, candidates + "{:.2%}".format(totalvotesDic[candidates] / num_rows) + "("+ str(totalvotesDic[candidates])+ ")"])
print(candidates_info)
# prints
# O'Tooley 0.00%(2)
# Someone 30.00%(..)
Also, you don't need keys(). Try this instead:
candidates_info = ""
for candidates, votes in totalvotesDic.items():
candidates_info = '\n'.join([candidates_info, str(candidates) + "{:.2%}".format(votes / num_rows) + "("+ str(votes)+ ")"])

Conditional Probability of a List followed by another term NLTK

NO CODE NEEDED
I am checking probability that given a series of words that, following that series, the index is some given word. I am currently working with nltk/python and was wondering if there was a simple function to do this or if I need to hard code this kind of thing myself by iterating through and counting all occurrences sort of thing.
Thanks
You have to iterate over the whole text first and count the n-grams so that you can compute their probability given a preceding sequence.
Here is a very simple example:
import re
from collections import defaultdict, Counter
# Tokenize the text in a very naive way.
text = "The Maroon Bells are a pair of peaks in the Elk Mountains of Colorado, United States, close to the town of Aspen. The two peaks are separated by around 500 meters (one-third of a mile). Maroon Peak is the higher of the two, with an altitude of 14,163 feet (4317.0 m), and North Maroon Peak rises to 14,019 feet (4273.0 m), making them both fourteeners. The Maroon Bells are a popular tourist destination for day and overnight visitors, with around 300,000 visitors every season."
tokens = re.findall(r"\w+", text.lower(), re.U)
def get_ngram_mapping(tokens, n):
# Add markers for the beginning and end of the text.
tokens = ["[BOS]"] + tokens + ["[EOS]"]
# Map a preceding sequence of n-1 tokens to a list
# of following tokens. 'defaultdict' is used to
# give us an empty list when we acces a key that
# does not exist yet.
ngram_mapping = defaultdict(list)
# Iterate through the text using a moving window
# of length n.
for i in range(len(tokens) - n + 1):
window = tokens[i:i+n]
preceding_sequence = tuple(window[:-1])
following_token = window[-1]
# Example for n=3: 'it is good' =>
# ngram_mapping[("it", "is")] = ["good"]
ngram_mapping[preceding_sequence].append(following_token)
return ngram_mapping
def compute_ngram_probability(ngram_mapping):
ngram_probability = {}
for preceding, following in ngram_mapping.items():
# Let's count which tokens appear right
# behind the tokens in the preceding sequence.
# Example: Counter(['a', 'a', 'b'])
# => {'a': 2, 'b': 1}
token_counts = Counter(following)
# Next we compute the probability that
# a token 'w' follows our sequence 's'
# by dividing by the frequency of 's'.
frequency_s = len(following)
token_probability = defaultdict(float)
for token, token_frequency in token_counts.items():
token_probability[token] = token_frequency / frequency_s
ngram_probability[preceding] = token_probability
return ngram_probability
ngrams = count_ngrams(tokens, n=2)
ngram_probability = compute_ngram_probability(ngrams)
print(ngram_probability[("the",)]["elk"]) # = 0.14285714285714285
print(ngram_probability[("the",)]["unknown"]) # = 0.0
I needed to solve the same issue as well. I used nltk.ngrams() function to get n-grams and then extend into a list as bi-grams because nltk.ConditionalFreqDist() function requires bi-grams. Then feed the results into nltk.ConditionalProbDist(). You can find the following example code;
from collections import defaultdict
ngram_prob = defaultdict(float)
ngrams_as_bigrams=[]
ngrams_as_bigrams.extend([((t[:-1]), t[-1]) for t in nltk.ngrams(tokens, n)])
cfd = nltk.ConditionalFreqDist(ngrams_as_bigrams)
cpdist = nltk.ConditionalProbDist(cfd, nltk.LidstoneProbDist, gamma=0.2, bins=len(tokens))
for (pre,follow) in ngrams_as_bigrams:
all_st = pre + (follow,)
ngram_prob[all_st] = cpdist[pre].prob(follow)
sorted_ngrams = [' '.join(k) for k, v in sorted(ngram_prob.items(), key=lambda x: x[1])[::-1]][:topk]

How do I quickly extract data from this massive csv file?

I have genomic data from 16 nuclei. The first column represents the nucleus, the next two columns represent the scaffold (section of genome) and the position on the scaffold respectively, and the last two columns represent the nucleotide and coverage respectively. There can be equal scaffolds and positions in different nuclei.
Using input for start and end positions (scaffold and position of each), I'm supposed to output a csv file which shows the data (nucleotide and coverage) of each nucleus within the range from start to end. I was thinking of doing this by having 16 columns (one for each nucleus), and then showing the data from top to bottom. The leftmost region would be a reference genome in that range, which I accessed by creating a dictionary for each of its scaffolds.
In my code, I have a defaultdict of lists, so the key is a string which combines the scaffold and the location, while the data is an array of lists, so that for each nucleus, the data can be appended to the same location, and in the end each location has data from every nucleus.
Of course, this is very slow. How should I be doing it instead?
Code:
#let's plan this
#input is start and finish - when you hit first, add it and keep going until you hit next or larger
#dictionary of arrays
#loop through everything, output data for each nucleus
import csv
from collections import defaultdict
inrange = 0
start = 'scaffold_41,51335'
end = 'scaffold_41|51457'
locations = defaultdict(list)
count = 0
genome = defaultdict(lambda : defaultdict(dict))
scaffold = ''
for line in open('Allpaths_SL1_corrected.fasta','r'):
if line[0]=='>':
scaffold = line[1:].rstrip()
else:
genome[scaffold] = line.rstrip()
print('Genome dictionary done.')
with open('automated.csv','rt') as read:
for line in csv.reader(read,delimiter=','):
if line[1] + ',' + line[2] == start:
inrange = 1
if inrange == 1:
locations[line[1] + ',' + line[2]].append([line[3],line[4]])
if line[1] + ',' + line[2] == end:
inrange = 0
count += 1
if count%1000000 == 0:
print('Checkpoint '+str(count)+'!')
with open('region.csv','w') as fp:
wr = csv.writer(fp,delimiter=',',lineterminator='\n')
for key in locations:
nuclei = []
for i in range(0,16):
try:
nuclei.append(locations[key][i])
except IndexError:
nuclei.append(['',''])
wr.writerow([genome[key[0:key.index(',')][int(key[key.index(',')+1:])-1],key,nuclei])
print('Done!')
Files:
https://drive.google.com/file/d/0Bz7WGValdVR-bTdOcmdfRXpUYUE/view?usp=sharing
https://drive.google.com/file/d/0Bz7WGValdVR-aFdVVUtTbnI2WHM/view?usp=sharing
(Only focusing on the CSV section in the middle of your code)
The example csv file you supplied is over 2GB and 77,822,354 lines. Of those lines, you seem to only be focused on 26,804,253 lines or about 1/3.
As a general suggestion, you can speed thing up by:
Avoid processing the data you are not interested in (2/3 of the file);
Speed up identifying the data you are interested in;
Avoid the things that repeated millions of times that tend to be slower (processing each line as csv, reassembling a string, etc);
Avoid reading all data when you can break it up into blocks or lines (memory will get tight)
Use faster tools like numpy, pandas and pypy
You data is block oriented, so you can use a FlipFlop type object to sense if you are in a block or not.
The first column of your csv is numeric, so rather than splitting the line apart and reassembling two columns, you can use the faster Python in operator to find the start and end of the blocks:
start = ',scaffold_41,51335,'
end = ',scaffold_41,51457,'
class FlipFlop:
def __init__(self, start_pattern, end_pattern):
self.patterns = start_pattern, end_pattern
self.state = False
def __call__(self, st):
rtr=True if self.state else False
if self.patterns[self.state] in st:
self.state = not self.state
return self.state or rtr
lines_in_block=0
with open('automated.csv') as f:
ff=FlipFlop(start, end)
for lc, line in enumerate(f):
if ff(line):
lines_in_block+=1
print lines_in_block, lc
Prints:
26804256 77822354
That runs in about 9 seconds in PyPy and 46 seconds in Python 2.7.
You can then take the portion that reads the source csv file and turn that into a generator so you only need to deal with one block of data at a time.
(Certainly not correct, since I spent no time trying to understand your files overall..):
def csv_bloc(fn, start_pat, end_pat):
from itertools import ifilter
with open(fn) as csv_f:
ff=FlipFlop(start_pat, end_pat)
for block in ifilter(ff, csv_f):
yield block
Or, if you need to combine all the blocks into one dict:
def csv_line(fn, start, end):
with open(fn) as csv_in:
ff=FlipFlop(start, end)
for line in csv_in:
if ff(line):
yield line.rstrip().split(",")
di={}
for row in csv_line('/tmp/automated.csv', start, end):
di.setdefault((row[2],row[3]), []).append([row[3],row[4]])
That executes in about 1 minute on my (oldish) Mac in PyPy and about 3 minutes in cPython 2.7.
Best

Better way to compare all items in a dataframe and replace similar items with fuzzy matching python

I'm wondering if there's a better way to compare all items in a dataframe column to each other and replace those items if they have a high fuzzy set matching score. I ended up using combinations, but my feeling is that this is memory intensive and inefficient. My code is below.
To clarify: the central question here is not the fuzzy matching aspect, but the aspect of comparing all items in a list to each other and then replacing those items that match.
newl = list(true_df2.Name.unique())
def remove_duplicate_names(newl, Name, origdf, namesave):
"""
This function removes duplicate names. It replaces longer names with shorter names
It takes in (1) newl: a list of unique names, where generic words have already been stripped out.
(2)Name: name of dataframe column
(3)origdf: original dataframe that is being rewritten
(4)namesave: name of saved matchedwords file: e.g, 'save1'. I created (4) because this file
takes a long time to run.
Returns a dataframe
"""
if isinstance(newl, pd.DataFrame):
newl = list(newl[Name].unique())
if isinstance(newl, list):
cnl = list(combinations(newl, 2))
matchword = []
for i in cnl:
fp = fuzz.partial_ratio(i[0], i[1])
if len(i[0]) > 3 and len(i[1]) > 3:
if not i[0] == i[1]:
#if i[0] or i[1] == 'York University':
# continue
#I can edit these conditions to make matches more or less strict
#higher values mean more strict
#using more criteria with 'and' means more strict
if fp >= 98:
shortstr = min(i, key=len)
longstr = max(i,key=len)
matchword.append((shortstr, longstr))
for pair in matchword:
#replace in each longstring spot, the shorter string
print 'pair', pair
print origdf[Name][origdf[Name].str.contains(pair[1])]
#origdf[Name][origdf[Name].str.contains(pair[1])] = pair[0].strip()
origdf.ix[origdf[Name].str.contains(pair[1]), 'Name'] = pair[0]
return origdf

Categories