I want to find the 20 most common names, and their frequency, in a country.
Lets say I have lists of all residents' first name in 100 cities. Each list might contain a lot of names. Lets say we speak about 100 lists, each list with 1000 strings.
What is the most efficient method to get the 20 most common names, and their frequencies, in the entire country?
This is the direction I began with, assuming I got each city in a text file at the same directory:
Use pandas and collection modules for this.
Iterate through each city.txt, making it a string. Then, turn it into a collection using the Counter module, and then to a DataFrame (using to_dict).
Union each DataFrame with the previous one.
Then, group by and count (*) the DataFrame.
But, I'm thinking this method might not work, as the DataFrame can get too big.
Would like to hear any advice on that. Thank you.
Here is a sample code:
import os
from collections import Counter
cities = [i for i in os.listdir(".") if i.endswith(".txt")]
d = Counter()
for file in cities:
with open(file) as f:
# Adjust the code below to put the strings in a list
data = f.read().split(",")
d.update(Counter(data))
out = d.most_common(10)
print(out)
You can also use NLTK library, I was using the code below for similar purpose.
from nltk import FreqDist
fd = FreqDist(text)
top_20 = fd.most_commmon(20) # it's done, you got top 20 tokens :)
Related
I have two excel sheets and am building a small program to compare two columns from these sheets to find the differences. The problem is, since most of these inputs are done manually there are a lot of spelling errors, which should be ignored. The program should highlight new or deleted data.
I was reading about fuzzy text and I found this code online (link) but the output of this just generates a CSV with exact same entries (not what I wanted). I'll still add it here so you've an idea what I'm talking about.
from __future__ import division
import numpy as np
import pandas as pd
from collections import Counter
import collections
from fuzzywuzzy import fuzz
import time
from two_lists_similarity import Calculate_Similarity as cs
#the first file
book_old = pd.read_excel(r' #Input file here', sheet_name = '#Sheet Name Here')
data_old = book_old.iloc[7:,2].tolist()
#Selecting the column i want to compare
#second file to compare with
book_new = pd.read_excel(r'#source here', sheet_name = '#Sheet name')
data_new = book_new.iloc[7:,2].tolist() #selecting col
inp_list = data_old
ref_list = data_new
#this is what i picked up online because i couldnt do myself
#the plan is to iterate the list and find entries that are different, ignore spellings
# Create an instance of the class. This is otherwise called as an object
csObj = cs(inp_list,ref_list)
# csObj is now the object of Calculate Similarity class.
csObj.fuzzy_match_output(output_csv_name = 'pkg_sim_test_vsc.csv', output_csv_path = r'#Output path')
What you need is probably some function that computes how different two strings are.
Turns out there are already some algorithms for that! Check out the Damerau–Levenshtein distance, which seems to be the closest to your use-case. From Wikipedia:
Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.
However, this will only pick up on simple typos and is prone to false positives, so you may want to combine it with some other mechanism.
There are Python implementations of this algorithm on the web (see here or there).
Alternatively, feel free to check out some other algorithms like:
the Hamming distance
the Levenshtein distance
For a little background this is the csv file that I'm starting with. (the data is nonsensical and only used for proof of concept)
Jackson,Thompson,jackson.thompson#hotmail.com,test,
Luke,Wallace,luke.wallace#lycos.com,test,
David,Wright,david.wright#hotmail.com,test,
Nathaniel,Butler,nathaniel.butler#aol.com,test,
Eli,Simpson,noah.simpson#hotmail.com,test,
Eli,Mitchell,eli.mitchell#aol.com,,test2
Bob,Test,bob.test#aol.com,test,
What I am attempting to do with this csv on a larger scale is if the first value in the row is duplicated I need to take the data in the second entry and append it to the row with the first instance of the value. For example, in the data above "Eli" is represented twice, the first instance has "test" after the email value. The second instance of "Eli" does not have a value there it instead has another value in the next index over, and remove the duplicate row.
I would want it to go from this:
Eli,Simpson,noah.simpson#hotmail.com,test,,
Eli,Mitchell,eli.mitchell#aol.com,,test2
To this:
Eli,Simpson,noa.simpson#hotmail.com,test,test2
I have been able to successfully import this csv into my code using what is below.
import csv
f = open('C:\Projects\Python\Test.csv','r')
csv_f = csv.reader(f)
test_list = []
for row in csv_f:
test_list.append(row[0])
print(test_list)
At this point I was able to import my csv, and put the first names into my list. I'm not sure how to compare the indexes to make the changes I'm looking for. I'm a python rookie so any help/guidance would be greatly appreciated.
If you want to use pandas you could use the pandas .drop_deplicates() method. An example would look something like this.
import pandas as pd
csv_f = pd.read_csv(r'C:\a file with addresses')
data.drop_duplicates(subset=['thing_to_drop'], keep='first',inplace=False)
see pandas documentation https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&cad=rja&uact=8&ved=2ahUKEwiej-eNrLrjAhVBGs0KHV6bB9kQFjADegQIABAB&url=https%3A%2F%2Fpandas.pydata.org%2Fpandas-docs%2Fstable%2Freference%2Fapi%2Fpandas.DataFrame.drop_duplicates.html&usg=AOvVaw1uGhCrPNMDDZAZWE9_YA9D
I am a kind of a newbie in python as well but I would suggest using dictreader and look at the excel file as a dictionary meaning every raw is a dictionary.
this way you can iterate through the names easily.
Second, I would suggest making a list of names already known to you as you iterate through the excel file to check if this is a known name for example
name_list.append("eli")
then when you check if "eli" in name_list:
and add a key, value to the first one.
I don't know if this is best practice so don't roast me guys, but this is a simple and quick solution.
This will help you practice iterating through lists and dictionaries as well.
Here is a helpful link for reading about csv handling.
Programmming Noob, trying to use sent_tokenize to split text into separate sentences. While it appears to be working (in console, making each sentence it's own list item), when I append it to an empty list, I end up with a list (well, a list of list of lists from the syntax) of len 1, that I cannot iterate through. Basically, I want to be able to extract each individual sentence, so that I can compare same with something i.e., i.e. the string "Summer is great." There may be a better way to accomplish this, but please try to give me a simple solution, because Noob. I imagine there is a flag at the end of every sentence I could use to append sentences one at a time, so pointing me to that might be enough.
I've reviewed the documentation and tried adding the following code, but still end up with my listz being of length 1, rather than broken into individual sentences.
import nltk
nltk.download('punkt')
from nltk import sent_tokenize, word_tokenize
listz = []
s = "Good muffins cost $3.88\nin New York. Please buy me two of
them.\n\nThanks."
listz.append([word_tokenize(t) for t in sent_tokenize(s)])
print(listz)
---
// Expenced output listz = [["Good muffins cost $3.88 in New York."],
["Please buy me two of them."], ["Thanks."]]
You should use extend:
listz.extend([word_tokenize(t) for t in sent_tokenize(s)])
But in this case, simple assignment works:
listz = [word_tokenize(t) for t in sent_tokenize(s)]
I have a text file with only one column containing text content. I want to find out the top 3 most frequent items and the 3 least frequent items. I have tried some of the solutions in other posts, but I am not able to get what I want. I tried finding the mode as shown below, but it just outputs all the rows. I also tried using the counter and the most common functions, but they do the same thing i.e. print all the rows in the file. Any help is appreciated.
# My Code
import pandas as pd
df = pd.read_csv('sample.txt')
print(df.mode())
You can use Python's built-in counter.
from collections import Counter
# Read file directly into a Counter
with open('file') as f:
cnts = Counter(l.strip() for l in f)
# Display 3 most common lines
cnts.most_common(3)
# Display 3 least common lines
cnts.most_common()[-3:]
I have sets of data. The first (A) is a list of equipment with sophisticated names. The second is a list of more broad equipment categories (B) - to which I have to group the first list into using string comparisons. I'm aware this won't be perfect.
For each entity in List A - I'd like to establish the levenshtein distance for each entity in List B. The record in List B with the highest score will be the group to which I'll assign that data point.
I'm very rusty in python - and am playing around with FuzzyWuzzy to get the distance between two string values. However - I can't quite figure out how to iterate through each list to produce what I need.
I presumed I'd just create a list for each data set and write a pretty basic loop for each - but like I said I'm a little rusty and not having any luck.
Any help would be greatly appreciated! If there is another package that will allow me to do this (not Fuzzy) - I'm glad to take suggestions.
It looks like the process.extractOne function is what you're looking for. A simple use case is something like
from fuzzywuzzy import process
from collections import defaultdict
complicated_names = ['leather couch', 'left-handed screwdriver', 'tomato peeler']
generic_names = ['couch', 'screwdriver', 'peeler']
group = defaultdict(list)
for name in complicated_names:
group[process.extractOne(name, generic_names)[0]].append(name)
defaultdict is a dictionary that has default values for all keys.
We loop over all the complicated names, use fuzzywuzzy to find the closest match, and then add the name to the list associated with that match.