I need to write a program which looks for words with the same three middle characters(each word is 5 characters long) in a list, then writes them into a file like this :
wasdy
casde
tasdf
gsadk
csade
hsadi
Between the similar words i need to leave an empty line. I am kinda stuck.
Is there a way to do this? I use Python 3.2 .
Thanks for your help.
I would use the itertools.groupby function for this. Assuming wordlist is a list of the words you want to group, this code does the trick.
import itertools
for k, v in itertools.groupby(wordlist, lambda word: word[1:4]):
# here, k is the key the words are grouped by, i.e. word[1:4]
# and v is a list/iterable of the words in the group
for word in v:
print word
print
itertools.groupby(wordlist, lambda word: word[1:4]) basically takes all the words, and groups them by word[1:4], i.e. the three middle characters. Here's the output of the above code with your sample data:
wasdy
casde
tasdf
gsadk
csade
hsadi
To get you started: try using the builtin sorted function on the list of words, and for the key you should experiment with using a slice(1, 4).
For example:
some_list = ['wasdy', 'casde', 'tasdf', 'gsadk', 'other', 'csade', 'hsadi']
sorted(some_list, key = lambda x: sorted(x[1:4]))
# outputs ['wasdy', 'casde', 'tasdf', 'gsadk', 'csade', 'hsadi', 'other']
edit: It was unclear to me whether you wanted "same three middle characters, in order" or just "same three middle characters". If the latter, then you could look at sorted(some_list, key = lambda x: x[1:4]) instead.
try:
from collections import defaultdict
dict_of_words = defaultdict(list)
for word in list_of_words:
dict_of_words[word[1:-1]].append(word)
then, to write to an output file:
with open('outfile.txt', 'w') as f:
for key in dict_of_words:
f.write('\n'.join(dict_of_words[key])
f.write('\n' )
word_list = ['wasdy', 'casde','tasdf','gsadk','csade','hsadi']
def test_word(word):
return all([x in word[1:4] for x in ['a','s','d']])
f = open('yourfile.txt', 'w')
f.write('\n'.join([word for word in word_list if test_word(word)]))
f.close()
returns:
wasdy
casde
tasdf
gsadk
csade
hsadi
Related
This is the function to read words from text file, sort these and then store in another text file.
#file contains words
file=open('/content/gdrive/MyDrive/Post_OCR_Classifictaion/Dict_try.txt').read().split()
#sorting order based on letters
letters="abcçdefgğhıijklmnoöprsştuüvyz"
d={i:letters.index(i) for i in letters}
#sort function
sorted_list=sorted(file,key=d.get)
#store after sortting in new file
textfile = open("/content/gdrive/MyDrive/Post_OCR_Classifictaion/Dict_try_sort.txt", "w")
for element in sorted_list:
textfile.write(element + "\n")
textfile.close()
These are the words in text file:
aço
çzb
ogğ
beg
zğe
öge
ğg
gaço
ogğ
But it gives error:
Here
sorted_list=sorted(file,key=d.get)
file is list of words whilst d is dict with keys being letters. You need first retrieve first letter of word then search for it in dict, for example using lambda i.e.
sorted_list=sorted(file,key=lambda word:d[word[0]])
You can convert your words to indexes:
def key(word):
return tuple(d[x] for x in word)
sorted_list = sorted(file, key=key)
You should probably try to account for both upper and lower case and also any circumstance where the line in 'file' either has a length of zero or starts with a character that is not in your list. Consider this possible solution:-
letters = "abcçdefgğhıijklmnoöprsştuüvyz"
letters = letters.upper() + letters
def myfunc(s):
try:
return letters.index(s[0])
except Exception:
pass
return -1
ml = ['ğc', 'öp', 'Rr', 'Po', 'aw', 'tp', 'çd']
mls = sorted(ml, key=myfunc)
print(mls)
The output of this would be:-
['Po', 'Rr', 'aw', 'çd', 'ğc', 'öp', 'tp']
I have a .txt file with 3 columns: word position, word and tag (NN, VB, JJ, etc.).
Example of txt file:
1 i PRP
2 want VBP
3 to TO
4 go VB
I want to find the frequency of the word and tag as a pair in the list in order to find the most frequently assigned tag to a word.
Example of Results:
3 (food, NN), 2 (Brave, ADJ)
My idea is to start by opening the file from the folder, read the file line by line and split, set a counter using dictionary and print with the most common to uncommon in descending order.
My code is extremely rough (I'm almost embarrassed to post it):
file=open("/Users/Desktop/Folder1/trained.txt")
wordcount={}
for word in file.read().split():
from collections import Counter
c = Counter()
for d in dicts.values():
c += Counter(d)
print(c.most_common())
file.close()
Obviously, i'm getting no results. Anything will help. Thanks.
UPDATE:
so i got this code posted on here which worked, but my results are kinda funky. here's the code (the author removed it so i don't know who to credit):
file=open("/Users/Desktop/Folder1/trained.txt").read().split('\n')
d = {}
for i in file:
if i[1:] in d.keys():
d[i[1:]] += 1
else:
d[i[1:]] = 1
print (sorted(d.items(), key=lambda x: x[1], reverse=True))
here are my results:
[('', 15866), ('\t.\t.', 9479), ('\ti\tPRP', 7234), ('\tto\tTO', 4329), ('\tlike\tVB', 2533), ('\tabout\tIN', 2518), ('\tthe\tDT', 2389), ('\tfood\tNN', 2092), ('\ta\tDT', 2053), ('\tme\tPRP', 1870), ('\twant\tVBP', 1713), ('\twould\tMD', 1507), ('0\t.\t.', 1427), ('\teat\tVB', 1390), ('\trestaurant\tNN', 1371), ('\tuh\tUH', 1356), ('1\t.\t.', 1265), ('\ton\tIN', 1237), ("\t'd\tMD", 1221), ('\tyou\tPRP', 1145), ('\thave\tVB', 1127), ('\tis\tVBZ', 1098), ('\ttell\tVB', 1030), ('\tfor\tIN', 987), ('\tdollars\tNNS', 959), ('\tdo\tVBP', 956), ('\tgo\tVB', 931), ('2\t.\t.', 912), ('\trestaurants\tNNS', 899),
there seem to be a mix of good results with words and other results with space or random numbers, anyone know a way to remove what aren't real words? also, i know \t is supposed to signify a tab, is there a way to remove that as well? you guys really helped a lot
You need to have a separate collections.Counter for each word. This code uses defaultdict to create a dictionary of counters, without checking every word to see if it is known.
from collections import Counter, defaultdict
counts = defaultdict(Counter)
for row in file: # read one line into `row`
if not row.strip():
continue # ignore empty lines
pos, word, tag = row.split()
counts[word.lower()][tag] += 1
That's it, you can now check the most common tag of any word:
print(counts["food"].most_common(1))
# Prints [("NN", 3)] or whatever
If you don't mind using pandas which is a great library for tabular data I would do the following:
import pandas as pd
df = pd.read_csv("/Users/Desktop/Folder1/trained.txt", sep=" ", header=None, names=["position", "word", "tag"])
df["word_tag_counts"] = df.groupby(["word", "tag"]).transform("count")
Then if you only want the maximum one from each group you can do:
df.groupby(["word", "tag"]).max()["word_tag_counts"]
which should give you a table with the values you want
I made a function to make a dictionary out of the words in a file and the value is the frequency of the word, named read_dictionary. I'm trying to now make another function that prints the top three words of the file by count so it looks like this:
"The top three words in fileName are:
word : number
word : number
word : number"
This is what my code is:
def top_three_by_count(fileName):
freq_words = sorted(read_dictionary(f), key = read_dictionary(f).get,
reverse = True)
top_3 = freq_words[:3]
print top_3
print top_three_by_count(f)
You can use collections.Counter.
from collections import Counter
def top_three_by_count(fileName):
return [i[0] for i in Counter(read_dictionary(fileName)).most_common(3)]
Actually you don't need the read_dictionary() function at all. The following code snippet do the whole thing for you.
with open('demo.txt','r') as f:
print Counter([i.strip() for i in f.read().split(' ')]).most_common(3)
I am trying to make my own lemmatizer for Spanish in Python2.7 using a lemmatization dictionary.
I would like to replace all of the words in a certain text with their lemma form. This is the code that I have been working on so far.
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
my_text = 'Flojo y cargantes. Decepcionantes. Decenté decentó'
my_text_lower= my_text.lower()
lemmatize_list = 'ExampleDictionary'
lemmatize_word_dict = {}
with open(lemmatize_list) as f:
for line in f:
depurated_line = line.rstrip()
(val, key) = depurated_line.split("\t")
lemmatize_word_dict[key] = val
txt = replace_all(my_text_lower, lemmatize_word_dict)
print txt
Here is an example dictionary file which contains the lemmatized forms used to replace the words in the input, or my_tyext_lower. The example dictionary is a tab-separated 2-column file in which Col. 1 Represented the values and Col 2 represents the keys to match.
ExampleDictionary
flojo floja
flojo flojas
flojo flojos
cargamento cargamentos
cargante cargantes
decepción decepciones
decepcionante decepcionantes
decentar decenté
decentar decentéis
decentar decentemos
decentar decentó
My desired output is as follows:
flojo y cargante. decepcionante. decentar decentar
Using these inputs (and the example phrase, as listed in my_textwithin the code). My actual output currently is:
felitrojo y cargramarramarrartserargramarramarrunirdo. decepáginacionarrtícolitroargramarramarrunirdo. decentar decentar
Currently, I can't seem to understand what it going wrong with the code.
It seems that it is replacing letters or chunks of each word, instead of recognizing the word, finding it in the lemma dictionary and then replace that instead.
For instance, this is the result that I am getting when I use the entire dictionary (more than 50.000 entries). This problem does not happen with my small example dictionary. Only when I use the complete dictionary which makes me think that prehaps it is double "replacing" at some point?
Is there a pythonic technique that I am missing and can incorporate into this code to make my search and replace function more precise, to identify the full words for replacement rather than chunks and/or NOT make any double replacements?
Because you use text.replace there's a chance that you'll still be matching a sub-string, and the text will get processed again. It's better to process one input word at a time and build the output string word-by-word.
I've switched your key-value the other way around (because you want to look up the right and find the word on the left), and I mainly changed the replace_all:
import re
def replace_all(text, dic):
result = ""
input = re.findall(r"[\w']+|[.,!?;]", text)
for word in input:
changed = dic.get(word,word)
result = result + " " + changed
return result
my_text = 'Flojo y cargantes. Decepcionantes. Decenté decentó'
my_text_lower= my_text.lower()
lemmatize_list = 'ExampleDictionary'
lemmatize_word_dict = {}
with open(lemmatize_list) as f:
for line in f:
kv = line.split()
lemmatize_word_dict[kv[1]] =kv[0]
txt = replace_all(my_text_lower, lemmatize_word_dict)
print txt
I see two problems with your code:
it will also replace words if they appear as part of a bigger word
by replacing words one after the other, you could replace (parts of) words that have already been replaced
Instead of that loop, I suggest using re.sub with word boundaries \b to make sure that you replace complete words only. This way, you can also pass a callable as a replacement function.
import re
def replace_all(text, dic):
return re.sub(r"\b\w+\b", lambda m: dic.get(m.group(), m.group()), text)
My problem is to replace strings in a text file, with another string. These key strings are in a list called word_list. I've tried the following, nothing seems to work. It prints out the sentence in document.text as it appears, with no replacement:
word_list = {'hi' : 'test', 'how' : 'teddy'}
with open("document.txt") as main:
words = main.read().split()
replaced = []
for y in words:
replacement = word_list.get(y, y)
replaced.append(replacement)
text = ' '.join(word_list.get(y, y) for y in words)
print text
new_main = open("done.txt", 'w')
new_main.write(text)
new_main.close()
Content of document.txt:
hi you, how is he?
Current output is the same as document.txt when it should be:
test you, teddy is he?
Any solutions/ help would be appreciated :)
As you seem to want to replace words, this will use a more natural definition of 'word':
import re
word_list = {'hi' : 'test', 'how' : 'teddy'}
with open('document.txt') as main, open('done.txt', 'w') as done:
text = main.read()
done.write(re.sub(r'\b\w+\b', lambda x: word_list.get(x.group(), x.group()), text))
word_list = {'hi' : 'test', 'how' : 'teddy'}
with open("document.txt") as main:
with open('done.txt', 'w') as new_main:
input_data = main.read()
for key, value in word_list.iteritems():
input_data = input_data.replace(key, value)
new_main.write(input_data)
This will read the entire contents of the file (not the most efficient if it's a large file), then iterate over your search and replace items in your dictionary, and call replace on the input text. Once complete it will write the data out to your new file.
Some things to remember with this approach
if your input file is large, it will be slow
you search pattern can also match word fragments, ie. hi will watch which, so you should cater for that too.