Create a dictionary from text file - python

Alright well I am trying to create a dictionary from a text file so the key is a single lowercase character and each value is a list of the words from the file that start with that letter.
The text file containts one lowercase word per line eg:
airport
bathroom
boss
bottle
elephant
Output:
words = {'a': ['airport'], 'b': ['bathroom', 'boss', 'bottle'], 'e':['elephant']}
Havent got alot done really, just confused how I would get the first index from each line and set it as the key and append the values. would really appreatiate if someone can help me get sarted.
words = {}
for line in infile:
line = line.strip() # not sure if this line is correct

So let's examine your example:
words = {}
for line in infile:
line = line.strip()
This looks good for a beginning. Now you want to do something with the line. Probably you'll need the first character, which you can access through line[0]:
first = line[0]
Then you want to check whether the letter is already in the dict. If not, you can add a new, empty list:
if first not in words:
words[first] = []
Then you can append the word to that list:
words[first].append(line)
And you're done!
If the lines are already sorted like in your example file, you can also make use of itertools.groupby, which is a bit more sophisticated:
from itertools import groupby
from operator import itemgetter
with open('infile.txt', 'r') as f:
words = { k:map(str.strip, g) for k, g in groupby(f, key=itemgetter(0)) }
You can also sort the lines first, which makes this method generally applicable:
groupby(sorted(f), ...)

defaultdict from the collections module is a good choice for these kind of tasks:
>>> import collections
>>> words = collections.defaultdict(list)
>>> with open('/tmp/spam.txt') as f:
... lines = [l.strip() for l in f if l.strip()]
...
>>> lines
['airport', 'bathroom', 'boss', 'bottle', 'elephant']
>>> for word in lines:
... words[word[0]].append(word)
...
>>> print words
defaultdict(<type 'list'>, {'a': ['airport'], 'b': ['bathroom', 'boss', 'bottle'], 'e': ['elephant']})

Related

How do I count objects/substrings in a very specifically formatted file?

I have a file formatted this way -
{'apple': 4, 'orange': 3, 'peach': 1}
{}
{'apple': 1, 'banana': 1}
{'peach': 1}
{}
{}
{'pear': 3}
...
[10k more lines like this]
I want to create a new text file to store the total count of each of these fruits/objects like this -
apple:110
banana:200
pineapple:50
...
How do I do this?
My attempt: I've tried using Python (If this is confusing, please skip it) -
f = open("fruits.txt","r")
lines = f.readlines()
f.close()
g = open("number_of_fruits.txt","a")
for line in lines: #Iterating through every line,
for character in "{}'": #Removing extra characters,
line = line.replace(character, "")
for i in range(0,line.count(":")): #Using the number of colons as a counter,
line = line[ [m.start() for m in re.finditer("[a-z]",line)][i] : [m.start() for m in re.finditer("[0-9]",line)][i] + 1 ] #Slice the line like this - line[ith time I detect any letter : ith time I detect any number + 1]
#And then somehow store that number in temp, slicing however needed for every new fruit
#Open a new file
#First look if any of the fruits in my line already exist
#If they do:
#Convert that sliced number part of string to integer, add temp to it, and write it back to the file
#else:
#Make a newline entry with the object name and the sliced number from line.
The number of functions in Python is very overwhelming to begin with. And at this point I'm just considering using C, which is already a terrible idea.
Avoid using eval.
I would opt for treating it as JSON if you can ensure the formatting will be as above.
import json
from collections import Counter
with open('fruits.txt') as f:
counts = Counter()
for line in f.readlines():
counts.update(json.loads(line.replace("'", '"')))
If you want the output as defined above:
for fruit, count in counts.items():
print(f"{fruit}:{count}")
Updated Answer
Based on #DarryIG's literal_eval suggestion in comments, negates JSON use.
from ast import literal_eval
from collections import Counter
with open('fruits.txt') as f:
counts = Counter()
for line in f.readlines():
counts.update(literal_eval(line))
You can use built-in functions of python like literal_eval for evaluate each lines to dictionaries in python:
from ast import literal_eval
from collections import defaultdict, Counter
with open("input.txt", 'r') as inputFile:
counts = Counter()
for line in inputFile:
a = literal_eval(line)
counts.update(Counter(a))
print(dict(counts))
output:
{'apple': 5, 'orange': 3, 'banana': 1, 'peach': 2, 'pear': 3}
using defaultdict and json
import json
from collections import defaultdict
result = defaultdict(int)
with open('fruits.txt') as f:
for line in f:
data = json.loads(line.replace("'", '"'))
for fruit, num in data.items():
result[fruit] += num
print(result)
output
defaultdict(<class 'int'>, {'apple': 5, 'orange': 3, 'peach': 2, 'banana': 1, 'pear': 3})
EDIT: I would recommend using #BenjaminRowell answer (I upvoted it). I will keep this one just for brevity.
EDIT2 (22 May 2020): If it was using double quotes instead of single quotes this would be ndjson/jsonlines format (here is interesting discussion on relationship between the two). You can use ndjson or jsonlines packages to process it, e.g.:
import ndjson
from collections import Counter
with open('sample.txt') as f:
# if using double quotes, you can do:
#data = ndjson.load(f)
# because it uses single quotes - read the whole file and replace the quotes
data = f.read()
data = ndjson.loads(data.replace("'", '"'))
counts = Counter()
for item in data:
counts.update(item)
print(counts)

How to remove duplicate text from two different files in python

My problem: I have two files, "text1.txt" and "text2.txt"
"Text1.txt" contains the following:
Banana, rotten
Apple, ripe
Cheese, fresh
and "Text2.txt" contains the following:
Banana, good
Dragon, edible
Cheese, nice
What I want is to create a code that would check text2.txt with text1.txt and remove the word and the whole line that repeats itself before the comma. So, in this case, it would look like this:
"Text1.txt" changed to and Text2.txt would be left unchanged
Apple, ripe
What I managed to do is check if the words are duplicates without the comma, but even struggled to do that. My attempt is below:
New_food = open("text1.txt", "r+")
All_food = open("text2.txt")
food = New_food.readlines()
food2 = All_food.readlines()
#The following calculates how many lines the text file has
def file_len(fname):
with open(fname) as s:
for t, l in enumerate(s):
pass
return t+1
#calculates line number
n = file_len("text1.txt")
m = file_len("text2.txt")
for g in range(n):
food_r = food[g]
for j in range(m):
food2_r = food2[j]
if food_r == food2_r:
print(5) #only when they match
I have made the line break before it reaches a comma using this piece of code:
word = "cheese , fresh"
type_, *vals = word.split(',')
print(type_) #this would return cheese
I rewrote some of your code into the following script:
file1 = open("text1.txt", "r+")
file2 = open("text2.txt")
# List from files
food_list_1 = file1.readlines()
food_list_2 = file2.readlines()
# Unique food values in list
file_2_only_foods = list()
for line in food_list_2:
file_2_only_foods.append(line.split(',')[0])
def determine(x):
type = x.split(',')[0]
return type in file_2_only_foods
result = [x for x in food_list_1 if not determine(x)]
file1.close()
file1 = open("text1.txt", 'w')
file1.writelines(result)
This will put all the unique values into file_2_only_foods list to check if the values from list 1 are unique are not.
In order to write the file, we have to close the previous file and than reopen it to write your results. The result from my code is the exact same as what you described.
If there are no duplicates within files, you could go through both files and add all elements to a Counter (https://docs.python.org/2/library/collections.html), and then on a second pass remove all elements that have a count larger than 1.
from collections import Counter
>>> food1 = open("Text1.txt")
>>> food2 = open("Text2.txt")
>>> counter1 = Counter(item.split(",")[0] for item in food1.readlines())
>>> counter2 = Counter(item.split(",")[0] for item in food2.readlines())
>>> counter = counter1 + counter2
Counter({'Cheese': 2, 'Banana': 2, 'Apple': 1, 'Dragon': 1})
you can use regular expressions to extract the words from the text. Regular expressions reference: https://docs.python.org/3/library/re.html
You can extract all the first words from a file with this one-liner:
re.findall(r"^\s*(\w+)", file.read(), re.MULTILINE)
Demo:
>>> txt = """
... Banana, rotten
... Apple, ripe
... Cheese, fresh
... """
>>>
>>> re.findall(r"^\s*(\w+)", txt, re.MULTILINE)
['Banana', 'Apple', 'Cheese']
>>>
Extracts all the words to filter on, then efficiently filters the target file line-by-line.
>>> def filter_lines(filter_path, target_path, output_path):
...
... with open(filter_path, 'r' ) as filter_file,
... open(target_path, 'r' ) as target_file,
... open(output_path, 'w+') as output_file:
...
... filter_words = re.findall(r"^\s*(\w+)",
... filter_file.read(),
... re.MULTILINE)
... filter_words = set(filter_words)
...
... for line in target_file:
... m = re.findall(r"^\s*(\w+)", line)
... if not (m and m[0] in filter_words):
... output_file.write(line)
>>>
>>> filter_lines('text2.txt', 'text1.txt', 'filtered_text1.txt')
>>>
side note: Generally, in cases where you need to maintain a large list of items that's used in expressions like if item in long_list:, where the list is checked for membership. A set is much better than a list because lookups are fast; with list's, lookups are done by iterating over all items until what you're looking for is found.

How would I index words on each line

Write a function named lineIndex that takes a file name, fName,
as a parameter and returns a dictionary, d, that indexes the
words in fName by line number, with the first line in fName
being numbered 0.
Each word in fName should be a key in the returned dictionary d,
and the corresponding value should be a list of the line numbers
on which the word occurs. A line number should occur no more
than one time in a given list of line numbers.**
I tried numerous ways but couldn't find a solution.
What I have accomplished, I am not sure how to remove repeating words.
def lineindex(fname):
ifile=open(fname, 'rt')
readfile = ifile.readlines()
d = {}
fst=[]
for line in readfile:
#print(readfile[0][0])
#print(readfile.index(line))
#print(line)
split=line.split()
for word in split:
if word not in d:
d[word] = line.index(word)
else:
return d
Sample input
I have no pride
I have no shame
You gotta make it rain
Make it rain rain rain
Correct output
{'rain': [2, 3], 'gotta': [2], 'make': [2], 'it': [2, 3], 'shame': [1], 'I': [0,1], 'You': [2], 'have': [0, 1], 'no': [0,1], 'Make': [3], 'pride': [0]}
Edit 2:
def lineindex(fname):
ifile=open(fname, 'rt')
readfile = ifile.readlines()
d = {}
for line in readfile:
#print(line, readfile.index(line))
words=line.split()
for word in words:
#print(word, readfile.index(line))
if word not in d:
d[word] = readfile.index(line)
else:
return d
You're close - what you need to utilise is a set - it can only contain one unique element per key - so that'll handle the repeating words for you. You also missed out line numbers from your code, so look at enumerate for that. Then you can look at collections.defaultdict which does the creation of a default value for keys that are not existent for you.
from collections import defaultdict
def lineindex(fname):
dd = defaultdict(list)
with open(fname) as fin:
for lineno, line in enumerate(fin):
for word in set(line.split()):
dd[word].append(lineno)
return dd
Purely using builtins, then:
def lineindex(fname):
dd = {}
with open(fname) as fin:
for lineno, line in enumerate(fin):
for word in set(line.split()):
dd.setdefault(word, []).append(lineno)
return dd
A version without imports:
def lineindex(fname):
d = {}
with open(fname) as fobj:
for lineno, line in enumerate(fobj):
for word in set(line.split()):
d.setdefault(word, []).append(lineno)
return d
>>> lineindex('sample.txt') == out
True
You can use the setdefault method of dictionaries. It looks for the key and returns the value if the key is there. If it cannot find the key it it returns a new list that can appended to immediately.

how to omit the less frequent words from a dictionary in python?

I have a dictionary. I want to omit the words with the count 1 from the dictionary. how can I do it? Any help? and I wanna extract the bigram model of the words remained? how can I do it?
import codecs
file=codecs.open("Pezeshki339.txt",'r','utf8')
txt = file.read()
txt = txt[1:]
token=txt.split()
count={}
for word in token:
if word not in count:
count[word]=1
else:
count[word]+=1
for k,v in count.items():
print(k,v)
i could edit my code as the following. But there is a question about it: how can I create the bigram matrix and smooth it using add-one method? I appreciate any suggestions which matches my code.
import nltk
from collections import Counter
import codecs
with codecs.open("Pezeshki339.txt",'r','utf8') as file:
for line in file:
token=line.split()
spl = 80*len(token)/100
train = token[:int(spl)]
test = token[int(spl):]
print(len(test))
print(len(train))
cn=Counter(train)
known_words=([word for word,v in cn.items() if v>1])# removes the rare words and puts them in a list
print(known_words)
print(len(known_words))
bigram=nltk.bigrams(known_words)
frequency=nltk.FreqDist(bigram)
for f in frequency:
print(f,frequency[f])
Use a Counter dict to count the word then filter the .items removing keys that have a value of 1:
from collections import Counter
import codecs
with codecs.open("Pezeshki339.txt",'r','utf8') as f:
cn = Counter(word for line in f for word in line.split())
print(dict((word,v )for word,v in cn.items() if v > 1 ))
If you just want the words use list comp:
print([word for word,v in cn.items() if v > 1 ])
You don't need to call read you can split each line as you go, also if you want to remove punctuation you need to strip:
from string import punctuation
cn = Counter(word.strip(punctuation) for line in file for word in line.split())
import collections
c = collections.Counter(['a', 'a', 'b']) # Just an example - use your words
[w for (w, n) in c.iteritems() if n > 1]
Padraic's solution works great. But here is a solution that can just go underneath your code, instead of rewriting it completely:
newdictionary = {}
for k,v in count.items():
if v != 1:
newdictionary[k] = v

how can I read from a file and append each word to a dictionary?

what I want to do is read from a file, and then for each word, append it to a dictionary along with its number of occurances.
example:
'today is sunday. tomorrow is not sunday.'
my dictionary would then be this:
{'today': 1, 'is': 2, 'sunday': 2, 'tomorrow': 1, 'not': 1}
the way I'm going about it is to use readline and split to create a list, and then append each element and it's value to an empty dictionary, but it's not really working so far. here's what I have so far, although its incomplete:
file = open('any_file,txt', 'r')
for line in file.readline().split():
for i in range(len(line)):
new_dict[i] = line.count(i) # I'm getting an error here as well, saying that
return new_dict # I can't convert int to str implicitly
the problem with this is that when my dictionary updates when each line is read, the value of a word won't accumulate. so if in another line 'sunday' occurred 3 times, my dictionary would contain {'sunday': 3} instead of {'sunday': 5}. any help? I have no idea where to go from here and I'm new to all of this.
You are looking for collections.Counter.
e.g:
from itertools import chain
with open("file.txt") as file:
Counter(chain.from_iterable(line.split() for line in file))
(Using a itertools.chain.from_iterable() generator expression too.)
Note that your example only works on the first line, I presume this wasn't intentional, and this solution is for across the whole file (obviously it's trivial to swap that around).
Here is a simple version that doesn't deal with punctuation
from collections import Counter
counter = Counter()
with open('any_file,txt', 'r') as file:
for line in file:
for word in line.split():
counter[word] += 1
can also be written like this:
from collections import Counter
counter = Counter(word for line in file for word in line.split())
Here's one way to solve the problem using a dict
counter = {}
with open('any_file,txt', 'r') as file:
for line in file:
for word in line.split():
if word not in counter:
counter[word] = 1
else:
counter[word] += 1
try this
file = open('any_file.txt', 'r')
myDict = {}
for line in file:
lineSplit = line.split(" ")
for x in xrange(len(lineSplit)):
if lineSplit[x] in myDict.keys(): myDict[lineSplit[x]] += 1
else: myDict[lineSplit[x]] = 1
file.close()
print myDict
Do you use Python 3 or Python 2.7?
If yes, use Counter from collections library:
import re
from collections import Counter
words = re.findall('\w+', open('any_file.txt').read().lower())
Counter(words).most_common(10)
But you get list of tuples though. It should be easy for you to turn list of tuples to dictionary.

Categories