How would I index words on each line - python

Write a function named lineIndex that takes a file name, fName,
as a parameter and returns a dictionary, d, that indexes the
words in fName by line number, with the first line in fName
being numbered 0.
Each word in fName should be a key in the returned dictionary d,
and the corresponding value should be a list of the line numbers
on which the word occurs. A line number should occur no more
than one time in a given list of line numbers.**
I tried numerous ways but couldn't find a solution.
What I have accomplished, I am not sure how to remove repeating words.
def lineindex(fname):
ifile=open(fname, 'rt')
readfile = ifile.readlines()
d = {}
fst=[]
for line in readfile:
#print(readfile[0][0])
#print(readfile.index(line))
#print(line)
split=line.split()
for word in split:
if word not in d:
d[word] = line.index(word)
else:
return d
Sample input
I have no pride
I have no shame
You gotta make it rain
Make it rain rain rain
Correct output
{'rain': [2, 3], 'gotta': [2], 'make': [2], 'it': [2, 3], 'shame': [1], 'I': [0,1], 'You': [2], 'have': [0, 1], 'no': [0,1], 'Make': [3], 'pride': [0]}
Edit 2:
def lineindex(fname):
ifile=open(fname, 'rt')
readfile = ifile.readlines()
d = {}
for line in readfile:
#print(line, readfile.index(line))
words=line.split()
for word in words:
#print(word, readfile.index(line))
if word not in d:
d[word] = readfile.index(line)
else:
return d

You're close - what you need to utilise is a set - it can only contain one unique element per key - so that'll handle the repeating words for you. You also missed out line numbers from your code, so look at enumerate for that. Then you can look at collections.defaultdict which does the creation of a default value for keys that are not existent for you.
from collections import defaultdict
def lineindex(fname):
dd = defaultdict(list)
with open(fname) as fin:
for lineno, line in enumerate(fin):
for word in set(line.split()):
dd[word].append(lineno)
return dd
Purely using builtins, then:
def lineindex(fname):
dd = {}
with open(fname) as fin:
for lineno, line in enumerate(fin):
for word in set(line.split()):
dd.setdefault(word, []).append(lineno)
return dd

A version without imports:
def lineindex(fname):
d = {}
with open(fname) as fobj:
for lineno, line in enumerate(fobj):
for word in set(line.split()):
d.setdefault(word, []).append(lineno)
return d
>>> lineindex('sample.txt') == out
True
You can use the setdefault method of dictionaries. It looks for the key and returns the value if the key is there. If it cannot find the key it it returns a new list that can appended to immediately.

Related

How do I count objects/substrings in a very specifically formatted file?

I have a file formatted this way -
{'apple': 4, 'orange': 3, 'peach': 1}
{}
{'apple': 1, 'banana': 1}
{'peach': 1}
{}
{}
{'pear': 3}
...
[10k more lines like this]
I want to create a new text file to store the total count of each of these fruits/objects like this -
apple:110
banana:200
pineapple:50
...
How do I do this?
My attempt: I've tried using Python (If this is confusing, please skip it) -
f = open("fruits.txt","r")
lines = f.readlines()
f.close()
g = open("number_of_fruits.txt","a")
for line in lines: #Iterating through every line,
for character in "{}'": #Removing extra characters,
line = line.replace(character, "")
for i in range(0,line.count(":")): #Using the number of colons as a counter,
line = line[ [m.start() for m in re.finditer("[a-z]",line)][i] : [m.start() for m in re.finditer("[0-9]",line)][i] + 1 ] #Slice the line like this - line[ith time I detect any letter : ith time I detect any number + 1]
#And then somehow store that number in temp, slicing however needed for every new fruit
#Open a new file
#First look if any of the fruits in my line already exist
#If they do:
#Convert that sliced number part of string to integer, add temp to it, and write it back to the file
#else:
#Make a newline entry with the object name and the sliced number from line.
The number of functions in Python is very overwhelming to begin with. And at this point I'm just considering using C, which is already a terrible idea.
Avoid using eval.
I would opt for treating it as JSON if you can ensure the formatting will be as above.
import json
from collections import Counter
with open('fruits.txt') as f:
counts = Counter()
for line in f.readlines():
counts.update(json.loads(line.replace("'", '"')))
If you want the output as defined above:
for fruit, count in counts.items():
print(f"{fruit}:{count}")
Updated Answer
Based on #DarryIG's literal_eval suggestion in comments, negates JSON use.
from ast import literal_eval
from collections import Counter
with open('fruits.txt') as f:
counts = Counter()
for line in f.readlines():
counts.update(literal_eval(line))
You can use built-in functions of python like literal_eval for evaluate each lines to dictionaries in python:
from ast import literal_eval
from collections import defaultdict, Counter
with open("input.txt", 'r') as inputFile:
counts = Counter()
for line in inputFile:
a = literal_eval(line)
counts.update(Counter(a))
print(dict(counts))
output:
{'apple': 5, 'orange': 3, 'banana': 1, 'peach': 2, 'pear': 3}
using defaultdict and json
import json
from collections import defaultdict
result = defaultdict(int)
with open('fruits.txt') as f:
for line in f:
data = json.loads(line.replace("'", '"'))
for fruit, num in data.items():
result[fruit] += num
print(result)
output
defaultdict(<class 'int'>, {'apple': 5, 'orange': 3, 'peach': 2, 'banana': 1, 'pear': 3})
EDIT: I would recommend using #BenjaminRowell answer (I upvoted it). I will keep this one just for brevity.
EDIT2 (22 May 2020): If it was using double quotes instead of single quotes this would be ndjson/jsonlines format (here is interesting discussion on relationship between the two). You can use ndjson or jsonlines packages to process it, e.g.:
import ndjson
from collections import Counter
with open('sample.txt') as f:
# if using double quotes, you can do:
#data = ndjson.load(f)
# because it uses single quotes - read the whole file and replace the quotes
data = f.read()
data = ndjson.loads(data.replace("'", '"'))
counts = Counter()
for item in data:
counts.update(item)
print(counts)

How to Remove duplicate lines from a text file and the unique related to this duplicate

How might I remove duplicate lines from a file and also the unique related to this duplicate?
Example:
Input file:
line 1 : Messi , 1
line 2 : Messi , 2
line 3 : CR7 , 2
I want the output file to be:
line 1 : CR7 , 2
Just ( " CR7 , 2 " I want to delete duplicate lines and also the unique related to this duplicate)
The deletion depends on the first row if there is a match in the first row I want to delete this line
How to do this in python
with this code what to edit on it :
lines_seen = set() # holds lines already seen
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
if line not in lines_seen: # not a duplicate
outfile.write(line)
lines_seen.add(line)
outfile.close()
What is the best way to do this job?
Have you tried Counter?
This works for example:
import collections
a = [1, 1, 2]
out = [k for k, v in collections.Counter(a).items() if v == 1]
print(out)
Output: [2]
Or with a longer example:
import collections
a = [1, 1, 1, 2, 4, 4, 4, 5, 3]
out = [k for k, v in collections.Counter(a).items() if v == 1]
print(out)
Output: [2, 5, 3]
EDIT:
Since you don't have a list at the beginning there are two ways, depending on the file size you should use the first for small enough files (otherwise you might run in memory problems) or the second one for large files.
Read file as list and use previous answer:
import collections
lines = [line for line in open(infilename)]
out = [k for k, v in collections.Counter(lines).items() if v == 1]
with open(outfilename, 'w') as outfile:
for o in out:
outfile.write(o)
The first line reads your file completely as a list. This means, that really large files would be loaded in your memory. If you have to large files you can go ahead and use a sort of "blacklist":
Using blacklist:
lines_seen = set() # holds lines already seen
blacklist = set()
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
if line not in lines_seen and line not in blacklist: # not a duplicate
lines_seen.add(line)
else:
lines_seen.discard(line)
blacklist.add(line)
for l in lines_seen:
outfile.write(l)
outfile.close()
Here you add all lines to the set and only write the set to the file at the end. The blacklist remembers all multiple occurrences and therefore you do not write multiple lines even once. You can't do it in one go, to read and write since you do not know, if there comes the same line a second time. If you have further information (like multiple lines always come continuously) you could maybe do it differently
EDIT 2
If you want to do it depending on the first part:
firsts_seen = set()
lines_seen = set() # holds lines already seen
blacklist = set()
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
first = line.split(',')[0]
if first not in firsts_seen and first not in blacklist: # not a duplicate
lines_seen.add(line)
firsts_seen.add(first)
else:
lines_seen.discard(line)
firsts_seen.discard(first)
blacklist.add(first)
print(len(lines_seen))
for l in lines_seen:
outfile.write(l)
outfile.close()
P.S.: By now I have just been adding code, there might be a better way
For example with a dict:
lines_dict = {}
for line in open(infilename, 'r'):
if line.split(',')[0] not in lines_dict:
lines_dict[line.split(',')[0]] = [line]
else:
lines_dict[line.split(',')[0]].append(line)
with open(outfilename, 'w') as outfile:
for key, value in lines_dict.items():
if len(value) == 1:
outfile.write(value[0])
Given your input you can do something like this:
seen = {} # key maps to index
double_seen = set()
with open('input.txt') as f:
for line in f:
_, key = line.split(':')
key = key.strip()
if key not in seen: # Have not seen this yet?
seen[key] = line # Then add it to the dictionary
else:
double_seen.add(key) # Else we have seen this more thane once
# Now we can just write back to a different file
with open('output.txt', 'w') as f2:
for key in set(seen.keys()) - double_seen:
f2.write(seen[key])
Input I used:
line 1 : Messi
line 2 : Messi
line 3 : CR7
Output:
line 3 : CR7
Note this solution assumes Python3.7+ since it assumes dictionaries are in insertion order.

Making python dictionary from a text file with multiple keys

I have a text file named file.txt with some numbers like the following :
1 79 8.106E-08 2.052E-08 3.837E-08
1 80 -4.766E-09 9.003E-08 4.812E-07
1 90 4.914E-08 1.563E-07 5.193E-07
2 2 9.254E-07 5.166E-06 9.723E-06
2 3 1.366E-06 -5.184E-06 7.580E-06
2 4 2.966E-06 5.979E-07 9.702E-08
2 5 5.254E-07 0.166E-02 9.723E-06
3 23 1.366E-06 -5.184E-03 7.580E-06
3 24 3.244E-03 5.239E-04 9.002E-08
I want to build a python dictionary, where the first number in each row is the key, the second number is always ignored, and the last three numbers are put as values. But in a dictionary, a key can not be repeated, so when I write my code (attached at the end of the question), what I get is
'1' : [ '90' '4.914E-08' '1.563E-07' '5.193E-07' ]
'2' : [ '5' '5.254E-07' '0.166E-02' '9.723E-06' ]
'3' : [ '24' '3.244E-03' '5.239E-04' '9.002E-08' ]
All the other numbers are removed, and only the last row is kept as the values. What I need is to have all the numbers against a key, say 1, to be appended in the dictionary. For example, what I need is :
'1' : ['8.106E-08' '2.052E-08' '3.837E-08' '-4.766E-09' '9.003E-08' '4.812E-07' '4.914E-08' '1.563E-07' '5.193E-07']
Is it possible to do it elegantly in python? The code I have right now is the following :
diction = {}
with open("file.txt") as f:
for line in f:
pa = line.split()
diction[pa[0]] = pa[1:]
with open('file.txt') as f:
diction = {pa[0]: pa[1:] for pa in map(str.split, f)}
You can use a defaultdict.
from collections import defaultdict
data = defaultdict(list)
with open("file.txt", "r") as f:
for line in f:
line = line.split()
data[line[0]].extend(line[2:])
Try this:
from collections import defaultdict
diction = defaultdict(list)
with open("file.txt") as f:
for line in f:
key, _, *values = line.strip().split()
diction[key].extend(values)
print(diction)
This is a solution for Python 3, because the statement a, *b = tuple1 is invalid in Python 2. Look at the solution of #cha0site if you are using Python 2.
Make the value of each key in diction be a list and extend that list with each iteration. With your code as it is written now when you say diction[pa[0]] = pa[1:] you're overwriting the value in diction[pa[0]] each time the key appears, which describes the behavior you're seeing.
with open("file.txt") as f:
for line in f:
pa = line.split()
try:
diction[pa[0]].extend(pa[1:])
except KeyError:
diction[pa[0]] = pa[1:]
In this code each value of diction will be a list. In each iteration if the key exists that list will be extended with new values from pa giving you a list of all the values for each key.
To do this in a very simple for loop:
with open('file.txt') as f:
return_dict = {}
for item_list in map(str.split, f):
if item_list[0] not in return_dict:
return_dict[item_list[0]] = []
return_dict[item_list[0]].extend(item_list[1:])
return return_dict
Or, if you wanted to use defaultdict in a one liner-ish:
from collections import defaultdict
with open('file.txt') as f:
return_dict = defaultdict(list)
[return_dict[item_list[0]].extend(item_list[1:]) for item_list in map(str.split, f)]
return return_dict

how can I read from a file and append each word to a dictionary?

what I want to do is read from a file, and then for each word, append it to a dictionary along with its number of occurances.
example:
'today is sunday. tomorrow is not sunday.'
my dictionary would then be this:
{'today': 1, 'is': 2, 'sunday': 2, 'tomorrow': 1, 'not': 1}
the way I'm going about it is to use readline and split to create a list, and then append each element and it's value to an empty dictionary, but it's not really working so far. here's what I have so far, although its incomplete:
file = open('any_file,txt', 'r')
for line in file.readline().split():
for i in range(len(line)):
new_dict[i] = line.count(i) # I'm getting an error here as well, saying that
return new_dict # I can't convert int to str implicitly
the problem with this is that when my dictionary updates when each line is read, the value of a word won't accumulate. so if in another line 'sunday' occurred 3 times, my dictionary would contain {'sunday': 3} instead of {'sunday': 5}. any help? I have no idea where to go from here and I'm new to all of this.
You are looking for collections.Counter.
e.g:
from itertools import chain
with open("file.txt") as file:
Counter(chain.from_iterable(line.split() for line in file))
(Using a itertools.chain.from_iterable() generator expression too.)
Note that your example only works on the first line, I presume this wasn't intentional, and this solution is for across the whole file (obviously it's trivial to swap that around).
Here is a simple version that doesn't deal with punctuation
from collections import Counter
counter = Counter()
with open('any_file,txt', 'r') as file:
for line in file:
for word in line.split():
counter[word] += 1
can also be written like this:
from collections import Counter
counter = Counter(word for line in file for word in line.split())
Here's one way to solve the problem using a dict
counter = {}
with open('any_file,txt', 'r') as file:
for line in file:
for word in line.split():
if word not in counter:
counter[word] = 1
else:
counter[word] += 1
try this
file = open('any_file.txt', 'r')
myDict = {}
for line in file:
lineSplit = line.split(" ")
for x in xrange(len(lineSplit)):
if lineSplit[x] in myDict.keys(): myDict[lineSplit[x]] += 1
else: myDict[lineSplit[x]] = 1
file.close()
print myDict
Do you use Python 3 or Python 2.7?
If yes, use Counter from collections library:
import re
from collections import Counter
words = re.findall('\w+', open('any_file.txt').read().lower())
Counter(words).most_common(10)
But you get list of tuples though. It should be easy for you to turn list of tuples to dictionary.

Create a dictionary from text file

Alright well I am trying to create a dictionary from a text file so the key is a single lowercase character and each value is a list of the words from the file that start with that letter.
The text file containts one lowercase word per line eg:
airport
bathroom
boss
bottle
elephant
Output:
words = {'a': ['airport'], 'b': ['bathroom', 'boss', 'bottle'], 'e':['elephant']}
Havent got alot done really, just confused how I would get the first index from each line and set it as the key and append the values. would really appreatiate if someone can help me get sarted.
words = {}
for line in infile:
line = line.strip() # not sure if this line is correct
So let's examine your example:
words = {}
for line in infile:
line = line.strip()
This looks good for a beginning. Now you want to do something with the line. Probably you'll need the first character, which you can access through line[0]:
first = line[0]
Then you want to check whether the letter is already in the dict. If not, you can add a new, empty list:
if first not in words:
words[first] = []
Then you can append the word to that list:
words[first].append(line)
And you're done!
If the lines are already sorted like in your example file, you can also make use of itertools.groupby, which is a bit more sophisticated:
from itertools import groupby
from operator import itemgetter
with open('infile.txt', 'r') as f:
words = { k:map(str.strip, g) for k, g in groupby(f, key=itemgetter(0)) }
You can also sort the lines first, which makes this method generally applicable:
groupby(sorted(f), ...)
defaultdict from the collections module is a good choice for these kind of tasks:
>>> import collections
>>> words = collections.defaultdict(list)
>>> with open('/tmp/spam.txt') as f:
... lines = [l.strip() for l in f if l.strip()]
...
>>> lines
['airport', 'bathroom', 'boss', 'bottle', 'elephant']
>>> for word in lines:
... words[word[0]].append(word)
...
>>> print words
defaultdict(<type 'list'>, {'a': ['airport'], 'b': ['bathroom', 'boss', 'bottle'], 'e': ['elephant']})

Categories