How to remove duplicate text from two different files in python - python

My problem: I have two files, "text1.txt" and "text2.txt"
"Text1.txt" contains the following:
Banana, rotten
Apple, ripe
Cheese, fresh
and "Text2.txt" contains the following:
Banana, good
Dragon, edible
Cheese, nice
What I want is to create a code that would check text2.txt with text1.txt and remove the word and the whole line that repeats itself before the comma. So, in this case, it would look like this:
"Text1.txt" changed to and Text2.txt would be left unchanged
Apple, ripe
What I managed to do is check if the words are duplicates without the comma, but even struggled to do that. My attempt is below:
New_food = open("text1.txt", "r+")
All_food = open("text2.txt")
food = New_food.readlines()
food2 = All_food.readlines()
#The following calculates how many lines the text file has
def file_len(fname):
with open(fname) as s:
for t, l in enumerate(s):
pass
return t+1
#calculates line number
n = file_len("text1.txt")
m = file_len("text2.txt")
for g in range(n):
food_r = food[g]
for j in range(m):
food2_r = food2[j]
if food_r == food2_r:
print(5) #only when they match
I have made the line break before it reaches a comma using this piece of code:
word = "cheese , fresh"
type_, *vals = word.split(',')
print(type_) #this would return cheese

I rewrote some of your code into the following script:
file1 = open("text1.txt", "r+")
file2 = open("text2.txt")
# List from files
food_list_1 = file1.readlines()
food_list_2 = file2.readlines()
# Unique food values in list
file_2_only_foods = list()
for line in food_list_2:
file_2_only_foods.append(line.split(',')[0])
def determine(x):
type = x.split(',')[0]
return type in file_2_only_foods
result = [x for x in food_list_1 if not determine(x)]
file1.close()
file1 = open("text1.txt", 'w')
file1.writelines(result)
This will put all the unique values into file_2_only_foods list to check if the values from list 1 are unique are not.
In order to write the file, we have to close the previous file and than reopen it to write your results. The result from my code is the exact same as what you described.

If there are no duplicates within files, you could go through both files and add all elements to a Counter (https://docs.python.org/2/library/collections.html), and then on a second pass remove all elements that have a count larger than 1.
from collections import Counter
>>> food1 = open("Text1.txt")
>>> food2 = open("Text2.txt")
>>> counter1 = Counter(item.split(",")[0] for item in food1.readlines())
>>> counter2 = Counter(item.split(",")[0] for item in food2.readlines())
>>> counter = counter1 + counter2
Counter({'Cheese': 2, 'Banana': 2, 'Apple': 1, 'Dragon': 1})

you can use regular expressions to extract the words from the text. Regular expressions reference: https://docs.python.org/3/library/re.html
You can extract all the first words from a file with this one-liner:
re.findall(r"^\s*(\w+)", file.read(), re.MULTILINE)
Demo:
>>> txt = """
... Banana, rotten
... Apple, ripe
... Cheese, fresh
... """
>>>
>>> re.findall(r"^\s*(\w+)", txt, re.MULTILINE)
['Banana', 'Apple', 'Cheese']
>>>
Extracts all the words to filter on, then efficiently filters the target file line-by-line.
>>> def filter_lines(filter_path, target_path, output_path):
...
... with open(filter_path, 'r' ) as filter_file,
... open(target_path, 'r' ) as target_file,
... open(output_path, 'w+') as output_file:
...
... filter_words = re.findall(r"^\s*(\w+)",
... filter_file.read(),
... re.MULTILINE)
... filter_words = set(filter_words)
...
... for line in target_file:
... m = re.findall(r"^\s*(\w+)", line)
... if not (m and m[0] in filter_words):
... output_file.write(line)
>>>
>>> filter_lines('text2.txt', 'text1.txt', 'filtered_text1.txt')
>>>
side note: Generally, in cases where you need to maintain a large list of items that's used in expressions like if item in long_list:, where the list is checked for membership. A set is much better than a list because lookups are fast; with list's, lookups are done by iterating over all items until what you're looking for is found.

Related

Making python dictionary from a text file with multiple keys

I have a text file named file.txt with some numbers like the following :
1 79 8.106E-08 2.052E-08 3.837E-08
1 80 -4.766E-09 9.003E-08 4.812E-07
1 90 4.914E-08 1.563E-07 5.193E-07
2 2 9.254E-07 5.166E-06 9.723E-06
2 3 1.366E-06 -5.184E-06 7.580E-06
2 4 2.966E-06 5.979E-07 9.702E-08
2 5 5.254E-07 0.166E-02 9.723E-06
3 23 1.366E-06 -5.184E-03 7.580E-06
3 24 3.244E-03 5.239E-04 9.002E-08
I want to build a python dictionary, where the first number in each row is the key, the second number is always ignored, and the last three numbers are put as values. But in a dictionary, a key can not be repeated, so when I write my code (attached at the end of the question), what I get is
'1' : [ '90' '4.914E-08' '1.563E-07' '5.193E-07' ]
'2' : [ '5' '5.254E-07' '0.166E-02' '9.723E-06' ]
'3' : [ '24' '3.244E-03' '5.239E-04' '9.002E-08' ]
All the other numbers are removed, and only the last row is kept as the values. What I need is to have all the numbers against a key, say 1, to be appended in the dictionary. For example, what I need is :
'1' : ['8.106E-08' '2.052E-08' '3.837E-08' '-4.766E-09' '9.003E-08' '4.812E-07' '4.914E-08' '1.563E-07' '5.193E-07']
Is it possible to do it elegantly in python? The code I have right now is the following :
diction = {}
with open("file.txt") as f:
for line in f:
pa = line.split()
diction[pa[0]] = pa[1:]
with open('file.txt') as f:
diction = {pa[0]: pa[1:] for pa in map(str.split, f)}
You can use a defaultdict.
from collections import defaultdict
data = defaultdict(list)
with open("file.txt", "r") as f:
for line in f:
line = line.split()
data[line[0]].extend(line[2:])
Try this:
from collections import defaultdict
diction = defaultdict(list)
with open("file.txt") as f:
for line in f:
key, _, *values = line.strip().split()
diction[key].extend(values)
print(diction)
This is a solution for Python 3, because the statement a, *b = tuple1 is invalid in Python 2. Look at the solution of #cha0site if you are using Python 2.
Make the value of each key in diction be a list and extend that list with each iteration. With your code as it is written now when you say diction[pa[0]] = pa[1:] you're overwriting the value in diction[pa[0]] each time the key appears, which describes the behavior you're seeing.
with open("file.txt") as f:
for line in f:
pa = line.split()
try:
diction[pa[0]].extend(pa[1:])
except KeyError:
diction[pa[0]] = pa[1:]
In this code each value of diction will be a list. In each iteration if the key exists that list will be extended with new values from pa giving you a list of all the values for each key.
To do this in a very simple for loop:
with open('file.txt') as f:
return_dict = {}
for item_list in map(str.split, f):
if item_list[0] not in return_dict:
return_dict[item_list[0]] = []
return_dict[item_list[0]].extend(item_list[1:])
return return_dict
Or, if you wanted to use defaultdict in a one liner-ish:
from collections import defaultdict
with open('file.txt') as f:
return_dict = defaultdict(list)
[return_dict[item_list[0]].extend(item_list[1:]) for item_list in map(str.split, f)]
return return_dict

Extracting data from csv

I have a csv file with each row containing lists of adjectives.
For example, the first 2 rows are as follows:
["happy","sad","colorful"]
["horrible","sad","cheerful","happy"]
I want to extract all the data from this file to get a list containing each adjective only one.
(Here, it would be a list as follows :
["happy","sad","colorful","horrible","cheerful"]
I am doing this using Python.
import csv
with open('adj.csv', 'rb') as f:
reader = csv.reader(f)
adj_list = list(reader)
filtered_list = []
for l in adj_list:
if l not in new_list:
filtered_list.append(l)
Supposing that "memory is not important" and that one liner is what you are looking for:
from itertools import chain
from csv import reader
print(list(set(chain(*reader(open('file.csv'))))))
having 'file.csv' content like this:
happy, sad, colorful
horrible, sad, cheerful, happy
OUTPUT:
['horrible', ' colorful', ' sad', ' cheerful', ' happy', 'happy']
You can remove the list() part if you don't mind receive a Python set instead of a list.
Assuming you are only interested in a list of unique words where order does not matter:
# Option A1
import csv
with open("adj.csv", "r") as f:
seen = set()
reader = csv.reader(f)
for line in reader:
for word in line:
seen.add(word)
list(seen)
# ['cheerful', 'colorful', 'horrible', 'happy', 'sad']
More concisely:
# Option A2
with open("adj.csv", "r") as f:
reader = csv.reader(f)
unique_words = {word for line in reader for word in line}
list(unique_words)
The with statement safely opens and closes the file. We are simply adding every word to a set. We cast the filtered result to list() and get a list of unique (unordered) words.
Alternatives
If ordered does matter, implement the unique_everseen itertools recipe.
From itertools recipes:
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBCcAD', str.lower) --> A B C D
seen = set()
seen_add = seen.add
if key is None:
for element in it.filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
You can manually implement this or install a third-library that implements it for you, such as more_itertools, e.g. pip install more_itertools
# Option B
import csv
import more_itertools as mit
with open("adj.csv", "r") as f:
reader = csv.reader(f)
words = (word for line in reader for word in line)
unique_words = list(mit.unique_everseen(words))
unique_words
# ['happy', 'sad', 'colorful', 'horrible', 'cheerful']

Appending multiple scores to one entry in a python dictionary

I have a textfile and it looks like this :
zor:10
zor:21
bob:30
qwerty:46
I want it to look like {'zor': [10, 21,], 'bob': [30]} etc but the numbers keep getting replaced when I add multiple scores to a name I split the scores so that the names and scores are in separate positions.
elif schClass == '2':
schClass = open("scores2.txt", 'r')
li = open("scores2.txt", 'r')
data = li.read().splitlines()
for li in data:
name = li.split(":")[0]
score = li.split(":")[1]
if name not in diction1:
diction1[name] = score
elif name in diction1 :
diction1[name] = diction1[name + int(score)]
print(diction1)
You are not building lists; simply use the dict.setdefault() method to insert a list object when the key is missing, and append your values:
diction1 = {}
with open("scores2.txt", 'r') as infile:
for line in infile:
name, _, score = line.partition(':')
diction1.setdefault(name, []).append(int(score))
I took the liberty to clean up your code a little; I'm using the file as a context manager so that it is closed again automatically. By looping directly over the file you get individual lines, no need to split first. And I used str.partition() to split just the once (it is more efficient for that case than str.split() is).
Demo:
>>> from io import StringIO
>>> sample = '''\
... zor:10
... zor:21
... bob:30
... qwerty:46
... '''
>>> diction1 = {}
>>> with StringIO(sample) as infile:
... for line in infile:
... name, _, score = line.partition(':')
... diction1.setdefault(name, []).append(int(score))
...
>>> diction1
{'bob': [30], 'qwerty': [46], 'zor': [10, 21]}

Build a dictionary from successful regex matches in python

I'm pretty new to Python, and I'm trying to parse a file. Only certain lines in the file contain data of interest, and I want to end up with a dictionary of the stuff parsed from valid matching lines in the file.
The code below works, but it's a bit ugly and I'm trying to learn how it should be done, perhaps with a comprehension, or else with a multiline regex. I'm using Python 3.2.
file_data = open('x:\\path\\to\\file','r').readlines()
my_list = []
for line in file_data:
# discard lines which don't match at all
if re.search(pattern, line):
# icky, repeating search!!
one_tuple = re.search(pattern, line).group(3,2)
my_list.append(one_tuple)
my_dict = dict(my_list)
Can you suggest a better implementation?
Thanks for the replies. After putting them together I got
file_data = open('x:\\path\\to\\file','r').read()
my_list = re.findall(pattern, file_data, re.MULTILINE)
my_dict = {c:b for a,b,c in my_list}
but I don't think I could have gotten there today without the help.
Here's some quick'n'dirty optimisations to your code:
my_dict = dict()
with open(r'x:\path\to\file', 'r') as data:
for line in data:
match = re.search(pattern, line)
if match:
one_tuple = match.group(3, 2)
my_dict[one_tuple[0]] = one_tuple[1]
In the spirit of EAFP I'd suggest
with open(r'x:\path\to\file', 'r') as data:
for line in data:
try:
m = re.search(pattern, line)
my_dict[m.group(2)] = m.group(3)
except AttributeError:
pass
Another way is to keep using lists, but redesign the pattern so that it contains only two groups (key, value). Then you could simply do:
matches = [re.findall(pattern, line) for line in data]
mydict = dict(x[0] for x in matches if x)
matchRes = pattern.match(line)
if matchRes:
my_dict = matchRes.groupdict()
I'm not sure I'd recommend it, but here's a way you could try to use a comprehension instead(I substituted a string for the file for simplicity)
>>> import re
>>> data = """1foo bar
... 2bing baz
... 3spam eggs
... nomatch
... """
>>> pattern = r"(.)(\w+)\s(\w+)"
>>> {x[0]: x[1] for x in (m.group(3, 2) for m in (re.search(pattern, line) for line in data.splitlines()) if m)}
{'baz': 'bing', 'eggs': 'spam', 'bar': 'foo'}

Create a dictionary from text file

Alright well I am trying to create a dictionary from a text file so the key is a single lowercase character and each value is a list of the words from the file that start with that letter.
The text file containts one lowercase word per line eg:
airport
bathroom
boss
bottle
elephant
Output:
words = {'a': ['airport'], 'b': ['bathroom', 'boss', 'bottle'], 'e':['elephant']}
Havent got alot done really, just confused how I would get the first index from each line and set it as the key and append the values. would really appreatiate if someone can help me get sarted.
words = {}
for line in infile:
line = line.strip() # not sure if this line is correct
So let's examine your example:
words = {}
for line in infile:
line = line.strip()
This looks good for a beginning. Now you want to do something with the line. Probably you'll need the first character, which you can access through line[0]:
first = line[0]
Then you want to check whether the letter is already in the dict. If not, you can add a new, empty list:
if first not in words:
words[first] = []
Then you can append the word to that list:
words[first].append(line)
And you're done!
If the lines are already sorted like in your example file, you can also make use of itertools.groupby, which is a bit more sophisticated:
from itertools import groupby
from operator import itemgetter
with open('infile.txt', 'r') as f:
words = { k:map(str.strip, g) for k, g in groupby(f, key=itemgetter(0)) }
You can also sort the lines first, which makes this method generally applicable:
groupby(sorted(f), ...)
defaultdict from the collections module is a good choice for these kind of tasks:
>>> import collections
>>> words = collections.defaultdict(list)
>>> with open('/tmp/spam.txt') as f:
... lines = [l.strip() for l in f if l.strip()]
...
>>> lines
['airport', 'bathroom', 'boss', 'bottle', 'elephant']
>>> for word in lines:
... words[word[0]].append(word)
...
>>> print words
defaultdict(<type 'list'>, {'a': ['airport'], 'b': ['bathroom', 'boss', 'bottle'], 'e': ['elephant']})

Categories