remove similar lines in text file

remove similar lines in text file - python

I am not using Python but I have script in python:
part of script
elif line.find("CONECT") > -1:
con = line.split()
line_value = line_value + 1
#print line_value
#print con[2]
try:
line_j = "e" + ', ' + str(line_value) + ', ' + con[2] + "\n"
output_file.write(line_j)
print(line_j)
line_i = "e" + ', ' + str(line_value) + ', ' + con[3] + "\n"
output_file.write(line_i)
print(line_i)
line_k = "e"+ ', ' + str(line_value) + ', ' + con[4] + "\n"
print(line_k)
output_file.write(line_k)
except IndexError:
continue
which give .txt output in format
e, 1, 2
e, 1, 3
e, 1, 4
e, 2, 1
e, 2, 3
etc.
I need remove similar lines with the same numbers, but no matter on order this numbers
i.e. line e, 2, 1..
Is it possible?

Of course, it is better to modify your code to remove that lines BEFORE you're writing them to file. You can use a list to store already saved values, and on each itereation, perfom a search if the values you're want to add is already exists in that list. The code below isn't tested and optimized, but it explains an idea:
# 'added = []' should be placed somewhere before 'if'
added = []
# you part of code
elif line.find("CONECT") > -1:
con = line.split()
line_value = line_value + 1
try:
line_j = "e, %s, %s\n" % (str(line_value),con[2])
tmp = sorted((str(line_value),con[2]))
if tmp not in added:
added.append(tmp)
output_file.write(line_j)
print(line_j)
line_i = "e, %s, %s\n" % (str(line_value),con[3])
tmp = sorted((str(line_value),con[3]))
if tmp not in added:
added.append(tmp)
output_file.write(line_i)
print(line_i)
line_k = "e, %s, %s\n" % (str(line_value),con[4])
tmp = sorted((str(line_value),con[4]))
if tmp not in added:
added.append(tmp)
print(line_k)
output_file.write(line_k)
except IndexError:
continue

Here is a comparison method for two lines of your file:
def compare(line1, line2):
els1 = line1.strip().split(', ')
els2 = line2.strip().split(', ')
return Counter(els1) == Counter(els2)
See the documentation for the Counter class.
If the count of elements doesn't matter you can replace the Counter class with set instead

The following approach should work. First add the following line further up in your code:
seen = set()
Then replace everything inside the try with the following code:
for con_value in con[2:5]:
entry = frozenset((line_value, con_value))
if entry not in seen:
seen.append(entry)
line_j = "e" + ', ' + str(line_value) + ', ' + con_value + "\n"
output_file.write(line_j)
print(line_j)
Make sure this code is indented to the same level as the code it replaces.

Related

Line split is not functioning as intended

I am trying to get this code to split one at a time, but it is not functioning as expected:
for line in text_line:
one_line = line.split(' ',1)
if len(one_line) > 1:
acro = one_line[0].strip()
meaning = one_line[1].strip()
if acro in acronyms_dict:
acronyms_dict[acro] = acronyms_dict[acro] + ', ' + meaning
else:
acronyms_dict[acro] = meaning

Remove the ' ' from the str.split. The file is using tabs to delimit the acronyms:
import requests
data_site = requests.get(
"https://raw.githubusercontent.com/priscian/nlp/master/OpenNLP/models/coref/acronyms.txt"
)
text_line = data_site.text.split("\n")
acronyms_dict = {}
for line in text_line:
one_line = line.split(maxsplit=1) # <-- remove the ' '
if len(one_line) > 1:
acro = one_line[0].strip()
meaning = one_line[1].strip()
if acro in acronyms_dict:
acronyms_dict[acro] = acronyms_dict[acro] + ", " + meaning
else:
acronyms_dict[acro] = meaning
print(acronyms_dict)
Prints:
{
'24KHGE': '24 Karat Heavy Gold Electroplate',
'2B1Q': '2 Binary 1 Quaternary',
'2D': '2-Dimensional',
...

How would I format python code using python?

Let's say I've got this code in python:
total=0for i in range(100):print(i)if i > 50:total=total+i
How would I make an algorithm in python to format this python code into the code below:
total=0
for i in range(100):
print(i)
if i > 50:
total=total+i
Assume that everything is nested under each other, such that another statement would be assumed to be inside the if block.

This was quite a fun exercise! I'm running out of juice so just posting this as is. It works on your example but probably not much for anything more complex.
code_block = "total=0for i in range(100):print(i)if i > 50:total=total+iprint('finished')"
code_block_b = "def okay() {print('ff')while True:print('blbl')break}"
line_break_before = ['for', 'while', 'if', 'print', 'break', '}']
line_break_after = [':', '{']
indent_chars = [':', '{']
unindent_chars = ['}']
# Add line breaks before keywords
for kw in line_break_before:
kw_indexes = [idx for idx in range(len(code_block)) if code_block[idx:idx + len(kw)] == kw]
for kw_idx in kw_indexes[::-1]:
code_block = code_block[:kw_idx] + '\n' + code_block[kw_idx:]
# Add line breaks after other keywords if not present already
for kw in line_break_after:
kw_indexes = [idx for idx in range(len(code_block)) if code_block[idx:idx + len(kw)] == kw]
for kw_idx in kw_indexes[::-1]:
if code_block[kw_idx + 1: kw_idx + 2] != '\n':
code_block = code_block[:kw_idx + 1] + '\n' + code_block[kw_idx + 1:]
# Add indentation
indent = 0
formatted_code_lines = []
for line in code_block.split('\n'):
if line[-1] in unindent_chars:
indent = 0
formatted_code_lines.append(' ' * indent)
if line[-1] in indent_chars:
indent += 4
formatted_code_lines.append(line + '\n')
code_block = ''.join(formatted_code_lines)
print(code_block)
The basic premise for formatting is based around keywords. There are keys that require a line break before, and keys that require a line break after them. After that, the indentation was counted +4 spaces for every line after each : symbol. I tested some formatting with braces too in code_block_b.
Output a
total=0
for i in range(100):
print(i)
if i > 50:
total=total+i
Output b
def okay() {
print('ff')
while True:
print('blbl')
break
}

list index out of range when extending list

This function takes email body as input and returns values after Application name, source and message respectively and it works fine
def parse_subject(line):
info = {}
segments = line.split(' ')
info['time'] = segments[0]+' '+segments[1]
for i in range(2, len(segments)):
key = ''
if segments[i] == 'Application name:':
key = 'appname'
elif segments[i] == 'Source:':
key = 'source'
elif segments[i] == 'Message:':
key = 'message'
if key != '':
i += 1
info[key] = segments[i]
return info
For another email body format i need to extend segments format because i need to search more lines in message body so i changed info['time'] and as soon i extend segments for more than 2 i'm getting out of range errors
info['time'] = segments[0]+' '+segments[1]+' '+segments[2]+' '+segments[3]+' '+segments[4]+' '+segments[5]......up to segment[17]
maybe i'll need to extend more
and above function fails with list index out of range
i changed code but same error:
also tried changing number to match number of segments but same:
for i in range(<number of segments>, len(segments)):
example of segments: lenght will vary because string after Message has different value, sometime it's URL string
Question
when i define lenght of the segment, let's say up to segments[17],
what i need to change in function not to throw out of index error
def parse_subject(line):
info = {}
segments = line.split(' ')
info['time'] = segments[0]+' '+segments[1] + ' ' + segments[2] + ' ' + segments[3] + ' ' + segments[4] + ' ' + segments[5] + ' ' + segments[6] + ' ' + segments[7] + ' ' + segments[8] +' ' + segments[9] + ' ' + segments[10] + ' ' + segments[11] + ' ' + segments[12] +' ' + segments[13] + ' ' + segments[14] + ' '
+ segments[15] +' ' + segments[16] + ' ' + segments[17]
for i in range(16, len(segments)):
key = ''
if segments[i] == 'name:':
key = 'appname'
elif segments[i] == 'Source:':
key = 'source'
elif segments[i] == 'Message:':
key = 'message'
if key != '':
i += 1
info[key] = segments[i]
return info
if mail["Subject"].find("PA1") > 0 or mail["Subject"].find("PA2") > 0:
body = get_autosys_body(mail)
# print(body)
for line in body.splitlines():
if 'Application Name' in line:
job_info = parse_subject(line)
break
print(job_info)
I need to pass line variable (content below)
name:Contoso.Service
Source: host15
Timestamp: 2019-01-22T00:00:43.901Z
Message:null
to parse_subject(line) function and from above output to get:
Contoso.Service as value of job_info['appname']
host15 as value of jobinfo['source']
null as value of jobinfo['message']

In your code, you need to debug it. The error is telling you exactly what is wrong.
def old_parse_subject(line):
info = {}
segments = line.split(' ')
if len(segments < 18):
raise ValueError("segments[17] won't work if segments is not that long")
You could have done a print(len(segments)) or just print (segments) right before where you know the error is.
For reading an email header, if you know it has multiple lines, you get those with split('\n') and then for each line if you know it is "name: value" you get that with split(':', 1).
The second argument to split says only split on 1 colon, because any additional colons are allowed to be part of the data. For example, timestamps have colons.
def parse_subject(headers):
info = {}
# split the full header into separate lines
for line in headers.split('\n'):
# split on colon, but only once
key, value = line.split(':', 1)
# store info
info[key] = value
return info
data = """name:Contoso.Service
Source: host15
Timestamp: 2019-01-22T00:00:43.901Z
Message:null"""
print (parse_subject(data))
{'name': 'Contoso.Service', 'Source': ' host15', 'Timestamp': ' 2019-01-22T00:00:43.901Z', 'Message': 'null'}

concatinating multiple strings from dictionary and save in file using python

I able to write hostname in the /tmp/filter.log but any hint how can i write all three values[hostname, owner, seats] in the file?
def list_hosts(nc):
resp = nc.send_service_request('ListHosts', json.dumps({}))
result = resp['result']
l = []
f=open("/tmp/filter.log", "w+")
for r in result:
if "team-prod" in r['owner']:
print r['owner'], r['hostname'], r['seats']
f.write(r['hostname'] + "\n")
f.close()
l.append(r['hostname'])
return l
nc = create_client('zone', 'team_PROD_USERNAME', 'team_PROD_PASSWORD')
l = list_hosts(nc)
print l
The file should have entries as below:
team-prod\*, np-team-052, [u'123123123-18d1-483d-9af8-169ac66b26e4']
Current entry is:
np-team-052

f.write(str(r['owner']) + ', ' + str(r['hostname']) + ', ' + str(r['seats']) + '\n')

Python function performance

I have 130 lines of code in which part except from line 79 to 89 work fine like compiles in ~0.16 seconds however after adding function which is 10 lines(between 79-89) it works in 70-75 seconds. In that function the data file(u.data) is 100000 lines of numerical data in this format:
>196 242 3 881250949
4 grouped numbers in every line. The thing is that when I ran that function in another Python file while testing (before implementing it in the main program) it showed that it works in 0.15 seconds however when I implemented it in main one (same code) it takes whole program 70 seconds almost.
Here is my code:
""" Assignment 5: Movie Reviews
Date: 30.12.2016
"""
import os.path
import time
start_time = time.time()
""" FUNCTIONS """
# Getting film names in film folder
def get_film_name():
name = ''
for word in read_data.split(' '):
if ('(' in word) == False:
name += word + ' '
else:
break
return name.strip(' ')
# Function for removing date for comparison
def throw_date(string):
a_list = string.split()[:-1]
new_string = ''
for i in a_list:
new_string += i + ' '
return new_string.strip(' ')
def film_genre(film_name):
oboist = []
genr_list = ['unknown', 'Action', 'Adventure', 'Animation', "Children's", 'Comedy', 'Crime', 'Documentary', 'Drama',
'Fantasy',
'Movie-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
for item in u_item_list:
if throw_date(str(item[1])) == film_name:
for i in range(4, len(item)):
oboist.append(item[i])
dictionary = dict(zip(genr_list, oboist))
genres = ''
for key, value in dictionary.items():
if value == '1':
genres += key + ' '
return genres.strip(' ')
def film_link(film_name):
link = ''
for item in u_item_list:
if throw_date(str(item[1])) == film_name:
link += item[3]
return link
def film_review(film_name):
review = ''
for r, d, filess in os.walk('film'):
for fs in filess:
fullpat = os.path.join(r, fs)
with open(fullpat, 'r') as a_file:
data = a_file.read()
if str(film_name).lower() in str(data.split('\n', 1)[0]).lower():
for i, line in enumerate(data):
if i > 1:
review += line
a_file.close()
return review
def film_id(film_name):
for film in u_item_list:
if throw_date(film[1]) == film_name:
return film[0]
def total_user_and_rate(film_name):
rate = 0
user = 0
with open('u.data', 'r') as data_file:
rate_data = data_file.read()
for l in rate_data.split('\n'):
if l.split('\t')[1] == film_id(film_name):
user += 1
rate += int(l.split('\t')[2])
data_file.close()
print('Total User:' + str(int(user)) + '\nTotal Rate: ' + str(rate / user))
""" MAIN CODE"""
review_file = open("review.txt", 'w')
film_name_list = []
# Look for txt files and extract the film names
for root, dirs, files in os.walk('film'):
for f in files:
fullpath = os.path.join(root, f)
with open(fullpath, 'r') as file:
read_data = file.read()
film_name_list.append(get_film_name())
file.close()
with open('u.item', 'r') as item_file:
item_data = item_file.read()
item_file.close()
u_item_list = []
for line in item_data.split('\n'):
temp = [word for word in line.split('|')]
u_item_list.append(temp)
film_name_list = [i.lower() for i in film_name_list]
updated_film_list = []
print(u_item_list)
# Operation for review.txt
for film_data_list in u_item_list:
if throw_date(str(film_data_list[1]).lower()) in film_name_list:
strin = film_data_list[0] + " " + film_data_list[1] + " is found in the folder" + '\n'
print(film_data_list[0] + " " + film_data_list[1] + " is found in the folder")
updated_film_list.append(throw_date(str(film_data_list[1])))
review_file.write(strin)
else:
strin = film_data_list[0] + " " + film_data_list[1] + " is not found in the folder. Look at " + film_data_list[
3] + '\n'
print(film_data_list[0] + " " + film_data_list[1] + " is not found in the folder. Look at " + film_data_list[3])
review_file.write(strin)
total_user_and_rate('Titanic')
print("time elapsed: {:.2f}s".format(time.time() - start_time))
And my question is what can be the reason for that? Is the function
("total_user_and_rate(film_name)")
problematic? Or can there be other problems in other parts? Or is it normal because of the file?

I see a couple of unnecessary things.
You call film_id(film_name) inside the loop for every line of the file, you really only need to call it once before the loop.
You don't need to read the file, then split it to iterate over it, just iterate over the lines of the file.
You split each line twice, just do it once
Refactored for these changes:
def total_user_and_rate(film_name):
rate = 0
user = 0
f_id = film_id(film_name)
with open('u.data', 'r') as data_file:
for line in data_file:
line = line.split('\t')
if line[1] == f_id:
user += 1
rate += int(line[2])
data_file.close()
print('Total User:' + str(int(user)) + '\nTotal Rate: ' + str(rate / user))

In your test you were probably testing with a much smaller u.item file. Or doing something else to ensure film_id was much quicker. (By quicker, I mean it probably ran on the nanosecond scale.)
The problem you have is that computers are so fast you didn't realise when you'd actually made a big mistake doing something that runs "slowly" in computer time.
If your if l.split('\t')[1] == film_id(film_name): line takes 1 millisecond, then when processing a 100,000 line u.data file, you could expect your total_user_and_rate function to take 100 seconds.
The problem is that film_id iterates all your films to find the correct id for every single line in u.data. You'd be lucky, if the the film_id you're looking for is near the beginning of u_item_list because then the function would return in probably less than a nanosecond. But as soon as you run your new function for a film near the end of u_item_list, you'll notice performance problems.
wwii has explained how to optimise the total_user_and_rate function. But you could also gain performance improvements by changing u_item_list to use a dictionary. This would improve the performance of functions like film_id from O(n) complexity to O(1). I.e. it would still run on the nanosecond scale no matter how many films are included.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

remove similar lines in text file - python

Related

Line split is not functioning as intended

How would I format python code using python?

list index out of range when extending list

concatinating multiple strings from dictionary and save in file using python

Python function performance

Categories

Resources