Removing Duplicate Domain URLs From the Text File Using Bash

Removing Duplicate Domain URLs From the Text File Using Bash - python

Text file
https://www.google.com/1/
https://www.google.com/2/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
https://www.bing.com/3/
Expected Output:
https://www.google.com/1/
https://www.bing.com
What I Tried
awk -F'/' '!a[$3]++' $file;
Output
https://www.google.com/1/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
I already tried various codes and none of them work as expected. I just want to pick only one unique domain URL per domain from the list.
Please tell me how I can do it by using the Bash script or Python.
PS: I want to filter and save full URLs from the list and not only the root domain.

With awk and / as field separator:
awk -F '/' '!seen[$3]++' file
If your file contains Windows line breaks (carriage returns) then I suggest:
dos2unix < file | awk -F '/' '!seen[$3]++'
Output:
https://www.google.com/1/
https://www.bing.com

Python solution using one of Itertools Recipes and urllib.parse.urlparse, let file.txt content be
https://www.google.com/1/
https://www.google.com/2/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
https://www.bing.com/3/
then
from itertools import filterfalse
from urllib.parse import urlparse
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBcCAD', str.lower) --> A B c D
seen = set()
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen.add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen.add(k)
yield element
def get_netloc(url):
return urlparse(url).netloc
with open("file.txt","r") as fin:
with open("file_uniq.txt","w") as fout:
for line in unique_everseen(fin,key=get_netloc):
fout.write(line)
creates file file_uniq.txt with following content
https://www.google.com/1/
https://www.bing.com

Related

Parse lines in files with similar strings using Python

AH! I'm new to Python. Trying to get the pattern here, but could use some assistance to get unblocked.
Scenario:
testZip.zip file with test.rpt files inside
The .rpt files have multiple areas of interest ("AOI") to parse
AOI1: Line starting with $$
AOI2: Multiple lines starting with a single $
Goal:
To get AOI's into tabular format for upload to SQL
Sample file:
$$ADD ID=TEST BATCHID='YEP' PASSWORD=NOPE
###########################################################################################
$KEY= 9/21/2020 3:53:55 PM/2002/B0/295.30/305.30/4FAOA973_3.0_v2.19.2.0_20150203_1/20201002110149
$TIMESTAMP= 20201002110149
$MORECOLUMNS= more columns
$YETMORE = yay
Tried so far:
import zipfile
def get_aoi1(zip):
z = zipfile.ZipFile(zip)
for f in z.namelist():
with z.open(f, 'r') as rptf:
for l in rptf.readlines():
if l.find(b"$$") != -1:
return l
def get_aoi2(zip):
z = zipfile.ZipFile(zip)
for f in z.namelist():
with z.open(f, 'r') as rptf:
for l in rptf.readlines():
if l.find(b"$") != -1:
return l
aoi1 = get_aoi1('testZip.zip')
aoi2 = get_aoi2('testZip.zip')
print(aoi1)
print(aoi2)
Results:
I get the same results for both functions
b"$$ADD ID=TEST BATCHID='YEP' PASSWORD=NOPE\r\n"
b"$$ADD ID=TEST BATCHID='YEP' PASSWORD=NOPE\r\n"
How do I get the results in text instead of bytes (b) and remove the \r\n from AOI1?
There doesn't seem to be an r option for z.open()
I've been unsuccessful with .strip()
EDIT 1:
Thanks for the pep #furas!
return l.strip().decode() worked for removing the new line and b
How do I get the correct results from AOI2 (lines with a single $ in a tabular format)?
EDIT 2:
#furas 2021!
Adding the following logic to aoi2 function worked great.
col_array = []
for l in rptf.readlines():
if not l.startswith(b"$$") and l.startswith(b"$"):
col_array.append(l)
return col_array

Adding items to list of lists - Python

I am looking to use the code I have here to match domains to their DNS resolver name
Current CSV output
domain1 dns1 dns2 dns3 \n domain2 dns1 dns2 dns3 \n etc
This is the incorrect format, because it is adding all domains and dns resolvers to the same row, instead of a new row based on the new domain. They are only separated by a blank cell because of the newline character. I instead want it to be written as below, where each domain & its dns resolvers are written to their own individual row.
Expected CSV output:
domain1 dns1 dns2 dns3
domain2 dns1 dns2 dns3
domain3 dns1 dns2 dns3
etc...
I want the CSV file to be written out in the correct format, and with the code that I have, every time a domain is passed to def dns_resolver, it should iterate to a new list index. that way, each domain, and it's dns resolvers have their own list, so when writing out to a new CSV file, each domain will be printed in it's own row in the CSV file.
The code is not iterating through the list index correctly, and does not add the domain & its dns names to any list because of this. When they are written all into the same list, it works fine, but they are written out all into the same row, which is incorrect. So instead of using 1 list, I am going to use a list of lists, and write each to its own list, and then write each list to the csv file, so that they are in their own rows. Normally the domains will be read into a list from a csv file, but for the sake of this, I entered 3 values.
import dns.resolver
import csv
import os
from os.path import dirname, abspath
r = 0
def dns_resolver(domain):
server = []
resolvers = []
try:
resolvers = dns.resolver.resolve(domain, 'NS')
#dns_list.append(domain)
for x in resolvers:
#dns_list.append(x.target)
#dns_list.append('\n')
server.append(str(x.target))
except:
server.append('did not resolve')
finally:
return (domain, *server)
# Read in all domains from csv file domains.csv & count how many domains there are listed
domain_list = ['google.com', 'facebook.com', 'github.com']
domain_amount = 0
with open(domainFName, 'r') as file:
for line in csv.reader(file):
name = (line)
domain_list.append(line)
domain_amount += 1
for first_domain in domain_list:
for x in first_domain:
outputWriter.writerow(dns_resolver(x))

You can simply make your dns_resolver function return a list for given domain.
The *server is a shorthand to append each item into a list.
Using list comprehension, collect all lists into a list of lists to write to CSV.
def dns_resolver(domain):
# do your dns resolution
# server = dns.resolver.resolve(domain, 'NS')
server = ["dns1", "dns2", "dns3", "dns4"]
return [domain, *server]
# Read in all domains
domain_list = ['google.com', 'facebook.com', 'github.com']
print([dns_resolver(d) for d in domain_list])
Output:
[
['google.com', 'dns1', 'dns2', 'dns3', 'dns4'],
['facebook.com', 'dns1', 'dns2', 'dns3', 'dns4'],
['github.com', 'dns1', 'dns2', 'dns3', 'dns4']
]

Merging and sorting multiline logs in Python

I have a bunch of log files with the following format:
[Timestamp1] Text1
Text2
Text3
[Timestamp2] Text4
Text5
...
...
Where the number of text lines following a timestamp can vary from 0 to many. All the lines following a timestamp until the next timestamp are part of the previous log statement.
Example:
[2016-03-05T23:18:23.672Z] Some log text
[2016-03-05T23:18:23.672Z] Some other log text
[2016-03-05T23:18:23.672Z] Yet another log text
Some text
Some text
Some text
Some text
[2016-03-05T23:18:23.672Z] Log text
Log text
I am trying to create a log merge script for such types of log files and have been unsuccessful so far.
If the logs were in a standard format where each line is a separate log entry, it is straight forward to create a log merge script using fileinput and sorting.
I think am looking at a way to treat multiple lines as a single log entity that is sortable on the associated timestamp.
Any pointers?

You can write a generator that acts as an adapter for your log stream to do the chunking for you. Something like this:
def log_chunker(log_lines):
batch = []
for line in log_lines:
if batch and has_timestamp(line):
# detected a new log statement, so yield the previous one
yield batch
batch = []
batch.append(line)
yield batch
This will turn your raw log lines into batches where each one is a list of lines, and the first line in each list has the timestamp. You can build the rest from there. It might make more sense to start batch as an empty string and tack on the rest of the message directly; whatever works for you.
Side-note, if you're merging multiple timestamped logs you shouldn't need to perform global sorting at all if you use a streaming merge-sort.

The following approach should work well.
from heapq import merge
from itertools import groupby
import re
import glob
re_timestamp = re.compile(r'\[\d{4}-\d{2}-\d{2}')
def get_log_entry(f):
entry = ''
for timestamp, g in groupby(f, lambda x: re_timestamp.match(x) is not None):
entries = [row.strip() + '\n' for row in g]
if timestamp:
if len(entries) > 1:
for entry in entries[:-1]:
yield entry
entry = entries[-1]
else:
yield entry + ''.join(entries)
files = [open(f) for f in glob.glob('*.log')] # Open all log files
with open('output.txt', 'w') as f_output:
for entry in merge(*[get_log_entry(f) for f in files]):
f_output.write(''.join(entry))
for f in files:
f.close()
It makes use of the merge function to combine a list of iterables in order.
As your timestamps are naturally ordered, all that is needed is a function to read whole entries at a time from each file. This is done using a regular expression to spot lines starting in each file with a timestamp, and groupby is used to read matching rows in at once.
glob is used to first find all files in your folder with a .log extension.

You can easily break it into chunks using re.split() with a capturing regexp:
pieces = re.split(r"(^\[20\d\d-.*?\])", logtext, flags=re.M)
You can make the regexp as precise as you wish; I just require [20\d\d- at the start of a line. The result contains the matching and non-matching parts of logtext, as alternating pieces (starting with an empty non-matching part).
>>> print(pieces[:5])
['', '[2016-03-05T23:18:23.672Z] ', 'Some log text\n', '[2016-03-05T23:18:23.672Z] ', 'Some other log text\n']
It remains to reassemble the log parts, which you can do with this recipe from itertools:
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = itertools.tee(iterable)
next(b, None)
return zip(a, b)
log_entries = list( "".join(pair) for pair in pairwise(pieces[1:]) )
If you have several of these lists, you can indeed just combine and sort them, or use a fancier merge sort if you have lots of data. I understand your question to be about splitting up the log entries, so I won't go into this.

Python script to match a partcular text in filename and count the number of such files

In a folder I have files containing file names as the following :
Q1234_ABC_B02_12232.hl7
12313_SDDD_Q03_44545.hl7
Q43434_SAD_B02_2312.hl7
4324_SDSD_W05_344423423.hl7
3123123_DSD_D06_67578.hl7
and many such files
I need to write a python script to count the number of files whose file names begin with "Q" and which have "B02" after the second underscore which means that I should get output count as 2. I have tried the following script but not got the desired solution.
import re
import os
resultsDict = {}
myString1 = ""
regex = r'[^_]+_([^_]*)_.*'
for file_name in os.listdir("."):
m = file_name.split("_")
if len(m) > 2 :
myString = m[2]
if "B02" in myString:
myString1 = myString
if myString1 in resultsDict:
resultsDict[myString1] += 1
else:
resultsDict.update({myString1: 1})
else:
print "error in the string! there are less then 2 _"
print resultsDict
I am using python 2.6.6. Any help would be useful.

As time of this writing, there is several answer with a wrong regex.
One of these is probably better:
r'^Q[^_]*_[^_]*_B02_.*'
r'^Q[^_]*_[^_]*_B02.*'
r'^Q[^_]*_[^_]*_B02(_.*|$)'
If you stick with .* instead, the regex might consume some intermediate underscore. So your are no longer able to enforce B02 being after the second _
After that, test for matching values (re.match) is a simple loop over the various file names ( os.listdir or glob.glob). Here is an example using list comprehension:
>>> l = [file for file in os.listdir(".") if re.match(r'^Q[^_]*_[^_]*_B02.*', file)]
>>> l
['Q1234_ABC_B02_12232.hl7', 'Q43434_SAD_B02_2312.hl7']
>>> len(l)
2
For better performances you might wish to compile the regex first (re.compile).
As a comment by #camh above let me think that maybe you have jumped into Python because you couldn't find a shell-based solution, here is how to do the same thing using only bash:
sh$ shopt -s extglob
sh$ ls Q*([^_])_*([^_])_B02*
Q1234_ABC_B02_12232.hl7 Q43434_SAD_B02_2312.hl7
sh$ ls Q*([^_])_*([^_])_B02* | wc -l
# ^^^^^^^
# This *won't* work if some file names contain '\n' !!!

Use a regular expression,
import re
resultsDict = {}
expression = "^Q.*_.*_B02_.*"
p = re.compile(expression)
for file_name in os.listdir("."):
if p.match(file_name):
if file_name in resultsDict:
resultsDict[file_name] = resultsDict[file_name] + 1
else:
resultsDict[file_name] = 1

You can try with this regular expression:
'^Q.*_.*_B02_.*'

This code will match all files in current directory according to your requirements.
import os
import re
regex = r'^Q\w+_\w+_B02' # This will match any word character between the underscores
for f in os.listdir("."):
if re.match(regex, f, re.I):
print f
Word character is A-Z, a-z and 0-9.

A solution with list comprehensions instead of regular expressions. First, get all the directory names that start with Q, and split them on the underscores;
import os
dirs = [d.split('_') for d in os.listdir(".") if d.startswith('Q')]
Now get all directories with two underscores or more;
dirs = [d for d in dirs if len(d) > 2]
Finally, narrow it down;
dirs = [d for d in dirs if d[2] == 'B02']
You could combine the last to comprehensions into one;
dirs = [d for d in dirs if len(d) > 2 and d[2] == 'B02']

Find all text files not containing some text string

I'm on Python 2.7.1 and I'm trying to identify all text files that don't contain some text string.
The program seemed to be working at first but whenever I add the text string to a file, it keeps coming up as if it doesn't contain it (false positive). When I check the contents of the text file, the string is clearly present.
The code I tried to write is
def scanFiles2(rdir,sstring,extens,start = '',cSens = False):
fList = []
for fol,fols,fils in os.walk(rdir):
fList.extend([os.path.join(rdir,fol,fil) for fil in fils if fil.endswith(extens) and fil.startswith(start)])
if fList:
for fil in fList:
rFil = open(fil)
for line in rFil:
if not cSens:
line,sstring = line.lower(), sstring.lower()
if sstring in line:
fList.remove(fil)
break
rFil.close()
if fList:
plur = 'files do' if len(fList) > 1 else 'file does'
print '\nThe following %d %s not contain "%s":\n'%(len(fList),plur,sstring)
for fil in fList:
print fil
else:
print 'No files were found that don\'t contain %(sstring)s.'%locals()
scanFiles2(rdir = r'C:\temp',sstring = '!!syn',extens = '.html', start = '#', cSens = False)
I guess there's a flaw in the code but I really don't see it.
UPDATE
The code still comes up with many false positives: files that do contain the search string but are identified as not containing it.
Could text encoding be an issue here? I prefixed the search string with U to account for Unicode encoding but it didn't make any difference.
Does Python in some way cache file contents? I don't think so but that could somewhat account for files to still pop up after having been corrected.
Could some kind of malware cause symptoms like these? Seems highly unlikely to me but I'm kinda desperate to get this fixed.

Modifying element while iterating the list cause unexpected results:
For example:
>>> lst = [1,2,4,6,3,8,0,5]
>>> for n in lst:
... if n % 2 == 0:
... lst.remove(n)
...
>>> lst
[1, 4, 3, 0, 5]
Workaround iterate over copy
>>> lst = [1,2,4,6,3,8,0,5]
>>> for n in lst[:]:
... if n % 2 == 0:
... lst.remove(n)
...
>>> lst
[1, 3, 5]
Alternatively, you can append valid file path, instead of removing from the whole file list.
Modified version (appending file that does not contian sstring instead of removing):
def scanFiles2(rdir, sstring, extens, start='', cSens=False):
if not cSens:
# This only need to called once.
sstring = sstring.lower()
fList = []
for fol, fols, fils in os.walk(rdir):
for fil in fils:
if not (fil.startswith(start) and fil.endswith(extens)):
continue
fil = os.path.join(fol, fil)
with open(fil) as rFil:
for line in rFil:
if not cSens:
line = line.lower()
if sstring in line:
break
else:
fList.append(fil)
...
list.remove takes O(n) time, while list.append takes O(1). See Time Complexity.
Use with statement if possible.

Falsetru already showed you why you should not remove lines from a list while looping over it; list iterators do not and cannot update their counter when a list is shortened, so if item 3 was processed but you removed that item, the next iteration item 4 was previously located at index 5.
List comprehension version using fnmatch.filter() and any() and a filter lambda for case insensitive matching:
import fnmatch
def scanFiles2(rdir, sstring, extens, start='', cSens=False):
lfilter = sstring.__eq__ if cSens else lambda l, s=sstring.lower(): l.lower() == s
ffilter = '{}*{}'.format(start, extens)
return [os.path.join(r, fname)
for r, _, f in os.walk(rdir)
for fname in fnmatch.filter(f, ffilter)
if not any(lfilter(l) for l in open(os.path.join(root, fname)))]
but perhaps you'd be better off sticking to a more readable loop:
def scanFiles2(rdir, sstring, extens, start='', cSens=False):
lfilter = sstring.__eq__ if cSens else lambda l, s=sstring.lower(): l.lower() == s
ffilter = '{}*{}'.format(start, extens)
result = []
for root, _, files in os.walk(rdir):
for fname in fnmatch.filter(files, ffilter):
fname = os.path.join(r, fname)
with open(fname) as infh:
if not any(lfilter(l) for l in infh):
result.append(fname)
return result

Another alternative that opens the searching up for using regular expressions (although just using grep with appropriate options would still be better):
import mmap
import os
import re
import fnmatch
def scan_files(rootdir, search_string, extension, start='', case_sensitive=False):
rx = re.compile(re.escape(search_string), flags=re.I if not case_sensitive else 0)
name_filter = start + '*' + extension
for root, dirs, files in os.walk(rootdir):
for fname in fnmatch.filter(files, name_filter):
with open(os.path.join(root, fname)) as fin:
try:
mm = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
except ValueError:
continue # empty files etc.... include this or not?
if not next(rx.finditer(mm), None):
yield fin.name
Then use list on that if you want the names materialised or treat it as you would any other generator...

Please do not write a python program for that. This program already exists. Use grep:
grep * -Ilre 'main' 2> /dev/null
99client/.git/COMMIT_EDITMSG
99client/taxis-android/build/incremental/mergeResources/production/merger.xml
99client/taxis-android/build/incremental/mergeResources/production/inputs.data
99client/taxis-android/build/incremental/mergeResources/production/outputs.data
99client/taxis-android/build/incremental/mergeResources/release/merger.xml
99client/taxis-android/build/incremental/mergeResources/release/inputs.data
99client/taxis-android/build/incremental/mergeResources/release/outputs.data
99client/taxis-android/build/incremental/mergeResources/debug/merger.xml
99client/taxis-android/build/incremental/mergeResources/debug/inputs.data
(...)
http://www.gnu.org/savannah-checkouts/gnu/grep/manual/grep.html#Introduction
If you need the list in python, simply execute grep from it and collect the result.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing Duplicate Domain URLs From the Text File Using Bash - python

With awk and / as field separator: awk -F '/' '!seen[$3]++' file If your file contains Windows line breaks (carriage returns) then I suggest: dos2unix < file | awk -F '/' '!seen[$3]++' Output: https://www.google.com/1/ https://www.bing.com

Related

Parse lines in files with similar strings using Python

Adding items to list of lists - Python

Merging and sorting multiline logs in Python

Python script to match a partcular text in filename and count the number of such files

Find all text files not containing some text string

Categories

Resources