regex to find match in element of list - python

I'm new to Python and have complied a list of items from a file that has the an element which appeared in the file and its frequency in the file like this
('95.108.240.252', 9)
its mostly IP addresses I'm gathering. I'd like to output the address and frequency like this instead
IP Frequency
95.108.240.252 9
I'm trying to do this by regexing the list item and printing that but it returns the following error when I try TypeError: expected string or bytes-like object
This is the code I'm using to do all the now:
ips = [] # IP address list
for line in f:
match = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", line) # Get all IPs line by line
if match:
ips.append(match.group()) # if found add to list
from collections import defaultdict
freq = defaultdict( int )
for i in ips:
freq[i] += 1 # get frequency of IPs
print("IP\t\t Frequency") # Print header
freqsort = sorted(freq.items(), reverse = True, key=lambda item: item[1]) # sort in descending frequency
for c in range(0,4): # print the 4 most frequent IPs
# print(freqsort[c]) # This line prints the item like ('95.108.240.252', 9)
m1 = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", freqsort[c]) # This is the line returning errors - trying to parse IP on its own from the list
print(m1.group()) # Then print it
Not trying to even parse the frequency yet, just wanted the IPs as a starting point

The second parameter in re.search() should be string and you are passing tuple. So it is generating an error saying that it expected string or buffer.
NOTE:- Also you need to make sure that there at least 4 elements for IP address, otherwise there will be index out of bounds error
Delete the last two lines and use this instead
print(freqsort[c][0])
If you want to stick to your format you can use the following but it is of no use
m1 = re.search(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", freqsort[c][0]) # This is the line returning errors - trying to parse IP on its own from the list
print(m1.group())

Use a byte object instead:
# notice the `b` before the quotes.
match = re.search(b'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', line)

Try regex with positive and negative lookaround.
(?<=\(\')(.*)(?=\').*(\d+)
First captured group will be your IP and second frequency.

You can use the ipaddress and Counter in the stdlib to assist with this...
from collections import Counter
from ipaddress import ip_address
with open('somefile.log') as fin:
ips = Counter()
for line in fin:
ip, rest_of_line = line.partition(' ')[::2]
try:
ips[ip_address(ip)] += 1
except ValueError:
pass
print(ips.most_common(4))
This'll also handle IPv4 and IPv6 style addresses and make sure they're technically correct not just "look" correct. Using a collections.Counter also gives you a .most_common() method to automatically sort by the most frequent and limit it to n amounts.

Related

I need to compare two sets and alert on any matched elements

Brand new to python.
First
I want to pull in a list of IP addresses from the body of a webpage using requests. I have that list and have used -Splitlines- to format my list and remove anything other than IP addresses. I am adding this list to a set.
Second
I want to pull in a list of IP address from a CSV file. I have the list and have also formated using -Splitlines- and added to a set. However, if I run a len on the set, I am missing around 1,000 lines (out of 18,000).
Additionally, I've tried several different ways to compare the sets, but I don't see to be getting any Red Flags that an element exist in both sets. This could be due to missing lines.
4 hours worth of Googling - finally decided to ask for help
r = requests.get(url)
black = set()
for line in r.text.splitlines():
bip = line.split(' ')[0]
black.add(bip)
# print(black) # Print for testing
file = "file_wip.csv"
white = set()
with open(file, 'r') as filehandle:
for line in filehandle:
wip = line.split(',')[0]
white.add(wip)
# print(white) # Print for testing
# black.intersection(white) <-- my attempts to compare
# set(black) == set(white)```
1. len on the sets do not provide an accurate line count
2. comparing the sets is blank
Your logic seems to be correct
black = set(['93.43.2.3', '83.23.2.2' ,'98.21.2.4'])
white = set(['54.54.3.2' ,'90.90.32.3' ,'98.21.2.4'])
print(black.intersection(white))
Output
{'98.21.2.4'}
Have you checked the print(black) and print(white) output for any discrepancies?
If your data has duplicate values they will be removed. It might be the reason for length mismatch

How do i print this list in a readable form?

I have written a short python script to search for urls with a http status code in a logfile. The script works as intended and counts how often an url is requested in combination with a certain http status code. The dictionary with the results is unsorted. Thats why i sorted the data afterwards using the values in the dictionary. This part of the script works as intended and i get a sorted list with the urls and the counter, The list looks like:
([('http://example1.com"', 1), ('http://example2.com"', 5), ('http://example3.com"', 10)])
I just want to make it better readable and print the list in rows.
http://example1.com 1
http://example2.com 5
http://example3.com 10
I started with python only two weeks ago and i cant find a solution. I tried several solutions i found here on stackoverflow but nothing works. My current solution prints all urls in seperate rows but does not show the count. I cant use comma as a seperator because i got some url with commas in my logfile. Im sorry for my bad english and the stupid question. Thank you in advance.
from operator import itemgetter
from collections import OrderedDict
d=dict()
with open("access.log", "r") as f:
for line in f:
line_split = line.split()
list = line_split[5], line_split[8]
url=line_split[8]
string='407'
if string in line_split[5]:
if url in d:
d[url]+=1
else:
d[url]=1
sorted_d = OrderedDict(sorted(d.items(), key=itemgetter(1)))
for element in sorted_d:
parts=element.split(') ')
print(parts)
for url, count in sorted_d.items():
print(f'{url} {count}')
Replace your last for loop with the above.
To explain: we unpack the url, count pairs in sorted_d in the for loop, and then use the an f-string to print the url and count separated by a space.
First if you're already importing from the collections library, why not import a Counter?
from collections import Counter
d=Counter()
with open("access.log", "r") as f:
for line in f:
line_split = line.split()
list = line_split[5], line_split[8]
url=line_split[8]
string='407'
if string in line_split[5]:
d[url] += 1
for key, value in d.most_common(): # or reversed(d.most_common())
print(f'{key} {value}')
There are many good tutorials on how to format strings in Python such as this
Here an example code how to print a dictionary. I set the width of the columns with the variables c1 and c2.
c1 = 34; c2 = 10
printstr = '\n|%s|%s|' % ('-'*c1, '-'*c2)
for key in sorted(d.keys()):
val_str = str(d[key])
printstr += '\n|%s|%s|' % (str(key).ljust(c1), val_str.rjust(c2))
printstr += '\n|%s|%s|\n\n' % ('-' * c1, '-' * c2)
print(printstr)
The string function ljust() creates a string of the length passed as an argument where the content of the string is left justified.

Python Re-ordering the lines in a dat file by string

Sorry if this is a repeat but I can't find it for now.
Basically I am opening and reading a dat file which contains a load of paths that I need to loop through to get certain information.
Each of the lines in the base.dat file contains m.somenumber. For example some lines in the file might be:
Volumes/hard_disc/u14_cut//u14m12.40_all.beta/beta8
Volumes/hard_disc/u14_cut/u14m12.50_all.beta/beta8
Volumes/hard_disc/u14_cut/u14m11.40_all.beta/beta8
I need to be able to re-write the dat file so that all the lines are re-ordered from the largest m.number to the smallest m.number. Then when I loop through PATH in database (shown in code) I am looping through in decreasing m.
Here is the relevant part of the code
base = open('base8.dat', 'r')
database= base.read().splitlines()
base.close()
counter=0
mu_list=np.array([])
delta_list=np.array([])
ofsset = 0.00136
beta=0
for PATH in database:
if os.path.exists(str(PATH)+'/CHI/optimal_spectral_function_CHI.dat'):
n1_array = numpy.loadtxt(str(PATH)+'/AVERAGES/av-err.n.dat')
n7_array= numpy.loadtxt(str(PATH)+'/AVERAGES/av-err.npx.dat')
n1_mean = n1_array[0]
delta=round(float(5.0+ofsset-(n1_array[0]*2.+4.*n7_array[0])),6)
par = open(str(PATH)+"/params10", "r")
for line in par:
counter= counter+1
if re.match("mu", line):
mioMU= re.findall('\d+', line.translate(None, ';'))
mioMU2=line.split()[2][:-1]
mu=mioMU2
print mu, delta, PATH
mu_list=np.append(mu_list, mu)
delta_list=np.append(delta_list,delta)
optimal_counter=0
print delta_list, mu_list
I have checked the possible flagged repeat but I can't seem to get it to work for mine because my file doesn't technically contain strings and numbers. The 'number' I need to sort by is contained in the string as a whole:
Volumes/data_disc/u14_cut/from_met/u14m11.40_all.beta/beta16
and I need to sort the entire line by just the m(somenumber) part
Assuming that the number part of your line has the form of a float you can use a regular expression to match that part and convert it from string to float.
After that you can use this information in order to sort all the lines read from your file. I added a invalid line in order to show how invalid data is handled.
As a quick example I would suggest something like this:
import re
# TODO: Read file and get list of lines
l = ['Volumes/hard_disc/u14_cut/u14**m12.40**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**m12.50**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**m11.40**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**mm11.40**_all.beta/beta8']
regex = r'^.+\*{2}m{1}(?P<criterion>[0-9\.]*)\*{2}.+$'
p = re.compile(regex)
criterion_list = []
for s in l:
m = p.match(s)
if m:
crit = m.group('criterion')
try:
crit = float(crit)
except Exception as e:
crit = 0
else:
crit = 0
criterion_list.append(crit)
tuples_list = list(zip(criterion_list, l))
output = [element[1] for element in sorted(tuples_list, key=lambda t: t[0])]
print(output)
# TODO: Write output to new file or overwrite existing one.
Giving:
['Volumes/hard_disc/u14_cut/u14**mm11.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m11.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m12.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m12.50**_all.beta/beta8']
This snippets starts after all lines are read from the file and stored into a list (list called l here). The regex group criterion catches the float part contained in **m12.50** as you can see on regex101. So iterating through all the lines gives you a new list containing all matching groups as floats. If the regex does not match on a given string or casting the group to a float fails, crit is set to zero in order to have those invalid lines at the very beginning of the sorted list later.
After that zip() is used to get a list of tules containing the extracted floats and the according string. Now you can sort this list of tuples based on the tuple's first element and write the according string to a new list output.

In Python,if startswith values in tuple, I also need to return which value

I have an area codes file I put in a tuple
for line1 in area_codes_file.readlines():
if area_code_extract.search(line1):
area_codes.append(area_code_extract.search(line1).group())
area_codes = tuple(area_codes)
and a file I read into Python full of phone numbers.
If a phone number starts with one of the area codes in the tuple, I need to do to things:
1 is to keep the number
2 is to know which area code did it match, as need to put area codes in brackets.
So far, I was only able to do 1:
for line in txt.readlines():
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
if line.startswith(area_codes):
print (line)
How do I do the second part?
The simple (if not necessarily highest performance) approach is to check each prefix individually, and keep the first match:
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
if line.startswith(area_codes):
print(line, next(filter(line.startswith, area_codes)))
Since we know filter(line.startswith, area_codes) will get exactly one hit, we just pull the hit using next.
Note: On Python 2, you should start the file with from future_builtins import filter to get the generator based filter (which will also save work by stopping the search when you get a hit). Python 3's filter already behaves like this.
For potentially higher performance, the way to both test all prefixes at once and figure out which value hit is to use regular expressions:
import re
# Function that will match any of the given prefixes returning a match obj on hit
area_code_matcher = re.compile(r'|'.join(map(re.escape, area_codes))).match
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
# Returns None on miss, match object on hit
m = area_code_matcher(line)
if m is not None:
# Whatever matched is in the 0th grouping
print(line, m.group())
Lastly, one final approach you can use if the area codes are of fixed length. Rather than using startswith, you can slice directly; you know the hit because you sliced it off yourself:
# If there are a lot of area codes, using a set/frozenset will allow much faster lookup
area_codes_set = frozenset(area_codes)
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
# Assuming lines that match always start with ###
if line[:3] in area_codes_set:
print(line, line[:3])

Match lines from a file and parse them on Python

I have this file with different lines, and I want to take only some information from each line (not the whole of it) here is a sample of how the file looks like:
18:10:12.960404 IP 132.227.127.62.12017 > 134.157.0.129.53: 28192+ A? safebrowsing-cache.google.com. (47)
18:10:12.961114 IP 134.157.0.129.53 > 132.227.127.62.12017: 28192 12/4/4 CNAME safebrowsing.cache.l.google.com., A 173.194.40.102, A 173.194.40.103, A 173.194.40.104, A 173.194.40.105, A 173.194.40.110, A 173.194.40.96, A 173.194.40.97, A 173.194.40.98, A 173.194.40.99, A 173.194.40.100, A 173.194.40.101 (394)
18:13:46.206371 IP 132.227.127.62.49296 > 134.157.0.129.53: 47153+ PTR? b._dns-sd._udp.upmc.fr. (40)
18:13:46.206871 IP 134.157.0.129.53 > 132.227.127.62.49296: 47153 NXDomain* 0/1/0 (101)
18:28:57.253746 IP 132.227.127.62.54232 > 134.157.0.129.53: 52694+ TXT? time.apple.com. (32)
18:28:57.254647 IP 134.157.0.129.53 > 132.227.127.62.54232: 52694 1/8/8 TXT "ntp minpoll 9 maxpoll 12 iburst" (381)
.......
.......
It is actually the output of a DNS request, so from it I want to extract these elements:
[timestamp], [srcip], [src prt], [dst ip], [dst prt], [domaine (if existed)], [related ips addresses]
After looking in the website old topics, I found that the re.match() is a great and helpful way to do that, but since as you see every line is different of the other, I am kind of lost, some help would be great, here is the code I wrote so far and it is correct:
def extractDNS(filename):
objList = []
obj = {}
with open(filename) as fi:
for line in fi:
line = line.lower().strip()
#18:09:29.960404
m = re.match("(\d+):(\d+):(\d+.\d+)",line)
if m:
obj = {} #New object detected
hou = int(m.group(1))
min = int(m.group(2))
sec = float(m.group(3))
obj["time"] = (hou*3600)+(min*60)+sec
objList.append(obj)
#IP 134.157.0.129.53
m=re.match("IP\s+(\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}).(\d+)",bb)
if m:
obj["dnssrcip"] = m.group(1)
obj["dnssrcport"] = m.group(2)
# > 134.157.0.129.53:
m = re.match("\s+>\s+(\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}).(\d+):",line)
if m:
obj["dnsdstip"] = m.group(1)
obj["dnsdstport"] = m.group(2)
tstFile3=open("outputFile","w+")
tstFile3.write("%s\n" %objList)
tstFile3.close()
extractDNS(sys.argv[1])
I know I have to make if else statements after this, because what comes after them is different every time, and I showed in the 3 cases I get generaly in every dns output file, which are :
- A? followed by CNAME, the exact domain, and the IP addresses,
- PTR? followed by a NXDOmain, means the domain is non-existant, so I will just ignore this line,
- TXT? followed by a domain, but it only gives words, so I ll ignore this one two
I only want the request that their responses contain IP addresses, which are in this case the A?
If you know the first 5 columns are always present, why don't you just split the line up and handle those directly (use datetime for the timestamp, and manually parse the IP addresses/ports). Then, you could use your regular expression to match only the CNAME records and the contents you are interested in from that one field. There is no need to have a regular expression scan over the different possibilities if you aren't going to actually use the output. So, if it doesn't match the CNAME form, then you don't care how it would be handled. At least, that's what it sounds like.
As user632657 said above, there's no need to care about the lines you, well, don't care about. Just use one regular expression for that line, and if it doesn't match, ignore that line:
pattern = re.compile( r'(\d{2}):(\d{2}):(\d{2}\.\d+)\s+IP\s+(\d+\.\d+\.\d+\.\d+)\.(\d+)\s+>\s+(\d+\.\d+\.\d+\.\d+)\.(\d+):\s+(\d+).*?CNAME\s+([^,]+),\s+(.*?)\s+\(\d+\)' )
That will match the CNAME records only. You only need to define that once, outside your loop. Then, within the loop:
try:
hour, minute, seconds, source_ip, source_port, dst_ip, dst_port, domain, records = pattern.match( line ).groups()
except:
continue
records = [ r.split() for r in records.split( ', ' ) ]
This will pull all the fields you've asked about in the relevant variables, and parse the associated IPs into a list of pairs (class, IP), which I figure will be useful :P

Categories