I need to compare two sets and alert on any matched elements - python

Brand new to python.
First
I want to pull in a list of IP addresses from the body of a webpage using requests. I have that list and have used -Splitlines- to format my list and remove anything other than IP addresses. I am adding this list to a set.
Second
I want to pull in a list of IP address from a CSV file. I have the list and have also formated using -Splitlines- and added to a set. However, if I run a len on the set, I am missing around 1,000 lines (out of 18,000).
Additionally, I've tried several different ways to compare the sets, but I don't see to be getting any Red Flags that an element exist in both sets. This could be due to missing lines.
4 hours worth of Googling - finally decided to ask for help
r = requests.get(url)
black = set()
for line in r.text.splitlines():
bip = line.split(' ')[0]
black.add(bip)
# print(black) # Print for testing
file = "file_wip.csv"
white = set()
with open(file, 'r') as filehandle:
for line in filehandle:
wip = line.split(',')[0]
white.add(wip)
# print(white) # Print for testing
# black.intersection(white) <-- my attempts to compare
# set(black) == set(white)```
1. len on the sets do not provide an accurate line count
2. comparing the sets is blank

Your logic seems to be correct
black = set(['93.43.2.3', '83.23.2.2' ,'98.21.2.4'])
white = set(['54.54.3.2' ,'90.90.32.3' ,'98.21.2.4'])
print(black.intersection(white))
Output
{'98.21.2.4'}
Have you checked the print(black) and print(white) output for any discrepancies?
If your data has duplicate values they will be removed. It might be the reason for length mismatch

Related

Python Readline Loop and Subloop

I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.
If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.
As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python

Extract time values from a list and add to a new list or array

I have a script that reads through a log file that contains hundreds of these logs, and looks for the ones that have a "On, Off, or Switch" type. Then I output each log into its own list. I'm trying to find a way to extract the Out and In times into a separate list/array and then subtract the two times to find the duration of each separate log. This is what the outputted logs look like:
['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
This is my current code:
logfile = '/path/to/my/logfile'
with open(logfile, 'r') as f:
text = f.read()
words = ["On", "Off", "Switch"]
text2 = text.split('\n')
for l in text.split('\n'):
if (words[0] in l or words[1] in l or words[2] in l):
log = l.split(',')[0:3]
I'm stuck on how to target only the Out and In time values from the logs and put them in an array and convert to a time value to find duration.
Initial log before script: everything after the "In" time is useless for what I'm looking for so I only have the first three indices outputted
2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a","Type":"Switch,"In":"2020-01-31T00:30:20.140Z","Path":"interface","message":"interface changed status from unknown to normal","severity":"INFORMATIONAL","display":true,"json_map":"{\"severity\":null,\"eventId\":\"65e-64d9-45-ab62-8ef98ac5e60d\",\"componentPath\":\"interface_css\",\"displayToGui\":false,\"originalState\":\"unknown\",\"closed\":false,\"eventType\":\"InterfaceStateChange\",\"time\":\"2019-04-18T07:04:32.747Z\",\"json_map\":null,\"message\":\"interface_css changed status from unknown to normal\",\"newState\":\"normal\",\"info\":\"Event created with current status\"}","closed":false,"info":"Event created with current status","originalState":"unknown","newState":"normal"}
Below is a possible solution. The wordmatch line is a bit of a hack, until I find something clearer: it's just a one-liner that create an empty or 1-element set of True if one of the words matches.
(Untested)
import re
logfile = '/path/to/my/logfile'
words = ["On", "Off", "Switch"]
dateformat = r'\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2}\.\d+[Zz]?'
pattern = fr'Out:\s*\[(?P<out>{dateformat})\].*In":\s*\"(?P<in>{dateformat})\"'
regex = re.compile(pattern)
with open(logfile, 'r') as f:
for line in f:
wordmatch = set(filter(None, (word in s for word in words)))
if wordmatch:
match = regex.search(line)
if match:
intime = match.group('in')
outtime = match.group('out')
# whatever to store these strings, e.g., append to list or insert in a dict.
As noted, your log example is very awkward, so this works for the example line, but may not work for every line. Adjust as necessary.
I have also not included (if so wanted), a conversion to a datetime.datetime object. For that, read through the datetime module documentation, in particular datetime.strptime. (Alternatively, you may want to store your results in a Pandas table. In that case, read through the Pandas documentation on how to convert strings to actual datetime objects.)
You also don't need to read nad split on newlines yourself: for line in f will do that for you (provided f is indeed a filehandle).
Regex is probably the way to go (fastness, efficiency etc.) ... but ...
You could take a very simplistic (if very inefficient) approach of cleaning your data:
join all of it into a string
replace things that hinder easy parsing
split wisely and filter the split
like so:
data = ['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
all_text = " ".join(data)
# this is inefficient and will create throwaway intermediate strings - if you are
# in a hurry or operate on 100s of MB of data, this is NOT the way to go, unless
# you have time
# iterate pairs of ("bad thing", "what to replace it with") (or list of bad things)
for thing in [ (": ",":"), (list('[]{}"'),"") ]:
whatt = thing[0]
withh = thing[1]
# if list, do so for each bad thing
if isinstance(whatt, list):
for p in whatt:
# replace it
all_text = all_text.replace(p,withh)
else:
all_text = all_text.replace(whatt,withh)
# format is now far better suited to splitting/filtering
cleaned = [a for a in all_text.split(" ")
if any(a.startswith(prefix) or "Switch" in a
for prefix in {"In:","Switch:","Out:"})]
print(cleaned)
Outputs:
['Out:2020-01-31T00:30:20.150Z', 'Type:Switch', 'In:2020-01-31T00:30:20.140Z']
After cleaning your data would look like:
2020-01-31T12:04:57.976Z 1234 Out:2020-01-31T00:30:20.150Z Id:Id:4-f-4-9-6a Type:Switch In:2020-01-31T00:30:20.140Z
You can transform the clean list into a dictionary for ease of lookup:
d = dict( part.split(":",1) for part in cleaned)
print(d)
will produce:
{'In': '2020-01-31T00:30:20.140Z',
'Type': 'Switch',
'Out': '2020-01-31T00:30:20.150Z'}
You can use datetime module to parse the times from your values as shown in 0 0 post.

Convert .txt into .csv when some rows have missing data for certain columns (python)

I have a .txt file that is currently formatted kind of like this:
John,bread,17,www.google.com
Emily,apples,24,
Anita,35,www.website.com
Charles,banana,www.stackoverflow.com
Susie,french fries,31,www.regexr.com
...
The first column will never have any missing values.
I'm trying to use python to convert this into a .csv file. I know how to do this if I have all of the column data for each row, but my .txt is missing some data in certain columns. How can I convert this to a .csv while making sure the same type of data remains in the same column? Thanks :)
Split by commas. You know the pattern should be word, word, int(I'm assuming), string in the pattern of www.word.word.
If there is only 1 word at the front instead of 2, add another comma after the first word.
If the number is missing, add a comma after the second word.
Etc...
Say you get a line "Susie,www.regexr.com" , you know that there is a missing word and missing number. Add 2 commas after the first word.
It's essentially a bunch of if statements or a switch-case statement.
There probably is a more elegant way of doing this, but my mind is fried from dealing with server and phone issues all morning.
This isn't tested in any way, I hope I didn't just embarrass myself:
import re
#read_line is a line read from the csv
split_line = read_line.split(',')
num_elements = len(split_line) #do this only once for efficiency
if (num_elements == 3): #Need to add an element somewhere, depending on what's missing
if(re.search('[^#]+#[^#]+\.[^#]+',split_line[2])): #Starting at the last element, if it is an email address
if(re.search('[\d]',split_line[1])): #If the previous element is a digit
#if so, add a comma as the only element missing is the string at split_line[1]
read_line = split_line[0]+','+','+split_line[1]+','+split_line[2]
else:
#if not so, add a comma at split_line[2]
read_line = split_line[0]+','+split_line[1]+','+','+split_line[2]
else:
#last element isn't email address, add a comma in its place
read_line = split_line[0]+','+split_line[1]+','+split_line[2]+','
elif (num_elements == 2) #need two elements, first one is assumed to always be there
if(re.search('[^#]+#[^#]+\.[^#]+',split_line[1])): #The second element is an email address
#Insert 2 commas in for missing string and number
read_line = split_line[0]+',,,'+split_line[1]
elif(re.search('[\d]',split_line[1])): #The second element contains digits
#Insert commas for missing string and email address
read_line = split_line[0]+',,'+split_line[1]+','
else:
#Insert commas for missing number and email address
read_line = split_line[0]+','+split_line[1]+',,'
elif (num_elements == 1):
read_line = split_line[0]+',,,'
I thought about your issue and I can only offer a half baked solution as your CSV file, when having missing data do not show it with something like ,,.
Your current csv file is like that
John,bread,17,www.google.com
Emily,apples,24,
Anita,35,www.website.com
Charles,banana,www.stackoverflow.com
Susie,french fries,31,www.regexr.com
If you find a way to change your CSV file to like that
John,bread,17,www.google.com
Emily,apples,24,
Anita,,35,www.website.com
Charles,banana,,www.stackoverflow.com
Susie,french fries,31,www.regexr.com
You can use the solution like below. For info, I've put your input into a text file
In [1]: import pandas as pd
In [2]: population = pd.read_csv('input_to_csv.txt')
In [3]: mod_population=population.fillna("NaN")
In [4]: mod_population.to_csv('output_to_csv.csv',index=False)
One suggestion would be to do a regex check, if you can assume some kind of uniformity. For example, build a list of regex patterns, since each piece of data seems to be different.
If the second column you read in matches all characters and spaces, it's likely food. On the other hand, if it's a digit match, you should assume that food is missing. If it's a url match, you missed both. You'll want to be thorough with your test cases, but if the actual data is similar to your example you have 3 relatively unique cases, with a string, an integer, and a url. This should make writing regex tasks relatively trivial. Importing re and using re.search should help you test each regex without too much overhead.

In Python,if startswith values in tuple, I also need to return which value

I have an area codes file I put in a tuple
for line1 in area_codes_file.readlines():
if area_code_extract.search(line1):
area_codes.append(area_code_extract.search(line1).group())
area_codes = tuple(area_codes)
and a file I read into Python full of phone numbers.
If a phone number starts with one of the area codes in the tuple, I need to do to things:
1 is to keep the number
2 is to know which area code did it match, as need to put area codes in brackets.
So far, I was only able to do 1:
for line in txt.readlines():
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
if line.startswith(area_codes):
print (line)
How do I do the second part?
The simple (if not necessarily highest performance) approach is to check each prefix individually, and keep the first match:
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
if line.startswith(area_codes):
print(line, next(filter(line.startswith, area_codes)))
Since we know filter(line.startswith, area_codes) will get exactly one hit, we just pull the hit using next.
Note: On Python 2, you should start the file with from future_builtins import filter to get the generator based filter (which will also save work by stopping the search when you get a hit). Python 3's filter already behaves like this.
For potentially higher performance, the way to both test all prefixes at once and figure out which value hit is to use regular expressions:
import re
# Function that will match any of the given prefixes returning a match obj on hit
area_code_matcher = re.compile(r'|'.join(map(re.escape, area_codes))).match
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
# Returns None on miss, match object on hit
m = area_code_matcher(line)
if m is not None:
# Whatever matched is in the 0th grouping
print(line, m.group())
Lastly, one final approach you can use if the area codes are of fixed length. Rather than using startswith, you can slice directly; you know the hit because you sliced it off yourself:
# If there are a lot of area codes, using a set/frozenset will allow much faster lookup
area_codes_set = frozenset(area_codes)
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
# Assuming lines that match always start with ###
if line[:3] in area_codes_set:
print(line, line[:3])

Workaround for index out of range while searching through FASTA file

I'm working on a program that lets the user enter a sequence they want to find inside a FASTA file, after which the program shows the description line and the sequence that belongs to it.
The FASTA can be found at hugheslab.ccbr.utoronto.ca/supplementary-data/IRC/IRC_representative_cdna.fa.gz, it's approx. 87 MB.
The idea is to first create a list with the location of description lines, which always start with a >. Once you know what are the description lines, you can search for the search_term in the lines between two description lines. This is exactly what is done in the fourth paragraph, this results in a list of 48425 long, here is an idea of what the results are: http://imgur.com/Lxy8hnI
Now the fifth paragraph is meant to search between two description lines, let's take lines 0 and 15 as example, this would be description_list[a] and description_list[a+1] as a = 0 and a+1 = 1, and description_list[0] = 0 and description_list[1] = 15. Between these lines the if-statement searches for the search term, if it finds one it will save description_list[a] into the start_position_list and description_list[a+1] into the stop_position_list, which will be used later on.
So as you can imagine a simple term like 'ATCG' will occur often, which means the start_position_list and stop_position_list will have a lot of duplicates, which will be removed using list(set(start_position_list)) and afterwards sorting them. That way start_position_list[0] and start_position_list[0] will be 0 and 15, like this: http://imgur.com/QcOsuhM, which can then be used as a range for which lines to print out to show the sequence.
Now, of course, the big issue is that line 15, for i in range(description_list[a], description_list[a+1]): will eventually hit the [a+1] while it's already at the maximum length of description_list and therefore will give a list index out of range error, as you see here as well: http://imgur.com/hi7d4tr
What would be the best solution for this ? It's still necessary to go through all the description lines and I can't come up with a better structure to go through them all ?
file = open("IRC_representative_cdna.fa")
file_list = list(file)
search_term = input("Enter your search term: ")
description_list = []
start_position_list = []
stop_position_list = []
for x in range (0, len(file_list)):
if ">" in file_list[x]:
description_list.append(x)
for a in range(0, len(description_list)):
for i in range(description_list[a], description_list[a+1]):
if search_term in file_list[i]:
start_position_list.append(description_list[a])
stop_position_list.append(description_list[a+1])
The way to avoid the subscript out of range error is to shorten the loop. Replace the line
for a in range(0, len(description_list)):
by
for a in range(0, len(description_list)-1):
Also, I think that you can use a list comprehension to build up description_list:
description_list = [x for x in file_list if x.startswith('>')]
in addition to being shorter it is more efficient since it doesn't do a linear search over the entire line when only the starting character is relevant.
Here is a solution that uses the biopython package, thus saving you the headache of parsing interleaved fasta yourself:
from Bio import SeqIO
file = open("IRC_representative_cdna.fa")
search_term = input("Enter your search term: ")
for record in SeqIO.parse(file, "fasta"):
rec_seq = record.seq
if search_term in rec-seq:
print(record.id)
print(rec-seq)
it wasn't very clear to me what your desired output is, but this code can be changed easily to fit it.

Categories