Matching regex to list items in Python - python

I am attempting to write a python script that shows the URL flow on my installation of nginx. So I currently have my script opening my 'rewrites' file that contains a list of of regex's and locations like so:
rewritei ^/ungrad/info.cfm$ /ungrad/info/ permanent;
So what I currently have python doing is reading the file, trimming the first and last word off (rewritei and premanent;) which just leaves a list like so:
[
['^/ungrad/info.cfm$', '/ungrad/info'],
['^/admiss/testing.cfm$', '/admiss/testing'],
['^/ungrad/testing/$', '/ungrad/info.cfm']
]
This results in the first element being the URL watched, and the second being the URL redirected to. What I would like to do now, is take each of the first elements, and run the regex over the entire list, and check if it matches any of the second elements.
With the example above, [0][0] would match [2][1].
However I am having trouble thinking of a good and efficient way to do this.

import re
a = [
['^/ungrad/info.cfm$', '/ungrad/info'],
['^/admiss/testing.cfm$', '/admiss/testing'],
['^/ungrad/testing/$', '/ungrad/info.cfm']
]
def matchingfun(b):
for list1 in a: # iterating the main list
for reglist in list1: # iterating the inner lists
count = 0
matchedurl = []
for innerlist in reglist[:1]: # iterating the inner list items
c = b.match(innerlist) # matching the regx
if c:
count = count+1
if count > 0:
matchedurl.append(reglist)
return matchedurl
result1 = []
for list1 in a:
for reglist in list1:
b = re.compile(reglist[0])
result = matchingfun(b)
result1.extend(result)
bs = list(set(result1))
print "matched url is", bs
This is bit unefficient i guess but I have done to some extent. Hope this answers your query. the above snippet prints the urls which are matched with the second items in the entire list.

Related

Create a new array list for all matching filenames in ''main'' array

I have code in Python that loops through an directory of images that returns an array 105 images. Now I need it to go through the array and find the matching images by name Example: Mainlist = [Image_Sun_01, Image_Sun_02, Image_Moon_01] and I want it create a seperate list for each matching image like so:
List_01 = [Image_Sun_01, Image_Sun_02]
List_02 = [Image_Moon_01]
What is the best way to do this?
Edit:
To clarify I want it to match the words with each other so "Sun" goes with "Sun" into a list and "Moon" with "Moon" into a new list
From the sample data shown in the question it appears that the "key" is part of a filename within two underscore characters. If that is the case then one idea would be to build a dictionary which is keyed on those tokens. Something like this:
Mainlist = ['Image_Sun_01.cr2', 'Image_Sun_02.jpg', 'Image_Moon_01.raw']
result = {}
for image in Mainlist:
key = image.split('_')[1]
result.setdefault(key, []).append(image)
print(result)
Output:
{'Sun': ['Image_Sun_01.cr2', 'Image_Sun_02.jpg'], 'Moon': ['Image_Moon_01.raw']}
Note:
Subsequently, access to 'Sun' or 'Moon' images is trivial
I have added data to your Mainlist to check the result of my proposed code and get 3 different size lists in output.
import re
Mainlist = ['Image_Sun_01', 'Image_Sun_02', 'Image_Moon_01', 'Image_Moon_02', 'Image_Moon_03', 'Image_Earth_01']
prev_pattern = ''
nb = 0
for i in range(len(Mainlist)):
new_pattern = re.search('[a-zA-Z\_]+', Mainlist[i]).group(0)
if new_pattern != prev_pattern:
nb+=1
prev_pattern = new_pattern
if f"List_{nb:02d}" in globals():
globals()[f"List_{nb:02d}"] += [Mainlist[i]]
else:
globals()[f"List_{nb:02d}"] = [Mainlist[i]]
print(List_01)
print(List_02)
print(List_03)
Output:
['Image_Sun_01', 'Image_Sun_02']
['Image_Moon_01', 'Image_Moon_02', 'Image_Moon_03']
['Image_Earth_01']

Need to Get 4 URLs in Output After Remove Letter S but Get only Last URL

Below 4 URLs Contain Letter s and We need to remove this Letter and
Print the 4 x URLs But The Problem is I got only the last web site not the 4
Sites printed
Note :Language used is Python
file1 = ['https:/www.google.com\n', 'https:/www.yahoo.com\n', 'https:/www.stackoverflow.com\n',
'https:/www.pythonhow.com\n']
file1_remove_s = []
for line in file1:
file1_remove_s = line.replace('s','',1)
print(file1_remove_s)
You are reassigning file1_remove_s from a list object to the modified list element. You want to use append instead
file1 = ['https:/www.google.com\n', 'https:/www.yahoo.com\n', 'https:/www.stackoverflow.com\n',
'https:/www.pythonhow.com\n']
file1_remove_s = []
for line in file1:
file1_remove_s.append(line.replace('s','',1))
print(file1_remove_s)
You are assigning only the last item on the dict by using the = operator. This is actually a perfect place to use a list comprehension, hence your code should look like:
file1 = [file1_remove_s.replace('s','',1) for file1_remove_s in file1]
This will automatically append the formatted text -strings with removed "s" - to a list and by setting the variable name of that list to the name of the initial list, the initial list gets overwritten by the new one which have the proper format of the texts you want.

Including Exception in python

I have a file at /location/all-list-info.txt underneath I have some items in below manner:
aaa:xxx:abc.com:1857:xxx1:rel5t2:y
ifa:yyy:xyz.com:1858:yyy1:rel5t2:y
I process these items with a below python code:
def pITEMName():
global itemList
itemList = str(raw_input('Enter pipe separated list of ITEMS : ')).upper().strip()
items = itemList.split("|")
count = len(items)
print 'Total Distint Item Count : ', count
pipelst = itemList.split('|')
filepath = '/location/all-item-info.txt '
f = open(filepath, 'r')
for lns in f:
split_pipe = lns.split(':', 1)
if split_pipe[0] in pipelst:
index = pipelst.index(split_pipe[0])
del pipelst[index]
for lns in pipelst:
print lns,' is wrong item Name'
f.close()
if podList:
After execution of above python code its gives a prompt as :
Enter pipe separated list of ITEMS:
And then I passes the items :
Enter pipe separated list of ITEMS: aaa|ifa-mc|ggg-mc
now after pressing enter above code process further like below :
Enter pipe separated list of ITEMS : aaa|ifa-mc|ggg-mc
Total Distint Item Count : 3
IFA-MC is wrong Item Name
GGG-MC is wrong Item Name
ITEMs Belonging to other Centers :
Item Count From Other Center = 0
ITEMs Belonging to Current Centers :
Active Items in US1 :
^IFA$
Test Active Items in US1 :
^AAA$
Ignored Item Count From Current center = 0
You Have Entered ItemList belonging to this center as: ^IFA$|^AAA$
Active Pod Count : 2
My question is if I suffix the '-mc' in items while giving the input its given me as wrong item whereas it presents in /location/all-item-info.txt file with not present the item in /location/all-item-info.txt . Please have a look at below output again :
IFA-MC is wrong Item Name
GGG-MC is wrong Item Name
In above example 'ifa' is present in /location/all-items-info.txt path whereas as ggg is not present.
Request you to help me here what can I do on above code so if I suffix the -mc which are present in /location/all-items-info.txt file it should not count as wrong item name. it should count only for those items which are not present in /location/all-items-info.txt file.
Please give you help.
Thanks,
Ritesh.
If you want to avoid checking for -mc as well, then you can modify this part of your script -
pipelst = itemList.split('|')
To -
pipelst = [i.split('-')[0] for i in itemList.split('|')]
It's a bit unclear exactly what you are asking, but basically to ignore any '-mc' from user input, you can explicitly preprocess the user input to strip it out:
pipelst = itemList.split('|')
pipelst = [item.rsplit('-mc',1)[0] for item in pipelst]
If instead you want to allow for the possibility of -mc-suffixed words in the file as well, simply add the stripped version to the list instead of replacing
pipelst = itemList.split('|')
for item in pipelist:
if item.endswith('-mc'):
pipelst.append(item.rsplit('-mc',1)[0])
Another issue may be based on the example lines you gave from /location/all-list-info.txt, it sounds like all the items are lowercase. However, pipelst is explicitly making the user input all uppercase. String equality and in mechanics is case-sensitive, so for instance
>>> print 'ifa-mc' in ['IFA-MC']
False
You probably want:
itemList = str(raw_input('Enter pipe separated list of ITEMS : ')).lower().strip()
and you could use .upper() only when printing or wherever it is needed
Finally, there are a few other things that could be tweaked with the code just to make things a bit faster and cleaner. The main one that comes to mind is it seems like pipelst should be a python set and not a list as checking inclusion and removal would then be much faster for large lists, and the code to remove an item from a set is much cleaner:
>>> desserts = set(['ice cream', 'cookies', 'cake'])
>>> if 'cake' in desserts:
... desserts.remove('cake')
>>> print desserts
set(['cookies', 'ice cream'])

Extracting a string from html tags in python

Hopefully there isn't a duplicated question that I've looked over because I've been scouring this forum for someone who has posted to a similar to the one below...
Basically, I've created a python script that will scrape the callsigns of each ship from the url shown below and append them into a list. In short it works, however whenever I iterate through the list and display each element there seems to be a '[' and ']' between each of the callsigns. I've shown the output of my script below:
Output
*********************** Contents of 'listOfCallSigns' List ***********************
0 ['311062900']
1 ['235056239']
2 ['305500000']
3 ['311063300']
4 ['236111791']
5 ['245639000']
6 ['235077805']
7 ['235011590']
As you can see, it shows the square brackets for each callsign. I have a feeling that this might be down to an encoding problem within the BeautifulSoup library.
Ideally, I want the output to be without any of the square brackets and just the callsign as a string.
*********************** Contents of 'listOfCallSigns' List ***********************
0 311062900
1 235056239
2 305500000
3 311063300
4 236111791
5 245639000
6 235077805
7 235011590
This script I'm using currently is shown below:
My script
# Importing the modules needed to run the script
from bs4 import BeautifulSoup
import urllib2
import re
import requests
import pprint
# Declaring the url for the port of hull
url = "http://www.fleetmon.com/en/ports/Port_of_Hull_5898"
# Opening and reading the contents of the URL using the module 'urlib2'
# Scanning the entire webpage, finding a <table> tag with the id 'vessels_in_port_table' and finding all <tr> tags
portOfHull = urllib2.urlopen(url).read()
soup = BeautifulSoup(portOfHull)
table = soup.find("table", {'id': 'vessels_in_port_table'}).find_all("tr")
# Declaring a list to hold the call signs of each ship in the table
listOfCallSigns = []
# For each row in the table, using a regular expression to extract the first 9 numbers from each ship call-sign
# Adding each extracted call-sign to the 'listOfCallSigns' list
for i, row in enumerate(table):
if i:
listOfCallSigns.append(re.findall(r"\d{9}", str(row.find_all('td')[4])))
print "\n\n*********************** Contents of 'listOfCallSigns' List ***********************\n"
# Printing each element of the 'listOfCallSigns' list
for i, row in enumerate(listOfCallSigns):
print i, row
Does anyone know how to remove the square brackets surrounding each callsign and just display the string?
Thanks in advance! :)
Change the last lines to:
# Printing each element of the 'listOfCallSigns' list
for i, row in enumerate(listOfCallSigns):
print i, row[0] # <-- added a [0] here
Alternatively, you can also add the [0] here:
for i, row in enumerate(table):
if i:
listOfCallSigns.append(re.findall(r"\d{9}", str(row.find_all('td')[4]))[0]) <-- added a [0] here
The explanation here is that re.findall(...) returns a list (in your case, with a single element in it). So, listOfCallSigns ends up being a "list of sublists each containing a single string":
>>> listOfCallSigns
>>> [ ['311062900'], ['235056239'], ['311063300'], ['236111791'],
['245639000'], ['305500000'], ['235077805'], ['235011590'] ]
When you enumerate your listOfCallSigns, the row variable is basically the re.findall(...) that you appended earlier in the code (that's why you can add the [0] after either of them).
So row and re.findall(...) are both of type "list of string(s)" and look like this:
>>> row
>>> ['311062900']
And to get the string inside the list, you need access its first element, i.e.:
>>> row[0]
>>> '311062900'
Hope this helps!
This can also be done by stripping the unwanted characters from the string like so:
a = "string with bad characters []'] in here"
a = a.translate(None, "[]'")
print a

Scan through txt, append certain data to an empty list in Python

I have a text file that I am reading in python . I'm trying to extract certain elements from the text file that follow keywords to append them into empty lists . The file looks like this:
so I want to make two empty lists
1st list will append the sequence names
2nd list will be a list of lists which will include be in the format [Bacteria,Phylum,Class,Order, Family, Genus, Species]
most of the organisms will be Uncultured bacterium . I am trying to add the Uncultured bacterium with the following IDs that are separated by ;
Is there anyway to scan for a certain word and when the word is found, take the word that is after it [separated by a '\t'] ?
I need it to create a dictionary of the Sequence Name to be translated to the taxonomic data .
I know i will need an empty list to append the names to:
seq_names=[ ]
a second list to put the taxonomy lists into
taxonomy=[ ]
and a 3rd list that will be reset after every iteration
temp = [ ]
I'm sure it can be done in Biopython but i'm working on my python skills
Yes there is a way.
You can split the string which you get from reading the file into an array using the inbuilt function split. From this you can find the index of the word you are looking for and then using this index plus one to get the word after it. For example using a text file called test.text that looks like so (the formatting is a bit weird because SO doesn't seem to like hard tabs).
one two three four five six seven eight nine
The following code
f = open('test.txt','r')
string = f.read()
words = string.split('\t')
ind = words.index('seven')
desired = words[ind+1]
will return desired as 'eight'
Edit: To return every following word in the list
f = open('test.txt','r')
string = f.read()
words = string.split('\t')
desired = [words[ind+1] for ind, word in enumerate(words) if word == "seven"]
This is using list comprehensions. It enumerates the list of words and if the word is what you are looking for includes the word at the next index in the list.
Edit2: To split it on both new lines and tabs you can use regular expressions
import re
f = open('testtest.txt','r')
string = f.read()
words = re.split('\t|\n',string)
desired = [words[ind+1] for ind, word in enumerate(words) if word == "seven"]
It sounds like you might want a dictionary indexed by sequence name. For instance,
my_data = {
'some_sequence': [Bacteria,Phylum,Class,Order, Family, Genus, Species],
'some_other_sequence': [Bacteria,Phylum,Class,Order, Family, Genus, Species]
}
Then, you'd just access my_data['some_sequence'] to pull up the data about that sequence.
To populate your data structure, I would just loop over the lines of the files, .split('\t') to break them into "columns" and then do something like my_data[the_row[0]] = [the_row[10], the_row[11], the_row[13]...] to load the row into the dictionary.
So,
for row in inp_file.readlines():
row = row.split('\t')
my_data[row[0]] = [row[10], row[11], row[13], ...]

Categories