Calling a Python function on each entry of a text file - python

I'm trying to find regex matches to each entry of a text file in order to structure the data better.
Keeps returning "No match" but if I call the function manually on the entry, it works.
import re
# The patterns
r1 = re.compile('.*full.*time.*', flags = re.IGNORECASE)
r2 = re.compile('.*contingent.*', flags = re.IGNORECASE)
r3 = re.compile('.*intern', flags = re.IGNORECASE)
def doSomething1():
print ("Full Time")
def doSomething2():
print("Contract")
def doSomething3():
print("Internship")
def default():
print ("No match")
def match(r, s):
mo = re.match(r, s)
try:
return mo.group()
except AttributeError:
return None
def delegate(s):
try:
action = {
match(r1, s): doSomething1,
match(r2, s): doSomething2,
match(r3, s): doSomething3
}[s]()
return action
except KeyError:
return default()
with open('data.txt', 'r') as data:
for job in data:
delegate(job)
This is the data.txt:
Full Time Remote
Contingent
Intern

If you set flags as flags = re.IGNORECASE | re.DOTALL, then the three lines will all match.
According to docs, If the DOTALL flag has been specified, this matches any character including a newline.
But your design of delegate is a little bad. You'd better tell us what you really/finally want.

You want to match Full Time Remote but it is matching Full Time Remote\n. In fact, each line in the file is followed by a newline character, so you won't get any matches since . matches any character except a newline by default. You can handle it by re.DOTALL flag as Lei Yang said or having s.strip() to remove newline.
However, I think string method is enough to do the task.
def delegate(s):
# ignore case by lower() and remove newline by strip()
s = s.lower().strip("\n")
# check index by find() to ensure "full" before "time"
if "full" in s and s.find("full") < s.find("time"):
print("Full Time")
elif "contingent" in s:
print("Contract")
# last 6 characters are "intern"
elif s[-6:] == "intern":
print("Intern")
else:
print("No match")
with open('data.txt', 'r') as data:
for job in data:
delegate(job)
With data.txt as:
Full Time Remote
Time Remote Full
Contingent
Intern
Intern 123
Result:
Full Time
No match
Contract
Intern
No match

Related

python: extract parts from line when using different delimiter

I am reading stdin line by line:
for line in sys.stdin:
...
Each line has following format:
: 1631373881:0;echo
I need to extract the first number (epoch time) and the command (last part after ';')
How can I extract these when the delimiter is not the same?
input_str = ": 1631373881:0;echo".split(";")
command = input_str[-1]
number = input_str[0].split(":")[1].replace(" ","")
If you know the lines always have the same format, you can bet on regular expressions:
import re
MASK = re.compile(': (\\d+):\\d+;(.+)')
def extract(line):
matches = MASK.findall(line)
return matches[0] if matches else None
def test():
assert extract(": 1631373881:0;echo test") == ("1631373881", "echo test")

replace function not working with list items

I am trying to use the replace function to take items from a list and replace the fields below with their corresponding values, but no matter what I do, it only seems to work when it reaches the end of the range (on it's last possible value of i, it successfully replaces a substring, but before that it does not)
for i in range(len(fieldNameList)):
foo = fieldNameList[i]
bar = fieldValueList[i]
msg = msg.replace(foo, bar)
print msg
This is what I get after running that code
<<name>> <<color>> <<age>>
<<name>> <<color>> <<age>>
<<name>> <<color>> 18
I've been stuck on this for way too long. Any advice would be greatly appreciated. Thanks :)
Full code:
def writeDocument():
msgFile = raw_input("Name of file you would like to create or write to?: ")
msgFile = open(msgFile, 'w+')
msg = raw_input("\nType your message here. Indicate replaceable fields by surrounding them with \'<<>>\' Do not use spaces inside your fieldnames.\n\nYou can also create your fieldname list here. Write your first fieldname surrounded by <<>> followed by the value you'd like to assign, then repeat, separating everything by one space. Example: \"<<name>> ryan <<color>> blue\"\n\n")
msg = msg.replace(' ', '\n')
msgFile.write(msg)
msgFile.close()
print "\nDocument written successfully.\n"
def fillDocument():
msgFile = raw_input("Name of file containing the message you'd like to fill?: ")
fieldFile = raw_input("Name of file containing the fieldname list?: ")
msgFile = open(msgFile, 'r+')
fieldFile = open(fieldFile, 'r')
fieldNameList = []
fieldValueList = []
fieldLine = fieldFile.readline()
while fieldLine != '':
fieldNameList.append(fieldLine)
fieldLine = fieldFile.readline()
fieldValueList.append(fieldLine)
fieldLine = fieldFile.readline()
print fieldNameList[0]
print fieldValueList[0]
print fieldNameList[1]
print fieldValueList[1]
msg = msgFile.readline()
for i in range(len(fieldNameList)):
foo = fieldNameList[i]
bar = fieldValueList[i]
msg = msg.replace(foo, bar)
print msg
msgFile.close()
fieldFile.close()
###Program Starts#####--------------------
while True==True:
objective = input("What would you like to do?\n1. Create a new document\n2. Fill in a document with fieldnames\n")
if objective == 1:
writeDocument()
elif objective == 2:
fillDocument()
else:
print "That's not a valid choice."
Message file:
<<name>> <<color>> <<age>>
Fieldname file:
<<name>>
ryan
<<color>>
blue
<<age>>
18
Cause:
This is because all lines except the last line read from the "Fieldname" file contains "\n" characters. So when the program comes to the replacing part fieldNameList , fieldValueList and msg looks like this:
fieldNameList = ['<<name>>\n', '<<color>>\n', '<<age>>\n']
fieldValueList = ['ryan\n', 'blue\n', '18']
msg = '<<name>> <<color>> <<age>>\n'
so the replace() function actually searches for '<<name>>\n','<<color>>\n','<<age>>\n' in msg string and only <<age>> field get replaced.(You must have a "\n" at the end of msg file, otherwise it won't be replaced as well).
Solution:
use rstrip() method when reading lines to strip the newline character at the end.
fieldLine = fieldFile.readline().rstrip()

How to extract data from a text file based on a regular expression pattern

I need some help for a python program. I've tried so many things, for hours, but it doesn't work.
Anyone who can help me?
This is what I need:
I have this file: http://www.filedropper.com which contains information about proteins.
I want to filter only the proteins which match the ...exists.
From these proteins, I need only the ... (the text of 6 tokens, after >sp|, and the species (second line, between the [])
I want the .. and ..in a .., and eventually in a table.
....
Human AAA111
Mouse BBB222
Fruit fly CCC333
What I have so far:
import re
def main():
ReadFile()
file = open ("file.txt", "r")
FilterOnRegEx(file)
def ReadFile():
try:
file = open ("file.txt", "r")
except IOError:
print ("File not found!")
except:
print ("Something went wrong.")
def FilterOnRegEx(file):
f = ("[AG].{4}GK[ST]")
for line in file:
if f in line:
print (line)
main()
You're a hero if you help me out!
My first recommendation is to use a with statement when opening files:
with open("ploop.fa", "r") as file:
FilterOnRegEx(file)
The problem with your FilterOnRegEx method is: if ploop in line. The in operator, with string arguments, searches the string line for the exact text in ploop.
Instead you need to compile the text form to an re object, then search for matches:
def FilterOnRegEx(file):
ploop = ("[AG].{4}GK[ST]")
pattern = re.compile(ploop)
for line in file:
match = pattern.search(line)
if match is not None:
print (line)
This will help you to move forward.
As a next step, I would suggest learning about generators. Printing the lines that match is great, but that doesn't help you to do further operations with them. I might change print to yield so that I could then process the data further such as extracting the parts you want and reformatting it for output.
As a simple demonstration:
def FilterOnRegEx(file):
ploop = ("[AG].{4}GK[ST]")
pattern = re.compile(ploop)
for line in file:
match = pattern.search(line)
if match is not None:
yield line
with open("ploop.fa", "r") as file:
for line in FilterOnRegEx(file):
print(line)
Addendum: I ran the code I posted, above, using the sample of the data that you posted and it successfully prints some lines and not others. In other words, the regular expression did match some of the lines and did not match others. So far so good. However, the data you need is not all on one line in the input! That means that filtering individual lines on the pattern is insufficient. (Unless, of course, that I don't see the correct line breaks in the question) The way the data is in the question you'll need to implement a more robust parser with state to know when a record begins, when a record ends, and what any given line is in the middle of a record.
This seems to work on your sample text. I don't know if you can have more than one extract per file, and I'm out of time here, so you'll have to extend it if needed:
#!python3
import re
Extract = {}
def match_notes(line):
global _State
pattern = r"^\s+(.*)$"
m = re.match(pattern, line.rstrip())
if m:
if 'notes' not in Extract:
Extract['notes'] = []
Extract['notes'].append(m.group(1))
return True
else:
_State = match_sp
return False
def match_pattern(line):
global _State
pattern = r"^\s+Pattern: (.*)$"
m = re.match(pattern, line.rstrip())
if m:
Extract['pattern'] = m.group(1)
_State = match_notes
return True
return False
def match_sp(line):
global _State
pattern = r">sp\|([^|]+)\|(.*)$"
m = re.match(pattern, line.rstrip())
if m:
if 'sp' not in Extract:
Extract['sp'] = []
spinfo = {
'accession code': m.group(1),
'other code': m.group(2),
}
Extract['sp'].append(spinfo)
_State = match_sp_note
return True
return False
def match_sp_note(line):
"""Second line of >sp paragraph"""
global _State
pattern = r"^([^[]*)\[([^]]+)\)"
m = re.match(pattern, line.rstrip())
if m:
spinfo = Extract['sp'][-1]
spinfo['note'] = m.group(1).strip()
spinfo['species'] = m.group(2).strip()
spinfo['sequence'] = ''
_State = match_sp_sequence
return True
return False
def match_sp_range(line):
"""Last line of >sp paragraph"""
global _State
pattern = r"^\s+(\d+) - (\d+):\s+(.*)"
m = re.match(pattern, line.rstrip())
if m:
spinfo = Extract['sp'][-1]
spinfo['range'] = (m.group(1), m.group(2))
spinfo['flags'] = m.group(3)
_State = match_sp
return True
return False
def match_sp_sequence(line):
"""Middle block of >sp paragraph"""
global _State
spinfo = Extract['sp'][-1]
if re.match("^\s", line):
# End of sequence. Check for pattern, reset state for sp
if re.match(r"[AG].{4}GK[ST]", spinfo['sequence']):
spinfo['ag_4gkst'] = True
else:
spinfo['ag_4gkst'] = False
_State = match_sp_range
return False
spinfo['sequence'] += line.rstrip()
return True
def match_start(line):
"""Start of outer item"""
global _State
pattern = r"^Hits for ([A-Z]+\d+)|([^:]+) : (?:\[occurs (\w+)\])?"
m = re.match(pattern, line.rstrip())
if m:
Extract['pattern_id'] = m.group(1)
Extract['title'] = m.group(2)
Extract['occurrence'] = m.group(3)
_State = match_pattern
return True
return False
_State = match_start
def process_line(line):
while True:
state = _State
if state(line):
return True
if _State is not state:
continue
if len(line) == 0:
return False
print("Unexpected line:", line)
print("State was:", _State)
return False
def process_file(filename):
with open(filename, "r") as infile:
for line in infile:
process_line(line.rstrip())
process_file("ploop.fa")
import pprint
pprint.pprint(Extract)

Python regex groupdict returns single characters instead of strings for groups

I'm running up against a really confusing issue with Regex matching in Python.
I have a pair of regex patterns that work fine in debugging tools such as regex101:
[Hex&Oct matching Pattern] (Code in testing window is the same as the file contents in the console test)
[Base64 matching Pattern] (Far from ideal, but the minimum length of 15 characters helps avoid false positives)
[Hex|Oct splitting Pattern] (Variation of Hex&Oct with different named groups)
However, once implemented in the script, the patterns fail to match anything unless both compiled and prepended with r before the opening quote.
Even then, the matches return single characters from the group dict.
Can anyone provide any pointers as to what I am doing wrong here?
deobf.py:
#!/bin/python
import sys
import getopt
import re
import base64
####################################################################################
#
# Setting up global vars and functions
#
####################################################################################
# Assemble Pattern Dictionary
pattern={}
pattern["HexOct"]=re.compile(r'([\"\'])(?P<obf_code>(\\[xX012]?[\dA-Fa-f]{2})*)\1')
pattern["Base64"]=re.compile(r'([\"\'])(?P<obf_code>[\dA-Za-z\/\+]{15,}={0,2})\1')
# Assemble more precise Pattern handling:
sub_pattern={}
sub_pattern["HexOct"]=re.compile(r'((?P<Hex>\\[xX][\dA-Fa-f]{2})|(?P<Oct>\\[012]?[\d]{2}))')
#print pattern # trying to Debug Pattern Dicts
#print sub_pattern # trying to Debug Pattern Dicts
# Global Var init
file_in=""
file_out=""
code_string=""
format_code = False
# Prints the Help screen
def usage():
print "How to use deobf.py:"
print "-----------------------------------------------------------\n"
print "$ python deobf.py -i {inputfile.php} [-o {outputfile.txt}]\n"
print "Other options include:"
print "-----------------------------------------------------------"
print "-f : Format - Format the output code with indentations"
print "-h : Help - Prints this info\n"
print "-----------------------------------------------------------"
print "You can also use the long forms:"
print "-i : --in"
print "-o : --out"
print "-f : --format"
print "-h : --Help"
# Combination wrapper for the above two functions
def deHexOct(obf_code):
match = re.search(sub_pattern["HexOct"],obf_code)
if match:
# Find and process Hex obfuscated elements
for HexObj in match.groupdict()["Hex"]:
print match.groupdict()["Hex"]
print "Processing:"
print HexObj.pattern
obf_code.replace(HexObj,chr(int(HexObj),16))
# Find and process Oct obfuscated elements
for OctObj in set(match.groupdict()["Oct"]):
print "Processing:"
print OctObj
obf_code.replace(OctObj,chr(int(OctObj),8))
return obf_code
# Crunch the Data
def deObfuscate(file_string):
# Identify HexOct sections and process
match = re.search(pattern["HexOct"],file_string)
if match:
print "HexOct Obfuscation found."
for HexOctObj in match.groupdict()["obf_code"]:
print "Processing:"
print HexOctObj
file_string.replace(HexOctObj,deHexOct(HexOctObj))
# Identify B64 sections and process
match = re.search(pattern["Base64"],file_string)
if match:
print "Base64 Obfuscation found."
for B64Obj in match.groupdict()["obf_code"]:
print "Processing:"
print B64Obj
file_string.replace(B64Obj,base64.b64decode(B64Obj))
# Return the (hopefully) deobfuscated string
return file_string
# File to String
def loadFile(file_path):
try:
file_data = open(file_path)
file_string = file_data.read()
file_data.close()
return file_string
except ValueError,TypeError:
print "[ERROR] Problem loading the File: " + file_path
# String to File
def saveFile(file_path,file_string):
try:
file_data = open(file_path,'w')
file_data.write(file_string)
file_data.close()
except ValueError,TypeError:
print "[ERROR] Problem saving the File: " + file_path
####################################################################################
#
# Main body of Script
#
####################################################################################
# Getting the args
try:
opts, args = getopt.getopt(sys.argv[1:], "hi:o:f", ["help","in","out","format"])
except getopt.GetoptError:
usage()
sys.exit(2)
# Handling the args
for opt, arg in opts:
if opt in ("-h", "--help"):
usage()
sys.exit()
elif opt in ("-i", "--in"):
file_in = arg
print "Designated input file: "+file_in
elif opt in ("-o", "--out"):
file_out = arg
print "Designated output file: "+file_out
elif opt in ("-f", "--format"):
format_code = True
print "Code Formatting mode enabled"
# Checking the input
if file_in =="":
print "[ERROR] - No Input File Specified"
usage()
sys.exit(2)
# Checking or assigning the output
if file_out == "":
file_out = file_in+"-deObfuscated.txt"
print "[INFO] - No Output File Specified - Automatically assigned: "+file_out
# Zhu Li, Do the Thing!
code_string=loadFile(file_in)
deObf_String=deObfuscate(str(code_string))
saveFile(file_out,deObf_String)
The Console output from my debug prints is as follows:
C:\Users\NJB\workspace\python\deObf>deobf.py -i "Form 5138.php"
Designated input file: Form 5138.php
[INFO] - No Output File Specified - Automatically assigned: Form 5138.php-deObfuscated.txt
HexOct Obfuscation found.
Processing:
\
Processing:
x
Processing:
6
Processing:
1
Processing:
\
Processing:
1
Processing:
5
Processing:
6
Processing:
\
Processing:
x
Processing:
7
Processing:
5
Processing:
\
Processing:
1
Processing:
5
Processing:
6
Processing:
\
Processing:
x
Processing:
6
Processing:
1
Your regular expression is matching groups just fine, but you are then iterating through the characters in the matched group.
This gives the string you just matched: match.groupdict()["Hex"]
This iterates over the characters in the string:
for HexObj in match.groupdict()["Hex"]:
You want instead to iterate the search, so use re.finditer() instead of re.search(). So something like:
def deHexOct(obf_code):
for match in re.finditer(sub_pattern["HexOct"],obf_code):
# Find and process Hex obfuscated elements
groups = match.groupdict()
hex = groups["Hex"]
if hex:
print "hex:", hex
# do processing here
oct = groups["Oct"]
if oct:
print "oct:", oct
# do processing here
Also, the r in front of the string just stops Python interpreting backslashes as escapes and is needed for regular expressions because they also use backslash for escapes. An alternative would be to double every backslash in your regex; then you wouldn't need the r prefix but the regex might become even less readable.

Sorting Problems when using a list

I have a .txt file that contains a list of IP address:
111.67.74.234:8080
111.67.75.89:8080
12.155.183.18:3128
128.208.04.198:2124
142.169.1.233:80
There's a lot more than that though :)
Anyway, imported this into a list using Python and I'm trying to get it to sort them, but I'm having trouble. Anybody have any ideas?
EDIT:
Ok since that was vague, this is what I had so fair.
f = open("/Users/jch5324/Python/Proxy/resources/data/list-proxy.txt", 'r+')
lines = [x.split() for x in f]
new_file = (sorted(lines, key=lambda x:x[:18]))
You're probably sorting them by ascii string-comparison ('.' < '5', etc.), when you'd rather that they sort numerically. Try converting them to tuples of ints, then sorting:
def ipPortToTuple(string):
"""
'12.34.5.678:910' -> (12,34,5,678,910)
"""
ip,port = string.strip().split(':')
return tuple(int(i) for i in ip.split('.')) + (port,)
with open('myfile.txt') as f:
nonemptyLines = (line for line in f if line.strip()!='')
sorted(nonemptyLines, key=ipPortToTuple)
edit: The ValueError you are getting is because your text files are not entirely in the #.#.#.#:# format as you imply. (There may be comments or blank lines, though in this case the error would hint that there is a line with more than one ':'.) You can use debugging techniques to home in on your issue, by catching the exception and emitting useful debugging data:
def tryParseLines(lines):
for line in lines:
try:
yield ipPortToTuple(line.strip())
except Exception:
if __debug__:
print('line {} did not match #.#.#.#:# format'.format(repr(line)))
with open('myfile.txt') as f:
sorted(tryParseLines(f))
I was a bit sloppy in the above, in that it still lets some invalid IP addresses through (e.g. #.#.#.#.#, or 257.-1.#.#). Below is a more thorough solution, which allows you do things like compare IP addresses with the < operators, also making sorting work naturally:
#!/usr/bin/python3
import functools
import re
#functools.total_ordering
class Ipv4Port(object):
regex = re.compile(r'(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}):(\d{1,5})')
def __init__(self, ipv4:(int,int,int,int), port:int):
try:
assert type(ipv4)==tuple and len(ipv4)==4, 'ipv4 not 4-length tuple'
assert all(0<=x<256 for x in ipv4), 'ipv4 numbers not in valid range (0<=n<256)'
assert type(port)==int, 'port must be integer'
except AssertionError as ex:
print('Invalid IPv4 input: ipv4={}, port={}'.format(repr(ipv4),repr(port)))
raise ex
self.ipv4 = ipv4
self.port = port
self._tuple = ipv4+(port,)
#classmethod
def fromString(cls, string:'12.34.5.678:910'):
try:
a,b,c,d,port = cls.regex.match(string.strip()).groups()
ip = tuple(int(x) for x in (a,b,c,d))
return cls(ip, int(port))
except Exception as ex:
args = list(ex.args) if ex.args else ['']
args[0] += "\n...indicating ipv4 string {} doesn't match #.#.#.#:# format\n\n".format(repr(string))
ex.args = tuple(args)
raise ex
def __lt__(self, other):
return self._tuple < other._tuple
def __eq__(self, other):
return self._tuple == other._tuple
def __repr__(self):
#return 'Ipv4Port(ipv4={ipv4}, port={port})'.format(**self.__dict__)
return "Ipv4Port.fromString('{}.{}.{}.{}:{}')".format(*self._tuple)
and then:
def tryParseLines(lines):
for line in lines:
line = line.strip()
if line != '':
try:
yield Ipv4Port.fromString(line)
except AssertionError as ex:
raise ex
except Exception as ex:
if __debug__:
print(ex)
raise ex
Demo:
>>> lines = '222.111.22.44:214 \n222.1.1.1:234\n 23.1.35.6:199'.splitlines()
>>> sorted(tryParseLines(lines))
[Ipv4Port.fromString('23.1.35.6:199'), Ipv4Port.fromString('222.1.1.1:234'), Ipv4Port.fromString('222.111.22.44:214')]
Changing the values to be for example 264... or ...-35... will result in the appropriate errors.
#Ninjagecko's solution is the best but here is another way of doing it using re:
>>> import re
>>> with open('ips.txt') as f:
print sorted(f, key=lambda line: map(int, re.split(r'\.|:', line.strip())))
['12.155.183.18:3128\n', '111.67.74.234:8080\n', '111.67.75.89:8080\n',
'128.208.04.198:2124\n', '142.169.1.233:80 \n']
You can pre-proces the list so it can be sorted using the built in comparison function. and then process it back to a more normal format.
strings will be the same length and can be sorted . Afterwards, we simply remove all spaces.
you can google around and find other examples of this.
for i in range(len(address)):
address[i] = "%3s.%3s.%3s.%3s" % tuple(ips[i].split("."))
address.sort()
for i in range(len(address)):
address[i] = address[i].replace(" ", "")
if you have a ton of ip address you are going to get better processing time if you use c++. it will be more work up front but you will get better processing times.

Categories