I am trying to take a field from salesforce that has line breaks and pull out the words and punctuation with a python step in zapier. Here is my code but it returns and empty string. If there is a better/easier way let me know, I am super new to any code and Frankensteined this together from googling.
import re
string = input_data['ac']
regex = r"^[a-z,A-Z].*[?.!]$"
cleaned = re.findall(regex, string)
return [{'cleaned': cleaned}]
Here are 2 pictures of, original comment and the current result, I have it working but would like to keep the punctuation by updating code.
Original Comment
Current Result
JSON parser error
The following just finds sentences by looking for a letter and then scanning until it finds one of the sentence-terminating characters.
import re
s = input_data['ac']
# remove multiple, consecutive carriage returns and/or newlines
s = re.sub(r'[\r\n]+', '', s)
regex = r"""(?x) # verbose flag
[A-Za-z] # a letter
[^?.!]* # one or more non-sentence-ending characters or .*? (non-greedy)
[?.!] # a sentence-ending character
"""
cleaned = re.findall(regex, s)
result = [{'cleaned': cleaned}]
#return result # only legal in a function
Regex Demo
Related
I am using Python and would like to match all the words after test till a period (full-stop) or space is encountered.
text = "test : match this."
At the moment, I am using :
import re
re.match('(?<=test :).*',text)
The above code doesn't match anything. I need match this as my output.
Everything after test, including test
test.*
Everything after test, without test
(?<=test).*
Example here on regexr.com
You need to use re.search since re.match tries to match from the beging of the string. To match until a space or period is encountered.
re.search(r'(?<=test :)[^.\s]*',text)
To match all the chars until a period is encountered,
re.search(r'(?<=test :)[^.]*',text)
In a general case, as the title mentions, you may capture with (.*) pattern any 0 or more chars other than newline after any pattern(s) you want:
import re
p = re.compile(r'test\s*:\s*(.*)')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want . to match across multiple lines, compile the regex with re.DOTALL or re.S flag (or add (?s) before the pattern):
p = re.compile(r'test\s*:\s*(.*)', re.DOTALL)
p = re.compile(r'(?s)test\s*:\s*(.*)')
However, it will retrun match this.. See also a regex demo.
You can add \. pattern after (.*) to make the regex engine stop before the last . on that line:
test\s*:\s*(.*)\.
Watch out for re.match() since it will only look for a match at the beginning of the string (Avinash aleady pointed that out, but it is a very important note!)
See the regex demo and a sample Python code snippet:
import re
p = re.compile(r'test\s*:\s*(.*)\.')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want to make sure test is matched as a whole word, add \b before it (do not remove the r prefix from the string literal, or '\b' will match a BACKSPACE char!) - r'\btest\s*:\s*(.*)\.'.
I don't see why you want to use regex if you're just getting a subset from a string.
This works the same way:
if line.startswith('test:'):
print(line[5:line.find('.')])
example:
>>> line = "test: match this."
>>> print(line[5:line.find('.')])
match this
Regex is slow, it is awkward to design, and difficult to debug. There are definitely occassions to use it, but if you just want to extract the text between test: and ., then I don't think is one of those occasions.
See: https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
For more flexibility (for example if you are looping through a list of strings you want to find at the beginning of a string and then index out) replace 5 (the length of 'test:') in the index with len(str_you_looked_for).
fileinput = open('INFILE.txt', 'r')
fileoutput = fileinput.read()
replace = re.sub(r'([A-Za-z]),([A-Za-z])', r'\1\2', fileoutput)
print replace
replaceout = open('OUTFILE.txt', 'w')
replaceout.write(replace)
The code above delete commas among any letter whether CapsLocks or not. How to insert commas among any letter and digit? I try the code
replace = re.sub(r"([a-z])([0-9])", r",\1", fileoutput)
but it does not work. Any suggestion how to insert commas among any letter and any digit?
This may help you understand how to add in the comma and reference out what you want. The brackets around the pattern allow you to capture a value in the regex pattern to return later. First one you capture is referenced as \1 and second \2 and so on.
Inside the square brackets you are telling the regex what you want it to match and without further instructions in the regex pattern it's referencing a single character it's trying to match. So the code below will put a comma in between each character.
import re
test = "123frogger"
replace = re.sub(r'([A-Za-z0-9])', r'\1,', test)
creating the output
1,2,3,f,r,o,g,g,e,r,
Here's an update based on one of your comments above about the content of what you are trying to adjust.
import re
test = "Vilniausnuoma483,NuomaVilniuiiraplinkVilniu"
replace = re.sub(r'([A-Za-z])([0-9].*)', r'\1,\2', test)
It will output the following.
Vilniausnuoma,483,NuomaVilniuiiraplinkVilniu
I can't seem to find an example of this, but I doubt the regex is that sophisticated. Is there a simple way of getting the immediately preceding digits of a certain character in Python?
For the character "A" and the string:
"îA"
It should return 238A
As long as you intend to include the trailing character in the resulting match, the regex pattern to do that is very simple. For instance, if you want to capture any series of digits followed by a letter A, the pattern would be \d+A
If you are on python 3, try this.
Please refer to this link for more information.
import re
char = "A" # the character you're searching for.
string = "BA îA 123A" # test string.
regex = "[0-9]+%s" %char # capturing digits([0-9]) which appear more than once(+) followed by a desired character "%s"%char
compiled_regex = re.compile(regex) # compile the regex
result = compiled_regex.findall(string)
print (result)
>>['238A', '123A']
I have a list of regex patterns.
rgx_list = ['pattern_1', 'pattern_2', 'pattern_3']
And I am using a function to loop through the list, compile the regex's, and apply a findall to grab the matched terms and then I would like a way of deleting said terms from the text.
def clean_text(rgx_list, text):
matches = []
for r in rgx_list:
rgx = re.compile(r)
found_matches = re.findall(rgx, text)
matches.append(found_matches)
I want to do something like text.delete(matches) so that all of the matches will be deleted from the text and then I can return the cleansed text.
Does anyone know how to do this? My current code will only work for one match of each pattern, but the text may have more than one occurence of the same pattern and I would like to eliminate all matches.
Use sub to replace matched patterns with an empty string. No need to separately find the matches first.
def clean_text(rgx_list, text):
new_text = text
for rgx_match in rgx_list:
new_text = re.sub(rgx_match, '', new_text)
return new_text
For simple regex you can OR the expressions together using a "|". There are examples of combining regex using OR on stack overflow.
For really complex regex I would loop through the list of regex. You could get timeouts from combined complex regex.
I want to print the lines between specific string, my string is as follows:
my_string = '''
##start/file1
file/images/graphs/main
file/images/graphs
file/graphs
##start/new
new/pattern/symbol
new/pattern/
##start/info/version
version/info/main
version/info/minor
##start
values/key
values
...
... '''
In this string i want to search for "main" and print it as:
##start/file1/file/images/graphs/main
##start/info/version/version/info/main
How can i do this?
I tried to find the lines between two ##start and search for main.
Try something like:
def get_mains(my_string):
section = ''
for line in my_string.split('\n'):
if line[0:7] == "##start":
section = line
continue
if 'main' in line:
yield '/'.join([section, line])
for main in get_mains(my_string):
print main
There is a way to do this with Python's Regular Expressions Parser called regex for short.
Basically, regex is this whole language for searching through a string for certain patterns. If I have the string 'Hello, World', it would match the regex pattern 'llo, Wor', because it contains an ell followed by an ell followed by an o followed by a comma and a space and a capital double-you and so on. On the surface it just looks like a substring test. The real power of regex comes with special characters. If I have the string 'Hello, World' again, it also matches the pattern 'Hello, \w\w\w\w\w', because \w is a special character that stands for any letter in the alphabet (plus a few extras). So 'Hello, Bobby', 'Hello, World', 'Hello, kitty' all match the pattern 'Hello, \w\w\w\w\w', because \w can stand in for any letter. There are many more of these 'special characters' and they are all very useful. To actually answer your question,
I constructed a pattern that matches
##start\textICareAbout
file_I_don't_care
file_I_don't_care
file_I_care_about\main
which is
r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)')
The leading r makes the string a raw string (so we don't have to double backslash newlines, see the linked webpage). Then, everything in parenthesis becomes a group. Groups are peices of texts that we want to be able to recall later. There are two groups. The first one is (##start{line}), the second one is (.*main). The first group matches anything that starts with ##start and continues for a whole line, so lines like
##start/file1 or ##start/new
The second group matches lines that end in main, because .* matches every character except newlines. In between the two groups there is {line}*, which means 'match any thing that is a complete line, and match any number of them'. So tying it all together, we have:
match anything that starts with ##start, then we match any number of lines, and then we match any line that ends in main.
import re
# define my_string here
pattern = re.compile(r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)'))
for match in pattern.findall(my_string):
string = match[0][:-1] # don't want the trailing \n
string += '/'
string += match[1]
print string
For your example, it outputs
##start/file1/file/images/graphs/main
##start/new/version/info/main
So Regex is pretty cool and other languages have it too. It is a very powerful tool, and you should learn how to use it here.
Also just a side note, I use the .format function, because I think it looks much cleaner and easier to read, so
'hello{line}world'.format(line=r'(?:.*\n)') just becomes evaluated to 'hello(?:.*\n)world', and it would match
hello
Any Text Here. Anything at all. (just for one line)
world