Split string with multiple words using regex in python

Split string with multiple words using regex in python - python

Suppose I have a expression
exp="\"OLS\".\"ORDER_ITEMS\".\"QUANTITY\" <50 and \"OLS\".\"PRODUCTS\".\"PRODUCT_NAME\" = 'Kingston' or \"OLS\".\"ORDER_ITEMS\".\"QUANTITY\" <20"
I want to split the expression by and , or so that my result will be
exp=['\"OLS\".\"ORDER_ITEMS\".\"QUANTITY\" <50','\"OLS\".\"PRODUCTS\".\"PRODUCT_NAME\" = 'Kingston'','\"OLS\".\"ORDER_ITEMS\".\"QUANTITY\" <20']
This is what i have tried:
import re
res=re.split('and|or|',exp)
but it will split by each character how can we make it split by word?

import itertools
exp=itertools.chain(*[y.split('or') for y in exp.split('and')])
exp=[x.strip() for x in list(exp)]
Explanation: 1st split on 'and'. Now try spitting each element obtained on 'or'. This will create list of lists. Using itertools, create a flat list & strip extra spaces from each new element in the flat list

Your regex has three alternatives: "and", "or" or the empty string: and|or|
Omit the trailing | to split just by those two words.
import re
res=re.split('and|or', exp)
Note that this will not work reliably; it'll split on any instance of "and", even when it's in quotes or part of a word. You could make it split only on full words using \b, but that will still split on a product name like 'Black and Decker'. If you need it to be reliable and general, you'll have to parse the string using the full syntax (probably using an off-the-shelf parser, if it's standard SQL or similar).

You can do it in 2 steps: [ss for s in exp.split(" and ") for ss in s.split(' or ')]

Related

Notepad++ Regex to insert random letter or number every other character-position

I'm hoping to figure out a simple search and replace in Notepad++ to slightly obfuscate text by littering it with random letters and numbers every second ("other") character, and then be able to reverse that again with another macro.
So:
banana
would become:
bma0ndaNn4aR
(b?a?n?a?n?a?)
...And then be able to undo this again by removing every other character with a backspace.
...
I found this method so far:
(?<=.)(?!$)
How to insert spaces between characters using Regex?
But as best I understand, this is not actually capturing anything so I can't use this to replace with expressions I've found for printing random letters and numbers, such as:
^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])$
I'm sure a tweak to that would work and then I could reverse it all by replacing the same search with \b.

There are better ways of doing but you can use the following python prototype as a starting point to create your own script:
import string
import random
inputText = 'banana'
#encoding
obfuscatedText = ''.join([x + random.choice(string.ascii_letters+string.digits) for x in inputText])
print(obfuscatedText)
#decoding
originalText = ''.join([x for x in obfuscatedText][0:len(obfuscatedText)-1:2])
print(originalText)
Explanations:
Encoding:
[x for x in inputText] will generate an array of chars from the input string
random.choice(string.ascii_letters+string.digits) takes one character
from the union of string.ascii_letters and string.digits
x + random.choice(string.ascii_letters+string.digits) create 2 char strings by concatenating each char of the input with the generated char.
The ''.join() operation will allow you to create a string from the char array
Decoding:
[x for x in obfuscatedText][0:len(obfuscatedText)-1:2] will allow you to get only the
char that are located at index 0,2,4,6,...
the ''.join() operation will regenerate a string from the char array
Execution:
$ python obfuscate.py
biaLncaIn4aE
banana

Problems when trying to filter strings in a list?

I have a large list with strings and I would like to filter everything inside a parenthesis, thus I am using the following regex:
text_list = [' 1__(this_is_a_string) 74_string__(anotherString_with_underscores) question__(stringWithAlot_of_underscores) 1.0__(another_withUnderscores) 23:59:59__(get_arguments_end) 2018-05-13 00:00:00__(get_arguments_start)']
import re
r = re.compile('\([^)]*\)')
a_lis = list(filter(r.search, text_list))
print(a_lis)
I test my regex here, and is working. However, when I apply the above regex I end up with an empty list:
[]
Any idea of how to filter all the tokens inside parenthesis from a list?

Your regex is OK (though perhaps you don't want to capture the parentheses as part of the match), but search() is the wrong method to use. You want findall() to get the text of all the matches, rather than the indices of the first match:
list(map(r.findall, text_list))
This will give you a list of lists, where each inner list contains the strings which were inside parentheses.
For example, given this input:
text_list = ['asdf (qwe) asdf (gdfd)', 'xx', 'gdfw(rgf)']
The result is:
[['(qwe)', '(gdfd)'], [], ['(rgf)']]
If you want to exclude the parentheses, change the regex slightly:
'\(([^)]*)\)'
The unescaped parentheses within the escaped ones indicate what to capture.

Confusion with string split method in python

Consider the following example
a= 'Apple'
b = a.split(',')
print(b)
Output is ['Apple'].
I am not getting why is it returning a list even when there is no ',' character in Apple
There might be case when we use split method we are expecting more than one element in list but since we are splitting based on separator not present in string, there will be only one element, wouldn't it be better if this mistake is caught during this split method itself

The behaviour of a.split(',') when no commas are present in a is perfectly consistent with the way it behaves when there are a positive number of commas in a.
a.split(',') says to split string a into a list of substrings that are delimited by ',' in a; the delimiter is not preserved in the substrings.
If 1 comma is found you get 2 substrings in the list, if 2 commas are found you get 3 substrings in the list, and in general, if n commas are found you get n+1 substrings in the list. So if 0 commas are found you get 1 substring in the list.
If you want 0 substrings in the list, then you'll need to supply a string with -1 commas in it. Good luck with that. :)

The docstring of that method says:
Return a list of the words in the string S, using sep as the delimiter string.
The delimiter is used to separate multiple parts of the string; having only one part is not an error.

That's the way split() function works. If you do not want that behaviour, you can implement your my_split() function as follows:
def my_split(s, d=' '):
return s.split(d) if d in s else s

replace multiple words - python

There can be an input "some word".
I want to replace this input with "<strong>some</strong> <strong>word</strong>" in some other text which contains this input
I am trying with this code:
input = "some word".split()
pattern = re.compile('(%s)' % input, re.IGNORECASE)
result = pattern.sub(r'<strong>\1</strong>',text)
but it is failing and i know why: i am wondering how to pass all elements of list input to compile() so that (%s) can catch each of them.
appreciate any help

The right approach, since you're already splitting the list, is to surround each item of the list directly (never using a regex at all):
sterm = "some word".split()
result = " ".join("<strong>%s</strong>" % w for w in sterm)
In case you're wondering, the pattern you were looking for was:
pattern = re.compile('(%s)' % '|'.join(sterm), re.IGNORECASE)
This works on your string because the regular expression would become
(some|word)
which means "matches some or matches word".
However, this is not a good approach as it does not work for all strings. For example, consider cases where one word contains another, such as
a banana and an apple
which becomes:
<strong>a</strong> <strong>banana</strong> <strong>a</strong>nd <strong>a</strong>n <strong>a</strong>pple

It looks like you're wanting to search for multiple words - this word or that word. Which means you need to separate your searches by |, like the script below:
import re
text = "some word many other words"
input = '|'.join('some word'.split())
pattern = re.compile('(%s)' % input, flags=0)
print pattern.sub(r'<strong>\1</strong>',text)

I'm not completely sure if I know what you're asking but if you want to pass all the elements of input in as parameters in the compile function call, you can just use *input instead of input. * will split the list into its elements. As an alternative, could't you just try joining the list with and adding at the beginning and at the end?

Alternatively, you can use the join operator with a list comprehension to create the intended result.
text = "some word many other words".split()
result = ' '.join(['<strong>'+i+'</strong>' for i in text])

manipulating list items python

line = "english: while french: pendant que spanish: mientras german: whrend "
words = line.split('\t')
for each in words:
each = each.rstrip()
print words
the string in 'line' is tab delimited but also features a single white space character after each translated word, so while split returns the list I'm after, each word annoyingly has a whitespace character at the end of the string.
in the loop I'm trying to go through the list and remove any trailing whitespaces in the strings but it doest seem to work, suggestions?

Just line.split() could give you stripped words list.
Updating each inside the loop does not make any changes to the words list
Should be done like this
for i in range(len(words)):
words[i]=words[i].rstrip()
Or
words=map(str.rstrip,words)
See the map docs for details on map.
Or one liner with list comprehension
words=[x.rstrip() for x in line.split("\t")]
Or with regex .findall
words=re.findall("[^\t]+",line)

words = line.split('\t')
words = [ i.rstrip() for i in words ]

You can use a regular expression:
import re
words = re.split(r' *\t| +$', line)[:-1]
With this you define the possible sequence as the delimiter. It also allows more than one space because of the * operator (or no space at all).
EDIT: Fixed after Roger Pate pointed an error.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split string with multiple words using regex in python - python

You can do it in 2 steps: [ss for s in exp.split(" and ") for ss in s.split(' or ')]

Related

Notepad++ Regex to insert random letter or number every other character-position

Problems when trying to filter strings in a list?

Confusion with string split method in python

replace multiple words - python

manipulating list items python

Categories

Resources