python identificare random syntax in text - python

i have this input text file bio.txt
Enter for a chance to {win|earn|gain|obtain|succeed|acquire|get}
1⃣Click {Link|Url|Link up|Site|Web link} Below️
2⃣Enter Name
3⃣Do the submit(inside optin {put|have|positioned|set|placed|apply|insert|locate|situate|put|save|stick|know|keep} {shipping|delivery|shipment} adress)
need locate syntax like this {win|earn|gain|obtain|succeed|acquire|get} and return random word, example : win
how i can locate this in python started from my code :
input = open('bio.txt', 'r').read()

First, you need to read the text file into a string; find the pattern "{([a-z|]+)}" using regex, split them by "|" to make a list as random words. It could be achieved as the following:
import re, random
seed = []
matches = re.findall('{([a-z|]+)}', open('bio.txt', 'r').read())
[seed.extend(i.split('|')) for i in matches]
input = random.choice(seed)

You can search for your pattern ("\{.*\}" according to your example) with regex on each line.
Then once you found it, simply split the match by a separator ("|" according to your example).
finally return randomly an element of the list.
Regular expression doc : https://docs.python.org/2/library/re.html
Python's string common operation doc (including split ) https://docs.python.org/2/library/string.html
Get a random element of a list : How to randomly select an item from a list?

Related

Search for a specify string in a specified position on a sentence

I am trying to search for a given string in a specified location in the sentence. A name is embedded in a sentence and I want to search if that said name is present or not.
The sentence is:
s = Category: Image_by_Thoma_for_my_favourite_Theme
My python code is:
If 'Th' in s:
print('present')
Note that this will return present because of 'Thoma' and 'Theme'.
So I guess the best approach is using a regex to get the exact position of 'Thoma' and return present. I am new to python and I really do not know much about regex orr Is there a way to check if the word 'Thoma' is positioned exactly after the words in s?
I hope I've understood your question correctly. Are you searching for word that is after Image_by_? You can use Regex for that:
import re
s = "Category: Image_by_Thoma_for_my_favourite_Theme"
pat = re.compile(r"Image_by_([^_]+)")
match = pat.search(s)
if match:
print(match.group(1))
Prints:
Thoma
You can then check, if the .group(1) is equal to Thoma, or use str.startswith to check if the name starts with Th.

how to find unknown string from a known pattern ? python re.findall

I have an html text with strings such as
sentence-transformers/paraphrase-MiniLM-L6-v2
I want to extract all the strings that appear after "sentence-transformers/".
I tried models = re.findall("sentence-transformers/"+"(\w+)", text) but it only output the first word (paraphrase) while I want the full "paraphrase-MiniLM-L6-v2 "
Also I don't know the len(paraphrase-MiniLM-L6-v2 ) a priori.
How can I extract the full string?
Many thanks,
Ele
The problem with your regex is that - is not considered a word character, and you are only searching for word characters. The following regex works on your example:
text = 'sentence-transformers/paraphrase-MiniLM-L6-v2'
models = re.findall(r'sentence-transformers/([\w-]+)', text)
assert models[0] == 'paraphrase-MiniLM-L6-v2'

How would I remove the Arabic prefix "ال" from an arabic string?

I have tried things like this, but there is no change between the input and output:
def remove_al(text):
if text.startswith('ال'):
text.replace('ال','')
return text
text.replace returns the updated string but doesn't change it, you should change the code to
text = text.replace(...)
Note that in Python strings are "immutable"; there's no way to change even a single character of a string; you can only create a new string with the value you want.
If you want to only remove the prefix ال and not all of ال combinations in the string, I'd rather suggest to use:
def remove_prefix_al(text):
if text.startswith('ال'):
return text[2:]
return text
If you simply use text.replace('ال',''), this will replace all ال combinations:
Example
text = 'الاستقلال'
text.replace('ال','')
Output:
'استقل'
I would recommend the method str.lstrip instead of rolling your own in this case.
example text (alrashid) in Arabic: 'الرَشِيد'
text = 'الرَشِيد'
clean_text = text.lstrip('ال')
print(clean_text)
Note that even though arabic reads from right to left, lstrip strips the start of the string (which is visually to the right)
also, as user 6502 noted, the issue in your code is because python strings are immutable, thus the function was returning the input back
"ال" as prefix is quite complex in Arabic that you will need Regex to accurately separate it from its stem and other prefixes. The following code will help you isolate "ال" from most words:
import re
text = 'والشعر كالليل أسود'
words = text.split()
for word in words:
alx = re.search(r'''^
([وف])?
([بك])?
(لل)?
(ال)?
(.*)$''', word, re.X)
groups = [alx.group(1), alx.group(2), alx.group(3), alx.group(4), alx.group(5)]
groups = [x for x in groups if x]
print (word, groups)
Running that (in Jupyter) you will get:

Retrieve part of string, variable length

I'm trying to learn how to use Regular Expressions with Python. I want to retrieve an ID number (in parentheses) in the end from a string that looks like this:
"This is a string of variable length (561401)"
The ID number (561401 in this example) can be of variable length, as can the text.
"This is another string of variable length (99521199)"
My coding fails:
import re
import selenium
# [Code omitted here, I use selenium to navigate a web page]
result = driver.find_element_by_class_name("class_name")
print result.text # [This correctly prints the whole string "This is a text of variable length (561401)"]
id = re.findall("??????", result.text) # [Not sure what to do here]
print id
This should work for your example:
(?<=\()[0-9]*
?<= Matches something preceding the group you are looking for but doesn't consume it. In this case, I used \(. ( is a special character, so it has to be escaped with \. [0-9] matches any number. The * means match any number of the directly preceding rule, so [0-9]* means match as many numbers as there are.
Solved this thanks to Kaz's link, very useful:
http://regex101.com/
id = re.findall("(\d+)", result.text)
print id[0]
You can use this simple solution :
>>> originString = "This is a string of variable length (561401)"
>>> str1=OriginalString.replace("("," ")
'This is a string of variable length 561401)'
>>> str2=str1.replace(")"," ")
'This is a string of variable length 561401 '
>>> [int(s) for s in string.split() if s.isdigit()]
[561401]
First, I replace parantheses with space. and then I searched the new string for integers.
No need to really use regular expressions here, if it is always at the end and always in parenthesis you can split, extract last element and remove the parenthesis by taking the substring ([1:-1]). Regexes are relatively time expensive.
line = "This is another string of variable length (99521199)"
print line.split()[-1][1:-1]
If you did want to use regular expressions I would do this:
import re
line = "This is another string of variable length (99521199)"
id_match = re.match('.*\((\d+)\)',line)
if id_match:
print id_match.group(1)

replace multiple words - python

There can be an input "some word".
I want to replace this input with "<strong>some</strong> <strong>word</strong>" in some other text which contains this input
I am trying with this code:
input = "some word".split()
pattern = re.compile('(%s)' % input, re.IGNORECASE)
result = pattern.sub(r'<strong>\1</strong>',text)
but it is failing and i know why: i am wondering how to pass all elements of list input to compile() so that (%s) can catch each of them.
appreciate any help
The right approach, since you're already splitting the list, is to surround each item of the list directly (never using a regex at all):
sterm = "some word".split()
result = " ".join("<strong>%s</strong>" % w for w in sterm)
In case you're wondering, the pattern you were looking for was:
pattern = re.compile('(%s)' % '|'.join(sterm), re.IGNORECASE)
This works on your string because the regular expression would become
(some|word)
which means "matches some or matches word".
However, this is not a good approach as it does not work for all strings. For example, consider cases where one word contains another, such as
a banana and an apple
which becomes:
<strong>a</strong> <strong>banana</strong> <strong>a</strong>nd <strong>a</strong>n <strong>a</strong>pple
It looks like you're wanting to search for multiple words - this word or that word. Which means you need to separate your searches by |, like the script below:
import re
text = "some word many other words"
input = '|'.join('some word'.split())
pattern = re.compile('(%s)' % input, flags=0)
print pattern.sub(r'<strong>\1</strong>',text)
I'm not completely sure if I know what you're asking but if you want to pass all the elements of input in as parameters in the compile function call, you can just use *input instead of input. * will split the list into its elements. As an alternative, could't you just try joining the list with and adding at the beginning and at the end?
Alternatively, you can use the join operator with a list comprehension to create the intended result.
text = "some word many other words".split()
result = ' '.join(['<strong>'+i+'</strong>' for i in text])

Categories