I have an html text with strings such as
sentence-transformers/paraphrase-MiniLM-L6-v2
I want to extract all the strings that appear after "sentence-transformers/".
I tried models = re.findall("sentence-transformers/"+"(\w+)", text) but it only output the first word (paraphrase) while I want the full "paraphrase-MiniLM-L6-v2 "
Also I don't know the len(paraphrase-MiniLM-L6-v2 ) a priori.
How can I extract the full string?
Many thanks,
Ele
The problem with your regex is that - is not considered a word character, and you are only searching for word characters. The following regex works on your example:
text = 'sentence-transformers/paraphrase-MiniLM-L6-v2'
models = re.findall(r'sentence-transformers/([\w-]+)', text)
assert models[0] == 'paraphrase-MiniLM-L6-v2'
Related
I am trying to search for a given string in a specified location in the sentence. A name is embedded in a sentence and I want to search if that said name is present or not.
The sentence is:
s = Category: Image_by_Thoma_for_my_favourite_Theme
My python code is:
If 'Th' in s:
print('present')
Note that this will return present because of 'Thoma' and 'Theme'.
So I guess the best approach is using a regex to get the exact position of 'Thoma' and return present. I am new to python and I really do not know much about regex orr Is there a way to check if the word 'Thoma' is positioned exactly after the words in s?
I hope I've understood your question correctly. Are you searching for word that is after Image_by_? You can use Regex for that:
import re
s = "Category: Image_by_Thoma_for_my_favourite_Theme"
pat = re.compile(r"Image_by_([^_]+)")
match = pat.search(s)
if match:
print(match.group(1))
Prints:
Thoma
You can then check, if the .group(1) is equal to Thoma, or use str.startswith to check if the name starts with Th.
I have a number of long strings and I want to match those that contain all words of a given list.
keywords=['special','dreams']
search_string1="This is something that manifests especially in dreams"
search_string2="This is something that manifests in special cases in dreams"
I want only search_string2 matched. So far I have this code:
if all(x in search_text for x in keywords):
print("matched")
The problem is that it will also match search_string1. Obviously I need to include some regex matching that uses \w or or \b, but I can't figure out how I can include a regex in the if all statement.
Can anyone help?
you can use regex to do the same but I prefer to just use python.
string classes in python can be split to list of words. (join can join a list to string). while using word in list_of_words will help you understand if word is in the list.
keywords=['special','dreams']
found = True
for word in keywords:
if not word in search_string1.split():
found = False
Could be not the best idea, but we could check if one set is a part of another set:
keywords = ['special', 'dreams']
strs = [
"This is something that manifests especially in dreams",
"This is something that manifests in special cases in dreams"
]
_keywords = set(keywords)
for s in strs:
s_set = set(s.split())
if _keywords.issubset(s_set):
print(f"Matched: {s}")
Axe319's comment works and is closest to my original question of how to solve the problem using regex. To quote the solution again:
all(re.search(fr'\b{x}\b', search_text) for x in keywords)
Thanks to everyone!
I have tried things like this, but there is no change between the input and output:
def remove_al(text):
if text.startswith('ال'):
text.replace('ال','')
return text
text.replace returns the updated string but doesn't change it, you should change the code to
text = text.replace(...)
Note that in Python strings are "immutable"; there's no way to change even a single character of a string; you can only create a new string with the value you want.
If you want to only remove the prefix ال and not all of ال combinations in the string, I'd rather suggest to use:
def remove_prefix_al(text):
if text.startswith('ال'):
return text[2:]
return text
If you simply use text.replace('ال',''), this will replace all ال combinations:
Example
text = 'الاستقلال'
text.replace('ال','')
Output:
'استقل'
I would recommend the method str.lstrip instead of rolling your own in this case.
example text (alrashid) in Arabic: 'الرَشِيد'
text = 'الرَشِيد'
clean_text = text.lstrip('ال')
print(clean_text)
Note that even though arabic reads from right to left, lstrip strips the start of the string (which is visually to the right)
also, as user 6502 noted, the issue in your code is because python strings are immutable, thus the function was returning the input back
"ال" as prefix is quite complex in Arabic that you will need Regex to accurately separate it from its stem and other prefixes. The following code will help you isolate "ال" from most words:
import re
text = 'والشعر كالليل أسود'
words = text.split()
for word in words:
alx = re.search(r'''^
([وف])?
([بك])?
(لل)?
(ال)?
(.*)$''', word, re.X)
groups = [alx.group(1), alx.group(2), alx.group(3), alx.group(4), alx.group(5)]
groups = [x for x in groups if x]
print (word, groups)
Running that (in Jupyter) you will get:
i have this input text file bio.txt
Enter for a chance to {win|earn|gain|obtain|succeed|acquire|get}
1⃣Click {Link|Url|Link up|Site|Web link} Below️
2⃣Enter Name
3⃣Do the submit(inside optin {put|have|positioned|set|placed|apply|insert|locate|situate|put|save|stick|know|keep} {shipping|delivery|shipment} adress)
need locate syntax like this {win|earn|gain|obtain|succeed|acquire|get} and return random word, example : win
how i can locate this in python started from my code :
input = open('bio.txt', 'r').read()
First, you need to read the text file into a string; find the pattern "{([a-z|]+)}" using regex, split them by "|" to make a list as random words. It could be achieved as the following:
import re, random
seed = []
matches = re.findall('{([a-z|]+)}', open('bio.txt', 'r').read())
[seed.extend(i.split('|')) for i in matches]
input = random.choice(seed)
You can search for your pattern ("\{.*\}" according to your example) with regex on each line.
Then once you found it, simply split the match by a separator ("|" according to your example).
finally return randomly an element of the list.
Regular expression doc : https://docs.python.org/2/library/re.html
Python's string common operation doc (including split ) https://docs.python.org/2/library/string.html
Get a random element of a list : How to randomly select an item from a list?
This topic has been addressed for text based emoticons at link1, link2, link3. However, I would like to do something slightly different than matching simple emoticons. I'm sorting through tweets that contain the emoticons' icons. The following unicode information contains just such emoticons: pdf.
Using a string with english words that also contains any of these emoticons from the pdf, I would like to be able to compare the number of emoticons to the number of words.
The direction that I was heading down doesn't seem to be the best option and I was looking for some help. As you can see in the script below, I was just planning to do the work from the command line:
$cat <file containing the strings with emoticons> | ./emo.py
emo.py psuedo script:
import re
import sys
for row in sys.stdin:
print row.decode('utf-8').encode("ascii","replace")
#insert regex to find the emoticons
if match:
#do some counting using .split(" ")
#print the counting
The problem that I'm running into is the decoding/encoding. I haven't found a good option for how to encode/decode the string so I can correctly find the icons. An example of the string that I want to search to find the number of words and emoticons is as follows:
"Smiley emoticon rocks! I like you."
The challenge: can you make a script that counts the number of words and emoticons in this string? Notice that the emoticons are both sitting next to the words with no space in between.
First, there is no need to encode here at all. You're got a Unicode string, and the re engine can handle Unicode, so just use it.
A character class can include a range of characters, by specifying the first and last with a hyphen in between. And you can specify Unicode characters that you don't know how to type with \U escape sequences. So:
import re
s=u"Smiley emoticon rocks!\U0001f600 I like you.\U0001f601"
count = len(re.findall(ru'[\U0001f600-\U0001f650]', s))
Or, if the string is big enough that building up the whole findall list seems wasteful:
emoticons = re.finditer(ru'[\U0001f600-\U0001f650]', s)
count = sum(1 for _ in emoticons)
Counting words, you can do separately:
wordcount = len(s.split())
If you want to do it all at once, you can use an alternation group:
word_and_emoticon_count = len(re.findall(ru'\w+|[\U0001f600-\U0001f650]', s))
As #strangefeatures points out, Python versions before 3.3 allowed "narrow Unicode" builds. And, for example, most CPython Windows builds are narrow. In narrow builds, characters can only be in the range U+0000 to U+FFFF. There's no way to search for these characters, but that's OK, because they're don't exist to search for; you can just assume they don't exist if you get an "invalid range" error compiling the regexp.
Except, of course, that there's a good chance that wherever you're getting your actual strings from, they're UTF-16-BE or UTF-16-LE, so the characters do exist, they're just encoded into surrogate pairs. And you want to match those surrogate pairs, right? So you need to translate your search into a surrogate-pair search. That is, convert your high and low code points into surrogate pair code units, then (in Python terms) search for:
(lead == low_lead and lead != high_lead and low_trail <= trail <= DFFF or
lead == high_lead and lead != low_lead and DC00 <= trail <= high_trail or
low_lead < lead < high_lead and DC00 <= trail <= DFFF)
You can leave off the second condition in the last case if you're not worried about accepting bogus UTF-16.
If it's not obvious how that translates into regexp, here's an example for the range [\U0001e050-\U0001fbbf] in UTF-16-BE:
(\ud838[\udc50-\udfff])|([\ud839-\ud83d].)|(\ud83e[\udc00-\udfbf])
Of course if your range is small enough that low_lead == high_lead this gets simpler. For example, the original question's range can be searched with:
\ud83d[\ude00-\ude50]
One last trick, if you don't actually know whether you're going to get UTF-16-LE or UTF-16-BE (and the BOM is far away from the data you're searching): Because no surrogate lead or trail code unit is valid as a standalone character or as the other end of a pair, you can just search in both directions:
(\ud838[\udc50-\udfff])|([\ud839-\ud83d][\udc00-\udfff])|(\ud83e[\udc00-\udfbf])|
([\udc50-\udfff]\ud838)|([\udc00-\udfff][\ud839-\ud83d])|([\udc00-\udfbf]\ud83e)
My solution includes the emoji and regex modules. The regex module supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character), so we can count emojis like 👨👩👦👦 once, although it consists of 4 emojis.
import emoji
import regex
def split_count(text):
emoji_counter = 0
data = regex.findall(r'\X', text)
for word in data:
if any(char in emoji.UNICODE_EMOJI for char in word):
emoji_counter += 1
# Remove from the given text the emojis
text = text.replace(word, '')
words_counter = len(text.split())
return emoji_counter, words_counter
Testing:
line = "hello 👩🏾🎓 emoji hello 👨👩👦👦 how are 😊 you today🙅🏽🙅🏽"
counter = split_count(line)
print("Number of emojis - {}, number of words - {}".format(counter[0], counter[1]))
Output:
Number of emojis - 5, number of words - 7
If you are trying to read unicode characters outside the ascii range, don't convert into the ascii range. Just leave it as unicode and work from there (untested):
import sys
count = 0
emoticons = set(range(int('1f600',16), int('1f650', 16)))
for row in sys.stdin:
for char in row:
if ord(char) in emoticons:
count += 1
print "%d emoticons found" % count
Not the best solution, but it should work.
This is my solution using re:
import re
text = "your text with emojis"
em_count = len(re.findall(r'[^\w\s,.]', text))
print(em_count)