Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have texts in an excel file that looks something like this:
alpha123_4rf
45beta_Frank
Red5Great_Sam_Fun
dan.dan_mmem_ber
huh_k
han.jk_jj
huhu
I am trying to use a regex to match all of these words and save them into a set().
I have tried r"(\w+..*?_.*?\w+)" as seen here . But cant seem to capture the word huhu that does not have special characters.
Your regex is capturing word that have a _ in them, and huhu don't.
You could change your regex to match every letter, number, underscore, and dots, multiple times.
([\w.]+)
I've fork your regex101
If you wish to match something more precise, you might need to give us more information about your context and what exactly you are trying to match.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
Improve this question
I want to detect and/or replace weird utf, non-emoji characters that break my tokenization pipeline, like \uf0fc, which renders like a cup/glass:
That image / code is not contained in the emojis package, which I tried for filtering.
Is there a class that describes all such characters?
Is there a way I can reliably detect them?
This is a character from a Private Use Area. It happens to look like a tankard in your font, but the Unicode standard doesn't mandate a specific look or meaning for these; it has whatever meaning you assign to it. The idea is that you agree upon a meaning with whoever you're communicating with - privately, meaning without getting the Unicode Consortium involved.
You can use the standard unicodedata module to check whether a character is from the Co category, or just hardcode the ranges, as described here.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
How can I find common substrings of two strings in order to show changes/edit of a string?
So what I try to do is to compare an old version of a string:
string_old = "My name is pm730! How are you?"
with a new/edited version of the string:
string_new = "My name isn't pm730, it is pm740!"
Deleted substrings are not important. New substrings should be distinguished somehow, so that I could output it like this eventually:
My name isn't pm730 , it is pm740!
This task sounds easy but is more complicated than I thought. So my hope is that there is already an similar implementation available, but unfortunanly I can't find it...
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am trying to capture any alpha numeric character between ''
Regex
'(.*.doc)' will only capture .doc files.
'(\w)' should capture any alpha numeric character.
But I am looking to capture any character between '' except the ---- characters.
Here you can use the following regular expression: ([^\-\[\][\n']+)
An example:
regexr.com/5btcs
Is this good?
'[^'-]*'
Means a single ', then anything not ' or -, then another '.
If you wish to capture things around the dashes though, you might have to capture inclusively and filter them out.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am using scrapy to scrape the date that a comment was posted on a forum. I have been able to scrape the contents of the divider that contains the date, but it has escaped characters on both sides that make the string unusable. I need to create a regex expression which matches everything except for escaped characters.
The string I am working with is "\r\n\t\t\t\r\n\t\t\t\t08-07-2019, 11:37:16 AM\r\n\t\t\t\r\n\t\t\t\r\n\t\t\t". I want only to match the date inside.
The pattern that I was trying to use was (?<!\\\\)\\+[\\w-]+, as was recommended by other topics, but this doesn't match anything in that string.
You don't need regex if you want to match everything. I strongly recommend you to use Item Loaders in Scrapy to process your fields (using .strip() etc).
Also you can remove unwanted characters from your string using XPath normalize-space():
event_time = response.xpath('normalize-space(string(//YOUR/XPATH/HERE))').get()
But if you want to match part of a complex string you can use regular expresssion of course:
event_time = response.xpath('//YOUR/XPATH/HERE').re_first(r'(\d{2}-\d{2}-\d{4},\s+\d{2}:\d{2}:\d{2}\s+\w{2})')
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
How can I match strings that are not inside a set of strings using python regular expressions?
Ex: set of strings ('/abc|/bcd')
I want to match any string other than that in the parentheses. That should be exact match.
Here's a fun one for you:
^(?!\/(?:abc|bcd)$).+
It uses a negative lookahead to ensure that the string being matched isn't one of the strings you don't want, then grabs whatever else is left.
Demo on Regex101