Regex: match everything but escaped characters [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am using scrapy to scrape the date that a comment was posted on a forum. I have been able to scrape the contents of the divider that contains the date, but it has escaped characters on both sides that make the string unusable. I need to create a regex expression which matches everything except for escaped characters.
The string I am working with is "\r\n\t\t\t\r\n\t\t\t\t08-07-2019, 11:37:16 AM\r\n\t\t\t\r\n\t\t\t\r\n\t\t\t". I want only to match the date inside.
The pattern that I was trying to use was (?<!\\\\)\\+[\\w-]+, as was recommended by other topics, but this doesn't match anything in that string.

You don't need regex if you want to match everything. I strongly recommend you to use Item Loaders in Scrapy to process your fields (using .strip() etc).
Also you can remove unwanted characters from your string using XPath normalize-space():
event_time = response.xpath('normalize-space(string(//YOUR/XPATH/HERE))').get()
But if you want to match part of a complex string you can use regular expresssion of course:
event_time = response.xpath('//YOUR/XPATH/HERE').re_first(r'(\d{2}-\d{2}-\d{4},\s+\d{2}:\d{2}:\d{2}\s+\w{2})')

Related

Detect / replace utf characters [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
Improve this question
I want to detect and/or replace weird utf, non-emoji characters that break my tokenization pipeline, like \uf0fc, which renders like a cup/glass:
That image / code is not contained in the emojis package, which I tried for filtering.
Is there a class that describes all such characters?
Is there a way I can reliably detect them?
This is a character from a Private Use Area. It happens to look like a tankard in your font, but the Unicode standard doesn't mandate a specific look or meaning for these; it has whatever meaning you assign to it. The idea is that you agree upon a meaning with whoever you're communicating with - privately, meaning without getting the Unicode Consortium involved.
You can use the standard unicodedata module to check whether a character is from the Co category, or just hardcode the ranges, as described here.

Regex capture group is not capturing data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am trying to capture any alpha numeric character between ''
Regex
'(.*.doc)' will only capture .doc files.
'(\w)' should capture any alpha numeric character.
But I am looking to capture any character between '' except the ---- characters.
Here you can use the following regular expression: ([^\-\[\][\n']+)
An example:
regexr.com/5btcs
Is this good?
'[^'-]*'
Means a single ', then anything not ' or -, then another '.
If you wish to capture things around the dashes though, you might have to capture inclusively and filter them out.

Find all words including those with special characters [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have texts in an excel file that looks something like this:
alpha123_4rf
45beta_Frank
Red5Great_Sam_Fun
dan.dan_mmem_ber
huh_k
han.jk_jj
huhu
I am trying to use a regex to match all of these words and save them into a set().
I have tried r"(\w+..*?_.*?\w+)" as seen here . But cant seem to capture the word huhu that does not have special characters.
Your regex is capturing word that have a _ in them, and huhu don't.
You could change your regex to match every letter, number, underscore, and dots, multiple times.
([\w.]+)
I've fork your regex101
If you wish to match something more precise, you might need to give us more information about your context and what exactly you are trying to match.

Matches whatever regular expression is not inside the parentheses [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
How can I match strings that are not inside a set of strings using python regular expressions?
Ex: set of strings ('/abc|/bcd')
I want to match any string other than that in the parentheses. That should be exact match.
Here's a fun one for you:
^(?!\/(?:abc|bcd)$).+
It uses a negative lookahead to ensure that the string being matched isn't one of the strings you don't want, then grabs whatever else is left.
Demo on Regex101

Robot Framework: how to encode string using ascii? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have a keyword:
Verify Payment Method Field
Element Text Should Be ${paymentMethodValueField} PDF-lasku sähköpostiin
here is the logs:
Step 3 Fields verification :: OK: Display Customer Information fie... | FAIL |
The text of element '//div/span' should have been 'PDF-lasku s?hk?postiin' but in fact it was 'PDF-lasku s?hk?postiin'.
I need to write something like that, but I don't know how:
PDF-lasku s[ascii symbol]hk[ascii symbol]postiin
can somebody help me?
I would probably convert the whole thing to one format or another, then evaluate? Or is it important that ASCII characters are located in certain parts of the string? If not and you simply want to verify what is returned is exactly what you expect, I'd probably use Encode String to Bytes for simplicity, perhaps even the encoding/decoding keyword would serve your needs if the ASCII is important.
http://robotframework.org/robotframework/latest/libraries/String.html#Encode%20String%20To%20Bytes
By using the above you could set it to ignore the characters that cannot be converted or replace them with a known character that you provide. Simply get the text first, then perform whatever manipulation you want and evaluate.
The alternative with regard to decoding/encoding if ASCII location is important is:
http://robotframework.org/robotframework/latest/libraries/BuiltIn.html#Convert%20To%20Bytes

Categories