I am using python 2.7 with scrapy .20
I have this test
0552121152, +97143321090
I want to get the value before the comma and the value after it.
My regular expression is
\s*(.*), for the one before
and ,\s*(.*) for the one after
in both ways I got the whole test.
why please?
Edit
For who need a full details, here it is:
This is the page I am scraping http://www.justproperty.com/apartments/old-town/1092713-amazing-2br-apartment-in-the-old-town-island-with-a-burj-khalifa-and-address-hotel-view.html
and this is my scrapy code in cmd in shell:
s =sel.xpath("normalize-space(.//span[#class='content_agent']/span[last()]/span[2]/text())")
s.re(r'\s*(.*),')
Note
I am not asking about the solution to get these values, I am asking about why my mentioned-regular expression is not working
Try doing it this way:
import re
string = "0552121152, +97143321090"
pattern = re.compile(r"\s*(.*),\s*(.*)")
print pattern.search(string).groups()
Result:
('0552121152', '+97143321090')
[Finished in 0.3s]
The only difference might be because I used the .groups() method, that's why I'm getting it. If you can reproduce this on your end, then it's not your regular expression that is wrong but your coding style.
Other than that, I fail to see how the regex fails at all.
That because . matches also , and it's a greedy selection. Use something like more specific like [^,] or \d instead of . symbol.
Something like this would store the part before the comma in \1 and the part after the comma in \2
/(\s?[\d]+)(?:[, ]+)([+\d]+)/
This looks for an optional number of whitespace characters followed by some digits. Then it looks for either a space or comma (or both) followed by another string made up of a character class of numbers and/or a plus sign.
Related
I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.
I am looking for a pattern that matches everything until the first occurrence of a specific character, say a ";" - a semicolon.
I wrote this:
/^(.*);/
But it actually matches everything (including the semicolon) until the last occurrence of a semicolon.
You need
/^[^;]*/
The [^;] is a character class, it matches everything but a semicolon.
^ (start of line anchor) is added to the beginning of the regex so only the first match on each line is captured. This may or may not be required, depending on whether possible subsequent matches are desired.
To cite the perlre manpage:
You can specify a character class, by enclosing a list of characters in [] , which will match any character from the list. If the first character after the "[" is "^", the class matches any character not in the list.
This should work in most regex dialects.
Would;
/^(.*?);/
work?
The ? is a lazy operator, so the regex grabs as little as possible before matching the ;.
/^[^;]*/
The [^;] says match anything except a semicolon. The square brackets are a set matching operator, it's essentially, match any character in this set of characters, the ^ at the start makes it an inverse match, so match anything not in this set.
None of the proposed answers did work for me. (e.g. in notepad++)
But
^.*?(?=\;)
did.
Try /[^;]*/
Google regex character classes for details.
sample text:
"this is a test sentence; to prove this regex; that is g;iven below"
If for example we have the sample text above, the regex /(.*?\;)/ will give you everything until the first occurence of semicolon (;), including the semicolon: "this is a test sentence;"
Try /[^;]*/
That's a negating character class.
This was very helpful for me as I was trying to figure out how to match all the characters in an xml tag including attributes. I was running into the "matches everything to the end" problem with:
/<simpleChoice.*>/
but was able to resolve the issue with:
/<simpleChoice[^>]*>/
after reading this post. Thanks all.
this is not a regex solution, but something simple enough for your problem description. Just split your string and get the first item from your array.
$str = "match everything until first ; blah ; blah end ";
$s = explode(";",$str,2);
print $s[0];
output
$ php test.php
match everything until first
This will match up to the first occurrence only in each string and will ignore subsequent occurrences.
/^([^;]*);*/
"/^([^\/]*)\/$/" worked for me, to get only top "folders" from an array like:
a/ <- this
a/b/
c/ <- this
c/d/
/d/e/
f/ <- this
Really kinda sad that no one has given you the correct answer....
In regex, ? makes it non greedy. By default regex will match as much as it can (greedy)
Simply add a ? and it will be non-greedy and match as little as possible!
Good luck, hope that helps.
This works for getting the content from the beginning of a line till the first word,
/^.*?([^\s]+)/gm
I faced a similar problem including all the characters until the first comma after the word entity_id. The solution that worked was this in Bigquery:
SELECT regexp_extract(line_items,r'entity_id*[^,]*')
I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?
I'm developing a calculator program in Python, and need to remove leading zeros from numbers so that calculations work as expected. For example, if the user enters "02+03" into the calculator, the result should return 5. In order to remove these leading zeroes in-front of digits, I asked a question on here and got the following answer.
self.answer = eval(re.sub(r"((?<=^)|(?<=[^\.\d]))0+(\d+)", r"\1\2", self.equation.get()))
I fully understand how the positive lookbehind to the beginning of the string and lookbehind to the non digit, non period character works. What I'm confused about is where in this regex code can I find the replacement for the matched patterns?
I found this online when researching regex expressions.
result = re.sub(pattern, repl, string, count=0, flags=0)
Where is the "repl" in the regex code above? If possible, could somebody please help to explain what the r"\1\2" is used for in this regex also?
Thanks for your help! :)
The "repl" part of the regex is this component:
r"\1\2"
In the "find" part of the regex, group capturing is taking place (ordinarily indicated by "()" characters around content, although this can be overridden by specific arguments).
In python regex, the syntax used to indicate a reference to a positional captured group (sometimes called a "backreference") is "\n" (where "n" is a digit refering to the position of the group in the "find" part of the regex).
So, this regex is returning a string in which the overall content is being replaced specifically by parts of the input string matched by numbered groups.
Note: I don't believe the "\1" part of the "repl" is actually required. I think:
r"\2"
...would work just as well.
Further reading: https://www.regular-expressions.info/brackets.html
Firstly, repl includes what you are about to replace.
To understand \1\2 you need to know what capture grouping is.
Check this video out for basics of Group capturing.
Here , since your regex splits every match it finds into groups which are 1,2... so on. This is so because of the parenthesis () you have placed in the regex.
$1 , $2 or \1,\2 can be used to refer to them.
In this case: The regex is replacing all numbers after the leading 0 (which is caught by group 2) with itself.
Note: \1 is not necessary. works fine without it.
See example:
>>> import re
>>> s='awd232frr2cr23'
>>> re.sub('\d',' ',s)
'awd frr cr '
>>>
Explanation:
As it is, '\d' is for integer so removes them and replaces with repl (in this case ' ').
I have been struggling with creating a regular expression that will differentiate between an object definition and calling that object in Python. The purpose is syntax highlighting.
This is the situation which needs to be resolved: (numbers denote line)
0 def otherfunc(vars...):
1 pass
2
3 otherfunc(vars...)
I am interested in matching the name of the object, but not if preceded anywhere by def in the same line. The result on the above code should be:
"otherfunc", line: 3
Is regular expressions capable of doing something like this?
EDIT: I am only concerned with scanning/searching a single line at a time.
You could use negative lookbehind. This matches an atom that is not preceded by an atom. So in your case your looking for otherfunc which is not preceded by "def"
I'm use PCRE regex here.
(?<!def\s)otherfunc
I like Richards answer, however I would also take into considerarion the valid function name characters of phyton and intendation. So this is what I came up with:
(?<!(def\s))(?<=^|\s)[a-zA-Z_][\w_]*(?=\()
See this working sample on Rexex101
Explanation
Matches valid python function names if
(?<!(def\s)) they are not following a def and a whitespace and
(?<=^|\s) are either at the beginning of a line, or following a whitespace (this is the closest you get, since lookbehinds dont support wildcard specifiers) and
are followed by a opening bracket (()
Note that I am not an phyton dev, so for the sake of simplicity [a-zA-Z_][\w_]* matches valid phyton 2.x function names, you can extend this part of the expression to phyton 3.x which I have no clue of ;)