regular expression to remove only leading zeros in pyspark? - python

Guys I have the input string +00000000995510.32 and I need to remove the + sign and the leading zeros, my output number should be: 995510.32.
Is there a regular expression to do this in regexp_replace?
My current code:
df.withColumn("vl_fat",regexp_replace(col("vl_fat"),"^([0-9]|[1-9][0-9])$+", ""))
but that didn't work

if you want to practise regex, try: https://regex101.com/. The pattern you describe is that it starts with one + and then with a zero to many amount of 0, which in python regex would be [+][0]*. You also need to consider the look ahead feature of regex that can get a little weird. This should work however:
(?![+])(?![0]).*

you can use this regex "\+0+" to catch the leading +000...
explanation from regex101 :
\+ matches the character + literally (case sensitive)
0 matches the character 0 literally (case sensitive)
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)

My two cents: you can use regex_extract (that seems to suite better your use case) and convert the input string into a float:
from pyspark.sql import functions as F, types as T
df = spark.createDataFrame(
[('+00000000995510.32',),
('34.32',),
('+00000.34',),
('+0444444',),
('9.',)],
T.StructType([
T.StructField('input_string', T.StringType())
])
)
df.withColumn('parsed_float',
F.regexp_extract('input_string', '^(\+0+|)(\d+(\.\d*|))$', 2).cast(T.FloatType()))
This is what you get:
+------------------+------------+
| input_string|parsed_float|
+------------------+------------+
|+00000000995510.32| 995510.3|
| 34.32| 34.32|
| +00000.34| 0.34|
| +0444444| 444444.0|
| 9.| 9.0|
+------------------+------------+
For the regex:
(\+0+|): this captures the initial (optional) + followed by one or more 0
(\d+(\.\d*|)): this captures the whole figure, described as a sequence of numbers followed by a (optional) sequence composed of a . followed by any number of decimals
The second argument of regex_extract is the group you are interested into; in this case is the second one, i.e., (\d+(\.\d*|)).

Instead of regex you might like to use TRIM. I find this easier to read and it better conveys the intention of the code. Note this code will also remove any + signs directly after your leading zeros.
import pyspark.sql.functions as F
df = spark.createDataFrame([('+00000000995510.32',)], ['number'])
df.withColumn('trimmed', F.expr("TRIM(LEADING '+0' FROM number)")).show()
+------------------+---------+
| number| trimmed|
+------------------+---------+
|+00000000995510.32|995510.32|
+------------------+---------+
Or if you want an actual number, you could simply cast it to float (or decimal). Note any value which cannot be cast will become NULL.
df.withColumn('trimmed', F.col('number').cast('float')).show()
+------------------+--------+
| number| trimmed|
+------------------+--------+
|+00000000995510.32|995510.3|
+------------------+--------+

Related

ValueError: could not convert string to float: " " (empty string?)

How do I go about removing an empty string or at least having regex ignore it?
I have some data that looks like this
EIV (5.11 gCO₂/t·nm)
I'm trying to extract the numbers only. I have done the following:
df['new column'] = df['column containing that value'].str.extract(r'((\d+.\d*)|(\d+)|(\.\d+)|(\d+[eE][+]?\d*)?)').astype('float')
since the numbers Can be floats, integers, and I think there's one exponent 4E+1
However when I run it I then get the error as in title which I presume is an empty string.
What am I missing here to allow the code to run?
Try this
import re
c = "EIV (5.11 gCO₂/t·nm)"
x = re.findall("[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?", c)
print(x)
Will give
['5.11']
The problem is not only the number of groups, but the fact that the last alternative in your regex is optional (see ? added right after it, and your regex demo). However, since Series.str.extract returns the first match, your regex matches and returns the empty string at the start of the string if the match is not at the string start position.
It is best to use the well-known single alternative patterns to match any numbers with a single capturing group, e.g.
df['col'].str.extract(r'((?:(?:\b[0-9]+)?\.)?\b[0-9]+(?:[eE][-+]?[0-9]+)?)\b').astype(float)
See Example Regexes to Match Common Programming Language Constructs.
Pandas test:
import pandas as pd
df = pd.DataFrame({'col':['EIV (5.11 gCO₂/t·nm)', 'EIV (5.11E+12 gCO₂/t·nm)']})
df['col'].str.extract(r'((?:(?:\b[0-9]+)?\.)?\b[0-9]+(?:[eE][-+]?[0-9]+)?)\b').astype(float)
# => 0
# 0 5.110000e+00
# 1 5.110000e+12
There also quite a lot of other such regex variations at Parsing scientific notation sensibly?, and you may also use r"([-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?)", r"(-?\d+(?:\.\d*)?(?:[eE][+-]?\d+)?)", r"([+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?)", etc.
If your column consist of data of same format(as you have posted - EIV (5.11 gCO₂/t·nm)) then it will surely work
import pandas as pd
df['new_exctracted_column'] = df['column containing that value'].str.extract('(\d+(?:\.\d+)?)')
df
5.11

How to split a string in python based on separator with separator as a part of one of the chunks?

Looking for an elegant way to:
Split a string based on a separator
Instead of discarding separator, making it a part of the splitted chunks.
For instance I do have date and time data like:
D2018-4-21T3:55+6
2018-4-4T3:15+6
D2018-11-21T12:45+6:30
Sometimes there's D, sometimes not (however I always want it to be a part of first chunk), no trailing or leading zeros for time and timezone only have ':' sometimes. Point is, it is necessary to split on these 'D, T, +' characters cause the segements might not follow the sae length. If they were it would be easier to just split on the index basis. I want to split them over multiple characters like T and + and have them a part of the data as well like:
['D2018-4-21', 'T3:55', 'TZ+6']
['D2018-4-4', 'T3:15', 'TZ+6']
['D2018-11-21', 'T12:45', 'TZ+6:30']
I know a nicer way would be to clean data first and normalize all rows to follow same pattern but just curious how to do it as it is
For now on my ugly solution looks like:
[i+j for _, i in enumerate(['D','T','TZ']) for __, j in enumerate('D2018-4-21T3:55+6'.replace('T',' ').replace('D', ' ').replace('+', ' +').split()) if _ == __]
Use a regular expression
Reference:
https://docs.python.org/3/library/re.html
(...)
Matches whatever regular expression is inside the parentheses, and
indicates the start and end of a group; the contents of a group can be
retrieved after a match has been performed, and can be matched later
in the string with the \number special sequence, described below. To
match the literals '(' or ')', use ( or ), or enclose them inside a
character class: [(], [)].
import re
a = '''D2018-4-21T3:55+6
2018-4-4T3:15+6
D2018-11-21T12:45+6:30'''
b = a.splitlines()
for i in b:
m = re.search(r'^D?(.*)([T].*?)([-+].*)$', i)
if m:
print(["D%s" % m.group(1), m.group(2), "TZ%s" % m.group(3)])
Result:
['D2018-4-21', 'T3:55', 'TZ+6']
['D2018-4-4', 'T3:15', 'TZ+6']
['D2018-11-21', 'T12:45', 'TZ+6:30']

regex to fix csv quotes

I have a simple csv with quotes, something like:
"something","something","something","something",...
BUT, sometimes I get csv with
"something","som"ething"","s"omething",...
and I wanted to create a regex that will fix this problem, does someone have something to offer?
something that will take out everything out from the string that is not a number or text, but when I take out " I need to make sure its not the ones that bounds the string cause i need those..
so from "som"ething"","s"ometh8 ing" id expect => "something","someth8 ing"
im using scala but any solution will be great!
thanks!!
Simple solution
A simple solution in Scala:
scala> val input = """"som"ething"","s"ometh8 ing""""
input: String = "som"ething"","s"ometh8 ing"
scala> val values = input.split("\",\"").map(_.filter(c => c.isLetterOrDigit || c.isWhitespace))
values: Array[String] = Array(something, someth8 ing)
scala> val output = values.mkString("\"", "\",\"", "\"")
output: String = "something","someth8 ing"
Assuming you never have "," inside your values, but if you do then there's no way to fix your CSV unambiguously anyway.
This isn't the most optimal solution speed or memory-wise, but it's short and simple.
EDIT: Regex solution
In case you really want some regexes, enjoy:
scala> input.replaceAll("""(^"|"$|","|[\p{IsAlphabetic}\p{Digit}\p{Space}])|.""", "$1")
res17: String = "something","someth8 ing"
This tries to match " at the beginning or end of input OR "," anywhere else OR any of your approved characters. If any of these match, it goes to the first capturing group. Otherwise, it matches any character (.), but doesn't capture it in a group, so the first group stays empty. Then, the matched substring is replaced with $1, which is the content of the first capturing group.
I still think the first solution is cleaner and easier to understand.
import re
csv_string = '"something","som"ething"","s"omething"'
for each_str in re.findall(r'(.*?)[\,\n]', csv_string):
print(re.sub(r'\"', '', each_str)
add a line feed, to the end of the string so that you can include the last part of the string in re.findall

Best Way to Add Implied Multiplication (*) to a Python String?

I know how to do this in other languages that are stronger with RegEx, but I'm not sure about Python. Basically what I'm trying to do is convert
(30(5-10x))+10=20
into
(30*(5-10x))+10=20
And it would also be nice to add
(30*(5-10*x))+10=20
if the x is preceded by a number.
In Perl, the regex would look something like
/ \w+\K (?=\() | \)\K (?=\w) | \)\K (?=\() /*/
And to take care of the x's:
/\d\K(?=[x])/*/
How is this best done in Python?
I like it better like Nate did, i.e., just add the *, not take the surrounding characters out only to put them back in.
>>> e = '(1+2)(30(5-10x))x+10=20'
>>> re.sub('(?<=\w|\))(?=\() | (?<=\))(?=\w) | (?<=\d)(?=x)', '*', e, flags=re.X)
'(1+2)*(30*(5-10*x))*x+10=20'
The three parts are "before (", "after )", and "between digit and x". Could maybe be combined, but then we might combine too much, so I find it clearer and safer this way.
You can do this for your example, and adapt it accordingly for strings like (x+2)10 if you need:
import re
s = '(30(5-10x))+10=20'
r = re.sub(r'(\d)([(a-zA-Z])', r'\1*\2', s)
print(r) # Prints (30*(5-10*x))+10=20
Short explanation:
The capturing groups \1 and \2 are substituted with ... the same capturing groups and an additional * in between them.
(Code improvements by #shashank)
I only try to translate your regular expressions, such that they do not use \K.
You could achieve this by matching the preceeding characters also, and use them in the replacement as backreference.
/ \w+\K (?=\() | \)\K (?=\w) | \)\K (?=\() /*/
becomes
re.sub(r'(\w+(?=\()|\)(?=\w)|\)(?=\())', r'\\1*', original )
And to take care of the x's:
/\d\K(?=[x])/*/
becomes
re.sub(r'(\d(?=[x]))', r'\\1*', original)
Notes :
I removed the spaces, I suppose you added them for legibility.
I don't think your perl regex is specific enough. For instance it will transform sin(x) into sin*(x).

Regular expression in python to capture multiple forms of badly formatted addresses

I have been tweaking a regular expression over several days to try to capture, with a single definition, several cases of inconsistent format in the address field of a database.
I am new to Python and regular expressions, and have gotten great feedback here is stackoverflow, and with my new knowledge, I built a RegEx that is getting close to the final result, but still can't spot the problem.
import re
r1 = r"([\w\s+]+),?\s*\(?([\w\s+\\/]+)\)?\s*\(?([\w\s+\\/]+)\)?"
match1 = re.match(r1, 'caracas, venezuela')
match2 = re.match(r1, 'caracas (venezuela)')
match3 = re.match(r1, 'caracas, (venezuela) (df)')
group1 = match1.groups()
group2 = match2.groups()
group3 = match3.groups()
print group1
print group2
print group3
This thing should return 'caracas, venezuela' for groups 1 and 2, and 'caracas, venezuela, df' for group 3, instead, it returns:
('caracas', 'venezuel' 'a')
('caracas ', 'venezuel' 'a')
('caracas', 'venezuela', 'df')
The only perfect match is group 3. The other 2 are isolating the 'a' at the end, and the 2nd one has an extra space at the end of 'caracas '.
Thanks in advance for any insight.
Cheers!
Regular expressions might be overkill... what exactly is your problem statement? What do you need to capture?
Some things I caught (in order of appearance in your regex; sometimes it helps to read it out, left-to-right, English-style):
([\w\s+]+)
This says, "capture one or more (letter or one or more spaces)"
Do you really want to capture the spaces at the end of the city name? Also, you don't need (indeed, shouldn't have) the 1-or-more symbol + inside your brackets [ ], since your regex will already be matching one or more of them based on the outer +. I'd rewrite this part like this:
([\w\s]*\w)
Which will match eagerly up to the last alphanumeric character ("zero or more (letter or space) followed by a letter"). This does assume you have at least one character, but is better than your assumption that a single space would work as well.
Next you have:
,?\s*\(?
which looks okay to me except that it doesn't guarantee that you'll see either a comma or an open paren anymore. What about:
(?:,\s*\(|,\s*|\s*\()
which says, "non-capturingly match either (a comma with maybe some spaces and then an open paren) OR (a comma with maybe some spaces) OR (maybe some spaces and then an open paren)". This enforces that you must have either a comma or a paren or both.
Next you have the capturing expression, very similar to the first:
([\w\s+\\/]+)
Again, you don't want the spaces (or slashes in this case) at the end of the city name, and you don't want the + inside the [ ]:
([\w\s\\/]*\w)
The next expression is probably where you're getting your venezuel a problem; let's take a look:
\)?\s*\(?([\w\s+\\/]+)\)?
This is a rather long one, so let's break it down:
\)?\s*\(?
says to "maybe match a close paren, and then maybe some spaces, and then maybe an open paren". This is okay I guess, let's move on to the real problem:
([\w\s+\\/]+)
This capturing group MUST match at least one character. If the matcher sees "venezuela" at the end of your address, it will eagerly match the characters venezuel and then need to satisfy this final expression with what it has left, a. Try instead:
\)?\s*
Followed by making your entire final expression optional, and the outer expression non-capturing:
(?:\(?([\w\s+\\/]+)\)?)?
The final expression would be:
([\w\s]*\w)(?:,\s*\(|,\s*|\s*\()([\w\s\\/]*\w)\)?\s*(?:\(?([\w\s+\\/]+)\)?)?
Edit: fixed a problem that made the final group capture twice, once with the parens, once without. Now it should only capture the text inside the parens.
Testing it on your examples:
>>> re.match(r, 'caracas, venezuela').groups()
('caracas', 'venezuela', None)
>>> re.match(r, 'caracas (venezuela)').groups()
('caracas', 'venezuela', None)
>>> re.match(r, 'caracas, (venezuela) (df)').groups()
('caracas', 'venezuela', 'df')
Could you not just find all the words in the text?
E.g.:
>>> import re
>>> samples = ['caracas, venezuela','caracas (venezuela)','caracas, (venezuela) (df)']
>>>
>>> def find_words(text):
... return re.findall('\w+',text)
...
>>> for sample in samples:
... print find_words(sample)
...
['caracas', 'venezuela']
['caracas', 'venezuela']
['caracas', 'venezuela', 'df']

Categories