Split String with Python Regexp - python

If I have a string like:
"|CLL23|STR. CALIFORNIA|CLL12|AV. PHILADELFIA 438|CLL10|AV. 234 DEPTO 34|"
I need to separate the string form next:
CLL23|STR.CALIFORNIA
CLL12|AV. TEXAS 345
CLL10|AV. 234 DEPTO 24
Try the following form:
r=re.compile('(?<=[|])([\w]+)')
v_sal=r.findall(v_campo)
print v_sal
Result:
['CLL23', 'CLL12', 'CLL10']
That way you could get the rest of the string in Python?

Let's define your string:
>>> s = "|CLL23|STR. CALIFORNIA|CLL12|AV. PHILADELFIA 438|CLL10|AV. 234 DEPTO 34|"
Now, let's print the formatted form:
>>> print('\n'.join('CLL' + word.rstrip('|') for word in s.split('|CLL') if word))
CLL23|STR. CALIFORNIA
CLL12|AV. PHILADELFIA 438
CLL10|AV. 234 DEPTO 34
The above divides on |CLL. This seems to work for your sample input.

Another simple solution would be to split() the string at every '|' and then print them in chunks:
s="|CLL23|STR. CALIFORNIA|CLL12|AV. PHILADELFIA 438|CLL10|AV. 234 DEPTO 34|"
s1=filter(None, s.split('|')) #split string and filter empty strings
for x,y in zip(s1[0::2], s1[1::2]):
print x + '|' + y
Output:
>>>
CLL23|STR. CALIFORNIA
CLL12|AV. PHILADELFIA 438
CLL10|AV. 234 DEPTO 34

Related

Regex to unify a format of phone numbers in Python

I'm trying a regex to match a phone like +34(prefix), single space, followed by 9 digits that may or may not be separated by spaces.
+34 886 24 68 98
+34 980 202 157
I would need a regex to work with these two example cases.
I tried this ^(\+34)\s([ *]|[0-9]{9}) but is not it.
Ultimately I'll like to match a phone like +34 "prefix", single space, followed by 9 digits, no matter what of this cases given. For that I'm using re.sub() function but I'm not sure how.
+34 886 24 68 98 -> ?
+34 980 202 157 -> ?
+34 846082423 -> `^(\+34)\s(\d{9})$`
+34920459596 -> `^(\+34)(\d{9})$`
import re
from faker import Faker
from faker.providers import BaseProvider
#fake = Faker("es_ES")
class CustomProvider(BaseProvider):
def phone(self):
#phone = fake.phone_number()
phone = "+34812345678"
return re.sub(r'^(\+34)(\d{9})$', r'\1 \2', phone)
You can try:
^\+34\s*(?:\d\s*){9}$
^ - beginning of the string
\+34\s* - match +34 followed by any number of spaces
(?:\d\s*){9} - match number followed by any number of spaces 9 times
$ - end of string
Regex demo.
Here's a simple approach: use regex to get the plus sign and all the numbers into an array (one char per element), then use other list and string manipulation operations to format it the way you like.
import re
p1 = "+34 886 24 68 98"
p2 = "+34 980 202 157"
pattern = r'[+\d]'
m1 = re.findall(pattern, p1)
m2 = re.findall(pattern, p2)
m1_str = f"{''.join(m1[:3])} {''.join(m1[3:])}"
m2_str = f"{''.join(m2[:3])} {''.join(m2[3:])}"
print(m1_str) # +34 886246898
print(m2_str) # +34 980202157
Or removing spaces using string replacement instead of regex:
p1 = "+34 886 24 68 98"
p2 = "+34 980 202 157"
p1_compact = p1.replace(' ', '')
p2_compact = p2.replace(' ', '')
p1_str = f"{p1_compact[:3]} {p1_compact[3:]}"
p2_str = f"{p2_compact[:3]} {p2_compact[3:]}"
print(p1_str) # +34 886246898
print(p2_str) # +34 980202157
I would capture the numbers like this: r"(\+34(?:\s?\d){9})".
That will allows you to search for numbers allowing whitespace to optionally be placed before any of them. Using a non-capturing group ?: to allow repeating \s?\d without having each number listed as a group on its own.
import re
nums = """
Number 1: +34 886 24 68 98
Number 2: +34 980 202 157
Number 3: +34812345678
"""
number_re = re.compile(r"(\+34(?:\s?\d){9})")
for match in number_re.findall(nums):
print(match)
+34 886 24 68 98
+34 980 202 157
+34812345678

How does this code format the list to the desired output?

I'm a beginner python student and I'm having trouble with lists.
This was our last weeks' class and I need help understanding how does the commands on this code operate to change the list output.
The commands we were taught on lists and formatting are split() , strip() , slice() , append() and format()
More spesificly, the part I don't understand exactly is how does the program understand what part are the names and what parts are grades? Where does it split the names from the grades and what do we use the tuple for?
The list given to us is like this
Hopper, Grace 100 98 87 97
Knuth, Donald 82 87 92 81
Goldberg, Adele 94 96 90 91
Kernighan, Brian 89 74 89 77
Liskov, Barbara 87 97 81 85
and the desired output should look like this
Output
The code we wrote in class is:
exam1_score = 0
exam2_score = 0
exam3_score = 0
exam4_score = 0
name = ""
scores=[]
f=open("scores.txt","r")
for line in f:
temp_list = []
name = line[0:18]
list = line[19:len(line)].split()
ex1_score = int(list[0])
ex2_score = int(list[1])
ex3_score = int(list[2])
ex4_score = int(list[3])
avg_score = float(float(ex1_score+ex2_score+ex3_score+ex4_score)/float(4))
temp_list.append(name)
temp_list.append(ex1_score)
temp_list.append(ex2_score)
temp_list.append(ex3_score)
temp_list.append(ex4_score)
temp_list.append(avg_score)
scores.append(tuple(temp_list))
scores = sorted(scores)
print("{:20s}{:6s}{:6s}{:6s}{:6s}{:10s}".format("Name", "Exam1", "Exam2", "Exam3", "Exam4", "Mean"))
ex1_mean = 0
ex2_mean = 0
ex3_mean = 0
ex4_mean = 0
avg_mean = 0
for baslik in scores:
print ("{:20s}{:6d}{:6d}{:6d}{:6d}{:10.2f}".format(baslik[0],baslik[1],baslik[2],baslik[3],baslik[4],baslik[5]))
ex1_mean = ex1_mean+baslik[1]
ex2_mean = ex2_mean+baslik[2]
ex3_mean = ex3_mean+baslik[3]
ex4_mean = ex4_mean+baslik[4]
avg_mean = avg_mean+baslik[5]
print ("{:20s}{:6s}{:6s}{:6s}{:6s}{:10.2s}".format("Exam Mean",str(ex1_mean/len(scores)),str(ex2_mean/len(scores)),str(ex3_mean/len(scores)),str(ex4_mean/len(scores)),str(avg_mean/len(scores))))
f.close()
I guess that each input is on a different line, i.e.
Hopper, Grace 100 98 87 97
Knuth, Donald 82 87 92 81
Goldberg, Adele 94 96 90 91
Kernighan, Brian 89 74 89 77
Liskov, Barbara 87 97 81 85
where the first 18 characters include the name, because we have the line:
name = line[0:18]
Afterwards we split the remaining string with
list = line[19:len(line)].split()
So, the input is most likely formatted, such that you can hard-code the split of names and grades.
Sounds like you were never told what strings and characters are. Understanding those concepts will give you some background info to understand whats going on.
You can think of a word as a string. The letters in that word are the characters.
An sentence is a string as well. The only difference between a word and a string is the spaces between the words. The computer does not care if the character in the string is a space or a letter. We have to tell the computer to remove the spaces.
So all the split() function does (by default) is separate a sentence on the spaces. It takes all those words and puts them into a list.
For example, say the last sentence was stored in variable sentence.
In python that would look like:
sentence = "It takes all those words and puts them into a list"
Now say we just want each word in a list.
broken_sentence = sentence.split()
The result will be if you print broken_sentence with
print(broken_sentence)
You will get
('It','takes','all','those','words','and','puts','them','into','a','list')
Now the computer has a list. If you want the word It, you access that via index or slicing.
print(broken_sentence[0]) will print the FIRST element in the list which is 'It'.
Ask your teacher why counting starts with zero in computers, that should be a fun discussion.
Hope this breaks it down a little bit for you.

removing words from a list from pandas column - python 2.7

I have a text file which contains some strings that I want to remove from my data frame. The data frame observations contains those texts which are present in the ext file.
here is the text file - https://drive.google.com/open?id=1GApPKvA82tx4CDtlOTqe99zKXS3AHiuD
here is the link; Data = https://drive.google.com/open?id=1HJbWTUMfiBV54EEtgSXTcsQLzQT1rFgz
I am using the following code -
import nltk
from nltk.tokenize import word_tokenize
file = open("D://Users/Shivam/Desktop/rahulB/fliter.txt")
result = file.read()
words = word_tokenize(result)
I loaded the text files and converted them into words/tokens.
Its is my dataframe.
text
0 What Fresh Hell Is This? January 31, 2018 ...A...
1 What Fresh Hell Is This? February 27, 2018 My ...
2 What Fresh Hell Is This? March 31, 2018 Trump ...
3 What Fresh Hell Is This? April 29, 2018 Michel...
4 Join Email List Contribute Join AMERICAblog Ac...
If you see this, these texts are present in the all rows such as "What Fresh Hell Is This?" or "Join Email List Contribute Join AMERICAblog Ac, "Sign in Daily Roundup MS Legislature Elected O" etc.
I used this for loop
for word in words:
df['text'].replace(word, ' ')
my error.
error Traceback (most recent call last)
<ipython-input-168-6e0b8109b76a> in <module>()
----> 1 df['text'] = df['text'].str.replace("|".join(words), " ")
D:\Users\Shivam\Anaconda2\lib\site-packages\pandas\core\strings.pyc in replace(self, pat, repl, n, case, flags)
1577 def replace(self, pat, repl, n=-1, case=None, flags=0):
1578 result = str_replace(self._data, pat, repl, n=n, case=case,
-> 1579 flags=flags)
1580 return self._wrap_result(result)
1581
D:\Users\Shivam\Anaconda2\lib\site-packages\pandas\core\strings.pyc in str_replace(arr, pat, repl, n, case, flags)
422 if use_re:
423 n = n if n >= 0 else 0
--> 424 regex = re.compile(pat, flags=flags)
425 f = lambda x: regex.sub(repl=repl, string=x, count=n)
426 else:
D:\Users\Shivam\Anaconda2\lib\re.pyc in compile(pattern, flags)
192 def compile(pattern, flags=0):
193 "Compile a regular expression pattern, returning a pattern object."
--> 194 return _compile(pattern, flags)
195
196 def purge():
D:\Users\Shivam\Anaconda2\lib\re.pyc in _compile(*key)
249 p = sre_compile.compile(pattern, flags)
250 except error, v:
--> 251 raise error, v # invalid expression
252 if not bypass_cache:
253 if len(_cache) >= _MAXCACHE:
error: nothing to repeat
You can use str.replace
Ex:
df['text'] = df['text'].str.replace("|".join(words), " ")
You can modify your code in this way:
for word in words:
df['text'] = df['text'].str.replace(word, ' ')
You may use
df['text'] = df['text'].str.replace(r"\s*(?<!\w)(?:{})(?!\w)".format("|".join([re.escape(x) for x in words])), " ")
The r"(?<!\w)(?:{})(?!\w)".format("|".join([re.escape(x) for x in words])) line will perform these steps:
re.escape(x) for x in words] - will escape all special chars in the words to be used with regex safely
"|".join([...) - will create alternations that will be matched by regex engine
r"\s*(?<!\w)(?:{})(?!\w)".format(....) - will create a regex like \s*(?<!\w)(?:word1|word2|wordn)(?!\w) that will match words as whole words from the list (\s* will also remove 0+ whitespaces before the words).

How can I match the whole regex not the subexpression

Say, I have the following regex to search a series of room number:
import re
re.findall(r'\b(\d)\d\1\b','101 102 103 201 202 203')
I want to search for the room number whose first and last digit are the same (101 and 202). The above code gives
['1','2']
which corresponding to the subexpression (\d). But how can it return the whole room number like 101 and 202?
import re
print [i for i,j in re.findall(r'\b((\d)\d\2)\b','101 102 103 201 202 203')]
or
print [i[0] for i in re.findall(r'\b((\d)\d\2)\b','101 102 103 201 202 203')]
You can use list comprehension here.You need only room numbers so include only i.basically re.findall return all groups in a regex.So you need 2 groups.The first is will have room numbers and second will be used for matching.So we can extract just the first out of the tuple of 2.

Making a decryption program in python

So a little while ago I asked for some help with an encryption program,
And you guys were amazing and came up with the solution.
So I come to you again in search of help for the equivalent decryption program.
The code I have got so far is like this:
whinger = 0
bewds = raw_input ('Please enter the encrypted message: ')
bewds = bewds.replace(' ', ', ')
warble = [bewds]
print warble
wetler = len(warble)
warble.reverse();
while whinger < wetler:
print chr(warble[whinger]),
whinger += 1
But when I input
101 103 97 115 115 101 109
it comes up with the error that the input is not an integer.
What I need is when I enter the numbers it turns them into a list of integers.
But I don't want to have to input all the numbers separately.
Thanks in advance for your help :P
To convert input string into a list of integers:
numbers = [int(s) for s in "101 103 97 115 115 101 109".split()]
Here's almost the simplest way I can think of to do it:
s = '101 103 97 115 115 101 109'
numbers = []
for number_str in s.replace(',', ' ').split():
numbers.append(int(number_str))
It will allow the numbers to be separated with commas and/or one or more space characters. If you only want to allow spaces, leave the ".replace(',', ' ')" out.
Your problem is, that raw_input returns a string to you. So you have two options.
1, Use regular expression library re. E.G.:
import re
bewds = raw_input ('Please enter the encrypted message: ')
some_list = []
for find in re.finditer("\d+", bewds):
some_list.append(find.group(0))
2, Or you can use split method as described in the most voted answer to this question: sscanf in Python
You could also use map
numbers = map(int, '101 103 97 115 115 101 109'.split())
This returns a list in Python 2, but a map object in Python 3, which you might want to convert into a list.
numbers = list(map(int, '101 103 97 115 115 101 109'.split()))
This does exactly the same as J. F. Sebastian's answer.

Categories