Extract values from String using Python - python

I am getting a string
name="Mathew",lastname="Thomas",zipcode="PR123T",gender="male"
I need to get the values Mathew, Thomas, PR123T, male.
Also if the String doesnt have a value for zipcode, it should not assign any value to string.
I am newbie to python. Please help

You need to use the .split() function that is available on every string. First you need to split by comma ,, then you need to split by = and select the 1th element.
Once this is done, you need to .join() the elements on a comma , again.
def split_my_fields(input_string):
if not 'zipcode=""' in input_string:
output = ', '.join(e.split('=')[1].replace('"','') for e in input_string.split(','))
print(f'Output is {output}')
return output
else:
print('Zipcode is empty.')
split_my_fields(r'name="Mathew",lastname="Thomas",zipcode="PR123T",gender="male"')
Output:
>>> split_my_fields(r'name="Mathew",lastname="Thomas",zipcode="PR123T",gender="male"')
Output is Mathew, Thomas, PR123T, male
'Mathew, Thomas, PR123T, male'

In fact, my dear friend, you can use parse
>>from parse import *
>>parse("name={},lastname={},zipcode={},gender={}","name='Mathew',lastname='Thomas',zipcode='PR123T',gender='male'")
<Result ("'Mathew'", "'Thomas'", "'PR123T'", "'male'") {}>

You can use named groups and create dictionary with keys corresponding to the group names:
import re
text = 'name="Mathew",lastname="Thomas",zipcode="PR123T",gender="male"'
expr = re.compile(r'^(name="(\s+)?(?P<name>.*?)(\s+)?")?,?(lastname="(\s+)?(?P<lastname>.*?)(\s+)?")?,?(zipcode="(\s+)?(?P<zipcode>.*?)(\s+)?")?,?(gender="(\s+)?(?P<gender>.*?)(\s+)?")?$')
match = expr.search(text).groupdict()
print(match['name']) # Matthew
print(match['lastname']) # Thomas
print(match['zipcode']) # R123T
print(match['gender']) # male
The pattern will catch all non-whitespace characters between parentheses and strip whitespaces around it. For empty zipcode value it will return an empty string (the same applies for other named groups). It will also handle missing key-value pairs as long as the order in which keys are appearing will stay the same (e.g. text = 'name="Mathew",lastname="Thomas",gender="male"').

Related

Excluding a specific string of characters in a str()-function

A small issue I've encountered during coding.
I'm looking to print out the name of a .txt file.
For example, the file is named: verdata_florida.txt, or verdata_newyork.txt
How can I exclude .txt and verdata_, but keep the string between? It must work for any number of characters, but .txt and verdata_ must be excluded.
This is where I am so far, I've already defined filename to be input()
print("Average TAM at", str(filename[8:**????**]), "is higher than ")
3 ways of doing it:
using str.split twice:
>>> "verdata_florida.txt".split("_")[1].split(".")[0]
'florida'
using str.partition twice (you won't get an exception if the format doesn't match, and probably faster too):
>>> "verdata_florida.txt".partition("_")[2].partition(".")[0]
'florida'
using re, keeping only center part:
>>> import re
>>> re.sub(".*_(.*)\..*",r"\1","verdata_florida.txt")
'florida'
all those above must be tuned if _ and . appear multiple times (must we keep the longest or the shortest string)
EDIT: In your case, though, prefixes & suffixes seem fixed. In that case, just use str.replace twice:
>>> "verdata_florida.txt".replace("verdata_","").replace(".txt","")
'florida'
Assuming you want it to split on the first _ and the last . you can use slicing and the index and rindex functions to get this done. These functions will search for the first occurrence of the substring in the parenthesis and return the index number. If no substring is found, they will throw a ValueError. If the search is desired, but not the ValueError, you can also use find and rfind, which do the same thing but always return -1 if no match is found.
s = 'verdata_new_hampshire.txt'
s_trunc = s[s.index('_') + 1: s.rindex('.')] # or s[s.find('_') + 1: s.rfind('.')]
print(s_trunc) # new_hampshire
Of course, if you are always going to exclude verdata_ and .txt you could always hardcode the slice as well.
print(s[8:-4]) # new_hampshire
You can leverage str.split() on strings. For example:
s = 'verdata_newyork.txt'
s.split('verdata_')
# ['', 'florida.txt']
s.split('verdata_')[1]
# 'florida.txt'
s.split('verdata_')[1].split('.txt')
['florida', '']
s.split('verdata_')[1].split('.txt')[0]
# 'florida'
You can just split string by dot and underscore like this:
string filename = "verdata_prague.txt";
string name = filename.split("."); //verdata_prague
name = name[0].split("_")[1]; //prague
or by replace function:
string filename = "verdata_prague.txt";
string name = filename.replace(".txt",""); //verdata_prague
name = name[0].replace("verdata_","")[1]; //prague

Regular Expression in Python 3

I am new here and just start using regular expressions in my python codes. I have a string which has 6 commas inside. One of the commas is fallen between two quotation marks. I want to get rid of the quotation marks and the last comma.
The input:
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
I want this output:
string = 'Fruits,Pear,Cherry,Apple,Orange,Cherry'
The output of my code:
string = 'Fruits,Pear,**CherryApple**,Orange,Cherry'
here is my code in python:
if (re.search('"', string)):
matches = re.findall(r'\"(.+?)\"',string);
matches1 = re.sub(",", "", matches[0]);
string = re.sub(matches[0],matches1,string);
string = re.sub('"','',string);
My problem is, I want to give a condition that the code only works for the last bit ("Cherry,") but unfortunately it affects other words in the middle (Cherry,Apple), which has the same text as the one between the quotation marks! That results in reducing the number of commas (from 6 to 4) as it merges two fields (Cherry,Apple) and I want to be left with 5 commas.
fullString = '2000-04-24 12:32:00.000,22186CBD0FDEAB049C60513341BA721B,0DDEB5,COMP,Ch‌​erry Corp.,DE,100,0.57,100,31213C678CC483768E1282A9D8CB524C,365.0‌​0000,business,acquis‌​itions-mergers,acqui‌​sition-bid,interest,‌​acquiree,fact,,,,,,,‌​,,,,,,acquisition-in‌​terest-acquiree,Cher‌​ry Corp. Gets Buyout Offer From Chairman President,FULL-ARTICLE,B5569E,Dow Jones Newswires,0.04,-0.18,0,0,1,0,0,0,0,1,1,5,RPA,DJ,DN2000042400‌​0597,"Cherry Corp. Gets Buyout Offer From Chairman President,"\n'
Many Thanks in advance
For your task you don't need regular expressions, just use replace:
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
new_string = string.replace('"').strip(',')
The best way would be to use the newer regex module where (*SKIP)(*FAIL) is supported:
import regex as re
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
# parts
rx = re.compile(r'"[^"]+"(*SKIP)(*FAIL)|,')
def cleanse(match):
rxi = re.compile(r'[",]+')
return rxi.sub('', match)
parts = [cleanse(match) for match in rx.split(string)]
print(parts)
# ['Fruits', 'Pear', 'Cherry', 'Apple', 'Orange', 'Cherry']
Here you match anything between double quotes and throw it away afterwards, thus only commas outside quotes are used for the split operation. The rest is a list comprehension with a cleaning function.
See a demo on regex101.com.
Why not simply use this:
>>>ans_string=string.replace('"','')[0:-1]
Output
>>>ans_string
'Fruits,Pear,Cherry,Apple,Orange,Cherry'
For the sake of simplicity and algorithmic complexity.
You might consider using the csv module to do this.
Example:
import csv
s='Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
>>> ','.join([e.replace(',','') for row in csv.reader([s]) for e in row])
Fruits,Pear,Cherry,Apple,Orange,Cherry
The csv module will strip the quotes but keep the commas on each quoted field. Then you can just remove that comma that was kept.
This will take care of any modifications desired (remove , for example) on a field by field basis. The fields with quotes and commas could be any field in the string.
If your content is in a csv file, you would do something like this (in pseudo code)
with open(file, 'rb') as csv_fo:
# modify(string) stands for what you want to do to each field...
for row in csv.reader(csv_fo):
new_row=[modify(field) for field in row]
# now do what you need with that row

Regular expression for multiple occurances in python

I need to parse lines having multiple language codes as below
008800002 Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$<nld>
008800002 being a id
Bruxelles-Nord$Br�ussel Nord$ being name1
deu being language one
$Brussel Noord$ being name two
nld being language two.
SO, the idea is name and language can appear N number of times. I need to collect them all.
the language in <> is 3 characters in length (fixed)
and all names end with $ sign.
I tried this one but it is not giving expected output.
x = re.compile('(?P<stop_id>\d{9})\s(?P<authority>[[\x00-\x7F]{3}|\s{3}])\s(?P<stop_name>.*)
(?P<lang_code>(?:[<]\S{0,4}))',flags=re.UNICODE)
I have no idea how to get repeated elements.
It takes
Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$ as stop_name and <nld> as language.
Do it in two steps. First separate ID from name/language pairs; then use re.finditer on the name/language section to iterate over the pairs and stuff them into a dict.
import re
line = u"008800002 Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$<nld>"
m = re.search("(\d+)\s+(.*)", line, re.UNICODE)
id = m.group(1)
names = {}
for m in re.finditer("(.*?)<(.*?)>", m.group(2), re.UNICODE):
names[m.group(2)] = m.group(1)
print id, names
\b(\d+)\b\s*|(.*?)(?=<)<(.*?)>
Try this.Just grab the captures.see demo.
http://regex101.com/r/hS3dT7/4

Extracting 2 strings from regular expression Python

I am trying to extract city, state and/or zip code from a string using a regular expression. The regex I am using (from here get city, state or zip from a string in python) is ([^\d]+)?(\d{5})? and when I tested it on http://regex101.com/ it accurately selects the two strings I want to match.
However I'm not sure how to separate these two strings in Python. Here is what I have tried:
import re
string = "binghamton ny 13905"
reg = re.compile('([^\d]+)?(\d{5})?')
match = reg.match(string)
return match.group()
This simply returns the entire string. Is there a way to pull each match individually?
I have also tried separating the regular expression into two distinct regular expressions (one for city, state and one for zip code) however the zip code regex either returns an empty string or None. All help is appreciated, thanks.
Probably the easiest way is to name the two capturing groups:
reg = re.compile('(?P<city>[^\d]+)?(?P<zip>\d{5})?')
and then access the groupdict:
>>> match = reg.match("binghamton ny 13905")
>>> match.groupdict()
{'city': 'binghamton ny ', 'zip': '13905'}
This gives you easy access to the two pieces of information by name, rather than index.
I would agree with jonrsharpe
string = "binghamton ny 13905"
reg = re.compile('(?P<city>[^\d]+)?(?P<zip>\d{5})?')
result = re.match(reg, string)
Additionally you can access the variables by name like this:
result.group('city')
result.group('zip')
Python re reference page
r = re.search("([^\d]+)?(\d{5})?")
r.groups()
(u'binghamton ny ', u'13905')

programmatically find and replace content dynamically in a string in python

i need to find and replace patterns in a string with a dynamically generated content.
lets say i want to find all strings within '' in the string and double the string.
a string like:
my 'cat' is 'white' should become my 'catcat' is 'whitewhite'
all matches could also appear twice in the string.
thank you
Make use of the power of regular expressions. In this particular case:
import re
s = "my 'cat' is 'white'"
print re.sub("'([^']+)'", r"'\1\1'", s) # prints my 'catcat' is 'whitewhite'
\1 refers to the first group in the regex (called $1 in some other implementations).
It's also pretty easy to do it without regex in your case:
s = "my 'cat' is 'white'".split("'")
# the parts between the ' are at the 1, 3, 5 .. index
print s[1::2]
# replace them with new elements
s[1::2] = [x+x for x in s[1::2]]
# join that stuff back together
print "'".join(s)

Categories