How do I split these strings into arrays of strings? - python

I have several strings with phrases or words separated by multiple spaces.
c1 = "St. Louis 12 Cardinals"
c2 = "Boston 16 Red Sox"
c3 = "New York 13 Yankees"
How do I write a function perhaps using the python split(" ") function to separate each line into an array of strings? For instance, c1 would go to ['St. Louis', '12', 'Cardinals'].
Calling split(" ") and then trimming the component entities won't work because some entities such as St. Louis or Red Sox have spaces in them.
However, I do know that all entities are at least 2 spaces apart and that no entity has 2 spaces within it. By the way, I actually have around 100 cities to deal with, not 3. Thanks!

Without regular expressions:
c1 = "St. Louis 12 Cardinals"
words = [w.strip() for w in c1.split(' ') if w]
# words == ['St. Louis', '12', 'Cardinals']

import re
re.split(r' {2,}', c1)
re.split(r' {2,}', c2)
re.split(r' {2,}', c3)

You can use re.split
>>> re.split('\s{2,}','St. Louis 12 Cardinals')
['St. Louis', '12', 'Cardinals']

You could do this with regular expressions:
import re
blahRegex = re.compile(r'(.*?)\s+(\d+)\s+(.*?)')
for line in open('filename','ro').readlines():
m = blahRegex.match(line)
if m is not None:
city = m.group(1)
rank = m.group(2)
team = m.group(3)
There's a lot of ways to skin that cat, you could use named groups, or make your regular expression tighter.. But, this should do it.

It looks like that content is fixed-width. If that is always the case and assuming those are spaces and not tabs, then you can always reverse it using slices:
split_fields = lambda s: [s[:16].strip(), s[16:31:].strip(), s[31:].strip()]
or:
def split_fields(s):
return [s[:16].strip(), s[16:31:].strip(), s[31:].strip()]
Example usage:
>>> split_fields(c1)
['St. Louis', '12', 'Cardinals']
>>> split_fields(c2)
['Boston', '16', 'Red Sox']
>>> split_fields(c3)
['New York', '13', 'Yankees']

Related

splitting a text by a capital letter after a small letter, without loosing the small letter

I have the following type of strings:
"CanadaUnited States",
"GermanyEnglandSpain"
I want to split them into the countries' names, i.e.:
['Canada', 'United States']
['Germany', 'England', 'Spain']
I have tried using the following regex:
text = "GermanyEnglandSpain"
re.split('[a-z](?=[A-Z])', text)
and I'm getting:
['German', 'Englan', 'Spain']
How can I not lose the last char in every word?]
Thanks!
I would use re.findall here with a regex find all approach:
inp = "CanadaUnited States"
countries = re.findall(r'[A-Z][a-z]+(?: [A-Z][a-z]+)*', inp)
print(countries) # ['Canada', 'United States']
The regex pattern used here says to match:
[A-Z][a-z]+ match a leading uppercase word of a country name
(?: [A-Z][a-z]+)* followed by space and another capital word, 0 or more times
My answer is longer than Tim's because I wanted to include more cases to the problem so that you can change it as you need it. You can shorten it by using lambda functions and putting multiple regex into one
Basic flow: add a space before every upper letter, replace multiple spaces with *, split on single spaces, and replace * with single space
import re
text = "GermanyUnited StatesEnglandUnited StatesSpain"
text2=re.sub('([A-Z])', r' \1', text) #adds a single space before every upper letter
print(text2)
#Germany United States England United States Spain
text3=re.sub('\s{2,}', '*', text2)#replaces 2 or more spaces with * so that we can replace later
print(text3)
#Germany United*States England United*States Spain
text4=re.split(' ',text3)#splits the text into list on evert single space
print(text4)
#['', 'Germany', 'United*States', 'England', 'United*States', 'Spain']
text5=[]
for i in text4:
text5.append(re.sub('\*', ' ', i)) #replace every * with a single space
text5=list(filter(None, text5)) #remove empty elements
print(text5)
#['Germany', 'United States', 'England', 'United States', 'Spain']
You can use re.split with capture groups like so, but then you will also need to filter out the empty delimeters:
import re
text = "GermanyEnglandSpain"
res = re.split('([A-Z][a-z]*)', text)
res = list(filter(None, res))
print(res)

Remove Numbers and Turn into a List

#
PYTHON
A clearer way of asking the question is:
If I have a string as follows:
'PALM BEACH.Race 6GaveaRace 5MaronasRace 7IOWARace 3ORANGE PARK.Race 5'
How do I turn that string into:
Palm Beach, Gavea, Maronas, Iowa, Orange Park
So that is, make each item in the list 'title'(ie. Uppercase first letter and the rest lower case), delete the numbers and the word 'Race'.
I am setting up to export to Excel.
Thanks in advance - Angus
#
You can do it without importing any library:
races = """PALM BEACH.Race 6GaveaRace 5MaronasRace 7IOWARace 3ORANGE PARK.Race 5"""
''.join([ch if not ch.isdigit() else 'xxx' for ch in races.replace('Race ','')]).split('xxx')
Output:
['PALM BEACH.', 'Gavea', 'Maronas', 'IOWA', 'ORANGE PARK.', '']
You can use re.split and some string manipulation:
import re
>>> s = 'PALM BEACH.Race 6GaveaRace 5MaronasRace 7IOWARace 3ORANGE PARK.Race 5'
>>> # Split by the race and folowed by a digit
>>> race_names = re.split('Race \d+', s)
>>> def format_name(name):
... # Remove the trailing period on some race names
... name = name.rstrip('.')
... # Change name to title case
... name = name.title()
... return name
>>> # Format the name and remove any empty entries in the list
>>> race_names = [format_name(name) for name in race_names if name]
>>> list(race_names)
['Palm Beach', 'Gavea', 'Maronas', 'Iowa', 'Orange Park']

Iterate and match all elements with regex

So I have something like this:
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
I want to replace every element with the first name so it would look like this:
data = ['Alice Smith', 'Tim', 'Uncle Neo']
So far I got:
for i in range(len(data)):
if re.match('(.*) and|with|\&', data[i]):
a = re.match('(.*) and|with|\&', data[i])
data[i] = a.group(1)
But it doesn't seem to work, I think it's because of my pattern but I can't figure out the right way to do this.
Use a list comprehension with re.split:
result = [re.split(r' (?:and|with|&) ', x)[0] for x in data]
The | needs grouping with parentheses in your attempt. Anyway, it's too complex.
I would just use re.sub to remove the separation word & the rest:
data = [re.sub(" (and|with|&) .*","",d) for d in data]
result:
['Alice Smith', 'Tim', 'Uncle Neo']
You can try this:
import re
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
final_data = [re.sub('\sand.*?$|\s&.*?$|\swith.*?$', '', i) for i in data]
Output:
['Alice Smith', 'Tim', 'Uncle Neo']
Simplify your approach to the following:
import re
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
data = [re.search(r'.*(?= (and|with|&))', i).group() for i in data]
print(data)
The output:
['Alice Smith', 'Tim', 'Uncle Neo']
.*(?= (and|with|&)) - positive lookahead assertion, ensures that name/surname .* is followed by any item from the alternation group (and|with|&)
Brief
I would suggest using Casimir's answer if possible, but, if you are not sure what word might follow (that is to say that and, with, and & are dynamic), then you can use this regex.
Note: This regex will not work for some special cases such as names with apostrophes ' or dashes -, but you can add them to the character list that you're searching for. This answer also depends on the name beginning with an uppercase character and the "union word" as I'll name it (and, with, &, etc.) not beginning with an uppercase character.
Code
See this regex in use here
Regex
^((?:[A-Z][a-z]*\s*)+)\s.*
Substitution
$1
Result
Input
Alice Smith and Bob
Tim with Sam Dunken
Uncle Neo & 31
Output
Alice Smith
Tim
Uncle Neo
Explanation
Assert position at the beginning of the string ^
Match a capital alpha character [A-Z]
Match between any number of lowercase alpha characters [a-z]*
Match between any number of whitespace characters (you can specify spaces if you'd prefer using * instead) \s*
Match the above conditions between one and unlimited times, all captured into capture group 1 (...)+: where ... contains everything above
Match a whitespace character, followed by any character (except new line) any number of times
$1: Replace with capture group 1

Python - spilt() over many spaces

I followed this answer's (Python: Split by 1 or more occurrences of a delimiter) directions to a T and it keeps failing so I'm wondering if it's something simple I'm missing or if I need a new method to solve this.
I have the following .eml file:
My goal is to eventually parse out all the fish stocks and their corresponding weight amounts, but for a test I'm just using the following code:
with open(file_path) as f:
for line in f:
if ("Haddock" in line):
#fish, remainder = re.split(" +", line)
fish, remainder = line.split()
print(line.lower().strip())
print("fish:", fish)
print("remainder:", remainder)
and it fails on the line fish, remainder = line.split() with the error
ValueError: too many values to unpack (expected 2)
which tells me that Python is failing because it is trying to split on too many spaces, right? Or am I misunderstanding this? I want to get two values back from this process: the name of the fish (a string containing all the text before the many spaces) and the quantity (integer from the right side of the input line).
Any help would be appreciated.
You may use below regular expression for splitting
fish, remainder = re.split(r'(?<=\w)\s+(?=\d)',line.strip())
it will split and give `['GB Haddock West', '22572']`
I would like the fish to be GB Haddock West and the remainder to be 22572
You could do something line this:
s = line.split()
fish, remainder = " ".join(s[:-1]), s[-1]
Instead of using split() you could utilize rindex() and find the last space and split between there.
at = line.rindex(" ")
fish, remainder = line[:at], line[at+1:]
Both will output:
print(fish) # GB Haddock West
print(remainder) # 22572
Yes ... you can split on multiple spaces. However, unless you can specify the number of spaces, you're going to get additional empty fields in the middle, just as you're getting now. For instance:
in_stuff = [
"GB Haddock West 22572",
"GB Cod West 7207",
"GB Haddock East 3776"
]
for line in in_stuff:
print line.split(" ")
Output:
['GB Haddock West', '', '', ' 22572']
['GB Cod West', '', '', '', '', '7207']
['GB Haddock East', '', '', ' 3776']
However, a simple change will get what you want: pick off the first and last fields from this:
for line in in_stuff:
fields = line.split(" ")
print fields[0], int(fields[-1])
Output:
GB Haddock West 22572
GB Cod West 7207
GB Haddock East 3776
Will that solve your problem?
Building upon #Vallentin's answer, but using the extended unpacking features of Python 3:
In [8]: line = "GB Haddock West 22572"
In [9]: *fish, remainder = line.split()
In [10]: print(" ".join(fish))
GB Haddock West
In [11]: print(int(remainder))
22572

What is efficient way to match words in string?

Example:
names = ['James John', 'Robert David', 'Paul' ... the list has 5K items]
text1 = 'I saw James today'
text2 = 'I saw James John today'
text3 = 'I met Paul'
is_name_in_text(text1,names) # this returns false 'James' in not in list
is_name_in_text(text2,names) # this returns 'James John'
is_name_in_text(text3,names) # this return 'Paul'
is_name_in_text() searches if any of the name list is in text.
The easy way to do is to just check if the name is in the list by using in operator, but the list has 5,000 items, so it is not efficient. I can just split the text into words and check if the words are in the list, but this not going to work if you have more than one word matching. Line number 7 will fail in this case.
Make names into a set and use the in-operator for fast O(1) lookup.
You can use a regex to parse out the possible names in a sentence:
>>> import re
>>> findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*)?)')
>>> def is_name_in_text(text, names):
for possible_name in set(findnames.findall(text)):
if possible_name in names:
return possible_name
return False
>>> names = set(['James John', 'Robert David', 'Paul'])
>>> is_name_in_text('I saw James today', names)
False
>>> is_name_in_text('I saw James John today', names)
'James John'
>>> is_name_in_text('I met Paul', names)
'Paul'
Build a regular expression with all the alternatives. This way you don't have to worry about somehow pulling the names out of the phrases beforehand.
import re
names_re = re.compile(r'\b' +
r'\b|\b'.join(re.escape(name) for name in names) +
r'\b')
print names_re.search('I saw James today')
You may use Python's set in order to get good performance while using the in operator.
If you have a mechanism of pulling the names out of the phrases and don't need to worry about partial matches (the full name will always be in the string), you can use a set rather than a list.
Your code is exactly the same, with this addition at line 2:
names = set(names)
The in operation will now function much faster.

Categories