Python - spilt() over many spaces - python

I followed this answer's (Python: Split by 1 or more occurrences of a delimiter) directions to a T and it keeps failing so I'm wondering if it's something simple I'm missing or if I need a new method to solve this.
I have the following .eml file:
My goal is to eventually parse out all the fish stocks and their corresponding weight amounts, but for a test I'm just using the following code:
with open(file_path) as f:
for line in f:
if ("Haddock" in line):
#fish, remainder = re.split(" +", line)
fish, remainder = line.split()
print(line.lower().strip())
print("fish:", fish)
print("remainder:", remainder)
and it fails on the line fish, remainder = line.split() with the error
ValueError: too many values to unpack (expected 2)
which tells me that Python is failing because it is trying to split on too many spaces, right? Or am I misunderstanding this? I want to get two values back from this process: the name of the fish (a string containing all the text before the many spaces) and the quantity (integer from the right side of the input line).
Any help would be appreciated.

You may use below regular expression for splitting
fish, remainder = re.split(r'(?<=\w)\s+(?=\d)',line.strip())
it will split and give `['GB Haddock West', '22572']`

I would like the fish to be GB Haddock West and the remainder to be 22572
You could do something line this:
s = line.split()
fish, remainder = " ".join(s[:-1]), s[-1]
Instead of using split() you could utilize rindex() and find the last space and split between there.
at = line.rindex(" ")
fish, remainder = line[:at], line[at+1:]
Both will output:
print(fish) # GB Haddock West
print(remainder) # 22572

Yes ... you can split on multiple spaces. However, unless you can specify the number of spaces, you're going to get additional empty fields in the middle, just as you're getting now. For instance:
in_stuff = [
"GB Haddock West 22572",
"GB Cod West 7207",
"GB Haddock East 3776"
]
for line in in_stuff:
print line.split(" ")
Output:
['GB Haddock West', '', '', ' 22572']
['GB Cod West', '', '', '', '', '7207']
['GB Haddock East', '', '', ' 3776']
However, a simple change will get what you want: pick off the first and last fields from this:
for line in in_stuff:
fields = line.split(" ")
print fields[0], int(fields[-1])
Output:
GB Haddock West 22572
GB Cod West 7207
GB Haddock East 3776
Will that solve your problem?

Building upon #Vallentin's answer, but using the extended unpacking features of Python 3:
In [8]: line = "GB Haddock West 22572"
In [9]: *fish, remainder = line.split()
In [10]: print(" ".join(fish))
GB Haddock West
In [11]: print(int(remainder))
22572

Related

Python regex to match Ledger/hledger account journal entry

I am writing a program in Python to parse a Ledger/hledger journal file.
I'm having problems coming up with a regex that I'm sure is quite simple. I want to parse a string of the form:
expenses:food:food and wine 20.99
and capture the account sections (between colons, allowing any spaces), regardless of the number of sub-accounts, and the total, in groups. There can be any number of spaces between the final character of the sub-account name and the price digits.
expenses:food:wine:speciality 19.99 is also allowable (no space in sub-account).
So far I've got (\S+):|(\S+ \S+):|(\S+ (?!\d))|(\d+.\d+) which is not allowing for any number of sub-accounts and possible spaces. I don't think I want to have OR operators in there either as this is going to concatenated with other regexes with .join() as part of the parsing function.
Any help greatly appreciated.
Thanks.
You can use the following:
((?:[^\s:]+)(?:\:[^\s:]+)*)\s*(\d+\.\d+)
Now we can use:
s = 'expenses:food:wine:speciality 19.99'
rgx = re.compile(r'((?:[^\s:]+)(?:\:[^\s:]+)*)\s*(\d+\.\d+)')
mat = rgx.match(s)
if mat:
categories,price = mat.groups()
categories = categories.split(':')
Now categories will be a list containing the categories, and price a string with the price. For your sample input this gives:
>>> categories
['expenses', 'food', 'wine', 'speciality']
>>> price
'19.99'
You don't need regex for such a simple thing at all, native str.split() is more than enough:
def split_ledger(line):
entries = line.split(":") # first split all the entries
last = entries.pop() # take the last entry
return entries + last.rsplit(" ", 1) # split on last space and return all together
print(split_ledger("expenses:food:food and wine 20.99"))
# ['expenses', 'food', 'food and wine ', '20.99']
print(split_ledger("expenses:food:wine:speciality 19.99"))
# ['expenses', 'food', 'wine', 'speciality ', '19.99']
Or if you don't want the leading/trailing whitespace in any of the entries:
def split_ledger(line):
entries = [e.strip() for e in line.split(":")]
last = entries.pop()
return entries + [e.strip() for e in last.rsplit(" ", 1)]
print(split_ledger("expenses:food:food and wine 20.99"))
# ['expenses', 'food', 'food and wine', '20.99']
print(split_ledger("expenses:food:wine:speciality 19.99"))
# ['expenses', 'food', 'wine', 'speciality', '19.99']

Python - split() producing ValueError

I am trying to split the line:
American plaice - 11,000 lbs # 35 cents or trade for SNE stocks
at the word or but I receive ValueError: not enough values to unpack (expected 2, got 1).
Which doesn't make sense, if I split the sentence at or then that will indeed leave 2 sides, not 1.
Here's my code:
if ('-' in line) and ('lbs' in line):
fish, remainder = line.split('-')
if 'trade' in remainder:
weight, price = remainder.split('to ')
weight, price = remainder.split('or')
The 'to' line is what I normally use, and it has worked fine, but this new line appeared without a 'to' but instead an 'or' so I tried writing one line that would tackle either condition but couldn't figure it out so I simply wrote a second and am now running into the error listed above.
Any help is appreciated, thanks.
The most straightforward way is probably to use a regular expression to do the split. Then you can split on either word, whichever appears. The ?: inside the parentheses makes the group non-capturing so that the matched word doesn't appear in the output.
import re
# ...
weight, price = re.split(" (?:or|to) ", remainder, maxsplit=1)
You split on 'to ' before you attempt to split on 'or', which is throwing the error. The return value of remainder.split('to ') is [' 11,000 lbs # 35 cents or trade for SNE stocks'] which cannot be unpacked to two separate values. you can fix this by testing for which word you need to split on first.
if ('-' in line) and ('lbs' in line):
fish, remainder = line.split('-')
if 'trade' in remainder:
if 'to ' in remainder:
weight, price = remainder.split('to ')
elif ' or ' in remainder:
weight, price = remainder.split(' or ') #add spaces so we don't match 'for'
This should solve your problem by checking if your separator is in the string first.
Also note that split(str, 1) makes sure that your list will be split a max of one time (Ex "hello all world".split(" ", 1) == ["hello", "all world"])
if ('-' in line) and ('lbs' in line):
fish, remainder = line.split('-')
if 'trade' in remainder:
weight, price = remainder.split(' to ', 1) if ' to ' in remainder else remainder.split(' or ', 1)
The problem is that the word "for" also contains an "or" therefore you will end up with the following:
a = 'American plaice - 11,000 lbs # 35 cents or trade for SNE stocks'
a.split('or')
gives
['American plaice - 11,000 lbs # 35 cents ', ' trade f', ' SNE stocks']
Stephen Rauch's answer does fix the problem
Once you have done the split(), you have a list, not a string. So you can not do another split(). And if you just copy the line, then you will overwrite you other results. You can instead try and do the processing as a string:
weight, price = remainder.replace('or ', 'to ').split('to ')

Cut of middle word from a string python

I am trying to cut of few words from the scraped data.
3 Bedroom, Residential Apartment in Velachery
There are many rows of data like this. I am trying to remove the word 'Bedroom' from the string. I am using beautiful soup and python to scrape the webpage, and here I am using this
for eachproperty in properties:
print eachproperty.string[2:]
I know what the above code will do. But I cannot figure out how to just remove the "Bedroom" which is between 3 and ,Residen....
>>> import re
>>> strs = "3 Bedroom, Residential Apartment in Velachery"
>>> re.sub(r'\s*Bedroom\s*', '', strs)
'3, Residential Apartment in Velachery'
or:
>>> strs.replace(' Bedroom', '')
'3, Residential Apartment in Velachery'
Note that strings are immutable, so you need to assign the result off re.sub and str.replace to a variable.
What you need is the replace method:
line = "3 Bedroom, Residential Apartment in Velachery"
line = line.replace("Bedroom", "")
# For multiple lines use a for loop
for line in lines:
line = line.replace("Bedroom", "")
A quick answer is
k = input_string.split()
if "Bedroom" in k:
k.remove("Bedroom")
answer = ' '.join(k)
This won't handle punctuation like in your question. To do that you need
rem = "Bedroom"
answer = ""
for i in range(len(input_string)-len(rem)):
if (input_string[i:i+len(rem)]==rem):
answer = input_string[:i]+input_string[i+len(rem)]
break

How do I split these strings into arrays of strings?

I have several strings with phrases or words separated by multiple spaces.
c1 = "St. Louis 12 Cardinals"
c2 = "Boston 16 Red Sox"
c3 = "New York 13 Yankees"
How do I write a function perhaps using the python split(" ") function to separate each line into an array of strings? For instance, c1 would go to ['St. Louis', '12', 'Cardinals'].
Calling split(" ") and then trimming the component entities won't work because some entities such as St. Louis or Red Sox have spaces in them.
However, I do know that all entities are at least 2 spaces apart and that no entity has 2 spaces within it. By the way, I actually have around 100 cities to deal with, not 3. Thanks!
Without regular expressions:
c1 = "St. Louis 12 Cardinals"
words = [w.strip() for w in c1.split(' ') if w]
# words == ['St. Louis', '12', 'Cardinals']
import re
re.split(r' {2,}', c1)
re.split(r' {2,}', c2)
re.split(r' {2,}', c3)
You can use re.split
>>> re.split('\s{2,}','St. Louis 12 Cardinals')
['St. Louis', '12', 'Cardinals']
You could do this with regular expressions:
import re
blahRegex = re.compile(r'(.*?)\s+(\d+)\s+(.*?)')
for line in open('filename','ro').readlines():
m = blahRegex.match(line)
if m is not None:
city = m.group(1)
rank = m.group(2)
team = m.group(3)
There's a lot of ways to skin that cat, you could use named groups, or make your regular expression tighter.. But, this should do it.
It looks like that content is fixed-width. If that is always the case and assuming those are spaces and not tabs, then you can always reverse it using slices:
split_fields = lambda s: [s[:16].strip(), s[16:31:].strip(), s[31:].strip()]
or:
def split_fields(s):
return [s[:16].strip(), s[16:31:].strip(), s[31:].strip()]
Example usage:
>>> split_fields(c1)
['St. Louis', '12', 'Cardinals']
>>> split_fields(c2)
['Boston', '16', 'Red Sox']
>>> split_fields(c3)
['New York', '13', 'Yankees']

How to strip variable spaces in each line of a text file based on special condition - one-liner in Python?

I have some data (text files) that is formatted in the most uneven manner one could think of. I am trying to minimize the amount of manual work on parsing this data.
Sample Data :
Name Degree CLASS CODE EDU Scores
--------------------------------------------------------------------------------------
John Marshall CSC 78659944 89989 BE 900
Think Code DB I10 MSC 87782 1231 MS 878
Mary 200 Jones CIVIL 98993483 32985 BE 898
John G. S Mech 7653 54 MS 65
Silent Ghost Python Ninja 788505 88448 MS Comp 887
Conditions :
More than one spaces should be compressed to a delimiter (pipe better? End goal is to store these files in the database).
Except for the first column, the other columns won't have any spaces in them, so all those spaces can be compressed to a pipe.
Only the first column can have multiple words with spaces (Mary K Jones). The rest of the columns are mostly numbers and some alphabets.
First and second columns are both strings. They almost always have more than one spaces between them, so that is how we can differentiate between the 2 columns. (If there is a single space, that is a risk I am willing to take given the horrible formatting!).
The number of columns varies, so we don't have to worry about column names. All we want is to extract each column's data.
Hope I made sense! I have a feeling that this task can be done in a oneliner. I don't want to loop, loop, loop :(
Muchos gracias "Pythonistas" for reading all the way and not quitting before this sentence!
It still seems tome that there's some format in your files:
>>> regex = r'^(.+)\b\s{2,}\b(.+)\s+(\d+)\s+(\d+)\s+(.+)\s+(\d+)'
>>> for line in s.splitlines():
lst = [i.strip() for j in re.findall(regex, line) for i in j if j]
print(lst)
[]
[]
['John Marshall', 'CSC', '78659944', '89989', 'BE', '900']
['Think Code DB I10', 'MSC', '87782', '1231', 'MS', '878']
['Mary 200 Jones', 'CIVIL', '98993483', '32985', 'BE', '898']
['John G. S', 'Mech', '7653', '54', 'MS', '65']
['Silent Ghost', 'Python Ninja', '788505', '88448', 'MS Comp', '887']
Regex is quite straightforward, the only things you need to pay attention to are the delimiters (\s) and the word breaks (\b) in case of the first delimiter. Note that when the line wouldn't match you get an empty list as lst. That would be a read flag to bring up the user interaction described below. Also you could skip the header lines by doing:
>>> file = open(fname)
>>> [next(file) for _ in range(2)]
>>> for line in file:
... # here empty lst indicates issues with regex
Previous variants:
>>> import re
>>> for line in open(fname):
lst = re.split(r'\s{2,}', line)
l = len(lst)
if l in (2,3):
lst[l-1:] = lst[l-1].split()
print(lst)
['Name', 'Degree', 'CLASS', 'CODE', 'EDU', 'Scores']
['--------------------------------------------------------------------------------------']
['John Marshall', 'CSC', '78659944', '89989', 'BE', '900']
['Think Code DB I10', 'MSC', '87782', '1231', 'MS', '878']
['Mary 200 Jones', 'CIVIL', '98993483', '32985', 'BE', '898']
['John G. S', 'Mech', '7653', '54', 'MS', '65']
another thing to do is simply allow user to decide what to do with questionable entries:
if l < 3:
lst = line.split()
print(lst)
iname = input('enter indexes that for elements of name: ') # use raw_input in py2k
idegr = input('enter indexes that for elements of degree: ')
Uhm, I was all the time under the impression that the second element might contain spaces, since it's not the case you could just do:
>>> for line in open(fname):
name, _, rest = line.partition(' ')
lst = [name] + rest.split()
print(lst)
Variation on SilentGhost's answer, this time first splitting the name from the rest (separated by two or more spaces), then just splitting the rest, and finally making one list.
import re
for line in open(fname):
name, rest = re.split('\s{2,}', line, maxsplit=1)
print [name] + rest.split()
This answer was written after the OP confessed to changing every tab ("\t") in his data to 3 spaces (and not mentioning it in his question).
Looking at the first line, it seems that this is a fixed-column-width report. It is entirely possible that your data contains tabs that if expanded properly might result in a non-crazy result.
Instead of doing line.replace('\t', ' ' * 3) try line.expandtabs().
Docs for expandtabs are here.
If the result looks sensible (columns of data line up), you will need to determine how you can work out the column widths programatically (if that is possible) -- maybe from the heading line.
Are you sure that the second line is all "-", or are there spaces between the columns?
The reason for asking is that I once needed to parse many different files from a database query report mechanism which presented the results like this:
RecordType ID1 ID2 Description
----------- -------------------- ----------- ----------------------
1 12345678 123456 Widget
4 87654321 654321 Gizmoid
and it was possible to write a completely general reader that inspected the second line to determine where to slice the heading line and the data lines. Hint:
sizes = map(len, dash_line.split())
If expandtabs() doesn't work, edit your question to show exactly what you do have i.e. show the result of print repr(line) for the first 5 or so lines (including the heading line). It might also be useful if you could say what software produces these files.

Categories