split a line of data with a constraint - python

When I need to split a line of data I get following result:
>>> s="MS Dhoni cricket captain 10000"
>>> val=s.split()
>>> print val
['MS', 'Dhoni', 'cricket', 'captain', '10000']
But I expect code in the below manner:
['MS Dhoni', 'cricket', 'captain', '10000']
Though there is space in a specific position it must be skipped. How can I modify the code?

That code does what you want
import re
s="MS Dhoni cricket captain 10000"
print(re.split("\s(?=[a-z0-9])",s))
output:
['MS Dhoni', 'cricket', 'captain', '10000']
Explanation: split according to spaces, but only if followed by a lowercase letter or a digit (not consumed in the split operation thanks to the ?= construction (lookahead)
BUT this is cheating: had MS Dhoni been in the middle of the string, it wouldn't have worked. You assume that python knows how to read a distinction (Mr, ...) or group words containing only capital letters together with the next word. That is only in your mind.
It answers your question, but you have to be more specific if you want the answer to be useful for your projects.

Related

Multiline text splitting

Sooo, I have this problem in which I have to create a list of lists, that contain every word from each line that has a length greater then 4. The challenge is to solve this with a one-liner.
text = '''My candle burns at both ends;
It will not last the night;
But ah, my foes, and oh, my friends—
It gives a lovely light!'''
So far I managed this res = [i for ele in text.splitlines() for i in ele.split(' ') if len(i) > 4] but it returns ['candle', 'burns', 'ends;', 'night;', 'foes,', 'friends—', 'gives', 'lovely', 'light!'] insetead of [['candle', 'burns', 'ends;'], ['night;'], ['foes,', 'friends—'], ['gives', 'lovely', 'light!']]
Any ideas? :D
So in this case i would utilize some regular expressions to find your results.
By doing a list comprehension as you did with a regular expression you end up automatically placing the matches into new lists.
This particular search pattern looks for any number or letter (both capital or not) in a recurrence of 4 or more times.
import re
text = '''My candle burns at both ends;
It will not last the night;
But ah, my foes, and oh, my friends—
It gives a lovely light!'''
results = [re.findall('\w{4,}', line) for line in text.split('\n')]
print(results)
Output:
[['candle', 'burns', 'both', 'ends'], ['will', 'last', 'night'], ['foes', 'friends'], ['gives', 'lovely', 'light']]
If you wish to keep the special characters you might want to look into expanding the regular expression so it includes all characters except whitespace.
There are great tools to play around with if you look for "online regular expression tools" so you get some more feedback when trying to build your own patterns.
IIUC, this oneliner should work for you (without the use of additional packages):
[[w.strip(';,!—') for w in l.split() if len(w)>=4] for l in text.split('\n')]
Output:
[['candle', 'burns', 'both', 'ends'],
['will', 'last', 'night'],
['foes', 'friends'],
['gives', 'lovely', 'light']]

Regex formula to find string between two other strings or characters

I am trying to extract some sub-strings from another string, and I have identified patterns that should yield the correct results, however I think there are some small flaws in my implementation.
s = 'Arkansas BaseballMiami (Ohio) at ArkansasFeb 17, 2017 at Fayetteville, Ark. (Baum Stadium)Score by Innings123456789RHEMiami (Ohio)000000000061Arkansas60000010X781Miami (Ohio) starters: 1/lf HALL, D.; 23/3b YACEK; 36/1b HAFFEY; 40/c SENGER; 7/dh HARRIS; 8/rf STEPHENS; 11/ss TEXIDOR; 2/2b VOGELGESANG; 5/cf SADA; 32/p GNETZ;Arkansas starters: 8/dh E. Cole; 9/ss J. Biggers; 17/lf L. Bonfield; 33/c G. Koch; 28/cf D. Fletcher; 20/2b C. Shaddy; 24/1b C Spanberger; 15/rf J. Arledge; 6/3b H. Wilson; 16/p B. Knight;Miami (Ohio) 1st - HALL, D. struck out swinging.'
Here is my attempt at regex formulas to achieve my desired outputs:
teams = re.findall(r'(;|[0-9])(.*?) starters', s)
pitchers = re.findall('/p(.*?);', s)
The pitchers search seems to work, however the teams outputs the following:
[('1', '7, 2017 at Fayetteville, Ark. (Baum Stadium)Score by Innings123456789RHEMiami (Ohio)000000000061Arkansas60000010X781Miami (Ohio)'), ('1', '/lf HALL, D.; 23/3b YACEK; 36/1b HAFFEY; 40/c SENGER; 7/dh HARRIS; 8/rf STEPHENS; 11/ss TEXIDOR; 2/2b VOGELGESANG; 5/cf SADA; 32/p GNETZ;Arkansas')]
DESIRED OUTPUTS:
['Miami (Ohio)', 'Arkansas']
[' GNETZ', ' B. Knight']
I can worry about stripping out the leading spaces in the pitchers names later.
(;|[0-9]) can be replaced with [;0-9]. Then what I think you're trying to express is "get me the string before starters and immediately after the last number/semicolon that comes before the starters", for which you can say "there must be no other numbers/semicolons in between", i.e.
teams = re.findall(r'[;0-9]([^;0-9]*) starters', s)

getting words between m and n characters

I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.
Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])
You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words

Counting the number of unique words [duplicate]

This question already has answers here:
Counting the number of unique words in a document with Python
(8 answers)
Closed 9 years ago.
I want to count unique words in a text, but I want to make sure that words followed by special characters aren't treated differently, and that the evaluation is case-insensitive.
Take this example
text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
print len(set(w.lower() for w in text.split()))
The result would be 16, but I expect it to return 14. The problem is that 'boy.' and 'boy' are evaluated differently, because of the punctuation.
import re
print len(re.findall('\w+', text))
Using a regular expression makes this very simple. All you need to keep in mind is to make sure that all the characters are in lowercase, and finally combine the result using set to ensure that there are no duplicate items.
print len(set(re.findall('\w+', text.lower())))
you can use regex here:
In [65]: text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
In [66]: import re
In [68]: set(m.group(0).lower() for m in re.finditer(r"\w+",text))
Out[68]:
set(['grown',
'boy',
'he',
'now',
'longer',
'no',
'is',
'there',
'up',
'one',
'a',
'the',
'has',
'handsome'])
I think that you have the right idea of using the Python built-in set type.
I think that it can be done if you first remove the '.' by doing a replace:
text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
punc_char= ",.?!'"
for letter in text:
if letter == '"' or letter in punc_char:
text= text.replace(letter, '')
text= set(text.split())
len(text)
that should work for you. And if you need any of the other signs or punctuation points you can easily
add them into punc_char and they will be filtered out.
Abraham J.
First, you need to get a list of words. You can use a regex as eandersson suggested:
import re
words = re.findall('\w+', text)
Now, you want to get the number of unique entries. There are a couple of ways to do this. One way would be iterate through the words list and use a dictionary to keep track of the number of times you have seen a word:
cwords = {}
for word in words:
try:
cwords[word] += 1
except KeyError:
cwords[word] = 1
Now, finally, you can get the number of unique words by
len(cwords)

How to strip variable spaces in each line of a text file based on special condition - one-liner in Python?

I have some data (text files) that is formatted in the most uneven manner one could think of. I am trying to minimize the amount of manual work on parsing this data.
Sample Data :
Name Degree CLASS CODE EDU Scores
--------------------------------------------------------------------------------------
John Marshall CSC 78659944 89989 BE 900
Think Code DB I10 MSC 87782 1231 MS 878
Mary 200 Jones CIVIL 98993483 32985 BE 898
John G. S Mech 7653 54 MS 65
Silent Ghost Python Ninja 788505 88448 MS Comp 887
Conditions :
More than one spaces should be compressed to a delimiter (pipe better? End goal is to store these files in the database).
Except for the first column, the other columns won't have any spaces in them, so all those spaces can be compressed to a pipe.
Only the first column can have multiple words with spaces (Mary K Jones). The rest of the columns are mostly numbers and some alphabets.
First and second columns are both strings. They almost always have more than one spaces between them, so that is how we can differentiate between the 2 columns. (If there is a single space, that is a risk I am willing to take given the horrible formatting!).
The number of columns varies, so we don't have to worry about column names. All we want is to extract each column's data.
Hope I made sense! I have a feeling that this task can be done in a oneliner. I don't want to loop, loop, loop :(
Muchos gracias "Pythonistas" for reading all the way and not quitting before this sentence!
It still seems tome that there's some format in your files:
>>> regex = r'^(.+)\b\s{2,}\b(.+)\s+(\d+)\s+(\d+)\s+(.+)\s+(\d+)'
>>> for line in s.splitlines():
lst = [i.strip() for j in re.findall(regex, line) for i in j if j]
print(lst)
[]
[]
['John Marshall', 'CSC', '78659944', '89989', 'BE', '900']
['Think Code DB I10', 'MSC', '87782', '1231', 'MS', '878']
['Mary 200 Jones', 'CIVIL', '98993483', '32985', 'BE', '898']
['John G. S', 'Mech', '7653', '54', 'MS', '65']
['Silent Ghost', 'Python Ninja', '788505', '88448', 'MS Comp', '887']
Regex is quite straightforward, the only things you need to pay attention to are the delimiters (\s) and the word breaks (\b) in case of the first delimiter. Note that when the line wouldn't match you get an empty list as lst. That would be a read flag to bring up the user interaction described below. Also you could skip the header lines by doing:
>>> file = open(fname)
>>> [next(file) for _ in range(2)]
>>> for line in file:
... # here empty lst indicates issues with regex
Previous variants:
>>> import re
>>> for line in open(fname):
lst = re.split(r'\s{2,}', line)
l = len(lst)
if l in (2,3):
lst[l-1:] = lst[l-1].split()
print(lst)
['Name', 'Degree', 'CLASS', 'CODE', 'EDU', 'Scores']
['--------------------------------------------------------------------------------------']
['John Marshall', 'CSC', '78659944', '89989', 'BE', '900']
['Think Code DB I10', 'MSC', '87782', '1231', 'MS', '878']
['Mary 200 Jones', 'CIVIL', '98993483', '32985', 'BE', '898']
['John G. S', 'Mech', '7653', '54', 'MS', '65']
another thing to do is simply allow user to decide what to do with questionable entries:
if l < 3:
lst = line.split()
print(lst)
iname = input('enter indexes that for elements of name: ') # use raw_input in py2k
idegr = input('enter indexes that for elements of degree: ')
Uhm, I was all the time under the impression that the second element might contain spaces, since it's not the case you could just do:
>>> for line in open(fname):
name, _, rest = line.partition(' ')
lst = [name] + rest.split()
print(lst)
Variation on SilentGhost's answer, this time first splitting the name from the rest (separated by two or more spaces), then just splitting the rest, and finally making one list.
import re
for line in open(fname):
name, rest = re.split('\s{2,}', line, maxsplit=1)
print [name] + rest.split()
This answer was written after the OP confessed to changing every tab ("\t") in his data to 3 spaces (and not mentioning it in his question).
Looking at the first line, it seems that this is a fixed-column-width report. It is entirely possible that your data contains tabs that if expanded properly might result in a non-crazy result.
Instead of doing line.replace('\t', ' ' * 3) try line.expandtabs().
Docs for expandtabs are here.
If the result looks sensible (columns of data line up), you will need to determine how you can work out the column widths programatically (if that is possible) -- maybe from the heading line.
Are you sure that the second line is all "-", or are there spaces between the columns?
The reason for asking is that I once needed to parse many different files from a database query report mechanism which presented the results like this:
RecordType ID1 ID2 Description
----------- -------------------- ----------- ----------------------
1 12345678 123456 Widget
4 87654321 654321 Gizmoid
and it was possible to write a completely general reader that inspected the second line to determine where to slice the heading line and the data lines. Hint:
sizes = map(len, dash_line.split())
If expandtabs() doesn't work, edit your question to show exactly what you do have i.e. show the result of print repr(line) for the first 5 or so lines (including the heading line). It might also be useful if you could say what software produces these files.

Categories