How can I generate a rolling metric like this in Pandas - python

I have a dataframe that initially contains two columns, Home, which is 1 if a game was player at home, else 0, and PTS, which records the number of points a player scored in a given game. I want to end up with a third column, a rolling metric that represents how sensitive a player is to playing at home. I'll calculate this as follows:
Home Sensitivity = (Average PTS Home - Average PTS Away)/Average PTS
I did this successfully in the following code, but it felt cumbersome, as I created many columns I didn't need in the end. How can I solve this problem more directly?
df=pd.DataFrame({'Home':[1,0,1,0,1,0,1,0], 'PTS':[11, 10, 12, 11, 13, 12, 14, 12]})
df.loc[testDF['Home'] == 1, 'Home PTS'] = df['PTS']
df.loc[testDF['Home'] == 0, 'Away PTS'] = df['PTS']
df['Home PTS'] = df['Home PTS'].fillna(0)
df['Away PTS'] = df['Away PTS'].fillna(0)
df['Home Sum'] = df['Home PTS'].expanding(min_periods=1).sum()
df['Away Sum'] = df['Away PTS'].expanding(min_periods=1).sum()
df['Home Count']=df['Home'].expanding().sum()
df['Index']=df.index+1
df['Away Count']=df['Index']-df['Home Count']
df['Home Average']=df['Home Sum']/df['Home Count']
df['Away Average']=df['Away Sum']/df['Away Count']
df['Average']=df['PTS'].expanding().mean()
df['Metric']=(df['Home Average']-df['Away Average'])/df['Average']

Here is a naive way to do it: take increasingly larger slices of the DataFrame in a loop; do the math on each slice and store it in a list; assign the list to a new column of the DataFrame (using your testDF):
df = tesdDF
sens = []
for i in range(len(df)):
d = df[:i]
mean_pts = d.PTS.mean()
home = d[d.Home == 1].PTS.mean()
away = d[d.Home == 0].PTS.mean()
#print(home, away, (home - away) / mean_pts)
sens.append((home - away) / mean_pts)
df['sens'] = sens
>>> df
Home PTS sens
0 1 11 NaN
1 0 10 NaN
2 1 12 0.095238
3 0 11 0.136364
4 1 13 0.090909
5 0 12 0.131579
6 1 14 0.086957
7 0 12 0.126506
Using DataFrame.expanding(): Not quite there yet ...
>>> mean_pts = df.PTS.expanding(1).mean()
>>> away = df[df['Home'] == 0].PTS.expanding(1).mean()
>>> home = df[df['Home'] == 1].PTS.expanding(1).mean()
>>>
>>> home
0 11.0
2 11.5
4 12.0
6 12.5
Name: PTS, dtype: float64
>>> away
1 10.00
3 10.50
5 11.00
7 11.25
Name: PTS, dtype: float64
>>> mean_pts
0 11.000000
1 10.500000
2 11.000000
3 11.000000
4 11.400000
5 11.500000
6 11.857143
7 11.875000
Name: PTS, dtype: float64
>>>
To do the math will require more manipulation.
You cannot get the difference between home and away directly because the indices are different - but you can do ...
>>> home.values - away.values
array([ 1. , 1. , 1. , 1.25])
>>>
Also home and away only have four rows and mean_pts has eight.
I tried .expanding(1).apply() with the following function and didn't get what I expected, expanding doesn't pass both columns to the function, it appears to pass one column then the other; so I punted...
def f(thing):
print(thing, '***')
return thing.mean()
>>> df.expanding(1).apply(f)
[ 1.] ***
[ 1. 0.] ***
[ 1. 0. 1.] ***
[ 1. 0. 1. 0.] ***
[ 1. 0. 1. 0. 1.] ***
[ 1. 0. 1. 0. 1. 0.] ***
[ 1. 0. 1. 0. 1. 0. 1.] ***
[ 1. 0. 1. 0. 1. 0. 1. 0.] ***
[ 11.] ***
[ 11. 10.] ***
[ 11. 10. 12.] ***
[ 11. 10. 12. 11.] ***
[ 11. 10. 12. 11. 13.] ***
[ 11. 10. 12. 11. 13. 12.] ***
[ 11. 10. 12. 11. 13. 12. 14.] ***
[ 11. 10. 12. 11. 13. 12. 14. 12.] ***

Related

Np Array values are being changed without doing stuff

Why does :
print(np.delete(MatrixAnalytics(Cmp),[0],1))
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print("SecondPrint")
print(MyNewMatrix)
returns :
[[ 2. 2. 2. 2. 2.]
[ 1. 2. 2. 2. 2.]
[ 1. 2. 0. 2. 2.]
[ 2. 2. 2. 2. 2.]
[ 2. 2. 2. 0. 0.]
[ 1. 2. 2. 0. 2.]
[ 1. 2. 2. 2. 2.]
[ 1. 2. 2. 2. nan]
[ 2. 2. 2. 2. 2.]
[ 2. 2. 2. 2. nan]]
Second Print
[[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. -1. 0. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. -1. 0.]
[-1. 0. 0. -1. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. nan]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. nan]]
This is weird, and can't figure this out. Why Would the values change without any line of code between 3 print ?
def MatrixAnalytics(DataMatrix):
AnalyzedMatrix = DataMatrix
for i in range(len(AnalyzedMatrix)): #Browse Each Column
for j in range(len(AnalyzedMatrix[i])): #Browse Each Line
if j>0:
if AnalyzedMatrix[i][j] > 50:
if AnalyzedMatrix[i][j] > AnalyzedMatrix[i][j-1]:
AnalyzedMatrix[i][j] = 2
else:
AnalyzedMatrix[i][j] = 1
else:
if AnalyzedMatrix[i][j] <50:
if AnalyzedMatrix[i][j] > AnalyzedMatrix[i][j-1]:
AnalyzedMatrix[i][j] = 0
else:
AnalyzedMatrix[i][j] = -1
return AnalyzedMatrix
The input array is :
[[55. 57.6 57.2 57. 51.1 55.9]
[55.3 54.7 56.1 55.8 52.7 55.5]
[55.5 52. 52.2 49.9 53.8 55.6]
[54.9 57.8 57.6 53.6 54.2 59.9]
[47.9 50.7 53.3 52.5 49.9 45.8]
[57. 56.2 58.3 55.4 47.9 56.5]
[56.6 54.2 57.6 54.7 50.1 53.6]
[54.7 53.4 52. 52. 50.9 nan]
[51.4 51.5 51.2 53. 50.1 50.1]
[55.3 58.7 59.2 56.4 53. nan]]
It seems that it call again the function MatrixAnalytics But I don't understand why
**
Doing this works :
**
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print(MyNewMatrix)
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print("SecondPrint")
print(MyNewMatrix)
I think I got the issue.
In this code :
def MatrixAnalytics(DataMatrix):
AnalyzedMatrix = DataMatrix
...
...
return AnalyzedMatrix
AnalyzedMatrix is not a copy of DataMatrix, it's referencing to the same object in memory !
So on the first call of MatrixAnalytics, your are actually modifying the object behind the reference given as argument (because arrays are mutable).
In the second call, your are giving the same reference as argument so the array behind it has already been modified.
note : return AnalyzedMatrix statement just returns the a new reference to the object referenced by the DataMatrix argument (not a copy).
Try to replace this line :
AnalyzedMatrix = DataMatrix
with this one (in your definition of MatrixAnalytics) :
AnalyzedMatrix = np.copy(DataMatrix)
For more info :
mutable vs unmutable
numpy.delete()
numpy.copy()
I believe you want same output in both the cases,
Sadly the thing is np.delete performs changes in the array itself, so when you called the first line (np.delete(MatrixAnalytics(Cmp),[0],1))
it deletes the 0th column and saves it in matrixanalytics, so never call this function in print statement, either call it during assignment or even without assignment as it will make the changes in the given array itself, but never in print since the column would be lost in the print statement.

How to extract a DataFrame to obtain a nested array?

I have a sample DataFrame as below:
First column consists of 2 years, for each year, 2 track exist and each track includes pairs of longitude and latitude coordinated. How can I extract every track for each year separately to obtain an array of tracks with lat and long?
df = pd.DataFrame(
{'year':[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1],
'track_number':[0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1],
'lat': [11.7,11.8,11.9,11.9,12.0,12.1,12.2,12.2,12.3,12.3,12.4,12.5,12.6,12.6,12.7,12.8],
'long':[-83.68,-83.69,-83.70,-83.71,-83.71,-83.73,-83.74,-83.75,-83.76,-83.77,-83.78,-83.79,-83.80,-83.81,-83.82,-83.83]})
You can groupby year and then extract a numpy.array from the created dataframes with .to_numpy().
>>> years = []
>>> for _, df2 in df.groupby(["year"]):
years.append(df2.to_numpy()[:, 1:])
>>> years[0]
array([[ 0. , 11.7 , -83.68],
[ 0. , 11.8 , -83.69],
[ 0. , 11.9 , -83.7 ],
[ 0. , 11.9 , -83.71],
[ 1. , 12. , -83.71],
[ 1. , 12.1 , -83.73],
[ 1. , 12.2 , -83.74],
[ 1. , 12.2 , -83.75]])
>>> years[1]
array([[ 0. , 12.3 , -83.76],
[ 0. , 12.3 , -83.77],
[ 0. , 12.4 , -83.78],
[ 0. , 12.5 , -83.79],
[ 1. , 12.6 , -83.8 ],
[ 1. , 12.6 , -83.81],
[ 1. , 12.7 , -83.82],
[ 1. , 12.8 , -83.83]])
Where years[0] would have the desired information for the year 0. And so on. Inside the array, the positions of the original dataframe are preserved. That is, the first element is the track; the second, the latitude, and the third, the longitude.
If you wish to do the same for the track, i.e, have an array of only latitude and longitude, you can groupby(["year", "track_number"]) as well.

Vectorized equivalent of dict.get

I'm looking for the functionality that operates like such
lookup_dict = {5:1.0, 12:2.0, 39:2.0...}
# this is the missing magic:
lookup = vectorized_dict(lookup_dict)
x = numpy.array([5.0, 59.39, 39.49...])
xbins = numpy.trunc(x).astype(numpy.int_)
y = lookup.get(xbins, 0.0)
# the idea is that we get this as the postcondition:
for (result, input) in zip(y, xbins):
assert(result==lookup_dict.get(input, 0.0))
Is there some flavor of sparse array in numpy (or scipy) that gets at this kind of functionality?
The full context is that I'm binning some samples of a 1-D feature.
As far as I know, numpy does not support different data types in the same array structures but you can achieve a similar result if you are willing to separate keys from values and maintain the keys (and corresponding values) in sorted order:
import numpy as np
keys = np.array([5,12,39])
values = np.array([1.0, 2.0, 2.0])
valueOf5 = values[keys.searchsorted(5)] # 2.0
k = np.array([5,5,12,39,12])
values[keys.searchsorted(k)] # array([1., 1., 2., 2., 2.])
This may not be as efficient as a hashing key but it does support the propagation of indirections from arrays with any number of dimensions.
note that this assumes your keys are always present in the keys array. If not, rather than an error, you could be getting the value from the next key up.
Using np.select to create boolean masks over the array, ([xbins == k for k in lookup_dict]), the values from the dict (lookup_dict.values()), and a default value of 0:
y = np.select(
[xbins == k for k in lookup_dict],
lookup_dict.values(),
0.0
)
# In [17]: y
# Out[17]: array([1., 0., 2.])
This assumes that the dictionary is sorted, I'm not sure what the behaviour would be below python 3.6.
OR overkill with pandas:
import pandas as pd
s = pd.Series(xbins)
s = s.map(lookup_dict).fillna(0)
Another approach is to use searchsorted to search a numpy array which has the integer 'keys' and returns the initially loaded value in the range n <= x < n+1. This may be useful to somebody asking the a similar question in the future.
import numpy as np
class NpIntDict:
""" Class to simulate a python dict get for a numpy array. """
def __init__( self, dict_in, default = np.nan ):
""" dict_in: a dictionary with integer keys.
default: the value to be returned for keys not in the dictionary.
defaults to np.nan
default must be consistent with the dtype of values
"""
# Create list of dict items sorted by key.
list_in = sorted([ item for item in dict_in.items() ])
# Create three empty lists.
key_list = []
val_list = []
is_def_mask = []
for key, value in list_in:
key = int(key)
if not key in key_list: # key not yet in key list
# Update the three lists for key as default.
key_list.append( key )
val_list.append( default )
is_def_mask.append( True )
# Update the lists for key+1. With searchsorted this gives the required results.
key_list.append( key + 1 )
val_list.append( value )
is_def_mask.append( False )
# Add the key > max(key) to the val and is_def_mask lists.
val_list.append( default )
is_def_mask.append( True )
self.keys = np.array( key_list, dtype = np.int )
self.values = np.array( val_list )
self.default_mask = np.array( is_def_mask )
def set_default( self, default = 0 ):
""" Set the default to a new default value. Using self.default_mask.
Changes the default value for all future self.get(arr).
"""
self.values[ self.default_mask ] = default
def get( self, arr, default = None ):
""" Returns an array looking up the values in `arr` in the dict.
default can be used to change the default value returned for this get only.
"""
if default is None:
values = self.values
else:
values= self.values.copy()
values[ self.default_mask ] = default
return values[ np.searchsorted( self.keys, arr, side = 'right' ) ]
# side = 'right' to ensure key[ix] <= x < key[ix+1]
# side = 'left' would mean key[ix] < x <= key[ix+1]
This could be simplified if there's no requirement to change the default returned after the NpIntDict is created.
To test it.
d = { 2: 5.1, 3: 10.2, 5: 47.1, 8: -6}
# x <2 Return default
# 2 <= x <3 return 5.1
# 3 <= x < 4 return 10.2
# 4 <= x < 5 return default
# 5 <= x < 6 return 47.1
# 6 <= x < 8 return default
# 8 <= x < 9 return -6.
# 9 <= x return default
test = NpIntDict( d, default = 0.0 )
arr = np.arange( 0., 100. ).reshape(10,10)/10
print( arr )
"""
[[0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
[1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9]
[2. 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9]
[3. 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9]
[4. 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9]
[5. 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9]
[6. 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9]
[7. 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9]
[8. 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9]
[9. 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9]]
"""
print( test.get( arr ) )
"""
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1]
[10.2 10.2 10.2 10.2 10.2 10.2 10.2 10.2 10.2 10.2]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[47.1 47.1 47.1 47.1 47.1 47.1 47.1 47.1 47.1 47.1]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[-6. -6. -6. -6. -6. -6. -6. -6. -6. -6. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]]
"""
This could be amended to raise an exception if any of the arr elements aren't in the key list. For me returning a default would be more useful.

Combine two numpy arrays and covert them into a dataframe

I have two Dataframes (X & y) sliced off the main dataframe df as below :
X = df.ix[:,df.columns!='Class']
y = df.ix[:,df.columns=='Class']
from imblearn.over_sampling import SMOTE
sm = SMOTE()
X_resampled , y_resampled = sm.fit_sample(X,y.values.ravel())
The last line returns a numpy 2-d array for X_resampled and y_resampled.
So I would want to know how to convert X_resampled and y_resampled back into a dataframe.
Example Data :
X_resampled :Dimensions(2,30) : 2 rows,30 columns
array([[ 0. , -1.35980713, -0.07278117, 2.53634674, 1.37815522,
-0.33832077, 0.46238778, 0.23959855, 0.0986979 , 0.36378697,
0.09079417, -0.55159953, -0.61780086, -0.99138985, -0.31116935,
1.46817697, -0.47040053, 0.20797124, 0.02579058, 0.40399296,
0.2514121 , -0.01830678, 0.27783758, -0.11047391, 0.06692807,
0.12853936, -0.18911484, 0.13355838, -0.02105305, 0.24496426],
[ 0. , 1.19185711, 0.26615071, 0.16648011, 0.44815408,
0.06001765, -0.08236081, -0.07880298, 0.08510165, -0.25542513,
-0.16697441, 1.61272666, 1.06523531, 0.48909502, -0.1437723 ,
0.63555809, 0.46391704, -0.11480466, -0.18336127, -0.14578304,
-0.06908314, -0.22577525, -0.63867195, 0.10128802, -0.33984648,
0.1671704 , 0.12589453, -0.0089831 , 0.01472417, -0.34247454]])
y_resampled :Dimensions (2,) - Coressponding to the two rows of X_resampled.
array([0, 0], dtype=int64)
I believe you need numpy.hstack:
a = np. array([[ 0. , -1.35980713, -0.07278117, 2.53634674, 1.37815522,
-0.33832077, 0.46238778, 0.23959855, 0.0986979 , 0.36378697,
0.09079417, -0.55159953, -0.61780086, -0.99138985, -0.31116935,
1.46817697, -0.47040053, 0.20797124, 0.02579058, 0.40399296,
0.2514121 , -0.01830678, 0.27783758, -0.11047391, 0.06692807,
0.12853936, -0.18911484, 0.13355838, -0.02105305, 0.24496426],
[ 0. , 1.19185711, 0.26615071, 0.16648011, 0.44815408,
0.06001765, -0.08236081, -0.07880298, 0.08510165, -0.25542513,
-0.16697441, 1.61272666, 1.06523531, 0.48909502, -0.1437723 ,
0.63555809, 0.46391704, -0.11480466, -0.18336127, -0.14578304,
-0.06908314, -0.22577525, -0.63867195, 0.10128802, -0.33984648,
0.1671704 , 0.12589453, -0.0089831 , 0.01472417, -0.34247454]])
b = np.array([0, 100])
c = pd.DataFrame(np.hstack((a,b[:, None])))
print (c)
0 1 2 3 4 5 6 7 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
8 9 ... 21 22 23 24 \
0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928
1 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846
25 26 27 28 29 30
0 0.128539 -0.189115 0.133558 -0.021053 0.244964 0.0
1 0.167170 0.125895 -0.008983 0.014724 -0.342475 100.0
[2 rows x 31 columns]

How to construct regex for this text [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
Here's the input:
7. Data 1 1. STR1 STR2 3. 12345 4. 0876 9. NO 2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO 0 1.
And here's expected output:
[('1', '1. STR1 STR2 3. 12345 4. 0876 9. NO'),
('2', '1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO'),
('3', '1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO'),
('4', '1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO')]
I've tried this:
re.findall(r'(?=\s(\d+)\s(1\..*?)\s\d+\s1\.)', txt, re.DOTALL)
But of course it's not right solution - regex have to match (\d+) 1. but not PRub. 1 1..
What should I do to make it work?
How is this:
In [1]: s='7. Data 1 1. STR1 STR2 3. 12345 4. 0876 9. NO 2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO 0 1.'
In [2]: import re
In [3]: re.findall('(?<=\s)\d.*?(?=\s\d\s\d[.](?=$|\s[A-Z]))',s)
Out[3]:
['1 1. STR1 STR2 3. 12345 4. 0876 9. NO',
'2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO',
'3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO',
'4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO']
For you exact output I'd do something like:
In [4]: ns = re.findall('(?<=\s)\d.*?(?=\s\d\s\d[.](?=$|\s[A-Z]))',s)
In [5]: [tuple(f.split(' ',1)) for f in ns]
Out[5]:
[('1', '1. STR1 STR2 3. 12345 4. 0876 9. NO'),
('2', '1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO'),
('3', '1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO'),
('4', '1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO')]
Might be a better way to do this but my python foo isn't as good as my regexp foo.
Regexplanation:
(?<=\s) # Use positive look-behind to match a leading space but don't include it
\d # match digit
.*? # Match everything up till the next record (lazy)
# The following positive look-behinds is the key. It matches the start of
# each new record i.e
# 2 1. S
# 3 1. S
# 4 1. Q
# 0 1.$
# look-arounds match but don't seek past.
(?=\s\d\s\d[.](?=$|\s[A-Z]))
(?= # positive look-ahead 1
\s # space
\d # digit
\s # space
\d # digit
[.] # period
(?= # postive look-ahead 2
$ # end of string
| # OR
\s[A-Z] # space followed by uppercase letter
) # close look-ahead 1
) # close look-ahead 2

Categories