How to construct regex for this text [closed] - python

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
Here's the input:
7. Data 1 1. STR1 STR2 3. 12345 4. 0876 9. NO 2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO 0 1.
And here's expected output:
[('1', '1. STR1 STR2 3. 12345 4. 0876 9. NO'),
('2', '1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO'),
('3', '1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO'),
('4', '1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO')]
I've tried this:
re.findall(r'(?=\s(\d+)\s(1\..*?)\s\d+\s1\.)', txt, re.DOTALL)
But of course it's not right solution - regex have to match (\d+) 1. but not PRub. 1 1..
What should I do to make it work?

How is this:
In [1]: s='7. Data 1 1. STR1 STR2 3. 12345 4. 0876 9. NO 2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO 0 1.'
In [2]: import re
In [3]: re.findall('(?<=\s)\d.*?(?=\s\d\s\d[.](?=$|\s[A-Z]))',s)
Out[3]:
['1 1. STR1 STR2 3. 12345 4. 0876 9. NO',
'2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO',
'3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO',
'4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO']
For you exact output I'd do something like:
In [4]: ns = re.findall('(?<=\s)\d.*?(?=\s\d\s\d[.](?=$|\s[A-Z]))',s)
In [5]: [tuple(f.split(' ',1)) for f in ns]
Out[5]:
[('1', '1. STR1 STR2 3. 12345 4. 0876 9. NO'),
('2', '1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO'),
('3', '1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO'),
('4', '1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO')]
Might be a better way to do this but my python foo isn't as good as my regexp foo.
Regexplanation:
(?<=\s) # Use positive look-behind to match a leading space but don't include it
\d # match digit
.*? # Match everything up till the next record (lazy)
# The following positive look-behinds is the key. It matches the start of
# each new record i.e
# 2 1. S
# 3 1. S
# 4 1. Q
# 0 1.$
# look-arounds match but don't seek past.
(?=\s\d\s\d[.](?=$|\s[A-Z]))
(?= # positive look-ahead 1
\s # space
\d # digit
\s # space
\d # digit
[.] # period
(?= # postive look-ahead 2
$ # end of string
| # OR
\s[A-Z] # space followed by uppercase letter
) # close look-ahead 1
) # close look-ahead 2

Related

Np Array values are being changed without doing stuff

Why does :
print(np.delete(MatrixAnalytics(Cmp),[0],1))
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print("SecondPrint")
print(MyNewMatrix)
returns :
[[ 2. 2. 2. 2. 2.]
[ 1. 2. 2. 2. 2.]
[ 1. 2. 0. 2. 2.]
[ 2. 2. 2. 2. 2.]
[ 2. 2. 2. 0. 0.]
[ 1. 2. 2. 0. 2.]
[ 1. 2. 2. 2. 2.]
[ 1. 2. 2. 2. nan]
[ 2. 2. 2. 2. 2.]
[ 2. 2. 2. 2. nan]]
Second Print
[[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. -1. 0. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. -1. 0.]
[-1. 0. 0. -1. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. nan]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. nan]]
This is weird, and can't figure this out. Why Would the values change without any line of code between 3 print ?
def MatrixAnalytics(DataMatrix):
AnalyzedMatrix = DataMatrix
for i in range(len(AnalyzedMatrix)): #Browse Each Column
for j in range(len(AnalyzedMatrix[i])): #Browse Each Line
if j>0:
if AnalyzedMatrix[i][j] > 50:
if AnalyzedMatrix[i][j] > AnalyzedMatrix[i][j-1]:
AnalyzedMatrix[i][j] = 2
else:
AnalyzedMatrix[i][j] = 1
else:
if AnalyzedMatrix[i][j] <50:
if AnalyzedMatrix[i][j] > AnalyzedMatrix[i][j-1]:
AnalyzedMatrix[i][j] = 0
else:
AnalyzedMatrix[i][j] = -1
return AnalyzedMatrix
The input array is :
[[55. 57.6 57.2 57. 51.1 55.9]
[55.3 54.7 56.1 55.8 52.7 55.5]
[55.5 52. 52.2 49.9 53.8 55.6]
[54.9 57.8 57.6 53.6 54.2 59.9]
[47.9 50.7 53.3 52.5 49.9 45.8]
[57. 56.2 58.3 55.4 47.9 56.5]
[56.6 54.2 57.6 54.7 50.1 53.6]
[54.7 53.4 52. 52. 50.9 nan]
[51.4 51.5 51.2 53. 50.1 50.1]
[55.3 58.7 59.2 56.4 53. nan]]
It seems that it call again the function MatrixAnalytics But I don't understand why
**
Doing this works :
**
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print(MyNewMatrix)
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print("SecondPrint")
print(MyNewMatrix)
I think I got the issue.
In this code :
def MatrixAnalytics(DataMatrix):
AnalyzedMatrix = DataMatrix
...
...
return AnalyzedMatrix
AnalyzedMatrix is not a copy of DataMatrix, it's referencing to the same object in memory !
So on the first call of MatrixAnalytics, your are actually modifying the object behind the reference given as argument (because arrays are mutable).
In the second call, your are giving the same reference as argument so the array behind it has already been modified.
note : return AnalyzedMatrix statement just returns the a new reference to the object referenced by the DataMatrix argument (not a copy).
Try to replace this line :
AnalyzedMatrix = DataMatrix
with this one (in your definition of MatrixAnalytics) :
AnalyzedMatrix = np.copy(DataMatrix)
For more info :
mutable vs unmutable
numpy.delete()
numpy.copy()
I believe you want same output in both the cases,
Sadly the thing is np.delete performs changes in the array itself, so when you called the first line (np.delete(MatrixAnalytics(Cmp),[0],1))
it deletes the 0th column and saves it in matrixanalytics, so never call this function in print statement, either call it during assignment or even without assignment as it will make the changes in the given array itself, but never in print since the column would be lost in the print statement.

How to extract a DataFrame to obtain a nested array?

I have a sample DataFrame as below:
First column consists of 2 years, for each year, 2 track exist and each track includes pairs of longitude and latitude coordinated. How can I extract every track for each year separately to obtain an array of tracks with lat and long?
df = pd.DataFrame(
{'year':[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1],
'track_number':[0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1],
'lat': [11.7,11.8,11.9,11.9,12.0,12.1,12.2,12.2,12.3,12.3,12.4,12.5,12.6,12.6,12.7,12.8],
'long':[-83.68,-83.69,-83.70,-83.71,-83.71,-83.73,-83.74,-83.75,-83.76,-83.77,-83.78,-83.79,-83.80,-83.81,-83.82,-83.83]})
You can groupby year and then extract a numpy.array from the created dataframes with .to_numpy().
>>> years = []
>>> for _, df2 in df.groupby(["year"]):
years.append(df2.to_numpy()[:, 1:])
>>> years[0]
array([[ 0. , 11.7 , -83.68],
[ 0. , 11.8 , -83.69],
[ 0. , 11.9 , -83.7 ],
[ 0. , 11.9 , -83.71],
[ 1. , 12. , -83.71],
[ 1. , 12.1 , -83.73],
[ 1. , 12.2 , -83.74],
[ 1. , 12.2 , -83.75]])
>>> years[1]
array([[ 0. , 12.3 , -83.76],
[ 0. , 12.3 , -83.77],
[ 0. , 12.4 , -83.78],
[ 0. , 12.5 , -83.79],
[ 1. , 12.6 , -83.8 ],
[ 1. , 12.6 , -83.81],
[ 1. , 12.7 , -83.82],
[ 1. , 12.8 , -83.83]])
Where years[0] would have the desired information for the year 0. And so on. Inside the array, the positions of the original dataframe are preserved. That is, the first element is the track; the second, the latitude, and the third, the longitude.
If you wish to do the same for the track, i.e, have an array of only latitude and longitude, you can groupby(["year", "track_number"]) as well.

How can I generate a rolling metric like this in Pandas

I have a dataframe that initially contains two columns, Home, which is 1 if a game was player at home, else 0, and PTS, which records the number of points a player scored in a given game. I want to end up with a third column, a rolling metric that represents how sensitive a player is to playing at home. I'll calculate this as follows:
Home Sensitivity = (Average PTS Home - Average PTS Away)/Average PTS
I did this successfully in the following code, but it felt cumbersome, as I created many columns I didn't need in the end. How can I solve this problem more directly?
df=pd.DataFrame({'Home':[1,0,1,0,1,0,1,0], 'PTS':[11, 10, 12, 11, 13, 12, 14, 12]})
df.loc[testDF['Home'] == 1, 'Home PTS'] = df['PTS']
df.loc[testDF['Home'] == 0, 'Away PTS'] = df['PTS']
df['Home PTS'] = df['Home PTS'].fillna(0)
df['Away PTS'] = df['Away PTS'].fillna(0)
df['Home Sum'] = df['Home PTS'].expanding(min_periods=1).sum()
df['Away Sum'] = df['Away PTS'].expanding(min_periods=1).sum()
df['Home Count']=df['Home'].expanding().sum()
df['Index']=df.index+1
df['Away Count']=df['Index']-df['Home Count']
df['Home Average']=df['Home Sum']/df['Home Count']
df['Away Average']=df['Away Sum']/df['Away Count']
df['Average']=df['PTS'].expanding().mean()
df['Metric']=(df['Home Average']-df['Away Average'])/df['Average']
Here is a naive way to do it: take increasingly larger slices of the DataFrame in a loop; do the math on each slice and store it in a list; assign the list to a new column of the DataFrame (using your testDF):
df = tesdDF
sens = []
for i in range(len(df)):
d = df[:i]
mean_pts = d.PTS.mean()
home = d[d.Home == 1].PTS.mean()
away = d[d.Home == 0].PTS.mean()
#print(home, away, (home - away) / mean_pts)
sens.append((home - away) / mean_pts)
df['sens'] = sens
>>> df
Home PTS sens
0 1 11 NaN
1 0 10 NaN
2 1 12 0.095238
3 0 11 0.136364
4 1 13 0.090909
5 0 12 0.131579
6 1 14 0.086957
7 0 12 0.126506
Using DataFrame.expanding(): Not quite there yet ...
>>> mean_pts = df.PTS.expanding(1).mean()
>>> away = df[df['Home'] == 0].PTS.expanding(1).mean()
>>> home = df[df['Home'] == 1].PTS.expanding(1).mean()
>>>
>>> home
0 11.0
2 11.5
4 12.0
6 12.5
Name: PTS, dtype: float64
>>> away
1 10.00
3 10.50
5 11.00
7 11.25
Name: PTS, dtype: float64
>>> mean_pts
0 11.000000
1 10.500000
2 11.000000
3 11.000000
4 11.400000
5 11.500000
6 11.857143
7 11.875000
Name: PTS, dtype: float64
>>>
To do the math will require more manipulation.
You cannot get the difference between home and away directly because the indices are different - but you can do ...
>>> home.values - away.values
array([ 1. , 1. , 1. , 1.25])
>>>
Also home and away only have four rows and mean_pts has eight.
I tried .expanding(1).apply() with the following function and didn't get what I expected, expanding doesn't pass both columns to the function, it appears to pass one column then the other; so I punted...
def f(thing):
print(thing, '***')
return thing.mean()
>>> df.expanding(1).apply(f)
[ 1.] ***
[ 1. 0.] ***
[ 1. 0. 1.] ***
[ 1. 0. 1. 0.] ***
[ 1. 0. 1. 0. 1.] ***
[ 1. 0. 1. 0. 1. 0.] ***
[ 1. 0. 1. 0. 1. 0. 1.] ***
[ 1. 0. 1. 0. 1. 0. 1. 0.] ***
[ 11.] ***
[ 11. 10.] ***
[ 11. 10. 12.] ***
[ 11. 10. 12. 11.] ***
[ 11. 10. 12. 11. 13.] ***
[ 11. 10. 12. 11. 13. 12.] ***
[ 11. 10. 12. 11. 13. 12. 14.] ***
[ 11. 10. 12. 11. 13. 12. 14. 12.] ***

Not able to get my head around this python

I just implemented a hierarchical clustering by following the documentation here: http://www.mathworks.com/help/stats/hierarchical-clustering.html?s_tid=doc_12b
So, let me try to put down what I am trying to do.
Take a look at the following figure:
Now, this dendogram is generated from the following data:
node1 node2 dist(node1,node2) num_elems
assigning index **37 to [ 16. 26**. 1.14749118 2. ]
assigning index 38 to [ 4. 7. 1.20402602 2. ]
assigning index 39 to [ 13. 29. 1.44708015 2. ]
assigning index 40 to [ 12. 18. 1.45827365 2. ]
assigning index 41 to [ 10. 34. 1.49607538 2. ]
assigning index 42 to [ 17. 38. 1.52565922 3. ]
assigning index 43 to [ 8. 25. 1.58919037 2. ]
assigning index 44 to [ 3. 40. 1.60231007 3. ]
assigning index 45 to [ 6. 42. 1.65755731 4. ]
assigning index 46 to [ 15. 23. 1.77770844 2. ]
assigning index 47 to [ 24. 33. 1.77771082 2. ]
assigning index 48 to [ 20. 35. 1.81301111 2. ]
assigning index 49 to [ 19. 48. 1.9191061 3. ]
assigning index 50 to [ 0. 44. 1.94238609 4. ]
assigning index 51 to [ 2. 36. 2.0444266 2. ]
assigning index 52 to [ 39. 45. 2.11667375 6. ]
assigning index 53 to [ 32. 43. 2.17132916 3. ]
assigning index 54 to [ 21. 41. 2.2882061 3. ]
assigning index 55 to [ 9. 30. 2.34492327 2. ]
assigning index 56 to [ 5. 51. 2.38383321 3. ]
assigning index 57 to [ 46. 52. 2.42100025 8. ]
assigning index 58 to [ **28. 37**. 2.48365024 3. ]
assigning index 59 to [ 50. 53. 2.57305009 7. ]
assigning index 60 to [ 49. 57. 2.69459675 11. ]
assigning index 61 to [ 11. 54. 2.75669475 4. ]
assigning index 62 to [ 22. 27. 2.77163751 2. ]
assigning index 63 to [ 47. 55. 2.79303418 4. ]
assigning index 64 to [ 14. 60. 2.88015327 12. ]
assigning index 65 to [ 56. 59. 2.95413905 10. ]
assigning index 66 to [ 61. 65. 3.12615829 14. ]
assigning index 67 to [ 64. 66. 3.28846304 26. ]
assigning index 68 to [ 31. 58. 3.3282066 4. ]
assigning index 69 to [ 63. 67. 3.47397104 30. ]
assigning index 70 to [ 62. 68. 3.63807605 6. ]
assigning index 71 to [ 1. 69. 4.09465969 31. ]
assigning index 72 to [ 70. 71. 4.74129435 37.
So basically, there are 37 points in my data same indexed from 0-36..Now, when I see the first element in this list... I assign i + len(thiscompletelist) + 1
So for example, when the id is 37 seen again in future iterations, then that basically means that it is linked to a branch as well.
I used matlab to generate this image. But I want to query this information as query_node(node_id) such that it returns me a list by level.. such that... on query_node(37) I get
{ "left": {"level":1 {"id": 28}} , "right":{"level":0 {"left" :"id":16},"right":{"id":26}}}
Actually.. I dont even know what is the right data structure to do this..
Basically I want to query by node and gain some insight on what does the structure of this dendogram looks like when I am standing on that node and looking below. :(
EDIT 1:
*OOH I didn't knew that you wont be able to zoom the image.. basically the fourth element from the left is 28 and the green entry is the first row of the data..
So fourth vertical line on dendogram represents 28
Next to that line (the first green line) represents 16
and next to that line (the second green line) represents 26*
Well it's always good to build upon something already existing so take a look at dendrogram in scipy.

Rolling median in python

I have some stock data based on daily close values. I need to be able to insert these values into a python list and get a median for the last 30 closes. Is there a python library that does this?
In pure Python, having your data in a Python list a, you could do
median = sum(sorted(a[-30:])[14:16]) / 2.0
(This assumes a has at least 30 items.)
Using the NumPy package, you could use
median = numpy.median(a[-30:])
Have you considered pandas? It is based on numpy and can automatically associate timestamps with your data, and discards any unknown dates as long as you fill it with numpy.nan. It also offers some rather powerful graphing via matplotlib.
Basically it was designed for financial analysis in python.
isn't the median just the middle value in a sorted range?
so, assuming your list is stock_data:
last_thirty = stock_data[-30:]
median = sorted(last_thirty)[15]
Now you just need to get the off-by-one errors found and fixed and also handle the case of stock_data being less than 30 elements...
let us try that here a bit:
def rolling_median(data, window):
if len(data) < window:
subject = data[:]
else:
subject = data[-30:]
return sorted(subject)[len(subject)/2]
#found this helpful:
list=[10,20,30,40,50]
med=[]
j=0
for x in list:
sub_set=list[0:j+1]
median = np.median(sub_set)
med.append(median)
j+=1
print(med)
Here is a much faster method with w*|x| space complexity.
def moving_median(x, w):
shifted = np.zeros((len(x)+w-1, w))
shifted[:,:] = np.nan
for idx in range(w-1):
shifted[idx:-w+idx+1, idx] = x
shifted[idx+1:, idx+1] = x
# print(shifted)
medians = np.median(shifted, axis=1)
for idx in range(w-1):
medians[idx] = np.median(shifted[idx, :idx+1])
medians[-idx-1] = np.median(shifted[-idx-1, -idx-1:])
return medians[(w-1)//2:-(w-1)//2]
moving_median(np.arange(10), 4)
# Output
array([0.5, 1. , 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8. ])
The output has the same length as the input vector.
Rows with less than one entry will be ignored and with half of them nans (happens only for an even window-width), only the first option will be returned. Here is the shifted_matrix from above with the respective median values:
[[ 0. nan nan nan] -> -
[ 1. 0. nan nan] -> 0.5
[ 2. 1. 0. nan] -> 1.0
[ 3. 2. 1. 0.] -> 1.5
[ 4. 3. 2. 1.] -> 2.5
[ 5. 4. 3. 2.] -> 3.5
[ 6. 5. 4. 3.] -> 4.5
[ 7. 6. 5. 4.] -> 5.5
[ 8. 7. 6. 5.] -> 6.5
[ 9. 8. 7. 6.] -> 7.5
[nan 9. 8. 7.] -> 8.0
[nan nan 9. 8.] -> -
[nan nan nan 9.]]-> -
The behaviour can be changed by adapting the final slice medians[(w-1)//2:-(w-1)//2].
Benchmark:
%%timeit
moving_median(np.arange(1000), 4)
# 267 µs ± 759 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Alternative approach: (the results will be shifted)
def moving_median_list(x, w):
medians = np.zeros(len(x))
for j in range(len(x)):
medians[j] = np.median(x[j:j+w])
return medians
%%timeit
moving_median_list(np.arange(1000), 4)
# 15.7 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Both algorithms have a linear time complexity.
Therefore, the function moving_median will be the faster option.

Categories