Describing a function partially in Z3 or Z3py

Describing a function partially in Z3 or Z3py - python

This is my first question in Stackoverflow. So please do offer advice if I'm breaking any house rules.
I have started learning Z3Py only recently. I found myself in a situation where I know the following about an Int->Int function A that I want to describe and leave it for Z3Py to solve:
The summation of A(x) = 10000 where 0<x<100
A(x) > 10 where 0<x<100
A(x) = 0 where x>= 100
A(3) = 190
A(12) = 1200
How can these constraints be described in Z3Py (Or Z3)?

There could be multiple ways to code this example in z3py. With the constraints as you listed, you can use a symbolic array, an uninterpreted function or use separate variables for each of the 100 elements.
Using arrays
The simplest for this case would be the use of arrays: That is, model the Int -> Int function as an array:
from z3 import *
A = Array('A', IntSort(), IntSort())
s = Solver()
total = 0
for i in range(100):
s.add(A[i] > 10)
total = total + A[i]
s.add(total == 10000)
s.add(A[ 3] == 190)
s.add(A[12] == 1200)
k = Int('k')
s.add(ForAll(k, (Implies(Or(k < 0, k >= 100), A[k] == 0))))
if s.check() == sat:
m = s.model()
for i in range(100):
v = m.evaluate(A[i]).as_long()
print("%3d: %d" % (i, v))
When run, this prints:
0: 131
1: 19
2: 436
3: 190
4: 12
5: 11
6: 19
7: 24
8: 11
9: 13
10: 133
11: 134
12: 1200
13: 39
14: 132
15: 134
16: 134
17: 134
18: 30
19: 132
20: 132
21: 38
22: 16
23: 132
24: 22
25: 132
26: 134
27: 27
28: 134
29: 76
30: 130
31: 15
32: 132
33: 134
34: 31
35: 123
36: 35
37: 58
38: 123
39: 64
40: 49
41: 20
42: 139
43: 24
44: 134
45: 11
46: 132
47: 132
48: 22
49: 11
50: 134
51: 134
52: 134
53: 132
54: 132
55: 11
56: 134
57: 11
58: 11
59: 132
60: 11
61: 71
62: 134
63: 58
64: 132
65: 132
66: 134
67: 134
68: 39
69: 74
70: 132
71: 134
72: 134
73: 11
74: 18
75: 134
76: 16
77: 132
78: 17
79: 132
80: 132
81: 132
82: 15
83: 132
84: 132
85: 134
86: 15
87: 132
88: 134
89: 18
90: 132
91: 132
92: 132
93: 132
94: 12
95: 132
96: 22
97: 121
98: 24
99: 11
You can sum up the values printed to ensure it indeed gives you 10000.
Using uninterpreted functions
You can also model A as an uninterpreted function. The changes required is really trivial to the above:
from z3 import *
A = Function('A', IntSort(), IntSort())
s = Solver()
total = 0
for i in range(100):
s.add(A(i) > 10)
total = total + A(i)
s.add(total == 10000)
s.add(A( 3) == 190)
s.add(A(12) == 1200)
k = Int('k')
s.add(ForAll(k, (Implies(Or(k < 0, k >= 100), A(k) == 0))))
if s.check() == sat:
m = s.model()
for i in range(100):
v = m.evaluate(A(i)).as_long()
print("%3d: %d" % (i, v))
This prints:
0: 11
1: 11
2: 11
3: 190
4: 11
5: 11
6: 11
7: 11
8: 11
9: 11
10: 11
11: 11
12: 1200
13: 11
14: 11
15: 11
16: 11
17: 11
18: 11
19: 11
20: 11
21: 11
22: 11
23: 11
24: 11
25: 11
26: 11
27: 11
28: 11
29: 11
30: 11
31: 11
32: 11
33: 11
34: 11
35: 11
36: 11
37: 11
38: 11
39: 11
40: 11
41: 11
42: 11
43: 11
44: 11
45: 11
46: 11
47: 11
48: 11
49: 11
50: 11
51: 11
52: 11
53: 11
54: 11
55: 11
56: 11
57: 11
58: 11
59: 11
60: 11
61: 11
62: 11
63: 11
64: 11
65: 11
66: 11
67: 11
68: 11
69: 11
70: 11
71: 11
72: 11
73: 11
74: 7543
75: 11
76: 11
77: 11
78: 11
79: 11
80: 11
81: 11
82: 11
83: 11
84: 11
85: 11
86: 11
87: 11
88: 11
89: 11
90: 11
91: 11
92: 11
93: 11
94: 11
95: 11
96: 11
97: 11
98: 11
99: 11
The model is quite different in this case, but it's easy to see that it has A(3) = 190, A(12) = 1200, A(74) = 7543, and all other entries are set to 11; giving you a total of 190 + 1200 + 7543 + 11 * 97 = 10000.
Using separate variables
Yet a third method would be to allocate 100 integer elements in a python array, and assert the constraints on individual elements separately. This would lead to the simplest model, as it would not use any quantification. Of course, this would model the fact that elements outside the range 0..99 are 0 implicitly, so only use this if this constraint does not need to be explicitly mentioned. Again, the coding is almost identical:
from z3 import *
A = [Int('A_%d' % i) for i in range(100)]
s = Solver()
total = 0
for i in range(100):
s.add(A[i] > 10)
total = total + A[i]
s.add(total == 10000)
s.add(A[ 3] == 190)
s.add(A[12] == 1200)
if s.check() == sat:
m = s.model()
for i in range(100):
v = m.evaluate(A[i]).as_long()
print("%3d: %d" % (i, v))
I'm eliding the output of this for brevity. Note how we're using A as a python list in this case, instead of a symbolic array directly supported by z3py itself.
Summary
At the end of the day, the constraints you've described are simple enough that it can use all these three techniques. Which one is the best depends on what other constraints you'd want to model. (In particular, you'd want to avoid anything that makes heavy use of quantifiers, like ForAll above since SMT solvers don't do all that well for quantifiers in general. This problem is simple enough so it doesn't pose an issue, but when your constraints get complicated, it can increase solving times or lead the solver to answer unknown.)
Hope this gets you started. Best of luck!

Below you'll find a straight-forward SMT encoding of your problem, except for the summation requirement.
(set-option :smt.mbqi true) ; also try false (e-matching instead of MBQI)
(declare-fun A (Int) Int)
(assert
(=
10000
(+
(A 1)
(A 2)
(A 3)
; ...
(A 99))))
; A(x) > 10 where 0<x<100
(assert (forall ((x Int))
(implies
(and (< 0 x) (< x 100))
(> (A x) 10))))
; A(x) = 0 where x>= 100
(assert (forall ((x Int))
(implies
(>= x 100)
(> (A x) 10))))
; A(3) = 190
(assert (= (A 3) 190))
; A(12) = 1200
(assert (= (A 12) 1200))
(check-sat)
A few remarks:
All SMT functions are total; if you want to encode a partial function, you'll have to explicitly model the function's domain, and make definitional axioms conditional on whether or not an argument is in the domain. Z3 will still give you a total model for the function, though.
Arbitrary summation is non-trivial to encode in SMT, since there is no sum comprehension operator. Your case is simpler, though, since your sum ranges over a statically known number of elements (0 < x < 100).
I'd not be surprised if Z3py offers a convenient source syntax to generate sums over statically-known many elements

Related

Pandas: GroupData & Manipulate values within groups

I have a DataFrame, and taking a subset of it, it has a dict constructor like:
df = pd.DataFrame(data = {'K': {8: 3.9274999999999998, 9: 1.9275, 10: 2.9274999999999998, 11: 2.9274999999999998, 12: 2.275, 13: 3.2750000000000004, 14: 2.775, 15: 2.8000000000000003, 16: 1.7999999999999998, 17: 2.8000000000000003, 18: 2.82, 19: 2.82, 20: 2.8000000000000003, 21: 2.8000000000000003, 22: 2.82, 23: 2.82, 24: 1.82, 25: 1.82, 26: 1.7999999999999998}, 'Struct': {8: 'Call', 9: 'Put', 10: 'Straddle', 11: 'Straddle', 12: 'Put', 13: 'Call', 14: 'Straddle', 15: 'Delta', 16: 'Put', 17: 'Put', 18: 'Put', 19: 'Delta', 20: 'Put', 21: 'Delta', 22: 'Delta', 23: 'Put', 24: 'Put', 25: 'Put', 26: 'Put'}, 'MainID': {8: 10, 9: 10, 10: 10, 11: 10, 12: 20, 13: 20, 14: 20, 15: 21, 16: 21, 17: 21, 18: 23, 19: 23, 20: 23, 21: 23, 22: 23, 23: 23, 24: 23, 25: 23, 26: 23}})
Markdown
Index
K
Struct
MainID
8
3.9275
Call
10
9
1.9275
Put
10
10
2.9275
Straddle
10
11
2.9275
Straddle
10
12
2.275
Put
20
13
3.275
Call
20
14
2.775
Straddle
20
15
2.8
Delta
21
16
1.8
Put
21
17
2.8
Put
21
18
2.82
Put
23
19
2.82
Delta
23
20
2.8
Put
23
21
2.8
Delta
23
22
2.82
Delta
23
23
2.82
Put
23
24
1.82
Put
23
25
1.82
Put
23
26
1.8
Put
23
I am trying to find a way to do the following steps:
Groupby("MainID")
For any Call or Put, subtract "K" from either "Straddle", or "Delta" if it exists within the Groupby("MainID")
In the case that you have multiple delta/put/call within a Groupby("MainID"), you would want to subtract based on ascending values. Ie, if K[Struct==Put] = [1,2,3] and K[Struct==Delta] = [2,2,3] the result would be [-1, 0, 0]
The resulting DF would look like
Index
K
Struct
MainID
8
1
Call
10
9
-1
Put
10
10
2.9275
Straddle
10
11
2.9275
Straddle
10
12
-0.50
Put
20
13
0.50
Call
20
14
2.775
Straddle
20
15
2.8
Delta
21
16
-1
Put
21
17
0
Put
21
18
0
Put
23
19
2.82
Delta
23
20
0
Put
23
21
2.8
Delta
23
22
2.82
Delta
23
23
0
Put
23
24
-1
Put
23
25
-1
Put
23
26
-1
Put
23
Thanks so much! It's a tricky one...

Nested for loop filtering inner loop based on outer loop and appending dataframe

I am trying to build a dataframe that combines individual dataframes of county-level high school enrollment projections generated in a for loop.
I can do this for a single county, based on this SO question. It works great. My goal now is to do a nested for loop that would take multiple county FIPS codes, filter the inner loop on that, and generate an 11-row dataframe that would then be appended to a master dataframe. For three counties, for example, the final dataframe would be 33 rows.
But I haven't been able to get it right. I've tried to model on this SO question and answer.
This is my starting dataframe:
df = pd.DataFrame({"year": ['2020_21', '2020_21','2020_21'],
"county_fips" : ['06019','06021','06023'] ,
"grade11" : [5000,2000,2000],
"grade12": [5200,2200,2200],
"grade11_chg": [1.01,1.02,1.03],
"grade11_12_ratio": [0.9,0.8,0.87]})
df
This is my code with the nested loops. My intent is to run through the county codes in the outer loop and the projection year calculations in the inner loop.
projection_years=['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']
for i in df['county_fips'].unique():
print(i)
grade11_change=df.iloc[0]['grade11_chg']
grade11_12_ratio=df.iloc[0]['grade11_12_ratio']
full_name=[]
for year in projection_years:
#print(year)
df_select=df[df['county_fips']==i]
lr = df_select.iloc[-1]
row = {}
row['year'] = year
row['county_fips'] = i
row = {}
row['grade11'] = int(lr['grade11'] * grade11_change)
row['grade12'] = int(lr['grade11'] * grade11_12_ratio)
df_select = df_select.append([row])
full_name.append(df_select)
df_final=pd.concat(full_name)
df_final=df_final[['year','county_fips','grade11','grade12']]
print('Finished processing')
But I end up with NaN values and repeating years. Below shows my desired output (I built this in Excel and the numbers reflect rounding. (Update - this corrects the original df_final_goal .)
df_final_goal=pd.DataFrame({'year': {0: '2020_21', 1: '2021_22', 2: '2022_23', 3: '2023_24', 4: '2024_25', 5: '2025_26',
6: '2026_27', 7: '2027_28', 8: '2028_29', 9: '2029_30', 10: '2030_31', 11: '2020_21', 12: '2021_22', 13: '2022_23',
14: '2023_24', 15: '2024_25', 16: '2025_26', 17: '2026_27', 18: '2027_28', 19: '2028_29', 20: '2029_30', 21: '2030_31',
22: '2020_21', 23: '2021_22', 24: '2022_23', 25: '2023_24', 26: '2024_25', 27: '2025_26', 28: '2026_27', 29: '2027_28',
30: '2028_29', 31: '2029_30', 32: '2030_31'},
'county_fips': {0: '06019', 1: '06019', 2: '06019', 3: '06019', 4: '06019', 5: '06019', 6: '06019', 7: '06019', 8: '06019',
9: '06019', 10: '06019', 11: '06021', 12: '06021', 13: '06021', 14: '06021', 15: '06021', 16: '06021', 17: '06021', 18: '06021',
19: '06021', 20: '06021', 21: '06021', 22: '06023', 23: '06023', 24: '06023', 25: '06023', 26: '06023', 27: '06023',
28: '06023', 29: '06023', 30: '06023', 31: '06023', 32: '06023'},
'grade11': {0: 5000, 1: 5050, 2: 5101, 3: 5152, 4: 5203, 5: 5255, 6: 5308, 7: 5361, 8: 5414, 9: 5468, 10: 5523,
11: 2000, 12: 2040, 13: 2081, 14: 2122, 15: 2165, 16: 2208, 17: 2252, 18: 2297, 19: 2343, 20: 2390, 21: 2438,
22: 2000, 23: 2060, 24: 2122, 25: 2185, 26: 2251, 27: 2319, 28: 2388, 29: 2460, 30: 2534, 31: 2610, 32: 2688},
'grade12': {0: 5200, 1: 4500, 2: 4545, 3: 4590, 4: 4636, 5: 4683, 6: 4730, 7: 4777, 8: 4825, 9: 4873, 10: 4922,
11: 2200, 12: 1600, 13: 1632, 14: 1665, 15: 1698, 16: 1732, 17: 1767, 18: 1802, 19: 1838, 20: 1875, 21: 1912,
22: 2200, 23: 1740, 24: 1792, 25: 1846, 26: 1901, 27: 1958, 28: 2017, 29: 2078, 30: 2140, 31: 2204, 32: 2270}})
Thanks for any assistance.

Creating a helper function for calculating grade11 helps make this a bit easier.
import pandas as pd
def expand_grade11(
grade11: int,
grade11_chg: float,
len_projection_years: int
) -> list:
"""
Calculate `grade11` values based on current
`grade11`, `grade11_chg`, and number of
`projection_years`.
"""
list_of_vals = []
while len(list_of_vals) < len_projection_years:
grade11 = int(grade11 * grade11_chg)
list_of_vals.append(grade11)
return list_of_vals
# initial info
df = pd.DataFrame({
"year": ['2020_21', '2020_21','2020_21'],
"county_fips": ['06019','06021','06023'] ,
"grade11": [5000,2000,2000],
"grade12": [5200,2200,2200],
"grade11_chg": [1.01,1.02,1.03],
"grade11_12_ratio": [0.9,0.8,0.87]
})
projection_years = ['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']
# converting to pd.MultiIndex
prods_index = pd.MultiIndex.from_product((df.county_fips.unique(), projection_years), names=["county_fips", "year"])
# setting index for future grouping/joining
df.set_index(["county_fips", "year"], inplace=True)
# calculate grade11
final = df.groupby([
"county_fips",
"year",
]).apply(lambda x: expand_grade11(x.grade11, x.grade11_chg, len(projection_years)))
final = final.explode()
final.index = prods_index
final = final.to_frame("grade11")
# concat with original df to get other columns
final = pd.concat([
df, final
])
final.sort_index(level=["county_fips", "year"], inplace=True)
final.grade11_12_ratio.ffill(inplace=True)
# calculate grade12
grade12 = final.groupby([
"county_fips"
]).apply(lambda x: x["grade11"] * x["grade11_12_ratio"])
grade12 = grade12.groupby("county_fips").shift(1)
grade12 = grade12.droplevel(0)
# put it all together
final.grade12.fillna(grade12, inplace=True)
final = final[["grade11", "grade12"]]
final = final.astype(int)
final.reset_index(inplace=True)

there are some bugs in the code, this code seems to produce the result you expect (the final dataframe is currently not consistent with the initial one):
projection_years = ['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']
full_name = []
for i in df['county_fips'].unique():
print(i)
df_select = df[df['county_fips']==i]
grade11_change = df_select.iloc[0]['grade11_chg']
grade11_12_ratio = df_select.iloc[0]['grade11_12_ratio']
for year in projection_years:
#print(year)
lr = df_select.iloc[-1]
row = {}
row['year'] = year
row['county_fips'] = i
row['grade11'] = int(lr['grade11'] * grade11_change)
row['grade12'] = int(lr['grade11'] * grade11_12_ratio)
df_select = df_select.append([row])
full_name.append(df_select)
df_final = pd.concat(full_name)
df_final = df_final[['year','county_fips','grade11','grade12']].reset_index()
print('Finished processing')
fixes:
full_name initialized before the outer loop
do not redefine df_select in the inner loop
row was initialized twice inside the inner loop
full_name.append moved outside of the inner loop and after it
added reset_index() to df_final (mostly cosmetic)
(edit) grade change variables (grade11_change and grade11_12_ratio) are now computed from df_select last row (and not df)
the final result (print(df_final.to_markdown())) with the above code is:
index
year
county_fips
grade11
grade12
0
0
2020_21
06019
5000
5200
1
0
2021_22
06019
5050
4500
2
0
2022_23
06019
5100
4545
3
0
2023_24
06019
5151
4590
4
0
2024_25
06019
5202
4635
5
0
2025_26
06019
5254
4681
6
0
2026_27
06019
5306
4728
7
0
2027_28
06019
5359
4775
8
0
2028_29
06019
5412
4823
9
0
2029_30
06019
5466
4870
10
0
2030_31
06019
5520
4919
11
1
2020_21
06021
2000
2200
12
0
2021_22
06021
2040
1600
13
0
2022_23
06021
2080
1632
14
0
2023_24
06021
2121
1664
15
0
2024_25
06021
2163
1696
16
0
2025_26
06021
2206
1730
17
0
2026_27
06021
2250
1764
18
0
2027_28
06021
2295
1800
19
0
2028_29
06021
2340
1836
20
0
2029_30
06021
2386
1872
21
0
2030_31
06021
2433
1908
22
2
2020_21
06023
2000
2200
23
0
2021_22
06023
2060
1740
24
0
2022_23
06023
2121
1792
25
0
2023_24
06023
2184
1845
26
0
2024_25
06023
2249
1900
27
0
2025_26
06023
2316
1956
28
0
2026_27
06023
2385
2014
29
0
2027_28
06023
2456
2074
30
0
2028_29
06023
2529
2136
31
0
2029_30
06023
2604
2200
32
0
2030_31
06023
2682
2265
note: edited to address the comments

How to create a new data frame from a larger dataset

I am working with a dataset (10000 data points) that provides 100 different account numbers with transaction amounts, date and time of transactions etc.
From this dataset I want to create a separate data frame for one account number, which then contains all the transactions (ordered by time) that that account number made throughout the year.
I tried to do this by:
group = df.groupby('account_num')
which then gives me
pandas.core.groupby.generic.DataFrameGroupBy
Then, when I want to get the group for a specific account number, say 51234:
group.get_group('51234')
I receive an error:
KeyError: 51234
How can I make a separate data frame containing all the transaction for one single account number?
(Sorry if this is a very basic question, Im a newbie)

IIUC, you can get your output in a slightly different way. You can start by making sure your time column, which I assume is a date based on your description, is actually a datetime object, and then filtering your dataframe for the specific account number - there are plenty of ways to do this, a common one is loc, but in my case I use query. Then you can sort based on your date, using sort_values, and lastly you can use groupby on the year part of your date column:
# Convert your date column to datetime
df['date'] = pd.to_datetime(df['date'])
# Filter and sort
>>> print(df.query('account_num == 51234')\
.sort_values(by=['date'],ascending=True))
# Equivalently with loc
print(
df.loc[df['account_num'] == 51234]\
.sort_values(by=['date'],ascending=True))
account_num date
0 51234 2020-01-01
1 51234 2020-02-01
2 51234 2020-03-01
7 51234 2020-08-01
9 51234 2020-08-01
11 51234 2020-08-01
13 51234 2020-08-01
3 51234 2021-04-01
4 51234 2021-05-01
5 51234 2023-06-01
6 51234 2023-07-01
8 51234 2023-07-01
10 51234 2023-07-01
12 51234 2023-07-01
# Filter, sort, and get yearly count
>>> print(
df.query('account_num == 51234')\
.sort_values(by=['date'],ascending=True)\
.groupby(df['date'].dt.year).account_num.count())
date
2020 7
2021 2
2023 5
Based on the below sample DF:
{'account_num': {0: 51234,
1: 51234,
2: 51234,
3: 51234,
4: 51234,
5: 51234,
6: 51234,
7: 51234,
8: 51234,
9: 51234,
10: 51234,
11: 51234,
12: 51234,
13: 51234,
14: 512346,
15: 512346,
16: 512346,
17: 512346,
18: 512346,
19: 512346,
20: 512346,
21: 512346,
22: 512346,
23: 13123,
24: 13123,
25: 13123,
26: 13123,
27: 13123,
28: 13123,
29: 13123,
30: 13123,
31: 13123},
'date': {0: '01/01/2020',
1: '02/01/2020',
2: '03/01/2020',
3: '04/01/2021',
4: '05/01/2021',
5: '06/01/2023',
6: '07/01/2023',
7: '08/01/2020',
8: '07/01/2023',
9: '08/01/2020',
10: '07/01/2023',
11: '08/01/2020',
12: '07/01/2023',
13: '08/01/2020',
14: '09/01/2020',
15: '10/01/2020',
16: '11/01/2020',
17: '12/01/2020',
18: '13/01/2020',
19: '14/01/2020',
20: '15/01/2020',
21: '16/01/2020',
22: '17/01/2020',
23: '18/01/2020',
24: '19/01/2020',
25: '20/01/2020',
26: '21/01/2020',
27: '22/01/2020',
28: '23/01/2020',
29: '24/01/2020',
30: '25/01/2020',
31: '26/01/2020'}}

Pandas python, how to make columns from a list into a DataFrame, and clean it from unnecessary info

I have a list:
sorted_info = [' 1: surgery?\n', ' 2: Age\n', ' 3: Hospital Number\n', ' 4: rectal temperature\n', ' 5: pulse\n', ' - is a reflection of the heart condition: 30 -40 is normal for adults\n', ' 6: respiratory rate\n', ' 7: temperature of extremities\n', ' - possible values:\n', ' 8: peripheral pulse\n', ' - possible values are:\n', ' 9: mucous membranes\n', ' - possible values are:\n', ' 10: capillary refill time\n', " 11: pain - a subjective judgement of the horse's pain level\n", ' - possible values:\n', ' 12: peristalsis\n', ' - possible values:\n', ' 13: abdominal distension\n', ' 14: nasogastric tube\n', ' - possible values:\n', ' 15: nasogastric reflux\n', ' 16: nasogastric reflux PH\n', ' 17: rectal examination - feces\n', ' 18: abdomen\n', ' 19: packed cell volume\n', ' 20: total protein\n', ' 21: abdominocentesis appearance\n', ' - possible values:\n', ' 22: abdomcentesis total protein\n', ' 23: outcome\n', ' - possible values:\n', ' 24: surgical lesion?\n', ' - possible values:\n', ' 25, 26, 27: type of lesion\n', ' 28: cp_data\n']
further, I do:
import pandas as pd
pd.DataFrame(sorted_info)
0 1: surgery?\n 1 2:
Age\n 2 3: Hospital Number\n 3 4: rectal temperature\n 4 5: pulse\n
5 - is a reflection of the heart condi... 6 6: respiratory rate\n
7 7: temperature of extremities\n 8 - possible values:\n 9 8:
peripheral pulse\n 10 - possible values are:\n 11 9: mucous
membranes\n 12 - possible values are:\n 13 10: capillary refill
time\n 14 11: pain - a subjective judgement of the hors... 15 -
possible values:\n 16 12: peristalsis\n 17 - possible values:\n
18 13: abdominal distension\n 19 14: nasogastric tube\n 20 - possible
values:\n 21 15: nasogastric reflux\n 22 16: nasogastric reflux PH\n
23 17: rectal examination - feces\n 24 18: abdomen\n 25 19: packed
cell volume\n 26 20: total protein\n 27 21: abdominocentesis
appearance\n 28 - possible values:\n 29 22: abdomcentesis total
protein\n 30 23: outcome\n 31 - possible values:\n 32 24: surgical
lesion?\n 33 - possible values:\n 34 25, 26, 27: type of lesion\n
35 28: cp_data\n
So I am trying to sort in that DF will look like:
Col1 Col2
1: surgery
2: Age
3: Hospital Number
etc.
Any suggestions how to split Series into 2 Cols and clean/delete rest of info?

Try:
import pandas as pd
sorted_info = [' 1: surgery?\n', ' 2: Age\n', ' 3: Hospital Number\n', ' 4: rectal temperature\n', ' 5: pulse\n', ' - is a reflection of the heart condition: 30 -40 is normal for adults\n', ' 6: respiratory rate\n', ' 7: temperature of extremities\n', ' - possible values:\n', ' 8: peripheral pulse\n', ' - possible values are:\n', ' 9: mucous membranes\n', ' - possible values are:\n', ' 10: capillary refill time\n', " 11: pain - a subjective judgement of the horse's pain level\n", ' - possible values:\n', ' 12: peristalsis\n', ' - possible values:\n', ' 13: abdominal distension\n', ' 14: nasogastric tube\n', ' - possible values:\n', ' 15: nasogastric reflux\n', ' 16: nasogastric reflux PH\n', ' 17: rectal examination - feces\n', ' 18: abdomen\n', ' 19: packed cell volume\n', ' 20: total protein\n', ' 21: abdominocentesis appearance\n', ' - possible values:\n', ' 22: abdomcentesis total protein\n', ' 23: outcome\n', ' - possible values:\n', ' 24: surgical lesion?\n', ' - possible values:\n', ' 25, 26, 27: type of lesion\n', ' 28: cp_data\n']
sorted_info = [x.strip() for x in sorted_info]
joined_list = []
for x in sorted_info:
if x.startswith('-'):
joined_list[-1] += ' ' + x
else:
joined_list.append(x)
df = pd.DataFrame(joined_list)
df[['Number', 'Text']] = df[0].str.split(':', n=1, expand=True)
del df[0]
print(df)
Output:
Number Text
0 1 surgery?
1 2 Age
2 3 Hospital Number
3 4 rectal temperature
4 5 pulse - is a reflection of the heart conditi...
5 6 respiratory rate
6 7 temperature of extremities - possible values:
7 8 peripheral pulse - possible values are:
8 9 mucous membranes - possible values are:
9 10 capillary refill time
10 11 pain - a subjective judgement of the horse's ...
11 12 peristalsis - possible values:
12 13 abdominal distension
13 14 nasogastric tube - possible values:
14 15 nasogastric reflux
15 16 nasogastric reflux PH
16 17 rectal examination - feces
17 18 abdomen
18 19 packed cell volume
19 20 total protein
20 21 abdominocentesis appearance - possible values:
21 22 abdomcentesis total protein
22 23 outcome - possible values:
23 24 surgical lesion? - possible values:
24 25, 26, 27 type of lesion
25 28 cp_data
Additional:
If you wanted to then go forwards and expand the 25, 26, 27 values in row 24 try:
df = df.apply(lambda x: x.str.split(',').explode()).reset_index(drop=True)
Output:
Number Text
0 1 surgery?
1 2 Age
2 3 Hospital Number
3 4 rectal temperature
4 5 pulse - is a reflection of the heart conditi...
5 6 respiratory rate
6 7 temperature of extremities - possible values:
7 8 peripheral pulse - possible values are:
8 9 mucous membranes - possible values are:
9 10 capillary refill time
10 11 pain - a subjective judgement of the horse's ...
11 12 peristalsis - possible values:
12 13 abdominal distension
13 14 nasogastric tube - possible values:
14 15 nasogastric reflux
15 16 nasogastric reflux PH
16 17 rectal examination - feces
17 18 abdomen
18 19 packed cell volume
19 20 total protein
20 21 abdominocentesis appearance - possible values:
21 22 abdomcentesis total protein
22 23 outcome - possible values:
23 24 surgical lesion? - possible values:
24 25 type of lesion
25 26 type of lesion
26 27 type of lesion
27 28 cp_data

We can do it this way:
Read the list sorted_info into a Pandas Series.
Use .str.extract() with Regex to extract the number tag and main contents of one line of text. Ready with Col1 and Col2 after the extraction.
For continuation line without number tag, we use .ffill() to forward fill the missing tag number
Group by the tag number in Col1 and join texts with continuation line based on the same tag number
Here, the codes:
# Read the list `sorted_info` into a Pandas Series:
s = pd.Series(sorted_info)
# Extract the number tag and main contents of a line of text:
df = s.str.extract(r'\s*(?:(?P<Col1>\d+(?:,\s*\d+)*:)|-)\s*(?P<Col2>.*)', expand=True)
# For continuation lines without number tag, forward fill the missing tag number
df['Col1'] = df['Col1'].ffill()
# Group by the tag numbers in `Col1` and join text in continuation line based on the same tag number
df_out = df.groupby('Col1', sort=False, as_index=False).agg(' - '.join)
Result:
print(df_out)
Col1 Col2
0 1: surgery?
1 2: Age
2 3: Hospital Number
3 4: rectal temperature
4 5: pulse - is a reflection of the heart condition: 30 -40 is normal for adults
5 6: respiratory rate
6 7: temperature of extremities - possible values:
7 8: peripheral pulse - possible values are:
8 9: mucous membranes - possible values are:
9 10: capillary refill time
10 11: pain - a subjective judgement of the horse's pain level - possible values:
11 12: peristalsis - possible values:
12 13: abdominal distension
13 14: nasogastric tube - possible values:
14 15: nasogastric reflux
15 16: nasogastric reflux PH
16 17: rectal examination - feces
17 18: abdomen
18 19: packed cell volume
19 20: total protein
20 21: abdominocentesis appearance - possible values:
21 22: abdomcentesis total protein
22 23: outcome - possible values:
23 24: surgical lesion? - possible values:
24 25, 26, 27: type of lesion
25 28: cp_data

You should first clean the data itself. It is unfortunate that it reaches the part where you want to make a DataFrame in such a noisy form. Ideally, the cleanup should come closer to the data collection itself.
Generally speaking, data cleaning is highly domain-dependent. For simple string cleanup, usually a combination of string.split(), re.match() and list comprehensions can go a long way.
For your specific case, the following gives good results (exercise for the reader: understand each bit of the expression, by trying reduced forms of it, starting by e.g. [v.splitlines()[0].split(':', 1) for v in sorted_info] building up toward the final form):
import re
cleaned = [
[s.split(' - ', 1)[0].strip().rstrip('?')
for s in v.splitlines()[0].split(':', 1)]
for v in sorted_info if re.match('^ *\d+:', v)
]
>>> cleaned
[['1', 'surgery'],
['2', 'Age'],
['3', 'Hospital Number'],
['4', 'rectal temperature'],
['5', 'pulse'],
['6', 'respiratory rate'],
['7', 'temperature of extremities'],
['8', 'peripheral pulse'],
['9', 'mucous membranes'],
['10', 'capillary refill time'],
['11', 'pain'],
['12', 'peristalsis'],
['13', 'abdominal distension'],
['14', 'nasogastric tube'],
['15', 'nasogastric reflux'],
['16', 'nasogastric reflux PH'],
['17', 'rectal examination'],
['18', 'abdomen'],
['19', 'packed cell volume'],
['20', 'total protein'],
['21', 'abdominocentesis appearance'],
['22', 'abdomcentesis total protein'],
['23', 'outcome'],
['24', 'surgical lesion'],
['28', 'cp_data']]
# and
df = pd.DataFrame(cleaned, columns=['Col1', 'Col2'])
>>> df
Col1 Col2
0 1 surgery
1 2 Age
2 3 Hospital Number
3 4 rectal temperature
4 5 pulse
5 6 respiratory rate
6 7 temperature of extremities
7 8 peripheral pulse
8 9 mucous membranes
9 10 capillary refill time
10 11 pain
11 12 peristalsis
12 13 abdominal distension
13 14 nasogastric tube
14 15 nasogastric reflux
15 16 nasogastric reflux PH
16 17 rectal examination
17 18 abdomen
18 19 packed cell volume
19 20 total protein
20 21 abdominocentesis appearance
21 22 abdomcentesis total protein
22 23 outcome
23 24 surgical lesion
24 28 cp_data

Cutting SciPy hierarchical dendrogram into clusters via a threshold value

I'm trying to use SciPy's dendrogram method to cut my data into a number of clusters based on a threshold value. However, once I create a dendrogram and retrieve its color_list, there is one fewer entry in the list than there are labels.
Alternatively, I've tried using fcluster with the same threshold value I identified in dendrogram; however, this does not render the same result -- it gives me one cluster instead of three.
here's my code.
import pandas
data = pandas.DataFrame({'total_runs': {0: 2.489857755536053,
1: 1.2877651950650333, 2: 0.8898850111727028, 3: 0.77750321282732704, 4: 0.72593099987615461, 5: 0.70064977003207007,
6: 0.68217502514600825, 7: 0.67963194285399975, 8: 0.64238326692987524, 9: 0.6102581538587678, 10: 0.52588765899448564,
11: 0.44813665774322564, 12: 0.30434031343774476, 13: 0.26151929543260161, 14: 0.18623657993534984, 15: 0.17494230269731209,
16: 0.14023670906519603, 17: 0.096817318756050832, 18: 0.085822227670014059, 19: 0.042178447746868117, 20: -0.073494398270518693,
21: -0.13699665903273103, 22: -0.13733324345373216, 23: -0.31112299949731331, 24: -0.42369178918768974, 25: -0.54826542322710636,
26: -0.56090603814914863, 27: -0.63252372328438811, 28: -0.68787316140457322, 29: -1.1981351436422796, 30: -1.944118415387774,
31: -2.1899746357945964, 32: -2.9077222144449961},
'total_salaries': {0: 3.5998991340231234,
1: 1.6158435140488829, 2: 0.87501176080187315, 3: 0.57584734201367749, 4: 0.54559862861592978, 5: 0.85178295446270169,
6: 0.18345463930386757, 7: 0.81380836410678736, 8: 0.43412670908952178, 9: 0.29560433676606418, 10: 1.0636736398252848,
11: 0.08930130612600648, 12: -0.20839133305170349, 13: 0.33676911316165403, 14: -0.12404710480916628, 15: 0.82454221267393346,
16: -0.34510456295395986, 17: -0.17162157282367937, 18: -0.064803261585569982, 19: -0.22807757277294818, 20: -0.61709008778669083,
21: -0.42506873158089231, 22: -0.42637946918743924, 23: -0.53516500398181921, 24: -0.68219830809296633, 25: -1.0051418692474947,
26: -1.0900316082184143, 27: -0.82421065378673986, 28: 0.095758053930450004, 29: -0.91540963929213015, 30: -1.3296449323844519,
31: -1.5512503530547552, 32: -1.6573856443389405}})
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, dendrogram
distanceMatrix = pdist(data)
dend = dendrogram(linkage(distanceMatrix, method='complete'),
color_threshold=4,
leaf_font_size=10,
labels = df.teamID.tolist())
len(dend['color_list'])
Out[169]: 32
len(df.index)
Out[170]: 33
Why is dendrogram only assigning colors to 32 labels, although there are 33 observations in the data? Is this how I extract the labels and their corresponding clusters (colored in blue, green and red above)? If not, how else do I 'cut' the tree properly?
Here's my attempt at using fcluster. Why does it return only one cluster for the set, when the same threshold for dend returns three?
from scipy.cluster.hierarchy import fcluster
fcluster(linkage(distanceMatrix, method='complete'), 4)
Out[175]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Here's the answer - I didn't add 'distance' as an option to fcluster. With it, I get the correct (3) cluster assignments.
assignments = fcluster(linkage(distanceMatrix, method='complete'),4,'distance')
print assignments
[3 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
cluster_output = pandas.DataFrame({'team':df.teamID.tolist() , 'cluster':assignments})
print cluster_output
cluster team
0 3 NYA
1 2 BOS
2 2 PHI
3 2 CHA
4 2 SFN
5 2 LAN
6 2 TEX
7 2 ATL
8 2 SLN
9 2 SEA
10 2 NYN
11 2 HOU
12 1 BAL
13 2 DET
14 1 ARI
15 2 CHN
16 1 CLE
17 1 CIN
18 1 TOR
19 1 COL
20 1 OAK
21 1 MIL
22 1 MIN
23 1 SDN
24 1 KCA
25 1 TBA
26 1 FLO
27 1 PIT
28 1 LAA
29 1 WAS
30 1 ANA
31 1 MON
32 1 MIA

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Describing a function partially in Z3 or Z3py - python

Related

Pandas: GroupData & Manipulate values within groups

Nested for loop filtering inner loop based on outer loop and appending dataframe

How to create a new data frame from a larger dataset

Pandas python, how to make columns from a list into a DataFrame, and clean it from unnecessary info

Cutting SciPy hierarchical dendrogram into clusters via a threshold value

Categories

Resources