Why CausalNex output in python is wrong? - python

I am using CausalNex to create a DAG from a dataset in Python.
I got the graph, and the nodes are correct, but the edges are totally off. I tried this in a DataFrame with four random independent variables (Requestor, Risk, Size, Developer) and a single dependent one (Duration), and the graph produced is this:
Am I using the library incorrectly? Why is the figure so distant from the true data-generating process? Could a Bayesian Network model outperform CausalNex?
I tried this code:
# Generate initial data
import numpy as np
import pandas as pd
np.random.seed(42)
fib_list = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
df = pd.DataFrame({
"Requestor": np.random.randint(1, 4, 100),
"Size": np.random.randint(1, 4, 100),
"Risk": np.random.randint(1, 4, 100)
})
df['Developer'] = np.random.choice(fib_list, df.shape[0])
df["Duration"] = (
0.1 * df["Requestor"] +
0.2 * df["Size"] +
0.2 * df["Risk"] +
0.5 * df["Developer"]
)
# Generate graph
from causalnex.structure.notears import from_pandas
import matplotlib.pyplot as plt
import networkx as nx
sm = from_pandas(df)
sm.remove_edges_below_threshold(0.8)
nx.draw_shell(sm, with_labels=True, font_weight ="bold")
plt.show()
I was expecting something like this:

I would say that the relations between the variables are not easy to capture (particularly due to the domain size of Developer). The parents of continuous "Duration" have a domain size of 4*4*4*12 ... And duration itself is not really continuous, but can take 102 different values ...
So a database of size 100 is really not enough for the tests/scores to be accurate during the learning algorithms.
Note that I multiplied Duration by 10 to keep integer values.
FYI an inference is the last BN
The code :
import numpy as np
import pandas as pd
import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
gum.config["notebook","default_graph_size"]="3!" #change default size for BN
def createDB(N:int):
# code from Rafaela Medeiros
np.random.seed(42)
fib_list = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
data = {"Requestor": np.random.randint(1,4,N),
"Size": np.random.randint(1,4,N),
"Risk": np.random.randint(1,4,N)}
df = pd.DataFrame(data)
df['Developer'] = np.random.choice(fib_list, df.shape[0])
df["Duration"] = (1*df["Requestor"] + 2*df["Size"] + 2*df["Risk"] + 5*df["Developer"])
return df
def learnForSize(N:int):
learner=gum.BNLearner(createDB(N))
learner.useMIIC() # choosing this algo to learn
bn=learner.learnBN()
return bn
sizes=[100,5000,55000]
gnb.flow.row(*[learnForSize(N) for N in sizes],
captions=[f"{size=}" for size in sizes])

Related

Multiple images numpy array into blocks

I have a numpy array with 1000 RGB images with shape (1000, 90, 90, 3) and I need to work on each image, but sliced in 9 blocks. I've found many solution for slicing a single image, but how can I obtain a (9000, 30, 30, 3) array and then iteratively send to a function 9 contiguous block?
I would do smth like what I do in the code below. In my example I used parts of images from skimage.data to illustrate my method and made the shapes and sizes different so that it will look prettier. But you can do the same for your dta by adjusting those parameters.
from skimage import data
from matplotlib import pyplot as plt
import numpy as np
astronaut = data.astronaut()
coffee = data.coffee()
arr = np.stack([coffee[:400, :400, :], astronaut[:400, :400, :]])
plt.imshow(arr[0])
plt.title('arr[0]')
plt.figure()
plt.imshow(arr[1])
plt.title('arr[1]')
arr_blocks = arr.reshape(arr.shape[0], 4, 100, 4, 100, 3, ).swapaxes(2, 3)
arr_blocks = arr_blocks.reshape(-1, 100, 100, 3)
for i, block in enumerate(arr_blocks):
plt.figure(10+i//16, figsize = (10, 10))
plt.subplot(4, 4, i%16+1)
plt.imshow(block)
plt.title(f'block {i}')
# batch_size = 9
# some_outputs_list = []
# for i in range(arr_blocks.shape[0]//batch_size + ((arr_blocks.shape[0]%batch_size) > 0)):
# some_outputs_list.append(some_function(arr_blocks[i*batch_size:(i+1)*batch_size]))
Output:

Having trouble converting r chisquare goodness of fit test code to python equivalent

UCLA has this great site for statistical tests
https://stats.idre.ucla.edu/r/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-r/#1sampt
but the code is all in R. I am trying to convert the code to Python equivalents but it is not a straightforward process for some like the chi-square goodness of fit. Here is the R version:
hsb2 <- within(read.csv("https://stats.idre.ucla.edu/stat/data/hsb2.csv"), {
race <- as.factor(race)
schtyp <- as.factor(schtyp)
prog <- as.factor(prog)
})
chisq.test(table(hsb2$race), p = c(10, 10, 10, 70)/100)
My Python attempt is this:
import numpy as np
import pandas as pd
from scipy import stats
df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/hsb2.csv")
# convert to category
df["race"] = df["race"].astype("category")
t_race = pd.crosstab(df.race, columns = 'race')
p_tests = np.array((10, 10, 10, 70))
p_tests = ptests/100
# tried this
stats.chisquare(t_race, p_tests)
# and this
stats.chisquare(t_race.T, p_tests)
but neither stats.chisquare output comes close to the R version. Can anybody steer me in the right direction? TIA
chisq.test takes a vector of probabilities; stats.chisquare takes the expected frequencies (docs).
> results = chisq.test(c(24, 11, 20, 145), p=c(0.1, 0.1, 0.1, 0.7))
> results
Chi-squared test for given probabilities
data: c(24, 11, 20, 145)
X-squared = 5.028571429, df = 3, p-value = 0.169716919
vs.
In [49]: obs = np.array([24, 11, 20, 145])
In [50]: prob = np.array([0.1, 0.1, 0.1, 0.7])
In [51]: stats.chisquare(obs, f_exp=obs.sum() * prob)
Out[51]: Power_divergenceResult(statistic=5.0285714285714285, pvalue=0.16971691923343338)

Solve non linear equation numpy.

Edit: Everything is good :)
This is a code which works with small values of t=20 and TR=([[30,20,12,23..],[...]]) but when I put higher values it is shown "Expect x to be a 1-D sorted array_like.". Do you know how to solve this problem??
import matplotlib.pylab as plt
from scipy.special import erfc
from scipy import sqrt
from scipy import exp
import numpy as np
from scipy.interpolate import interp1d
# The function to inverse:
t = 100
alfa = 1.1*10**(-7)
k = 0.18
T1 = 20
Tpow = 180
def F(h):
p = erfc(h*sqrt(alfa*t)/k)
return T1 + (Tpow-T1)*(1-exp((h**2*alfa*t)/k**2)*(p))
# Interpolation
h_eval = np.linspace(-80, 500, 200) # critical step: define the discretization grid
F_inverse = interp1d( F(h_eval), h_eval, kind='cubic', bounds_error=True )
# Some random data:
TR = np.array([[130, 100, 130, 130, 130],
[ 90, 101, 100, 120, 90],
[130, 130, 100, 100, 130],
[120, 101, 120, 90, 110],
[110, 130, 130, 110, 130]])
# Compute the array h for a given array TR
h = F_inverse(TR)
print(h)
# Graph to verify the interpolation
plt.plot(h_eval, F(h_eval), '.-', label='discretized F(h)');
plt.plot(h.ravel(), TR.ravel(), 'or', label='interpolated values')
plt.xlabel('h'); plt.ylabel('F(h) or TR'); plt.legend();
Has anyone an idea how to solve non-linear, implicit equation in numpy.
I have array TR and other values which are included in my equation.
I need to solve it - as a result receive a new array with the same shape
Here is a solution using an 1D interpolation to compute the inverse of the F(h) function. Because non standard root finding method is used, the error is not controlled, and the discretization grid have to be chosen with care. However, the interpolated inverse function can be directly computed over an array.
note: the definition of F is modified, the problem is now Solve h for F(h) = TR
import numpy as np
from scipy.interpolate import interp1d
import matplotlib.pylab as plt
# The function to inverse:
t = 10
alfa = 1.1*10**(-7)
k = 0.18
T1 = 20
Tpow = 100
def F(h):
A = np.exp(h**2*alfa*t/k**2)
B = h**3*2/(3*np.sqrt(3))*(alfa*t)**(3/2)/k**3
return -(Tpow-T1)*( 1 - A + B )
# Interpolation
h_eval = np.linspace(40, 100, 50) # critical step: define the discretization grid
F_inverse = interp1d( F(h_eval), h_eval, kind='cubic', bounds_error=True )
# Some random data:
TR = np.array([[13, 10, 13, 13, 13],
[ 9, 11, 10, 12, 9],
[13, 13, 10, 10, 13],
[12, 11, 12, 9, 11],
[11, 13, 13, 11, 13]])
# Compute the array h for a given array TR
h = F_inverse(TR)
print(h)
# Graph to verify the interpolation
plt.plot(h_eval, F(h_eval), '.-', label='discretized F(h)');
plt.plot(h.ravel(), TR.ravel(), 'or', label='interpolated values')
plt.xlabel('h'); plt.ylabel('F(h) or TR'); plt.legend();
With the other function, the following lines are changed:
from scipy.special import erf
def F(h):
return (Tpow-T1)*(1-np.exp((h**2*alfa*t)/k**2)*(1.0-erf(h*np.sqrt(alfa*t)/k)))
# Interpolation
h_eval = np.linspace(15, 35, 50) # the range is changed

Summing until time-condition is reached in Python

I want to sum over a certain, but rolling, period within my dynamic model. The formal representation is as follows
A simple code snippet to run the equation is:
import numpy as np
import pandas as pd
import operator
year = np.arange(50)
m_ = [50, 30, 15]
a = [25, 15, 7.5]
ARC_ = [38, 255, 837]
r = 0.03
I tried subtracting list a from m_ by list(map(operator.sub, m_, a))) as found within another post.
My failed attempt looks something like this:
for t in year:
for i in range(0, 3):
while t < t+(list(map(operator.sub, m_, a))):
L_[t] = sum(ARC_[i] / (1+r) ** t)
Not at all sure that I understood it right, I tried to base my answer on the equation. Even if it is still a bit of from the result you expect, it might help you to solve your issue.
I create a result list to store each value of L[t], i.e. 50 values. Then I compute the start / stop of the sum for every couple (t,i) and compute it.
import numpy as np
years = np.arange(50)
m_ = [50, 30, 15]
a = [25, 15, 7.5]
ARC_ = [38, 255, 837]
r = 0.03
result = []
for t in years:
s = 0
for i in range(3):
t0 = t
tf = t + m_[i]-a[i]
for k in range(int(t0), int(tf+1)):
s += ARC_[i] / (1+r) ** t
result.append(s)
If what you wanted to do is to compute the difference element-wise between m and a, a simple solution is:
[m_[i] - a[i] for i in range(len(m_))]
Hope it helps.

How to print the output value in this example?

I'm trying to test the fuzzy logic tipping example that exists at the following link: click here
My question is how can I make this control system prints the output value (tipping) in terms of ['low', 'medium', 'high'] rather than printing the actual computed value.
The following is the example code
import matplotlib.pyplot as plt
import numpy as np
import skfuzzy as fuzzy
from skfuzzy import control
# Universe variables
quality = control.Antecedent(np.arange(0, 11, 1), 'quality')
service = control.Antecedent(np.arange(0, 11, 1), 'service')
tip = control.Consequent(np.arange(0, 26, 1), 'tip')
# Auto-membership function population (3,5,7)
quality.automf(3)
service.automf(3)
# Custom triangle membership functions
tip['low'] = fuzzy.trimf(tip.universe, [0, 0, 13])
tip['medium'] = fuzzy.trimf(tip.universe, [0, 13, 25])
tip['high'] = fuzzy.trimf(tip.universe, [13, 25, 25])
#view memberships
#quality.view()
#service.view()
#tip.view()
#Fuzzy rules
rule1 = control.Rule(quality['poor'] | service['poor'], tip['low'])
rule2 = control.Rule(service['average'], tip['medium'])
rule3 = control.Rule(service['good'] | quality['good'], tip['high'])
#Control System Creation and Simulation
tipping_ctrl = control.ControlSystem([rule1, rule2, rule3])
tipping = control.ControlSystemSimulation(tipping_ctrl)
# Pass inputs to the ControlSystem & compute
tipping.input['quality'] = 10
tipping.input['service'] = 3
tipping.compute()
#visualize & view
print (tipping.output)
tip.view(sim=tipping)
plt.show()
You have to pass the tip in this case
tipping.output['tip']

Categories