Power BI dataframe table visualization [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Power BI has a Python visualization element. It creates dataframe from fields of Power BI data source, and then visualize it with matplotlib.pyplot.show() method.
I need to visualize dataframe in table form (with ability to color cells depending on different data conditions)
Problem is that any example of table visualizaions of dataframes doesn't work inside Power BI Py element (and doesn't says what the problem is) even when it works in Anaconda.
Can somebody show a working example of dataframe table visualisation for Power BI?

I created data in a dataframe to keep the example simple. This could also be the output of manipulation.
import pandas as pd
dataset = pd.DataFrame({'a': range(0,20,2), 'b': range(10,30,2)})
print(dataset)
a b
0 0 10
1 2 12
2 4 14
3 6 16
4 8 18
5 10 20
6 12 22
7 14 24
8 16 26
9 18 28
In a new Power BI file,
1. Get Data/More/Other/Python Script
Paste in:
dataset = pandas.DataFrame({'a': range(0,20,2), 'b': range(10,30,2)})
# Note the use of pandas, not pd
In the Navigator window, select 'dataset' under Python
Select Load or Transform Data if you wish to manipulate the data.
Once loaded you can to visualization and use the data just like any other table.
EDIT
While the question is closed because it was not focussed. I think this is what the op was looking for.
In Power BI, create a dataset as follows from python script:
dataset = pd.DataFrame(np.random.randn(10, 8), columns=list('abcdefgh'))
Use matplotlib.pyplot to create a heatmap from the table. You can control the heatmap more extensively than in this example.
So in visualization in Power BI, add the following python script (taken from Conditional formatting for 2- or 3-scale coloring of cells of a table):
# The following code to create a dataframe and remove duplicated rows is always executed and acts as a preamble for your script:
# dataset = pandas.DataFrame(a, b, c, d, e, f, g, h)
# dataset = dataset.drop_duplicates()
# Paste or type your script code here:
import pandas as pandas
import numpy as np
import matplotlib.pyplot as plt
#Round to two digits to print nicely
vals = np.around(dataset.values, 2)
#Normalize data to [0, 1] range for color mapping below
normal = (dataset - dataset.min()) / (dataset.max() - dataset.min())
fig = plt.figure()
ax = fig.add_subplot(111)
ax.axis('off')
the_table=ax.table(cellText=vals, rowLabels=dataset.index, colLabels=dataset.columns,
loc='center', cellColours=plt.cm.RdYlGn(normal),animated=True)
plt.show()
From this you get:
If you refresh your data, the script will yeild a new heatmap, which is what should happen in your power bi if you update whatever data you are using. Hope this helps.

Related

Matplotlib time-based heatmap [duplicate]

This question already has answers here:
Normalize columns of a dataframe
(23 answers)
Closed 8 months ago.
Background: I picked up Python about a month ago, so my experience level is pretty slim. I'm pretty comfortable with VBA though years of data analysis through excel and PI Processbook.
I have 27 thermocouples that I pull data for in 1s intervals. I would like to heatmap them from hottest to coldest at a given instance in time. I've leveraged seaborn heatmaps, but the problem with those is that they compare temperatures across time as well and the aggregate of these thermocouples changes dramatically over time. See chart below:
Notice how in the attached, the pink one is colder than the rest when all of them are cold, but when they all heat up, the cold spot transfers to the orange and green ones (and even the blue one for a little bit at the peak).
In excel, I would write a do loop to apply conditional formatting to each individual timestamp (row), however in Python I can't figure it out for the life of me. The following is the code that I used to develop the above chart, so I'm hoping I can modify this to make it work.
tsStartTime = pd.Timestamp(strStart_Time)
tsEndTime = pd.Timestamp(strEnd_Time)
t = np.linspace(tsStartTime.value,tsEndTime.value, 150301)
TimeAxis = pd.to_datetime(t)
fig,ax = plt.subplots(figsize=(25,5))
plt.subplots_adjust(bottom = 0.25)
x = TimeAxis
i = 1
while i < 28:
globals()['y' + str(i)] = forceconvert_v(globals()['arTTXD' + str(i)])
ax.plot(x,globals()['y' + str(i)])
i += 1
I've tried to use seaborn heatmaps, but when i slice it by timestamps, the output array is size (27,) instead of (27,1), so it gets rejected.
Ultimately, I'm looking for an output that looks like this:
Notice how the values of 15 in the middle are blue despite being higher than the red 5s in the beginning. I didnt fill out every cell, but hopefully you get the jist of what I'm trying to accomplish.
This data is being pulled from OSISoft PI via the PIConnect library. PI leverages their own classes, but they are essentially either series or dataframes, but I can manipulate them into whatever they need to be if someone has any awesome ideas to handle this.
Here's the link to the data: https://file.io/JS0RoQvDL6AB
Thanks!
You are going the wrong way with globals. In this case, I suggest to use pandas.DataFrame.
What you are looking for can be produced like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Settings
col_number = 5
start = '1/1/2022 10:00:00'
end = '1/1/2022 10:10:00'
# prepare a data frame
index = pd.date_range(start=start, end=end, freq="S")
columns = [f'y{i}' for i in range(col_number)]
df = pd.DataFrame(index=index, columns=columns)
# fill in the data
for n, col in enumerate(df.columns):
df[col] = np.array([n + np.sin(2*np.pi*i/len(df)) for i in range(len(df))])
# drawing a heatmap
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 5))
ax1.plot(df)
ax1.legend(df.columns)
ax2.imshow(df.T, aspect='auto', cmap='plasma')
ax2.set_yticks(range(len(df.columns)))
ax2.set_yticklabels(df.columns)
plt.show()
Here:
As far as you didn't supply data to reproduce your case I use sin as illustrative values.
Transposing df.T is needed to put records horizontally. Of course, we can initially write data horizontally, it's up to you.
set_yticks is here to avoid troubles when changing the y-labels on the second figure.
seaborn.heatmap(...) can be used as well:
import seaborn as sns
data = df.T
data.columns = df.index.strftime('%H:%M:%S')
plt.figure(figsize=(15,3))
sns.heatmap(data, cmap='plasma', xticklabels=60)
Update
To compare values at each point in time:
data = (data - data.min())/(data.max() - data.min())
sns.heatmap(data, cmap='plasma', xticklabels=60)

How to I extract objects? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
This post was edited and submitted for review last year and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
I want to split images like this in a way that every symbols gets splits up vertically kind of like this input image:
![input image][1]
to this:
![here][2]
The problem is each symbol might have different width so I can't really fix the splitting points like we do in array splitting. If all objects had same width then I could segment the image base on width. In this scenario, what logic I should use to extract these connected objects?
First load the img from the url
import numpy as np
import urllib.request
from PIL import Image
from matplotlib import pyplot as plt
urllib.request.urlretrieve(
'https://i.stack.imgur.com/GRHzg.png',
"img.png")
img = Image.open("img.png")
img.show()
Then consider the black part as "filled" and convert in numpy array
arr = (np.array(img)[:,:,:-1].sum(axis=-1)==0)
If we sum the rows values for each column we can have a simple sum of how much pixel are filled in each column:
plt.subplot(211)
plt.imshow(arr, aspect="auto")
plt.subplot(212)
plt.plot(arr.sum(axis=0))
plt.xlim(0,arr.shape[1])
finally if we compute the differential of this sum over the columns we can obtain the following result:
plt.subplot(211)
plt.imshow(arr, aspect="auto")
plt.subplot(212)
plt.plot(np.diff(arr.sum(axis=0)))
plt.xlim(0,arr.shape[1])
At this point you can simply chose a threshold and cut the image:
threshold = 25
cut = np.abs(np.diff(arr.sum(axis=0)))>threshold
x_lines = np.arange(len(cut))[cut]
plt.imshow(arr, aspect="auto")
plt.vlines(x_lines, 0, arr.shape[0], color="r")
This is my solution and it works fine, but it is sensitive to the chosen threshold and to the columns gradient. I hope it is useful.

K-means clustering on 3 dimensions with sklearn

I'm trying to cluster data using lat/lon as X/Y axes and DaysUntilDueDate as my Z axis. I also want to retain the index column ('PM') so that I can create a schedule later using this clustering analysis. The tutorial I found here has been wonderful but I don't know if it's taking the Z-axis into account, and my poking around hasn't resulted in anything but errors. I think the essential point in the code is the parameters of the iloc bit of this line:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(A.iloc[:, :])
I tried changing this part to iloc[1:4] (to only work on columns 1-3) but that resulted in the following error:
ValueError: n_samples=3 should be >= n_clusters=4
So my question is: How can I set up my code to run clustering analysis on 3-dimensions while retaining the index ('PM') column?
Here's my python file, thanks for your help:
from sklearn.cluster import KMeans
import csv
import pandas as pd
# Import csv file with data in following columns:
# [PM (index)] [Longitude] [Latitude] [DaysUntilDueDate]
df = pd.read_csv('point_data_test.csv',index_col=['PM'])
numProjects = len(df)
K = numProjects // 3 # Around three projects can be worked per day
print("Number of projects: ", numProjects)
print("K-clusters: ", K)
for k in range(1, K):
# Create a kmeans model on our data, using k clusters.
# Random_state helps ensure that the algorithm returns the
# same results each time.
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])
# These are our fitted labels for clusters --
# the first cluster has label 0, and the second has label 1.
labels = kmeans_model.labels_
# Sum of distances of samples to their closest cluster center
SSE = kmeans_model.inertia_
print("k:",k, " SSE:", SSE)
# Add labels to df
df['Labels'] = labels
#print(df)
df.to_csv('test_KMeans_out.csv')
It seems the issue is with the syntax of iloc[1:4].
From your question it appears you changed:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])
to:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[1:4])
It seems to me that either you have a typo or you don't understand how iloc works. So I will explain.
You should start by reading Indexing and Selecting Data from the pandas documentation.
But in short .iloc is an integer based indexing method for selecting data by position.
Let's say you have the dataframe:
A B C
1 2 3
4 5 6
7 8 9
10 11 12
The use of iloc in the example you provided iloc[:,:] selects all rows and columns and produces the entire dataframe. In case you aren't familiar with Python's slice notation take a look at the question Explain slice notation or the docs for An Informal Introduction to Python. The example you said caused your error iloc[1:4] selects the rows at index 1-3. This would result in:
A B C
4 5 6
7 8 9
10 11 12
Now, if you think about what you are trying to do and the error you received you will realize that you have selected fewer samples form your data than you are looking for clusters. 3 samples (rows 1, 2, 3) but you're telling KMeans to find 4 clusters, which just isn't possible.
What you really intended to do (as I understand it) was to select all rows and columns 1-3 that correspond to your lat, lng, and z values. To do this just add a colon as the first argument to iloc like so:
df.iloc[:, 1:4]
Now you will have selected all of your samples and the columns at index 1, 2, and 3. Now, assuming you have enough samples, KMeans should work as you intended.

Plotting boxplots for a groupby object

I would like to plot boxplots for several datasets based on a criterion.
Imagine a dataframe similar to the example below:
df = pd.DataFrame({'Group':[1,1,1,2,3,2,2,3,1,3],'M':np.random.rand(10),'F':np.random.rand(10)})
df = df[['Group','M','F']]
Group M F
0 1 0.465636 0.537723
1 1 0.560537 0.727238
2 1 0.268154 0.648927
3 2 0.722644 0.115550
4 3 0.586346 0.042896
5 2 0.562881 0.369686
6 2 0.395236 0.672477
7 3 0.577949 0.358801
8 1 0.764069 0.642724
9 3 0.731076 0.302369
In this case, I have three groups, so I would like to make a boxplot for each group and for M and F separately having the groups on Y axis and the columns of M and F colour-coded.
This answer is very close to what I want to achieve, but I would prefer something more robust, applicable for larger dataframes with greater number of groups. I feel that groupby is the way to go, but I am not familiar with groupby objects and I am failing to even slice them.
. The desirable output would look something like this:
Looks like years ago, someone had the same problem, but got no answers :( Having a boxplot as a graphical representation of the describe function of groupby
My questions are:
How to implement groupby to feed the desired data into the boxplot
What is the correct syntax for the box plot if I want to control what is displayed and not just use default settings (which I don't even know what they are, I am finding the documentation rather vague. To be specific,can I have the box covering the mean +/- standard deviation, and keep the vertical line at median value?)
I think you should use Seaborn library that offers to create these type of customize plots.In your case i had first melted your dataframe to convert it into proper format and then created the boxplot of your choice.
import pandas as pd
import matplotlib.pyplot as plt
Import seaborn as sns
dd=pd.melt(df,id_vars=['Group'],value_vars=['M','F'],var_name='sex')
sns.boxplot(y='Group',x='value',data=dd,orient="h",hue='sex')
The plot looks similar to your required plot.
Finally, I found a solution by slightly modifying this answer. It does not use groupby object, so it is more tedious to prepare the data, but so far it looks like the best solution to me. Here it is:
# here I prepare the data (group them manually and then store in lists)
Groups=[1,2,3]
Columns=df.columns.tolist()[1:]
print Columns
Mgroups=[]
Fgroups=[]
for g in Groups:
dfgc = df[df['Group']==g]
m=dfgc['M'].dropna()
f=dfgc['F'].dropna()
Mgroups.append(m.tolist())
Fgroups.append(f.tolist())
fig=plt.figure()
ax = plt.axes()
def setBoxColors(bp,cl):
plt.setp(bp['boxes'], color=cl, linewidth=2.)
plt.setp(bp['whiskers'], color=cl, linewidth=2.5)
plt.setp(bp['caps'], color=cl,linewidth=2)
plt.setp(bp['medians'], color=cl, linewidth=3.5)
bpl = plt.boxplot(Mgroups, positions=np.array(xrange(len(Mgroups)))*3.0-0.4,vert=False,whis='range', sym='', widths=0.6)
bpr = plt.boxplot(Fgroups, positions=np.array(xrange(len(Fgroups)))*3.0+0.4,vert=False,whis='range', sym='', widths=0.6)
setBoxColors(bpr, '#D7191C') # colors are from http://colorbrewer2.org/
setBoxColors(bpl, '#2C7BB6')
# draw temporary red and blue lines and use them to create a legend
plt.plot([], c='#D7191C', label='F')
plt.plot([], c='#2C7BB6', label='M')
plt.legend()
plt.yticks(xrange(0, len(Groups) * 3, 3), Groups)
plt.ylim(-3, len(Groups)*3)
#plt.xlim(0, 8)
plt.show()
The result looks mostly like what I wanted (as far as I have been able to find, the box always ranges from first to third quartile, so it is not possible to set it to +/- standard deviation). So I am a bit disappointed there is no one-line solution, but I am glad it is possible. However, for hundreds of groups this would not be good enough...

Plotting quantiles, median and spread using scipy and matplotlib [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am new to matplotlib, and I want to create a plot, with the following information:
A line joining the medians of around 200 variable length vectors (input)
A line joining the corresponding quantiles of these vectors.
A line joining the corresponding spread (largest and smallest points).
So basically, its somewhat like a continuous box plot.
Thanks!
Using just scipy and matplotlib (you tagged only those libraries in your question) is a little bit verbose, but here's how you would do it (I'm doing it only for the quantiles):
import numpy as np
from scipy.stats import mstats
import matplotlib.pyplot as plt
# Create 10 columns with 100 rows of random data
rd = np.random.randn(100, 10)
# Calculate the quantiles column wise
quantiles = mstats.mquantiles(rd, axis=0)
# Plot it
labels = ['25%', '50%', '75%']
for i, q in enumerate(quantiles):
plt.plot(q, label=labels[i])
plt.legend()
Which gives you:
Now, I would try to convince you to try the Pandas library :)
import numpy as np
import pandas as pd
# Create random data
rd = pd.DataFrame(np.random.randn(100, 10))
# Calculate all the desired values
df = pd.DataFrame({'mean': rd.mean(), 'median': rd.median(),
'25%': rd.quantile(0.25), '50%': rd.quantile(0.5),
'75%': rd.quantile(0.75)})
# And plot it
df.plot()
You'll get:
Or you can get all the stats in just one line:
rd.describe().T.drop('count', axis=1).plot()
Note: I dropped the count since it's not a part of the "5 number summary".

Categories