Adjust y axis when using parallel_coordinates - python

I'm working on plotting some data with pandas in the form of parallel coordinates, and I'm not too sure how to go about setting the y-axis scaling.
here is my code:
def show_means(df: pd.DataFrame):
plt.figure('9D-parallel_coordinates')
plt.title('continents & features')
parallel_coordinates(df,'continent', color=['blue', 'green', 'red', 'yellow','orange','black'])
plt.show()
and I got this:
enter image description here
as shown in the graph, the value of "tempo" is way more than others. I want to scale all features values between 0 and 1 and get a line chart. How could I do that? Also, I want to change exegesis to vertical that readers can understand it easier.
this is my data frame:
enter image description here
Thanks

To normalize your values between 0 and 1, you have multiple choices. One of them could be (MinMaxScaler): the lowest value of each column is 0 and the highest value is 1:
df = (df - df.min()) / (df.max() - df.min())
To have vertically labels, use df.plot(rot=90)

Related

Annotate specific bars with values from Dataframe on Pandas bar plots

I have a bar chart like this:
and this is the code that I use to generate it:
def performance_plot_builder(data: str, ax: pyplot.Axes):
df = pandas.read_csv(data, header=0, sep=';')
df[['library', 'function']] = df.name.str.split('_', expand=True, n=1)
df = df.pivot('function', 'library', 'elapsed')
normalized = df.div(df.max(axis=1), axis=0)
normalized.plot(ax=ax, kind='bar', color=[c.value for c in Color])
ax.set_ylabel('execution time (normalized)')
for p in ax.patches:
ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))
The data is first normalized relative to the maximum value between the two series for each item and then plotted. I've been able to annotate the value on each bar, however I would like several modifications:
I only want the values displayed on the maximum of each of the two values. For example, for array_access, only the stl bar's value will be shown since it is greater than etl.
The biggest thing I need is for the non-normalized values to be displayed instead of the normalized values as it is now (so the df dataframe instead of the normalized dataframe.
I would also like the labels to be rotated 90 degrees so that the labels display on the bars themselves.
This is an example dataframe I have:
library etl stl
function
copy 6.922975e-06 6.319098e-06
copy_if 1.369602e-04 1.423410e-04
count 6.135367e-05 1.179409e-04
count_if 1.332942e-04 1.908408e-04
equal 1.099963e-05 1.102448e-05
fill 5.337406e-05 9.352984e-05
fill_n 6.412923e-05 9.354095e-05
find 4.354274e-08 7.804437e-08
find_if 4.792641e-08 9.206846e-08
iter_swap 4.898631e-08 4.911048e-08
rotate 2.816952e-04 5.219732e-06
swap 2.832723e-04 2.882649e-04
swap_ranges 3.492764e-04 3.576686e-04
transform 9.739075e-05 1.080187e-04
I'm really not sure how to go about this since as far as I can tell, the data is retrieved from the Axes object, however this contains the normalized values.
Edit
I was able to somewhat accomplish all the modifications with this code:
interleaved = [val for pair in zip(df['etl'], df['stl']) for val in pair]
for v, p in zip(interleaved, ax.patches):
if p.get_height() == 1:
ax.text(x=p.get_x() + 0.01, y=0.825, s=f'{v:.1E}', rotation=90, color='white')
However, this is somewhat hard coded and only works if the bar chart values are normalized, which they are most likely to be, but not necessarily, so I would like a solution that is generic and is independent from the normalized values.
I was able to figure it out:
size = len(ax.patches) // 2
for v_etl, v_stl, p_etl, p_stl in zip(df['etl'], df['stl'], ax.patches[:size], ax.patches[size:]):
p, v = (p_etl, v_etl) if v_etl > v_stl else (p_stl, v_stl)
ax.text(x=p.get_x() + 0.18 * p.get_width(), y=p.get_height() - 0.175, s=f'{v:.1E}', rotation=90, color='white')

how to find mean for mixed categorical variables in pandas dataframe?

I have survey dataset about different age of people over using various social media platform. I want to calculate the average number of people over social media app usage. Here is how example data looks like:
here is reproducible pandas dataframe:
df=pd.DataFrame({'age': np.random.randint(10,100,size=10),
'web1a': np.random.choice([1, 2], size=(10,)),
'web1b': np.random.choice([1, 2], size=(10,), p=[1./3, 2./3]),
'web1c': np.random.choice([1, 2], size=(10,)),
'web1d': np.random.choice([1, 2], size=(10,))})
here is what I tried:
df.pivot_table(df, values='web1a', index='age', aggfunc='mean')
but it is not efficient and didn't produce my desired output. Any idea to get this done? Thanks
update:
for me, the way to do this, first select categorical values in each column and get mean for it which can be the same for others. If I do that, how can I nicely plot them?
Note that in column web1a,web1b, web1c, web1d, 1 mean user and 2 means non-user respectively. I want to compute the average age of the user and non-user. How can I do that? Anyone give me a possible idea to make this happen? Thanks!
Using
df.melt('age').set_index(['variable','value']).mean(level=[0,1]).unstack().plot(kind='bar')
This can be done using groupby method:
df.groupby(['web1a', 'web1b', 'web1c', 'web1d']).mean()
You can groupby the 'web*' columns and calculate the mean on the 'age' column.
You can also plot bar charts (colors can be defined in the subplot). I'm not sure pie charts make sense in this case.
I tried with your data, taking only the columns starting with 'web'. There are more values than '1's and '2's, So I assumed you only wanted to analyze the users and non-users and nothing else. You can change the values or add other values in the chart in the same way, as long as you know what values you want to draw.
df = df.filter(regex=('web|age'),axis=1)
userNr = '1'
nonUserNr = '2'
users = list()
nonUsers = list()
labels = [x for x in df.columns.tolist() if 'web' in x]
for col in labels:
users.append(df.loc[:,['age',col]].groupby(col).mean().loc[userNr][0])
nonUsers.append(df.loc[:,['age',col]].groupby(col).mean().loc[nonUserNr][0])
from matplotlib import pyplot as plt
x = np.arange(1, len(labels)+1)
ax = plt.subplot(111)
ax.bar(x-0.1, users, width=0.2,color='g')
ax.bar(x+0.1,nonUsers, width=0.2,color='r')
plt.xticks(x, labels)
plt.legend(['users','non-users'])
plt.show()
df.melt(id_vars='age').groupby(['variable', 'value']).mean()

How to find the correct condition for my matplotlib scatterplot?

I'm trying to correlate two measures(DD & DRE) from a data set which contains many more columns. I created a data frame and called it as 'Data'.
Within this Data, I want to create a scatterplot between DD(X axis) & DRE(y Axis), I want to include DD values between 0 and 100.
Please help me with the first line of my code to get the condition of DD between 0 and 100
Also when I plot the scatterplot, I get dots beyond 100% ( Y axis is DRE in %) though I dont have any value >100%.
Data1= Data[ Data['DD']<100]
plt.scatter(Data1.DD,Data1.DRE)
tick_val = [0,10,20,30,40,50,60,70,80,90,100]
tick_lab = ['0%','10%','20%','30%','40%','50%','60%','70%','80%','90%','100']
plt.yticks(tick_val,tick_lab)
plt.show()

Boxplot and data outliers

I have data in a dictionary form that I convert to pandas that I am attempting to box plot data that is outside the range of 68 and 72. Ultimately I am trying to rotate the title of the box blot 90 degrees and also exclude outlier data if possible. In this snip below of my real world scenario its impossible to read to column header and its also not necessary to show the box plot if only a few outliers are outside the range 68 & 72. Any tips are greatly appreciated...
Ill make up some code that mimics my real world application.
df = pd.DataFrame(dict(a=[71.5,72.8,79.3],b=[70.2,73.3,74.9],c=[63.1,64.9,65.9],d=[70.1,70.9,70.9]))
Flag too hot:
TooHot = df.apply(lambda x: not (x > 72).any())
print('These zones are too warm')
df[TooHot[~TooHot].index].boxplot()
plt.show()
Flag too cool:
TooCool = df.apply(lambda x: not (x < 68).any())
print('These zones are too cool')
df[TooCool[~TooCool].index].boxplot()
plt.show()
The keyword arguments showfliers=False in .boxplot() will remove the outliers from displaying on the plot.
Using vert=False will make the boxplots horizontal (which I think is what you are asking?
The documentation on matplotlib boxplots is a good place to start: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.boxplot.html

Heatmap with specific axis labels coloured

I am trying to plot a heatmap with 2 columns of data from a pandas dataframe. However, I would like to use a 3rd column to label the x axis, ideally by colour though another method such as an additional axis would be equally suitable. My dataframe is:
MUT SAMPLE VAR GROUP
True s1 1_1334442_T CC002
True s2 1_1334442_T CC006
True s1 1_1480354_GAC CC002
True s2 1_1480355_C CC006
True s2 1_1653038_C CC006
True s3 1_1730932_G CC002
...
Just to give a better idea of the data; there are 9 different types of 'GROUP', ~60,000 types of 'VAR' and 540 'SAMPLE's. I am not sure if this is the best way to build a heatmap in python but here is what I figured out so far:
pivot = pd.crosstab(df_all['VAR'],df_all['SAMPLE'])
sns.set(font_scale=0.4)
g = sns.clustermap(pivot, row_cluster=False, yticklabels=False, linewidths=0.1, cmap="YlGnBu", cbar=False)
plt.show()
I am not sure how to get 'GROUP' to display along the x-axis, either as an additional axis or just colouring the axis labels? Any help would be much appreciated.
I'm not sure if the 'MUT' column being a boolean variable is an issue here, df_all is 'TRUE' on every 'VAR' but as pivot is made, any samples which do not have a particular 'VAR' are filled as 0, others are filled with 1. My aim was to try and cluster samples with similar 'VAR' profiles. I hope this helps.
Please let me know if I can clarify anything further? Many thanks
Take look at this example. You can give a list or a dataframe column to the clustermap function. By specifying either the col_colors argument or the row_colors argument you can give colours to either the rows or the columns based on that list.
In the example below I use the iris dataset and make a pandas series object that specifies which colour the specific row should have. That pandas series is given as an argument for row_colors.
iris = sns.load_dataset("iris")
species = iris.pop("species")
lut = dict(zip(species.unique(), "rbg"))
row_colors = species.map(lut)
g = sns.clustermap(iris, row_colors=row_colors,row_cluster=False)
This code results in the following image.
You may need to tweak a bit further to also include a legend for the colouring for groups.

Categories