Boxplotting X values appearing more than once - python

So I have this data frame df:
Author | Score
A | 10
B | 4
C | 8
A | 9
B | 7
C | 6
D | 4
E | 3
I want to be able to make a box plot of x = author and y = score where the amount of authors is greater than 1. So the chart will only display authors A, B, and C. The reason why I want to set this limit is because the actual data frame I'm working with contains a rather large number of authors, and the box plot ends up looking extremely cluttered and unable to read. Is there a way to do this?

You can use groupby + transform('size') to create a mask that limits your DataFrame to Authors with more than 1 row. Then boxplot this subset.
m = df.groupby('Author')['Score'].transform('size').gt(1)
df.loc[m].boxplot(by='Author', column='Score')
That method allows you to easily generalize to an arbitrary number of rows as your threshold. In this special case of more than 1 row you could also use duplicated to slice the original:
df[df.duplicated('Author', keep=False)].boxplot(by='Author', column='Score')

First count Authors by grouping them then filter data by Counts.
import pandas as pd
import matplotlib.pyplot as plt
# add counts column
df['Counts'] = df.groupby(['Author']).transform('count')
# filter by value > 1
df = df[df['Counts'] > 1]
# plot
df.boxplot(by='Author', column=['Score'])
plt.show()
Output:

Related

Update each row value with constant value plus previous value using pandas

I have a data frame having 4 columns, 1st column is equal to the counter which has values in hexadecimal.
Data
counter frequency resistance phase
0 15000.000000 698.617126 -0.745298
1 16000.000000 647.001708 -0.269421
2 17000.000000 649.572265 -0.097540
3 18000.000000 665.282775 0.008724
4 19000.000000 690.836975 -0.011101
5 20000.000000 698.051025 -0.093241
6 21000.000000 737.854003 -0.182556
7 22000.000000 648.586792 -0.125149
8 23000.000000 643.014160 -0.172503
9 24000.000000 634.954223 -0.126519
a 25000.000000 631.901733 -0.122870
b 26000.000000 629.401123 -0.123728
c 27000.000000 629.442016 -0.156490
Expected output
| counter | sampling frequency | time. |
| --------| ------------------ |---------|
| 0 | - |t0=0 |
| 1 | 1 |t1=t0+sf |
| 2 | 1 |t2=t1+sf |
| 3 | 1 |t3=t2+sf |
The time column is the new column added to the original data frame. I want to plot time in the x-axis and frequency, resistance, and phase in y-axis.
Because in order to calculate the value of any row you need to calculate the value of the previous row before, you may have to use a for loop for this problem.
For a constant frequency, you could just calculate it in advance, no need to operate in the datafame:
sampling_freq = 1
df['time'] = [sampling_freq * i for i in range(len(df))]
If you need to operate in the dataframe (let's say the frequency may change at some point), in order to call each cell based on row number and column name, you can this suggestion. Syntax would be a lot easier using both numbers for row and column, but I prefer to refer to 'time' instead of 2.
df['time'] = np.zeros(len(df))
for i in range(1, len(df)):
df.iloc[i, df.columns.get_loc('time')] = df.iloc[i-1, df.columns.get_loc('time')] + df.iloc[i, df.columns.get_loc('sampling frequency')]
Or, alternatively, resetting the index so you can iterate through consecutive numbers:
df['time'] = np.zeros(len(df))
df = df.reset_index()
for i in range(1, len(df)):
df.loc[i, 'time'] = df.loc[i-1, 'time'] + df.loc[i, 'sampling frequency']
df = df.set_index('counter')
Note that, because your sampling frequency is likely constant in the whole experiment, you could simplify it like:
sampling_freq = 1
df['time'] = np.zeros(len(df))
for i in range(1,len(df)):
df.iloc[i, df.columns.get_loc('time')] = df.iloc[i-1, df.columns.get_loc('time')] + sampling_freq
But it's not going to be better than just create the time series as in the first example.

Delete wrong values from an array with help of panda

I have a problem with my data, there are a number of displacement in different times. But unfortunately in some cases for one time I have 2 or 3 different displacements, of which only the higher value is acceptable.
I used Panda and now I have all the values in my paython code and change them to a 2X2 array.
Acctually I need to write an algorithm to find all the duplicates t and check their x and delete the whole line if x is lower than in other cases.
I would really appreciate for any ideas.
t x
1 2
2 3
3 4
4 5
5 5
3 3
1 1
7 5
I have written an example. I need here for the "Time" equal to 3 and 1 to delete the whole line that has a lower x.
Let's say you have pandas dataframe such as :
import pandas as pd
dictionary = {'t':[1,2,3,4,5,3,1,7],
'x':[2,3,4,5,5,3,1,5]}
dataframe = pd.DataFrame(dictionary)
you can select column: t equals 3 and 1, as following :
dataframe.loc[(dataframe['t'] == 3| 1)].reset_index(drop = True)
Also, you can select column: t does not equals 3 and 1, as following :
dataframe.loc[(dataframe['t'] != 3| 1)].reset_index(drop = True)

Pandas: how to GROUPBY by number of not NaNs for each row?

I have a result from check-all-that-apply questions:
A | B | C ...
1 | NaN | 1
NaN | 1 | NaN
Where NaN means the responder did not select that option, and 1 if they selected it.
I want to group by the number of not NaNs in each row. Specifically, this is the kind of output visualization I am trying to do:
I tried using count():
df.count(axis=1).reset_index()
And I get the number of selected boxes per user, but I don't know what's next.
If the dataframe is like this, I included 1 more row so that we get values of 4+ :
import pandas as pd
import numpy as np
from matplotlib.ticker import FuncFormatter
df = pd.DataFrame({'A':[1,np.nan,1,np.nan],
'B':[np.nan,1,np.nan,np.nan],
'C':[np.nan,1,1,np.nan],
'D':[np.nan,np.nan,1,np.nan]})
df.isna().sum(axis=1) would give you the count for number of NAs per row. But you want to be these values, you can use pd.cut :
labels = pd.cut(df.isna().sum(axis=1),[-np.inf,1,3,+np.inf],labels=['0-1','2-3','4+'])
labels
0 2-3
1 2-3
2 0-1
3 4+
And just plot this:
ax = (labels.value_counts(sort=False) / len(labels)).plot.bar()
ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))

How do i create a heatmap from two columns plus the value of those two in python

and thank you for helping!
I would like to generate a heatmap in python, from the data df.
(i am using pandas, seaborn, numpy, and matplotlib in my project)
The dataframe df looks like:
index | a | b | c | year | month
0 | | | | 2013 | 1
1 | | | | 2015 | 4
2 | | | | 2016 | 10
3 | | | | 2017 | 1
in the dataset the rows are each a ticket.
The dataset is big (51 colums and 100k+ rows), so a, b, c is just to show some random columns. (for month => 1 = jan, 2= feb...)
For the heatmap:
x-axis = year,
y-axis = month,
value: and in the heatmap, I wanted the value between the two axes to be a count of the number of rows, in which a ticket has been given in that year and month.
The result I imagine should look something like the from the seaborn documentation:
https://seaborn.pydata.org/_images/seaborn-heatmap-4.png
I am new to coding and tried a lot of random things I found on the internet and has not been able to make it work.
Thank you for helping!
This should do (with generated data):
import pandas as pd
import seaborn as sns
import random
y = [random.randint(2013,2017) for n in range(2000)]
m = [random.randint(1,12) for n in range(2000)]
df = pd.DataFrame([y,m]).T
df.columns=['y','m']
df['count'] = 1
df2 = df.groupby(['y','m'], as_index=False).count()
df_p = pd.pivot_table(df2,'count','m','y')
sns.heatmap(df_p)
You probably won't need the column count but I added it because I needed an extra column for the groupby to work.

How to plot 5 largest values obtained by dividing one column by another with pandas?

I have a dataset that looks like this:
variety|points|price
a | 80 | 5
b | 85 | 6
b | 70 | 8
and so on.
I would like to create a barplot using seaborn that has variety on the x-axis and points/price ratio on the y-axis. I have about 150k rows, so I only want to display the 5 best points/price ratios.
This was my idea using another column called result:
df["Result"] = df["points"]/df["price"]
ax = sns.barplot(x="variety", data=df, order=df["Result"].iloc[:5].index)
which does not work.
I will be glad for any advice.
You could try to filter out the first 5 largest values using nlargest.
largest_five= df.nlargest(5,'Result')
Then plot it
ax = sns.barplot(x="variety",y='Result', data=largest_five)
Let me know if this works.

Categories