I have a DataFrame like this:
Apples Oranges
0 1 1
1 2 1
2 3 2
3 2 3
4 1 2
5 2 3
I'm trying to count the occurence of values for both Apples and Oranges (how often values 1,2 and 3 occur in data frame for each fruit). I want to draw a bar chart using Matplotlib but so far I have not been successful. I have tried:
plt.bar(2,['Apples', 'Oranges'], data=df)
plt.show()
But the output is very weird, could I have some advise? Thanks in advance.
Edit: I'm expecting result like this:
You can use the value_counts method together with pandas plotting:
# Sample data
d = {'apples': [1, 2,3,2,1,2], 'oranges': [1,1,2,3,2,3]}
df = pd.DataFrame(data=d)
# Calculate the frequencies of each value
df2 = pd.DataFrame({
"apples": df["apples"].value_counts(),
"oranges": df["oranges"].value_counts()
})
# Plot
df2.plot.bar()
You will get:
Here is another one
import pandas as pd
df = pd.DataFrame({'Apples': [1, 2, 3, 2, 1, 2], 'Oranges': [1, 1, 2, 3, 2, 3]})
df.apply(pd.Series.value_counts).plot.bar()
You can use hist from matplotlib:
d = {'apples': [1, 2,3,2,1,2], 'oranges': [1,1,2,3,2,3]}
df = pd.DataFrame(data=d)
plt.hist([df.apples,df.oranges],label=['apples','oranges'])
plt.legend()
plt.show()
This will give the output:
Whenever you are plotting the count or frequency of something, you should look into a histogram:
from matplotlib import pyplot as plt
import pandas as pd
df = pd.DataFrame({'Apples': {0: 1, 1: 2, 2: 3, 3: 2, 4: 1, 5: 2}, 'Oranges': {0: 1, 1: 1, 2: 2, 3: 3, 4: 2, 5: 3}})
plt.hist([df.Apples,df.Oranges], bins=3,range=(0.5,3.5),label=['Apples', 'Oranges'])
plt.xticks([1,2,3])
plt.yticks([1,2,3])
plt.legend()
plt.show()
Related
I have a pandas dataframe which looks like this:
car,id
1,1
1,2
2,3
2,4
2,5
and so on
What I want to do is make a lineplot in seaborn that shows how many ids there are in each car ( I dont care for which id that are in the car). So on the x axis I want to have the unique number of cars (so here [1,2]) and y-axis I want the "number" of cars that are repeated (so here [2,3]). I would like to use seaborn to plot.
What I have tried now is:
import seaborn as sns
#the df is the one above
sns.lineplot(x='car', y='car'.count(), data=df) #which is not working for obvious reasons
Any tips to do this?
If you specifically need a lineplot then this would work:
import pandas as pd
import seaborn as sns
data = {"car": [1, 1, 2, 2, 2], "id": [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)
sns.lineplot(x="car", y="id", data=df.groupby('car').nunique())
Or could use value_counts() too:
car_count = df['car'].value_counts()
sns.lineplot(x=car_count.index, y=car_count.values)
import pandas as pd
MyDic = {"car": [1, 1, 2, 2, 2, 3], "id": [1, 2, 3, 4, 5, 6]}
MyDf = pd.DataFrame(MyDic)
print(MyDf)
>> car id
>> 0 1 1
>> 1 1 2
>> 2 2 3
>> 3 2 4
>> 4 2 5
>> 5 3 6
from collections import Counter
carCounter = Counter(MyDf["car"])
x, y = list(carCounter.keys()), list(carCounter.values())
print(f"{x=}, {y=}")
>>x=[1, 2, 3], y=[2, 3, 1]
Line plot in a seaborn needs both axis. The below code will run fine.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
cars = {"id": [1, 2, 3, 4, 5, 6], "car": [1, 1, 2, 2, 2, 3]}
dataset = pd.DataFrame(cars)
car_counts = dataset["car"].value_counts()
car_counts.sort_index(inplace=True)
sns.lineplot(x=car_counts.index, y=car_counts)
plt.show()
I'm trying to change the size of only SOME of the markers in a seaborn pairplot.
df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
Prettier:
num_legs num_wings num_specimen_seen class
falcon 2 2 10 1
dog 4 0 2 2
spider 8 0 1 3
fish 0 0 8 4
I want to for example increase the size of all samples with class=4.
How could this be done with the seaborn pairplot?
What I have so far:
sns.pairplot(data=df,diag_kind='hist',hue='class')
I have tried adding plot_kws={"s": 3}, but that changes the size of all the dots. Cheers!
After checking out how the pairplot is built up, one could iterate through the axes and change the size of each 4th set of scatter dots:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
N = 100
classes = np.random.randint(1, 5, N)
df = pd.DataFrame({'num_legs': 2 * classes % 8,
'num_wings': (classes == 1) * 2,
'num_specimen_seen': np.random.randint(1,20,N),
'class': classes})
g = sns.pairplot(data=df,diag_kind='hist',hue='class')
for ax in np.ravel(g.axes):
if len(ax.collections) == 4:
ax.collections[3].set_sizes([100])
g.fig.legends[0].legendHandles[3].set_sizes([100])
plt.show()
I've created a bar chart as described here where I have multiple variables (indicated in the 'value' column) and they belong to repeat groups. I've colored the bars by their group membership.
I want to create a legend ultimately equivalent to the colors dictionary, showing the color corresponding to a given group membership.
Code here:
d = {'value': [1, 2, 4, 5, 7 ,10], 'group': [1, 2, 3, 2, 2, 3]}
df = pd.DataFrame(data=d)
colors = {1: 'r', 2: 'b', 3: 'g'}
df['value'].plot(kind='bar', color=[colors[i] for i in df['group']])
plt.legend(df['group'])
In this way, I just get a legend with one color (1) instead of (1, 2, 3).
Thanks!
You can use sns:
sns.barplot(data=df, x=df.index, y='value',
hue='group', palette=colors, dodge=False)
Output:
With pandas, you could create your own legend as follows:
from matplotlib import pyplot as plt
from matplotlib import patches as mpatches
import pandas as pd
d = {'value': [1, 2, 4, 5, 7 ,10], 'group': [1, 2, 3, 2, 2, 3]}
df = pd.DataFrame(data=d)
colors = {1: 'r', 2: 'b', 3: 'g'}
df['value'].plot(kind='bar', color=[colors[i] for i in df['group']])
handles = [mpatches.Patch(color=colors[i]) for i in colors]
labels = [f'group {i}' for i in colors]
plt.legend(handles, labels)
plt.show()
I'm writing a small script to plot material data using plotly, I tried to use a dropdown to select which column of my dataframe to display. I can do this by defining the columns one by one but the dataframe will change in size so I wanted to make it flexible.
I've tried a few things and got to this;
for i in df.columns:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df.index, y=df[i]))
fig.update_layout(
updatemenus=[
go.layout.Updatemenu(
buttons=list([
dict(
args=["values", i],
label=i,
method="update"
),
]),
direction="down",
pad={"r": 10, "t": 10},
showactive=True,
x=0.1,
xanchor="left",
y=1.1,
yanchor="top"
),
]
)
fig.update_layout(
annotations=[
go.layout.Annotation(text="Material:", showarrow=False,
x=0, y=1.085, yref="paper", align="left")
]
)
fig.show()
The chart shows the final column of the df, while dropdown only holds the last column header?
My data looks something like this, where i is the index, the chart and dropdown show column 'G';
i A B C D E F G
1 8 2 4 5 4 9 7
2 5 3 7 7 6 7 3
3 7 4 9 3 7 4 6
4 3 9 3 6 3 3 4
5 1 7 6 9 9 1 9
The following suggestion should be exactly what you're asking for. It won't exceed the built-in functionalities to an extreme extent though, meaning that you can already subset your figure by clicking the legend. Anyway, let me know if this is something you can use or if it needs som adjustments.
Plot 1 - State of plot on first execution:
Click the dropdown in the upper left corner and select, D for example to get:
Plot 2 - State of plot after selecting D from dropdown:
Code:
# imports
import plotly.graph_objs as go
import pandas as pd
# data sample
df = pd.DataFrame({'i': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'A': {0: 8, 1: 5, 2: 7, 3: 3, 4: 1},
'B': {0: 2, 1: 3, 2: 4, 3: 9, 4: 7},
'C': {0: 4, 1: 7, 2: 9, 3: 3, 4: 6},
'D': {0: 5, 1: 7, 2: 3, 3: 6, 4: 9},
'E': {0: 4, 1: 6, 2: 7, 3: 3, 4: 9},
'F': {0: 9, 1: 7, 2: 4, 3: 3, 4: 1},
'G': {0: 7, 1: 3, 2: 6, 3: 4, 4: 9}})
df=df.set_index('i')
# plotly figure setup
fig = go.Figure()
# one trace for each df column
for col in df.columns:
fig.add_trace(go.Scatter(x=df.index, y=df[col].values,
name = col,
mode = 'lines')
)
# one button for each df column
updatemenu= []
buttons=[]
for col in df.columns:
buttons.append(dict(method='restyle',
label=col,
args=[{'y':[df[col].values]}])
)
# some adjustments to the updatemenu
updatemenu=[]
your_menu=dict()
updatemenu.append(your_menu)
updatemenu[0]['buttons']=buttons
updatemenu[0]['direction']='down'
updatemenu[0]['showactive']=True
# update layout and show figure
fig.update_layout(updatemenus=updatemenu)
fig.show()
What is the best way to get a random sample of the elements of a groupby? As I understand it, a groupby is just an iterable over groups.
The standard way I would do this for an iterable, if I wanted to select N = 200 elements is:
rand = random.sample(data, N)
If you attempt the above where data is a 'grouped' the elements of the resultant list are tuples for some reason.
I found the below example for randomly selecting the elements of a single key groupby, however this does not work with a multi-key groupby. From, How to access pandas groupby dataframe by key
create groupby object
grouped = df.groupby('some_key')
pick N dataframes and grab their indices
sampled_df_i = random.sample(grouped.indices, N)
grab the groups using the groupby object 'get_group' method
df_list = map(lambda df_i: grouped.get_group(df_i),sampled_df_i)
optionally - turn it all back into a single dataframe object
sampled_df = pd.concat(df_list, axis=0, join='outer')
You can take a randoms sample of the unique values of df.some_key.unique(), use that to slice the df and finally groupby on the resultant:
In [337]:
df = pd.DataFrame({'some_key': [0,1,2,3,0,1,2,3,0,1,2,3],
'val': [1,2,3,4,1,5,1,5,1,6,7,8]})
In [338]:
print df[df.some_key.isin(random.sample(df.some_key.unique(),2))].groupby('some_key').mean()
val
some_key
0 1.000000
2 3.666667
If there are more than one groupby keys:
In [358]:
df = pd.DataFrame({'some_key1':[0,1,2,3,0,1,2,3,0,1,2,3],
'some_key2':[0,0,0,0,1,1,1,1,2,2,2,2],
'val': [1,2,3,4,1,5,1,5,1,6,7,8]})
In [359]:
gby = df.groupby(['some_key1', 'some_key2'])
In [360]:
print gby.mean().ix[random.sample(gby.indices.keys(),2)]
val
some_key1 some_key2
1 1 5
3 2 8
But if you are just going to get the values of each group, you don't even need to groubpy, MultiIndex will do:
In [372]:
idx = random.sample(set(pd.MultiIndex.from_product((df.some_key1, df.some_key2)).tolist()),
2)
print df.set_index(['some_key1', 'some_key2']).ix[idx]
val
some_key1 some_key2
2 0 3
3 1 5
I feel like lower-level numpy operations are cleaner:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"some_key": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
"val": [1, 2, 3, 4, 1, 5, 1, 5, 1, 6, 7, 8],
}
)
ids = df["some_key"].unique()
ids = np.random.choice(ids, size=2, replace=False)
ids
# > array([3, 2])
df.loc[df["some_key"].isin(ids)]
# > some_key val
# 2 2 3
# 3 3 4
# 6 2 1
# 7 3 5
# 10 2 7
# 11 3 8
Although this question was asked and answered long ago, I think the following is cleaner:
import pandas as pd
df = pd.DataFrame(
{
"some_key1": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
"some_key2": [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
"val": [1, 2, 3, 4, 1, 5, 1, 5, 1, 6, 7, 8]
}
)
# Set the number of samples by group
n_samples_by_group = 1
samples_by_group = df \
.groupby(by=["some_key1", "some_key2"]) \
.sample(n_samples_by_group)