I have a pandas dataframe which looks like this:
car,id
1,1
1,2
2,3
2,4
2,5
and so on
What I want to do is make a lineplot in seaborn that shows how many ids there are in each car ( I dont care for which id that are in the car). So on the x axis I want to have the unique number of cars (so here [1,2]) and y-axis I want the "number" of cars that are repeated (so here [2,3]). I would like to use seaborn to plot.
What I have tried now is:
import seaborn as sns
#the df is the one above
sns.lineplot(x='car', y='car'.count(), data=df) #which is not working for obvious reasons
Any tips to do this?
If you specifically need a lineplot then this would work:
import pandas as pd
import seaborn as sns
data = {"car": [1, 1, 2, 2, 2], "id": [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)
sns.lineplot(x="car", y="id", data=df.groupby('car').nunique())
Or could use value_counts() too:
car_count = df['car'].value_counts()
sns.lineplot(x=car_count.index, y=car_count.values)
import pandas as pd
MyDic = {"car": [1, 1, 2, 2, 2, 3], "id": [1, 2, 3, 4, 5, 6]}
MyDf = pd.DataFrame(MyDic)
print(MyDf)
>> car id
>> 0 1 1
>> 1 1 2
>> 2 2 3
>> 3 2 4
>> 4 2 5
>> 5 3 6
from collections import Counter
carCounter = Counter(MyDf["car"])
x, y = list(carCounter.keys()), list(carCounter.values())
print(f"{x=}, {y=}")
>>x=[1, 2, 3], y=[2, 3, 1]
Line plot in a seaborn needs both axis. The below code will run fine.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
cars = {"id": [1, 2, 3, 4, 5, 6], "car": [1, 1, 2, 2, 2, 3]}
dataset = pd.DataFrame(cars)
car_counts = dataset["car"].value_counts()
car_counts.sort_index(inplace=True)
sns.lineplot(x=car_counts.index, y=car_counts)
plt.show()
Related
I have the following code
import plotly.express as px
import pandas as pd
import numpy as np
df = pd.DataFrame([1, None, None, 4, 6, None], columns=["y"])
df["x"] = [1, 2, 3, 4, 5, 6]
df["completed"] = [1, 0, 0, 1, 1, 0]
fig = px.line(df, x="x", y="y", markers=True, color="completed")
fig.show()
That results in the following plot
But I have to highlight (change the color line to red and add a dot point) in the cases that the dataframe has NaN value like in the following plot
Is there any way to do that easily? I have been looking for it but I'm not able to find a suitable solution.
Thanks in advance!
Found a solution using this https://community.plotly.com/t/change-color-of-continuous-line-based-on-value/68938/2
import plotly.express as px
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import itertools as it
df = pd.DataFrame([1, None, None, 4, 6, None, 2, 1], columns=["y"])
df["x"] = [1, 2, 3, 4, 5, 6, 7, 8]
df["completed"] = [1, 0, 0, 1, 1, 0, 1, 1]
fig = go.Figure()
# generate color list
df.loc[df["y"].isna(), "line_color"] = "red"
df.loc[df["y"].isna(), "line_type"] = "dash"
df.loc[df["y"].notna(), "line_color"] = "black"
df.loc[df["y"].notna(), "line_type"] = "solid"
df["y"] = df["y"].interpolate(method="index")
# create coordinate pairs
x_pairs = it.pairwise(df["x"])
y_pairs = it.pairwise(df["y"])
for x, y, line_color, line_type in zip(
x_pairs,
y_pairs,
df["line_color"].values,
df["line_type"].values,
):
# create trace
fig.add_trace(
go.Scatter(
x=x,
y=y,
mode="lines",
line=dict(color=line_color, dash=line_type),
)
)
fig.show()
This is the new output for the plot.
I want to create 3 subplots below with the subplot with the coordinates stated in the for loop parameters as add_plot. The format of add_plot is nrows, ncols ,cells. But I get an error when I try to implement it. How can I modify the contents of the for loop within Graphing() to achieve this?
Error:
ValueError: Single argument to subplot must be a three-digit integer, not (2, 2, 1)
Code:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = pd.DataFrame({'col1': [4, 5, 2, 2, 3, 5, 1, 1, 6], 'col2': [6, 2, 1, 7, 3, 5, 3, 3, 9],
'label':['Old','Old','Old','Old','Old','Old','Old','Old','Old'],
'date': ['2022-01-24 10:07:02', '2022-01-27 01:55:03', '2022-01-30 19:09:03', '2022-02-02 14:34:06',
'2022-02-08 12:37:03', '2022-02-10 03:07:02', '2022-02-10 14:02:03', '2022-02-11 00:32:25',
'2022-02-12 21:42:03']})
def Graphing():
#Size of the figure
fig = plt.figure(figsize=(12, 7))
#Creating the dataframe
df = pd.DataFrame({
'date' : datetime,
'col1': data['col1']
})
for subplot_,add_plot in (zip(
['sub1','sub2','sub3'],
[(2,2,1), (2,2,1), (2,1,2)])):
subplot_ = fig.add_subplot(add_plot)
# Show Graph
plt.show()
Graphing()
This still raises a NameError: name 'datetime' is not defined as you didn't define datetime in your code, but shouldn't raise any error for the subplots.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = pd.DataFrame({'col1': [4, 5, 2, 2, 3, 5, 1, 1, 6], 'col2': [6, 2, 1, 7, 3, 5, 3, 3, 9],
'label':['Old','Old','Old','Old','Old','Old','Old','Old','Old'],
'date': ['2022-01-24 10:07:02', '2022-01-27 01:55:03', '2022-01-30 19:09:03', '2022-02-02 14:34:06',
'2022-02-08 12:37:03', '2022-02-10 03:07:02', '2022-02-10 14:02:03', '2022-02-11 00:32:25',
'2022-02-12 21:42:03']})
def Graphing():
#Size of the figure
fig = plt.figure(figsize=(12, 7))
#Creating the dataframe
df = pd.DataFrame({
'date' : datetime,
'col1': data['col1']
})
for subplot_,add_plot in (zip(
['sub1','sub2','sub3'],
[(2,2,1), (2,2,1), (2,1,2)])):
subplot_ = fig.add_subplot(*add_plot)
# Show Graph
plt.show()
Graphing()
I have a DataFrame like this:
Apples Oranges
0 1 1
1 2 1
2 3 2
3 2 3
4 1 2
5 2 3
I'm trying to count the occurence of values for both Apples and Oranges (how often values 1,2 and 3 occur in data frame for each fruit). I want to draw a bar chart using Matplotlib but so far I have not been successful. I have tried:
plt.bar(2,['Apples', 'Oranges'], data=df)
plt.show()
But the output is very weird, could I have some advise? Thanks in advance.
Edit: I'm expecting result like this:
You can use the value_counts method together with pandas plotting:
# Sample data
d = {'apples': [1, 2,3,2,1,2], 'oranges': [1,1,2,3,2,3]}
df = pd.DataFrame(data=d)
# Calculate the frequencies of each value
df2 = pd.DataFrame({
"apples": df["apples"].value_counts(),
"oranges": df["oranges"].value_counts()
})
# Plot
df2.plot.bar()
You will get:
Here is another one
import pandas as pd
df = pd.DataFrame({'Apples': [1, 2, 3, 2, 1, 2], 'Oranges': [1, 1, 2, 3, 2, 3]})
df.apply(pd.Series.value_counts).plot.bar()
You can use hist from matplotlib:
d = {'apples': [1, 2,3,2,1,2], 'oranges': [1,1,2,3,2,3]}
df = pd.DataFrame(data=d)
plt.hist([df.apples,df.oranges],label=['apples','oranges'])
plt.legend()
plt.show()
This will give the output:
Whenever you are plotting the count or frequency of something, you should look into a histogram:
from matplotlib import pyplot as plt
import pandas as pd
df = pd.DataFrame({'Apples': {0: 1, 1: 2, 2: 3, 3: 2, 4: 1, 5: 2}, 'Oranges': {0: 1, 1: 1, 2: 2, 3: 3, 4: 2, 5: 3}})
plt.hist([df.Apples,df.Oranges], bins=3,range=(0.5,3.5),label=['Apples', 'Oranges'])
plt.xticks([1,2,3])
plt.yticks([1,2,3])
plt.legend()
plt.show()
I'm trying to change the size of only SOME of the markers in a seaborn pairplot.
df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
Prettier:
num_legs num_wings num_specimen_seen class
falcon 2 2 10 1
dog 4 0 2 2
spider 8 0 1 3
fish 0 0 8 4
I want to for example increase the size of all samples with class=4.
How could this be done with the seaborn pairplot?
What I have so far:
sns.pairplot(data=df,diag_kind='hist',hue='class')
I have tried adding plot_kws={"s": 3}, but that changes the size of all the dots. Cheers!
After checking out how the pairplot is built up, one could iterate through the axes and change the size of each 4th set of scatter dots:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
N = 100
classes = np.random.randint(1, 5, N)
df = pd.DataFrame({'num_legs': 2 * classes % 8,
'num_wings': (classes == 1) * 2,
'num_specimen_seen': np.random.randint(1,20,N),
'class': classes})
g = sns.pairplot(data=df,diag_kind='hist',hue='class')
for ax in np.ravel(g.axes):
if len(ax.collections) == 4:
ax.collections[3].set_sizes([100])
g.fig.legends[0].legendHandles[3].set_sizes([100])
plt.show()
What is the best way to get a random sample of the elements of a groupby? As I understand it, a groupby is just an iterable over groups.
The standard way I would do this for an iterable, if I wanted to select N = 200 elements is:
rand = random.sample(data, N)
If you attempt the above where data is a 'grouped' the elements of the resultant list are tuples for some reason.
I found the below example for randomly selecting the elements of a single key groupby, however this does not work with a multi-key groupby. From, How to access pandas groupby dataframe by key
create groupby object
grouped = df.groupby('some_key')
pick N dataframes and grab their indices
sampled_df_i = random.sample(grouped.indices, N)
grab the groups using the groupby object 'get_group' method
df_list = map(lambda df_i: grouped.get_group(df_i),sampled_df_i)
optionally - turn it all back into a single dataframe object
sampled_df = pd.concat(df_list, axis=0, join='outer')
You can take a randoms sample of the unique values of df.some_key.unique(), use that to slice the df and finally groupby on the resultant:
In [337]:
df = pd.DataFrame({'some_key': [0,1,2,3,0,1,2,3,0,1,2,3],
'val': [1,2,3,4,1,5,1,5,1,6,7,8]})
In [338]:
print df[df.some_key.isin(random.sample(df.some_key.unique(),2))].groupby('some_key').mean()
val
some_key
0 1.000000
2 3.666667
If there are more than one groupby keys:
In [358]:
df = pd.DataFrame({'some_key1':[0,1,2,3,0,1,2,3,0,1,2,3],
'some_key2':[0,0,0,0,1,1,1,1,2,2,2,2],
'val': [1,2,3,4,1,5,1,5,1,6,7,8]})
In [359]:
gby = df.groupby(['some_key1', 'some_key2'])
In [360]:
print gby.mean().ix[random.sample(gby.indices.keys(),2)]
val
some_key1 some_key2
1 1 5
3 2 8
But if you are just going to get the values of each group, you don't even need to groubpy, MultiIndex will do:
In [372]:
idx = random.sample(set(pd.MultiIndex.from_product((df.some_key1, df.some_key2)).tolist()),
2)
print df.set_index(['some_key1', 'some_key2']).ix[idx]
val
some_key1 some_key2
2 0 3
3 1 5
I feel like lower-level numpy operations are cleaner:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"some_key": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
"val": [1, 2, 3, 4, 1, 5, 1, 5, 1, 6, 7, 8],
}
)
ids = df["some_key"].unique()
ids = np.random.choice(ids, size=2, replace=False)
ids
# > array([3, 2])
df.loc[df["some_key"].isin(ids)]
# > some_key val
# 2 2 3
# 3 3 4
# 6 2 1
# 7 3 5
# 10 2 7
# 11 3 8
Although this question was asked and answered long ago, I think the following is cleaner:
import pandas as pd
df = pd.DataFrame(
{
"some_key1": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
"some_key2": [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
"val": [1, 2, 3, 4, 1, 5, 1, 5, 1, 6, 7, 8]
}
)
# Set the number of samples by group
n_samples_by_group = 1
samples_by_group = df \
.groupby(by=["some_key1", "some_key2"]) \
.sample(n_samples_by_group)