Related
I have the following dataset
ids count
1 2000210
2 -23123
3 100
4 500
5 102300120
...
1 million 123213
I want a graph where I have group of ids (all unique ids) in the x axis and count in y axis and a distribution chart that looks like the following
How can I achieve this in pandas dataframe in python.
I tried different ways but I am only getting a basic plot and not as complex as the drawing.
What I tried
df = pd.DataFrame(np.random.randn(1000000, 2), columns=["count", "ids"]).cumsum()
df["range"] = pd.Series(list(range(len(df))))
df.plot(x="range", y="count");
But the plots dont make any sense. I am also new to plotting in pandas. I searched for a long time for charts like this in the internet and could really use some help with such graphs
From what I understood from your question and comments here is what you can do:
1) Import the libraries and set the default theme:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()
2) Create your dataframe:
df = pd.DataFrame(np.random.randn(1000000, 2), columns=["count", "ids"]).cumsum()
df["range"] = pd.Series(list(range(len(df))))
3) Plot your data
3.1) Simple take using only the seaborn library:
sns.kdeplot(data=df, x="count", weights="range")
Output:
3.2) More complex take using seaborn and matplotlib libraries:
sns.histplot(x=df["count"], weights=df["range"], discrete=True,
color='darkblue', edgecolor='black',
kde=True, kde_kws={'cut': 2}, line_kws={'linewidth': 4})
plt.ylabel("range")
plt.show()
Output:
Personal note: please make sure to check all the solutions, if they
are not enough comment and we will work together in order to find you
a solution
From a distribution plot of ids you can use:
import numpy as np
import pandas as pd
np.random.seed(seed=123)
df = pd.DataFrame(np.random.randn(1000000), columns=["ids"])
df['ids'].plot(kind='kde')
I am attempting to create a histogram using seaborn and census data that displays 3 subplots for age composition, and I have the data grouped the way that I would like it, but I am struggling to turn that into a histogram.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
filename = "/scratch/%s_class_root/%s_class/materials/data/pums_short.csv.gz"
acs = pd.read_csv(filename)
R65_agg = acs.groupby(["R65", "PUMA"])["HINCP"]
R65_meds = R65_agg.agg(np.median).unstack()
R65_f = R65_meds.dropna()
R65_f = R65_meds.reset_index(drop = True)
I was expecting this code to give me data that I could plug into a histogram but instead of being distinct subplots, the "0.0, 1.0, 2,0" in the final variable just get added together when I apply the .describe() function. Any advice for how I can convert this into a form that's readable with the sns.histplot() function?
I need help plotting some categorical and numerical Values in python. the code is given below:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv('train_feature_store.csv')
df.info
df.head
df.columns
plt.figure(figsize=(20,6))
sns.countplot(x='Store', data=df)
plt.show()
Size = df[['Size','Store']].groupby(['Store'], as_index=False).sum()
Size.sort_values(by=['Size'],ascending=False).head(10)
However, the data size is so huge (Big data) that I'm not even able to make meaningful plotting in python. Basically, I just want to take the top 5 or top 10 values in python and make a plot of that as given below:-
In an attempt to plot the thing, I'm trying to put the below code into a dataframe and plot it, but not able to do so. Can anyone help me out in this:-
Size = df[['Size','Store']].groupby(['Store'], as_index=False).sum()
Size.sort_values(by=['Size'],ascending=False).head(10)
Below, is a link to the sample dataset. However, the dataset is a representation, in the original one where I'm trying to do the EDA, which has around 3 thousand unique stores and 60 thousand rows of data. PLEASE HELP! Thanks!
https://drive.google.com/drive/folders/1PdXaKXKiQXX0wrHYT3ZABjfT3QLIYzQ0?usp=sharing
You were pretty close.
import pandas as pd
import seaborn as sns
df = pd.read_csv('train_feature_store.csv')
sns.set(rc={'figure.figsize':(16,9)})
g = df.groupby('Store', as_index=False)['Size'].sum().sort_values(by='Size', ascending=False).head(10)
sns.barplot(data=g, x='Store', y='Size', hue='Store', dodge=False).set(xticklabels=[]);
First of all.. looking at the data ..looks like it holds data from scotland to Kolkata ..
categorize the data by geography first & then visualize.
Regards
Maitryee
The format of the .csv file is as below it has gaps in between as it gets data in packets. I want to plot the data with timestamp on the x-axis and sensor1 on the y-axis using matplotlib in python so is there a possibility.
This is the data in the CSV file so you can see 4 data points received 4 times this is being read at different time stamps. I tried approaching the normal way but it shows a blank plot.
This is the link to the CSV file.
https://docs.google.com/spreadsheets/d/17SIabIYYmSogOdeYTzpEwy9s2pZuVO3ghoChSSgGwAg/edit?usp=sharing
thanks in advance.
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("cantype.csv")
X = data.time_stamp
Y = data.sensor1
# plt.plot(data.time_stamp,data.BMS_01_CellVolt01)
plt.plot(X,Y)
plt.show()
Data
time_stamp,sensor1,sensor2,sensor3,sensor4,sensor5,sensor6,sensor7,sensor8,sensor9,sensor10,sensor11,sensor12,sensor13,sensor14,sensor15,sensor16
1.37E+12,1.50465,1.50405,1.50435,1.5042,,,,,,,,,,,,
1.37E+12,,,,,1.47105,1.5042,1.5045,1.50435,,,,,,,,
1.37E+12,,,,,,,,,1.49115,1.49205,1.4961,1.49865,,,,
1.37E+12,,,,,,,,,,,,,1.50405,1.5042,1.50405,1.50435
1.37E+12,1.50465,1.50405,1.50435,1.5042,,,,,,,,,,,,
1.37E+12,,,,,1.47105,1.5042,1.5045,1.50435,,,,,,,,
1.37E+12,,,,,,,,,1.49115,1.49205,1.4961,1.49865,,,,
1.37E+12,,,,,,,,,,,,,1.50405,1.5042,1.50405,1.50435
Load the csv with pandas:
import pandas as pd
df = pd.read_csv('cantype.csv')
Then either use pandas plotting:
df.plot(x='time_stamp', y='sensor1', marker='.')
Or pure matplotlib:
import matplotlib.pyplot as plt
plt.plot(df.time_stamp, df.sensor1, marker='.')
With your sample data, the plot does not look like a meaningful time series because there are only two (timestamp, sensor1) points and both are located at (1.37E+12, 1.50465):
I have a .csv file (csv_test_1.csv) that is in this format:
durum_before_length,durum_before_reads,durum_after_length,durum_after_reads
0,0,0,0
10,0,10,0
20,0,20,0
30,0,30,1
40,0,40,4
50,0,50,5
60,0,60,0
70,0,70,1
80,0,80,4
90,0,90,1
100,4840,100,4704
110,4817,110,4706
120,4983,120,4860
130,4997,130,4851
140,5142,140,4980
150,5363,150,5192
160,5756,160,5530
170,6054,170,5725
180,6335,180,5989
190,7051,190,6651
200,9003,200,7157
210,8446,210,7812
220,9088,220,8314
230,9761,230,8955
240,10637,240,9660
250,11659,250,10408
260,12572,260,11178
270,13139,270,11538
280,13985,280,11950
290,113552,290,14304
300,954175,300,16383
,,310,17230
,,320,18368
,,330,19158
,,340,19733
,,350,20754
,,360,21698
,,370,21991
,,380,21937
,,390,22473
,,400,22655
,,410,22497
,,420,22460
,,430,22488
,,440,21941
,,450,21884
,,460,21350
,,470,21066
,,480,20812
,,490,19901
,,500,19716
,,510,19374
,,520,19000
,,530,18245
,,540,17220
,,550,15713
,,560,14042
,,570,11932
,,580,7204
,,590,29
You can see that the second two columns are longer than the first two columns. I would like to plot two overlapping histograms: the first histogram will be the first column as the x values plotted against the second column as the y-values, and the second histogram will be the third column as the x values plotted against the fourth column as the y-values.
I am thinking of using seaborn because it makes nice looking plots. The code I have thus far is as shown below. From here, I have no idea how to specify the x and y values and how to generate two overlapping histograms on the same plot. Any advice would be greatly appreciated.
import numpy as np
import pandas as pd
from pandas import read_csv
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
read_data = read_csv("csv_test_1.csv")
sns.set(style="white", palette="muted")
sns.despine()
plt.hist(read_data, normed=False)
plt.xlabel("Read Length")
plt.ylabel("Number of Reads")