I'm currently struggling with my dataframe in Pandas (new to this).
I have a 3 columns dataframe : Categorical_data1, Categorical_data2,Output. (2400 rows x 3 columns).
Both categorical data (inputs) are strings and output is depending of inputs.
Categorical_data1 = ['type1','type2', ... , 'type6']
Categorical_data2 = ['rain1','rain2', 'rain3','rain4]
So 24 possible pairs of categorical data.
I want to plot a heatmap (using seaborn for instance) of the number of 0 in outputs regarding couples of categorical data (Cat_data1,Cat_data2). I tried several things using boolean.
I tried to figure out how to compute exact amount of 0
count = ((df['Output'] == 0) & (df(['Categorical_Data1'] == 'type1') & (df(['Categorical_Data2'] == 'rain1')))).sum()
but it failed.
The output belongs to [0,1] with a large amount of 0 (around 1200 over 2400). My goal is to have something like this Source by jcdoming (I can't upload images...) with months = Categorical Data1, years = Categorical Data2 ; and numbers of 0 in ouputs).
Thank you for your help.
Use a seaborn countplot. It gives counts of categorical data occurrences in a certain feature. Use hue to add in the second feature to the visualization:
import seaborn as sns
sns.countplot(data=dataframe, x='Categorical_Data1', hue='Categorical_Data2')
I have a data frame (loading from CSV) file that looks like below one
Data Mean sd time__1 time__2 time__3 time__4 time__5
0 Data_1 0.947667 0.025263 0.501517 0.874750 0.929426 0.953847 0.958375
1 Data_2 0.031960 0.017314 0.377588 0.069185 0.037523 0.024028 0.021532
Now, I wanted to plot 2 time series plots for (data_1, data_2) with (time__1, time__2, etc) as a timepoint. The x axis is (time__1, time__2, etc) and the y axis is their associated values.
The code I am trying
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("file.csv", delimiter=',', header=0)
data = data.drop(["Unnamed: 0"], axis=1)
# Set the date column as the index
data = data.set_index(["time__1", "time__2", "time__3", "time__4", "time__5"])
ax = data.plot(linewidth=2, fontsize=12)
ax.set_xlabel('Data')
ax.legend(fontsize=12)
plt.savefig("series.png")
plt.show()
The figure I am getting is not as expected.
I think I am doing some wrong with set_index() as my time points are in different columns.
How can I plot time-series when time points are in different columns?
Reproducible data as dictionary formate
{'Data': {(0.501517236232758, 0.874750375747681, 0.929425954818726, 0.953846752643585, 0.958374977111816): 'Data_1', (0.377588421106338, 0.069185301661491, 0.037522859871388, 0.0240284409374, 0.021532088518143): 'Data_2'}, 'Mean': {(0.501517236232758, 0.874750375747681, 0.929425954818726, 0.953846752643585, 0.958374977111816): 0.947667360305786, (0.377588421106338, 0.069185301661491, 0.037522859871388, 0.0240284409374, 0.021532088518143): 0.031959813088179}, 'sd': {(0.501517236232758, 0.874750375747681, 0.929425954818726, 0.953846752643585, 0.958374977111816): 0.025263005867601, (0.377588421106338, 0.069185301661491, 0.037522859871388, 0.0240284409374, 0.021532088518143): 0.017313838005066}}
IIUC you are getting the index wrong: If time__1, time__2 etc. is supposed to be your x-axis, that's what you want your index to be. The plot data series names are the columns. Therefore, you need to transpose your DataFrame. Using the csv data in your first table:
print(df)
# out:
Data Mean sd time__1 time__2 time__3 time__4 \
0 Data_1 0.947667 0.025263 0.501517 0.874750 0.929426 0.953847
1 Data_2 0.031960 0.017314 0.377588 0.069185 0.037523 0.024028
time__5
0 0.958375
1 0.021532
Changing column names and transposing:
df.drop(["Mean", "sd"], axis=1).set_index("Data").T
yields an appropriately formatted dataframe:
Data Data_1 Data_2
time__1 0.501517 0.377588
time__2 0.874750 0.069185
time__3 0.929426 0.037523
time__4 0.953847 0.024028
time__5 0.958375 0.021532
which can simply be plotted:
df.plot()
Have we numpy function or pandas function which make somthinfg like that:
For me, boundary values are the farthest values from the regression line.
That means for me:
the farthest from the line over the line and the farthest from the line under the line.
If I will have data:
l1 = [0,1,4,3,4,3]
df = pd.DataFrame(l1)
It looks like that:
0
0 1
1 4
2 3
3 4
4 3
How to find data from index 1 and index 4.
I need to recognize from python script and remove them. I know how to remove but i do not know how to find.
What I want to do:
First I am going to calculate linear regression, next I am going to remove outsider values and next i am going to recalculate linear regression one more time without the farthest values.
To remove outliers, you can use Series.quantile:
Suppose the following dataframe:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2022)
df = pd.DataFrame({'A': np.random.normal(5, 2, size=50)})
df.plot.hist(bins=25)
plt.xlim(0, 10)
plt.show()
Now filter out your dataframe:
df1 = df.loc[df['A'].between(*df['A'].quantile([0.25, 0.75]).values)]
df1.plot.hist(bins=10)
plt.xlim(0, 10)
plt.show()
I have a dataframe that looks like this:
I want to have one bar for old freq and one for new freq. Currently I have graph that looks like this:
This is what the code looks like:
freq_df['date'] = pd.to_datetime(freq_df['date'])
freq_df['hour'] = freq_df['hour'].astype(str)
fig = px.bar(freq_df, x="hour", y="old freq",hover_name = "date",
animation_frame= freq_df.date.dt.day)
fig.update_layout(transition = {'duration': 2000})
How do I add another bar?
Explanation about DF:
It has frequencies relevant to each hour in a specific date.
Edit:
One approach could be to create a category column and add old and new freq and assign values in another freq column. How do I do that :p ?
Edit:
Here is the DF
,date,hour,old freq,new freq
43,2020-09-04,18,273,224.0
44,2020-09-04,19,183,183.0
45,2020-09-04,20,99,111.0
46,2020-09-04,21,130,83.0
47,2020-09-04,22,48,49.0
48,2020-09-04,23,16,16.0
49,2020-09-05,0,8,6.0
50,2020-09-05,1,10,10.0
51,2020-09-05,2,4,4.0
52,2020-09-05,3,7,7.0
53,2020-09-05,4,25,21.0
54,2020-09-05,5,114,53.0
55,2020-09-05,6,284,197.0
56,2020-09-05,7,343,316.0
57,2020-09-05,8,418,419.0
58,2020-09-05,9,436,433.0
59,2020-09-05,10,469,396.0
60,2020-09-05,11,486,300.0
61,2020-09-05,12,377,140.0
62,2020-09-05,13,552,103.0
63,2020-09-05,14,362,117.0
64,2020-09-05,15,512,93.0
65,2020-09-05,16,392,41.0
66,2020-09-05,17,268,31.0
67,2020-09-05,18,223,30.0
68,2020-09-05,19,165,24.0
69,2020-09-05,20,195,15.0
70,2020-09-05,21,90,
71,2020-09-05,22,46,1.0
72,2020-09-05,23,17,1.0
The answer in two steps:
1. Perform a slight transformation of your data using pd.wide_to_long:
df_long = pd.wide_to_long(freq_df, stubnames='freq',
i=['date', 'hour'], j='type',
sep='_', suffix='\w+').reset_index()
2. Plot two groups of bar traces using:
fig1 = px.bar(df_long, x='hour', y = 'freq', hover_name = "date", color='type',
animation_frame= 'date', barmode='group')
This is the result:
The details:
If I understand your question correctly, you'd like to animate a bar chart where you've got one bar for each hour for your two frequencies freq_old and freq_new like this:
If that's the case, then you sample data is no good since your animation critera is hour per date and you've only got four observations (hours) for 2020-09-04 and then 24 observations for 2020-09-05. But don't worry, since your question triggered my interest I just as well made some sample data that will in fact work the way you seem to want them to.
The only real challenge is that px.bar will not accept y= [freq_old, freq_new], or something to that effect, to build your two bar series of different categories for you. But you can make px.bar build two groups of bars by providing a color argument.
However, you'll need a column to identify your different freqs like this:
0 new
1 old
2 new
3 old
4 new
5 old
6 new
7 old
8 new
9 old
In other words, you'll have to transform your dataframe, which originally has a wide format, to a long format like this:
date hour type day freq
0 2020-01-01 0 new 1 7.100490
1 2020-01-01 0 old 1 2.219932
2 2020-01-01 1 new 1 7.015528
3 2020-01-01 1 old 1 8.707323
4 2020-01-01 2 new 1 7.673314
5 2020-01-01 2 old 1 2.067192
6 2020-01-01 3 new 1 9.743495
7 2020-01-01 3 old 1 9.186109
8 2020-01-01 4 new 1 3.737145
9 2020-01-01 4 old 1 4.884112
And that's what this snippet does:
df_long = pd.wide_to_long(freq_df, stubnames='freq',
i=['date', 'hour'], j='type',
sep='_', suffix='\w+').reset_index()
stubnames uses a prefix to identify the columns you'd like to stack into a long format. And that's why I've renamed new_freq and old_freq to freq_new and freq_old, respectively. j='type' simply takes the last parts of your cartegory names using sep='_' and produces the column that we need to tell the freqs from eachother:
type
old
new
old
...
suffix='\w+' tells pd.wide_to_long that we're using non-integers as suffixes.
And that's it!
Complete code:
# imports
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import numpy as np
import random
# sample data
observations = 24*5
np.random.seed(5); cols = list('a')
freq_old = np.random.uniform(low=-1, high=1, size=observations).tolist()
freq_new = np.random.uniform(low=-1, high=1, size=observations).tolist()
date = [t[:10] for t in pd.date_range('2020', freq='H', periods=observations).format()]
hour = [int(t[11:13].lstrip()) for t in pd.date_range('2020', freq='H', periods=observations).format()]
# sample dataframe of a wide format such as yours
freq_df=pd.DataFrame({'date': date,
'hour':hour,
'freq_new':freq_new,
'freq_old':freq_old})
freq_df['day']=pd.to_datetime(freq_df['date']).dt.day
# attempt to make my random data look a bit
# like your real world data.
# but don't worry too much about that...
freq_df.freq_new = abs(freq_df.freq_new.cumsum())
freq_df.freq_old = abs(freq_df.freq_old.cumsum())
# sample dataframe of a long format that px.bar likes
df_long = pd.wide_to_long(freq_df, stubnames='freq',
i=['date', 'hour'], j='type',
sep='_', suffix='\w+').reset_index()
# plotly express bar chart with multiple bar groups.
fig = px.bar(df_long, x='hour', y = 'freq', hover_name = "date", color='type',
animation_frame= 'date', barmode='group')
# set up a sensible range for the y-axis
fig.update_layout(yaxis=dict(range=[df_long['freq'].min()*0.8,df_long['freq'].max()*1.2]))
fig.show()
I was able to create the bars for both the old and new frequencies, however using a separate plot for each day (Plotly Express Bar Charts don't seem to have support for multiple series). Here is the code for doing so:
# Import packages
import pandas as pd
import numpy as np
import plotly.graph_objs as go
import plotly
import plotly.express as px
from plotly.offline import init_notebook_mode, plot, iplot, download_plotlyjs
init_notebook_mode(connected=True)
plotly.offline.init_notebook_mode(connected=True)
# Start formatting data
allDates = np.unique(df.date)
numDates = allDates.shape[0]
print(numDates)
for i in range(numDates):
df = original_df.loc[original_df.date == allDates[i]]
oldFreqData = go.Bar(x=df["hour"].to_numpy(), y=df["old_freq"].to_numpy(), name="Old Frequency")
newFreqData = go.Bar(x=df["hour"].to_numpy(), y=df["new_freq"].to_numpy(), name="New Frequency")
fig = go.Figure(data=[oldFreqData,newFreqData])
fig.update_layout(title=allDates[i])
fig.update_xaxes(title='Hour')
fig.update_yaxes(title='Frequency')
fig.show()
where df is the dataframe DF from your question.
Here is the output:
However, if you prefer the use of the animation frame from Plotly Express, you can have two separate plots: one for old frequencies and one for new using this code:
# Reformat data
df = original_df
dates = pd.to_datetime(np.unique(df.date)).strftime('%Y-%m-%d')
numDays = dates.shape[0]
print(numDays)
hours = np.arange(0,24)
numHours = hours.shape[0]
allDates = []
allHours = []
oldFreqs = []
newFreqs = []
for i in range(numDays):
for j in range(numHours):
allDates.append(dates[i])
allHours.append(j)
if (df.loc[df.date == dates[i]].loc[df.hour == j].shape[0] != 0): # If data not missing
oldFreqs.append(df.loc[df.date == dates[i]].loc[df.hour == j].old_freq.to_numpy()[0])
newFreqs.append(df.loc[df.date == dates[i]].loc[df.hour == j].new_freq.to_numpy()[0])
else:
oldFreqs.append(0)
newFreqs.append(0)
d = {'Date': allDates, 'Hour': allHours, 'Old_Freq': oldFreqs, 'New_Freq': newFreqs, 'Comb': combined}
df2 = pd.DataFrame(data=d)
# Create px plot with animation
fig = px.bar(df2, x="Hour", y="Old_Freq", hover_data=["Old_Freq","New_Freq"], animation_frame="Date")
fig.show()
fig2 = px.bar(df2, x="Hour", y="New_Freq", hover_data=["Old_Freq","New_Freq"], animation_frame="Date")
fig2.show()
and here is the plot from that code:
I had a very ambitious project (for my novice level) to use on numpy array, where I load a series of data, and make different plots based on my needs - I have uploaded a slim version of my data file input_data and wanted to make plots based on: F (where I would like to choose the desired F before looping), and each series will have the data from E column (e.g. A12 one data series, A23 another data series in the plot, etc) and on the X axis I would like to use the corresponding values in D.
so to summarize for a chosen value on column F I want to have 4 different data series (as the number of variables on column E) and the data should be reference (x-axis) on the value of column D (which is date)
I stumbled in the first step (although spend too much time) where I wanted to plot all data with F column identifier as one plot.
Here is what I have up to now:
import os
import numpy as np
N = 8 #different values on column F
M = 4 #different values on column E
dataset = open('array_data.txt').readlines()[1:]
data = np.genfromtxt(dataset)
my_array = data
day = len(my_array)/M/N # number of measurement sets - variation on column D
for i in range(0, len(my_array), N):
plt.xlim(0, )
plt.ylim(-1, 2)
plt.plot(my_array[i, 0], my_array[i, 2], 'o')
plt.hold(True)
plt.show()
this does nothing.... and I still have a long way to go..
With pandas you can do:
import pandas as pd
dataset = pd.read_table("toplot.txt", sep="\t")
#make D index (automatically puts it on the x axis)
dataset.set_index("D", inplace=True)
#plotting R vs. D
dataset.R.plot()
#plotting F vs. D
dataset.F.plot()
dataset is a DataFrame object and DataFrame.plot is just a wrapper around the matplotlib function to plot the series.
I'm not clear on how you are wanting to plot it, but it sound like you'll need to select some values of a column. This would be:
# get where F == 1000
maskF = dataset.F == 1000
# get the values where F == 1000
rows = dataset[maskF]
# get the values where A12 is in column E
rows = rows[rows.E == "A12"]
#remove the we don't want to see
del rows["E"]
del rows["F"]
#Plot the result
rows.plot(xlim=(0,None), ylim=(-1,2))