Plotting values above a threshold in Python - python

Having issues with plotting values above a set threshold using a pandas dataframe.
I have a dataframe that has 21453 rows and 20 columns, and one of the columns is just 1 and 0 values. I'm trying to plot this column using the following code:
lst1 = []
for x in range(0, len(df)):
if(df_smooth['Active'][x] == 1):
lst1.append(df_smooth['Time'][x])
plt.plot(df_smooth['Time'], df_smooth['CH1'])
plt.plot(df_smooth['Time'], lst1)
But get the following errors:
x and y must have same first dimension, but have shapes (21453,) and (9,)
Any suggestions on how to fix this?

The error is probably the result of this line plt.plot(df_smooth['Time'], lst1). While lst1 is a subset of df_smooth[Time], df_smooth['Time'] is the full series.
The solution I would do is to also build a filtered x version for example -
lst_X = []
lst_Y = []
for x in range(0, len(df)):
if(df_smooth['Active'][x] == 1):
lst_X.append(df_smooth['Time'][x])
lst_Y.append(df_smooth['Time'][x])
Another option is to build a sub-dataframe -
sub_df = df_smooth[df_smooth['Active']==1]
plt.plot(sub_df['Time'], sub_df['Time'])
(assuming the correct column as Y column is Time, otherwise just replace it with the correct column)

It seems like you are trying to plot two different data series using the plt.plot() function, this is causing the error because plt.plot() expects both series to have the same length.
You will need to ensure that both data series have the same length before trying to plot them. One way to do this is to create a new list that contains the same number of elements as the df_smooth['Time'] data series, and then fill it with the corresponding values from the lst1 data series.
# Create a new list with the same length as the 'Time' data series
lst2 = [0] * len(df_smooth['Time'])
# Loop through the 'lst1' data series and copy the values to the corresponding
# indices in the 'lst2' data series
for x in range(0, len(lst1)):
lst2[x] = lst1[x]
# Plot the 'Time' and 'lst2' data series using the plt.plot() function
plt.plot(df_smooth['Time'], df_smooth['CH1'])
plt.plot(df_smooth['Time'], lst2)
I think this should work.

Related

Assigning values to a multi-index pandas data frame

This is my multiindex dataframe.
import pandas as pd
import numpy as np
arrays=[['A','A','A','A','B','B','B','B',
'C','C','C','C','D','D','D','D','E','E','E','E'],[1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4]]
hier_index=list(zip(*arrays))
#hier_index=pd.MultiIndex.from_tuples(hier_index)
hier_index
index=pd.MultiIndex.from_tuples(hier_index,names=["Camera","position"])
df1=pd.DataFrame(np.zeros([3,20]),index=["x","y","z"], columns=index)
I also have two for loops:
for i in range(0,20):
x=[]
for j in range(0,4):
x_r= an equaton
x.append(x_r)
print(x)
the output of four loops is 5 lists with 4 elements each list is associated with A, B,C,D,E respectively. e.g. [0,0.1,0.2,0.3] , [0,0.4,0.5,0.6], and so on.
I want to assign elements of each list to 1,2,3,4 sub columns for all columns A,B,C,D,E. I don't know how to do that.
I assume that you will need both i and j in the for loops for something so, I did not remove them. You can use enumerate to iterate through the top level index (e.g. A, B, C,...) and still keep the numeric index i. With this you can simply assign x to a slice of your data frame using loc. The loop below assigns your x (created with the equation 20*I+j) to the slice, df1.loc['x', (c,1):(c,4)] = x, which specifies the row 'x' and the columns (c,1):(c,4) where c is one of your labels (e.g. A,B,C,...).
for i,c in enumerate(df1.columns.levels[0]):
x=[]
for j in range(0,4):
x_r= 20*i+j
x.append(x_r)
df1.loc['x', (c,1):(c,4)] = x
print(x)

How do I access the integers given by nunique in Pandas?

I am trying to access the items in each column that is outputted given the following code. It outputs two columns, 'Accurate_Episode_Date' values, and the count (the frequency of each Date). My goal is to plot the date on the x axis, and the count on the y axis using a scatterplot, but first I need to be able to access the actual count values.
data = pd.read_csv('CovidDataset.csv')
Barrie = data.loc[data['Reporting_PHU_City'] == 'Barrie']
dates_barrie = Barrie[['Accurate_Episode_Date']]
num = data.groupby('Accurate_Episode_Date')['_id'].nunique()
print(num.tail(5))
The code above outputs the following:
2021-01-10T00:00:00 1326
2021-01-11T00:00:00 1875
2021-01-12T00:00:00 1274
2021-01-13T00:00:00 492
2021-01-14T00:00:00 8
Again, I want to plot the dates on the x axis, and the counts on the y axis in scatterplot form. How do I access the count and date values?
EDIT: I just want a way to plot dates like 2021-01-10T00:00:00 and so on on the x axis, and the corresponding count: 1326 on the Y-axis.
Turns out this was mainly a data type issue. Basically all that was needed was accessing the datetime index and typecasting it to string with num.index.astype(str).
You could probably change it "in-place" and use the plot like below.
num.index = num.index.astype(str)
num.plot()
If you only want to access the values of a DataFrame or Series you just need to access them like this: num.values
If you want to plot the date column on X, you don't need to access that column separately, just use pandas internals:
# some dummy dates + counts
dates = [datetime.now() + timedelta(hours=i) for i in range(1, 6)]
values = np.random.randint(1, 10, 5)
df = pd.DataFrame({
"Date": dates,
"Values": values,
})
# if you only have 1 other column you can skip `y`
df.plot(x="Date", y="Values")
you need to convert date column using pd.to_datetime(df['dates']) then you can plot
updated answer:
here no need to convert to pd.to_datetime(df['dates'])
ax=df[['count']].plot()
ax.set_xticks(df.count.index)
ax.set_xticklabels(df.date)

Modifying Dataframes Stored in a List of Dataframes

I am segmenting data into a set of Pandas dataframes that have identical structure. For each dataframe, there are a total of cnames columns that have unique names, and a total of nrows rows, that are identified by an integer-valued index running from 0 to nrows-1. There are a total of nframes segments, each containing 3 dataframes.
The goal is, within each segment, calculate a quotient of two of the dataframes and send the result to the third. I've implemented and tested a process that works, but have a question as to why a slight variation of the process doesn't.
The steps (and variation) are as follows:
Initialize data frames:
Ldf_num = [pd.DataFrame(0.0, index=range(0, nrows), columns=cnames) for x in range(0, nframes)]
Ldf_den = [pd.DataFrame(0.0, index=range(0, nrows), columns=cnames) for x in range(0, nframes)]
Ldf_quo = [pd.DataFrame(0.0, index=range(0, nrows), columns=cnames) for x in range(0, nframes)]
Populate data frames:
#For loop over a set of data-records stored as a list of lists:
#Determine x, the index of the data frame related to this record, from the data
df_num = Ldf_num[x]
df_den = Ldf_den[x]
#Derive values (including row) for each column of the data frame, and store them as...
df_num[cname][row] += derived_value1
df_den[cname][row] += derived_value2
Determine quotient for each set of dataframes:
for x in range(0, nframes):
df_num = Ldf_num[x]
df_den = Ldf_den[x]
Ldf_quo[x] = df_num.div(df_den)
The above version of step 3 worked, i.e. I can print each dataframe in the quotient dataframe, and see that they have different values that match the numerator and denominator values.
3b. However, the versison below did not work:
for x in range(0, nframes):
df_num = Ldf_num[x]
df_den = Ldf_den[x]
df_quo = Ldf_quo[x]
df_quo = df_num.div(df_den)
...as all entries in all dataframes in the list Ldf_quo contained their initial value of 0.
Can anyone explain why when I assign a variable to a single dataframe stored in a list of dataframes, and I change values of the assigned variable, it changes the values in the original dataframe in the list in step 2...
...but when I send the output of the "div" method to a variable assigned to a single dataframe in a list of dataframes as in step 3b, the values in the original dataframe do not change (but I can get the desired result by sending the output from the "div" method explicitly to the right slot in the list of dataframes, as in step 3)
In the answer 3b. you are assigning the value at Ldf_quo[x] to df_quo which basically be an integer. However, when you do Ldf_quo[x], you assigning df_num.div(df_den) at the xth index of the data frame.

How do you filter rows in a dataframe based on the column numbers from a Python list?

I have a Pandas dataframe with two columns, x and y, that correspond to a large signal. It is about 3 million rows in size.
Wavelength from dataframe
I am trying to isolate the peaks from the signal. After using scipy, I got a 1D Python list corresponding to the indexes of the peaks. However, they are not the actual x-values of the signal, but just the index of their corresponding row:
from scipy.signal import find_peaks
peaks, _ = find_peaks(y, height=(None, peakline))
So, I decided I would just filter the original dataframe by setting all values in its y column to NaN unless they were on an index found in the peak list. I did this iteratively, however, since it is 3000000 rows, it is extremely slow:
peak_index = 0
for data_index in list(data.index):
if data_index != peaks[peak_index]:
data[data_index, 1] = float('NaN')
else:
peak_index += 1
Does anyone know what a faster method of filtering a Pandas dataframe might be?
Looping in most cases is extremely inefficient when it comes to pandas. Assuming you just need filtered DataFrame that contains the values of both x and y columns only when y is a peak, you may use the following piece of code:
df.iloc[peaks]
Alternatively, if you are hoping to retrieve an original DataFrame with y column retaining its peak values and having NaN otherwise, then please use:
df.y = df.y.where(df.y.iloc[peaks] == df.y.iloc[peaks])
Finally, since you seem to care about just the x values of the peaks, you might just rework the first piece in the following way:
df.iloc[peaks].x

Getting rid of outliers rows in multiple columns pandas dataframe

I have a pandas data frame with many columns (>100). I standarized all the columns value so every column is centered at 0 (they have mean 0 and std 1). I want to get rid of all the rows that are below -2 and above 2 taking into account all the columns. With this I mean, lets say in the first column the rows 2,3,4 are outliers and in the second column the rows 3,4,5,6 are outliers. Then I would like to get rid of the rows [2,3,4,5,6].
What I am trying to do is to use a for loop to pass for every column and collect the row index that are outliers and store them in a list. At the end I have a list containing lists with the row index of every column. I get the unique values to obtain the row index I should get rid of. My problem is I don´t know how to slice the data frame so it doesn´t contain these rows. I was thinking in using an %in% operator, but it doesn´t admit the format # list in a list#. I show my code below.
### Getting rid of the outliers
'''
We are going to get rid of the outliers who are outside the range of -2 to 2.
'''
aux_features = features_scaled.values
n_cols = aux_features.shape[1]
n_rows = aux_features.shape[0]
outliers_index = []
for i in range(n_cols):
variable = aux_features[:,i] # We take one column at a time
condition = (variable < -2) | (variable > 2) # We stablish the condition for the outliers
index = np.where(condition)
outliers_index.append(index)
outliers = [j for i in outliers_index for j in i]
outliers_2 = np.array([j for i in outliers for j in i])
unique_index = list(np.unique(outliers_2)) # This is the final list with all the index that contain outliers.
total_index = list(range(n_rows))
aux = (total_index in unique_index)
outliers_2 contain a list with all the row indexes (this includes repetition), then in unique_index I get only the unique values so I end with all the row index that have outliers. I am stuck in this part. If anyone knows how to complete it or have better a idea of how get rid of these outliers (I guess my method would be very time consuming for really large datasets)
df = pd.DataFrame(np.random.standard_normal(size=(1000, 5))) # example data
cleaned = df[~(np.abs(df) > 2).any(1)]
Explanation:
Filter dataframe for values above and below 2. Returns dataframe containing boolean expressions:
np.abs(df) > 2
Check if row contains outliers. Evaluates to True for each row where an outlier exists:
(np.abs(df) > 2).any(1)
Finally select all rows without outlier using the ~ operator:
df[~(np.abs(df) > 2).any(1)]

Categories