I am segmenting data into a set of Pandas dataframes that have identical structure. For each dataframe, there are a total of cnames columns that have unique names, and a total of nrows rows, that are identified by an integer-valued index running from 0 to nrows-1. There are a total of nframes segments, each containing 3 dataframes.
The goal is, within each segment, calculate a quotient of two of the dataframes and send the result to the third. I've implemented and tested a process that works, but have a question as to why a slight variation of the process doesn't.
The steps (and variation) are as follows:
Initialize data frames:
Ldf_num = [pd.DataFrame(0.0, index=range(0, nrows), columns=cnames) for x in range(0, nframes)]
Ldf_den = [pd.DataFrame(0.0, index=range(0, nrows), columns=cnames) for x in range(0, nframes)]
Ldf_quo = [pd.DataFrame(0.0, index=range(0, nrows), columns=cnames) for x in range(0, nframes)]
Populate data frames:
#For loop over a set of data-records stored as a list of lists:
#Determine x, the index of the data frame related to this record, from the data
df_num = Ldf_num[x]
df_den = Ldf_den[x]
#Derive values (including row) for each column of the data frame, and store them as...
df_num[cname][row] += derived_value1
df_den[cname][row] += derived_value2
Determine quotient for each set of dataframes:
for x in range(0, nframes):
df_num = Ldf_num[x]
df_den = Ldf_den[x]
Ldf_quo[x] = df_num.div(df_den)
The above version of step 3 worked, i.e. I can print each dataframe in the quotient dataframe, and see that they have different values that match the numerator and denominator values.
3b. However, the versison below did not work:
for x in range(0, nframes):
df_num = Ldf_num[x]
df_den = Ldf_den[x]
df_quo = Ldf_quo[x]
df_quo = df_num.div(df_den)
...as all entries in all dataframes in the list Ldf_quo contained their initial value of 0.
Can anyone explain why when I assign a variable to a single dataframe stored in a list of dataframes, and I change values of the assigned variable, it changes the values in the original dataframe in the list in step 2...
...but when I send the output of the "div" method to a variable assigned to a single dataframe in a list of dataframes as in step 3b, the values in the original dataframe do not change (but I can get the desired result by sending the output from the "div" method explicitly to the right slot in the list of dataframes, as in step 3)
In the answer 3b. you are assigning the value at Ldf_quo[x] to df_quo which basically be an integer. However, when you do Ldf_quo[x], you assigning df_num.div(df_den) at the xth index of the data frame.
Related
This is my code snippet:
# Define array entries
array = ['RegCreateKeyA', 'RegCreateKeyExA', 'RegCreateKeyExW', 'RegCreateKeyTransactedA', ...]
# Print rows that have the specifed APIs using location2 column
rels = df.loc[df['location2'].isin(array)]
rel = rels.assign(Index = range(len(rels))).set_index('Index')
# Count number of times a process calls each API
count = rel.groupby(['Process_Name', 'location2']).size()
# convert from long to wide
temp = count.unstack()
print(temp)
Array contains over 100 entries. I compare the values in location2 column with array if it exists in array, I save to rels. Then count takes the frequency count of each of the entries in location2. It saves it in a long mode so I convert to wide and save it to temp then print that out.
The problem is that my code only prints out the count for entries in the array that it contains, I also want it to print out a count of zero for entries that it doesn't contain but I'm stumped on the best way to do that.
As stated the column header only contains values from the arrays that it contains but I want it to include all the entries from the array and have a value of zero for those that are not included in the dataframe.
I've attached a screenshot of my output.
try these codes
array = ['RegCreateKeyA', 'RegCreateKeyExA', 'RegCreateKeyExW', 'RegCreateKeyTransactedA']
df = pd.DataFrame(data={
"location2":['RegCreateKeyA', 'RegCreateKeyA', 'RegCreateKeyExA', 'RegCreateKeyExW', 'RegCreateKeyTransactedA', 1,2,3,"Match"],
"Process_Name":['exa', 'exa', "exa", 'exw', 'ta', 1,2,3,"Match"]
})
df["Counts"] = df[["location2", "Process_Name"]].apply(lambda x: 1 if x["location2"] in array else np.nan, axis=1)
pivot = df.pivot_table(index="Process_Name", columns="location2", aggfunc="count").fillna(0).astype(int)
print(pivot)
Having issues with plotting values above a set threshold using a pandas dataframe.
I have a dataframe that has 21453 rows and 20 columns, and one of the columns is just 1 and 0 values. I'm trying to plot this column using the following code:
lst1 = []
for x in range(0, len(df)):
if(df_smooth['Active'][x] == 1):
lst1.append(df_smooth['Time'][x])
plt.plot(df_smooth['Time'], df_smooth['CH1'])
plt.plot(df_smooth['Time'], lst1)
But get the following errors:
x and y must have same first dimension, but have shapes (21453,) and (9,)
Any suggestions on how to fix this?
The error is probably the result of this line plt.plot(df_smooth['Time'], lst1). While lst1 is a subset of df_smooth[Time], df_smooth['Time'] is the full series.
The solution I would do is to also build a filtered x version for example -
lst_X = []
lst_Y = []
for x in range(0, len(df)):
if(df_smooth['Active'][x] == 1):
lst_X.append(df_smooth['Time'][x])
lst_Y.append(df_smooth['Time'][x])
Another option is to build a sub-dataframe -
sub_df = df_smooth[df_smooth['Active']==1]
plt.plot(sub_df['Time'], sub_df['Time'])
(assuming the correct column as Y column is Time, otherwise just replace it with the correct column)
It seems like you are trying to plot two different data series using the plt.plot() function, this is causing the error because plt.plot() expects both series to have the same length.
You will need to ensure that both data series have the same length before trying to plot them. One way to do this is to create a new list that contains the same number of elements as the df_smooth['Time'] data series, and then fill it with the corresponding values from the lst1 data series.
# Create a new list with the same length as the 'Time' data series
lst2 = [0] * len(df_smooth['Time'])
# Loop through the 'lst1' data series and copy the values to the corresponding
# indices in the 'lst2' data series
for x in range(0, len(lst1)):
lst2[x] = lst1[x]
# Plot the 'Time' and 'lst2' data series using the plt.plot() function
plt.plot(df_smooth['Time'], df_smooth['CH1'])
plt.plot(df_smooth['Time'], lst2)
I think this should work.
I am trying to extract the section (matrix) of the numbers in pandas dataframe like as marked in the given picture embedded above.
Please anyone who can assist me, I want to perform analytics based on the section (matrix) of a bigger data frame. Thank you in advance!!
You can use the .iloc[] function to select the rows and columns you want.
dataframe.iloc[5:15,6:15]
This should select rows 5-14 and columns 6-14.
Not sure if the numbers are correct but I think this method is what you were looking for.
edit: changed .loc[] to .iloc[] because we're using index values, and cleaned it up a bit
Here is the code to iterate over the whole dataframe
#df = big data frame
shape = (10,10) #shape of matrix to be analized, here is 10x10
step = 1 #step size, itterate over every number
#or
step = 10 #step size, itterate block by block
#keep in mind, iterating by block will leave some data out at the end of the rows and columns
#you can set step = shape if you are working with a matrix that isn't square, just be sure to change step in the code below to step[0] and step[1] respectively
for row in range( 0, len(df[0]) - shape[0]+1, step): #number of rows of big dataframe - number of rows of matrix to be analized
for col in range(0, len(df.iloc[0,:]) - shape[1]+1, step): #number of columns of big dataframe - number of columns of matrix to be analized
matrix = df.iloc[row:shape[0]+row, col:shape[1]+col] #slice out matrix and set it equal to 'matrix'
#analize matrix here
This is basically the same as #dafmedinama said, i just added more commenting and simplified specifying the shape of the matrix as well as included a step variable if you don't want to iterate over every single number every time you move the matrix.
Be sub_rows and sub_cols the dimension of the datafram to be extracted:
import pandas as pd
sub_rows = 10 # Amount of rows to be extracted
sub_cols = 3 # Amount of columns to be extracted
if sub_rows > len(df.index):
print("Defined sub dataframe rows are more than in the original dataframe")
elif sub_cols > len(df.columns):
print("Defined sub dataframe columns are more than in the original dataframe")
else:
for i in range(0,len(df.index)-sub_rows):
for j in range(0, len(df.columns)):
d.iloc[i:i+sub_rows, j:j+sub_cols] # Extracted dataframe
# Put here the code you need for your analysis
I have a pandas data frame with many columns (>100). I standarized all the columns value so every column is centered at 0 (they have mean 0 and std 1). I want to get rid of all the rows that are below -2 and above 2 taking into account all the columns. With this I mean, lets say in the first column the rows 2,3,4 are outliers and in the second column the rows 3,4,5,6 are outliers. Then I would like to get rid of the rows [2,3,4,5,6].
What I am trying to do is to use a for loop to pass for every column and collect the row index that are outliers and store them in a list. At the end I have a list containing lists with the row index of every column. I get the unique values to obtain the row index I should get rid of. My problem is I don´t know how to slice the data frame so it doesn´t contain these rows. I was thinking in using an %in% operator, but it doesn´t admit the format # list in a list#. I show my code below.
### Getting rid of the outliers
'''
We are going to get rid of the outliers who are outside the range of -2 to 2.
'''
aux_features = features_scaled.values
n_cols = aux_features.shape[1]
n_rows = aux_features.shape[0]
outliers_index = []
for i in range(n_cols):
variable = aux_features[:,i] # We take one column at a time
condition = (variable < -2) | (variable > 2) # We stablish the condition for the outliers
index = np.where(condition)
outliers_index.append(index)
outliers = [j for i in outliers_index for j in i]
outliers_2 = np.array([j for i in outliers for j in i])
unique_index = list(np.unique(outliers_2)) # This is the final list with all the index that contain outliers.
total_index = list(range(n_rows))
aux = (total_index in unique_index)
outliers_2 contain a list with all the row indexes (this includes repetition), then in unique_index I get only the unique values so I end with all the row index that have outliers. I am stuck in this part. If anyone knows how to complete it or have better a idea of how get rid of these outliers (I guess my method would be very time consuming for really large datasets)
df = pd.DataFrame(np.random.standard_normal(size=(1000, 5))) # example data
cleaned = df[~(np.abs(df) > 2).any(1)]
Explanation:
Filter dataframe for values above and below 2. Returns dataframe containing boolean expressions:
np.abs(df) > 2
Check if row contains outliers. Evaluates to True for each row where an outlier exists:
(np.abs(df) > 2).any(1)
Finally select all rows without outlier using the ~ operator:
df[~(np.abs(df) > 2).any(1)]
I would like to know how I could iterate through each columns of a dataframe to perform some calculations and store the result in an another dataframe.
df_empty = []
m = daily.ix[:,-1] #Columns= stocks & Rows= daily returns
stocks = daily.ix[:,:-1]
for col in range (len(stocks.columns)):
s = daily.ix[:,col]
covmat = np.cov(s,m)
beta = covmat[0,1]/covmat[1,1]
return (beta)
print(beta)
In the above example, I first want to calculate a covariance matrix between "s" (the columns representing stocks daily returns and for which I want to iterate through one by one) and "m" (the market daily return which is my reference column/the last column of my dataframe). Then I want to calculate the beta for each covariance pair stock/market.
I'm not sure why return(beta) give me a single numerical result for one stock while print(beta) print the beta for all stocks.
I'd like to find a way to create a dataframe with all these betas.
beta_df = df_empty.append(beta)
I have tried the above code but it returns 'none' as if it could not append the outcome.
Thank you for your help
The return statement within your for-loop ends the loop itself the first time the return is encountered. Moreover, you are not saving the beta value anywhere because the for-loop itself does not return a value in python (it only has side effects).
Apart from that, you may choose a more pandas-like approach using apply on the data frame which basically iterates over the columns of the data frame and passes each column to a supplied function as the first parameter while returning the result of the function call. Here is a minimal working example with some dummy data:
import pandas as pd
import numpy as pd
# create some dummy data
daily = pd.DataFrame(np.random.randint(100, size=(100, 5)))
# define reference column
cov_column = daily.iloc[:, -1]
# setup computation function
def compute(column):
covmat = np.cov(column, cov_column)
return covmat[0,1]/covmat[1,1]
# use apply to iterate over columns
result = daily.iloc[:, :-1].apply(compute)
# show output
print(result)
0 -0.125382
1 0.024777
2 0.011324
3 -0.017622
dtype: float64