Counting number of null values below and placing them in new df - python

df
I'm attempting to count the number of null values below each non-null cell in a dataframe and put the number into a new variable (size) and data frame.
I have included a picture of the dataframe I'm trying to count. I'm only interested in the Arrival Date Column for now. The new data frame should have a column that has 1,1,3,7..etc as it's first observations.
##Loops through all of rows in DOAs
for i in range(0, DOAs.shape[0]):
j=0
if DOAs.iloc[int(i),3] != None: ### the rest only runs if the current, i, observation isn't null
newDOAs.iloc[int(j),0] = DOAs.iloc[int(i),3] ## sets the jth i in the new dataframe to the ith (currently assessed) row of the old
foundNull = True #Sets foundNull equal to true
k=1 ## sets the counter of people
while foundNull == True and (k+i) < 677:
if DOAs.iloc[int(i+k),3] == None: ### if the next one it looks at is null, increment the counter to add another person to the family
k = k+1
else:
newDOAs.iloc[int(j),1] = k ## sets second column in new dataframe equal to the size
j = j+1
foundNull = False
j=0

What you can do is get the indices of non-null entries in whatever column of your dataframe, then get the distances between each. Note: This is assuming they are nicely ordered and/or you don't mind calling .reset_index() on your dataframe.
Here is a sample:
df = pd.DataFrame({'a': [1, None, None, None, 2, None, None, 3, None, None]})
not_null_index = df.dropna(subset=['a']).index
null_counts = {}
for i in range(len(not_null_index)):
if i < len(not_null_index) - 1:
null_counts[not_null_index[i]] = not_null_index[i + 1] - 1 - not_null_index[i]
else:
null_counts[not_null_index[i]] = len(df.a) - 1 - not_null_index[i]
null_counts_df = pd.DataFrame({'nulls': list(null_counts.values())}, index=null_counts.keys())
df_with_null_counts = pd.merge(df, null_counts_df, left_index=True, right_index=True)
Essentially all this code does is get the indices of non-null values in the dataframe, then gets the difference between each index and the next non-null index and puts that in the column. Then sticks those null_counts in a dataframe and merges it with the original.
After running this snippet, df_with_null_counts is equal to:
a nulls
0 1.0 3
4 2.0 2
7 3.0 2
Alternatively, you can use numpy instead of using a loop, which would be much faster for large dataframes. Here's a sample:
df = pd.DataFrame({'a': [1, None, None, None, 2, None, None, 3, None, None]})
not_null_index = df.dropna(subset=['a']).index
offset_index = np.array([*not_null_index[1:], len(df.a)])
null_counts = offset_index - np.array(not_null_index) - 1
null_counts_df = pd.DataFrame({'nulls': null_counts}, index=not_null_index)
df_with_null_counts = pd.merge(df, null_counts_df, left_index=True, right_index=True)
And the output will be the same.

Related

Remove following rows that are above or under by X amount from the current row['x']

I am calculating correlations and the data frame I have needs to be filtered.
I am looking to remove the rows under the current row from the data frame that are above or under by X amount starting with the first row and looping through the dataframe all the way until the last row.
example:
df['y'] has the values 50,51,52,53,54,55,70,71,72,73,74,75
if X = 10 it would start at 50 and see 51,52,53,54,55 as within that 10+- range and delete the rows. 70 would stay as it is not within that range and the same test would start again at 70 where 71,72,73,74,75 and respective rows would be deleted
the filter if X=10 would thus leave us with the rows including 50,75 for df.
It would leave me with a clean dataframe that deletes the instances that are linked to the first instance of what is essentially the same observed period. I tried coding a loop to do that but I am left with the wrong result and desperate at this point. Hopefully someone can correct the mistake or point me in the right direction.
df6['index'] = df6.index
df6.sort_values('index')
boom = len(dataframe1.index)/3
#Taking initial comparison values from first row
c = df6.iloc[0]['index']
#Including first row in result
filters = [True]
#Skipping first row in comparisons
for index, row in df6.iloc[1:].iterrows():
if c-boom <= row['index'] <= c+boom:
filters.append(False)
else:
filters.append(True)
# Updating values to compare based on latest accepted row
c = row['index']
df2 = df6.loc[filters].sort_values('correlation').drop('index', 1)
df2
OUTPUT BEFORE
OUTPUT AFTER
IIUC, your main issue is to filter consecutive values within a threshold.
You can use a custom function for that that acts on a Series (=column) to return the list of valid indices:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = []
for i, val in s.iteritems():
if val-prev > threshold:
idx.append(i)
prev = val
return idx
Example of use:
import pandas as pd
df = pd.DataFrame({'y': [50,51,52,53,54,55,70,71,72,73,74,75]})
df2 = df.loc[consecutive(df['y'])]
Output:
y
0 50
6 70
variant
If you prefer the function to return a boolean indexer, here is a varient:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = [False]*len(s)
for i, val in s.iteritems():
if val-prev > threshold:
idx[i] = True
prev = val
return idx

Vectorized solution for pandas nested iterrows

Given an example dataframe:
example_df = pd.DataFrame({"app_id": [1,2,3,4,5,6] ,
"payment_date":["2021-01-01", "2021-02-01", "2020-03-02", "2020-04-05", "2020-01-05","2020-01-04"],
"user_id": [12,12,12,13,13,13],
"application_date":["2021-02-01", "2021-02-01", "2020-03-02", "2020-04-05", "2020-01-05", "2020-01-04"] , "flag": [1,0,0,1,0,1], "order_column": [1,2,3,4,5, 6]})
What should be done is:
I will explain what I want to do with an example:
Iterate through all rows
If the flag column is equal to 1 do as stated below
For the first row flag column is 1 and the user_id for the row is 12. Look at all instances with user_id= 12 and compare their application_date with the payment_date of the first row. We see that the second row has an application_date greater than the payment_date of the first row. Then the label of the first row is 1. Third row also belongs to user_id= 12 but its application_date is not greater than the payment_date of the first row. If there is one or more observation that has application_date greater than payment_date of the first row, overall label of the first row is 1. If there are no such observations the overall label is 0.
I wrote the code for this with iterrows, but I want a more compact vectorized solution since iterrows can be slow with larger datasets. Like
example_df.groupby("something").filter(lambda row: row. ...)
My code is:
labels_dict = {}
for idx, row in example_df.iterrows():
if row.flag == 1:
app_id = row.app_id
user_id = row.user_id
user_df = example_df[example_df.user_id == user_id]
labelss = []
for idx2, row2 in user_df.iterrows():
if (row2.order_column != row.order_column) & (row.payment_date < row2.application_date):
label = 1
labelss.append(label)
elif (row2.order_column != row.order_column) & (row.payment_date >= row2.application_date):
label = 0
labelss.append(label)
labels_dict[app_id] = labelss
final_labels = {}
for key, value in labels_dict.items():
if 1 in value:
final_labels[key] = 1
else:
final_labels[key] = 0
final_labels is the expected output. Basically, I am asking for all rows with flag= 1 to be labelled as 1 or 0 given the criteria I explained.
Desired output :
{1: 1, 4: 0, 6: 1}
Here keys are app_id and values are labels (either 0 or 1)
I would first built a temp dataframe with the only rows having 1 in flag and merge it with the full dataframe on user_id.
Then I will add a new boolean column being true if application_date is greater than payment_date and if the original app_id is different from the on from temp (ie different rows)
Finally it will be enough to count the number of true values per app_id and give a 1 if the number is greater than 0.
Pandas code could be:
tmp = example_df.loc[example_df['flag'] == 1,
['app_id', 'user_id', 'payment_date']]
tmp = tmp.merge(example_df.drop(columns = 'payment_date'), on='user_id')
tmp['k'] = ((tmp['app_id_x'] != tmp['app_id_y'])
& (tmp['application_date'] > tmp['payment_date']))
d = (tmp.groupby('app_id_x')['k'].sum() != 0).astype('int').to_dict()
With your data, it gives as expected:
{1: 1, 4: 0, 6: 1}
(i) Convert all dates to datetime objects
(ii) groupby "user_id" and for each group find the first "payment_date" using first and transform it for the entire DataFrame. Then compare it with the "application_date"s using lt (less than).
(iii) groupby "user_id" again to find how many entries satisfy the condition and assign values depending on whether the sum is greater than 1 or not.
example_df['payment_date'] = pd.to_datetime(example_df['payment_date'])
example_df['application_date'] = pd.to_datetime(example_df['application_date'])
example_df['flag_cumsum'] = example_df['flag'].cumsum()
example_df['first_payment_date < application_date'] = (example_df
.groupby(['flag_cumsum','user_id'])['payment_date']
.transform('first')
.lt(example_df['application_date']))
out = (example_df.groupby('flag_cumsum').agg({'app_id':'first',
'first_payment_date < application_date':'sum'})
.set_index('app_id')['first_payment_date < application_date']
.gt(0).astype(int)
.to_dict())
Output:
{1: 1, 4: 0}

How to iterate over rows of each column in a dataframe

My current code functions and produces a graph if there is only 1 sensor, i.e. if col2, and col3 are deleted in the example data provided below, leaving one column.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}
df = pd.DataFrame(data=d)
sensors = 3
window_size = 5
dfn = df.rolling(window_size).corr(pairwise = True)
index = df.index #index of values in the data frame.
rows = len(index) #len(index) returns number of rows in the data.
sensors = 3
baseline_num = [0]*(rows) #baseline numerator, by default zero
baseline = [0]*(rows) #initialize baseline value
baseline = DataFrame(baseline)
baseline_num = DataFrame(baseline_num)
v = [None]*(rows) # Initialize an empty array v[] equal to amount of rows in .csv file
s = [None]*(rows) #Initialize another empty array for the slope values for detecting when there is an exposure
d = [0]*(rows)
sensors_on = True #Is the sensor detecting something (True) or not (False).
off_count = 0
off_require = 8 # how many offs until baseline is updated
sensitivity = 1000
for i in range(0, (rows)): #This iterates over each index value, i.e. each row, and sums the values and returns them in list format.
v[i] = dfn.loc[i].to_numpy().sum() - sensors
for colname,colitems in df.iteritems():
for rownum,rowitem in colitems.iteritems():
#d[rownum] = dfone.loc[rownum].to_numpy()
#d[colname][rownum] = df.loc[colname][rownum]
if v[rownum] >= sensitivity:
sensors_on = True
off_count = 0
baseline_num[rownum] = 0
else:
sensors_on = False
off_count += 1
if off_count == off_require:
for x in range(0, (off_require)):
baseline_num[colname][rownum] += df[colname][rownum - x]
elif off_count > off_require:
baseline_num[colname][rownum] += baseline_num[colname][rownum - 1] + df[colname][rownum] - (df[colname][rownum - off_require]) #this loop is just an optimization, one calculation per loop once the first calculation is established
baseline[colname][rownum] = ((baseline_num[colname][rownum])//(off_require)) #mean of the last "off_require" points
dfx = DataFrame(v, columns =['Sensor Correlation']) #converts the summed correlation tables back from list format to a DataFrame, with the sole column name 'Sensor Correlation'
dft = pd.DataFrame(baseline, columns =['baseline'])
dft = dft.astype(float)
dfx.plot(figsize=(50,25), linewidth=5, fontsize=40) # plots dfx dataframe which contains correlated and summed data
dft.plot(figsize=(50,25), linewidth=5, fontsize=40)
Basically, instead of 1 graph as this produces, I would like to iterate over each column only for this loop:
for colname,colitems in df.iteritems():
for rownum,rowitem in colitems.iteritems():
#d[rownum] = dfone.loc[rownum].to_numpy()
#d[colname][rownum] = df.loc[colname][rownum]
if v[rownum] >= sensitivity:
sensors_on = True
off_count = 0
baseline_num[rownum] = 0
else:
sensors_on = False
off_count += 1
if off_count == off_require:
for x in range(0, (off_require)):
baseline_num[colname][rownum] += df[colname][rownum - x]
elif off_count > off_require:
baseline_num[colname][rownum] += baseline_num[colname][rownum - 1] + df[colname][rownum] - (df[colname][rownum - off_require]) #this loop is just an optimization, one calculation per loop once the first calculation is established
I've tried some other solutions from other questions but none of them seem to solve this case.
As of now, I've tried multiple conversions to things like lists and tuples, and then calling them something like this:
baseline_num[i,column] += d[i - x,column]
as well as
baseline_num[i][column += d[i - x][column]
while iterating over the loop using
for column in columns
However no matter how I seem to arrange the solution, there is always some keyerror of expecting integer or slice indices, among other errors.
See pictures for expected/possible outputs of one column on actual data.with varying input parameters (sensitivity value, and off_require is varied in different cases.)
One such solution which didn't work was the looping method from this link:
https://www.geeksforgeeks.org/iterating-over-rows-and-columns-in-pandas-dataframe/
I've also tried creating a loop using iteritems as the outer loop. This did not function as well.
Below are links to possible graph outputs for various sensitivity values, and windows in my actual dataset, with only one column. (i.e i manually deleted other columns, and plotted just the one using the current program)
sensitivity 1000, window 8
sensitivity 800, window 5
sensitivity 1500, window 5
If there's anything I've left out that would be helpful to solving this, please let me know so I can rectify it immediately.
See this picture for my original df.head:
df.head
Did you try,
for colname,colitems in df.iteritems():
for rownum,rowitem in colitems.iteritems():
print(df[colname][rownum])
The first loop iterates over all the columns, and the 2nd loops iterates over all the rows for that column.
edit:
From our conversation below, I think that your baseline and df dataframes don't have the same column names because of how you created them and how you are accessing the elements.
My suggestion is that you create the baseline dataframe to be a copy of your df dataframe and edit the information within it from there.
Edit:
I have managed to make your code work for 1 loop, but I run into an index error, I am not sure what your optimisation function does but i think that is what is causing it, take a look.
It is this part baseline_num[colname][rownum - 1], in the second loop i guess because you do rownum (0) -1, you get index -1. You need to change it so that in the first loop rownum is 1 or something, I am not sure what you are trying to do there.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}
df = pd.DataFrame(data=d)
sensors = 3
window_size = 5
dfn = df.rolling(window_size).corr(pairwise = True)
index = df.index #index of values in the data frame.
rows = len(index) #len(index) returns number of rows in the data.
sensors = 3
baseline_num = [0]*(rows) #baseline numerator, by default zero
baseline = [0]*(rows) #initialize baseline value
baseline = pd.DataFrame(df)
baseline_num = pd.DataFrame(df)
#print(baseline_num)
v = [None]*(rows) # Initialize an empty array v[] equal to amount of rows in .csv file
s = [None]*(rows) #Initialize another empty array for the slope values for detecting when there is an exposure
d = [0]*(rows)
sensors_on = True #Is the sensor detecting something (True) or not (False).
off_count = 0
off_require = 8 # how many offs until baseline is updated
sensitivity = 1000
for i in range(0, (rows)): #This iterates over each index value, i.e. each row, and sums the values and returns them in list format.
v[i] = dfn.loc[i].to_numpy().sum() - sensors
for colname,colitems in df.iteritems():
#print(colname)
for rownum,rowitem in colitems.iteritems():
#print(rownum)
#display(baseline[colname][rownum])
#d[rownum] = dfone.loc[rownum].to_numpy()
#d[colname][rownum] = df.loc[colname][rownum]
if v[rownum] >= sensitivity:
sensors_on = True
off_count = 0
baseline_num[rownum] = 0
else:
sensors_on = False
off_count += 1
if off_count == off_require:
for x in range(0, (off_require)):
baseline_num[colname][rownum] += df[colname][rownum - x]
elif off_count > off_require:
baseline_num[colname][rownum] += baseline_num[colname][rownum - 1] + df[colname][rownum] - (df[colname][rownum - off_require]) #this loop is just an optimization, one calculation per loop once the first calculation is established
baseline[colname][rownum] = ((baseline_num[colname][rownum])//(off_require)) #mean of the last "off_require" points
print(baseline[colname][rownum])
dfx = pd.DataFrame(v, columns =['Sensor Correlation']) #converts the summed correlation tables back from list format to a DataFrame, with the sole column name 'Sensor Correlation'
dft = pd.DataFrame(baseline, columns =['baseline'])
dft = dft.astype(float)
dfx.plot(figsize=(50,25), linewidth=5, fontsize=40) # plots dfx dataframe which contains correlated and summed data
dft.plot(figsize=(50,25), linewidth=5, fontsize=40)
My output looks like this,
-324.0
-238.0
-314.0
-276.0
-264.0
-306.0
-371.0
-806.0
638.0
-412.0
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
354 try:
--> 355 return self._range.index(new_key)
356 except ValueError as err:
ValueError: -1 is not in range
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
3 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
355 return self._range.index(new_key)
356 except ValueError as err:
--> 357 raise KeyError(key) from err
358 raise KeyError(key)
359 return super().get_loc(key, method=method, tolerance=tolerance)
KeyError: -1
I don't have enough rep to comment, but below is what I was able to work out. Hope it helps!
I tried to use the to_list() function while working out an answer, and it threw me an error:
AttributeError: 'DataFrame' object has no attribute 'to_list'
So, I decided to circumvent that method and came up with this:
indexes = [x for x in df.index]
row_vals = []
for index in indexes :
for val in df.iloc[i].values:
row_vals.append(val)
The object row_vals will contain all values in row order.
If you only want to get the row values for a particular row or set of rows, you would need to do this:
indx_subset = [`list of row indices`] #(Ex. [1, 2, 5, 6, etc...])
row_vals = []
for indx in indx_subset:
for val in df.loc[indx].values:
row_vals.append(val)
row_vals will then have all the row values from the specified indices.

Cover all columns using the least amount of rows in a pandas dataframe

I have a pandas dataframe looking like the following picture:
The goal here is to select the least amount of rows to have a "1" in all columns. In this scenario, the final selection should be these two rows:
The algorithm should work even if I add columns and rows. It should also work if I change the combination of 1 and 0 in any given row.
Use sum per rows, then compare by Series.ge (>=) for greater or equal and filter by boolean indexing:
df[df.sum(axis=1).ge(2)]
It want test 1 or 0 values first compare by DataFrame.eq for equal ==:
df[df.eq(1).sum(axis=1).ge(2)]
df[df.eq(0).sum(axis=1).ge(2)]
For those interested, this is how I managed to do it:
def _getBestRowsFinalSelection(self, df, cols):
"""
Get the selected rows for the final selection
Parameters:
1. df: Dataframe to use
2. cols: Columns of the binary variables in the Dataframe object (df)
RETURNS -> DataFrame : dfSelected
"""
isOne = df.loc[df[df.loc[:, cols] == 1].sum(axis=1) > 0, :]
lstIsOne = isOne.loc[:, cols].values.tolist()
lstIsOne = [(x, lstItem) for x, lstItem in zip(isOne.index.values.tolist(), lstIsOne)]
winningComb = None
stopFlag = False
for i in range(1, isOne.shape[0] + 1):
if stopFlag:
break;
combs = combinations(lstIsOne, i) #from itertools
for c in combs:
data = [x[1] for x in c]
index = [x[0] for x in c]
dfTmp = pd.DataFrame(data=data, columns=cols, index=index)
if (dfTmp.sum() > 0).all():
dfTmp["Final Selection"] = "Yes"
winningComb = dfTmp
stopFlag = True
break;
return winningComb

Add a value in a column as a function of the timestamp and another column

The title may not be very clear, but with an example I hope it would make some sense.
I would like to create an output column (called "outputTics"), and put a 1 in it 0.21 seconds after a 1 appears in the "inputTics" column.
As you see, there is no value 0.21 seconds exactly after another value, so I'll put the 1 in the outputTics column two rows after : an example would be at the index 3, there is a 1 at 11.4 seconds so I'm putting an 1 in the output column at 11.6 seconds
If there is a 1 in the "inputTics" column 0.21 second of earlier, do not put a one in the output column : an example would be at the index 1 in the input column
Here is an example of the red column I would like to create.
Here is the code to create the dataframe :
A = pd.DataFrame({"Timestamp":[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.1,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9,13.0],
"inputTics":[0,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,1,1],
"outputTics":[0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0]})
You can use pd.Timedelta if you can to avoid python rounded numbers if you want
Create the column with zeros.
df['outputTics'] = 0
Define a function set_output_tic in the following manner
def set_output_tic(row):
if row['inputTics'] == 0:
return 0
index = df[df == row].dropna().index
# check for a 1 in input within 0.11 seconds
t = row['Timestamp'] + pd.TimeDelta(seconds = 0.11)
indices = df[df.Timestamp <= t].index
c = 0
for i in indices:
if df.loc[i,'inputTics'] == 0:
c = c + 1
else:
c = 0
break
if c > 0:
df.loc[indices[-1] + 1, 'outputTics'] = 1
return 0
then call the above function using df.apply
temp = df.apply(set_output_tic, axis = 1) # temp is practically useless
This was actually kinda tricky, but by playing with indices in numpy you can do it.
# Set timestamp as index for a moment
A = A.set_index(['Timestamp'])
# Find the timestamp indices of inputTics and add your 0.11
input_indices = A[A['inputTics']==1].index + 0.11
# Iterate through the indices and find the indices to update outputTics
output_indices = []
for ii in input_indices:
# Compare indices to full dataframe's timestamps
# and return index of nearest timestamp
oi = np.argmax((A.index - ii)>=0)
output_indices.append(oi)
# Create column of output ticks with 1s in the right place
output_tics = np.zeros(len(A))
output_tics[output_indices] = 1
# Add it to dataframe
A['outputTics'] = outputTics
# Add condition that if inputTics is 1, outputTics is 0
A['outputTics'] = A['outputTics'] - A['inputTics']
# Clean up negative values
A[A['outputTic']<0] = 0
# The first row becomes 1 because of indexing; change to 0
A = A.reset_index()
A.at[0, 'outputTics'] = 0

Categories