How to time-efficiently remove values next to 'NaN' values? - python
I'm trying to remove wrong values form my data (a series of 15mln values, 700MB). The values to be removed are values next to 'nan' values, e.g.:
Series: /1/,nan,/2/,3,/4/,nan,nan,nan,/8/,9
Numbers surrounded by slashes i.e. /1/,/2/,/4/,/8/ are values, which should be removed.
The problem is that it takes way too long to compute that with the following code that I have:
%%time
import numpy as np
import pandas as pd
# sample data
speed = np.random.uniform(0,25,15000000)
next_speed = speed[1:]
# create a dataframe
data_dict = {'speed': speed[:-1],
'next_speed': next_speed}
df = pd.DataFrame(data_dict)
# calculate difference between the current speed and the next speed
list_of_differences = []
for i in df.index:
difference = df.next_speed[i]-df.speed[i]
list_of_differences.append(difference)
df['difference'] = list_of_differences
# add 'nan' to data in form of a string.
for i in range(len(df.difference)):
# arbitrary condition
if df.difference[i] < -2:
df.difference[i] = 'nan'
#########################################
# THE TIME-INEFFICIENT LOOP
# remove wrong values before and after 'nan'.
for i in range(len(df)):
# check if the value is a number to skip computations of the following "if" cases
if not(isinstance(df.difference[i], str)):
continue
# case 1: where there's only one 'nan' surrounded by values.
# Without this case the algo will miss some wrong values because 'nan' will be removed
# Example of a series: /1/,nan,/2/,3,4,nan,nan,nan,8,9
# A number surrounded by slashes e.g. /1/ is a value to be removed
if df.difference[i] == 'nan' and df.difference[i-1] != 'nan' and df.difference[i+1] != 'nan':
df.difference[i-1]= 'wrong'
df.difference[i+1]= 'wrong'
# case 2: where the following values are 'nan': /1/, nan, nan, 4
# E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,8,9
elif df.difference[i] == 'nan' and df.difference[i+1] == 'nan':
df.difference[i-1]= 'wrong'
# case 3: where next value is NOT 'nan' wrong, nan,nan,4
# E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,/8/,9
elif df.difference[i] == 'nan' and df.difference[i+1] != 'nan':
df.difference[i+1]= 'wrong'
How to make it more time-efficient?
This is still a work in progress for me. I knocked 100x off your dummy data size to get down to something I could stand to wait for.
I also added this code at the top of my version:
import time
current_milli_time = lambda: int(round(time.time() * 1000))
def mark(s):
print("[{}] {}".format(current_milli_time()/1000, s))
This just prints a string with a time-mark in front of it, to see what's taking so long.
With that done, in your 'difference' column computation, you can replace the manual list generation with a vector operation. This code:
df = pd.DataFrame(data_dict)
mark("Got DataFrame")
# calculate difference between the current speed and the next speed
list_of_differences = []
for i in df.index:
difference = df.next_speed[i]-df.speed[i]
list_of_differences.append(difference)
df['difference'] = list_of_differences
mark("difference 1")
df['difference2'] = df['next_speed'] - df['speed']
mark('difference 2')
print(df[:10])
Produces this output:
[1490943913.921] Got DataFrame
[1490943922.094] difference 1
[1490943922.096] difference 2
next_speed speed difference difference2
0 18.008314 20.182982 -2.174669 -2.174669
1 14.736095 18.008314 -3.272219 -3.272219
2 5.352993 14.736095 -9.383102 -9.383102
3 5.854199 5.352993 0.501206 0.501206
4 2.003826 5.854199 -3.850373 -3.850373
5 12.736061 2.003826 10.732236 10.732236
6 2.512623 12.736061 -10.223438 -10.223438
7 18.224716 2.512623 15.712093 15.712093
8 14.023848 18.224716 -4.200868 -4.200868
9 15.991590 14.023848 1.967741 1.967741
Notice that the two difference columns are the same, but the second version took about 8 seconds less time. (Presumably 800 seconds when you have 100x more data.)
I did the same thing in the 'nanify' process:
df.difference2[df.difference2 < -2] = np.nan
The idea here is that many of the binary operators actually generate either a placeholder, or a Series or vector. And that can be used as an index, so that df.difference2 < -2 becomes (in essence) a list of the places where that condition is true, and you can then index either df (the whole table) or any of the columns of df, like df.difference2, using that index. It's a fast shorthand for the otherwise-slow python for loop.
Update
Okay, finally, here is a version that vectorizes the "Time-inefficient Loop". I'm just pasting the whole thing in at the bottom, for copying.
The premise is that the Series.isnull() method returns a boolean Series (column) that is true if the contents are "missing" or "invalid" or "bogus." Generally, this means NaN, but it also recognizes Python None, etc.
The tricky part, in pandas, is shifting that column up or down by one to reflect "around"-ness.
That is, I want another boolean column, where col[n-1] is true if col[n] is null. That's my "before a nan" column. And likewise, I want another column where col[n+1] is true if col[n] is null. That's my "after a nan" column.
It turns out I had to take the damn thing apart! I had to reach in, extract the underlying numpy array using the Series.values attribute, so that the pandas index would be discarded. Then a new index is created, starting at 0, and everything works again. (If you don't strip the index, the columns "remember" what their numbers are supposed to be. So even if you delete column[0], the column doesn't shift down. Instead, is knows "I am missing my [0] value, but everyone else is still in the right place!")
Anyway, with that figured out, I was able to build three columns (needlessly - they could probably be parts of an expression) and then merge them together into a fourth column that indicates what you want: the column is True when the row is before, on, or after a nan value.
missing = df.difference2.isnull()
df['is_nan'] = missing
df['before_nan'] = np.append(missing[1:].values, False)
df['after_nan'] = np.insert(missing[:-1].values, 0, False)
df['around_nan'] = df.is_nan | df.before_nan | df.after_nan
Here's the whole thing:
import numpy as np
import pandas as pd
import time
current_milli_time = lambda: int(round(time.time() * 1000))
def mark(s):
print("[{}] {}".format(current_milli_time()/1000, s))
# sample data
speed = np.random.uniform(0,25,150000)
next_speed = speed[1:]
# create a dataframe
data_dict = {'speed': speed[:-1],
'next_speed': next_speed}
df = pd.DataFrame(data_dict)
mark("Got DataFrame")
# calculate difference between the current speed and the next speed
list_of_differences = []
#for i in df.index:
#difference = df.next_speed[i]-df.speed[i]
#list_of_differences.append(difference)
#df['difference'] = list_of_differences
#mark("difference 1")
df['difference'] = df['next_speed'] - df['speed']
mark('difference 2')
df['difference2'] = df['next_speed'] - df['speed']
# add 'nan' to data in form of a string.
#for i in range(len(df.difference)):
## arbitrary condition
#if df.difference[i] < -2:
#df.difference[i] = 'nan'
df.difference[df.difference < -2] = np.nan
mark('nanify')
df.difference2[df.difference2 < -2] = np.nan
mark('nanify 2')
missing = df.difference2.isnull()
df['is_nan'] = missing
df['before_nan'] = np.append(missing[1:].values, False)
df['after_nan'] = np.insert(missing[:-1].values, 0, False)
df['around_nan'] = df.is_nan | df.before_nan | df.after_nan
mark('looped')
#########################################
# THE TIME-INEFFICIENT LOOP
# remove wrong values before and after 'nan'.
for i in range(len(df)):
# check if the value is a number to skip computations of the following "if" cases
if not(isinstance(df.difference[i], str)):
continue
# case 1: where there's only one 'nan' surrounded by values.
# Without this case the algo will miss some wrong values because 'nan' will be removed
# Example of a series: /1/,nan,/2/,3,4,nan,nan,nan,8,9
# A number surrounded by slashes e.g. /1/ is a value to be removed
if df.difference[i] == 'nan' and df.difference[i-1] != 'nan' and df.difference[i+1] != 'nan':
df.difference[i-1]= 'wrong'
df.difference[i+1]= 'wrong'
# case 2: where the following values are 'nan': /1/, nan, nan, 4
# E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,8,9
elif df.difference[i] == 'nan' and df.difference[i+1] == 'nan':
df.difference[i-1]= 'wrong'
# case 3: where next value is NOT 'nan' wrong, nan,nan,4
# E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,/8/,9
elif df.difference[i] == 'nan' and df.difference[i+1] != 'nan':
df.difference[i+1]= 'wrong'
mark('time-inefficient loop done')
I am assuming that you don't want either 'nan' or wrong values and nan values are not much compared to size of data. Please try with this:
nan_idx = df[df['difference']=='nan'].index.tolist()
from copy import deepcopy
drop_list = deepcopy(nan_idx)
for i in nan_idx:
if (i+1) not in(drop_list) and (i+1) < len(df):
mm.append(i+1)
if (i-1) not in(drop_list) and (i-1) < len(df):
mm.append(i-1)
df.drop(df.index[drop_list])
if nan is not a string but it is NaN which is for missing values then use this to get its indexes:
nan_idx = df[pandas.isnull(df['difference'])].index.tolist()
Related
Different ways of iterating through pandas DataFrame
I am currently working on a short pandas project. The project assessment keeps marking this task as incorrect for me even though the resulting list appears to be the same as when the provided correct code is used. Is my code wrong and it just happens to give the same results for this particular DataFrame? My code: # Define an empty list colors = [] # Iterate over rows of netflix_movies_col_subset for t in netflix_movies_col_subset['genre']: if t == 'Children' : colors.append('red') elif t == 'Documentaries' : colors.append('blue') elif t == 'Stand-up' : colors.append('green') else: colors.append('black') # Inspect the first 10 values in your list print(colors[:10]) Provided code: # Define an empty list colors = [] # Iterate over rows of netflix_movies_col_subset for lab, row in netflix_movies_col_subset.iterrows(): if row['genre'] == 'Children' : colors.append('red') elif row['genre'] == 'Documentaries' : colors.append('blue') elif row['genre'] == 'Stand-up' : colors.append('green') else: colors.append('black') # Inspect the first 10 values in your list print(colors[0:10])
I've always been told, that the best way to iterate over a dataframe row by row is NOT TO DO IT. I your case, you could very nicely use df.ne() First create a dataframe that holds all genres (df_genres) then use netflix_movies_col_subset['genre'].ne(df_genres, axis=0) this should create a dataframe that has a line for every movie and columns for every genre. If a certain movie is a documentary, values in all columns would be False, only in the Documentary column it would be True. This method is by multiple orders of magnitude faster than iterating with multiple if statements.
Does this help? I haven't test it yet. # Define an empty list colors = [] # Iterate over rows of netflix_movies_col_subset for t in netflix_movies_col_subset['genre']: if t == 'Children' : x='red' elif t == 'Documentaries' : x= 'blue' elif t == 'Stand-up' : x ='green' else: x ='black' colors.append(x) # Inspect the first 10 values in your list print(colors[:10]) Or you can do match case. # Define an empty list colors = [] # Iterate over rows of netflix_movies_col_subset for t in netflix_movies_col_subset['genre']: match t: case 'Children': x ='red' case 'Documentaries': x ='blue' case 'Stand-up': x ='green' else: x ='black' colors.appent(x) # Inspect the first 10 values in your list print(colors[:10])
python pandas: attempting to replace value in row updates ALL rows
I have a simple CSV file named input.csv as follows: name,money Dan,200 Jimmy,xd Alice,15 Deborah,30 I want to write a python script that sanitizes the data in the money column: every value that has non-numerical characters needs to be replaced with 0 This is my attempt so far: import pandas as pd df = pd.read_csv( "./input.csv", sep = "," ) # this line is the problem: it doesn't update on a row by row basis, it updates all rows df['money'] = df['money'].replace(to_replace=r'[^0‐9]', value=0, regex=True) df.to_csv("./output.csv", index = False) The problem is that when the script runs, because the invalud money value xd exists on one of the rows, it will change ALL money values to 0 for ALL rows. I want it to ONLY change the money value for the second data row (Jimmy) which has the invalid value. this is what it gives at the end: name,money Dan,0 Jimmy,0 Alice,0 Deborah,0 but what I need it to give is this: name,money Dan,200 Jimmy,0 Alice,15 Deborah,30 What is the problem?
You can use: df['money'] = pd.to_numeric(df['money'], errors='coerce').fillna(0).astype(int) The above assumes all valid values are integers. You can leave off the .astype(int) if you want float values. Another option would be to use a converter function in the read_csv method. Again, this assumes integers. You can use float(x) in place of int(x) if you expect float money values: def convert_to_int(x): try: return int(x) except ValueError: return 0 df = pd.read_csv( 'input.csv', converters={'money': convert_to_int} )
Some list comprehension could work for this (given the "money" column has no decimals): df.money = [x if type(x) == int else 0 for x in df.money] If you are dealing with decimals, then something like: df.money = [x if (type(x) == int) or (type(x) == float) else 0 for x in df.money] ... will work. Just know that pandas will convert the entire "money" column to float (decimals).
Different results from interpolation if (same data) is done with timeindex
I get different results from interpolation if (same data) is done with timeindex, how can that be? On pandas docs it says: The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ methods are wrappers around the respective SciPy implementations of similar names. These use the actual numerical values of the index. For more information on their behavior, see the SciPy documentation and SciPy tutorial. the sub-methods in interpolation( method= ...), where i noticed this strange behavior are (among others): ['krogh', 'spline', 'pchip', 'akima', 'cubicspline'] reproducable sample (with comparison): import numpy as np , pandas as pd from math import isclose # inputs: no_timeindex = False # reset both dataframes indices to numerical indices # for comparison. no_timeindex_for_B = True # reset only dataframe indices of the first approach to numerical indices, the other one stays datetime, for comparison. holes = True # create date-timeindex that skips the timestamps, that would normally be at location 6,7,12, 14, 17, instead of a perfectly frequent one. o_ = 2 # order parameter for interpolation. method_ = 'cubicspline' #------------------+ n = np.nan arr = [n,n,10000000000 ,10,10,10000,10,10, 10,40,4,4,9,4,4,n,n,n,4,4,4,4,4,4,18,400000000,4,4,4,n,n,n,n,n,n,n,4,4,4,5,6000000000,4,5,4,5,4,3,n,n,n,n,n,n,n,n,n,n,n,n,n,4,n,n,n,n,n,n,n,n,n,n,n,n,n,n,2,n,n,n,10,1000000000,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,1,n,n,n,n,n,n,n,n,n] #--------------------------------------------------------------------------------+ df = pd.DataFrame(arr) # create dataframe from array. if holes: # create a date-timeindex that skips the timestamps, that would normally be at location 6,7,12, 14, 17. ix = pd.date_range("01.01.2000", periods = len(df)+(2 +5), freq="T")[2:] to_drop = [ix[6],ix[7],ix[12],ix[14],ix[17]] ix = ix.drop( to_drop) df.index = ix else: # create a perfectly frequent datetime-index without any holes. ix = pd.date_range("01.01.2000", periods = len(df)+2, freq="T")[2:] df.index = ix # if wanted, drop timeindex and set it to integer indices later if no_timeindex == True: df.reset_index( inplace=True, drop=True ) df = df.interpolate(method=method_, order=o_, limit_area = 'inside') # interpolate. df.index = ix # set index equal to the second approach, for comparing later. A = df.copy(deep=True) # create a copy, to compare result with second approach later. #------------------------------+ # second approach with numerical index instead of index-wise df = pd.DataFrame(arr) # create dataframe from array. if holes: # create a date-timeindex that skips the timestamps, that would normally be at location 6,7,12, 14, 17. ix = pd.date_range("01.01.2000", periods = len(df)+(2 +5), freq="T")[2:] to_drop = [ix[6],ix[7],ix[12],ix[14],ix[17]] ix = ix.drop( to_drop) df.index = ix else: # create a perfectly frequent datetime-index without any holes. ix = pd.date_range("01.01.2000", periods = len(df)+2, freq="T")[2:] df.index = ix # if wanted, drop timeindex and set it to integer indices later if no_timeindex == True or no_timeindex_for_B == True: df.reset_index(inplace=True, drop=True) df = df.interpolate(method=method_, order=o_, limit_area = 'inside') # interpolate. df.index = ix # set index equal to the first approach, for comparing later. B = df.copy(deep=True) # create a copy, to compare result with second approach later. #--------------------------------------------------------------------------------+ # compare: if A.equals(B)==False: # if values arent equal, count the ones that arent. i=0 for x,y in zip( A[A.columns[0]], B[B.columns[0]]): if x!=y and not (np.isnan(x) and np.isnan(y) ) : print( x, " ?= ", y," ", (x==y), abs(x-y)) i+=1 # if theres no different values, ... if i==0: print(" both are the same. ") else: # if theres different values, ... # count those different values, that are NOT almost the same. not_almost = 0 for x,y in zip( A[A.columns[0]], B[B.columns[0]]): if not (np.isnan(x) and np.isnan(y) ) : if isclose(x,y, abs_tol=0.000001) == False: not_almost+=1 # if all values are almost the same, ... if not_almost == 0: print(" both are not, but almost the same. ") else: print(" both are definetly not the same. ") else: print(" both are the same. ") This shouldnt be the case, since the pandas docs state different. Why does it happen anyways?
if and for loop in one line
I am approaching an excel via openpyxl and I need to do elif statment and for loop at the same line of code. What I want to achive is this: Check if the value is not None, if not None do a loop in a column in which you are looking for an index of matched value. If you do not find the value in this column, check the second column, where the value is. I have to create an elif statment, which would guide machine to do similar thing as it is doing in 'else:' statment which I can handle as I can write it in a chain and multiple lines The code I have: for each in sheet['G'][1:]: indexing_no = int(sheet['G'].index(each)+1) indexing_column = int(sheet['G'].index(each)) if each.value == None: pass else: for search_value in sheet['A'][1:]: if each.value == search_value.value: index_no = int(sheet['A'].index(search_value) + 1) sheet['H{name}'.format(name = indexing_no)].value = sheet['B{name}'.format(name = index_no)].value
You could try this: columns = ("G","A","D") # column letters to search in parallel values = ( ((c,v) for v in sheet[c][:1]) for c in columns ) for row,colVal in enumerate(zip(*values),1) col,value = next( ((c,v) for c,v in colval if v is not None),("",None)) if not col: continue # or break when no column has any value # ... # perform common work on first column that has a non-None value # using col as the column letter and value for the value of the cell # at sheet[col][row]
Randomization of a list with conditions using Pandas
I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried). The main aim is to create a randomization loop which takes original dataset looking like this: dataset From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order. I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this) import pandas as pd import random dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx") # original data set use for comparisons imageDataset = dataset.loc[0:11, :] # creating empty df for storing rows from imageDataset emptyExcel = pd.DataFrame() randomPick = imageDataset.sample() # select randomly one row from imageDataset emptyExcel = emptyExcel.append(randomPick) # append a row to empty df randomPickIndex = randomPick.index.tolist() # get index of the row imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before # getting raw values from the row 'position01'/02 are columns headers randomPickTemp1 = randomPick['position01'].values[0] randomPickTemp2 = randomPick randomPickTemp2 = randomPickTemp2['position02'].values[0] # getting a dataset which not including row values from position01 and position02 isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)] # pick another row from dataset not including row selected at the beginning - randomPick randomPick2 = isit.sample() # save it in empty df emptyExcel = emptyExcel.append(randomPick2, sort=False) # get index of this second row to delete it in next step randomPick2Index = randomPick2.index.tolist() # delete the another row imageDataset3 = imageDataset2.drop(index=randomPick2Index) # AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row: randomPickTemp1 = randomPick2['position01'].values[0] randomPickTemp2 = randomPick2 randomPickTemp2 = randomPickTemp2['position02'].values[0] isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)] # AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)
So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186 I've just adjusted the condition in for loop to my case like this: remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']] Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.
import itertools as it import random import pandas as pd # list of pair of numbers tmp1 = [x for x in it.permutations(list(range(6)),2)] df = pd.DataFrame(tmp1, columns=["position01","position02"]) df1 = pd.DataFrame() i = random.choice(df.index) df1 = df1.append(df.loc[i],ignore_index = True) df = df.drop(index = i) while not df.empty: val = list(df1.iloc[-1]) tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])] if tmp.empty: #looped for 10000 times, was never empty print("here") break i = random.choice(tmp.index) df1 = df1.append(df.loc[i],ignore_index = True) df = df.drop(index=i)