Efficient way to call previous row in python - python

I want to substitute the previous row value whenever a 0 value is found in the column of the dataframe in python. I used the following code,
if not a[j]:
a[j] = a[j-1]
and also
if a[j]==0:
a[j]=a[j-1]
Update:
Complete code updated:
for i in pd.unique(r.a):
sub=r[r.vehicle_id==i]
sub=DataFrame(sub,columns= ['a','b','c','d','e'])
sub=sub.drop_duplicates(["a","b","c","d"])
sub['c']=pd.to_datetime(sub['c'],unit='s')
for j in range(1, len(sub[1:])):
if not sub.d[j]:
sub.d[j] = sub.d[j-1]
if not sub.e[j]:
sub.e[j]=sub.e[j-1]
sub=sub.drop_duplicates(["lash_angle","lash_check_count"])
This is the starting of my code. the sub.d[j] line is only getting delayed
These both seem to work well when using integer values. One of the column contains decimal values. When using the code for that column, it is taking a huge time to complete(Nearly 15-20 secs) for the statement to complete. I am looping through nearly 10000 ids and wasting 15 secs at this step is making my entire code inefficient. Is there a better way, I can do this for the float (decimal) values, so that it would be much faster?
Thanks

Assuming that by "column of the dataframe" you mean you're actually talking about a column (Series) of a pandas DataFrame, then one trick is to replace the 0 by nan and then forward-fill. For example:
>>> df = pd.DataFrame(np.random.randint(0,4, 10**6))
>>> df.head(10)
0
0 0
1 3
2 3
3 0
4 1
5 2
6 3
7 2
8 0
9 3
>>> df[0] = df[0].replace(0, np.nan).ffill()
>>> df.head(10)
0
0 NaN
1 3
2 3
3 3
4 1
5 2
6 3
7 2
8 2
9 3
where you can decide for yourself how you want to handle the case of a 0 at the start, where you have no value to fill. This assumes that there aren't already NaN values you want to leave alone, but if there are, you can just use a mask with .loc to select only the ones you want to change.

Related

Is there an easy way to zero time with each new condition in a pandas dataframe?

I have a big-ass time series data frame where one condition changes at variable intervals. I would like to zero the time with each new condition, so I converted the categories into integers and created a new column using the .diff() to indicate the rows where the switch occurs with non-zero values. Then I made a new column, "Mod_time" as a container for the new time values that zero at each new condition. This is what I want the table to look like:
Time
Condition
Numerical Condition
Fruit_switch
Mod_time
0
Apples
6
nan
0
1
Apples
6
0
1
2
Apples
6
0
2
3
Apples
6
0
3
4
Oranges
2
-4
0
5
Oranges
2
0
1
I tried iterrows:
for index, row in gas_df.iterrows():
if row['gas_switch'] != 0:
gas_df.loc[[index], ["Mod_time"]] = 0
else:
gas_df.loc[[index], ["Mod_time"]] = gas_df.loc[[str(int(index)-1)], ["Mod_time"]] + 1
But got the error "None of [Index(['0'], dtype='object')] are in the [index]" It seems that iterrows is blind to everything but the one row it's looking at.
I also tried using enumerate instead of iterrows and got the same error.
Any suggestions or search terms would be appreciated.
There is a whole garden variety of problems that involve cumulative sum with reset. This one can be seen as such: you'd like to do the cumulative sum of time differences, with reset when the "numerical condition" changes.
import numpy as np
def cumsum_reset(v, reset):
v = v.copy()
c = np.cumsum(~reset)
v[reset] = -np.diff(np.r_[0, c[reset]])
return np.cumsum(v)
# application
cond = df['Numerical Condition']
df['Mod_time'] = cumsum_reset(np.diff(np.r_[0, df['Time']]),cond != cond.shift())
On your data:
Time Condition Numerical Condition Fruit_switch Mod_time
0 0 Apples 6 NaN 0
1 1 Apples 6 0.0 1
2 2 Apples 6 0.0 2
3 3 Apples 6 0.0 3
4 4 Oranges 2 -4.0 0
5 5 Oranges 2 0.0 1
Edit
From the comments, it sounds like the reset should really come from when df['Condition'] (the fruit name) changes. Also, the time difference between the rows is always one. Therefore, the following should work as well:
c = df['Condition']
df['Mod_time'] = cumsum_reset(np.ones_like(c), c.shift() != c)

Pandas: Replace/ Change Duplicate values within a Time Range

I have a pandas data-frame where I am trying to replace/ change the duplicate values to 0 (don't want to delete the values) within a certain range of days.
So, in example given below, I want to replace duplicate values in all columns with 0 within a range of let's say 3 (the number can be changed) days. Desired result is also given below
A B C
01-01-2011 2 10 0
01-02-2011 2 12 2
01-03-2011 2 10 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 5 23 1
01-07-2011 4 21 4
01-08-2011 2 21 5
01-09-2011 1 11 0
So, the output should look like
A B C
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
Any help will be appreciated.
You can use df.shift() for this to look at a value from a row up or down (or several rows, specified by the number x in .shift(x)).
You can use that in combination with .loc to select all rows that have a identical value to the 2 rows above and then replace it with a 0.
Something like this should work :
(edited the code to make it flexible for endless number of columns and flexible for the number of days)
numberOfDays = 3 # number of days to compare
for col in df.columns:
for x in range(1, numberOfDays):
df.loc[df[col] == df[col].shift(x), col] = 0
print df
This gives me the output:
A B C
date
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
I don't find anything better than looping over all columns, because every column leads to a different grouping.
First define a function which does what you want at grouped level, i.e. setting all but the first entry to zero:
def set_zeros(g):
g.values[1:] = 0
return g
for c in df.columns:
df[c] = df.groupby([c, pd.Grouper(freq='3D')], as_index=False)[c].transform(set_zeros)
This custom function can be applied to each group, which is defined by a time range (freq='3D') and equal values of a column within this period. As the columns generally have their equal values in different rows, this has to be done for each column in a loop.
Change freq to 5D, 10D or 20D for your other considerations.
For a detailed description of how to define the time period see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Problems transforming a pandas dataframe

I have troubles converting a pandas dataframe into the format i need in order to analyze it further. The current data is derived from a survey where we asked people to order preferred means of communication (1=highest,4=lowest). Every row is a respondee.
The current dataframe:
A B C D
0 1 2 4 3
1 2 3 1 4
2 2 1 4 3
3 2 1 4 3
4 1 3 4 2
...
For data analysis i want to transform this into the following dataframe, where every row is a different means of communication and the columns are the counts how often a person ranked it in that spot.
1st 2d 3th 4th
A 2 3 0 0
B 2 1 2 0
C 1 0 0 4
D 0 1 3 1
I have tried apply defined functions on the original dataframe, i have tried to apply .groupby function or .T on the dataframe with I don't seem to come closer to the result I actually want.
This is the function I wrote but I can't figure out how to apply it correctly to give me the desired result.
def count_values_rank(column,rank):
total_count_n1 = 0
for i in column:
if i == rank:
total_count_n1 += 1
return total_count_n1
Running this piece of code on a single column of my dataframe get's the desired results but having troubles to actually write it so i can apply it to the dataframe and get the result I am looking for. The below line of code would return 2.
count_values_rank(df.iloc[:,0],'1')
It is probably a really obvious solution but having troubles seeing the easiest way to solve this.
Thanks alot!
melt with crosstab
pd.crosstab(df.melt().variable,df.melt().value).add_suffix('st')
Out[107]:
value 1st 2st 3st 4st
variable
A 2 3 0 0
B 2 1 2 0
C 1 0 0 4
D 0 1 3 1

Attempting to delete multiple rows from Pandas Dataframe but more rows than intended are being deleted

I have a list, to_delete, of row indexes that I want to delete from both of my two Pandas Dataframes, df1 & df2. They both have 500 rows. to_delete has 50 entries.
I run this:
df1.drop(df1.index[to_delete], inplace=True)
df2.drop(df2.index[to_delete], inplace=True)
But this results in df1 and df2 having 250 rows each. It deletes 250 rows from each, and not the 50 specific rows that I want it to...
to_delete is ordered in descending order.
The full method:
def method(results):
#results is a 500 x 1 matrix of 1's and -1s
global df1, df2
deletions = []
for i in xrange(len(results)-1, -1, -1):
if results[i] == -1:
deletions.append(i)
df1.drop(df1.index[deletions], inplace=True)
df2.drop(df2.index[deletions], inplace=True)
Any suggestions as to what I'm doing wrong?
(I've also tried using .iloc instead of .index and deleting in the if statement instead of appending to a list first.
Your index values are not unique and when you use drop it is removing all rows with those index values. to_delete may have been of length 50 but there were 250 rows that had those particular index values.
Consider the example
df = pd.DataFrame(dict(A=range(10)), [0, 1, 2, 3, 4] * 2)
df
A
0 0
1 1
2 2
3 3
4 4
0 5
1 6
2 7
3 8
4 9
Let's say you want to remove the first, third, and fourth rows.
to_del = [0, 2, 3]
Using your method
df.drop(df.index[to_del])
A
1 1
4 4
1 6
4 9
Which is a problem
Option 1
use np.in1d to find complement of to_del
This is more self explanatory than the others. I'm looking in an array from 0 to n and seeing if it is in to_del. The result will be a boolean array the same length as df. I use ~ to get the negation and use that to slice the dataframe.
df[~np.in1d(np.arange(len(df)), to_del)]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
Option 2
use np.bincount to find complement of to_del
This accomplishes the same thing as option 1 by counting the positions defined in to_del. I end up with an array of 0 and 1 with a 1 in each position defined in to_del and 0 else where. I want to keep the 0s so I make a boolean array by finding where it is equal to 0. I then use this to slice the dataframe.
df[np.bincount(to_del, minlength=len(df)) == 0]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
Option 3
use np.setdiff1d to find positions
This uses set logic to find the difference between a full array of positions and just the ones I want to delete. I then use iloc to select.
df.iloc[np.setdiff1d(np.arange(len(df)), to_del)]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9

shift columns of a dataframe without looping?

consider this toy example. i need to shift each column down by one * (its position in the array). so a kind of diagonal shift:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,10,(5,5)),columns=list("ABCDE"))
for i,k in enumerate(df):
df[k] = df[k].shift(i)
transforms:
A B C D E
0 6 1 6 3 1
1 2 7 5 9 7
2 6 6 6 9 8
3 7 8 8 2 8
4 5 2 9 9 2
into
A B C D E
0 6 NaN NaN NaN NaN
1 2 1 NaN NaN NaN
2 6 7 6 NaN NaN
3 7 6 5 3 NaN
4 5 8 6 9 1
which is what i want.
however for larger dataframes with hierarchical indexes, this looping method does not seem feasible. in fact, i've got an ipython notebook that has been running for almost an hour now with no end in sight.
this makes me think that there must be an easier, perhaps vectorized way... perhaps using some kind of "apply", however i'm not sure how to do that when each column needs to be shifted down as a function of its position in the array.
Unless you have really a lot of data (dozens of gigabytes), shifting it does not take hours. It seems that the time is spent in rebuilding the indices. Especially with hierarchical indexing it is possible that the complex indices are rebuilt after each shift. If your tables are large, this may take a lot of time.
One possible approach (at least to isolate the problem) is to just extract the data into a np.array (take the .values), shift it, and recreate the DataFrame. In numpy shifting the data is rather trivial by, e.g.:
for c in range(1, a.shape[1]):
a[c:,c] = a[:-c,c]
a[:c, c] = np.nan
Shifting a float array with 500 columns and a million rows (4 GB array) with this code took my computer approximately 12 seconds, but the total time will depend heavily on your indexing and the cost of recreating it.

Categories