Hi so I have a dataframe and I would like to find the index whenever one of the column's cumulative sum is equal to a threshold. It will then reset and start the cumsum again.
For example:
d = np.random.randn(10, 1) * 2
df = pd.DataFrame(d.astype(int), columns=['data'])
pd.concat([df,df.cumsum()],axis=1)
Outout:
Out[34]:
data data1
0 1 1
1 2 3
2 3 6
3 2 8
4 0 8
5 1 9
6 0 9
7 -1 8
8 1 9
9 2 11
So in the above sample data, data1 is the cumsum of column 1. If I set thres=5 this means that whenever the running sum of column 1 is greater than or equal to 5, I save the index. After that happens, the running sum resets and start again until the next running total sum is greater than or equal to 5 is reached.
Right now I am doing a loop and keeping to track the running sum an manually resetting. I was wondering if there is a fast vectorized way in pandas to do it as my dataframe is millions of rows long.
Thanks
I'm not familiar with pandas but my understanding is that it is based on numpy. Using numpy you can define custom functions that can be used with accumulate.
Here is one that I think is close to what you're looking for:
import numpy as np
def capsum(array,cap):
capAdd = np.frompyfunc(lambda a,b:a+b if a < cap else b,2,1)
return capAdd.accumulate(values, dtype=np.object)
values = np.random.rand(1000000) * 3 // 1
result = capsum(values,5) # --> produces the result in 0.17 sec.
I believe (or I hope) you can use numpy functions on dataframes.
Related
I am trying to implement a small algorithm that creates a new column in my DataFrame depending on wether a certain condition on another column exceeds a threshold or not. The formula looks like this:
df.loc[:, 'new_column'] = 0
df.loc[sum(abs(np.diff(df.loc[:, 'old_column']) / df.loc[:, 'old_column'].mean())) > threshold, 'new_column'] = 1
However now I don't want to apply this formula to the whole height of the DataFrame, but rather would like to apply a rolling window, i.e. the mean value calculated in the formula rolls through the rows of the DataFrame. I found this page in the documentation but don't know how I can apply this for a formula like in my case. How could I do something like this?
Using column names a and b instead of old_column and new_column:
df = pd.DataFrame(np.random.randint(10, size=10), columns=['a'])
window = 3
val = df['a'].diff().abs() / df['a'].rolling(window, 1).mean()
threshold = 1
condition = (val > threshold)
df['b'] = 0
df.loc[condition, 'b'] = 1
Example results:
a b
0 1 0
1 7 1
2 6 0
3 1 1
4 8 1
5 2 1
6 3 0
7 1 0
8 0 0
9 8 1
Pay close attention to NaN values in the intermediate results. diff() returns nan in the first row. This is unlike np.diff() because np.diff() returns a smaller array than the input.
And rolling().mean() will return NaN values depending on your min_periods parameter.
The final result contains no NaN values because (val > threshold) is always False for NaN inputs.
I am trying to add a running count to a pandas df.
For the values in Column A, I want to add '5' and for values in Column B I want to add '1'.
So for the df below I'm hoping to produce:
A B Total
0 0 0 0
1 0 0 0
2 1 0 5
3 1 1 6
4 1 1 6
5 2 1 11
6 2 2 12
So for every incremental integer in Column A, it equal '5' in the total. While Column B is the '+1'.
I tried:
df['Total'] = df['A'].cumsum(axis = 0)
But this doesn't include Column B
df['Total'] = df['A'] * 5 + df['B']
As far as I can tell, you are simply doing row wise operations, not a cumulative sum. This snippet calculates the row value of A times 5 and adds the row value of B for each row. Please don't make it any more complicated than it really is.
What is a cumulative sum (also called running total)?
From Wikipedia:
Consider the sequence < 5 8 3 2 >. What is the total of this sequence?
Answer: 5 + 8 + 3 + 2 = 18. This is arrived at by simple summation of the sequence.
Now we insert the number 6 at the end of the sequence to get < 5 8 3 2 6 >. What is the total of that sequence?
Answer: 5 + 8 + 3 + 2 + 6 = 24. This is arrived at by simple summation of the sequence. But if we regarded 18 as the running total, we need only add 6 to 18 to get 24. So, 18 was, and 24 now is, the running total. In fact, we would not even need to know the sequence at all, but simply add 6 to 18 to get the new running total; as each new number is added, we get a new running total.
I am trying to do some data manipulations using pandas. I have an excel file with two columns x,y . The number of elements in x corresponds to number of connections(n_arrows) it makes with an element in column y. The number of unique elements in column x corresponds to the number of unique points(n_nodes). What i want to do is to generate a random data frame(10^4 times) with the unique elements in column x and elements in column y? The code i was trying to work on is attached. Any suggestion will be appreciated
import pandas as pd
import numpy as np
df = pd.read_csv('/home/amit/Desktop/playing_with_pandas.csv')
num_nodes = df.drop_duplicates(subset='x', keep="last")
n_arrows = [32] #32 rows corresponds to 32
n_nodes = [10]
n_arrows_random = np.random.randn(df.x)
Here are 2 methods:
Solution 1: If you need the x and y values to be independently random:
Given a sample df (thanks #AmiTavory):
df = pd.DataFrame({'x': [1, 1, 1, 2], 'y': [1, 2, 3, 4]})
Using numpy.random.choice, you can do this to select random values from your x column and random values from your y column:
def simulate_df(df, size_of_simulated_df):
return pd.DataFrame({'x':np.random.choice(df.x, size_of_simulated_df),
'y':np.random.choice(df.y, size_of_simulated_df)})
>>> simulate_df(df, 10)
x y
0 1 3
1 1 3
2 1 4
3 1 4
4 2 1
5 2 3
6 1 2
7 1 4
8 1 2
9 1 3
The function simulate_df returns random values sampled from your original dataframe in the x and y columns. The size of your simulated dataframe can be controlled by the argument size_of_simulated_df, which should be an integer representing the number of rows you want.
Solution 2: As per your comments, based on your task, you might want to return a dataframe of random rows, maintaining the x->y correspondence. Here is a vectorized pandas way to do that:
def simulate_df(df=df, size_of_simulated_df=10):
return df.sample(size_of_simulated_df, replace=True).reset_index(drop=True)
>>> simulate_df()
x y
0 1 2
1 2 4
2 2 4
3 2 4
4 1 1
5 1 3
6 1 3
7 1 1
8 1 1
9 1 3
Assigning your simulated Dataframes for future reference:
In the likely scenario you want to do some sort of calculation on your simulated dataframes, I'd recommend saving them to some sort of dictionary structure using a loop like this:
dict_of_dfs = {}
for i in range(100):
dict_of_dfs['df'+str(i)] = simulate_df(df, len(df))
Or a dictionary comprehension like this:
dict_of_dfs = {'df'+str(i): simulate_df(df, (len(df))) for i in range(100)}
You can then access any one of your simulated dataframes in the same way you would access any dictionary value:
# Access the 48th simulated dataframe:
>>> dict_of_dfs['df47']
x y
0 1 4
1 2 1
2 1 4
3 2 3
I want to substitute the previous row value whenever a 0 value is found in the column of the dataframe in python. I used the following code,
if not a[j]:
a[j] = a[j-1]
and also
if a[j]==0:
a[j]=a[j-1]
Update:
Complete code updated:
for i in pd.unique(r.a):
sub=r[r.vehicle_id==i]
sub=DataFrame(sub,columns= ['a','b','c','d','e'])
sub=sub.drop_duplicates(["a","b","c","d"])
sub['c']=pd.to_datetime(sub['c'],unit='s')
for j in range(1, len(sub[1:])):
if not sub.d[j]:
sub.d[j] = sub.d[j-1]
if not sub.e[j]:
sub.e[j]=sub.e[j-1]
sub=sub.drop_duplicates(["lash_angle","lash_check_count"])
This is the starting of my code. the sub.d[j] line is only getting delayed
These both seem to work well when using integer values. One of the column contains decimal values. When using the code for that column, it is taking a huge time to complete(Nearly 15-20 secs) for the statement to complete. I am looping through nearly 10000 ids and wasting 15 secs at this step is making my entire code inefficient. Is there a better way, I can do this for the float (decimal) values, so that it would be much faster?
Thanks
Assuming that by "column of the dataframe" you mean you're actually talking about a column (Series) of a pandas DataFrame, then one trick is to replace the 0 by nan and then forward-fill. For example:
>>> df = pd.DataFrame(np.random.randint(0,4, 10**6))
>>> df.head(10)
0
0 0
1 3
2 3
3 0
4 1
5 2
6 3
7 2
8 0
9 3
>>> df[0] = df[0].replace(0, np.nan).ffill()
>>> df.head(10)
0
0 NaN
1 3
2 3
3 3
4 1
5 2
6 3
7 2
8 2
9 3
where you can decide for yourself how you want to handle the case of a 0 at the start, where you have no value to fill. This assumes that there aren't already NaN values you want to leave alone, but if there are, you can just use a mask with .loc to select only the ones you want to change.
In Pandas, I am trying to manually code a chi-square test. I am comparing row 0 with row 1 in the dataframe below.
data
2 3 5 10 30
0 3 0 6 5 0
1 33324 15833 58305 54402 38920
For this, I need to calculate the expected cell counts for each cell as: cell(i,j) = rowSum(i)*colSum(j) / sumAll. In R, I can do this simply by taking the outer() products:
Exp_counts <- outer(rowSums(data), colSums(data), "*")/sum(data) # Expected cell counts
I used numpy's outer product function to imitate the outcome of the above R code:
import numpy as np
pd.DataFrame(np.outer(data.sum(axis=1),data.sum(axis=0))/ (data.sum().sum()), index=data.index, columns=data.columns.values)
2 3 5 10 30
0 2 1 4 3 2
1 33324 15831 58306 54403 38917
Is it possible to achieve this with a Pandas function?
A Complete solution using only Pandas built-in methods:
def outer_product(row):
numerator = df.sum(1).mul(row.sum(0))
denominator = df.sum(0).sum(0)
return (numerator.floordiv(denominator))
df.apply(outer_product)
Timings: For 1 million rows of DF.