I have:
A1 A2 Random data Random data2 Average Stddev
0 0.1 2.0 300 3000 1.05 1.343503
1 0.5 4.5 4500 450 2.50 2.828427
2 3.0 1.2 800 80 2.10 1.272792
3 9.0 9.0 900 90 9.00 0.000000
And would like to add a column 'ColumnX' that needs to have the values calculated as :
ColumnX = min(df['Random data']-df['Average'],df[Random data2]-
df[Stddev])/3.0*df['A2'])
I get the error:
ValueError: The truth value of a Series is ambiguous.
Your error has to do with pandas preferring bitwise operators and using the built in min function isn't going to work row wise.
A potential solution would be to make two new calculated columns then using the pandas dataframe .min method.
df['calc_col_1'] = df['Random data']-df['Average']
df['calc_col_2'] = (df['Random data2']-df['Stddev'])/(3.0*df['A2'])
df['min_col'] = df[['calc_col_1','calc_col_2']].min(axis=1)
The method min(axis=1) will find the min between the two columns by row then assigned to the new column. This way is efficient because you're using numpy vectorization, and it is easier to read.
Related
I'm setting up a dataframe by reading a csv file in pandas, the columns represent points in one dimensional positional arguments for different samples, the rows each represent 0.01s time segments. I want to create a new dataframe to represent velocity and acceleration (so basically apply the operation [point(i)-point(i-1)]/0.01) to every cell in the data frame.
I'm having trouble using pandas.applymap or other approaches because I don't quite know how to refer to multiple arguments in the dataframe for every operation, if that makes sense.
import pandas as pd
import numpy as np
data = pd.read_csv("file_name")
def velocity(xf, xi):
v = (xf - xi)*100
return v
velocity = data.applymap(velocity)
This is what the first few column and rows of the original data frame look like:
X LFHD Y LFHD Z LFHD X RFHD Y RFHD
0 700.003 -1769.61 1556.05 811.922 -1878.46
1 699.728 -1769.50 1555.99 811.942 -1878.14
2 699.465 -1769.38 1555.99 811.980 -1877.81
3 699.118 -1769.38 1555.83 812.005 -1877.48
4 699.017 -1768.78 1556.19 812.003 -1877.11
For every positional value in each column, I want to calculate the velocity where the initial positional value is the cell above (xi as the input in the velocity function) and the final positional value is the cell in question (xf).
when I try to run the above code, it gives me an error because there is only one argument provided for velocity, when it expects 2. I don't know how to go about providing the second argument so that it outputs the proper new dataframe with the velocity calculated in each cell.
df_velocity = data.diff()*100
df_velocity
Out[6]:
X_LFHD Y_LFHD Z_LFHD X_RFHD Y_RFHD
0 NaN NaN NaN NaN NaN
1 -27.5 11.0 -6.0 2.0 32.0
2 -26.3 12.0 0.0 3.8 33.0
3 -34.7 0.0 -16.0 2.5 33.0
4 -10.1 60.0 36.0 -0.2 37.0
I have a pandas dataframe containing about 2 Million rows which looks like the following example
ID V1 V2 V3 V4 V5
12 0.2 0.3 0.5 0.03 0.9
12 0.5 0.4 0.6 0.7 1.8
01 3.8 2.9 1.1 1.6 1.5
17 0.9 1.2 1.8 2.6 9.0
02 0.2 0.3 0.5 0.03 0.9
12 0.5 0.4 0.6 0.7 1.8
07 3.8 2.9 1.1 1.6 1.5
19 0.9 1.2 1.8 2.6 9.0
19 0.5 0.4 0.6 0.7 1.8
06 3.8 2.9 1.1 1.6 1.5
17 0.9 1.2 1.8 2.6 9.0
18 0.9 1.2 1.8 2.6 9.0
I want to create three subsets of this data such that the column ID is mutually exclusive. And each of the subset includes all rows corresponding to the ID column in the main dataframe.
As of now, I am randomly shuffling the ID column and selecting unique ID's as a list. Using this list I'm selecting all rows that from the dataframe who's ID belong to fraction of the list.
import numpy as np
import random
distinct = list(set(df.ID.values))
random.shuffle(distinct)
X1, X2 = distinct[:1000000], distinct[1000000:2000000]
df_X1 = df.loc[df['ID'].isin(list(X1))]
df_X2 = df.loc[df['ID'].isin(list(X2))]
This is working as expected for smaller data, however for larger data the run doesn't even complete for many hours. Is there a more efficient way to do this? appreciate responses.
I think the slow down is coming in the nested isin list inside the loc slice. I tried a different approach using numpy and a boolean index that seems to double the speed.
First to set up the dataframe. I wasn't sure how many unique items you had so I selected 50. I was also unsure how many columns so arbitrarily selected 10,000 columns and rows.
df = pd.DataFrame(np.random.randn(10000, 10000))
ID = np.random.randint(0,50,10000)
df['ID'] = ID
Then I try to use mostly numpy arrays and avoid the nested list using a boolean index.
# Create a numpy array from the ID columns
a_ID = np.array(df['ID'])
# use the numpy unique method to get a unique array
# a = np.unique(np.array(df['ID']))
a = np.unique(a_ID)
# shuffle the unique array
np.random.seed(100)
np.random.shuffle(a)
# cut the shuffled array in half
X1 = a[0:25]
# create a boolean mask
mask = np.isin(a_ID, X1)
# set the index to the mask
df.index = mask
df.loc[True]
When I ran your code on my sample df, times were 817 ms, the code above runs at 445 ms.
Not sure if this helps. Good question, thanks.
I have a dataframe, sega_df:
Month 2016-11-01 2016-12-01
Character
Sonic 12.0 3.0
Shadow 5.0 23.0
I would like to create multiple new columns, by applying a formula for each already existing column within my dataframe (to put it shortly, pretty much double the number of columns). That formula is (100 - [5*eachcell])*0.2.
For example, for November for Sonic, (100-[5*12.0])*0.2 = 8.0, and December for Sonic, (100-[5*3.0])*0.2 = 17.0 My ideal output is:
Month 2016-11-01 2016-12-01 Weighted_2016-11-01 Weighted_2016-12-01
Character
Sonic 12.0 3.0 8.0 17.0
Shadow 5.0 23.0 15.0 -3.0
I know how to create a for loop to create one column. This is for if only one month was in consideration:
for w in range(1,len(sega_df.index)):
sega_df['Weighted'] = (100 - 5*sega_df)*0.2
sega_df[sega_df < 0] = 0
I haven't gotten the skills or experience yet to create multiple columns. I've looked for other questions that may answer what exactly I am doing but haven't gotten anything to work yet. Thanks in advance.
One vectorised approach is to drown to numpy:
A = sega_df.values
A = (100 - 5*A) * 0.2
res = pd.DataFrame(A, index=sega_df.index, columns=('Weighted_'+sega_df.columns))
Then join the result to your original dataframe:
sega_df = sega_df.join(res)
I'm trying to find a way to iterate code for a linear regression over many many columns, upwards of Z3. Here is a snippet of the dataframe called df1
Time A1 A2 A3 B1 B2 B3
1 1.00 6.64 6.82 6.79 6.70 6.95 7.02
2 2.00 6.70 6.86 6.92 NaN NaN NaN
3 3.00 NaN NaN NaN 7.07 7.27 7.40
4 4.00 7.15 7.26 7.26 7.19 NaN NaN
5 5.00 NaN NaN NaN NaN 7.40 7.51
6 5.50 7.44 7.63 7.58 7.54 NaN NaN
7 6.00 7.62 7.86 7.71 NaN NaN NaN
This code returns the slope coefficient of a linear regression for the very ONE column only and concatenates the value to a numpy series called series, here is what it looks like for extracting the slope for the first column:
from sklearn.linear_model import LinearRegression
series = np.array([]) #blank list to append result
df2 = df1[~np.isnan(df1['A1'])] #removes NaN values for each column to apply sklearn function
df3 = df2[['Time','A1']]
npMatrix = np.matrix(df3)
X, Y = npMatrix[:,0], npMatrix[:,1]
slope = LinearRegression().fit(X,Y) # either this or the next line
m = slope.coef_[0]
series= np.concatenate((SGR_trips, m), axis = 0)
As it stands now, I am using this slice of code, replacing "A1" with a new column name all the way up to "Z3" and this is extremely inefficient. I know there are many easy way to do this with some modules but I have the drawback of having all these intermediate NaN values in the timeseries so it seems like I'm limited to this method, or something like it.
I tried using a for loop such as:
for col in df1.columns:
and replacing 'A1', for example with col in the code, but this does not seem to be working.
Is there any way I can do this more efficiently?
Thank you!
One liner (or three)
time = df[['Time']]
pd.DataFrame(np.linalg.pinv(time.T.dot(time)).dot(time.T).dot(df.fillna(0)),
['Slope'], df.columns)
Broken down with a bit of explanation
Using the closed form of OLS
In this case X is time where we define time as df[['Time']]. I used the double brackets to preserve the dataframe and its two dimensions. If I'd done single brackets, I'd have gotten a series and its one dimension. Then the dot products aren't as pretty.
is np.linalg.pinv(time.T.dot(time)).dot(time.T)
Y is df.fillna(0). Yes, we could have done one column at a time, but why when we could do it altogether. You have to deal with the NaNs. How would you imagine dealing with them? Only doing it over the time you had data? That is equivalent to placing zeroes in the NaN spots. So, I did.
Finally, I use pd.DataFrame(stuff, ['Slope'], df.columns) to contain all slopes in one place with the original labels.
Note that I calculated the slope of the regression for Time against itself. Why not? It was there. Its value is 1.0. Great! I probably did it right!
Looping is a decent strategy for a modest number (say, fewer than thousands) of columns. Without seeing your implementation, I can't say what's wrong, but here's my version, which works:
slopes = []
for c in cols:
if c=="Time": break
mask = ~np.isnan(df1[c])
x = np.atleast_2d(df1.Time[mask].values).T
y = np.atleast_2d(df1[c][mask].values).T
reg = LinearRegression().fit(x, y)
slopes.append(reg.coef_[0])
I've simplified your code a bit to avoid creating so many temporary DataFrame objects, but it should work fine your way too.
I have a very big Pandas dataframe where I need an ordering within groups based on another column. I know how to iterate over groups, do an operation on the group and union all those groups back into one dataframe however this is slow and I feel like there is a better way achieve this. Here is the input and what I want out of it. Input:
ID price
1 100.00
1 80.00
1 90.00
2 40.00
2 40.00
2 50.00
Output:
ID price order
1 100.00 3
1 80.00 1
1 90.00 2
2 40.00 1
2 40.00 2 (could be 1, doesn't matter too much)
2 50.00 3
Since this is over about 5kk records with around 250,000 IDs efficiency is important.
If speed is what you want, then the following should be pretty good, although it is a bit more complicated as it makes use of complex number sorting in numpy. This is similar to the approach used (my me) when writing the aggregate-sort method in the package numpy-groupies.
# get global sort order, for sorting by ID then price
full_idx = np.argsort(df['ID'] + 1j*df['price'])
# get min of full_idx for each ID (note that there are multiple ways of doing this)
n_for_id = np.bincount(df['ID'])
first_of_idx = np.cumsum(n_for_id)-n_for_id
# subtract first_of_idx from full_idx
rank = np.empty(len(df),dtype=int)
rank[full_idx] = arange(len(df)) - first_of_idx[df['ID'][full_idx]]
df['rank'] = rank+1
It takes 2s for 5m rows on my machine, which is about 100x faster than using groupby.rank from pandas (although I didn't actually run the pandas version with 5m rows because it would take too long; I'm not sure how #ayhan managed to do it in only 30s, perhaps a difference in pandas versions?).
If you do use this, then I recommend testing it thoroughly, as I have not.
You can use rank:
df["order"] = df.groupby("ID")["price"].rank(method="first")
df
Out[47]:
ID price order
0 1 100.0 3.0
1 1 80.0 1.0
2 1 90.0 2.0
3 2 40.0 1.0
4 2 40.0 2.0
5 2 50.0 3.0
It takes about 30s on a dataset of 5m rows with 250000 ID's (i5-3330) :
df = pd.DataFrame({"price": np.random.rand(5000000), "ID": np.random.choice(np.arange(250000), size = 5000000)})
%time df["order"] = df.groupby("ID")["price"].rank(method="first")
Wall time: 36.3 s