I've just started with Pandas and Numpy a couple of months ago and I've learned already quite a lot thanks to all the threads here. But now I can't find what I need.
For work, I have created an excel sheet that calculates some figures to be used for re-ordering inventory. To practice and maybe actually use it, I'd wanted to give it a try to replicate the functionality in Python. Later I might want to add some more sophisticated calculations with the help of Scikit-learn.
So far I've managed to load a csv with sales figures from our ERP into a dataframe, calculate mean and std. The calculations have been done on a subset of the data because I don't know how to apply calculations only to the specific columns. The csv does also contain for example product codes and leadtimes and these should not be used for the average and std calculations. Not sure yet also how to merge this subset back with the original dataframe.
The reason why I didn't hardcode the column names is because the ERP reports the sales number over the past x no. of months, so the order of the columns will change througout the year and I want to keep them in chronological order.
My data from the csv looks like:
"code","leadtime","jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec"
"001.002",60,299,821,351,614,246,957,968,939,125,368,727,231
"001.002",25,340,274,733,575,904,953,614,268,638,960,617,757
"001.002",130,394,327,435,767,377,699,424,951,972,717,317,264
What I've done so far and what is working fine. (This can be doe probably much easier/more efficient):
import numpy as np
import timeit
import csv
import pandas as pd
sd = 1
csv_in = "data_in.csv"
csv_out = "data_out.csv"
# Use Pandas
df = pd.read_csv(csv_in,dtype={'code': str})
# Get no of columns and substract 2 for compcode and leadtime
cols = df.shape[1] - 2
# Create a subset and count the columns
df_subset = df.iloc[:, -cols:]
subset_cols = df_subset.shape[1]
# Add columns for std dev and average
df_subset = (df_subset.assign(mean=df_subset.mean(axis=1),
stddev=df_subset.std(axis=1, ddof=0))
)
# Add columns for min and max values based on mean +/- std multiplied by factor sd
df_subset = (df_subset.assign(minSD=df_subset['mean'].sub(df_subset['stddev'] * sd),
maxSD=df_subset['mean'].add(df_subset['stddev'] * sd))
df_subset
Which gives me:
jan feb mar apr may jun jul aug sep oct nov dec mean stddev minSD maxSD
0 299 821 351 614 246 957 968 939 125 368 727 231 553.833333 304.262998 249.570335 858.096332
1 340 274 733 575 904 953 614 268 638 960 617 757 636.083333 234.519530 401.563804 870.602863
2 394 327 435 767 377 699 424 951 972 717 317 264 553.666667 242.398203 311.268464 796.064870
However for my next calculation I'm stuck again:
I want to calculate the average over values from the "month" columns and only the values that match the condition >= minSD and <= maxSD
So for row 0, I'm looking for the value (299+821+351+614+368+727)/6 = 530
How can I achieve this?
I've tried this, but this doesn't seem to work:
df_subset = df_subset.assign(avgwithSD=df_subset.iloc[:,0:subset_cols].values(where(df_subset.values>=df_subset['minSD'] & df_subset.values>=df_subset['maxSD'])).mean(axis=1))
Some help would be very welcome. Thanks
EDIT: With help I ended up using this to get further with my program
import numpy as np
import timeit
import csv
import pandas as pd
# sd will determine if range will be SD1 or SD2
sd = 1
# file to use
csv_in = "data_in.csv"
csv_out = "data_out.csv"
# Function to calculate the mean of the values within the range between minSD and maxSD
def CalcMeanSD(row):
months_ = row[2:14]
min_SD = row[-2]
max_SD = row[-1]
return months_[(months_ >= min_SD) & (months_ <= max_SD)]
# Use Pandas
df = pd.read_csv(csv_in,dtype={'code': str})
# Define the month/data columns and set them to floatvalues
months_cols = df.columns[2:]
df.loc[:, months_cols] = df.loc[:, months_cols].astype('float64')
# Add columns for stddev and mean. Based on these values set new range between minSD and maxSD
df['stddev'] = df.loc[:,months_cols].std(axis=1, ddof=0)
df['mean'] = df.loc[:, months_cols].mean(axis=1)
df['minSD'] = df['mean'].sub(df['stddev'] * sd)
df['maxSD'] = df['mean'].add(df['stddev'] * sd)
# Add column with the mean of the new range
df['avgwithSD'] = np.nanmean(df.apply(CalcMeanSD, axis=1), axis=1)
df
Result is:
code leadtime jan feb mar apr may jun jul aug sep oct nov dec stddev mean minSD maxSD avgwithSD
0 001.002 60 299.0 821.0 351.0 614.0 246.0 957.0 968.0 939.0 125.0 368.0 727.0 231.0 304.262998 553.833333 249.570335 858.096332 530.000000
1 001.002 25 340.0 274.0 733.0 575.0 904.0 953.0 614.0 268.0 638.0 960.0 617.0 757.0 234.519530 636.083333 401.563804 870.602863 655.666667
2 001.002 130 394.0 327.0 435.0 767.0 377.0 699.0 424.0 951.0 972.0 717.0 317.0 264.0 242.398203 553.666667 311.268464 796.064870 495.222222
3 001.002 90 951.0 251.0 411.0 469.0 359.0 220.0 192.0 250.0 818.0 768.0 937.0 128.0 292.572925 479.500000 186.927075 772.072925 365.000000
4 001.002 35 228.0 400.0 46.0 593.0 61.0 293.0 5.0 203.0 850.0 506.0 37.0 631.0 264.178746 321.083333 56.904588 585.262079 281.833333
5 001.002 10 708.0 804.0 208.0 380.0 531.0 125.0 500.0 773.0 354.0 238.0 805.0 215.0 242.371773 470.083333 227.711560 712.455106 451.833333
6 001.002 14 476.0 628.0 168.0 946.0 29.0 324.0 3.0 400.0 981.0 467.0 459.0 571.0 295.814225 454.333333 158.519109 750.147558 436.625000
7 001.002 14 92.0 906.0 18.0 537.0 57.0 399.0 544.0 977.0 909.0 687.0 881.0 459.0 333.154577 538.833333 205.678756 871.987910 525.200000
8 001.002 90 487.0 634.0 5.0 918.0 158.0 447.0 713.0 459.0 465.0 643.0 482.0 672.0 233.756447 506.916667 273.160220 740.673113 555.777778
9 001.002 130 741.0 43.0 976.0 461.0 35.0 321.0 434.0 8.0 330.0 32.0 896.0 531.0 326.216782 400.666667 74.449885 726.883449 415.400000
EDIT:
Instead of your original code:
# first part:
months_cols = df.columns[2:]
df.loc[:, months_cols] = df.loc[:, months_cols].astype('float64')
df['stddev'] = df.loc[:,months_cols].std(axis=1, ddof=0)
df['mean'] = df.loc[:, months_cols].mean(axis=1)
df['minSD'] = df['mean'].sub(df['stddev'] * sd)
df['maxSD'] = df['mean'].add(df['stddev'] * sd)
# second part: (the one that doesn't work for you)
def calc_mean_per_row_by_condition(row):
months_ = row[2:14]
min_SD = row[-2]
max_SD = row[-1]
return months_[(months_ >= min_SD) & (months_ <= max_SD)]
df['avgwithSD'] = np.nanmean(df.apply(calc_mean_per_row_by_condition, axis=1), axis=1)
Related
Here is a example of data we want to process:
df_size = 1000000
df_random = pd.DataFrame({'boat_id' : np.random.choice(range(300),df_size),
'X' :np.random.random_integers(0,1000,df_size),
'target_Y' :np.random.random_integers(0,10,df_size)})
X boat_id target_Y
0 482 275 6
1 705 245 4
2 328 102 6
3 631 227 6
4 234 236 8
...
I want to obtain an output like this :
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 target_Y boat_id
40055 684.0 692.0 950.0 572.0 442.0 850.0 75.0 140.0 382.0 576.0 0.0 1
40056 178.0 949.0 490.0 777.0 335.0 559.0 397.0 729.0 701.0 44.0 4.0 1
40057 21.0 818.0 341.0 577.0 612.0 57.0 303.0 183.0 519.0 357.0 0.0 1
40058 501.0 1000.0 999.0 532.0 765.0 913.0 964.0 922.0 772.0 534.0 1.0 2
40059 305.0 906.0 724.0 996.0 237.0 197.0 414.0 171.0 369.0 299.0 8.0 2
40060 408.0 796.0 815.0 638.0 691.0 598.0 913.0 579.0 650.0 955.0 2.0 3
40061 298.0 512.0 247.0 824.0 764.0 414.0 71.0 440.0 135.0 707.0 9.0 4
40062 535.0 687.0 945.0 859.0 718.0 580.0 427.0 284.0 122.0 777.0 2.0 4
40063 352.0 115.0 228.0 69.0 497.0 387.0 552.0 473.0 574.0 759.0 3.0 4
40064 179.0 870.0 862.0 186.0 25.0 125.0 925.0 310.0 335.0 739.0 7.0 4
...
I did the folowing code, but it is way to slow.
It groupby, cut with enumerate, transpose then merge result into one pandas Dataframe
start_time = time.time()
N = 10
col_names = map(lambda x: 'X'+str(x), range(N))
compil = pd.DataFrame(columns = col_names)
i = 0
# I group by boat ID
for boat_id, df_boat in df_random.groupby('boat_id'):
# then I cut every 50 line
for (line_number, (index, row)) in enumerate(df_boat.iterrows()):
if line_number%5 == 0:
compil_new_line_X = list(df_boat.iloc[line_number-N:line_number,:]["X"])
# filter to avoid issues at the start and end of the columns
if len (compil_new_line_X ) == N:
compil.loc[i,col_names] = compil_new_line_X
compil.loc[i, 'target_Y'] = row['target_Y']
compil.loc[i,'boat_id'] = row['boat_id']
i += 1
print("Total %s seconds" % (time.time() - start_time))
Total 232.947000027 seconds
My question is:
How to do somethings every "x number of line"? Then merge result?
Do it exist a way to vectorize that kind of operation?
Here is a solution that improve calculation time by 35%.
It use a 'groupby' for 'boat_ID' then 'groupby.apply' to divide groups in smalls chunks.
Then a final apply to create the new line. We probably still can improve it.
df_size = 1000000
df_random = pd.DataFrame({'boat_id' : np.random.choice(range(300),df_size),
'X' :np.random.random_integers(0,1000,df_size),
'target_Y' : np.random.random_integers(0,10,df_size)})
start_time = time.time()
len_of_chunks = 10
col_names = map(lambda x: 'X'+str(x), range(N))+['boat_id', 'target_Y']
def prepare_data(group):
# this function create the new line we will put in 'compil'
info_we_want_to_keep =['boat_id', 'target_Y']
info_and_target = group.tail(1)[info_we_want_to_keep].values
k = group["X"]
return np.hstack([k.values, info_and_target[0]]) # this create the new line we will put in 'compil'
# we group by ID (boat)
# we divide in chunk of len "len_of_chunks"
# we apply prepare data from each chunk
groups = df_random.groupby('boat_id').apply(lambda x: x.groupby(np.arange(len(x)) // len_of_chunks).apply(prepare_data))
# we reset index
# we take the '0' columns containing valuable info
# we put info in a new 'compil' dataframe
# we drop uncomplet line ( generated by chunk < len_of_chunks )
compil = pd.DataFrame(groups.reset_index()[0].values.tolist(), columns= col_names).dropna()
print("Total %s seconds" % (time.time() - start_time))
Total 153.781999826 seconds
I am trying to find a good way to calculate mean values from values in a dataframe. It contains measured data from an experiment and is imported from an excel sheet. The columns contain the time passed by, electric current and the corresponding voltage.
The current is changed in steps and then held for some time (the current values vary a little bit, so they are not exactly the same for each step). Now I want to calculate the mean voltage for each current step. Since it takes some time after the voltage gets stable after a step, I also want to leave out the first few voltage values after a step.
Currently I am doing this with loops, but I was wondering wether there is a nicer way with the usage of the groupby function (or others maybe).
Just say if you need more details or clarification.
Example of data:
s [A] [V]
0 6.0 -0.001420 0.780122
1 12.0 -0.002484 0.783297
2 18.0 -0.001478 0.785870
3 24.0 -0.001256 0.793559
4 30.0 -0.001167 0.806086
5 36.0 -0.000982 0.815364
6 42.0 -0.003038 0.825018
7 48.0 -0.001174 0.831739
8 54.0 0.000478 0.838861
9 60.0 -0.001330 0.846086
10 66.0 -0.001456 0.851556
11 72.0 0.000764 0.855950
12 78.0 -0.000916 0.859778
13 84.0 -0.000916 0.859778
14 90.0 -0.001445 0.863569
15 96.0 -0.000287 0.864303
16 102.0 0.000056 0.865080
17 108.0 -0.001119 0.865642
18 114.0 -0.000843 0.866434
19 120.0 -0.000997 0.866809
20 126.0 -0.001243 0.866964
21 132.0 -0.002238 0.867180
22 138.0 -0.001015 0.867177
23 144.0 -0.000604 0.867505
24 150.0 0.000507 0.867571
25 156.0 -0.001569 0.867525
26 162.0 -0.001569 0.867525
27 168.0 -0.001131 0.866756
28 174.0 -0.001567 0.866884
29 180.0 -0.002645 0.867240
.. ... ... ...
242 1708.0 24.703866 0.288902
243 1714.0 26.469208 0.219226
244 1720.0 26.468838 0.250437
245 1726.0 26.468681 0.254972
246 1732.0 26.468173 0.271525
247 1738.0 26.468260 0.247282
248 1744.0 26.467666 0.296894
249 1750.0 26.468085 0.247300
250 1756.0 26.468085 0.247300
251 1762.0 26.467808 0.261096
252 1768.0 26.467958 0.259615
253 1774.0 26.467828 0.260871
254 1780.0 28.232325 0.185291
255 1786.0 28.231697 0.197642
256 1792.0 28.231170 0.172802
257 1798.0 28.231103 0.170685
258 1804.0 28.229453 0.184009
259 1810.0 28.230816 0.181833
260 1816.0 28.230913 0.188348
261 1822.0 28.230609 0.178440
262 1828.0 28.231144 0.168507
263 1834.0 28.231144 0.168507
264 1840.0 8.813723 0.641954
265 1846.0 8.814301 0.652373
266 1852.0 8.818517 0.651234
267 1858.0 8.820255 0.637536
268 1864.0 8.821443 0.628136
269 1870.0 8.823643 0.636616
270 1876.0 8.823297 0.635422
271 1882.0 8.823575 0.622253
Output:
s [A] [V]
0 303.000000 -0.000982 0.857416
1 636.000000 0.879220 0.792504
2 699.000000 1.759356 0.752446
3 759.000000 3.519479 0.707161
4 816.000000 5.278372 0.669020
5 876.000000 7.064800 0.637848
6 939.000000 8.828799 0.611196
7 999.000000 10.593054 0.584402
8 1115.333333 12.357359 0.556127
9 1352.000000 14.117167 0.528826
10 1382.000000 15.882287 0.498577
11 1439.000000 17.646748 0.468379
12 1502.000000 19.410817 0.437342
13 1562.666667 21.175572 0.402381
14 1621.000000 22.939826 0.365724
15 1681.000000 24.704600 0.317134
16 1744.000000 26.468235 0.256047
17 1807.000000 28.231037 0.179606
18 1861.000000 8.819844 0.638190
The current approach:
df = df[['s','[A]','[V]']]
#Looping over the rows to separate current points
b=df['[A]'].iloc[0]
start=0
list = []
for index, row in df.iterrows():
if not math.isclose(row['[A]'], b, abs_tol=1e-02):
b=row['[A]']
list.append(df.iloc[start:index])
start=index
list.append(df.iloc[start:])
#Deleting first few points after each current change
list_b = []
for l in list:
list_b.append(l.iloc[3:])
#Calculating mean values for each current point
list_c = []
for l in list_b:
list_c.append(l.mean())
result=pd.DataFrame(list_c)
Does this help?
df.groupby(['Columnname', 'Columnname2']).mean()
You may need to create intermediate dataframes for each step. Can you provide an example of the output you want?
I have a function saved and defined in a different script called TechAnalisys.py This function just outputs a scalar, so I plan to use pd.rolling_apply() to generate a new column into the original dataframe (df).
The function works fine when executed, but I have problems when using the rolling_apply() application.This link Passing arguments to rolling_apply shows how you should do it, and that is how I think it my code is but it still shows the error "TypeError: int object is not iterable" appears
This is the function (located in the script TechAnalisys.py)
def hurst(df,days):
import pandas as pd
import numpy as np
df2 = pd.DataFrame()
df2 = df[-days:]
rango = lambda x: x.max() - x.min()
df2['ret'] = 1 - df.PX_LAST/df.PX_LAST.shift(1)
df2 = df2.dropna()
ave = pd.expanding_mean(df2.ret)
df2['desvdeprom'] = df2.ret - ave
df2['acum'] = df2['desvdeprom'].cumsum()
df2['rangorolled'] = pd.expanding_apply(df2.acum, rango)
df2['datastd'] = pd.expanding_std(df2.ret)
df2['rango_rangostd'] = np.log(df2.rangorolled/df2.datastd)
df2['tiempo1'] = np.log(range(1,len(df2.index)+1))
df2 = df2.dropna()
model1 = pd.ols(y=df2['rango_rangostd'], x=df2['tiempo1'], intercept=False)
return model1.beta
and now this is the main script:
import pandas as pd
import numpy as np
import TechAnalysis as ta
df = pd.DataFrame(np.log(np.cumsum(np.random.randn(100000)+1)+1000),columns =['PX_LAST'])
The following works:
print ta.hurst(df,50)
This doesn't work:
df['hurst_roll'] = pd.rolling_apply(df, 15 , ta.hurst, args=(50))
Whats wrong in the code?
If you check the type of df within the hurst function, you'll see that rolling_apply passes it as numpy.array.
If you create a DataFrame from this numpy.array inside rolling_apply, it works. I also used a longer window because there were only 15 values per array but you seemed to be planning on using the last 50 days.
def hurst(df, days):
df = pd.DataFrame(df, columns=['PX_LAST'])
df2 = pd.DataFrame()
df2 = df.loc[-days:, :]
rango = lambda x: x.max() - x.min()
df2['ret'] = 1 - df.loc[:, 'PX_LAST']/df.loc[:, 'PX_LAST'].shift(1)
df2 = df2.dropna()
ave = pd.expanding_mean(df2.ret)
df2['desvdeprom'] = df2.ret - ave
df2['acum'] = df2['desvdeprom'].cumsum()
df2['rangorolled'] = pd.expanding_apply(df2.acum, rango)
df2['datastd'] = pd.expanding_std(df2.ret)
df2['rango_rangostd'] = np.log(df2.rangorolled/df2.datastd)
df2['tiempo1'] = np.log(range(1, len(df2.index)+1))
df2 = df2.dropna()
model1 = pd.ols(y=df2['rango_rangostd'], x=df2['tiempo1'], intercept=False)
return model1.beta
def rol_apply():
df = pd.DataFrame(np.log(np.cumsum(np.random.randn(1000)+1)+1000), columns=['PX_LAST'])
df['hurst_roll'] = pd.rolling_apply(df, 100, hurst, args=(50, ))
PX_LAST hurst_roll
0 6.907911 NaN
1 6.907808 NaN
2 6.907520 NaN
3 6.908048 NaN
4 6.907622 NaN
5 6.909895 NaN
6 6.911281 NaN
7 6.911998 NaN
8 6.912245 NaN
9 6.912457 NaN
10 6.913794 NaN
11 6.914294 NaN
12 6.915157 NaN
13 6.916172 NaN
14 6.916838 NaN
15 6.917235 NaN
16 6.918061 NaN
17 6.918717 NaN
18 6.920109 NaN
19 6.919867 NaN
20 6.921309 NaN
21 6.922786 NaN
22 6.924173 NaN
23 6.925523 NaN
24 6.926517 NaN
25 6.928552 NaN
26 6.930198 NaN
27 6.931738 NaN
28 6.931959 NaN
29 6.932111 NaN
.. ... ...
970 7.562284 0.653381
971 7.563388 0.630455
972 7.563499 0.577746
973 7.563686 0.552758
974 7.564105 0.540144
975 7.564428 0.541411
976 7.564351 0.532154
977 7.564408 0.530999
978 7.564681 0.532376
979 7.565192 0.536758
980 7.565359 0.538629
981 7.566112 0.555789
982 7.566678 0.553163
983 7.566364 0.577953
984 7.567587 0.634843
985 7.568583 0.679807
986 7.569268 0.662653
987 7.570018 0.630447
988 7.570375 0.659497
989 7.570704 0.622190
990 7.571009 0.485458
991 7.571886 0.551147
992 7.573148 0.459912
993 7.574134 0.463146
994 7.574478 0.463158
995 7.574671 0.535014
996 7.575177 0.467705
997 7.575374 0.531098
998 7.575620 0.540611
999 7.576727 0.465572
[1000 rows x 2 columns]
I am trying to figure out how I can combine daily dates into specific months and summing the data for the each day that falls within the specific month.
Note: I have a huge list with daily dates but I put a small sample here to simply the example.
File name: (test.xlsx)
For an Example (sheet1) contains in dataframe mode:
DATE 51 52 53 54 55 56
0 20110706 28.52 27.52 26.52 25.52 24.52 23.52
1 20110707 28.97 27.97 26.97 25.97 24.97 23.97
2 20110708 28.52 27.52 26.52 25.52 24.52 23.52
3 20110709 28.97 27.97 26.97 25.97 24.97 23.97
4 20110710 30.5 29.5 28.5 27.5 26.5 25.5
5 20110711 32.93 31.93 30.93 29.93 28.93 27.93
6 20110712 35.54 34.54 33.54 32.54 31.54 30.54
7 20110713 33.02 32.02 31.02 30.02 29.02 28.02
8 20110730 35.99 34.99 33.99 32.99 31.99 30.99
9 20110731 30.5 29.5 28.5 27.5 26.5 25.5
10 20110801 32.48 31.48 30.48 29.48 28.48 27.48
11 20110802 31.04 30.04 29.04 28.04 27.04 26.04
12 20110803 32.03 31.03 30.03 29.03 28.03 27.03
13 20110804 34.01 33.01 32.01 31.01 30.01 29.01
14 20110805 27.44 26.44 25.44 24.44 23.44 22.44
15 20110806 32.48 31.48 30.48 29.48 28.48 27.48
What I would like is to edit ("test.xlsx",'sheet1') to result in what is below:
DATE 51 52 53 54 55 56
0 201107 313.46 303.46 293.46 283.46 273.46 263.46
1 201108 189.48 183.48 177.48 171.48 165.48 159.48
How would I go about implementing this?
Here is my code thus far:
import pandas as pd
from pandas import ExcelWriter
df = pd.read_excel('thecddhddtestquecdd.xlsx')
def sep_yearmonths(x):
x['month'] = str(x['DATE'])[:-2]
return x
df = df.apply(sep_yearmonths,axis=1)
df.groupby('month').sum()
writer = ExcelWriter('thecddhddtestquecddMERGE.xlsx')
df.to_excel(writer,'Sheet1',index=False)
writer.save()
This will work if 'DATE' is a column of strings and not your index.
Example dataframe - shortened for clarity:
df = pd.DataFrame({'DATE': {0: '20110706', 1:'20110707', 2: '20110801'},
52: {0: 28.52, 1: 28.97, 2: 28.52},
55: { 0: 24.52, 1: 24.97, 2:24.52 }
})
Which yields:
52 55 DATE
0 28.52 24.52 20110706
1 28.97 24.97 20110707
2 28.52 24.52 20110801
Apply the following function over the dataframe to generate a new column:
def sep_yearmonths(x):
x['month'] = x['DATE'][:-2]
return x
Like this:
df = df.apply(sep_yearmonths,axis=1)
Over which you can then groupby and sum:
df.groupby('month').sum()
Resulting in the following:
52 55
month
201107 57.49 49.49
201108 28.52 24.52
If 'date' is your index, simply call reset_index before. If it's not a column of string values, then you need to do that beforehand.
Finally, you can rename your 'month' column to 'DATE'. I suppose you could just substitute the column 'DATE' inplace, but I choose to do things explictly. You can do that like so:
df['DATE'] = df['DATE'].apply(lambda x: x[:-2])
Then 'groupby' 'DATE' instead of month.
Use resample
import pandas as pd
myTable=pd.read_excel('test.xlsx')
myTable['DATE']=pd.to_datetime(myTable['DATE'], format="%Y%m%d")
myTable=myTable.set_index('DATE')
myTable.resample("M").sum()
I have a pandas DataFrame structured in the following way
0 1 2 3 4 5 6 7 8 9
0 42 2012 106 1200 0.112986 -0.647709 -0.303534 31.73 14.80 1096
1 42 2012 106 1200 0.185159 -0.588728 -0.249392 31.74 14.80 1097
2 42 2012 106 1200 0.199910 -0.547780 -0.226356 31.74 14.80 1096
3 42 2012 106 1200 0.065741 -0.796107 -0.099782 31.70 14.81 1097
4 42 2012 106 1200 0.116718 -0.780699 -0.043169 31.66 14.78 1094
5 42 2012 106 1200 0.280035 -0.788511 -0.171763 31.66 14.79 1094
6 42 2012 106 1200 0.311319 -0.663151 -0.271162 31.78 14.79 1094
In which columns 4, 5 and 6 are actually the components of a vector. I want to apply a matrix multiplication in these columns, that is to replace columns 4, 5 and 6 with the vector resulting of a the multiplication of the previous vector with a matrix.
What I did was
DC=[[ .. definition of multiplication matrix .. ]]
def rotate(vector):
return dot(DC, vector)
data[[4,5,6]]=data[[4,5,6]].apply(rotate, axis='columns')
Which I thought should work, but the returned DataFrame is exactly the same as the original.
What am I missing here?
You code is correct but very slow. You can use values property to get the ndarray and use dot() to transform all the vectors at once:
import numpy as np
import pandas as pd
DC = np.random.randn(3, 3)
df = pd.DataFrame(np.random.randn(1000, 10))
df2 = df.copy()
df[[4,5,6]] = np.dot(DC, df[[4,5,6]].values.T).T
def rotate(vector):
return np.dot(DC, vector)
df2[[4,5,6]] = df2[[4,5,6]].apply(rotate, axis='columns')
df.equals(df2)
On my PC, it's about 90x faster.