Replace column with some rows of another column - python
I have following dataframe:
midPrice Change % Spike New Oilprice
92.20000 0.00 0 92.043405
92.26454 0.07 0 92.049689
91.96950 -0.32 0 91.979751
91.73958 -0.25 0 91.844369
91.78985 0.05 0 91.724690
91.41000 -0.41 0 91.568880
91.18148 -0.25 0 91.690812
91.24257 0.07 0 91.858391
90.95352 -0.32 0 92.016806
93.24000 2.51 1 92.139872
93.31013 0.08 0 92.321622
93.00690 -0.32 0 92.542687
92.77438 -0.25 0 92.727070
92.86400 0.10 0 92.949655
and whenever I have a Spike (1) in the column, I want to replace the 5 rows after the spike (including) with the new oil prices. The rest of the rows are being kept as they are.
Any ideas how to solve that?
I tried the code based on following:
Iterate through the df (for loop)
If/else statement if spike == 1 then replace following 5 rows with values of new oil prices / else: keep oil prices
def spike(i):
for i in df['Spike']:
if i.loc == 1:
df['midPrice'].replace(df['New Oilprice'][i:5])`
It unfortunately doesn't work and I\m not so strong with pandas. I tried mapping the function as well on the dataframe which didn't work either. I would appreciate any help
Assuming the df is sorted by time in ascending order (as I've seen in the edit history of your question that you had a time column), you could use a mask like so:
mask = df['Spike'].eq(1).where(df['Spike'].eq(1)).fillna(method='ffill', limit=4).fillna(False)
df.loc[mask, 'midPrice'] = df['New Oilprice']
print(df)
midPrice Change % Spike New Oilprice
0 92.200000 0.00 0 92.043405
1 92.264540 0.07 0 92.049689
2 91.969500 -0.32 0 91.979751
3 91.739580 -0.25 0 91.844369
4 91.789850 0.05 0 91.724690
5 91.410000 -0.41 0 91.568880
6 91.181480 -0.25 0 91.690812
7 91.242570 0.07 0 91.858391
8 90.953520 -0.32 0 92.016806
9 92.139872 2.51 1 92.139872
10 92.321622 0.08 0 92.321622
11 92.542687 -0.32 0 92.542687
12 92.727070 -0.25 0 92.727070
13 92.949655 0.10 0 92.949655
EDIT - 2 rows before, 3 rows after:
You can adjust the mask with another fillna:
mask = df['Spike'].eq(1).where(df['Spike'].eq(1)).fillna(method='bfill', limit=2).fillna(method='ffill', limit=3).fillna(False)
df.loc[mask, 'midPrice'] = df['New Oilprice']
print(df)
midPrice Change % Spike New Oilprice
0 92.200000 0.00 0 92.043405
1 92.264540 0.07 0 92.049689
2 91.969500 -0.32 0 91.979751
3 91.739580 -0.25 0 91.844369
4 91.789850 0.05 0 91.724690
5 91.410000 -0.41 0 91.568880
6 91.181480 -0.25 0 91.690812
7 91.858391 0.07 0 91.858391
8 92.016806 -0.32 0 92.016806
9 92.139872 2.51 1 92.139872
10 92.321622 0.08 0 92.321622
11 92.542687 -0.32 0 92.542687
12 92.727070 -0.25 0 92.727070
13 92.949655 0.10 0 92.949655
Related
Add incremental values following an id
`I'm trying to add incremetal values for each id in this pandas dataframe initial table id col1 col2 1 0.12 10 1 0.23 20 1 1.1 30 2 0.25 10 2 2.1 20 2 1.2 30 what i want to acheive id col1 col2 1 0.12 10 1 0.23 20 1 1.1 30 1 0 40 1 0 50 2 0.25 10 2 2.1 20 2 1.2 30 2 0 40 2 0 50 i tried : def func(row): for i in row["id"]: for j in range(40,50+1,10): row["id"] = i row["col1"] = 0 row["col2"] = j df = df.apply(lambda row:func(row)) but this raise an error that id doesn't exist KeyError: 'id'
No need for a loop, you can approach this by using a MultiIndex.from_product : N = 50 # <- adjust here the limit to reach gr = df.groupby("id", as_index=False).count() idx = pd.MultiIndex.from_product([gr["id"], range(10, N+10, 10)], names=["id", "col2"]) out = (df.set_index(["id", "col2"]).reindex(idx, fill_value=0).reset_index()[df.columns]) Output : print(out) id col1 col2 0 1 0.12 10 1 1 0.23 20 2 1 1.10 30 3 1 0.00 40 4 1 0.00 50 5 2 0.25 10 6 2 2.10 20 7 2 1.20 30 8 2 0.00 40 9 2 0.00 50
You can group by id and append values to needed columns via pandas concatenation: df_to_append = pd.DataFrame([[0,40],[0,50]], columns=['col1','col2']) df = df.groupby('id', as_index=False).apply(lambda x: pd.concat([x, df_to_append], ignore_index=True))\ .reset_index(drop=True).fillna(method='ffill') id col1 col2 0 1.0 0.12 10 1 1.0 0.23 20 2 1.0 1.10 30 3 1.0 0.00 40 4 1.0 0.00 50 5 2.0 0.25 10 6 2.0 2.10 20 7 2.0 1.20 30 8 2.0 0.00 40 9 2.0 0.00 50
Sort the rows of a dataframe and get the column values in pandas dataframe
My dataframe looks like this: df 5 1 2 4 3 0 pred_val true_value rank 0 0.3 0.2 0.1 0.5 0.25 0.4 4 2 6 1 0.36 0.24 0.12 0.5 0.45 0.4 4 3 2 I want to predict the values of rank column based on my true value. If the predicted value (pred_val) is same as the true_value then rank = 1 which can be achieved by using np.where. But if they do not match, then the true_value is searched in the all the columns named from 0-5. And this true value is given rank according to the cell value under it. Like in 0th row true value is 2 and pred_value is 4 do not match, then we search it in the column 2 which has the value 0.1 and this is the lowest among all 0-5 column values for 0th row, hence given 6th rank. How can I do this?
I think this is what you may be looking for df 5 1 2 4 3 0 pred_val true_value 0 0.3 0.2 0.1 0.5 0.25 0.4 4 2 1 0.36 0.24 0.12 0.5 0.45 0.4 4 3 df['rank'] = df.apply(lambda row: row[[0,1,2,3,4,5]].sort_values(ascending=False).index.get_loc(row.true_value) + 1, axis=1) df 5 1 2 4 3 0 pred_val true_value rank 0 0.3 0.2 0.1 0.5 0.25 0.4 4 2 6 1 0.36 0.24 0.12 0.5 0.45 0.4 4 3 2
If you want to use list comprehension: # set a string df['truevalue'] = df['truevalue'].astype(str) # list comprehension to get (index,col) pairs vals = [x for x in enumerate(df['truevalue'])] # use rank and list comprehension df['rank'] = [int(df[df.columns[:6].values].rank(1, ascending=False).loc[val]) for val in vals] 5 1 2 4 3 0 predval truevalue rank 0 0.30 0.20 0.10 0.5 0.25 0.4 4 2 6 1 0.36 0.24 0.12 0.5 0.45 0.4 4 3 2
Input: 5 1 2 4 3 0 pred_val true_value rank 0 0.30 0.20 0.10 0.5 0.25 0.4 4 2 0 1 0.36 0.24 0.12 0.5 0.45 0.4 4 3 0 Do this, for i in range(len(df)): t_val = df['true_value'][i] cols_vals = sorted(list(df.loc[i, ['5', '1', '2', '4', '3', '0']].values), reverse = True) rank = cols_vals.index(df[str(t_val)][i]) + 1 df.loc[i, 'rank'] = rank Output: 5 1 2 4 3 0 pred_val true_value rank 0 0.30 0.20 0.10 0.5 0.25 0.4 4 2 6 1 0.36 0.24 0.12 0.5 0.45 0.4 4 3 2
Missing values in .csv after writing by pandas dataframe error
I have a .csv file: 20376.65,22398.29,4.8,0.0,1.0,2394.0,6.1,89.1,0.0,4.027,9.377,0.33,0.28,0.36,51364.0,426372.0,888388.0,0.0,2040696.0,57.1,21.75,25.27,0.0,452.0,1046524.0,1046524.0,1 7048.842,8421.754,1.44,0.0,1.0,2394.0,29.14,69.5,0.0,4.027,9.377,0.33,0.28,0.36,51437.6,426964.0,684084.0,0.0,2040696.0,57.1,12.15,14.254,3.2,568.8,1046524.0,1046524.0,1 3716.89,4927.62,0.12,0.0,1.0,2394.0,26.58,73.32,0.0,4.027,9.377,0.586,1.056,3.544,51456.0,427112.0,633008.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1 3716.89,4927.62,0.0,0.0,1.0,2394.0,17.653333333,82.346666667,0.0,4.027,9.377,0.84066666667,1.796,5.9346666667,51487.2,427268.0,481781.6,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1 3716.89,4927.62,0.0,0.0,1.0,2394.0,16.6,83.4,0.0,4.027,9.377,0.87,1.88,6.18,51492.0,427292.0,458516.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1 I am normalizing it using pandas dataframe, but i get missing values in .csv file: .703280701968,0.867283950617,,,,0.0971635485818,-0.132770066385,,0.318518516666,-inf,-0.742913580247,-0.74703196347,-0.779350940252,-0.659592176966,-0.483438485804,0.565758716954,,,-inf,-0.274046377081,0.705774765311,-0.281481481478,-0.596841230258,,,1 0.104027493068,-0.0493827160494,,,,0.0199155099578,-0.0175015087508,,0.318518516666,-inf,-0.401580246914,-0.392694063927,-0.331530968381,-0.401165210674,-0.337539432177,0.426956186355,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1 0.104027493068,-0.132716049383,,,,-0.2494467914,0.254878294116,,0.318518516666,-inf,-0.0620246913541,-0.0547945205479,0.00470906912955,0.0370370365169,-0.183753943218,0.0159880797389,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1 0.104027493068,-0.132716049383,,,,-0.281231140616,0.286662643331,,0.318518516666,-inf,-0.0229135802474,-0.0164383561644,0.0392144605923,0.104452766854,-0.160094637224,-0.0472377828174,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1 0.104027493068,-0.132716049383,,,,-0.566083283042,0.571514785757,,0.318518516666,-inf,0.201086419753,0.199086757991,0.184362139917,0.104452766854,-0.160094637224,-0.0472377828174,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1 My code : import pandas as pd df = pd.read_csv('pooja.csv',index_col=False) df_norm = (df.ix[:, 1:-1] - df.ix[:, 1:-1].mean()) / (df.ix[:, 1:-1].max() - df.ix[:, 1:-1].min()) rslt = pd.concat([df_norm, df.ix[:,-1]], axis=1) rslt.to_csv('example.csv',index=False,header=False) What's wrong in code? Why values are missing in .csv file ?
You get many NaN, because divide 0 by 0. See broadcasting behaviour. Better explanation is here. I use code from your previous question, because I think slicing with df.ix[:, 1:-1] is not necessary. After normalize with slicing I get empty DataFrame. import pandas as pd import numpy as np import io temp=u"""20376.65,22398.29,4.8,0.0,1.0,2394.0,6.1,89.1,0.0,4.027,9.377,0.33,0.28,0.36,51364.0,426372.0,888388.0,0.0,2040696.0,57.1,21.75,25.27,0.0,452.0,1046524.0,1046524.0,1 7048.842,8421.754,1.44,0.0,1.0,2394.0,29.14,69.5,0.0,4.027,9.377,0.33,0.28,0.36,51437.6,426964.0,684084.0,0.0,2040696.0,57.1,12.15,14.254,3.2,568.8,1046524.0,1046524.0,1 3716.89,4927.62,0.12,0.0,1.0,2394.0,26.58,73.32,0.0,4.027,9.377,0.586,1.056,3.544,51456.0,427112.0,633008.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1 3716.89,4927.62,0.0,0.0,1.0,2394.0,17.653333333,82.346666667,0.0,4.027,9.377,0.84066666667,1.796,5.9346666667,51487.2,427268.0,481781.6,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1 3716.89,4927.62,0.0,0.0,1.0,2394.0,16.6,83.4,0.0,4.027,9.377,0.87,1.88,6.18,51492.0,427292.0,458516.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1""" #after testing replace io.StringIO(temp) to filename df = pd.read_csv(io.StringIO(temp),index_col=None, header=None) #print df #filter only first 5 columns for testing df = df.iloc[:, :5] print df 0 1 2 3 4 0 20376.650 22398.290 4.80 0 1 1 7048.842 8421.754 1.44 0 1 2 3716.890 4927.620 0.12 0 1 3 3716.890 4927.620 0.00 0 1 4 3716.890 4927.620 0.00 0 1 #get max values by columns print df.max() 0 20376.65 1 22398.29 2 4.80 3 0.00 4 1.00 dtype: float64 #get min values by columns print df.min() 0 3716.89 1 4927.62 2 0.00 3 0.00 4 1.00 dtype: float64 #difference, you get 0 print (df.max() - df.min()) 0 16659.76 1 17470.67 2 4.80 3 0.00 4 0.00 dtype: float64 print df - df.mean() 0 1 2 3 4 0 12661.4176 13277.7092 3.528 0 0 1 -666.3904 -698.8268 0.168 0 0 2 -3998.3424 -4192.9608 -1.152 0 0 3 -3998.3424 -4192.9608 -1.272 0 0 4 -3998.3424 -4192.9608 -1.272 0 0 #you get NaN, because divide columns 3 and 4 filled 0 to difference with index 3,4 filled 0 df_norm = (df - df.mean()) / (df.max() - df.min()) print df_norm 0 1 2 3 4 0 0.76 0.76 0.735 NaN NaN 1 -0.04 -0.04 0.035 NaN NaN 2 -0.24 -0.24 -0.240 NaN NaN 3 -0.24 -0.24 -0.265 NaN NaN 4 -0.24 -0.24 -0.265 NaN NaN Last if you generate to_csv, get from NaN "", because parameter na_rep has default value "": print df_norm.to_csv(index=False, header=False, na_rep="") 0.76,0.76,0.735,, -0.04,-0.04,0.035,, -0.24,-0.24,-0.24,, -0.24,-0.24,-0.265,, -0.24,-0.24,-0.265,, If you change value of na_rep: #change na_rep to * for testing print df_norm.to_csv(index=False, header=False, na_rep="*") 0.76,0.76,0.735,*,* -0.04,-0.04,0.035,*,* -0.24,-0.24,-0.24,*,* -0.24,-0.24,-0.265,*,* -0.24,-0.24,-0.265,*,*
Slicing based on dates Pandas Dataframe
I have a large dataframe with dates, store number, units sold, and rain precipitation totals. It looks like this... date store_nbr units preciptotal 2014-10-11 1 0 0.00 2014-10-12 1 0 0.01 2014-10-13 1 2 0.00 2014-10-14 1 1 2.13 2014-10-15 1 0 0.00 2014-10-16 1 0 0.87 2014-10-17 1 3 0.01 2014-10-18 1 0 0.40 I want to select a three day window around any date that has a precipitation total greater than 1. For this small example, I would want get back the first 7 rows, the 3 days before 2014-10-14, the three days after 2014-10-14, and 2014-10-14 because it has a preciptotal greater than 1.
Here are two ways you could build the selection mask without looping over the index values: You could find the rows where preciptotal is greater than 1: mask = (df['preciptotal'] > 1) and then use scipy.ndimage.binary_dilation to expand the mask to a 7-day window: import scipy.ndimage as ndimage import pandas as pd df = df = pd.read_table('data', sep='\s+') mask = (df['preciptotal'] > 1) mask = ndimage.binary_dilation(mask, iterations=3) df.loc[mask] yields date store_nbr units preciptotal 0 2014-10-11 1 0 0.00 1 2014-10-12 1 0 0.01 2 2014-10-13 1 2 0.00 3 2014-10-14 1 1 2.13 4 2014-10-15 1 0 0.00 5 2014-10-16 1 0 0.87 6 2014-10-17 1 3 0.01 Or, using NumPy (but without the scipy dependency), you could use mask.shift with np.logical_and.reduce: mask = (df['preciptotal'] > 1) mask = ~np.logical_and.reduce([(~mask).shift(i) for i in range(-3, 4)]).astype(bool) # array([ True, True, True, True, True, True, True, False], dtype=bool)
For a specific value you can do this: In [84]: idx = df[df['preciptotal'] > 1].index[0] df.iloc[idx-3: idx+4] Out[84]: date store_nbr units preciptotal 0 2014-10-11 1 0 0.00 1 2014-10-12 1 0 0.01 2 2014-10-13 1 2 0.00 3 2014-10-14 1 1 2.13 4 2014-10-15 1 0 0.00 5 2014-10-16 1 0 0.87 6 2014-10-17 1 3 0.01 For the more general case you can get an array of indices where the condition is met idx_vals = df[df['preciptotal'] > 1].index then you can generate slices or iterate over the array values: for idx in idx_values: df.iloc[idx-3: idx+4] This assumes your index is a 0 based int64 index which your sample is
Pandas: trouble understanding how merge works
I'm doing something wrong with merge and I can't understand what it is. I've done the following to estimate a histogram of a series of integer values: import pandas as pnd import numpy as np series = pnd.Series(np.random.poisson(5, size = 100)) tmp = {"series" : series, "count" : np.ones(len(series))} hist = pnd.DataFrame(tmp).groupby("series").sum() freq = (hist / hist.sum()).rename(columns = {"count" : "freq"}) If I print hist and freq this is what I get: > print hist count series 0 2 1 4 2 13 3 15 4 12 5 16 6 18 7 7 8 8 9 3 10 1 11 1 > print freq freq series 0 0.02 1 0.04 2 0.13 3 0.15 4 0.12 5 0.16 6 0.18 7 0.07 8 0.08 9 0.03 10 0.01 11 0.01 They're both indexed by "series" but if I try to merge: > df = pnd.merge(freq, hist, on = "series") I get a KeyError: 'no item named series' exception. If I omit on = "series" I get a IndexError: list index out of range exception. I don't get what I'm doing wrong. May be "series" is an index and not a column so I must do it differently?
From docs: on: Columns (names) to join on. Must be found in both the left and right DataFrame objects. If not passed and left_index and right_index are False, the intersection of the columns in the DataFrames will be inferred to be the join keys I don't know why this is not in the docstring, but it explains your problem. You can either give left_index and right_index: In : pnd.merge(freq, hist, right_index=True, left_index=True) Out: freq count series 0 0.01 1 1 0.04 4 2 0.14 14 3 0.12 12 4 0.21 21 5 0.14 14 6 0.17 17 7 0.07 7 8 0.05 5 9 0.01 1 10 0.01 1 11 0.03 3 Or you can make your index a column and use on: In : freq2 = freq.reset_index() In : hist2 = hist.reset_index() In : pnd.merge(freq2, hist2, on='series') Out: series freq count 0 0 0.01 1 1 1 0.04 4 2 2 0.14 14 3 3 0.12 12 4 4 0.21 21 5 5 0.14 14 6 6 0.17 17 7 7 0.07 7 8 8 0.05 5 9 9 0.01 1 10 10 0.01 1 11 11 0.03 3 Alternatively and more simply, DataFrame has join method which does exactly what you want: In : freq.join(hist) Out: freq count series 0 0.01 1 1 0.04 4 2 0.14 14 3 0.12 12 4 0.21 21 5 0.14 14 6 0.17 17 7 0.07 7 8 0.05 5 9 0.01 1 10 0.01 1 11 0.03 3