Slicing based on dates Pandas Dataframe - python

I have a large dataframe with dates, store number, units sold, and rain precipitation totals. It looks like this...
date store_nbr units preciptotal
2014-10-11 1 0 0.00
2014-10-12 1 0 0.01
2014-10-13 1 2 0.00
2014-10-14 1 1 2.13
2014-10-15 1 0 0.00
2014-10-16 1 0 0.87
2014-10-17 1 3 0.01
2014-10-18 1 0 0.40
I want to select a three day window around any date that has a precipitation total greater than 1. For this small example, I would want get back the first 7 rows, the 3 days before 2014-10-14, the three days after 2014-10-14, and 2014-10-14 because it has a preciptotal greater than 1.

Here are two ways you could build the selection mask without looping over the index values:
You could find the rows where preciptotal is greater than 1:
mask = (df['preciptotal'] > 1)
and then use scipy.ndimage.binary_dilation to expand the mask to a 7-day window:
import scipy.ndimage as ndimage
import pandas as pd
df = df = pd.read_table('data', sep='\s+')
mask = (df['preciptotal'] > 1)
mask = ndimage.binary_dilation(mask, iterations=3)
df.loc[mask]
yields
date store_nbr units preciptotal
0 2014-10-11 1 0 0.00
1 2014-10-12 1 0 0.01
2 2014-10-13 1 2 0.00
3 2014-10-14 1 1 2.13
4 2014-10-15 1 0 0.00
5 2014-10-16 1 0 0.87
6 2014-10-17 1 3 0.01
Or, using NumPy (but without the scipy dependency), you could use mask.shift with np.logical_and.reduce:
mask = (df['preciptotal'] > 1)
mask = ~np.logical_and.reduce([(~mask).shift(i) for i in range(-3, 4)]).astype(bool)
# array([ True, True, True, True, True, True, True, False], dtype=bool)

For a specific value you can do this:
In [84]:
idx = df[df['preciptotal'] > 1].index[0]
df.iloc[idx-3: idx+4]
Out[84]:
date store_nbr units preciptotal
0 2014-10-11 1 0 0.00
1 2014-10-12 1 0 0.01
2 2014-10-13 1 2 0.00
3 2014-10-14 1 1 2.13
4 2014-10-15 1 0 0.00
5 2014-10-16 1 0 0.87
6 2014-10-17 1 3 0.01
For the more general case you can get an array of indices where the condition is met
idx_vals = df[df['preciptotal'] > 1].index
then you can generate slices or iterate over the array values:
for idx in idx_values:
df.iloc[idx-3: idx+4]
This assumes your index is a 0 based int64 index which your sample is

Related

Replace column with some rows of another column

I have following dataframe:
midPrice Change % Spike New Oilprice
92.20000 0.00 0 92.043405
92.26454 0.07 0 92.049689
91.96950 -0.32 0 91.979751
91.73958 -0.25 0 91.844369
91.78985 0.05 0 91.724690
91.41000 -0.41 0 91.568880
91.18148 -0.25 0 91.690812
91.24257 0.07 0 91.858391
90.95352 -0.32 0 92.016806
93.24000 2.51 1 92.139872
93.31013 0.08 0 92.321622
93.00690 -0.32 0 92.542687
92.77438 -0.25 0 92.727070
92.86400 0.10 0 92.949655
and whenever I have a Spike (1) in the column, I want to replace the 5 rows after the spike (including) with the new oil prices. The rest of the rows are being kept as they are.
Any ideas how to solve that?
I tried the code based on following:
Iterate through the df (for loop)
If/else statement if spike == 1 then replace following 5 rows with values of new oil prices / else: keep oil prices
def spike(i):
for i in df['Spike']:
if i.loc == 1:
df['midPrice'].replace(df['New Oilprice'][i:5])`
It unfortunately doesn't work and I\m not so strong with pandas. I tried mapping the function as well on the dataframe which didn't work either. I would appreciate any help
Assuming the df is sorted by time in ascending order (as I've seen in the edit history of your question that you had a time column), you could use a mask like so:
mask = df['Spike'].eq(1).where(df['Spike'].eq(1)).fillna(method='ffill', limit=4).fillna(False)
df.loc[mask, 'midPrice'] = df['New Oilprice']
print(df)
midPrice Change % Spike New Oilprice
0 92.200000 0.00 0 92.043405
1 92.264540 0.07 0 92.049689
2 91.969500 -0.32 0 91.979751
3 91.739580 -0.25 0 91.844369
4 91.789850 0.05 0 91.724690
5 91.410000 -0.41 0 91.568880
6 91.181480 -0.25 0 91.690812
7 91.242570 0.07 0 91.858391
8 90.953520 -0.32 0 92.016806
9 92.139872 2.51 1 92.139872
10 92.321622 0.08 0 92.321622
11 92.542687 -0.32 0 92.542687
12 92.727070 -0.25 0 92.727070
13 92.949655 0.10 0 92.949655
EDIT - 2 rows before, 3 rows after:
You can adjust the mask with another fillna:
mask = df['Spike'].eq(1).where(df['Spike'].eq(1)).fillna(method='bfill', limit=2).fillna(method='ffill', limit=3).fillna(False)
df.loc[mask, 'midPrice'] = df['New Oilprice']
​
print(df)
midPrice Change % Spike New Oilprice
0 92.200000 0.00 0 92.043405
1 92.264540 0.07 0 92.049689
2 91.969500 -0.32 0 91.979751
3 91.739580 -0.25 0 91.844369
4 91.789850 0.05 0 91.724690
5 91.410000 -0.41 0 91.568880
6 91.181480 -0.25 0 91.690812
7 91.858391 0.07 0 91.858391
8 92.016806 -0.32 0 92.016806
9 92.139872 2.51 1 92.139872
10 92.321622 0.08 0 92.321622
11 92.542687 -0.32 0 92.542687
12 92.727070 -0.25 0 92.727070
13 92.949655 0.10 0 92.949655

Selectively use df.div() to divide only a certain column based on index match

I have 2 DataFrames, one is a monthly total and the other contains values by which I want to divide the first in order to get monthly percentage contributions.
Here are some example DataFrames:
MonthlyTotals = pd.DataFrame(data={'Month':[1,2,3],'Value':[100,200,300]})
Data = pd.DataFrame(data={'ID':[1,2,3,1,2,3,1,2,3],
'Month':[1,1,1,2,2,2,3,3,3],
'Value':[40,30,30,60,70,70,150,60,90]})
I am using df.div() so I set the index like so
MonthlyTotals.set_index('Month', inplace=True)
Data.set_index('Month', inplace=True)
Then I do the division
Contributions = Data.div(MonthlyTotals, axis='index')
The resulting DataFrame is what I want but I cannot see the ID that the Value relates to as this isn't in the MonthlyTotals frame. How would I use df.div() but only selectively on certain columns?
Here is an example dataframe of the result I am looking for
result = pd.DataFrame(data={'ID':[1,2,3,1,2,3,1,2,3],'Value':[0.4,0.3,0.3,0.3,0.35,0.35,0.5,0.2,0.3]})
You may not need MonthlyTotals if Data is complete. You can calculate MonthlyTotal using transform and then calculate Contributions.
Data = pd.DataFrame(data={'ID':[1,2,3,1,2,3,1,2,3],
'Month':[1,1,1,2,2,2,3,3,3],
'Value':[40,30,30,60,70,70,150,60,90]})
Data['MonthlyTotal'] = Data.Gropuby('Month')['Value'].transform('sum')
Data['Contributions'] = Data['Value'] / Data['MonthlyTotal']
Output
ID Month Value MonthlyTotal Contributions
0 1 1 40 100 0.40
1 2 1 30 100 0.30
2 3 1 30 100 0.30
3 1 2 60 200 0.30
4 2 2 70 200 0.35
5 3 2 70 200 0.35
6 1 3 150 300 0.50
7 2 3 60 300 0.20
8 3 3 90 300 0.30
Also if you would like only use pandas you can fix your code with reindex + update
Data.update(Data['Value'].div(MonthlyTotals['Value'].reindex(Data.index),axis=0))
Data
ID Value
Month
1 1 0.40
1 2 0.30
1 3 0.30
2 1 0.30
2 2 0.35
2 3 0.35
3 1 0.50
3 2 0.20
3 3 0.30

Pandas: How to maintain the type of columns with nan?

For example,I have a df with nan and use the following method to fillna.
import pandas as pd
a = [[2.0, 10, 4.2], ['b', 70, 0.03], ['x', ]]
df = pd.DataFrame(a)
print(df)
df.fillna(int(0),inplace=True)
print('fillna df\n',df)
dtype_df = df.dtypes.reset_index()
OUTPUT:
0 1 2
0 2 10.0 4.20
1 b 70.0 0.03
2 x NaN NaN
fillna df
0 1 2
0 2 10.0 4.20
1 b 70.0 0.03
2 x 0.0 0.00
col type
0 0 object
1 1 float64
2 2 float64
Actually,I want the column 1 maintain the type of int instead of float.
My desired output:
fillna df
0 1 2
0 2 10 4.20
1 b 70 0.03
2 x 0 0.00
col type
0 0 object
1 1 int64
2 2 float64
So how to do it?
Try adding downcast='infer' to downcast any eligible columns:
df.fillna(0, downcast='infer')
0 1 2
0 2 10 4.20
1 b 70 0.03
2 x 0 0.00
And the corresponding dtypes are
0 object
1 int64
2 float64
dtype: object

Sklearn changing string class label to int

I have a pandas dataframe and I'm trying to change the values in a given column which are represented by strings into integers. For instance:
df = index fruit quantity price
0 apple 5 0.99
1 apple 2 0.99
2 orange 4 0.89
4 banana 1 1.64
...
10023 kiwi 10 0.92
I would like it to look at:
df = index fruit quantity price
0 1 5 0.99
1 1 2 0.99
2 2 4 0.89
4 3 1 1.64
...
10023 5 10 0.92
I can do this using
df["fruit"] = df["fruit"].map({"apple": 1, "orange": 2,...})
which works if I have a small list to change, but I'm looking at a column with over 500 different labels. Is there any way of changing this from a string to a an int?
You can use sklearn.preprocessing
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.fruit)
df['categorical_label'] = le.transform(df.fruit)
Transform labels back to original encoding.
le.inverse_transform(df['categorical_label'])
Use factorize and then convert to categorical if necessary:
df.fruit = pd.factorize(df.fruit)[0]
print (df)
fruit quantity price
0 0 5 0.99
1 0 2 0.99
2 1 4 0.89
3 2 1 1.64
4 3 10 0.92
df.fruit = pd.Categorical(pd.factorize(df.fruit)[0])
print (df)
fruit quantity price
0 0 5 0.99
1 0 2 0.99
2 1 4 0.89
3 2 1 1.64
4 3 10 0.92
print (df.dtypes)
fruit category
quantity int64
price float64
dtype: object
Also if need count from 1:
df.fruit = pd.Categorical(pd.factorize(df.fruit)[0] + 1)
print (df)
fruit quantity price
0 1 5 0.99
1 1 2 0.99
2 2 4 0.89
3 3 1 1.64
4 4 10 0.92
you can use factorize method:
In [13]: df['fruit'] = pd.factorize(df['fruit'])[0].astype(np.uint16)
In [14]: df
Out[14]:
index fruit quantity price
0 0 0 5 0.99
1 1 0 2 0.99
2 2 1 4 0.89
3 4 2 1 1.64
4 10023 3 10 0.92
In [15]: df.dtypes
Out[15]:
index int64
fruit uint16
quantity int64
price float64
dtype: object
alternatively you can do it this way:
In [21]: df['fruit'] = df.fruit.astype('category').cat.codes
In [22]: df
Out[22]:
index fruit quantity price
0 0 0 5 0.99
1 1 0 2 0.99
2 2 3 4 0.89
3 4 1 1 1.64
4 10023 2 10 0.92
In [23]: df.dtypes
Out[23]:
index int64
fruit int8
quantity int64
price float64
dtype: object

Missing values in .csv after writing by pandas dataframe error

I have a .csv file:
20376.65,22398.29,4.8,0.0,1.0,2394.0,6.1,89.1,0.0,4.027,9.377,0.33,0.28,0.36,51364.0,426372.0,888388.0,0.0,2040696.0,57.1,21.75,25.27,0.0,452.0,1046524.0,1046524.0,1
7048.842,8421.754,1.44,0.0,1.0,2394.0,29.14,69.5,0.0,4.027,9.377,0.33,0.28,0.36,51437.6,426964.0,684084.0,0.0,2040696.0,57.1,12.15,14.254,3.2,568.8,1046524.0,1046524.0,1
3716.89,4927.62,0.12,0.0,1.0,2394.0,26.58,73.32,0.0,4.027,9.377,0.586,1.056,3.544,51456.0,427112.0,633008.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
3716.89,4927.62,0.0,0.0,1.0,2394.0,17.653333333,82.346666667,0.0,4.027,9.377,0.84066666667,1.796,5.9346666667,51487.2,427268.0,481781.6,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
3716.89,4927.62,0.0,0.0,1.0,2394.0,16.6,83.4,0.0,4.027,9.377,0.87,1.88,6.18,51492.0,427292.0,458516.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
I am normalizing it using pandas dataframe, but i get missing values in .csv file:
.703280701968,0.867283950617,,,,0.0971635485818,-0.132770066385,,0.318518516666,-inf,-0.742913580247,-0.74703196347,-0.779350940252,-0.659592176966,-0.483438485804,0.565758716954,,,-inf,-0.274046377081,0.705774765311,-0.281481481478,-0.596841230258,,,1
0.104027493068,-0.0493827160494,,,,0.0199155099578,-0.0175015087508,,0.318518516666,-inf,-0.401580246914,-0.392694063927,-0.331530968381,-0.401165210674,-0.337539432177,0.426956186355,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1
0.104027493068,-0.132716049383,,,,-0.2494467914,0.254878294116,,0.318518516666,-inf,-0.0620246913541,-0.0547945205479,0.00470906912955,0.0370370365169,-0.183753943218,0.0159880797389,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1
0.104027493068,-0.132716049383,,,,-0.281231140616,0.286662643331,,0.318518516666,-inf,-0.0229135802474,-0.0164383561644,0.0392144605923,0.104452766854,-0.160094637224,-0.0472377828174,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1
0.104027493068,-0.132716049383,,,,-0.566083283042,0.571514785757,,0.318518516666,-inf,0.201086419753,0.199086757991,0.184362139917,0.104452766854,-0.160094637224,-0.0472377828174,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1
My code :
import pandas as pd
df = pd.read_csv('pooja.csv',index_col=False)
df_norm = (df.ix[:, 1:-1] - df.ix[:, 1:-1].mean()) / (df.ix[:, 1:-1].max() - df.ix[:, 1:-1].min())
rslt = pd.concat([df_norm, df.ix[:,-1]], axis=1)
rslt.to_csv('example.csv',index=False,header=False)
What's wrong in code? Why values are missing in .csv file ?
You get many NaN, because divide 0 by 0. See broadcasting behaviour. Better explanation is here.
I use code from your previous question, because I think slicing with df.ix[:, 1:-1] is not necessary. After normalize with slicing I get empty DataFrame.
import pandas as pd
import numpy as np
import io
temp=u"""20376.65,22398.29,4.8,0.0,1.0,2394.0,6.1,89.1,0.0,4.027,9.377,0.33,0.28,0.36,51364.0,426372.0,888388.0,0.0,2040696.0,57.1,21.75,25.27,0.0,452.0,1046524.0,1046524.0,1
7048.842,8421.754,1.44,0.0,1.0,2394.0,29.14,69.5,0.0,4.027,9.377,0.33,0.28,0.36,51437.6,426964.0,684084.0,0.0,2040696.0,57.1,12.15,14.254,3.2,568.8,1046524.0,1046524.0,1
3716.89,4927.62,0.12,0.0,1.0,2394.0,26.58,73.32,0.0,4.027,9.377,0.586,1.056,3.544,51456.0,427112.0,633008.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
3716.89,4927.62,0.0,0.0,1.0,2394.0,17.653333333,82.346666667,0.0,4.027,9.377,0.84066666667,1.796,5.9346666667,51487.2,427268.0,481781.6,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
3716.89,4927.62,0.0,0.0,1.0,2394.0,16.6,83.4,0.0,4.027,9.377,0.87,1.88,6.18,51492.0,427292.0,458516.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),index_col=None, header=None)
#print df
#filter only first 5 columns for testing
df = df.iloc[:, :5]
print df
0 1 2 3 4
0 20376.650 22398.290 4.80 0 1
1 7048.842 8421.754 1.44 0 1
2 3716.890 4927.620 0.12 0 1
3 3716.890 4927.620 0.00 0 1
4 3716.890 4927.620 0.00 0 1
#get max values by columns
print df.max()
0 20376.65
1 22398.29
2 4.80
3 0.00
4 1.00
dtype: float64
#get min values by columns
print df.min()
0 3716.89
1 4927.62
2 0.00
3 0.00
4 1.00
dtype: float64
#difference, you get 0
print (df.max() - df.min())
0 16659.76
1 17470.67
2 4.80
3 0.00
4 0.00
dtype: float64
print df - df.mean()
0 1 2 3 4
0 12661.4176 13277.7092 3.528 0 0
1 -666.3904 -698.8268 0.168 0 0
2 -3998.3424 -4192.9608 -1.152 0 0
3 -3998.3424 -4192.9608 -1.272 0 0
4 -3998.3424 -4192.9608 -1.272 0 0
#you get NaN, because divide columns 3 and 4 filled 0 to difference with index 3,4 filled 0
df_norm = (df - df.mean()) / (df.max() - df.min())
print df_norm
0 1 2 3 4
0 0.76 0.76 0.735 NaN NaN
1 -0.04 -0.04 0.035 NaN NaN
2 -0.24 -0.24 -0.240 NaN NaN
3 -0.24 -0.24 -0.265 NaN NaN
4 -0.24 -0.24 -0.265 NaN NaN
Last if you generate to_csv, get from NaN "", because parameter na_rep has default value "":
print df_norm.to_csv(index=False, header=False, na_rep="")
0.76,0.76,0.735,,
-0.04,-0.04,0.035,,
-0.24,-0.24,-0.24,,
-0.24,-0.24,-0.265,,
-0.24,-0.24,-0.265,,
If you change value of na_rep:
#change na_rep to * for testing
print df_norm.to_csv(index=False, header=False, na_rep="*")
0.76,0.76,0.735,*,*
-0.04,-0.04,0.035,*,*
-0.24,-0.24,-0.24,*,*
-0.24,-0.24,-0.265,*,*
-0.24,-0.24,-0.265,*,*

Categories