how to compute mean absolute deviation row wise in pandas - python

snippet of the dataframe is as follows. but actual dataset is 200000 x 130.
ID 1-jan 2-jan 3-jan 4-jan
1. 4 5 7 8
2. 2 0 1 9
3. 5 8 0 1
4. 3 4 0 0
I am trying to compute Mean Absolute Deviation for each row value like this.
ID 1-jan 2-jan 3-jan 4-jan mean
1. 4 5 7 8 12.5
1_MAD 8.5 7.5 5.5 4.5
2. 2 0 1 9 6
2_MAD.4 6 5 3
.
.
I tried this,
new_df = pd.DataFrame()
for rows in (df['ID']):
new_df[str(rows) + '_mad'] = mad(df3.loc[row_value][1:])
new_df.T
where mad is a function that compares the mean to each value.
But, this is very time consuming since i have a large dataset and i need to do in a quickest way possible.

pd.concat([df1.assign(mean1=df1.mean(axis=1)).set_index(df1.index.astype('str'))
,df1.assign(mean1=df1.mean(axis=1)).apply(lambda ss:ss.mean1-ss,axis=1)
.T.add_suffix('_MAD').T.assign(mean1='')]).sort_index().pipe(print)
1-jan 2-jan 3-jan 4-jan mean1
ID
1.0 4.00 5.00 7.00 8.00 6.0
1.0_MAD 2.00 1.00 -1.00 -2.00
2.0 2.00 0.00 1.00 9.00 3.0
2.0_MAD 1.00 3.00 2.00 -6.00
3.0 5.00 8.00 0.00 1.00 3.5
3.0_MAD -1.50 -4.50 3.50 2.50
4.0 3.00 4.00 0.00 0.00 1.75
4.0_MAD -1.25 -2.25 1.75 1.75

IIUC use:
#convert ID to index
df = df.set_index('ID')
#mean to Series
mean = df.mean(axis=1)
from toolz import interleave
#subtract all columns by mean, add suffix
df1 = df.sub(mean, axis=0).abs().rename(index=lambda x: f'{x}_MAD')
#join with original with mean and interleave indices
df = pd.concat([df.assign(mean=mean), df1]).loc[list(interleave([df.index, df1.index]))]
print (df)
1-jan 2-jan 3-jan 4-jan mean
ID
1.0 4.00 5.00 7.00 8.00 6.00
1.0_MAD 2.00 1.00 1.00 2.00 NaN
2.0 2.00 0.00 1.00 9.00 3.00
2.0_MAD 1.00 3.00 2.00 6.00 NaN
3.0 5.00 8.00 0.00 1.00 3.50
3.0_MAD 1.50 4.50 3.50 2.50 NaN
4.0 3.00 4.00 0.00 0.00 1.75
4.0_MAD 1.25 2.25 1.75 1.75 NaN

It's possible to specify axis=1 to apply the mean calculation across columns:
df['mean_across_cols'] = df.mean(axis=1)

Related

splitting a dataframe into chunks and naming each new chunk into a dataframe

is there a good code to split dataframes into chunks and automatically name each chunk into its own dataframe?
for example, dfmaster has 1000 records. split by 200 and create df1, df2,….df5
any guidance would be much appreciated.
I've looked on other boards and there is no guidance for a function that can automatically create new dataframes.
Use numpy for splitting:
See example below:
In [2095]: df
Out[2095]:
0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.00 0.0 0.00 0.0 0.94 0.00 0.00 0.63 0.00
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.00 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN
In [2096]: np.split(df, 2)
Out[2096]:
[ 0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.0 0.0 0.0 0.0 0.94 0.0 0.0 0.63 0.0
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN,
0 1 2 3 4 5 6 7 8 9 10
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.0 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN]
df gets split into 2 dataframes having 2 rows each.
You can do np.split(df, 500)
I find these ideas helpful:
solution via list:
https://stackoverflow.com/a/49563326/10396469
solution using numpy.split:
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.split.html
just use df = df.values first to convert from dataframe to numpy.array.

Pandas rowwise ffilling

Is it possible to specify a ffill for an entire row? What I mean by this, is to condition on one value [Check] in the row to see if the row should be fforwardfilled.
My main goal is to keep row integrity intact (i.e. I only want to to forwardfilling an entire row into the next one). For the sake of simplicity assume that each row corresponds to an event, I want to forwardfill the data from the past event if the new event does not have data (in Val1). I do not want to mix data from past events as I forwardfill it should be noted, that nan values might be legitimate values for an event and should be forward filled as well.
First Example:
Check Val1 Val2 Val3 Val4
0 2.00 3.00 2.00 2.00 3.00
1 2.00 4.00 nan 3.00 4.00
2 2.00 nan nan nan nan
3 2.00 2.00 4.00 3.00 3.00
Should become
Check Val1 Val2 Val3 Val4
0 2.00 3.00 2.00 2.00 3.00
1 2.00 4.00 nan 3.00 4.00
2 2.00 4.00 nan 3.00 4.00
3 2.00 2.00 4.00 3.00 3.00
and not:
Check Val1 Val2 Val3 Val4
0 2.00 3.00 2.00 2.00 3.00
1 2.00 4.00 2.00 3.00 4.00
2 2.00 4.00 2.00 3.00 4.00
3 2.00 2.00 4.00 3.00 3.00
Second example:
Check Val1 Val2 Val3 Val4
0 2.00 3.00 2.00 2.00 3.00
1 2.00 4.00 nan 3.00 4.00
2 2.00 4.00 nan nan nan
3 2.00 2.00 4.00 3.00 3.00
Should remain unchanged.
Use for replace only one NaNs per columns - replace fitst all values and then check consecutive NaNs, which are set by mask to NaNs:
df = df.ffill().mask((df.ffill(limit=1) * df.bfill(limit=1)).isnull())
print (df)
0 1 2 3 4
0 2.0 3.0 2.0 2.0 3.0
1 2.0 4.0 NaN 3.0 4.0
2 2.0 4.0 NaN 3.0 4.0
3 2.0 2.0 4.0 3.0 3.0

Data split over 2 rows for each row entry - read in with pandas

I'm dealing with a dataset where each 'entry' is split over many rows which are different sizes,
i.e.
yyyymmdd hhmmss lat lon name nprt depth ubas udir cabs cdir
hs tp lp theta sp wf
20140701 000000 -76.500 208.000 'grid_point' 1 332.2 2.8 201.9 0.00 0.0
0 0.10 1.48 3.40 183.19 30.16 0.89
1 0.10 1.48 3.40 183.21 29.66 0.90
20140701 000000 -74.500 251.000 'grid_point' 1 1.0 8.4 159.7 0.00 0.0
0 0.63 4.24 28.02 105.05 32.71 0.85
1 0.60 4.21 27.68 110.42 27.04 0.95
2 0.20 5.78 52.18 43.73 17.98 0.01
3 0.06 6.55 66.86 176.86 11.04 0.10
20140701 000000 -74.500 258.000 'grid_point' 0 1.0 7.7 137.0 0.00 0.0
0 0.00 0.00 0.00 0.00 0.00 0.00
I'm only interested in the rows that begin with a date so the rest can be discarded. However, the number of additional rows varies throughout the data set (see code snippet for an example).
Ideally, I'd like to use pandas read_csv but I'm open to suggestions if that's not possible/there are easier ways.
So my question is how do you read data into a dataframe where the row begins with a date?
Thanks
You can use read_csv first, then try cast first and second column to_datetime with parameter errors='coerce' - it add NaT where are not dates. So last need filter rows with boolean indexing and mask created by notnull:
import pandas as pd
from pandas.compat import StringIO
temp=u"""yyyymmdd hhmmss lat lon name nprt depth ubas udir cabs cdir
hs tp lp theta sp wf
20140701 000000 -76.500 208.000 'grid_point' 1 332.2 2.8 201.9 0.00 0.0
0 0.10 1.48 3.40 183.19 30.16 0.89
1 0.10 1.48 3.40 183.21 29.66 0.90
20140701 000000 -74.500 251.000 'grid_point' 1 1.0 8.4 159.7 0.00 0.0
0 0.63 4.24 28.02 105.05 32.71 0.85
1 0.60 4.21 27.68 110.42 27.04 0.95
2 0.20 5.78 52.18 43.73 17.98 0.01
3 0.06 6.55 66.86 176.86 11.04 0.10
20140701 000000 -74.500 258.000 'grid_point' 0 1.0 7.7 137.0 0.00 0.0
0 0.00 0.00 0.00 0.00 0.00 0.00"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), delim_whitespace=True)
print (pd.to_datetime(df.iloc[:,0] + df.iloc[:,1], errors='coerce', format='%Y%m%d%H%M%S'))
0 NaT
1 2014-07-01
2 NaT
3 NaT
4 2014-07-01
5 NaT
6 NaT
7 NaT
8 NaT
9 2014-07-01
10 NaT
dtype: datetime64[ns]
mask = pd.to_datetime(df.iloc[:,0] +
df.iloc[:,1], errors='coerce', format='%Y%m%d%H%M%S')
.notnull()
print (mask)
print (mask)
0 False
1 True
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 True
10 False
dtype: bool
print (df[mask])
yyyymmdd hhmmss lat lon name nprt depth ubas udir \
1 20140701 000000 -76.500 208.000 'grid_point' 1 332.2 2.8 201.9
4 20140701 000000 -74.500 251.000 'grid_point' 1 1.0 8.4 159.7
9 20140701 000000 -74.500 258.000 'grid_point' 0 1.0 7.7 137.0
cabs cdir
1 0.0 0.0
4 0.0 0.0

Reading csv file using pandas where columns are separated by varying amounts of whitespace and commas

I want to read the csv file as a pandas dataframe. CSV file is here: https://www.dropbox.com/s/o3xc74f8v4winaj/aaaa.csv?dl=0
In particular,
I want to skip the first row
The column headers are in row 2. In this case, they are: 1, 1, 2 and TOT. I do not want to hardcode them though. It is ok if the only column that gets extracted is TOT
I do not want to use a non-pandas approach if possible.
Here is what I am doing:
df = pandas.read_csv('https://www.dropbox.com/s/o3xc74f8v4winaj/aaaa.csv?dl=0', skiprows=1, skipinitialspace=True, sep=' ')
But this gives the error:
*** CParserError: Error tokenizing data. C error: Expected 5 fields in line 4, saw 6
The output should look something like this:
1 1 2 TOT
0 DEPTH(m) 0.01 1.24 1.52
1 BD 33kpa(t/m3) 1.6 1.6 1.6
2 SAND(%) 42.1 42.1 65.1
3 SILT(%) 37.9 37.9 16.9
4 CLAY(%) 20 20 18
5 ROCK(%) 12 12 12
6 WLS(kg/ha) 0 5 0.1 5.1
7 WLM(kg/ha) 0 5 0.1 5.1
8 WLSL(kg/ha) 0 4 0.1 4.1
9 WLSC(kg/ha) 0 2.1 0 2.1
10 WLMC(kg/ha) 0 2.1 0 2.1
11 WLSLC(kg/ha) 0 1.7 0 1.7
12 WLSLNC(kg/ha) 0 0.4 0 0.4
13 WBMC(kg/ha) 9 1102.1 250.9 1361.9
14 WHSC(kg/ha) 69 8432 1920 10420
15 WHPC(kg/ha) 146 18018 4102 22266
16 WOC(kg/ha) 224 27556 6272 34
17 WLSN(kg/ha) 0 0 0 0
18 WLMN(kg/ha) 0 0.2 0 0.2
19 WBMN(kg/ha) 0.9 110.2 25.1 136.2
20 WHSN(kg/ha) 7 843 192 1042
21 WHPN(kg/ha) 15 1802 410 2227
22 WON(kg/ha) 22 2755 627 3405
23 CFEM(kg/ha) 0
You can specify a regular expression to be used as your delimiter, in your case it will work with [\s,]{2,20}, i.e. 2 or more spaces or commas:
In [180]: pd.read_csv('aaaa.csv',
skiprows = 1,
sep='[\s,]{2,20}',
index_col=0)
Out[180]:
Unnamed: 1 1 1.1 2 TOT
0
1 DEPTH(m) 0.01 1.24 1.52 NaN
2 BD 33kpa(t/m3) 1.60 1.60 1.60 NaN
3 SAND(%) 42.10 42.10 65.10 NaN
4 SILT(%) 37.90 37.90 16.90 NaN
5 CLAY(%) 20.00 20.00 18.00 NaN
6 ROCK(%) 12.00 12.00 12.00 NaN
7 WLS(kg/ha) 0.00 5.00 0.10 5.1
8 WLM(kg/ha) 0.00 5.00 0.10 5.1
9 WLSL(kg/ha) 0.00 4.00 0.10 4.1
10 WLSC(kg/ha) 0.00 2.10 0.00 2.1
11 WLMC(kg/ha) 0.00 2.10 0.00 2.1
12 WLSLC(kg/ha) 0.00 1.70 0.00 1.7
13 WLSLNC(kg/ha) 0.00 0.40 0.00 0.4
14 WBMC(kg/ha) 9.00 1102.10 250.90 1361.9
15 WHSC(kg/ha) 69.00 8432.00 1920.00 10420.0
16 WHPC(kg/ha) 146.00 18018.00 4102.00 22266.0
17 WOC(kg/ha) 224.00 27556.00 6272.00 34.0
18 WLSN(kg/ha) 0.00 0.00 0.00 0.0
19 WLMN(kg/ha) 0.00 0.20 0.00 0.2
20 WBMN(kg/ha) 0.90 110.20 25.10 136.2
21 WHSN(kg/ha) 7.00 843.00 192.00 1042.0
22 WHPN(kg/ha) 15.00 1802.00 410.00 2227.0
23 WON(kg/ha) 22.00 2755.00 627.00 3405.0
24 CFEM(kg/ha) 0.00 NaN NaN NaN
25, None NaN NaN NaN NaN
26, None NaN NaN NaN NaN
You need to specify the names of the columns. Notice the trick I used to get two columns called 1 (one is an integer name and the other is text).
Given how badly the data is structured, this is not perfect (note row 2 where BD and 33kpa got split because of the space between them).
pd.read_csv('/Downloads/aaaa.csv',
skiprows=2,
skipinitialspace=True,
sep=' ',
names=['Index', 'Description',1,"1",2,'TOT'],
index_col=0)
Description 1 1 2 TOT
Index
1, DEPTH(m) 0.01 1.24 1.52 NaN
2, BD 33kpa(t/m3) 1.60 1.60 1.6
3, SAND(%) 42.1 42.10 65.10 NaN
4, SILT(%) 37.9 37.90 16.90 NaN
5, CLAY(%) 20.0 20.00 18.00 NaN
6, ROCK(%) 12.0 12.00 12.00 NaN
7, WLS(kg/ha) 0.0 5.00 0.10 5.1
8, WLM(kg/ha) 0.0 5.00 0.10 5.1
9, WLSL(kg/ha) 0.0 4.00 0.10 4.1
10, WLSC(kg/ha) 0.0 2.10 0.00 2.1
11, WLMC(kg/ha) 0.0 2.10 0.00 2.1
12, WLSLC(kg/ha) 0.0 1.70 0.00 1.7
13, WLSLNC(kg/ha) 0.0 0.40 0.00 0.4
14, WBMC(kg/ha) 9.0 1102.10 250.90 1361.9
15, WHSC(kg/ha) 69. 8432.00 1920.00 10420.0
16, WHPC(kg/ha) 146. 18018.00 4102.00 22266.0
17, WOC(kg/ha) 224. 27556.00 6272.00 34.0
18, WLSN(kg/ha) 0.0 0.00 0.00 0.0
19, WLMN(kg/ha) 0.0 0.20 0.00 0.2
20, WBMN(kg/ha) 0.9 110.20 25.10 136.2
21, WHSN(kg/ha) 7. 843.00 192.00 1042.0
22, WHPN(kg/ha) 15. 1802.00 410.00 2227.0
23, WON(kg/ha) 22. 2755.00 627.00 3405.0
24, CFEM(kg/ha) 0. NaN NaN NaN
25, NaN NaN NaN NaN NaN
26, NaN NaN NaN NaN NaN
Or you can reset the index.
>>> (pd.read_csv('/Downloads/aaaa.csv',
skiprows=2,
skipinitialspace=True,
sep=' ',
names=['Index', 'Description',1,"1",2,'TOT'],
index_col=0)
.reset_index(drop=True)
.dropna(axis=0, how='all'))
Description 1 1 2 TOT
0 DEPTH(m) 0.01 1.24 1.52 NaN
1 BD 33kpa(t/m3) 1.60 1.60 1.6
2 SAND(%) 42.1 42.10 65.10 NaN
3 SILT(%) 37.9 37.90 16.90 NaN
4 CLAY(%) 20.0 20.00 18.00 NaN
5 ROCK(%) 12.0 12.00 12.00 NaN
6 WLS(kg/ha) 0.0 5.00 0.10 5.1
7 WLM(kg/ha) 0.0 5.00 0.10 5.1
8 WLSL(kg/ha) 0.0 4.00 0.10 4.1
9 WLSC(kg/ha) 0.0 2.10 0.00 2.1
10 WLMC(kg/ha) 0.0 2.10 0.00 2.1
11 WLSLC(kg/ha) 0.0 1.70 0.00 1.7
12 WLSLNC(kg/ha) 0.0 0.40 0.00 0.4
13 WBMC(kg/ha) 9.0 1102.10 250.90 1361.9
14 WHSC(kg/ha) 69. 8432.00 1920.00 10420.0
15 WHPC(kg/ha) 146. 18018.00 4102.00 22266.0
16 WOC(kg/ha) 224. 27556.00 6272.00 34.0
17 WLSN(kg/ha) 0.0 0.00 0.00 0.0
18 WLMN(kg/ha) 0.0 0.20 0.00 0.2
19 WBMN(kg/ha) 0.9 110.20 25.10 136.2
20 WHSN(kg/ha) 7. 843.00 192.00 1042.0
21 WHPN(kg/ha) 15. 1802.00 410.00 2227.0
22 WON(kg/ha) 22. 2755.00 627.00 3405.0
23 CFEM(kg/ha) 0. NaN NaN NaN

Error in getting specific row from python pandas

I want to extract a row by name from the foll. dataframe:
Unnamed: 1 1 1.1 2 TOT
0
1 DEPTH(m) 0.01 1.24 1.52 NaN
2 BD 33kpa(t/m3) 1.60 1.60 1.60 NaN
3 SAND(%) 42.10 42.10 65.10 NaN
4 SILT(%) 37.90 37.90 16.90 NaN
5 CLAY(%) 20.00 20.00 18.00 NaN
6 ROCK(%) 12.00 12.00 12.00 NaN
7 WLS(kg/ha) 2.60 8.20 0.10 10.9
8 WLM(kg/ha) 5.00 8.30 0.00 13.4
9 WLSL(kg/ha) 0.00 3.80 0.10 3.9
10 WLSC(kg/ha) 1.10 3.50 0.00 4.6
11 WLMC(kg/ha) 2.10 3.50 0.00 5.6
12 WLSLC(kg/ha) 0.00 1.60 0.00 1.6
13 WLSLNC(kg/ha) 1.10 1.80 0.00 2.9
14 WBMC(kg/ha) 3.40 835.10 195.20 1033.7
15 WHSC(kg/ha) 66.00 8462.00 1924.00 10451.0
16 WHPC(kg/ha) 146.00 18020.00 4102.00 22269.0
17 WOC(kg/ha) 219.00 27324.00 6221.00 34.0
18 WLSN(kg/ha) 0.00 0.00 0.00 0.0
19 WLMN(kg/ha) 0.00 0.10 0.00 0.1
20 WBMN(kg/ha) 0.50 92.60 19.30 112.5
21 WHSN(kg/ha) 7.00 843.00 191.00 1041.0
22 WHPN(kg/ha) 15.00 1802.00 410.00 2227.0
23 WON(kg/ha) 22.00 2738.00 621.00 3381.0
I want to extract the row containing info on WOC(kg/ha). here is what I am doing:
df.loc['WOC(kg/ha)']
but I get the error:
*** KeyError: 'the label [WOC(kg/ha)] is not in the [index]'
You don't have that label in your index, it's in your first column the following should work:
df.loc[df['Unnamed: 1'] == 'WOC(kg/ha)']
otherwise set the index to that column and your code would work fine:
df.set_index('Unnamed: 1', inplace=True)
Also, this can be used to set index without explicitly specifying column name: df.set_index(df.columns[0], inplace=True)

Categories