Restore index and append zeros after subtracting dataframe values - python

I am calculating the difference of a dataframe values at different lags.
Following dataframe is my input
df = pd.DataFrame([[1, 2], [3, 4],[5,6],[7,8]], columns=list('AB'))
To compute the difference between last three rows and first three rows, I am doing the following.
df2=df.iloc[1:,:]
df3=df.iloc[:-1,:]
df_out=pd.DataFrame(df2.values-df3.values,index=df2.index)
The calculation is as expected but I want to retain the index 0 with zeros in that row.
df_expected_out=pd.DataFrame([[0,0], [2,2],[2,2],[2,2]], columns=list('AB'))
Please suggest the way forward.Thanks for your time.

I think you need reindex by original index:
df_out=pd.DataFrame(df2.values-df3.values,index=df2.index).reindex(df.index, fill_value=0)
print (df_out)
0 1
0 0 0
1 2 2
2 2 2
3 2 2
Another solution:
df_out= df.diff().fillna(0).astype(int)
Or append first zero row to arrays:
a1 = np.zeros((1, len(df.columns)), dtype=int)
arr = np.append(a1, df2.values, axis=0) - np.append(a1, df3.values, axis=0)
df_out = pd.DataFrame(arr, index=df.index)
print (df_out)
0 1
0 0 0
1 2 2
2 2 2
3 2 2

You can use the shift function
(df - df.shift()).fillna(0)
Out[9]:
A B
0 0.0 0.0
1 2.0 2.0
2 2.0 2.0
3 2.0 2.0

Related

Subtract with value in previous row to create a new column by subject

Using python and this data set https://raw.githubusercontent.com/yadatree/AL/main/AK4.csv I would like to create a new column for each subject, that starts with 0 (in the first row) and then subtracts the SCALE value from row 2 from row 1, then row 3 from row 2, row 4 from row 3, etc.
However, if this produces a negative value, then to give the output of 0.
Edit: Thank you for the response. That worked perfectly. The only remaining issue is that I'd like to start again with each subject (SUBJECT column). The number of values for each subject is not fixed thus something that checks the SUBJECT column and then starts again from 0 would be ideal.
screenshot
You can use .shift(1) create new column with values moved from previous rows - and then you will have both values in the same row and you can substract columns.
And later you can selecte all negative results and assign zero
import pandas as pd
data = {
'A': [1, 3, 2, 5, 1],
}
df = pd.DataFrame(data)
df['previous'] = df['A'].shift(1)
df['result'] = df['A'] - df['previous']
print(df)
#df['result'] = df['A'] - df['A'].shift(1)
#print(df)
df.loc[ df['result'] < 0 , 'result'] = 0
print(df)
Result:
A previous result
0 1 NaN NaN
1 3 1.0 2.0
2 2 3.0 -1.0
3 5 2.0 3.0
4 1 5.0 -4.0
A previous result
0 1 NaN NaN
1 3 1.0 2.0
2 2 3.0 0.0
3 5 2.0 3.0
4 1 5.0 0.0
EDIT:
If you use df['result'] = df['A'] - df['A'].shift(1) then you get column result without creating column previous.
And if you use .shift(1, fill_value=0) then it will put 0 instead of NaN in first row.
EDIT:
You can use groupy("SUBJECT") to group by subject and later in every group you can put 0 in first row.
import pandas as pd
data = {
'S': ['A', 'A', 'A', 'B', 'B', 'B'],
'A': [1, 3, 2, 1, 5, 1],
}
df = pd.DataFrame(data)
df['result'] = df['A'] - df['A'].shift(1, fill_value=0)
print(df)
df.loc[ df['result'] < 0 , 'result'] = 0
print(df)
all_groups = df.groupby('S')
first_index = all_groups.apply(lambda grp: grp.index[0])
df.loc[first_index, 'result'] = 0
print(df)
Results:
S A result
0 A 1 1
1 A 3 2
2 A 2 -1
3 B 1 -1
4 B 5 4
5 B 1 -4
S A result
0 A 1 1
1 A 3 2
2 A 2 0
3 B 1 0
4 B 5 4
5 B 1 0
S A result
0 A 1 0
1 A 3 2
2 A 2 0
3 B 1 0
4 B 5 4
5 B 1 0

Append function in python (pandas)

import pandas as pd
df = pd.DataFrame([[5, 6], [1.2, 3]])
ser = pd.Series([0, 0], name='r3')
df_app = df.append(ser)
print('{}\n'.format(df_app)) #has 3 rows
df_app = df.append(ser, ignore_index=True)
print('{}\n'.format(df_app)) #has 3 rows
df2 = pd.DataFrame([[0,0],[9,9]])
df_app = df.append(df2)
print(format(df_app)) #didnt understand this part, where did the series row go?
OUTPUT
0 1
0 5.0 6
1 1.2 3
r3 0.0 0
0 1
0 5.0 6
1 1.2 3
2 0.0 0
0 1
0 5.0 6
1 1.2 3
0 0.0 0
1 9.0 9
I didn't understand where did the appended series go in the last append.
df has 2 rows, then [0,0] series is appended =>3 rows
df2 has 2 rows as well, after appending, there is a total of 4 rows. Where did the series row go?
You have been appending your series to the df Dataframe, which will remain same every time.
You are printing the df_app dataframe, which is updated over the df only. If you need to append all the rows then instead of appending to df you need to append to df_app itself.
You should change the code to following:
df_app = df_app.append(ser)

Convert dataframe to pivot table with booleans(0, 1) with Pandas [duplicate]

How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df
A B C
0 a c 1
1 b c 2
2 a b 3
>>> pd.get_dummies(data=df, columns=['A', 'B'])
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df
Out[111]:
A B
0 a x
1 a y
2 b z
3 b x
4 c x
5 a y
6 b y
7 c z
In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]:
A B
a b c x y z
0 1 0 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 1 0 1 0 0
4 0 0 1 1 0 0
5 1 0 0 0 1 0
6 0 1 0 0 1 0
7 0 0 1 0 0 1
If you don't want the multi-index column, then remove the keys=.. from the concat function call.
Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns 'Name' and 'Year' you want dummies for.
First, simply iterating over the columns isn't too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df[column])
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")
Unless I don't understand the question, it is supported natively in get_dummies by passing the columns argument.
The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2

concat two dataframe using python

We have one dataframe like
-0.140447131 0.124802527 0.140780106
0.062166349 -0.121484447 -0.140675515
-0.002989106 0.13984927 0.004382326
and the other as
1
1
2
We need to concat both the dataframe like
-0.140447131 0.124802527 0.140780106 1
0.062166349 -0.121484447 -0.140675515 1
-0.002989106 0.13984927 0.004382326 2
Let's say your first dataframe is like
In [281]: df1
Out[281]:
a b c
0 -0.140447 0.124803 0.140780
1 0.062166 -0.121484 -0.140676
2 -0.002989 0.139849 0.004382
And, the second like,
In [283]: df2
Out[283]:
d
0 1
1 1
2 2
Then you could create new column for df1 using df2
In [284]: df1['d_new'] = df2['d']
In [285]: df1
Out[285]:
a b c d_new
0 -0.140447 0.124803 0.140780 1
1 0.062166 -0.121484 -0.140676 1
2 -0.002989 0.139849 0.004382 2
The assumption however being both dataframes have common index
Use pd.concat and specify the axis equal to 1 (rows):
df_new = pd.concat([df1, df2], axis=1)
>>> df_new
0 1 2 0
0 -0.140447 0.124803 0.140780 1
1 0.062166 -0.121484 -0.140676 2
2 -0.002989 0.139849 0.004382 3

Why does a pandas Series of DataFrame mean() fail, but sum() does not, and how to make it work?

There may be a smarter way to do this in Python Pandas, but the following example should, but doesn't work:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1, 0], [1, 2], [2, 0]], columns=['a', 'b'])
df2 = df1.copy()
df3 = df1.copy()
idx = pd.date_range("2010-01-01", freq='H', periods=3)
s = pd.Series([df1, df2, df3], index=idx)
# This causes an error
s.mean()
I won't post the whole traceback, but the main error message is interesting:
TypeError: Could not convert melt T_s
0 6 12
1 0 6
2 6 10 to numeric
It looks like the dataframe was successfully sum'med, but not divided by the length of the series.
However, we can take the sum of the dataframes in the series:
s.sum()
... returns:
a b
0 6 12
1 0 6
2 6 10
Why wouldn't mean() work when sum() does? Is this a bug or a missing feature? This does work:
(df1 + df2 + df3)/3.0
... and so does this:
s.sum()/3.0
a b
0 2 4.000000
1 0 2.000000
2 2 3.333333
But this of course is not ideal.
You could (as suggested by #unutbu) use a hierarchical index but when you have a three dimensional array you should consider using a "pandas Panel". Especially when one of the dimensions represents time as in this case.
The Panel is oft overlooked but it is after all where the name pandas comes from. (Panel Data System or something like that).
Data slightly different from your original so there are not two dimensions with the same length:
df1 = pd.DataFrame([[1, 0], [1, 2], [2, 0], [2, 3]], columns=['a', 'b'])
df2 = df1 + 1
df3 = df1 + 10
Panels can be created a couple of different ways but one is from a dict. You can create the dict from your index and the dataframes with:
s = pd.Panel(dict(zip(idx,[df1,df2,df3])))
The mean you are looking for is simply a matter of operating on the correct axis (axis=0 in this case):
s.mean(axis=0)
Out[80]:
a b
0 4.666667 3.666667
1 4.666667 5.666667
2 5.666667 3.666667
3 5.666667 6.666667
With your data, sum(axis=0) returns the expected result.
EDIT: OK too late for panels as the hierarchical index approach is already "accepted". I will say that that approach is preferable if the data is know to be "ragged" with an unknown but different number in each grouping. For "square" data, the panel is absolutly the way to go and will be significantly faster with more built-in operations. Pandas 0.15 has many improvements for multi-level indexing but still has limitations and dark edge cases in real world apps.
When you define s with
s = pd.Series([df1, df2, df3], index=idx)
you get a Series with DataFrames as items:
In [77]: s
Out[77]:
2010-01-01 00:00:00 a b
0 1 0
1 1 2
2 2 0
2010-01-01 01:00:00 a b
0 1 0
1 1 2
2 2 0
2010-01-01 02:00:00 a b
0 1 0
1 1 2
2 2 0
Freq: H, dtype: object
The sum of the items is a DataFrame:
In [78]: s.sum()
Out[78]:
a b
0 3 0
1 3 6
2 6 0
but when you take the mean, nanops.nanmean is called:
def nanmean(values, axis=None, skipna=True):
values, mask, dtype, dtype_max = _get_values(values, skipna, 0)
the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_max))
...
Notice that _ensure_numeric (source code) is called on the resultant sum.
An error is raised because a DataFrame is not numeric.
Here is a workaround. Instead of making a Series with DataFrames as items,
you can concatenate the DataFrames into a new DataFrame with a hierarchical index:
In [79]: s = pd.concat([df1, df2, df3], keys=idx)
In [80]: s
Out[80]:
a b
2010-01-01 00:00:00 0 1 0
1 1 2
2 2 0
2010-01-01 01:00:00 0 1 0
1 1 2
2 2 0
2010-01-01 02:00:00 0 1 0
1 1 2
2 2 0
Now you can take the sum and the mean:
In [82]: s.sum(level=1)
Out[82]:
a b
0 3 0
1 3 6
2 6 0
In [84]: s.mean(level=1)
Out[84]:
a b
0 1 0
1 1 2
2 2 0

Categories