This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 2 years ago.
I'm trying to find how to get sum of all columns or for specific column
I found it and here is that part of code
data.loc['total'] = data.select_dtypes(np.number).sum()
this works correctly but I get warning
C:\Users\AAR\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\indexing.py:671: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_with_indexer(indexer, value)
table
on attached image you can see what I want to get
If you just want to do a sum, you should just do this:
Dataframe:
0 1 2 3 4 5 6 7 8 9
0 12 32 45 67 89 54 23.0 56.0 78.0 98.0
1 34 76 34 89 34 3 NaN NaN NaN NaN
2 76 34 54 12 43 78 56.0 NaN NaN NaN
3 76 56 45 23 43 45 67.0 76.0 67.0 8.0
4 87 9 9 0 89 90 6.0 89.0 NaN NaN
5 23 90 90 32 23 34 56.0 9.0 56.0 87.0
6 23 56 34 3 5 8 7.0 6.0 98.0 NaN
7 32 23 34 6 65 78 67.0 87.0 89.0 87.0
8 12 23 34 32 43 67 45.0 NaN NaN NaN
9 343 76 56 7 8 9 4.0 5.0 8.0 68.0
df = df.append(df.sum(numeric_only=True), ignore_index=True)
Output:
0 1 2 3 4 5 6 7 8 9
0 12.0 32.0 45.0 67.0 89.0 54.0 23.0 56.0 78.0 98.0
1 34.0 76.0 34.0 89.0 34.0 3.0 NaN NaN NaN NaN
2 76.0 34.0 54.0 12.0 43.0 78.0 56.0 NaN NaN NaN
3 76.0 56.0 45.0 23.0 43.0 45.0 67.0 76.0 67.0 8.0
4 87.0 9.0 9.0 0.0 89.0 90.0 6.0 89.0 NaN NaN
5 23.0 90.0 90.0 32.0 23.0 34.0 56.0 9.0 56.0 87.0
6 23.0 56.0 34.0 3.0 5.0 8.0 7.0 6.0 98.0 NaN
7 32.0 23.0 34.0 6.0 65.0 78.0 67.0 87.0 89.0 87.0
8 12.0 23.0 34.0 32.0 43.0 67.0 45.0 NaN NaN NaN
9 343.0 76.0 56.0 7.0 8.0 9.0 4.0 5.0 8.0 68.0
10 718.0 475.0 435.0 271.0 442.0 466.0 331.0 328.0 396.0 348.0
I have a dataframe like this:
original = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=["P1_day", "P1_week", "P1_month"])
print(original)
P1_day P1_week P1_month
0 50 17 55
1 45 3 10
2 93 79 84
3 99 38 33
4 44 35 35
5 25 43 87
6 38 88 56
7 20 66 6
8 4 23 6
9 39 75 3
I need to generate new dataframe starting from 3rd row of original dataframe and add new 9 columns based on rolling window defined as 3 previous rows with corresponding prefixes: [_0,_1, _2]. So, It's rows with index [0,1,2] from original dataframe .
For example, the next 3 columns will be from the original.iloc[0],
and after the next 3 columns will be from the original.iloc[1],
and the last 3 columns will be from the original.iloc[2]
I tried to solve it by the next code:
subset_shifted = original[["P1_day", "P1_week", "P1_month"]].shift(3)
subset_shifted.columns = ["P1_day_0", "P1_week_0", "P1_month_0"]
original_ = pd.concat([original, subset_shifted], axis = 1)
print(original_)
In result, I Have 3 additional columns with value from the previous 0 row:
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0
0 50 17 55 NaN NaN NaN
1 45 3 10 NaN NaN NaN
2 93 79 84 NaN NaN NaN
3 99 38 33 50.0 17.0 55.0
4 44 35 35 45.0 3.0 10.0
5 25 43 87 93.0 79.0 84.0
6 38 88 56 99.0 38.0 33.0
7 20 66 6 44.0 35.0 35.0
8 4 23 6 25.0 43.0 87.0
9 39 75 3 38.0 88.0 56.0
In the next iteration I did shift(2) with the same approach and received the columns from the original.iloc[1].
On the last iteration I did shift(1) and got expected result in view of:
result = original_.iloc[3:]
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0 P1_day_1 P1_week_1 P1_month_1 P1_day_2 P1_week_2 P1_month_2
3 99 38 33 50.0 17.0 55.0 45.0 3.0 10.0 93.0 79.0 84.0
4 44 35 35 45.0 3.0 10.0 93.0 79.0 84.0 99.0 38.0 33.0
5 25 43 87 93.0 79.0 84.0 99.0 38.0 33.0 44.0 35.0 35.0
6 38 88 56 99.0 38.0 33.0 44.0 35.0 35.0 25.0 43.0 87.0
7 20 66 6 44.0 35.0 35.0 25.0 43.0 87.0 38.0 88.0 56.0
8 4 23 6 25.0 43.0 87.0 38.0 88.0 56.0 20.0 66.0 6.0
9 39 75 3 38.0 88.0 56.0 20.0 66.0 6.0 4.0 23.0 6.0
Question:
Is there any way to solve this task with better approach as I described? Thanks.
Unless you want all these extra DataFrames, you could just add the new columns to your orignal df directly:
import pandas as pd
import numpy as np
original = pd.DataFrame(
np.random.randint(0,100,size=(10, 3)),
columns=["P1_day", "P1_week", "P1_month"],
)
original[
["P1_day_0", "P1_week_0", "P1_month_0"]
] = original[
["P1_day", "P1_week", "P1_month"]
].shift(3)
print(original)
output:
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0
0 2 35 26 NaN NaN NaN
1 99 4 96 NaN NaN NaN
2 4 67 6 NaN NaN NaN
3 76 33 31 2.0 35.0 26.0
4 84 60 98 99.0 4.0 96.0
5 57 1 58 4.0 67.0 6.0
6 35 70 96 76.0 33.0 31.0
7 81 32 39 84.0 60.0 98.0
8 25 4 38 57.0 1.0 58.0
9 83 4 60 35.0 70.0 96.0
python tutor link to example
Edit: OP asked the follow up question:
yes, for the first row it makes sense. But, my task is to add first 3 rows with index 0-1-2 as new 9 columns for the respected rows started from 3rd index. In your output row with index 1st is not added to the 3rd row as 3 columns. In my code that's why I used shift(2) and shift(1) iteratively.
Here is how this could be done iteratively:
import pandas as pd
import numpy as np
original = pd.DataFrame(
np.random.randint(0,100,size=(10, 3)),
columns=["P1_day", "P1_week", "P1_month"],
)
for shift, n in ((3,0),(2,1),(1,2)):
original[
[f"P1_day_{n}", f"P1_week_{n}", f"P1_month_{n}"]
] = original[
["P1_day", "P1_week", "P1_month"]
].shift(shift)
pd.set_option('display.max_columns', None)
print(original.iloc[3:])
Output:
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0 P1_day_1 \
3 58 43 74 26.0 56.0 82.0 56.0
4 44 27 40 56.0 87.0 38.0 31.0
5 2 90 4 31.0 32.0 87.0 58.0
6 90 70 6 58.0 43.0 74.0 44.0
7 1 31 57 44.0 27.0 40.0 2.0
8 96 22 69 2.0 90.0 4.0 90.0
9 13 98 47 90.0 70.0 6.0 1.0
P1_week_1 P1_month_1 P1_day_2 P1_week_2 P1_month_2
3 87.0 38.0 31.0 32.0 87.0
4 32.0 87.0 58.0 43.0 74.0
5 43.0 74.0 44.0 27.0 40.0
6 27.0 40.0 2.0 90.0 4.0
7 90.0 4.0 90.0 70.0 6.0
8 70.0 6.0 1.0 31.0 57.0
9 31.0 57.0 96.0 22.0 69.0
python tutor link
Edit 2: Not to make any assumptions here, but if your end goal is to get something like the 4 period moving average from the data in all of these new columns then you might not need them at all. You can use pandas.DataFrame.rolling instead:
import pandas as pd
import numpy as np
original = pd.DataFrame(
np.random.randint(0,100,size=(10, 3)),
columns=["P1_day", "P1_week", "P1_month"],
)
original[
["P1_day_4PMA", "P1_week_4PMA", "P1_month_4PMA"]
] = original[
["P1_day", "P1_week", "P1_month"]
].rolling(4).mean()
pd.set_option('display.max_columns', None)
print(original.iloc[3:])
Output:
P1_day P1_week P1_month P1_day_4PMA P1_week_4PMA P1_month_4PMA
3 1 13 48 31.25 38.00 55.00
4 10 4 40 22.00 21.00 45.75
5 7 76 0 5.50 23.75 37.00
6 5 69 9 5.75 40.50 24.25
7 63 31 82 21.25 45.00 32.75
8 26 67 22 25.25 60.75 28.25
9 89 41 40 45.75 52.00 38.25
another python tutor link
I have a dataframe like this with some NaNs:
df:
0 1
1 11.0 111.0
2 12.0 112.0
3 13.0 113.0
4 NaN 114.0
4 15.0 NaN
5 16.0 116.0
6 17.0 117.0
7 18.0 118.0
So what should I do to it to get the following:
0 1
1 11.0 111.0
2 12.0 112.0
3 13.0 113.0
4 15.0 114.0
4 15.0 114.0
5 16.0 116.0
6 17.0 117.0
7 18.0 118.0
So that the NaN values in the index 4 are filled with index 4 values from other rows which are not NaN?
you can group by index and ffill() + bfill() inside each group:
In [165]: df.groupby(level=0).apply(lambda x: x.ffill().bfill())
Out[165]:
0 1
1 11.0 111.0
2 12.0 112.0
3 13.0 113.0
4 15.0 114.0
4 15.0 114.0
5 16.0 116.0
6 17.0 117.0
7 18.0 118.0
I have Python which counts cumsum of 14 elements in column, starting from different elements and writes this sum in other column. Does anyone knows the way how to do it without loops?
import pandas as pd
import numpy as np
a = pd.DataFrame({"A": [i for i in range(25)]})
b = pd.DataFrame({"B": [np.nan for i in range(25)]})
for i in range(4, len(b)):
cumsum = 0
for k in range(i - 4, i):
cumsum += a.A[k]
b.B[k] = cumsum
pd.concat([a,b], axis=1)
IIUC you are looking for rolling(4) + sum():
In [83]: a['new'] = a.A.rolling(4).sum()
In [84]: a
Out[84]:
A new
0 0 NaN
1 1 NaN
2 2 NaN
3 3 6.0
4 4 10.0
5 5 14.0
6 6 18.0
7 7 22.0
8 8 26.0
9 9 30.0
10 10 34.0
11 11 38.0
12 12 42.0
13 13 46.0
14 14 50.0
15 15 54.0
16 16 58.0
17 17 62.0
18 18 66.0
19 19 70.0
20 20 74.0
21 21 78.0
22 22 82.0
23 23 86.0
24 24 90.0
check:
In [86]: pd.concat([a,b], axis=1)
Out[86]:
A new B
0 0 NaN NaN
1 1 NaN NaN
2 2 NaN NaN
3 3 6.0 6.0
4 4 10.0 10.0
5 5 14.0 14.0
6 6 18.0 18.0
7 7 22.0 22.0
8 8 26.0 26.0
9 9 30.0 30.0
10 10 34.0 34.0
11 11 38.0 38.0
12 12 42.0 42.0
13 13 46.0 46.0
14 14 50.0 50.0
15 15 54.0 54.0
16 16 58.0 58.0
17 17 62.0 62.0
18 18 66.0 66.0
19 19 70.0 70.0
20 20 74.0 74.0
21 21 78.0 78.0
22 22 82.0 82.0
23 23 86.0 86.0
24 24 90.0 NaN
l m learning Pandas and l m trying to learn things in it. However ,l got stuck in adding new column as new columns have larger index number.l would like to add more than 3 columns.
here is my code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
example=pd.read_excel("C:/Users/ömer sarı/AppData/Local/Programs/Python/Python35-32/data_analysis/pydata-book-master/example.xlsx",names=["a","b","c","d","e","f","g"])
dropNan=example.dropna()
#print(dropNan)
Fillup=example.fillna(-99)
#print(Fillup)
Countt=Fillup.get_dtype_counts()
#print(Countt)
date=pd.date_range("2014-01-01","2014-01-15",freq="D")
#print(date)
mm=date.month
yy=date.year
dd=date.day
df=pd.DataFrame(example)
print(df)
x=[i for i in yy]
print(x)
df["year"]=df[x]
here is example dataset:
a b c d e f g
0 1 1.0 2.0 5 3 11.0 57.0
1 2 4.0 6.0 10 6 22.0 59.0
2 3 9.0 12.0 15 9 33.0 61.0
3 4 16.0 20.0 20 12 44.0 63.0
4 5 25.0 NaN 25 15 NaN 65.0
5 6 NaN 42.0 30 18 66.0 NaN
6 7 49.0 56.0 35 21 77.0 69.0
7 8 64.0 72.0 40 24 88.0 71.0
8 9 81.0 NaN 45 27 99.0 73.0
9 10 NaN 110.0 50 30 NaN 75.0
10 11 121.0 NaN 55 33 121.0 77.0
11 12 144.0 156.0 60 36 132.0 NaN
12 13 169.0 182.0 65 39 143.0 81.0
13 14 196.0 NaN 70 42 154.0 83.0
14 15 225.0 240.0 75 45 165.0 85.0
here is error message:
IndexError: indices are out-of-bounds
after that ,l tried that one and l got new error:
df=pd.DataFrame(range(len(x)),index=x, columns=["a","b","c","d","e","f","g"])
pandas.core.common.PandasError: DataFrame constructor not properly called!
it is just a trial to learn and how can l add the date with split parts as new column like["date","year","month","day"....]