Cumsum of fixed amount of elements without loops - python

I have Python which counts cumsum of 14 elements in column, starting from different elements and writes this sum in other column. Does anyone knows the way how to do it without loops?
import pandas as pd
import numpy as np
a = pd.DataFrame({"A": [i for i in range(25)]})
b = pd.DataFrame({"B": [np.nan for i in range(25)]})
for i in range(4, len(b)):
cumsum = 0
for k in range(i - 4, i):
cumsum += a.A[k]
b.B[k] = cumsum
pd.concat([a,b], axis=1)

IIUC you are looking for rolling(4) + sum():
In [83]: a['new'] = a.A.rolling(4).sum()
In [84]: a
Out[84]:
A new
0 0 NaN
1 1 NaN
2 2 NaN
3 3 6.0
4 4 10.0
5 5 14.0
6 6 18.0
7 7 22.0
8 8 26.0
9 9 30.0
10 10 34.0
11 11 38.0
12 12 42.0
13 13 46.0
14 14 50.0
15 15 54.0
16 16 58.0
17 17 62.0
18 18 66.0
19 19 70.0
20 20 74.0
21 21 78.0
22 22 82.0
23 23 86.0
24 24 90.0
check:
In [86]: pd.concat([a,b], axis=1)
Out[86]:
A new B
0 0 NaN NaN
1 1 NaN NaN
2 2 NaN NaN
3 3 6.0 6.0
4 4 10.0 10.0
5 5 14.0 14.0
6 6 18.0 18.0
7 7 22.0 22.0
8 8 26.0 26.0
9 9 30.0 30.0
10 10 34.0 34.0
11 11 38.0 38.0
12 12 42.0 42.0
13 13 46.0 46.0
14 14 50.0 50.0
15 15 54.0 54.0
16 16 58.0 58.0
17 17 62.0 62.0
18 18 66.0 66.0
19 19 70.0 70.0
20 20 74.0 74.0
21 21 78.0 78.0
22 22 82.0 82.0
23 23 86.0 86.0
24 24 90.0 NaN

Related

Pandas rolling but involves last rows value

I have this dataframe
hour = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
visitor = [4,6,2,4,3,7,5,7,8,3,2,8,3,6,4,5,1,8,9,4,2,3,4,1]
df = {"Hour":hour, "Total_Visitor":visitor}
df = pd.DataFrame(df)
print(df)
I applied 6 window rolling sum
df_roll = df.rolling(6, min_periods=6).sum()
print(df_roll)
The first 5 rows will give you NaN value,
The problem is I want to know the sum of total visitor from 9pm to 3am, so I have to sum total visitor from hour 21 and then back to hour 0 until 3
How do you do that automatically with rolling?
I think you need add last N values, then using rolling and filter by length of Series:
N = 6
df_roll = df.iloc[-N:].append(df).rolling(N).sum().iloc[-len(df):]
print (df_roll)
Hour Total_Visitor
0 105.0 18.0
1 87.0 20.0
2 69.0 20.0
3 51.0 21.0
4 33.0 20.0
5 15.0 26.0
6 21.0 27.0
7 27.0 28.0
8 33.0 34.0
9 39.0 33.0
10 45.0 32.0
11 51.0 33.0
12 57.0 31.0
13 63.0 30.0
14 69.0 26.0
15 75.0 28.0
16 81.0 27.0
17 87.0 27.0
18 93.0 33.0
19 99.0 31.0
20 105.0 29.0
21 111.0 27.0
22 117.0 30.0
23 123.0 23.0
Check original solution:
df_roll = df.rolling(6, min_periods=6).sum()
print(df_roll)
Hour Total_Visitor
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 15.0 26.0
6 21.0 27.0
7 27.0 28.0
8 33.0 34.0
9 39.0 33.0
10 45.0 32.0
11 51.0 33.0
12 57.0 31.0
13 63.0 30.0
14 69.0 26.0
15 75.0 28.0
16 81.0 27.0
17 87.0 27.0
18 93.0 33.0
19 99.0 31.0
20 105.0 29.0
21 111.0 27.0
22 117.0 30.0
23 123.0 23.0
Numpy alternative with strides is complicated, but faster if large one Series:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
N = 3
x = np.concatenate([fv[-N+1:], fv.to_numpy()])
cv = pd.Series(rolling_window(x, N).sum(axis=1), index=fv.index)
print (cv)
0 5
1 4
2 4
3 6
4 5
dtype: int64
Though you have mentioned a series, see if this is helpful-
import pandas as pd
def cyclic_roll(s, n):
s = s.append(s[:n-1])
result = s.rolling(n).sum()
return result[-n+1:].append(result[n-1:-n+1])
fv = pd.DataFrame([1, 2, 3, 4, 5])
cv = fv.apply(cyclic_roll, n=3)
cv.reset_index(inplace=True, drop=True)
print cv
Output
0
0 10.0
1 8.0
2 6.0
3 9.0
4 12.0

Column totals on end column [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 2 years ago.
I'm trying to find how to get sum of all columns or for specific column
I found it and here is that part of code
data.loc['total'] = data.select_dtypes(np.number).sum()
this works correctly but I get warning
C:\Users\AAR\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\indexing.py:671: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_with_indexer(indexer, value)
table
on attached image you can see what I want to get
If you just want to do a sum, you should just do this:
Dataframe:
0 1 2 3 4 5 6 7 8 9
0 12 32 45 67 89 54 23.0 56.0 78.0 98.0
1 34 76 34 89 34 3 NaN NaN NaN NaN
2 76 34 54 12 43 78 56.0 NaN NaN NaN
3 76 56 45 23 43 45 67.0 76.0 67.0 8.0
4 87 9 9 0 89 90 6.0 89.0 NaN NaN
5 23 90 90 32 23 34 56.0 9.0 56.0 87.0
6 23 56 34 3 5 8 7.0 6.0 98.0 NaN
7 32 23 34 6 65 78 67.0 87.0 89.0 87.0
8 12 23 34 32 43 67 45.0 NaN NaN NaN
9 343 76 56 7 8 9 4.0 5.0 8.0 68.0
df = df.append(df.sum(numeric_only=True), ignore_index=True)
Output:
0 1 2 3 4 5 6 7 8 9
0 12.0 32.0 45.0 67.0 89.0 54.0 23.0 56.0 78.0 98.0
1 34.0 76.0 34.0 89.0 34.0 3.0 NaN NaN NaN NaN
2 76.0 34.0 54.0 12.0 43.0 78.0 56.0 NaN NaN NaN
3 76.0 56.0 45.0 23.0 43.0 45.0 67.0 76.0 67.0 8.0
4 87.0 9.0 9.0 0.0 89.0 90.0 6.0 89.0 NaN NaN
5 23.0 90.0 90.0 32.0 23.0 34.0 56.0 9.0 56.0 87.0
6 23.0 56.0 34.0 3.0 5.0 8.0 7.0 6.0 98.0 NaN
7 32.0 23.0 34.0 6.0 65.0 78.0 67.0 87.0 89.0 87.0
8 12.0 23.0 34.0 32.0 43.0 67.0 45.0 NaN NaN NaN
9 343.0 76.0 56.0 7.0 8.0 9.0 4.0 5.0 8.0 68.0
10 718.0 475.0 435.0 271.0 442.0 466.0 331.0 328.0 396.0 348.0

Python add columns to DataFame with rolling window based on 3 previous rows

I have a dataframe like this:
original = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=["P1_day", "P1_week", "P1_month"])
print(original)
P1_day P1_week P1_month
0 50 17 55
1 45 3 10
2 93 79 84
3 99 38 33
4 44 35 35
5 25 43 87
6 38 88 56
7 20 66 6
8 4 23 6
9 39 75 3
I need to generate new dataframe starting from 3rd row of original dataframe and add new 9 columns based on rolling window defined as 3 previous rows with corresponding prefixes: [_0,_1, _2]. So, It's rows with index [0,1,2] from original dataframe .
For example, the next 3 columns will be from the original.iloc[0],
and after the next 3 columns will be from the original.iloc[1],
and the last 3 columns will be from the original.iloc[2]
I tried to solve it by the next code:
subset_shifted = original[["P1_day", "P1_week", "P1_month"]].shift(3)
subset_shifted.columns = ["P1_day_0", "P1_week_0", "P1_month_0"]
original_ = pd.concat([original, subset_shifted], axis = 1)
print(original_)
In result, I Have 3 additional columns with value from the previous 0 row:
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0
0 50 17 55 NaN NaN NaN
1 45 3 10 NaN NaN NaN
2 93 79 84 NaN NaN NaN
3 99 38 33 50.0 17.0 55.0
4 44 35 35 45.0 3.0 10.0
5 25 43 87 93.0 79.0 84.0
6 38 88 56 99.0 38.0 33.0
7 20 66 6 44.0 35.0 35.0
8 4 23 6 25.0 43.0 87.0
9 39 75 3 38.0 88.0 56.0
In the next iteration I did shift(2) with the same approach and received the columns from the original.iloc[1].
On the last iteration I did shift(1) and got expected result in view of:
result = original_.iloc[3:]
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0 P1_day_1 P1_week_1 P1_month_1 P1_day_2 P1_week_2 P1_month_2
3 99 38 33 50.0 17.0 55.0 45.0 3.0 10.0 93.0 79.0 84.0
4 44 35 35 45.0 3.0 10.0 93.0 79.0 84.0 99.0 38.0 33.0
5 25 43 87 93.0 79.0 84.0 99.0 38.0 33.0 44.0 35.0 35.0
6 38 88 56 99.0 38.0 33.0 44.0 35.0 35.0 25.0 43.0 87.0
7 20 66 6 44.0 35.0 35.0 25.0 43.0 87.0 38.0 88.0 56.0
8 4 23 6 25.0 43.0 87.0 38.0 88.0 56.0 20.0 66.0 6.0
9 39 75 3 38.0 88.0 56.0 20.0 66.0 6.0 4.0 23.0 6.0
Question:
Is there any way to solve this task with better approach as I described? Thanks.
Unless you want all these extra DataFrames, you could just add the new columns to your orignal df directly:
import pandas as pd
import numpy as np
original = pd.DataFrame(
np.random.randint(0,100,size=(10, 3)),
columns=["P1_day", "P1_week", "P1_month"],
)
original[
["P1_day_0", "P1_week_0", "P1_month_0"]
] = original[
["P1_day", "P1_week", "P1_month"]
].shift(3)
print(original)
output:
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0
0 2 35 26 NaN NaN NaN
1 99 4 96 NaN NaN NaN
2 4 67 6 NaN NaN NaN
3 76 33 31 2.0 35.0 26.0
4 84 60 98 99.0 4.0 96.0
5 57 1 58 4.0 67.0 6.0
6 35 70 96 76.0 33.0 31.0
7 81 32 39 84.0 60.0 98.0
8 25 4 38 57.0 1.0 58.0
9 83 4 60 35.0 70.0 96.0
python tutor link to example
Edit: OP asked the follow up question:
yes, for the first row it makes sense. But, my task is to add first 3 rows with index 0-1-2 as new 9 columns for the respected rows started from 3rd index. In your output row with index 1st is not added to the 3rd row as 3 columns. In my code that's why I used shift(2) and shift(1) iteratively.
Here is how this could be done iteratively:
import pandas as pd
import numpy as np
original = pd.DataFrame(
np.random.randint(0,100,size=(10, 3)),
columns=["P1_day", "P1_week", "P1_month"],
)
for shift, n in ((3,0),(2,1),(1,2)):
original[
[f"P1_day_{n}", f"P1_week_{n}", f"P1_month_{n}"]
] = original[
["P1_day", "P1_week", "P1_month"]
].shift(shift)
pd.set_option('display.max_columns', None)
print(original.iloc[3:])
Output:
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0 P1_day_1 \
3 58 43 74 26.0 56.0 82.0 56.0
4 44 27 40 56.0 87.0 38.0 31.0
5 2 90 4 31.0 32.0 87.0 58.0
6 90 70 6 58.0 43.0 74.0 44.0
7 1 31 57 44.0 27.0 40.0 2.0
8 96 22 69 2.0 90.0 4.0 90.0
9 13 98 47 90.0 70.0 6.0 1.0
P1_week_1 P1_month_1 P1_day_2 P1_week_2 P1_month_2
3 87.0 38.0 31.0 32.0 87.0
4 32.0 87.0 58.0 43.0 74.0
5 43.0 74.0 44.0 27.0 40.0
6 27.0 40.0 2.0 90.0 4.0
7 90.0 4.0 90.0 70.0 6.0
8 70.0 6.0 1.0 31.0 57.0
9 31.0 57.0 96.0 22.0 69.0
python tutor link
Edit 2: Not to make any assumptions here, but if your end goal is to get something like the 4 period moving average from the data in all of these new columns then you might not need them at all. You can use pandas.DataFrame.rolling instead:
import pandas as pd
import numpy as np
original = pd.DataFrame(
np.random.randint(0,100,size=(10, 3)),
columns=["P1_day", "P1_week", "P1_month"],
)
original[
["P1_day_4PMA", "P1_week_4PMA", "P1_month_4PMA"]
] = original[
["P1_day", "P1_week", "P1_month"]
].rolling(4).mean()
pd.set_option('display.max_columns', None)
print(original.iloc[3:])
Output:
P1_day P1_week P1_month P1_day_4PMA P1_week_4PMA P1_month_4PMA
3 1 13 48 31.25 38.00 55.00
4 10 4 40 22.00 21.00 45.75
5 7 76 0 5.50 23.75 37.00
6 5 69 9 5.75 40.50 24.25
7 63 31 82 21.25 45.00 32.75
8 26 67 22 25.25 60.75 28.25
9 89 41 40 45.75 52.00 38.25
another python tutor link

Pandas Merge Rows with Duplicate Ids Conditionally, Suitable for to CSV

I have the following df and I want to merge the lines that have the same Ids, unless there are duplicates
Ids A B C D E F G H I J
4411 24 2 55 26 1
4411 24 2 54 26 0
4412 22 4 54 26 0
4412 18 8 54 26 0
7401 12 14 54 26 0
7401 0 25 53 26 0
7402 24 2 54 26 0
7402 25 1 54 26 0
10891 16 10 54 26 0
10891 3 23 54 26 0
10891 5 10 6 15 0
Example output
Ids A B C D E F G H I J
4411 24 2 55 26 1 24 2 54 26 0
4412 22 4 54 26 0 18 8 54 26 0
7401 12 14 54 26 0 0 25 53 26 0
7402 24 2 54 26 0 25 1 54 26 0
10891 16 10 54 26 0 3 23 54 26 0
10891 5 10 6 15 0
I tried groupby but that throws errors when you write to csv.
This solution uses Divakar's justify function. If needed, convert to numeric in advance:
df = df.apply(pd.to_numeric, errors='coerce', axis=1)
Now, call groupby + transform:
df.set_index('Ids')\
.groupby(level=0)\
.transform(
justify, invalid_val=np.nan, axis=0, side='up'
)\
.dropna(how='all')
A B C D E F G H I J
Ids
4411 24.0 2.0 55.0 26.0 1.0 24.0 2.0 54.0 26.0 0.0
4412 22.0 4.0 54.0 26.0 0.0 18.0 8.0 54.0 26.0 0.0
7401 12.0 14.0 54.0 26.0 0.0 0.0 25.0 53.0 26.0 0.0
7402 24.0 2.0 54.0 26.0 0.0 25.0 1.0 54.0 26.0 0.0
10891 16.0 10.0 54.0 26.0 0.0 3.0 23.0 54.0 26.0 0.0
10891 NaN NaN NaN NaN NaN 5.0 10.0 6.0 15.0 0.0
This should be slow , but can achieve what you need
df.replace('',np.nan).groupby('Ids').apply(lambda x: pd.DataFrame(x).apply(lambda x: sorted(x, key=pd.isnull),0)).dropna(axis=0,thresh=2).fillna('')
Out[539]:
Ids A B C D E F G H I J
0 7402 24.0 2.0 54.0 26.0 0.0 25.0 1.0 54.0 26.0 0.0
2 10891 16.0 10.0 54.0 26.0 0.0 3.0 23.0 54.0 26.0 0.0
3 10891 5.0 10.0 6.0 15.0 0.0
Assuming all the blank values are nan, another option using groupby and dropna:
df.loc[:,'A':'E'] = df.groupby('Ids').apply(lambda x: x.loc[:,'A':'E'].ffill(limit=1))
df.dropna(subset=['F','G','H','I','J'])

pandas adding new column?

l m learning Pandas and l m trying to learn things in it. However ,l got stuck in adding new column as new columns have larger index number.l would like to add more than 3 columns.
here is my code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
example=pd.read_excel("C:/Users/ömer sarı/AppData/Local/Programs/Python/Python35-32/data_analysis/pydata-book-master/example.xlsx",names=["a","b","c","d","e","f","g"])
dropNan=example.dropna()
#print(dropNan)
Fillup=example.fillna(-99)
#print(Fillup)
Countt=Fillup.get_dtype_counts()
#print(Countt)
date=pd.date_range("2014-01-01","2014-01-15",freq="D")
#print(date)
mm=date.month
yy=date.year
dd=date.day
df=pd.DataFrame(example)
print(df)
x=[i for i in yy]
print(x)
df["year"]=df[x]
here is example dataset:
a b c d e f g
0 1 1.0 2.0 5 3 11.0 57.0
1 2 4.0 6.0 10 6 22.0 59.0
2 3 9.0 12.0 15 9 33.0 61.0
3 4 16.0 20.0 20 12 44.0 63.0
4 5 25.0 NaN 25 15 NaN 65.0
5 6 NaN 42.0 30 18 66.0 NaN
6 7 49.0 56.0 35 21 77.0 69.0
7 8 64.0 72.0 40 24 88.0 71.0
8 9 81.0 NaN 45 27 99.0 73.0
9 10 NaN 110.0 50 30 NaN 75.0
10 11 121.0 NaN 55 33 121.0 77.0
11 12 144.0 156.0 60 36 132.0 NaN
12 13 169.0 182.0 65 39 143.0 81.0
13 14 196.0 NaN 70 42 154.0 83.0
14 15 225.0 240.0 75 45 165.0 85.0
here is error message:
IndexError: indices are out-of-bounds
after that ,l tried that one and l got new error:
df=pd.DataFrame(range(len(x)),index=x, columns=["a","b","c","d","e","f","g"])
pandas.core.common.PandasError: DataFrame constructor not properly called!
it is just a trial to learn and how can l add the date with split parts as new column like["date","year","month","day"....]

Categories