I have a Pandas dataframe that looks like the following:
c1 c2 c3 c4
p1 q1 r1 20
p2 q2 r2 10
p3 q3 r1 30
The Desired output looks like this.
c1 c2 c3 c4 NewColumn(c1.1)
p1 q1 r1 20 0
p2 q2 r2 10 p2-p1
p3 q3 r1 30 p3-p2
The shape of my dataset is(333650,665) I want to do that for all columns. Are there any ways to achieve this?
The code I am using:
data = pd.read_csv('Mydataset.csv')
i=0
j=1
while j < len(data['columnname']):
j=data['columnname'][i+1] - data['columnname'][i]
i+=1 #Next value of column.
j+=1 #Next value new column.
print(j)
Is this what you want? it finds the difference between the rows of a particular column using the shift method and assigns it to a new column.
Note that I am using the data from Dave.
df['New Column'] = df.a.sub(df.a.shift()).fillna(0)
a b c New Column
0 1 1 1 0.0
1 2 1 4 1.0
2 3 2 9 1.0
3 4 3 16 1.0
4 5 5 25 1.0
5 6 8 36 1.0
For multiple columns, this may suffice:
M = df.diff().fillna(0).add_suffix('_1')
#concatenate along the columns axis
pd.concat([df,M], axis = 1)
a b c a_1 b_1 c_1
0 1 1 1 0.0 0.0 0.0
1 2 1 4 1.0 0.0 3.0
2 3 2 9 1.0 1.0 5.0
3 4 3 16 1.0 1.0 7.0
4 5 5 25 1.0 2.0 9.0
5 6 8 36 1.0 3.0 11.0
You want the diff function:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html
df
a b c
0 1 1 1
1 2 1 4
2 3 2 9
3 4 3 16
4 5 5 25
5 6 8 36
df.diff()
a b c
0 NaN NaN NaN
1 1.0 0.0 3.0
2 1.0 1.0 5.0
3 1.0 1.0 7.0
4 1.0 2.0 9.0
5 1.0 3.0 11.0
Related
I have a dataframe that looks something like this:
ID
hiqual
Wave
1
1.0
g
1
NaN
i
1
NaN
k
2
1.0
g
2
NaN
i
2
NaN
k
3
1.0
g
3
NaN
i
4
5.0
g
4
NaN
i
This is a long format dataframe and I have my hiqual variable for my first measurement wave (g). I would like to populate the NaN values for the subsequent measurement waves (i and k) as the same value give in wave g for each ID.
I tried using fillna() but I am not sure how to provide the two conditions of ID and Wave and how to populate based on that. I would be grateful for any help/suggestions on this?
The exact expected output is unclear, but think you might want:
m = df['hiqual'].isna()
df.loc[m, 'hiqual'] = df['Wave'].mask(m).ffill()
If you dataframe is already ordered by ID and wave columns, you can simply fill forward values:
>>> df.sort_values(['ID', 'Wave']).ffill()
ID hiqual Wave
0 1 1.0 g
1 1 1.0 i
2 1 1.0 k
3 2 1.0 g
4 2 1.0 i
5 2 1.0 k
6 3 1.0 g
7 3 1.0 i
8 4 5.0 g
9 4 5.0 i
You can also use explicitly g values:
g_vals = df[df['Wave']=='g'].set_index('ID')['hiqual']
df['hiqual'] = df['hiqual'].fillna(df['ID'].map(g_vals))
print(df)
print(g_vals)
# Output
ID hiqual Wave
0 1 1.0 g
1 1 1.0 i
2 1 1.0 k
3 2 1.0 g
4 2 1.0 i
5 2 1.0 k
6 3 1.0 g
7 3 1.0 i
8 4 5.0 g
9 4 5.0 i
# g_vals
ID
1 1.0
2 1.0
3 1.0
4 5.0
Name: hiqual, dtype: float64
I have a dataframe like this:
import pandas as pd
import numpy as np
data={'trip':[1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3],
'timestamps':[1235471761, 1235471763, 1235471765, 1235471767, 1235471770, 1235471772, 1235471776, 1235471779, 1235471780, 1235471789,1235471792,1235471793,1235471829,1235471833,1235471835,1235471838,1235471844,1235471847,1235471848,1235471852,1235471855,1235471859,1235471900,1235471904,1235471911,1235471913]}
df = pd.DataFrame(data)
df['TimeDistance'] = df.groupby('trip')['timestamps'].diff(1)
df
What I am looking for is to start from the first row(consider it as an origin) in the "TimeDistance" column and doing cumulative sum over its values and whenever this summation reach 10, restart the cumsum and continue this procedure until the end of the trip (as you can see in this dataframe we have 3 trips in the "trip" column).
I want all the cumulative sum in a new column,lets say, "cumu" column.
Another important issue is that after reaching out threshold, the next row after threshold in the "cumu" column must be zero and the summation restart from this new origin again.
I hope I've understood your question right. You can use generator with .send():
def my_accumulate(maxval):
val = 0
yield
while True:
if val < maxval:
val += yield val
else:
yield val
val = 0
def fn(x):
a = my_accumulate(10)
next(a)
x["cumu"] = [a.send(v) for v in x["TimeDistance"]]
return x
df = df.groupby("trip").apply(fn)
print(df)
Prints:
trip timestamps TimeDistance cumu
0 1 1235471761 NaN 0.0
1 1 1235471763 2.0 2.0
2 1 1235471765 2.0 4.0
3 1 1235471767 2.0 6.0
4 1 1235471770 3.0 9.0
5 1 1235471772 2.0 11.0
6 1 1235471776 4.0 0.0
7 1 1235471779 3.0 3.0
8 1 1235471780 1.0 4.0
9 1 1235471789 9.0 13.0
10 1 1235471792 3.0 0.0
11 1 1235471793 1.0 1.0
12 2 1235471829 NaN 0.0
13 2 1235471833 4.0 4.0
14 2 1235471835 2.0 6.0
15 2 1235471838 3.0 9.0
16 2 1235471844 6.0 15.0
17 2 1235471847 3.0 0.0
18 2 1235471848 1.0 1.0
19 2 1235471852 4.0 5.0
20 2 1235471855 3.0 8.0
21 2 1235471859 4.0 12.0
22 3 1235471900 NaN 0.0
23 3 1235471904 4.0 4.0
24 3 1235471911 7.0 11.0
25 3 1235471913 2.0 0.0
Another solution:
df = df.groupby("trip").apply(
lambda x: x.assign(
cumu=(
val := 0,
*(
val := val + v if val < 10 else (val := 0)
for v in x["TimeDistance"][1:]
),
)
),
)
print(df)
Andrej's answer is better, as mine is probably not as efficient, and it depends on the df being ordered by trip and the TimeDistance being nan as the first value of each trip.
cummulative_sum = 0
df['cumu'] = 0
for i in range(len(df)):
if np.isnan(df.loc[i,'TimeDistance']) or cummulative_sum >= 10:
cummulative_sum = 0
df.loc[i, 'cumu'] = 0
else:
cummulative_sum += df.loc[i,'TimeDistance']
df.loc[i, 'cumu'] = cummulative_sum
print(df) outputs:
trip timestamps TimeDistance cumu
0 1 1235471761 NaN 0
1 1 1235471763 2.0 2
2 1 1235471765 2.0 4
3 1 1235471767 2.0 6
4 1 1235471770 3.0 9
5 1 1235471772 2.0 11
6 1 1235471776 4.0 0
7 1 1235471779 3.0 3
8 1 1235471780 1.0 4
9 1 1235471789 9.0 13
10 1 1235471792 3.0 0
11 1 1235471793 1.0 1
12 2 1235471829 NaN 0
13 2 1235471833 4.0 4
14 2 1235471835 2.0 6
15 2 1235471838 3.0 9
16 2 1235471844 6.0 15
17 2 1235471847 3.0 0
18 2 1235471848 1.0 1
19 2 1235471852 4.0 5
20 2 1235471855 3.0 8
21 2 1235471859 4.0 12
22 3 1235471900 NaN 0
23 3 1235471904 4.0 4
24 3 1235471911 7.0 11
25 3 1235471913 2.0 0
I have 2 dataframes:
dfA = pd.DataFrame({'label':[1,5,2,4,2,3],
'group':['A']*3 + ['B']*3,
'x':[np.nan]*3 + [1,2,3],
'y':[np.nan]*3 + [1,2,3]})
dfB = pd.DataFrame({'uniqid':[1,2,3,4,5,6,7],
'horizontal':[34,41,23,34,23,43,22],
'vertical':[98,67,19,57,68,88,77]})
...which look like:
label group x y
0 1 A NaN NaN
1 5 A NaN NaN
2 2 A NaN NaN
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
uniqid horizontal vertical
0 1 34 98
1 2 41 67
2 3 23 19
3 4 34 57
4 5 23 68
5 6 43 88
6 7 22 77
Basically, dfB contains 'horizontal' and 'vertical' values for all unique IDs. I want to populate the 'x' and 'y' columns in dfA with the 'horizontal' and 'vertical' values in dfB but only for group A; data for group B should remain unchanged.
The desired output would be:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
I've used .merge() to add additional columns to the dataframe for both groups A and B and then copy data to x and y columns for group A only. And finally delete columns from dfB.
dfA = dfA.merge(dfB, how = 'left', left_on = 'label', right_on = 'uniqid')
dfA.loc[dfA['group'] == 'A','x'] = dfA.loc[dfA['group'] == 'A','horizontal']
dfA.loc[dfA['group'] == 'A','y'] = dfA.loc[dfA['group'] == 'A','vertical']
dfA = dfA[['label','group','x','y']]
The correct output is produced:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
...but this is a really, really ugly solution. Is there a better solution?
combine_first
dfA.set_index(['label', 'group']).combine_first(
dfB.set_axis(['label', 'x', 'y'], axis=1).set_index(['label'])
).reset_index()
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
fillna
Works as well
dfA.set_index(['label', 'group']).fillna(
dfB.set_axis(['label', 'x', 'y'], axis=1).set_index(['label'])
).reset_index()
We can try loc to extract/update only the part we want. And since you are merging on one column, which also has unique value on dfB, you can use set_index and loc/reindex:
mask = dfA['group']=='A'
dfA.loc[ mask, ['x','y']] = (dfB.set_index('uniqid')
.loc[dfA.loc[mask,'label'],
['horizontal','vertical']]
.values
)
Output:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
Note that the above would fail if some of dfA.label is not in dfB.uniqueid. In which case, we need to use reindex:
(dfB.set_index('uniqid')
.reindex[dfA.loc[mask,'label']
[['horizontal','vertical']].values
)
dict={"asset":["S3","S2","E4","E1","A6","A8"],
"Rank":[1,2,3,4,5,6],"number_of_attributes":[2,1,2,2,1,1],
"number_of_cards":[1,2,2,1,2," "],"cards_plus1":[2,3,3,2,3," "]}
dframe=pd.DataFrame(dict,index=[1,2,3,4,5,6],
columns=["asset","Rank","number_of_attributes","number_of_cards","cards_plus1"])
i want to do cumsum of the column "cards_plus1".
How can I do this?
the output of the column cumsum should be that:
0
2
5
8
10
13
i want to start with zero instead of 2.. i want this outup : cards_plus1_cumsum 0 2 5 8 10 13
We can just pad a zero before the sums:
dframe["cumsum"] = np.pad(dframe["cards_plus1"][:-1].cumsum(), (1, 0), 'constant')
Try this:
First, replace the blank values by nan
import pandas as pd
import numpy as np
dict={"asset":["S3","S2","E4","E1","A6","A8"],"Rank":[1,2,3,4,5,6],"number_of_attributes":[2,1,2,2,1,1],
"number_of_cards":[1,2,2,1,2," "],"cards_plus1":[2,3,3,2,3," "]}
dframe=pd.DataFrame(dict,index=[1,2,3,4,5,6],
columns=["asset","Rank","number_of_attributes","number_of_cards","cards_plus1"])
## replace blank values by nan
print(dframe.replace(r'^\s*$', np.nan, regex=True, inplace=True))
print (dframe)
>>> asset Rank number_of_attributes number_of_cards cards_plus1
1 S3 1 2 1.0 2.0
2 S2 2 1 2.0 3.0
3 E4 3 2 2.0 3.0
4 E1 4 2 1.0 2.0
5 A6 5 1 2.0 3.0
6 A8 6 1 NaN NaN
Now the data type of the cards_plus1 column is object - change to numeric
### convert data type of the cards_plus1 to numeric
dframe['cards_plus1'] = pd.to_numeric(dframe['cards_plus1'])
Now calculate cumulative sum
### now we can calculate cumsum
dframe['cards_plus1_cumsum'] = dframe['cards_plus1'].cumsum()
print(dframe)
>>>
asset Rank number_of_attributes number_of_cards cards_plus1 \
1 S3 1 2 1.0 2.0
2 S2 2 1 2.0 3.0
3 E4 3 2 2.0 3.0
4 E1 4 2 1.0 2.0
5 A6 5 1 2.0 3.0
6 A8 6 1 NaN NaN
cards_plus1_cumsum
1 2.0
2 5.0
3 8.0
4 10.0
5 13.0
6 NaN
Instead of replacing the blank values by nan, you can replace them by zero, depends on what you want.. Hope this helped..
i have data frame as below ,
df = pd.DataFrame({'A':[1,4,7,1,4,7],'B':[2,5,8,2,5,8],'C':[3,6,9,3,6,9],'D':[1,2,3,1,2,3]})
A B C D
0 1 2 3 1
1 4 5 6 2
2 7 8 9 3
3 1 2 3 1
4 4 5 6 2
5 7 8 9 3
how can I find the difference between column (A & B) and save as AB, and do the same with (C & D) and save as CD within the data frame.
Expected output:
AB CD
0 1.0 -2.0
1 1.0 -4.0
2 1.0 -6.0
3 1.0 -2.0
4 1.0 -4.0
5 1.0 -6.0
tried using
d = dict(A='AB', B='AB', C='CD', D='CD')
df.groupby(d, axis=1).diff()
as explained here, this works well for sum(), but does not work as expected for diff(). Can someone please explain why?
Difference is diff not aggregate values like sum, but return new 2 columns - first filled by NAN and second with values.
So possible solution here is remove only NaNs columns by DataFrame.dropna:
d = dict(A='AB', B='AB', C='CD', D='CD')
df1 = df.rename(columns=d).groupby(level=0, axis=1).diff().dropna(axis=1, how='all')
print (df1)
AB CD
0 1.0 -2.0
1 1.0 -4.0
2 1.0 -6.0
3 1.0 -2.0
4 1.0 -4.0
5 1.0 -6.0