Doing .diff() on pandas column(s) gives wrong output? [duplicate] - python

This question already has answers here:
Subtract consecutive columns in a Pandas or Pyspark Dataframe
(2 answers)
Closed 2 years ago.
I am trying to take the difference of a column using .diff() in a dataframe with a date column and a value column.
import pandas as pd
d = {'Date':['11/11/2011', '11/12/2011', '11/13/2011'], 'a': [2, 3,4]}
df1 = pd.DataFrame(data=d)
df1.diff(axis = 1)
Pandas gives me this output:
Date a
0 11/11/2011 2
1 11/12/2011 3
2 11/13/2011 4
Which is the df1 and not the difference where I expect the output to be:
Date a
0 11/11/2011 NaN
1 11/12/2011 1
2 11/13/2011 1

df1.set_index('Date').diff(axis = 0) saves the day

axis=1 means you are subtracting columns not rows. Your target result is related to rows. Use axis=0 instead.
Second, it is not correct to do subtractions over strings. It will throw an error since python does not support that.

Related

Drop rows and reset_index in a dataframe [duplicate]

This question already has answers here:
Pandas reset index is not taking effect [duplicate]
(4 answers)
Closed 5 days ago.
This post was edited and submitted for review 5 days ago.
I was wondering why reset_index() has no effect in the following piece of code.
data = [0,10,20,30,40,50]
df = pd.DataFrame(data, columns=['Numbers'])
df.drop(df.index[2:4], inplace=True)
df.reset_index()
df
Numbers
0 0
1 10
4 40
5 50
UPDATE:
If I use df.reset_index(inplace=True), I see a new column which is not desired.
index Numbers
0 0 0
1 1 10
2 4 40
3 5 50
Because reset_index() has inplace=False as default, so you need to do reset_index(inplace=True). Docs
Please try this code
import pandas as pd
# create a sample DataFrame
df = pd.DataFrame({'column_name': [1, 2, 0, 4, 0, 6]})
# drop rows where column 'column_name' has value of 0
df = df[df['column_name'] != 0]
# reset the index of the resulting DataFrame
df = df.reset_index(drop=True)
print(df)

Merging 2 dataframe using update index but after running below code, index column is missing from dataframe1 [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 months ago.
I've a 2 dataframe for which I want to update dataframe1 specific column "var1" with dataframe2 column "var1" based on unique column "respid".
This is just an example : There are more column in df1 along with above shown example. However dataframe2 is the same as shown.
I've used below code for same and its working fine for var1. But my index column "respid" is missing after executing.
df1.set_index(['respid'], inplace=True)
df1.update(df2.set_index(['respid']))
df1.reset_index()
with pd.ExcelWriter("path"+ ".xlsx") as writer:
df1.to_excel(writer, sheet_name='sheet2', index=False)
Please let me know why "respid" column is missing from df1 and if possible do correct.
Try this way
df = pd.merge(df1,df2,on = ['respid'],how ='inner')
dfs = pd.merge(df,df1,on = ['respid'],how ='outer')
dfs =dfs.drop(columns=['var1_x','var1'])
dfs = dfs.fillna('')
dfs.columns = ['respid', 'var1']
which gives
respid var1
0 27217 screened
1 27211 screened
2 27214 screened
3 25402
4 1111

How to calculate the average of a column where the row meets a certain condition in Pandas [duplicate]

This question already has answers here:
Pandas Groupby: Count and mean combined
(6 answers)
Closed last year.
Basically I have this Dataframe:
import pandas as pd
dict = {'number': [1,1,1,1,1,2,2,2,4,4,4,4,6,6], 'time':[34,33,41,36,43,22,24,32,29,28,33,32,55,51]}
df = pd.DataFrame(dict)
print(df)
Output:
And I want to transform the df or create another one where instead of being several rows with the same 'number', there is a unique 'number' per row; and in the 'time' column, its average (of the records that had the same 'number'). Also, there should be a 3rd column called 'count' that shows the amount of records each 'number' had.
The output expected is:
Thanks.
Simply use groupby + agg:
agg = df.groupby('number')['time'].agg(['count', 'mean']).reset_index()
Output:
>>> agg
number count mean
0 1 5 37.4
1 2 3 26.0
2 4 4 30.5
3 6 2 53.0

create a new dataframe based on given dataframe [duplicate]

This question already has answers here:
Group dataframe and get sum AND count?
(4 answers)
Closed 1 year ago.
I have a table that looks like this:
user id
observation
25
2
25
3
25
2
23
1
23
3
the desired outcome is:
user id
observation
retention
25
7
3
23
4
2
I want to keep the user id column with unique ids and have another column showing how many times this id has appeared in the dataset summing up the observation column values.
any help will be appreciated
thanks
Use groupby() method and chain agg() method to it:
outputdf=df.groupby('user id',as_index=False).agg(observation=('observation','sum'),retention=('observation','count'))
Now if you print outputdf you will get your desired output:
user id observation retention
0 23 4 2
1 25 7 3
You have to use group by:
import pandas as pd
d = {'user id': [25,25,25,33,33], 'observation': [2,3,2,1,3]}
# get the dataframe
df = pd.DataFrame(data=d)
df_new = df.groupby('user id').agg({"sum", "count"}).reset_index()
# rename the columns as you desire
df_new.columns = ['user id', 'observation', 'retention']
df_new
Output:

Count the sum of a subset of the index in a pandas series [duplicate]

This question already has answers here:
Sum only certain rows in a given column of pandas dataframe
(2 answers)
Closed 3 years ago.
I have a pandas.core.series.Series with some data, now I want to calculate the sum of the index 0 to 13. How would I do that?
This is what tried so far:
#preg.prglngth.value_counts().sort_index()
prglnght_var = preg['prglngth']
prglnght_var.ser[:14]
The series data looks like this:
0 15
1 9
....
47 1
48 7
50 2
Name: prglngth, dtype: int64
You can try:
prglnght_var.loc[:14].sum()
.loc is a method of the series class.
It selects the rows or columns (the rows, in this case) for the criteria you choose (in this case, all lines from 0 to 13)
It returns a series
.sum is a method of a series that will sum all values in it.
As the series is already filtered for the lines you want, it will sum all values that you want.

Categories