adding a first difference column to a pandas dataframe - python

I have a dataframe df with two columns date and data. I want to take the first difference of the data column and add it as a new column.
It seems that df.set_index('date').shift() or df.set_index('date').diff() give me the desired result. However, when I try to add it as a new column, I get NaN for all the rows.
How can I fix this command:
df['firstdiff'] = df.set_index('date').shift()
to make it work?

Related

How to reshape dataframe with pandas?

I have a data frame that contains product sales for each day starting from 2018 to 2021 year. Dataframe contains four columns (Date, Place, Product Category and Sales). From the first two columns (Date, Place) I want to use the available data to fill in the gaps. Once the data is added, I would like to delete rows that do not have data in ProductCategory. I would like to do in python pandas.
The sample of my data set looked like this:
I would like the dataframe to look like this:
Use fillna with method 'ffill' that propagates last valid observation forward to next valid backfill. Then drop the rows that contain NAs.
df['Date'].fillna(method='ffill',inplace=True)
df['Place'].fillna(method='ffill',inplace=True)
df.dropna(inplace=True)
You are going to use the forward-filling method to replace null values with the value of the nearest one above it df['Date', 'Place'] = df['Date', 'Place'].fillna(method='ffill'). Next, to drop rows with missing values df.dropna(subset='ProductCategory', inplace=True). Congrats, now you have your desired df 😄
Documentation: Pandas fillna function, Pandas dropna function
compute the frequency of catagories in the column by plotting,
from plot you can see bars reperesenting the most repeated values
df['column'].value_counts().plot.bar()
and get the most frequent value using index, index[0] gives most repeated and
index[1] gives 2nd most repeated and you can choose as per your requirement.
most_frequent_attribute = df['column'].value_counts().index[0]
then fill missing values by above method
df['column'].fillna(df['column'].most_freqent_attribute,inplace=True)
to fill multiple columns with same method just define this as funtion, like this
def impute_nan(df,column):
most_frequent_category=df[column].mode()[0]
df[column].fillna(most_frequent_category,inplace=True)
for feature in ['column1','column2']:
impute_nan(df,feature)

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

Filling In Empty Dates With Previous Data in Pandas

So I have the current file in Excel where I have dates and don't have dates for everything which can be seen.
I read this excel file into a pandas dataframe, rename the column and get the following:
My question is, how would I get it so every empty date in the dataframe is filled in with the last previous date encountered. All of the blanks between 04/03/2021 and 05/03/2021 gets replaced with 04/03/2021, so every row in my dataframe has a date associated with it?
Thanks!
After reading the data into a dataframe, you can fill missing values using fillna with method='ffill' for forward fill
Just using the inbuilt way in pandas of:
duplicate_df['StartDate'] = duplicate_df['StartDate'].fillna(method = 'ffill')
This replaces all the NaNs in the dataframe with the last row that had data in.

Dividing one column in a dataframe by a number while bringing back all other columns in the dataframe

I am trying to divide one column in a dataframe by a number while bringing back all other columns in a dataframe unchanged. The code below works for the division, but I dont know how to bring back the columns I wanted unchanged from the dataframe as well:
df= df[['C']].div(4, axis = 0)
the data looks like this
The output that I am looking for would be:
The closest I could find to an answer is here:Divide multiple columns by another column in pandas
However, the last quote says to use pd.set_index after the division, but I am not sure how that syntax is supposed to look.
Right now I am only getting the output column C not the two other columns.
You can do:
df['C']= df['C']/4
It will divide the column C by 4 while keeping other columns same.
You are getting the output column C only because you are saving column C changed only:
df['C']= df[['C']].div(4, axis=0)
This may give you the exact result

How to feed new columns every time in a loop to a spark dataframe?

I have a task of reading each columns of Cassandra table into a dataframe to perform some operations. Here I want to feed the data like if 5 columns are there in a table I want:-
first column in the first iteration
first and second column in the second iteration to the same dataframe
and likewise.
I need a generic code. Has anyone tried similar to this? Please help me out with an example.
This will work:
df2 = pd.DataFrame()
for i in range(len(df.columns)):
df2 = df2.append(df.iloc[:,0:i+1],sort = True)
Since, the same column name is getting repeated, obviously df will not have same column name twice and hence it will keep on adding rows
You can extract the names from dataframe's schema and then access that particular column and use it the way you want to.
names = df.schema.names
columns = []
for name in names:
columns.append(name)
//df[columns] use it the way you want

Categories