I have a data frame that contains product sales for each day starting from 2018 to 2021 year. Dataframe contains four columns (Date, Place, Product Category and Sales). From the first two columns (Date, Place) I want to use the available data to fill in the gaps. Once the data is added, I would like to delete rows that do not have data in ProductCategory. I would like to do in python pandas.
The sample of my data set looked like this:
I would like the dataframe to look like this:
Use fillna with method 'ffill' that propagates last valid observation forward to next valid backfill. Then drop the rows that contain NAs.
df['Date'].fillna(method='ffill',inplace=True)
df['Place'].fillna(method='ffill',inplace=True)
df.dropna(inplace=True)
You are going to use the forward-filling method to replace null values with the value of the nearest one above it df['Date', 'Place'] = df['Date', 'Place'].fillna(method='ffill'). Next, to drop rows with missing values df.dropna(subset='ProductCategory', inplace=True). Congrats, now you have your desired df 😄
Documentation: Pandas fillna function, Pandas dropna function
compute the frequency of catagories in the column by plotting,
from plot you can see bars reperesenting the most repeated values
df['column'].value_counts().plot.bar()
and get the most frequent value using index, index[0] gives most repeated and
index[1] gives 2nd most repeated and you can choose as per your requirement.
most_frequent_attribute = df['column'].value_counts().index[0]
then fill missing values by above method
df['column'].fillna(df['column'].most_freqent_attribute,inplace=True)
to fill multiple columns with same method just define this as funtion, like this
def impute_nan(df,column):
most_frequent_category=df[column].mode()[0]
df[column].fillna(most_frequent_category,inplace=True)
for feature in ['column1','column2']:
impute_nan(df,feature)
Related
I have a dataframe that looks like this:
Input dataframe
I want to find the contribution of each category to the Price(USD) column by day. So far I've tried aggregating by Timestamp and Category, with the sum of Price(USD):
df3 = df.groupby(["Timestamp", "Category"]).sum()
Obtaining the following dataset:
Dataset grouped by Timestamp and Category
After this point, I haven't been able to apply a function to each row to divide each Price(USD) by the sum of all different categories in each day and create a new column with these values.
Ideally, a new column "Percentage" would contain :
Percentage
0.3/(0.3+0.2+0.1)
0.2/(0.3+0.2+0.1)
0.1/(0.3+0.2+0.1)
With the same pattern for the rest of the dataframe.
Thank you
Seems like you need
>>> df.groupby(["Timestamp", "Category"]).sum() / df.groupby(["Timestamp"]).sum()
here is another way about it
df.groupby(['Timestamp','Category'])['price'].transform(sum) / df.groupby(['Timestamp'])['price'].transform(sum)
I have a data frame that I made the transpose of it looking like this
I would like to know how I can transform this group into filled lines, follow an example below
Where the first column is filled with the first value until the last empty row.
how can i do this if the column is grouped
In your case, repeat the indices of your data frame five times, save them in a new column, and then make the column entries original indices.
ibov_transpose['index'] = ibov_transpose.index.repeat(5)
ibov_transpose.set_index('index')
del(ibov_transpose['index'])
I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.
I need to group by and apply a pandas df with the next rows
['CpuEff',
'my_remote_host',
'GLIDEIN_CMSSite',
'BytesRecvd',
'BytesSent',
'CMSPrimaryPrimaryDataset',
'CMSPrimaryDataTier',
'DESIRED_CMSDataset',
'DESIRED_CMSPileups',
'type_prefix',
'CMS_Jobtype',
'CMS_Type',
'CommittedTime',
'CommittedSlotTime',
'CpusProvisioned',
'CpuTimeHr',
'JobRunCount',
'LastRemoteHost']
Then, I apply the group by and calculate the mean of each field and passing into a new df
grouped = df.groupby(['DESIRED_CMSDataset'])
df_mean=grouped.mean()
df_mean
And check the new df fields,
list(df_mean.columns)
['CpuEff',
'BytesRecvd',
'BytesSent',
'CommittedTime',
'CommittedSlotTime',
'CpusProvisioned',
'CpuTimeHr',
'JobRunCount']
The issue is, I want to plot a histogram showing 'DESIRED_CMSDataset' and the respective mean values of each row, but it does not allow me as long as in new dataframe this row disappear.
Is there any way to perform the same operation without losing the gropued row?
I think (i am on mobile rn) if you aggregate this way your group column becomes the index of the new df. Try running df = df.reset_index(). I think adding as_index=False during groupby also works. Will confirm and edit answer tomorrow. You could also plot df.index if you want to keep it that way
I want to aggregate my data based off a field known as COLLISION_ID and a count of each COLLISION_ID.
I want to remove repeating COLLISION_IDs since they have the same Coordinates, but retain a count of occurrences in original data-set.
My code is below
df2 = df1.groupby(['COLLISION_ID'])[['COLLISION_ID']].count()
This returns such:
I would like my data returned as the COLLISION_ID numbers, the count, and the remaining columns of my data which are not shown here(~40 additional columns that will be filtered later)
If you are talking about filter , we should do transform
df1['count_col']=df1.groupby(['COLLISION_ID'])['COLLISION_ID'].transform('count')
Then you can filter the df1 with column count