I have multiple dataframe with timeseries index in dfList.(example dataframe is shown below)
I tried to concatenate these dataframe into one dataframe by following command.
db=pd.concat(dfList)
and I got following dataframe.
Timeseries index are duplicated (many index are 2012-10-12 20:00:00) since timeseries in base dataframe was overlapping each other.
I want to remove this duplicate. Does anyone know how to do this?
some example dataframe in which timeseries index are overlapping is shown below
Thank you!!
You can simply drop the duplicates by a particular column value as mentioned in the docs here. You may do something like this:
db = db.drop_duplicates(cols="Timestamp")
which will drop all rows with duplicates in the column "Timestamp" except the first occurrence.
Related
I have a data frame that contains product sales for each day starting from 2018 to 2021 year. Dataframe contains four columns (Date, Place, Product Category and Sales). From the first two columns (Date, Place) I want to use the available data to fill in the gaps. Once the data is added, I would like to delete rows that do not have data in ProductCategory. I would like to do in python pandas.
The sample of my data set looked like this:
I would like the dataframe to look like this:
Use fillna with method 'ffill' that propagates last valid observation forward to next valid backfill. Then drop the rows that contain NAs.
df['Date'].fillna(method='ffill',inplace=True)
df['Place'].fillna(method='ffill',inplace=True)
df.dropna(inplace=True)
You are going to use the forward-filling method to replace null values with the value of the nearest one above it df['Date', 'Place'] = df['Date', 'Place'].fillna(method='ffill'). Next, to drop rows with missing values df.dropna(subset='ProductCategory', inplace=True). Congrats, now you have your desired df 😄
Documentation: Pandas fillna function, Pandas dropna function
compute the frequency of catagories in the column by plotting,
from plot you can see bars reperesenting the most repeated values
df['column'].value_counts().plot.bar()
and get the most frequent value using index, index[0] gives most repeated and
index[1] gives 2nd most repeated and you can choose as per your requirement.
most_frequent_attribute = df['column'].value_counts().index[0]
then fill missing values by above method
df['column'].fillna(df['column'].most_freqent_attribute,inplace=True)
to fill multiple columns with same method just define this as funtion, like this
def impute_nan(df,column):
most_frequent_category=df[column].mode()[0]
df[column].fillna(most_frequent_category,inplace=True)
for feature in ['column1','column2']:
impute_nan(df,feature)
I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.
I have a pandas dataframe where I have done a groupby. The groupby results look like this:
As you can see this dataframe has a multilevel index ('ga:dimension3','ga:data') and a single column ('ga:sessions').
I am looking to create a dataframe with the first level of the index ('ga:dimension3') and the first date for each first level index value :
I can't figure out how to do this.
Guidance appreciated.
Thanks in advance.
Inspired from #ggaurav suggestion for using first(), I think that the following should do the work (df is the data you provided, after the group):
result=df.reset_index(1).groupby('ga:dimension3').first()
You can directly use first. As you need data based on just 'ga:dimension3', so you need to groupby it (or level=0)
df.groupby(level=0).first()
Without groupby, you can get the level 0 index values and delete the duplicated ones and keeping the first one.
df[~df.index.get_level_values(0).duplicated(keep='first')]
I have two DataFrames that I want to merge. I have read about merging on multiple columns, and preserving the index when merging. My problem needs to cater for both, and I am having difficulty figuring out the best way to do this.
The first DataFrame looks like this
and the second looks like this
I want to merge these based on the Date and the ID. In the first DataFrame the Date is the index and the ID is a column; in the second DataFrame both Date and ID are part of a MultiIndex.
Essentially, as a result I want a DataFrame that looks like DataFrame 2 with an additional column for the Events from DataFrame 1.
I'd suggest reseting the index (reset_index) and then merging the DataFrame, as you've read. Then you can set the index (set_index) to reproduce your desired MultiIndex.
I have two Series that I need to join in one DataFrame.
Each series has a date index and corresponding price.
When I use concat I get a DataFrame that has one index (good) but two columns that have the same values (bad).
zee_nbp = pd.concat([zee_da_df,nbp_da_df],axis=1)
The values are correct for zee_da_df but are duplicated for nbp_df_df. Any ideas? I have checked and each series has different values before they are concatenated
Thanks in advance