Pandas: Use a dataframe to index another one and fill the gaps? - python

I have two dataframes. df_0 is a complete list of dates and df_1 is a generic register indexed by incomplete dates. I need to make a dataframe that has df_0’s complete dates as an index, filled with df_1’s register in the matching dates. For dates without a register entry, I just need to repeat the last date’s register data as a filler. Any ideas on how to make this?
Thanks in advance.

Use DataFrame.reindex with parameter method:
df = df_1.reindex(df_0.index, method="ffill")

use 'reindex' to expand the df_1 and use 'fillna' to fill in the missed valued.
df_2 = df_1.reindex(df_0.index).fillna(method="ffill")

Related

How to drop rows in python that don't have date values?

I need help cleaning a very large dataframe. One of the rows is "PostingTimeUtc" should be only dates but several rows inserted wrong and they have strings of text instead. How can I select all the rows for "PostingTimeUtc" which have strings instead of dates and drop them?
I'm new to this site and to coding, so please let me know if I'm being vague.
Please remember to add examples even if short -
This may work in your case:
from pandas.api.types import is_datetime64_any_dtype as is_datetime
df[df['column name'].map(is_datetime)]
Where map applies the is_datetime function (results in True or False) to each row and the Boolean filter is applied to the dataframe.
Don't forget to assign df to this result to retain the values as it is not done inplace.
df = df[df['column name'].map(is_datetime)]
I am assuming it's the pandas data frame. You can do this to filter rows on the basis of regex.
df.column_name.str.contains('your regex here')

Multiindex Filterings of grouped data

I have a pandas dataframe where I have done a groupby. The groupby results look like this:
As you can see this dataframe has a multilevel index ('ga:dimension3','ga:data') and a single column ('ga:sessions').
I am looking to create a dataframe with the first level of the index ('ga:dimension3') and the first date for each first level index value :
I can't figure out how to do this.
Guidance appreciated.
Thanks in advance.
Inspired from #ggaurav suggestion for using first(), I think that the following should do the work (df is the data you provided, after the group):
result=df.reset_index(1).groupby('ga:dimension3').first()
You can directly use first. As you need data based on just 'ga:dimension3', so you need to groupby it (or level=0)
df.groupby(level=0).first()
Without groupby, you can get the level 0 index values and delete the duplicated ones and keeping the first one.
df[~df.index.get_level_values(0).duplicated(keep='first')]

Plot the grouped fields in the groupby pandas Function

I need to group by and apply a pandas df with the next rows
['CpuEff',
'my_remote_host',
'GLIDEIN_CMSSite',
'BytesRecvd',
'BytesSent',
'CMSPrimaryPrimaryDataset',
'CMSPrimaryDataTier',
'DESIRED_CMSDataset',
'DESIRED_CMSPileups',
'type_prefix',
'CMS_Jobtype',
'CMS_Type',
'CommittedTime',
'CommittedSlotTime',
'CpusProvisioned',
'CpuTimeHr',
'JobRunCount',
'LastRemoteHost']
Then, I apply the group by and calculate the mean of each field and passing into a new df
grouped = df.groupby(['DESIRED_CMSDataset'])
df_mean=grouped.mean()
df_mean
And check the new df fields,
list(df_mean.columns)
['CpuEff',
'BytesRecvd',
'BytesSent',
'CommittedTime',
'CommittedSlotTime',
'CpusProvisioned',
'CpuTimeHr',
'JobRunCount']
The issue is, I want to plot a histogram showing 'DESIRED_CMSDataset' and the respective mean values of each row, but it does not allow me as long as in new dataframe this row disappear.
Is there any way to perform the same operation without losing the gropued row?
I think (i am on mobile rn) if you aggregate this way your group column becomes the index of the new df. Try running df = df.reset_index(). I think adding as_index=False during groupby also works. Will confirm and edit answer tomorrow. You could also plot df.index if you want to keep it that way

How to drop duplicated rows in data frame based on certain criteria?

enter image description here
Our objective right now is to drop the duplicate player rows, but keep the row with the highest count in the G column (Games played). What code can we use to achieve this? I've attached a link to the image of our Pandas output here.
You probably want to first sort the dataframe by column G.
df = df.sort_values(by='G', ascending=False)
You can then use drop_duplicates to drop all duplicates except for the first occurrence.
df.drop_duplicates(['Player'], keep='first')
There are 2 ways that I can think of
df.groupby('Player', as_index=False)['G'].max()
and
df.sort_values('G').drop_duplicates(['Player'] , keep = 'last')
The first method uses groupby to group values by Player, and contracts rows keeping the one with the maximum of G. The second one uses the drop_duplicate method of Pandas to achieve the same.
Try this,
Assume your dataframe object is df1 then
series= df1.groupby('Player')['G'].max() # this will return series.
pd.DataFrame(series)
let me know if this work for you or not.

How can I set the index of a generated pandas Series to a column from a DataFrame?

In pandas this operation creates a Series:
q7.loc[:, list(q7)].max(axis=1) - q7.loc[:, list(q7)].min(axis=1)
I would like to be able to set the index as a list of values from a df colum. Ie
list(df['Colname'])
I've tried to create the series then update it with the series generated from the first code snippet. I've also searched the docs and don't see a method that will allow me to do this. I would prefer not to manually iterate over it.
Help is appreciated.
You can simply store that series to a variable say S and set the index accordingly as shown below..!!
S = (q7.loc[:, list(q7)].max(axis=1) - q7.loc[:, list(q7)].min(axis=1))
S.index = df['Colname']
The code is provided assuming the lengths of the series and Column from the dataframe is equal. Hope this helps.!!
If you want to reset series s index, you can do:
s.index = new_index_list

Categories