I have filtered a pandas data frame by grouping and taking sum, now I want all the details and no longer need the sum
for example what I have looks like the image below
what i want is for each of the individual transactions to be shown, as currently the amount column is the sum of all transactions done by an individual on a specific date i want to see all the individual amounts, is this possible?
I dont know how to filter the larger df by the groupby one, have also tried using isin() with multiple &s but it does not work as for example "David" could be in my groupby df on sept 15, but in the larger df he has made transactions on other days aswell and those are slipping through when using isin()
Hello there and welcome,
first of all, as I've learned my self, always try:
to give some data (in text, or code form) as your input
share your expected output, to avoid more questions
have fun :-)
I'm new as well, and I did my best to cover as much possibilities as I could, at least people can use my code to get your df.
#From the picture
data={'Date': ['2014-06-30','2014-07-02','2014-07-02','2014-07-03','2014-07-09','2014-07-14','2014-07-17','2014-07-25','2014-07-29','2014-07-29','2014-08-06','2014-08-11','2014-08-22'],
'LastName':['Cow','Kind','Lion','Steel','Torn','White','Goth','Hin','Hin','Torn','Goth','Hin','Hin'],
'FirstName':['C','J','K','J','M','D','M','G','G','M','M','G','G'],
'Vendor':['Jail','Vet','TGI','Dept','Show','Still','Turf','Glass','Sup','Ref','Turf','Lock','Brenn'],
'Amount': [5015.70,6293.27,7043.00,7600,9887.08,5131.74,5037.55,5273.55,9455.48,5003.71,6675,7670.5,8698.18]
}
df=pd.DataFrame(data)
incoming=df.groupby(['Date','LastName','FirstName','Vendor','Amount']).count()
#what I believe you did to get Date grouped
incoming
Now here my answer:
Firstly I merged First and Lastname
df['CompleteName']=df[['FirstName','LastName']].agg('.'.join,axis=1) # getting Names for df
Then I did some statistics for the amount, for different groups:
#creating a new column with as much Statistics from group (Complete Name, Date, Vendor, etc.)
df['AmountSumName']=df['Amount'].groupby(df['CompleteName']).transform('sum')
df['AmountSumDate']=df['Amount'].groupby(df['Date']).transform('sum')
df['AmountSumVendor']=df['Amount'].groupby(df['Vendor']).transform('sum')
df
Now just groupby as you wish
Hope I could answer you question.
A few rows of my dataframe
The third column shows the time of completion of my data. Ideally, I'd want the second row to just show the date, removing the second half of the elements, but I'm not sure how to change the elements. I was able to change the (second) column of strings into a column of floats without the pound symbol in order to find the sum of costs. However, this column has no specific keyword I just select for all of the elements to remove.
Second part of my question is is it is possible to easy create another dataframe that contains 2021-05-xx or 2021-06-xx. I know there's a way to make another dataframe selecting certain rows like the top 15 or bottom 7. But I don't know if there's a way to make a dataframe finding what I mentioned. I'm thinking it follows the Series.str.contains(), but it seems like when I put '2021-05' in the (), it shows a entire dataframe of False's.
Extracting just the date and ignoring the time from the datetime column can be done by changing the formatting of the column.
df['date'] = pd.to_datetime(df['date']).dt.date
To the second part of the question about creating a new dataframe that is filtered down to only contain rows between 2021-05-xx and 2021-06-xx, we can use pandas filtering.
df_filtered = df[(df['date'] >= pd.to_datetime('2021-05-01')) & (df['date'] <= pd.to_datetime('2021-06-30'))]
Here we take advantage of two things: 1) Pandas making it easy to compare the chronology of different dates using numeric operators. 2) Us knowing that any date that contains 2021-05-xx or 2021-06-xx must come on/after the first day of May and on/before the last day of June.
There are also a few GUI's that make it easy to change the formatting of columns and to filter data without actually having to write the code yourself. I'm the creator of one of these tools, Mito. To filter dates in Mito, you can just enter the dates using our calendar input fields and Mito will generate the equivalent pandas code for you!
I have this data set that I have been able to organise to the best of my abilities. I`m stuck on the next step. Here is a picture of the df:
My goal is to organise it in a way so that I have the columns month, genres, and time_watched_hours.
If I do the following:
df = df.groupby(['month']).sum().reset_index()
It only sums down the 1`s in the genre columns, whereas I need to add each instance of that genre occurring with the time_watched_hours. For example, in the first row, it would add 4.84 hours for genre comedies. In the third row, 0.84 hours for genre_Crime, and so on.
Once that`s organised, I will use the following to get it in the format I need:
df_cleaned = df.melt(id_vars='month',value_name='time_watched_hours',var_name='Genres').rename(columns=str.title)
Any advice on how to tackle this problem would be greatly appreciated! Thanks!
EDIT: Looking at this further, it would also work to replace the "1" in each row with the time_watched_hours value, then I can groupby().sum() down. Note there may be more than one value of "1" per row.
Ended up finding and using mask for each column which worked perfectly. Downside was I had to list it for each column
df['genre_Action & Adventure'].mask(df['genre_Action & Adventure'] == 1, df['time_watched_hours'], inplace=True)
I have a list of company names, dates, and pe ratios.
I need to find an average of the previous 10 years data of the given date such that only month-end date is considered.
for example if I need to find average as of 31st dec, 2015..... I need to first find data of all previous month ends from 31/12/2005 to 31/12/2015. and then their average.
sample data I have
required output:
required output
here is what I have done soo far....
df = pd.read_csv('daily_valuation_ratios_cc.csv')
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
columns = ['pe', 'price_bv', 'mcap_ns', 'ev_ebidta']
df_mean = df.groupby('Company Name')[columns].resample('M').mean()
but this method is finding mean on daily basis and is showing result monthly, unlike my sample output.
i am new to pandas, pls help.
Edit:
df3 = df.groupby(['Company Name','year','month'])
df3.first()
this code works, now I just have one problem, to export dataframe to to_csv. pls help
A dataframe has a special function called groupby that selects a column, and can be aggregated.
So if you were to run, data.groupby('pe') you would get that column.
Now if you were to tack on .describe, you would get the standard deviation/mean/min/ect.
Example:
data.groupby('pe').describe()
Edit: You can also use built-in aggregate functions such as .max()/.mean()/ect. with groupby().
I am trying to search through a large data frame for a specific date. The date may have multiple values in the data_value column. After finding the date, I am extracting the maximum value from the set of possible values associated with that data.
Is there a way to make this more efficient? It runs slowly now.
max_temps = []
for date in dates:
value = data_w[data_w['Date']==date]['Data_Value'].max()
max_temps.append(value)
If I understood your problem properly then you need like this,
temp=data_w[data_w['Date'].isin(dates)]
print temp.groupby('Date')['Data_Value'].max()
Explanation:
First apply isin in your large dataframe, then apply groupby and take max out of that