I have filtered a pandas data frame by grouping and taking sum, now I want all the details and no longer need the sum
for example what I have looks like the image below
what i want is for each of the individual transactions to be shown, as currently the amount column is the sum of all transactions done by an individual on a specific date i want to see all the individual amounts, is this possible?
I dont know how to filter the larger df by the groupby one, have also tried using isin() with multiple &s but it does not work as for example "David" could be in my groupby df on sept 15, but in the larger df he has made transactions on other days aswell and those are slipping through when using isin()
Hello there and welcome,
first of all, as I've learned my self, always try:
to give some data (in text, or code form) as your input
share your expected output, to avoid more questions
have fun :-)
I'm new as well, and I did my best to cover as much possibilities as I could, at least people can use my code to get your df.
#From the picture
data={'Date': ['2014-06-30','2014-07-02','2014-07-02','2014-07-03','2014-07-09','2014-07-14','2014-07-17','2014-07-25','2014-07-29','2014-07-29','2014-08-06','2014-08-11','2014-08-22'],
'LastName':['Cow','Kind','Lion','Steel','Torn','White','Goth','Hin','Hin','Torn','Goth','Hin','Hin'],
'FirstName':['C','J','K','J','M','D','M','G','G','M','M','G','G'],
'Vendor':['Jail','Vet','TGI','Dept','Show','Still','Turf','Glass','Sup','Ref','Turf','Lock','Brenn'],
'Amount': [5015.70,6293.27,7043.00,7600,9887.08,5131.74,5037.55,5273.55,9455.48,5003.71,6675,7670.5,8698.18]
}
df=pd.DataFrame(data)
incoming=df.groupby(['Date','LastName','FirstName','Vendor','Amount']).count()
#what I believe you did to get Date grouped
incoming
Now here my answer:
Firstly I merged First and Lastname
df['CompleteName']=df[['FirstName','LastName']].agg('.'.join,axis=1) # getting Names for df
Then I did some statistics for the amount, for different groups:
#creating a new column with as much Statistics from group (Complete Name, Date, Vendor, etc.)
df['AmountSumName']=df['Amount'].groupby(df['CompleteName']).transform('sum')
df['AmountSumDate']=df['Amount'].groupby(df['Date']).transform('sum')
df['AmountSumVendor']=df['Amount'].groupby(df['Vendor']).transform('sum')
df
Now just groupby as you wish
Hope I could answer you question.
I have a pandas dataframe which is a large number of answers given by users in response to a survey and I need to re-structure it. There are up to 105 questions asked each year, but I only need maybe 20 of them.
The current structure is as below.
What I want to do is re-structure it so that the row values become column names and the answer given by the user is then the value in that column. In a picture (from Excel), what I want is the below (I know I'll need to re-name my columns, but that's fine once I can create the structure in the first place):
Is it possible to re-structure my dataframe this way? The outcome of this is to use some predictive analytics to predict a target variable, so I need to re-strcture before I can use Random Forest, kNN, and so on.
You might want try pivoting your table:
df.pivot(index=['SurveyID', 'UserID'], columns=['QuestionID'], values=['AnswerText'])
df.columns = [x[0] if x[1] == "" else "Answer_{}".format(x[1]) for x in df.columns.to_flat_index()]
I am currently working with dataframes in pandas. In sum, I have a dataframe called "Claims" filled with customer claims data, and I want to parse all the rows in the dataframe based on the unique values found in the field 'Part ID.' I would then like to take each set of rows and append it one at a time to an empty dataframe called "emptydf." This dataframe has the same column headings as the "Claims" dataframe. Since the values in the 'Part ID' column change from week to week, I would like to find some way to do this dynamically, rather than comb through the dataframe each week manually. I was thinking of somehow incorporating the df.where() expression and a For Loop, but am at a loss as to how to put it all together. Any insight into how to go about this, or even some better methods, would be great! The code I have thus far is divided into two steps as follows:
emptydf = Claims[0:0]
#Create empty dataframe
2.Parse_Claims = Claims.query('Part_ID == 1009')
emptydf = emptydf.append(Parse_Claims)
#Parse the dataframe by each unique Part ID number and append to empty dataframe. As you can see, I can only hard code one Part ID number at a time so far. This would take hours to complete manually, so I would love to figure out a way to iterate through the Part ID column and append the data dynamically.
Needless to say, I am super new to Python, so I definitely appreciate your patience in advance!
empty_df = list(Claims.groupby(Claims['Part_ID']))
this will create a list of tuples one for each part id. each tuple has 2 elements 1st is part id and 2nd is subset for that part id
I have some data with 4 features of interest: account_id, location_id, date_from and date_to. Each entry corresponds to a period where a customer account was associated with a particular location.
There are some pairs of account_id and location_id which have multiple entries, with different dates. This means that the customer is associated with the location for a longer period, covered by multiple consecutive entries.
So I want to create an extra column with the total length of time that a customer was associated with a given location. I am able to use groupby and apply to calculate this for each pair (see code below).. this works fine but I don't understand how to then add this back into the original dataframe as a new column.
lengths = non_zero_df.groupby(['account_id','location_id'], group_keys=False).apply(lambda x: x.date_to.max() - x.date_from.min())
Thanks
I think Mephy is right that this should probably go to StackOverflow.
You're going to have a shape incompatibility because there will be fewer entries in the grouped result than in the original table. You'll need to do the equivalent of an SQL left outer join with the original table and the results, and you'll have the total length show up multiple times in the new column -- every time you have an equal (account_id, location_id) pair, you'll have the same value in the new column. (There's nothing necessarily wrong with this, but it could cause an issue if people are trying to sum up the new column, for example)
Check out pandas.DataFrame.join (you can also use merge). You'll want to join the old table with the results, on (account_id, location_id), as a left (or outer) join.
I have a dataframe filled with trades taken from a trading strategy. The logic in the trading strategy needs to be updated to ensure that trade isn't taken if the strategy is already in a trade - but that's a different problem. The trade data for many previous trades is read into a dataframe from a csv file.
Here's my problem for the data I have:
I need to do a row-by-row comparison of the dataframe to determine if Entrydate of rowX is less than ExitDate rowX-1.
A sample of my data:
Row 1:
EntryDate ExitDate
2012-07-25 2012-07-27
Row 2:
EntryDate ExitDate
2012-07-26 2012-07-29
Row 2 needs to be deleted because it is a trade that should not have occurred.
I'm having trouble identifying which rows are duplicates and then dropping them. I tried the approach in answer 3 of this question with some luck but it isn't ideal because I have to manually iterate through the dataframe and read each row's data. My current approach is below and is ugly as can be. I check the dates, and then add them to a new dataframe. Additionally, this approach gives me multiple duplicates in the final dataframe.
for i in range(0,len(df)+1):
if i+1 == len(df): break #to keep from going past last row
ExitDate = df['ExitDate'].irow(i)
EntryNextTrade = df['EntryDate'].irow(i+1)
if EntryNextTrade>ExitDate:
line={'EntryDate':EntryDate,'ExitDate':ExitDate}
df_trades=df_trades.append(line,ignore_index=True)
Any thoughts or ideas on how to more efficiently accomplish this?
You can click here to see a sampling of my data if you want to try to reproduce my actual dataframe.
You should use some kind of boolean mask to do this kind of operation.
One way is to create a dummy column for the next trade:
df['EntryNextTrade'] = df['EntryDate'].shift()
Use this to create the mask:
msk = df['EntryNextTrade'] > df'[ExitDate']
And use loc to look at the subDataFrame where msk is True, and only the specified columns:
df.loc[msk, ['EntryDate', 'ExitDate']]