I have a dataframe that looks like this:
Input dataframe
I want to find the contribution of each category to the Price(USD) column by day. So far I've tried aggregating by Timestamp and Category, with the sum of Price(USD):
df3 = df.groupby(["Timestamp", "Category"]).sum()
Obtaining the following dataset:
Dataset grouped by Timestamp and Category
After this point, I haven't been able to apply a function to each row to divide each Price(USD) by the sum of all different categories in each day and create a new column with these values.
Ideally, a new column "Percentage" would contain :
Percentage
0.3/(0.3+0.2+0.1)
0.2/(0.3+0.2+0.1)
0.1/(0.3+0.2+0.1)
With the same pattern for the rest of the dataframe.
Thank you
Seems like you need
>>> df.groupby(["Timestamp", "Category"]).sum() / df.groupby(["Timestamp"]).sum()
here is another way about it
df.groupby(['Timestamp','Category'])['price'].transform(sum) / df.groupby(['Timestamp'])['price'].transform(sum)
I have filtered a pandas data frame by grouping and taking sum, now I want all the details and no longer need the sum
for example what I have looks like the image below
what i want is for each of the individual transactions to be shown, as currently the amount column is the sum of all transactions done by an individual on a specific date i want to see all the individual amounts, is this possible?
I dont know how to filter the larger df by the groupby one, have also tried using isin() with multiple &s but it does not work as for example "David" could be in my groupby df on sept 15, but in the larger df he has made transactions on other days aswell and those are slipping through when using isin()
Hello there and welcome,
first of all, as I've learned my self, always try:
to give some data (in text, or code form) as your input
share your expected output, to avoid more questions
have fun :-)
I'm new as well, and I did my best to cover as much possibilities as I could, at least people can use my code to get your df.
#From the picture
data={'Date': ['2014-06-30','2014-07-02','2014-07-02','2014-07-03','2014-07-09','2014-07-14','2014-07-17','2014-07-25','2014-07-29','2014-07-29','2014-08-06','2014-08-11','2014-08-22'],
'LastName':['Cow','Kind','Lion','Steel','Torn','White','Goth','Hin','Hin','Torn','Goth','Hin','Hin'],
'FirstName':['C','J','K','J','M','D','M','G','G','M','M','G','G'],
'Vendor':['Jail','Vet','TGI','Dept','Show','Still','Turf','Glass','Sup','Ref','Turf','Lock','Brenn'],
'Amount': [5015.70,6293.27,7043.00,7600,9887.08,5131.74,5037.55,5273.55,9455.48,5003.71,6675,7670.5,8698.18]
}
df=pd.DataFrame(data)
incoming=df.groupby(['Date','LastName','FirstName','Vendor','Amount']).count()
#what I believe you did to get Date grouped
incoming
Now here my answer:
Firstly I merged First and Lastname
df['CompleteName']=df[['FirstName','LastName']].agg('.'.join,axis=1) # getting Names for df
Then I did some statistics for the amount, for different groups:
#creating a new column with as much Statistics from group (Complete Name, Date, Vendor, etc.)
df['AmountSumName']=df['Amount'].groupby(df['CompleteName']).transform('sum')
df['AmountSumDate']=df['Amount'].groupby(df['Date']).transform('sum')
df['AmountSumVendor']=df['Amount'].groupby(df['Vendor']).transform('sum')
df
Now just groupby as you wish
Hope I could answer you question.
I have this data set that I have been able to organise to the best of my abilities. I`m stuck on the next step. Here is a picture of the df:
My goal is to organise it in a way so that I have the columns month, genres, and time_watched_hours.
If I do the following:
df = df.groupby(['month']).sum().reset_index()
It only sums down the 1`s in the genre columns, whereas I need to add each instance of that genre occurring with the time_watched_hours. For example, in the first row, it would add 4.84 hours for genre comedies. In the third row, 0.84 hours for genre_Crime, and so on.
Once that`s organised, I will use the following to get it in the format I need:
df_cleaned = df.melt(id_vars='month',value_name='time_watched_hours',var_name='Genres').rename(columns=str.title)
Any advice on how to tackle this problem would be greatly appreciated! Thanks!
EDIT: Looking at this further, it would also work to replace the "1" in each row with the time_watched_hours value, then I can groupby().sum() down. Note there may be more than one value of "1" per row.
Ended up finding and using mask for each column which worked perfectly. Downside was I had to list it for each column
df['genre_Action & Adventure'].mask(df['genre_Action & Adventure'] == 1, df['time_watched_hours'], inplace=True)
I'm working with a crypto-currency data sample, each cell contains a dictionary. The dictionary containing the open price, close price, highest price, lowest price, volume and market cap. The columns are the corresponding dates and the index is the name of each cryptocurrency.
I don't know how to prepare the data in order for me to find the correlation between different currencies and between highest price and volume for example. How can this be done in python (pandas)...also how would I define a date range in such a situation?
Here's a link to the data sample, my coding and a printout of the data (Access is OPEN TO PUBLIC): https://drive.google.com/open?id=1mjgq0lEf46OmF4zK8sboXylleNs0zx7I
To begin with, I would suggest rearranging your data so that each currency's OHLCV values are their own columns (e.g. "btc_open | btc_high" etc.). This makes generating correlation matrices far easier. I'd also suggest beginning with only one metric (e.g. close price) and perhaps period movement (e.g. close-open) in your analysis. To answer your question:
Pandas can return a correlation matrix of all columns with:
df.corr()
If you want to use only specific columns, select those from the DataFrame:
df[["col1", "col2"]].corr()
You can return a single correlation value between two columns with the form:
df["col1"].corr(df["col2"])
If you'd like to specify a specific date range, I'd refer you to this question. I believe this will require your date column or index to be of the type datetime. If you don't know how to work with or convert to this type, I would suggest consulting the pandas documentation (perhaps begin with pandas.to_datetime).
In future, I would suggest including a data snippet in your post. I don't believe Google Drive is an appropriate form to share data, and it definitely is not appropriate to set the data to "request access".
EDIT: I checked your data and created a smaller subset to test this method on. If there are imperfections in the data you may find problems, but I had none when I tested it on a sample of your first 100 days and 10 coins (after transposing, df.iloc[:100, :10].
Firstly, transpose the DataFrame so columns are organised by coin and rows are dates.
df = df.T
Following this, we concatenate to a new DataFrame (result). Alternatively, concatenate to the original and drop columns after. Unfortunately I can't think of a non-iterative method. This method goes column by column, creates a DataFrame for each coins, adds the coin name prefix to the column names, then concatenates each DataFrame to the end.
result = pd.DataFrame()
coins = df.columns.tolist()
for coin in coins:
coin_data = df[coin]
split_coin = coin_data.apply(pd.Series).add_prefix(coin+"_")
result = pd.concat([result, split_coin], axis=1)
I am currently working with dataframes in pandas. In sum, I have a dataframe called "Claims" filled with customer claims data, and I want to parse all the rows in the dataframe based on the unique values found in the field 'Part ID.' I would then like to take each set of rows and append it one at a time to an empty dataframe called "emptydf." This dataframe has the same column headings as the "Claims" dataframe. Since the values in the 'Part ID' column change from week to week, I would like to find some way to do this dynamically, rather than comb through the dataframe each week manually. I was thinking of somehow incorporating the df.where() expression and a For Loop, but am at a loss as to how to put it all together. Any insight into how to go about this, or even some better methods, would be great! The code I have thus far is divided into two steps as follows:
emptydf = Claims[0:0]
#Create empty dataframe
2.Parse_Claims = Claims.query('Part_ID == 1009')
emptydf = emptydf.append(Parse_Claims)
#Parse the dataframe by each unique Part ID number and append to empty dataframe. As you can see, I can only hard code one Part ID number at a time so far. This would take hours to complete manually, so I would love to figure out a way to iterate through the Part ID column and append the data dynamically.
Needless to say, I am super new to Python, so I definitely appreciate your patience in advance!
empty_df = list(Claims.groupby(Claims['Part_ID']))
this will create a list of tuples one for each part id. each tuple has 2 elements 1st is part id and 2nd is subset for that part id