How to group values in a dataframe and keeping the values associated

How to group values in a dataframe and keeping the values associated - python

I am trying to solve a problem in Python, I feel like groupby is the solution but I can't tell how should I use it.
I have a dataframe budget where every team in the first league of soccer in France is associated with the budget of the team.
It looks like this
DB
The seasons go from '2010/2011' to '2019/2020'. I want to find a way to be able to GroupBy eachteamm with the budget associated for each season. I think I could do it by iterating through every columns and finding the index associated with the team value and finding what's the budget's value for each season. But maybe there's is a more efficient way that you could help me find.
Thank you very much

Based on the DB Schema, I would first extract data into separate dataframes for each year and have df_year2011['Team','Budget','Year'],df_year2012['Team','Budget','Year'] ..and so on and so forth.
Than I would concat them to create a df :
frames=[df_year2011,df_year2012,df_year2013]
df = pd.concat(frames)
And than to group the data in the dataframe, I would apply this :
df_grouped=df.groupby('Team')
And if you need to group by multiple columns you do :
df_grouped=df.groupby(['Column1_Name','Column2_Name'])

Related

How to count the number of distinct multiline index in pandas, only by one of the indices components

I have a dataframe that looks like this:
Input dataframe
I want to find the contribution of each category to the Price(USD) column by day. So far I've tried aggregating by Timestamp and Category, with the sum of Price(USD):
df3 = df.groupby(["Timestamp", "Category"]).sum()
Obtaining the following dataset:
Dataset grouped by Timestamp and Category
After this point, I haven't been able to apply a function to each row to divide each Price(USD) by the sum of all different categories in each day and create a new column with these values.
Ideally, a new column "Percentage" would contain :
Percentage
0.3/(0.3+0.2+0.1)
0.2/(0.3+0.2+0.1)
0.1/(0.3+0.2+0.1)
With the same pattern for the rest of the dataframe.
Thank you

Seems like you need
>>> df.groupby(["Timestamp", "Category"]).sum() / df.groupby(["Timestamp"]).sum()

here is another way about it
df.groupby(['Timestamp','Category'])['price'].transform(sum) / df.groupby(['Timestamp'])['price'].transform(sum)

How to expand groupby df Pandas python

I have filtered a pandas data frame by grouping and taking sum, now I want all the details and no longer need the sum
for example what I have looks like the image below
what i want is for each of the individual transactions to be shown, as currently the amount column is the sum of all transactions done by an individual on a specific date i want to see all the individual amounts, is this possible?
I dont know how to filter the larger df by the groupby one, have also tried using isin() with multiple &s but it does not work as for example "David" could be in my groupby df on sept 15, but in the larger df he has made transactions on other days aswell and those are slipping through when using isin()

Hello there and welcome,
first of all, as I've learned my self, always try:
to give some data (in text, or code form) as your input
share your expected output, to avoid more questions
have fun :-)
I'm new as well, and I did my best to cover as much possibilities as I could, at least people can use my code to get your df.
#From the picture
data={'Date': ['2014-06-30','2014-07-02','2014-07-02','2014-07-03','2014-07-09','2014-07-14','2014-07-17','2014-07-25','2014-07-29','2014-07-29','2014-08-06','2014-08-11','2014-08-22'],
'LastName':['Cow','Kind','Lion','Steel','Torn','White','Goth','Hin','Hin','Torn','Goth','Hin','Hin'],
'FirstName':['C','J','K','J','M','D','M','G','G','M','M','G','G'],
'Vendor':['Jail','Vet','TGI','Dept','Show','Still','Turf','Glass','Sup','Ref','Turf','Lock','Brenn'],
'Amount': [5015.70,6293.27,7043.00,7600,9887.08,5131.74,5037.55,5273.55,9455.48,5003.71,6675,7670.5,8698.18]
}
df=pd.DataFrame(data)
incoming=df.groupby(['Date','LastName','FirstName','Vendor','Amount']).count()
#what I believe you did to get Date grouped
incoming
Now here my answer:
Firstly I merged First and Lastname
df['CompleteName']=df[['FirstName','LastName']].agg('.'.join,axis=1) # getting Names for df
Then I did some statistics for the amount, for different groups:
#creating a new column with as much Statistics from group (Complete Name, Date, Vendor, etc.)
df['AmountSumName']=df['Amount'].groupby(df['CompleteName']).transform('sum')
df['AmountSumDate']=df['Amount'].groupby(df['Date']).transform('sum')
df['AmountSumVendor']=df['Amount'].groupby(df['Vendor']).transform('sum')
df
Now just groupby as you wish
Hope I could answer you question.

Python, pandas: summing a column based on multiple other columns and putting it into a new dataframe

I have this data set that I have been able to organise to the best of my abilities. I`m stuck on the next step. Here is a picture of the df:
My goal is to organise it in a way so that I have the columns month, genres, and time_watched_hours.
If I do the following:
df = df.groupby(['month']).sum().reset_index()
It only sums down the 1`s in the genre columns, whereas I need to add each instance of that genre occurring with the time_watched_hours. For example, in the first row, it would add 4.84 hours for genre comedies. In the third row, 0.84 hours for genre_Crime, and so on.
Once that`s organised, I will use the following to get it in the format I need:
df_cleaned = df.melt(id_vars='month',value_name='time_watched_hours',var_name='Genres').rename(columns=str.title)
Any advice on how to tackle this problem would be greatly appreciated! Thanks!
EDIT: Looking at this further, it would also work to replace the "1" in each row with the time_watched_hours value, then I can groupby().sum() down. Note there may be more than one value of "1" per row.

Ended up finding and using mask for each column which worked perfectly. Downside was I had to list it for each column
df['genre_Action & Adventure'].mask(df['genre_Action & Adventure'] == 1, df['time_watched_hours'], inplace=True)

Cryptocurrency correlation in python, working with dictionaries

I'm working with a crypto-currency data sample, each cell contains a dictionary. The dictionary containing the open price, close price, highest price, lowest price, volume and market cap. The columns are the corresponding dates and the index is the name of each cryptocurrency.
I don't know how to prepare the data in order for me to find the correlation between different currencies and between highest price and volume for example. How can this be done in python (pandas)...also how would I define a date range in such a situation?
Here's a link to the data sample, my coding and a printout of the data (Access is OPEN TO PUBLIC): https://drive.google.com/open?id=1mjgq0lEf46OmF4zK8sboXylleNs0zx7I

To begin with, I would suggest rearranging your data so that each currency's OHLCV values are their own columns (e.g. "btc_open | btc_high" etc.). This makes generating correlation matrices far easier. I'd also suggest beginning with only one metric (e.g. close price) and perhaps period movement (e.g. close-open) in your analysis. To answer your question:
Pandas can return a correlation matrix of all columns with:
df.corr()
If you want to use only specific columns, select those from the DataFrame:
df[["col1", "col2"]].corr()
You can return a single correlation value between two columns with the form:
df["col1"].corr(df["col2"])
If you'd like to specify a specific date range, I'd refer you to this question. I believe this will require your date column or index to be of the type datetime. If you don't know how to work with or convert to this type, I would suggest consulting the pandas documentation (perhaps begin with pandas.to_datetime).
In future, I would suggest including a data snippet in your post. I don't believe Google Drive is an appropriate form to share data, and it definitely is not appropriate to set the data to "request access".
EDIT: I checked your data and created a smaller subset to test this method on. If there are imperfections in the data you may find problems, but I had none when I tested it on a sample of your first 100 days and 10 coins (after transposing, df.iloc[:100, :10].
Firstly, transpose the DataFrame so columns are organised by coin and rows are dates.
df = df.T
Following this, we concatenate to a new DataFrame (result). Alternatively, concatenate to the original and drop columns after. Unfortunately I can't think of a non-iterative method. This method goes column by column, creates a DataFrame for each coins, adds the coin name prefix to the column names, then concatenates each DataFrame to the end.
result = pd.DataFrame()
coins = df.columns.tolist()
for coin in coins:
coin_data = df[coin]
split_coin = coin_data.apply(pd.Series).add_prefix(coin+"_")
result = pd.concat([result, split_coin], axis=1)

Create Loop to dynamically select rows from dataframe, then append selected rows to another dataframe: df.query()

I am currently working with dataframes in pandas. In sum, I have a dataframe called "Claims" filled with customer claims data, and I want to parse all the rows in the dataframe based on the unique values found in the field 'Part ID.' I would then like to take each set of rows and append it one at a time to an empty dataframe called "emptydf." This dataframe has the same column headings as the "Claims" dataframe. Since the values in the 'Part ID' column change from week to week, I would like to find some way to do this dynamically, rather than comb through the dataframe each week manually. I was thinking of somehow incorporating the df.where() expression and a For Loop, but am at a loss as to how to put it all together. Any insight into how to go about this, or even some better methods, would be great! The code I have thus far is divided into two steps as follows:
emptydf = Claims[0:0]
#Create empty dataframe
2.Parse_Claims = Claims.query('Part_ID == 1009')
emptydf = emptydf.append(Parse_Claims)
#Parse the dataframe by each unique Part ID number and append to empty dataframe. As you can see, I can only hard code one Part ID number at a time so far. This would take hours to complete manually, so I would love to figure out a way to iterate through the Part ID column and append the data dynamically.
Needless to say, I am super new to Python, so I definitely appreciate your patience in advance!

empty_df = list(Claims.groupby(Claims['Part_ID']))
this will create a list of tuples one for each part id. each tuple has 2 elements 1st is part id and 2nd is subset for that part id

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to group values in a dataframe and keeping the values associated - python

Related

How to count the number of distinct multiline index in pandas, only by one of the indices components

How to expand groupby df Pandas python

Python, pandas: summing a column based on multiple other columns and putting it into a new dataframe

Cryptocurrency correlation in python, working with dictionaries

Create Loop to dynamically select rows from dataframe, then append selected rows to another dataframe: df.query()

Categories

Resources