What is the best way to aggregate 100 columns in pandas?

What is the best way to aggregate 100 columns in pandas? - python

I have a data frame with 101 columns currently. The first column is called "Country/Region" and the other 100 are dates in MM/DD/YY format from 1/22/20 to 4/30/20 like the example below. I would like to combine repeat country entries such as 'Australia' below and have its values in the date columns to be added together so that there is one row per country. I would like to keep ALL date columns.I have tried to use the groupby() and agg() functions but I do not know how to sum() together that many columns without calling every single one. Is there a way to do this without calling all 100 columns individually?
Country/Region | 1/22/20 | 1/23/20 | ... | 4/29/20 | 4/30/20
Afghanistan 0 0 ... 1092 1176
Australia 0 0 10526 12065
Australia 0 0 ... 56289 4523

This should work:
df.pivot_table(index='Country/Region', aggfunc='sum')

Did you already try this? It should also give the expected result.
df.groupby('Country/Region').sum()

You can do this:
df.iloc[:,1:].sum(axis=1)

Related

Pandas dataframe group by 10 min intervals with different actions on other columns

I have a pandas dataframe which includes a timestamp and 71 other columns, something like this:
timestamp |close_price|highest_price|volume| ...
2018-09-29 00:00:20 |1809 |1811 | ... |
2018-09-29 00:00:34 |1823 |1832 |
2018-09-29 00:00:59 |1832 |1863 |
2018-09-29 00:01:09 |1800 |1802 |
2018-09-29 00:01:28 |1832 |1845 |
.
.
.
I want to put the data into 10 min intervals and I want to do separate operations on each column, for example I want the 10 min intervals of close_price column to show the last value of the corresponding range in the real table, or for the highest_price column, I want the max value of the corresponding range, or for volume I want the mean of the values in that range. I already tried
dataTable = datefram.resample("10min").agg({'first_price':'first',
'close_price':'last',
'highest_price': 'max',
'volume':'mean',
#other attributes...
})
but the result seems to be incorrect.
Is there any other ways to do what I want to do?
I will appreciate any comments or thoughts.
Note that there is no specific pattern in timestamp values. In 1 minute, we can have 0 to 60 rows.

Your approach is correct. The
dataframe.resample("10min").agg() does the calculations for you.
You might get more outputs than what you expect and that is because of this: resample method continuously adds 10 minutes to the time and does the calculations that you asked. But if there was no data in any of the 10 min intervals, it creates a NULL row. Maybe your data is not continuous and causes this Null rows.
You can simply delete the NULL rows by using dataframe.dropna()

If your data spans multiple days or periods where you don't have any data points, calling resample() can result in lots of additional rows with NaN values. I think your code is actually correct, you just got the wrong impression from seeing all the extra rows.

How to drop rows in pandas dataframe, when there is similar values?

I have a python pandas dataframe of stock data, and I'm trying to filter some of those tickers.
There are companies that have 2 or more tickers (different types of shares when a share is preferred and the other not).
I want to drop the lines of those additional share values, and let just the share with the higher volume. In the dataframe I also have the company name, so maybe there is a way of using it to make some condition and then drop it when comparing the volume of the same company? How can I do this?

Use groupby and idxmax:
Suppose this dataframe:
>>> df
ticker volume
0 CEBR3 123
1 CEBR5 456
2 CEBR6 789 # <- keep for group CEBR
3 GOAU3 23 # <- keep for group GOAU
4 GOAU4 12
5 CMIN3 135 # <- keep for group CMIN3
>>> df.loc[df.groupby(df['ticker'].str.extract(r'^(.*)\d', expand=False),
sort=False)['volume'].idxmax().tolist()]
ticker volume
2 CEBR6 789
3 GOAU3 23
5 CMIN3 135

Merge rows with the same values Pandas

I have a pandas dataframe as follows:
You will note here that there are many rows with the same code_module,code_presentation,id_student combination
What I want to do is merge all of these duplicate rows, and in so sum the sum_clicks with each group
An example of this is for the top rows they would be merged into one row looking as follows:
code_module code_presentation id_student sum_click
0 AAA 2013J 28400 18
In SQL terms, the private key should be a code_module,code_presentation,id_student combination
In my progress on this, I tried to use groupby in the following way:
groupby(['id_student','code_presentation','code_module']).aggregate({'sum_click': 'sum',})
But this didn't work as it gave student ids that aren't even in my dataset, which I don't understand why
Also, groupby doesn't seem to be quite what I'm looking for as it has a datastructure different to a standard pandas dataframe, which is what I would be looking for.
The problem can be seen in the following output
sum_click
id_student code_presentation code_module
6516 2014J AAA 2791
8462 2013J DDD 646
2014J DDD 10
11391 2013J AAA 934
Row 1 and 2 (indexing from 0) should be distinct rows, instead of the group as they are

Try this -
df.groupby(['code_module', 'code_presentation', 'id_student']).agg(sum_clicks=('sum_click', 'sum')).reset_index()

Iterate over dates in a Pandas Dataframe to get the count of a different column per week

I am a java developer finding it a bit tricky to switch to python and Pandas. Im trying to iterate over dates of a Pandas Dataframe which looks like below,
sender_user_id created
0 1 2016-12-19 07:36:07.816676
1 33 2016-12-19 07:56:07.816676
2 1 2016-12-19 08:14:07.816676
3 15 2016-12-19 08:34:07.816676
what I am trying to get is a dataframe which gives me a count of the total number of transactions that have occurred per week. From the forums I have been able to get syntax for 'for loops' which iterate over indexes only. Basically I need a result dataframe which looks like this. The value field contains the count of sender_user_id and the date needs to be modified to show the starting date per week.
date value
0 2016-12-09 20
1 2016-12-16 36
2 2016-12-23 56
3 2016-12-30 32
Thanks in advance for the help.

I think you need resample by week and aggregate size:
#cast to datetime if necessary
df.created = pd.to_datetime(df.created)
print (df.resample('W', on='created').size().reset_index(name='value'))
created value
0 2016-12-25 4
If need another offsets:
df.created = pd.to_datetime(df.created)
print (df.resample('W-FRI', on='created').size().reset_index(name='value'))
created value
0 2016-12-23 4
If need number of unique values per week aggregate by nunique:
df.created = pd.to_datetime(df.created)
print (df.resample('W-FRI', on='created')['sender_user_id'].nunique()
.reset_index(name='value'))
created value
0 2016-12-23 3

Combining and rearranging two dataframes in pandas

I'm having two dataframes with each of them looking like
date country value
20100101 country1 1
20100102 country1 2
20100103 country1 3
date country value
20100101 country2 4
20100102 country2 5
20100103 country2 6
I want to merge them into one dataframe looking like
date country1 country2
20100101 1 4
20100102 2 5
20100103 3 6
Is there any clever way to do this in pandas?

This looks like pivot table, which in Pandas is called unstack for some bizarre reason.
Example analogous to the one used in Wes McKinley's "python for data analysis" book:
bytz = df.groupby(['tz', opersystem])
counts = bytz.size().unstack().fillna(0)
(groupby operating system in rows which is then pivoted so that operating system becomes columns, just like your "country*" values).
P.S. for catting dataframes you can use pandas.concat. It's also often good to do .reset_index on resulting dataframe, bc in some (many?) cases duplicate values in index can make pandas go haywire, throwing strange exceptions on .apply used on dataframe and the like.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is the best way to aggregate 100 columns in pandas? - python

This should work: df.pivot_table(index='Country/Region', aggfunc='sum')

Did you already try this? It should also give the expected result. df.groupby('Country/Region').sum()

You can do this: df.iloc[:,1:].sum(axis=1)

Related

Pandas dataframe group by 10 min intervals with different actions on other columns

How to drop rows in pandas dataframe, when there is similar values?

Merge rows with the same values Pandas

Iterate over dates in a Pandas Dataframe to get the count of a different column per week

Combining and rearranging two dataframes in pandas

Categories

Resources