Using groupby on already grouped data in Pandas - python

I would like to achieve the result below in Python using Pandas.
I tried groupby and sum on the id and Group columns using the below:
df.groupby(['id','Group'])['Total'].sum()
I got the first two columns, but I'm not sure how to get the third column (Overall_Total).
How can I do it?
Initial data (before grouping)
id
Group
Time
1
a
2
1
a
2
1
a
1
1
b
1
1
b
1
1
c
1
2
e
2
2
a
4
2
e
1
2
a
5
3
c
1
3
e
4
3
a
3
3
e
4
3
a
2
3
h
4

Assuming df is your initial dataframe, please try this:
df_group = df.groupby(['id','group']).sum(['time']).rename(columns={'time':'Total'})
df_group['All_total'] = df_group.groupby(['id'])['Total'].transform('sum')

Related

Sum cells with duplicate column headers in pandas during import - python

I am trying to do some basic dimensional reduction. I have a CSV file that looks something like this:
A B C A B B A C
1 1 2 2 1 3 1 1
1 2 3 0 0 1 1 2
0 2 1 3 0 1 2 2
I want to import as a pandas DF but without renaming the headers to A.1 A.2 etc. Instead I want to sum the duplicates and keep the columns names. Ideally my new DF should look like this:
A B C
4 5 3
2 3 5
5 3 3
Is it possible to do this easily or would you recommend a different way? I can also use bash, R, or anything that can do the trick with a file that is 1 million lines and 1000 columns.
Thank you!
Try split the column names by . and groupby the first part:
df.groupby(df.columns.str.split('.').str[0], axis=1).sum()
Output:
A B C
0 4 5 3
1 2 3 5
2 5 3 3
Just load the dataframe normally and group by the first letter of the column name, and sum the values:
df.groupby(lambda colname: colname[0], axis=1).sum()
which gives
A B C
0 4 5 3
1 2 3 5
2 5 3 3

how to use results of a column as column references? python

How would you use the results of a variable in a pandas data set to call different columns in iteratively that have the same name as the results. If one column had results that were letters and other columns were titled with those letters, how could you use the results from the letters column as column names?
Take this example data set
letter a b c d
a 1 3 4 2
d 4 3 2 1
c 2 1 4 3
d 3 4 2 1
desired results
letter a b c d correct answer
a 1 3 4 2 1
d 4 3 2 1 1
c 2 1 4 3 4
d 3 4 2 1 1
How do you create the correct answer variable seen in the desired results
Assuming that you have set letter as your index, I suppose you could do the following:
df['correct_answer'] = df.apply(lambda x: x.loc[x.name], axis=1)
Yields:
a b c d correct_answer
letter
a 1 3 4 2 1
d 4 3 2 1 1
c 2 1 4 3 4
d 3 4 2 1 1
Here is something that might help: pandas DataFrame diagonal
This is basically searching for an identity diagonal i.e. when rowname == colname. Should be enough to get you started.

Pandas Counting Each Column with its Spesific Thresholds

If I have a following dataframe:
A B C D E
1 1 2 0 1 0
2 0 0 0 1 -1
3 1 1 3 -5 2
4 -3 4 2 6 0
5 2 4 1 9 -1
T 1 2 2 4 1
The last row is my threshold values for each column. I want to count each column values whether lower its threshold values or not in python pandas.
Desired Output is;
A B C D E
Count 2 2 3 3 4
But, I need to figure it out with a general solution, not for these specific columns. Because I have a large dataset. I cannot specify a column name for each of them in the code.
Could you please help me with this?
Select all rows without first by indexing and compare by DataFrame.lt by last row, then sum and convert Series to one row DataFrame by Series.to_frame with transpose by DataFrame.T:
df = df.iloc[:-1].lt(df.iloc[-1]).sum().to_frame('count').T
print (df)
A B C D E
count 2 2 3 3 4
Numpy alternative with DataFrame constructor:
arr = df.values
df = pd.DataFrame([np.sum(arr[:-1] < arr[-1], axis=0)], columns=df.columns, index=['count'])
print (df)
A B C D E
count 2 2 3 3 4

Applying operations on groups without aggregating

I want to apply an operation on multiple groups of a data frame and then fill all values of that group by the result. Lets take mean and np.cumsum as an example and the following dataframe:
df=pd.DataFrame({"a":[1,3,2,4],"b":[1,1,2,2]})
which looks like this
a b
0 1 1
1 3 1
2 2 2
3 4 2
Now I want to group the dataframe by b, then take the mean of a in each group, then apply np.cumsum to the means, and then replace all values of a by the (group dependent) result.
For the first three steps, I would start like this
df.groupby("b").mean().apply(np.cumsum)
which gives
a
b
1 2
2 5
But what I want to get is
a b
0 2 1
1 2 1
2 5 2
3 5 2
Any ideas how this can be solved in a nice way?
You can use map by Series:
df1 = df.groupby("b").mean().cumsum()
print (df1)
a
b
1 2
2 5
df['a'] = df['b'].map(df1['a'])
print (df)
a b
0 2 1
1 2 1
2 5 2
3 5 2

Inserting multiple sets of data for a single index value in a dataframe

I Have a dataframe which is of the following structure:
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 4 3
The index and column C are both set to have the same value. This is because I have created a dataframe which uses dates as it's index to cover every day in the year and I have a large collection of data whose dates are deposited in column C. In practice I can deposit as much data as possible and this can cover the majority of the year but there will be some days where there is no data and my dataframe is structed this way to account for it.
What I wish to do is enable support for multiple readings on one day. Currently my program selects which row to put data into by matching the raw data's date with the date in the index column so if I had the following:
A B C
2 3 2
The row would be selected by the value in column C and inserted into the data frame like so:
A B C
0 1 1 0
1 2 2 1
2 2 3 2
3 4 4 3
How would I handle the case where I have two sets of readings on one day whilst keeping the indexing the same and inserting the data based on the column c value.
Like so:
A B C
4 3 1
2 4 1
And I want to be able to have the following:
A B C
0 1 1 0
1 4 3 1
1 2 4 1
2 2 3 2
3 4 4 3
I wish to keep the indexing the same so that the structure of the dataframe is kept the same in covering all days of the year and days where there is multiple days the data can be inserted whilst keeping the index value the same.
This should do it for you:
Setup:
import pandas as pd
import io
a = io.StringIO(u'''
A B C
1 1 0
2 2 1
3 3 2
4 4 3
''')
df = pd.read_csv(a, delim_whitespace = True)
b = io.StringIO(u'''
A B C
4 3 1
2 4 1
''')
dfX = pd.read_csv(b, delim_whitespace = True)
Processing:
df = df.loc[~df['C'].isin(dfX['C'])]
df = df.append(dfX).sort_values(by = 'C')
df.index = df['C'].values
Output:
A B C
0 1 1 0
1 4 3 1
1 2 4 1
2 3 3 2
3 4 4 3

Categories