I have a dataframe:
df =
col1
Num
1
4
1
4
2
5
2
1
2
1
3
2
I want to add all the numbers and show the total.
So I will get:
col1
Sum
1
8
2
7
3
2
Try this:
df.groupby('col1').sum()
If you wanted the new column to have the name 'sum' as in your example you could do the following:
df1 = df.groupby('col1').sum()
df1.columns = ['Sum']
Related
I have the following DataFrame in my Python porject:
df1 = pd.DataFrame({"Col A":[1,2,3],"Col B":[3,2,2]})
I wish to order it in this kind of way:
df2 = pd.DataFrame({"Col A":[1,3,2],"Col B":[3,2,2]})
My goal is that each value in Col A matches the previous' value in Col B.
Do you have any idea of how to make this work properly and as little effort as possible?
I tried to work with .sort_values(by=) but that's also where my current knowledge stops.
If need roll one value per Col B use lambda function:
df1 = pd.DataFrame({"Col A":[1,2,3,7,4,8],"Col B":[3,2,2,1,1,1]})
print (df1)
Col A Col B
0 1 3
1 2 2
2 3 2
3 7 1
4 4 1
5 8 1
df1['Col A'] = df1.groupby('Col B')['Col A'].transform(lambda x: np.roll(x, -1))
print (df1)
Col A Col B
0 1 3
1 3 2
2 2 2
3 4 1
4 8 1
5 7 1
Yes, you can achieve the desired output by using sort_values() and by creating a mapping dictionary so:
import pandas as pd
df1 = pd.DataFrame({"Col A":[1,2,3],"Col B":[3,2,2]})
# mapping_dict for ordering
mapping_dict = {1:3, 3:2, 2:2}
df1["sort_order"] = df1["Col A"].map(mapping_dict)
df2 = df1.sort_values(by="sort_order").drop(columns=["sort_order"])
print(df2)
Output:
Col A Col B
0 1 3
2 3 2
1 2 2
Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!
Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1
I have a data frame including number of sold tickets in different price buckets for each flight.
For each record/row, I want to use the value in one column as an index in iloc function, to sum up values in a specific number of columns.
Like, for each row, I want to sum up values from column index 5 to value in ['iloc_index']
I tried df.iloc[:, 5:df['iloc_index']].sum(axis=1) but it did not work.
sample data:
A B C D iloc_value total
0 1 2 3 2 1
1 1 3 4 2 2
2 4 6 3 2 1
for each row, I want to sum up the number of columns based on the value in ['iloc_value']
for example,
for row0, I want the total to be 1+2
for row1, I want the total to be 1+3+4
for row2, I want the total to be 4+6
EDIT:
I quickly got the results this way:
First define a function that can do it for one row:
def sum_till_iloc_value(row):
return sum(row[:row['iloc_value']+1])
Then apply it to all rows to generate your output:
df_flights['sum'] = df_flights.apply(sum_till_iloc_value, axis=1)
A B C D iloc_value sum
0 1 2 3 2 1 3
1 1 3 4 2 2 8
2 4 6 3 2 1 10
PREVIOUSLY:
Assuming you have information that looks like:
df_flights = pd.DataFrame({'flight':['f1', 'f2', 'f3'], 'business':[2,3,4], 'economy':[6,7,8]})
df_flights
flight business economy
0 f1 2 6
1 f2 3 7
2 f3 4 8
you can sum the columns you want as below:
df_flights['seat_count'] = df_flights['business'] + df_flights['economy']
This will create a new column that you can later select:
df_flights[['flight', 'seat_count']]
flight seat_count
0 f1 8
1 f2 10
2 f3 12
Here's a way to do that in a fully vectorized way: melting the dataframe, summing only the relevant columns, and getting the total back into the dataframe:
d = dict([[y, x] for x, y in enumerate(df.columns[:-1])])
temp_df = df.copy()
temp_df = temp_df.rename(columns=d)
temp_df = temp_df.reset_index().melt(id_vars = ["index", "iloc_value"])
temp_df = temp_df[temp_df.variable <= temp_df.iloc_value]
df["total"] = temp_df.groupby("index").value.sum()
The output is:
A B C D iloc_value total
0 1 2 3 2 1 3
1 1 3 4 2 2 8
2 4 6 3 2 1 10
I have a dataframe df with the shape (4573,64) that I'm trying to pivot. The last column is an 'id' with two possible string values 'old' and 'new'. I would like to set the first 63 columns as index and then have the 'id' column across the top with values being the count of 'old' or 'new' for each index row.
I've created a list object out of columns labels that I want as index named cols.
I tried the following:
df.pivot(index=cols, columns='id')['id']
this gives an error: 'all arrays must be same length'
also tried the following to see if I can get sum but no luck either:
pd.pivot_table(df,index=cols,values=['id'],aggfunc=np.sum)
any ides greatly appreciated
I found a thread online talking about a possible bug in pandas 0.23.0 where the pandas.pivot_table() will not accept the multiindex as long as it contains NaN's (link to github in comments). My workaround was to do
df.fillna('empty', inplace=True)
then the solution below:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
as proposed by jezrael will work as intended hence the answer accepted.
I believe need convert columns names to list and then aggregate size with unstack:
df = pd.DataFrame({'B':[4,4,4,5,5,4],
'C':[1,1,9,4,2,3],
'D':[1,1,5,7,1,0],
'E':[0,0,6,9,2,4],
'id':list('aaabbb')})
print (df)
B C D E id
0 4 1 1 0 a
1 4 1 1 0 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
cols = df.columns.tolist()
df1 = df.groupby(cols)['id'].size().unstack(fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1
Solution with pivot_table:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1
From a simple dataframe like that in PySpark :
col1 col2 count
A 1 4
A 2 8
A 3 2
B 1 3
C 1 6
I would like to duplicate the rows in order to have each value of col1 with each value of col2 and the column count filled with 0 for those we don't have the original value. It would be like that :
col1 col2 count
A 1 4
A 2 8
A 3 2
B 1 3
B 2 0
B 3 0
C 1 6
C 2 0
C 3 0
Do you have any idea how to do that efficiently ?
You're looking for crossJoin.
data = df.select('col1', 'col2')
// this one gives you all combinations of col1+col2
all_combinations = data.alias('a').crossJoin(data.alias('b')).select('a.col1', 'b.col2')
// this one will append with count column from original dataset, and null for all other records.
all_combinations.alias('a').join(df.alias('b'), on=(col(a.col1)==col(b.col1) & col(a.col2)==col(b.col2)), how='left').select('a.*', b.count)