I have the following dataframe:
Group from to
1 2 1
1 1 2
1 3 2
1 3 1
2 1 4
2 3 1
2 1 2
2 3 1
I want create a 4th column that counts the of unique combinations (from, to) within each group and drops any repeated combination within each group (leaves only one)
Expected output:
Group from to weight
1 2 1 1
1 1 2 1
1 3 2 1
1 3 1 1
2 1 4 1
2 3 1 2
2 1 2 1
In the expected output, the 2nd from 3, to 1 row in group 2 was dropped because it is a duplicate.
In your case we just need groupby with size
out = df.groupby(df.columns.tolist()).size().to_frame(name='weight').reset_index()
Out[258]:
Group from to weight
0 1 1 2 1
1 1 2 1 1
2 1 3 1 1
3 1 3 2 1
4 2 1 2 1
5 2 1 4 1
6 2 3 1 2
You can group by the 3 columns using .groupby() and take their size by GroupBy.size(), as follows:
df_out = df.groupby(['Group', 'from', 'to'], sort=False).size().reset_index(name='weight')
Result:
print(df_out)
Group from to weight
0 1 2 1 1
1 1 1 2 1
2 1 3 2 1
3 1 3 1 1
4 2 1 4 1
5 2 3 1 2
6 2 1 2 1
Related
I have a df that looks like this:
time val
0 1
1 1
2 2
3 3
4 1
5 2
6 3
7 3
8 3
9 3
10 1
11 1
How do I create new columns that hold the amount of times a condition occurs and does not change? In this case, I want to create a column for each unique value in val that holds the cumulative sum at the given row of occurences, but does not increment the value if the condition doesn't change.
Expected outcome below:
time val sum_1 sum_2 sum_3
0 1 1 0 0
1 1 1 0 0
2 2 1 1 0
3 3 1 1 1
4 1 2 1 1
5 2 2 2 1
6 3 2 2 2
7 3 2 2 2
8 3 2 2 2
9 3 2 2 2
10 1 3 2 2
11 1 3 2 2
EDIT
To be more specific with the condition:
I want to count the number of times a unique value appears in val. For example, using the code below, I could get this result:
df['sum_1'] = (df['val'] == 1).cumsum()
df['sum_2'] = (df['val'] == 2).cumsum()
df['sum_3'] = (df['val'] == 3).cumsum()
time val sum_1 sum_2 sum_3
0 0 1 1 0 0
1 1 1 2 0 0
2 2 2 2 1 0
3 3 3 2 1 1
4 4 1 3 1 1
5 5 2 3 2 1
However, this code counts EVERY occurence of a condition. For example, val shows 1 occurring 3 times total. However, I want to treat consecutive occurrences of 1 as a single group, counting only the number of consecutive groupings that occur. In the example above, 1 occurs in total 3 times, but only 2 times as a consecutive grouping.
You can chain mask by & for bitwise AND for test first consecutive values by compare by shifted values by Series.ne with Series.shift and run code for test all unique values of column val:
uniq = df['val'].unique()
m = df['val'].ne(df['val'].shift())
for c in uniq:
df[f'sum_{c}'] = (df['val'].eq(c) & m).cumsum()
print (df)
time val sum_1 sum_2 sum_3
0 0 1 1 0 0
1 1 1 1 0 0
2 2 2 1 1 0
3 3 3 1 1 1
4 4 1 2 1 1
5 5 2 2 2 1
6 6 3 2 2 2
7 7 3 2 2 2
8 8 3 2 2 2
9 9 3 2 2 2
10 10 1 3 2 2
11 11 1 3 2 2
For better performance (I hope) here is numpy alternative:
a = df['val'].to_numpy()
uniq = np.unique(a)
m = np.concatenate(([False], a[:-1])) != a
arr = np.cumsum((a[:, None] == uniq) & m[:, None], axis=0)
df = df.join(pd.DataFrame(arr, index=df.index, columns=uniq).add_prefix('sum_'))
print (df)
time val sum_1 sum_2 sum_3
0 0 1 1 0 0
1 1 1 1 0 0
2 2 2 1 1 0
3 3 3 1 1 1
4 4 1 2 1 1
5 5 2 2 2 1
6 6 3 2 2 2
7 7 3 2 2 2
8 8 3 2 2 2
9 9 3 2 2 2
10 10 1 3 2 2
11 11 1 3 2 2
i have a dataframe and want to group 2 columns, which is working fine.
df.groupby(["Sektor, CustomerID"]).count().head(10)
_Order_ID_ Order_timezone Order_weight
AE 1298772 1 1 1
1298788 1 1 1
1298840 2 2 2
1298912 1 1 1
AT 1038570 1 1 1
1040424 1 1 1
1040425 3 3 3
1040426 2 2 2
1040427 1 1 1
1040428 1 1 1
1040429 2 2 2
Now the grouped dataframe is sorted by the CustomerID values. But i want to sort it by the count(). So that i have the Sektor then the CustomerIDs but the CustomerIds that occure the most should be at the top. So descending.
Expected Output should be:
_Order_ID_ Order_timezone Order_weight
AE 1298840 2 2 2
1298772 1 1 1
1298788 1 1 1
1298912 1 1 1
AT 1040425 3 3 3
1040426 2 2 2
1040429 2 2 2
1038570 1 1 1
1040424 1 1 1
1040427 1 1 1
1040428 1 1 1
How do i do that?
Use:
df1 = df.groupby(["Sektor", "CustomerID"]).count()
If need 10 rows in ouput:
df1 = df1.sort_values(['Sektor','_Order_ID_'], ascending=[True, False]).head(10)
print (df1)
_Order_ID_ Order_timezone Order_weight
Sektor CustomerID
AE 1298840 2 2 2
1298772 1 1 1
1298788 1 1 1
1298912 1 1 1
AT 1040425 3 3 3
1040426 2 2 2
1040429 2 2 2
1038570 1 1 1
1040424 1 1 1
1040427 1 1 1
If need 10 rows (if exist) per groups by Sektor:
df1 = df1.sort_values(['Sektor','_Order_ID_'], ascending=[True, False]).groupby('Sektor').head(10)
print (df1)
_Order_ID_ Order_timezone Order_weight
Sektor CustomerID
AE 1298840 2 2 2
1298772 1 1 1
1298788 1 1 1
1298912 1 1 1
AT 1040425 3 3 3
1040426 2 2 2
1040429 2 2 2
1038570 1 1 1
1040424 1 1 1
1040427 1 1 1
1040428 1 1 1
I have a DataFrame with values 1 and 2.
I want to add a row in the end of the DataFrame, counting the number of 1 in each column. It should be similar
COUNTIF(A:A,1) and drag to all columns in excel
I tried something like df.loc['lastrow']=df.count()[1], but the result is not correct.
How can I do it and what this function (count()[1]) does?
You can do append after sum
df.append(df.eq(1).sum(),ignore_index=True )
You can just compare your dataframe to the value you are interested in (1 for example), and then perform a sum on these booleans, like:
>>> df
0 1 2 3 4
0 2 2 2 1 2
1 2 2 2 2 2
2 2 1 2 1 1
3 1 2 2 1 1
4 2 2 1 2 1
5 2 2 2 2 2
6 1 1 1 1 2
7 2 2 1 1 1
8 1 1 1 2 1
9 2 2 1 2 1
>>> (df == 1).sum()
0 3
1 3
2 5
3 5
4 6
dtype: int64
You can thus append that row, like:
>>> df.append((df == 1).sum(), ignore_index=True)
0 1 2 3 4
0 2 2 2 1 2
1 2 2 2 2 2
2 2 1 2 1 1
3 1 2 2 1 1
4 2 2 1 2 1
5 2 2 2 2 2
6 1 1 1 1 2
7 2 2 1 1 1
8 1 1 1 2 1
9 2 2 1 2 1
10 3 3 5 5 6
The last row here thus contains the number of 1s of the previous rows.
I have a dataframe with many attributes. I want to assign an id for all unique combinations of these attributes.
assume, this is my df:
df = pd.DataFrame(np.random.randint(1,3, size=(10, 3)), columns=list('ABC'))
A B C
0 2 1 1
1 1 1 1
2 1 1 1
3 2 2 2
4 1 2 2
5 1 2 1
6 1 2 2
7 1 2 1
8 1 2 2
9 2 2 1
Now, I need to append a new column with an id for unique combinations. It has to be 0, it the combination occurs only once. In this case:
A B C unique_combination
0 2 1 1 0
1 1 1 1 1
2 1 1 1 1
3 2 2 2 0
4 1 2 2 2
5 1 2 1 3
6 1 2 2 2
7 1 2 1 3
8 1 2 2 2
9 2 2 1 0
My first approach was to use a for loop and check for every row, if I find more than one combination in the dataframe of the row's values with .query:
unique_combination = 1 #acts as a counter
df['unique_combination'] = 0
for idx, row in df.iterrows():
if len(df.query('A == #row.A & B == #row.B & C == #row.C')) > 1:
# check, if one occurrence of the combination already has a value > 0???
df.loc[idx, 'unique_combination'] = unique_combination
unique_combination += 1
However, I have no idea how to check whether there already is an ID assigned for a combination (see comment in code). Additionally my approach feels very slow and hacky (I have over 15000 rows). Do you data wrangler see a different approach to my problem?
Thank you very much!
Step1 : Assign a new column with values 0
df['new'] = 0
Step2 : Create a mask with repetition more than 1 i.e
mask = df.groupby(['A','B','C'])['new'].transform(lambda x : len(x)>1)
Step3 : Assign the values factorizing based on mask i.e
df.loc[mask,'new'] = df.loc[mask,['A','B','C']].astype(str).sum(1).factorize()[0] + 1
# or
# df.loc[mask,'new'] = df.loc[mask,['A','B','C']].groupby(['A','B','C']).ngroup()+1
Output:
A B C new
0 2 1 1 0
1 1 1 1 1
2 1 1 1 1
3 2 2 2 0
4 1 2 2 2
5 1 2 1 3
6 1 2 2 2
7 1 2 1 3
8 1 2 2 2
9 2 2 1 0
A new feature added in Pandas version 0.20.2 creates a column of unique ids automatically for you.
df['unique_id'] = df.groupby(['A', 'B', 'C']).ngroup()
gives the following output
A B C unique_id
0 2 1 2 3
1 2 2 1 4
2 1 2 1 1
3 1 2 2 2
4 1 1 1 0
5 1 2 1 1
6 1 1 1 0
7 2 2 2 5
8 1 2 2 2
9 1 2 2 2
The groups are given ids based on the order they would be iterated over.
See the documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#enumerate-groups
I have a pandas data frame and group it by two columns (for example col1 and col2). For fixed values of col1 and col2 (i.e. for a group) I can have several different values in the col3. I would like to count the number of distinct values from the third columns.
For example, If I have this as my input:
1 1 1
1 1 1
1 1 2
1 2 3
1 2 3
1 2 3
2 1 1
2 1 2
2 1 3
2 2 3
2 2 3
2 2 3
I would like to have this table (data frame) as the output:
1 1 2
1 2 1
2 1 3
2 2 1
df.groupby(['col1','col2'])['col3'].nunique().reset_index()
In [17]: df
Out[17]:
0 1 2
0 1 1 1
1 1 1 1
2 1 1 2
3 1 2 3
4 1 2 3
5 1 2 3
6 2 1 1
7 2 1 2
8 2 1 3
9 2 2 3
10 2 2 3
11 2 2 3
In [19]: df.groupby([0,1])[2].apply(lambda x: len(x.unique()))
Out[19]:
0 1
1 1 2
2 1
2 1 3
2 1
dtype: int64