i have a dataframe like this
Number Names
0 1 Josh
1 2 Jon
2 3 Adam
3 4 Barsa
4 5 Fekse
5 6 Bravo
6 7 Barsa
7 8 Talyo
8 9 Jon
9 10 Zidane
how can i group these numbers based on names
for Number,Names in zip(dsa['Number'],dsa['Names'])
print(Number,Names)
The above code gives me following output
1 Josh
2 Jon
3 Adam
4 Barsa
5 Fekse
6 Bravo
7 Barsa
8 Talyo
9 Jon
10 Zidane
How can i get a output like below
1 Josh
2,9 Jon
3 Adam
4,7 Barsa
5 Fekse
6 Bravo
8 Talyo
10 Zidane
I want to group the numbers based on names
Something like this?
df.groupby("Names")["Number"].unique()
This will return you a series and then you can transform as you wish.
Use pandas' groupby function with agg which aggregates columns. Assuming your dataframe is called df:
grouped_df = df.groupby(['Names']).agg({'Number' : ['unique']})
This is grouping by Names and within those groups reporting the unique values of Number.
Lets say the DF is:
A = pd.DataFrame({'n':[1,2,3,4,5], 'name':['a','b','a','c','c']})
n name
0 1 a
1 2 b
2 3 a
3 4 c
4 5 c
You can use groupby to group by name, and then apply 'list' to the n of those names:
A.groupby('name')['n'].apply(list)
name
a [1, 3]
b [2]
c [4, 5]
Related
Suppose I have a dataframe dataset as the following:
dataset = pd.DataFrame({'id':list('123456'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3]})
print(dataset)
id B C
0 1 4 7
1 2 5 8
2 3 4 9
3 4 5 4
4 5 5 2
5 6 4 3
Now I slice it using iloc() and get
dataset = dataset.iloc[2:5]
id B C
2 3 4 9
3 4 5 4
4 5 5 2
Now I set the id as the new index due to my needs in my project, so I do
dataset.set_index("id", inplace=True)
print(dataset)
B C
id
3 4 9
4 5 4
5 5 2
I would like to select the new dataset using iloc on the original index. So if I do dataset.iloc[3] I would lke to see the first row. However, if I do that it throws me a out of bound error. If I do dataset.iloc[0] it gives me the first row.
Is there anyway I can preserve the original index? Thanks.
iloc is slice by its position you can check subtract the lower
n = 2 # n is 2 since you slice 2:5
dataset.iloc[3-n-1]
Out[648]:
B 4
C 9
Name: 3, dtype: int64
In this case it is recommended to use loc instead of iloc:
dataset.index = dataset.index.astype('int')
dataset.loc[3]
>>>
B 4
C 9
Name: 3, dtype: int64
This question already has answers here:
How to groupby consecutive values in pandas DataFrame
(4 answers)
Closed 11 months ago.
I have a pandas dataframe after sorted, it looks like bellow (like few person working for shop as shift):
A B C D
1 1 1 Anna
2 3 1 Anna
3 1 2 Anna
4 3 2 Tom
5 3 2 Tom
6 3 2 Tom
7 3 2 Tom
8 1 1 Anna
9 3 1 Anna
10 1 2 Tom
...
I want to loop and split dataframe to subset of dataframe, then call my another function, eg:
first subset df would be
A B C D
1 1 1 Anna
2 3 1 Anna
3 1 2 Anna
second subset df would be
4 3 2 Tom
5 3 2 Tom
6 3 2 Tom
7 3 2 Tom
third subset df would be
8 1 1 Anna
9 3 1 Anna
Is there a good way to loop the main datafraem and split it?
for x in some_magic_here:
sub_df = some_mage_here_too()
my_fun(sub_df)
Thanks!
You need loop by groupby object with consecutive groups created by compare shifted D values for not equal with cumulative sum:
for i, sub_df in df.groupby(df.D.ne(df.D.shift()).cumsum()):
print (sub_df)
my_fun(sub_df)
Say i have something like this in a pandas dataframe :
Entity
Type
Doc
Proj
Daniel
PER
1
1
Daniel
PER
4
2
Daniel
PER
5
3
Daniel
PER
9
6
Daniel
LOC
7
4
905-888-8988
ID
3
1
905-888-8988
ID
4
2
905-888-8988
ID
14
8
For each combo of entity and type that is recurring, i'd like to add two new columns for the doc and proj corresponding to the match. I'd like to do this to all possible match between combo of entity.
Edit 1 More detailed explanation to get to the expected outcome
Step 1 - Identify if an "Entity" and "Type" combo has more than 1 occurence in the dataframe.
Step 2 - For each combo that have more than 1 occurence, i would need to represent all possible combination of "Doc" and "Proj" of the combos.
Step 3 - All these possible combination should be represented in pairs of "doc" and "proj"
So the result would look like this in the pandas dataframe :
Entity
Type
Doc1
Proj1
Doc2
Proj2
Daniel
PER
1
1
4
2
Daniel
PER
1
1
5
3
Daniel
PER
1
1
9
6
Daniel
PER
4
2
5
3
Daniel
PER
4
2
9
6
Daniel
PER
5
3
9
6
905-888-8988
ID
3
1
4
2
905-888-8988
ID
3
1
14
8
905-888-8988
ID
4
2
14
8
Thanks all for the help
here is one way:
reset_index to copy the index as a column
use merge to join df with itself on entity and type columns
remove the duplicate pairs by only keeping the smaller index of original right side of merge.
df = df.reset_index()
res = pd.merge(df, df, on=['Entity', 'Type'],suffixes=['1', '2'])
res = res.loc[res.index1 < res.index2].drop(columns=['index1', 'index2']).reset_index(drop=True)
output:
>>
Entity Type Doc1 Proj1 Doc2 Proj2
0 Daniel PER 1 1 4 2
1 Daniel PER 1 1 5 3
2 Daniel PER 1 1 9 6
3 Daniel PER 4 2 5 3
4 Daniel PER 4 2 9 6
5 Daniel PER 5 3 9 6
6 905-888-8988 ID 3 1 4 2
7 905-888-8988 ID 3 1 14 8
8 905-888-8988 ID 4 2 14 8
I try to set up a Dataframe that countains a column called frequency.
This column should show how often the value is present in a specific column of the dataframe in every row. Something like this:
Index Category Frequency
0 1 1
1 3 2
2 3 2
3 4 1
4 7 3
5 7 3
6 7 3
7 8 1
This is just an example
I already tried it with value_counts(), however I only receive a value in the last line of the appearing number.
In the case of the example
Index Category Frequency
0 1 1
1 3 N.A
2 3 2
3 4 1
4 7 N.A
5 7 N.A
6 7 3
7 8 1
It is very important that the column has the same number of rows as the dataframe, preferably appended to the same dataframe
df['Frequency'] = df.groupby('Category').transform('count')
Use pandas.Series.map:
df['Frecuency']=df['Category'].map(df['Category'].value_counts())
or pandas.Series.replace:
df['Frecuency']=df['Category'].replace(df['Category'].value_counts())
Output:
Index Category Frecuency
0 0 1 1
1 1 3 2
2 2 3 2
3 3 4 1
4 4 7 3
5 5 7 3
6 6 7 3
7 7 8 1
Details
df['Category'].value_counts()
7 3
3 2
4 1
1 1
8 1
Name: Category, dtype: int64
using value_counts you get a series whose index are the elements of the category and the values is the count. So you can use map or pandas.Series.replace to create a series with the category values replaced by those in the count. And finally assign this series to the frequency column
you can do it using group by like below
df.groupby("Category") \
.apply(lambda g: g.assign(frequency = len(g))) \
.reset_index(level=0, drop=True)
I suppose this is something rather simple, but I can't find how to make this. I've been searching tutorials and stackoverflow.
Suppose I have a dataframe df loking like this :
Group Id_In_Group SomeQuantity
1 1 10
1 2 20
2 1 7
3 1 16
3 2 22
3 3 5
3 4 12
3 5 28
4 1 1
4 2 18
4 3 14
4 4 7
5 1 36
I would like to select only the lines having at least 4 objects in the group (so there are at least 4 rows having the same "group" number) and for which SomeQuantity for the 4th object, when sorted in the group by ascending SomeQuantity, is greater than 20 (for example).
In the given Dataframe, for example, it would only return the 3rd group, since it has 4 (>=4) members and its 4th SomeQuantity (after sorting) is 22 (>=20), so it should construct the dataframe :
Group Id_In_Group SomeQuantity
3 1 16
3 2 22
3 3 5
3 4 12
3 5 28
(being or not sorted by SomeQuantity, whatever).
Could somebody be kind enough to help me? :)
I would use .groupby() + .filter() methods:
In [66]: df.groupby('Group').filter(lambda x: len(x) >= 4 and x['SomeQuantity'].max() >= 20)
Out[66]:
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28
A slightly different approach using map, value_counts, groupby , filter:
(df[df.Group.map(df.Group.value_counts().ge(4))]
.groupby('Group')
.filter(lambda x: np.any(x['SomeQuantity'].sort_values().iloc[3] >= 20)))
Breakdown of steps:
Perform value_counts to compute the total counts of distinct elements present in Group column.
>>> df.Group.value_counts()
3 5
4 4
1 2
5 1
2 1
Name: Group, dtype: int64
Use map which functions like a dictionary (wherein the index becomes the keys and the series elements become the values) to map these results back to the original DF
>>> df.Group.map(df.Group.value_counts())
0 2
1 2
2 1
3 5
4 5
5 5
6 5
7 5
8 4
9 4
10 4
11 4
12 1
Name: Group, dtype: int64
Then, we check for the elements having a value of 4 or more which is our threshold limit and take only those subset from the entire DF.
>>> df[df.Group.map(df.Group.value_counts().ge(4))]
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28
8 4 1 1
9 4 2 28
10 4 3 14
11 4 4 7
Inorder to use groupby.filter operation on this, we must make sure that we return a single boolean value corresponding to each grouped key when we perform the sorting process and compare the fourth element to the threshold which is 20.
np.any returns all such possiblities matching our filter.
>>> df[df.Group.map(df.Group.value_counts().ge(4))] \
.groupby('Group').apply(lambda x: x['SomeQuantity'].sort_values().iloc[3])
Group
3 22
4 18
dtype: int64
From these, we compare the fourth element .iloc[3] as it is 0-based indexed and return all such favourable matches.
This is how I have worked through your question, warts and all. Im sure there are much nicer ways to do this.
Find groups with "4 objects in the group"
import collections
groups = list({k for k, v in collections.Counter(df.Group).items() if v > 3} );groups
Out:[3, 4]
Use these groups to filter to a new df containing these groups:
df2 = df[df.Group.isin(groups)]
"4th SomeQuantity (after sorting) is 22 (>=20)"
df3 = df2.sort_values(by='SomeQuantity',ascending=False)
(Updated as per comment below...)
df3.groupby('Group').filter(lambda grp: any(grp.sort_values('SomeQuantity').iloc[3] >= 20)).sort_index()
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28