Split DataFrame every x unique values into new Dataframes - python

I need to slice a long format DataFrame by every x unique values for the purpose of visualizing. My actual dataset has ~ 90 variables for 20 individuals so I would like to split into 9 separate df's containing the entries for all 20 individuals for each variable.
I have created this simple example to help explain:
df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3,4,4,4],
'Period':[1,2,3,1,2,3,1,2,3,1,2,3,],
'Food':['Ham','Ham','Ham','Cheese','Cheese','Cheese','Egg','Egg','Egg','Bacon','Bacon','Bacon',]})
df
''' ******* PSUEDOCODE *******
df1 = unique entries [:2]
df2 = unique entries [2:4] '''
# desired outcome:
df1 = pd.DataFrame({'ID':[1,1,1,2,2,2,],
'Period':[1,2,3,1,2,3,],
'Food':['Ham','Ham','Ham','Cheese','Cheese','Cheese',]})
df2 = pd.DataFrame({'ID':[3,3,3,4,4,4],
'Period':[1,2,3,1,2,3,],
'Food':['Egg','Egg','Egg','Bacon','Bacon','Bacon',]})
print(df1)
print(df2)
In this case, the DataFrame would be split at the end of every 2 sets of unique entries in the df['Food'] column to create df1 and df2. Best case scenario would be a loop that creates a new DataFrame for every x unique entries. Given the lack of info I can find, I'm unfortunately struggling to write even good pseudocode for that.

Let us try with factorize and groupby
n = 2
d = {x : y for x , y in df.groupby(df.Food.factorize()[0]//n)}
d[0]
Out[132]:
ID Period Food
0 1 1 Ham
1 1 2 Ham
2 1 3 Ham
3 2 1 Cheese
4 2 2 Cheese
5 2 3 Cheese
d[1]
Out[133]:
ID Period Food
6 3 1 Egg
7 3 2 Egg
8 3 3 Egg
9 4 1 Bacon
10 4 2 Bacon
11 4 3 Bacon

Possible solution is the following:
# pip install pandas
import pandas as pd
df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3,4,4,4],
'Period':[1,2,3,1,2,3,1,2,3,1,2,3,],
'Food':['Ham','Ham','Ham','Cheese','Cheese','Cheese','Egg','Egg','Egg','Bacon','Bacon','Bacon',]})
dfs = [y for x, y in df.groupby('Food', as_index=False)]
Separated dfs can be accessed by list index (see below) or using loop:
dfs[0]
dfs[1]
and etc.

We could use groupby + ngroup + floordiv to create groups; then use another groupby to separate:
out = [x for _, x in df.groupby(df.groupby('Food', sort=False).ngroup().floordiv(2))]
Output:
[ ID Period Food
0 1 1 Ham
1 1 2 Ham
2 1 3 Ham
3 2 1 Cheese
4 2 2 Cheese
5 2 3 Cheese,
ID Period Food
6 3 1 Egg
7 3 2 Egg
8 3 3 Egg
9 4 1 Bacon
10 4 2 Bacon
11 4 3 Bacon]

From what I understand, this may help:
for x in df['ID'].unique():
print(df[df['ID']==x], '\n')
for x in df['Food'].unique():
print(df[df['Food']==x], '\n')

Related

Trying to group by, then sort a dataframe based on multiple values [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

Filter pandas column of lists based on a list

Having a large DataFrame as follows:
userid user_mentions
1 [2, 3, 4]
1 [3]
2 NaN
2 [1,3]
3 [1,4,5]
3 [4]
The user_mentions columns is a list of userids that have been mentioned by each user. For example, the first line means:
user 1 has mentioned users 2, 3, and 4.
I need to create a mention network among the users in the userid column. That is, I want the number of times each user in the userid column has been mentioned by other users in the userid column. So basically, first I need something like this:
filtered = df[df['user_mentions'].isin(df['userid'].unique())]
But this doesn't work on a column of lists.
If I resolve the above issue, then I can groupby['userid','user_mentions'].
EDIT
The final output should be:
Source Target Number
1 2 1
1 3 2
2 1 1
2 3 1
3 1 1
3 5 1
This isn't a task well suited to Pandas / NumPy. So I suggest you use collections.defaultdict to create a dictionary of counts, then construct a dataframe from the dictionary:
from collections import defaultdict
dd = defaultdict(lambda: defaultdict(int))
for row in df.itertuples(index=False):
vals = row.user_mentions
if vals == vals:
for val in vals:
dd[row.userid][val] += 1
df = pd.DataFrame([(k, w, dd[k][w]) for k, v in dd.items() for w in v],
columns=['source', 'target', 'number'])
print(df)
source target number
0 1 2 1
1 1 3 2
2 1 4 1
3 2 1 1
4 2 3 1
5 3 1 1
6 3 4 2
7 3 5 1
Of course, you shouldn't put lists in Pandas series in the first place. It's a nested layer of pointers, which should be avoided if at all possible.
Following your edit, I would have to agree with #jpp.
To your (unedited) original question, in terms of gathering the number of mentions of each user, you can do:
df['counts'] = df['userid'].apply(lambda x: df['user_mentions'].dropna().sum().count(x))
df[['userid','counts']].groupby('userid').first()
Yields:
counts
userid
1 2
2 1
3 3
Here's one way.
# Remove the `NaN` rows
df = df.dropna()
# Construct a new DataFrame
df2 = pd.DataFrame(df.user_mentions.tolist(),
index=df.userid.rename('source')
).stack().astype(int).to_frame('target')
# Groupby + size
df2.groupby(['source', 'target']).size().rename('counts').reset_index()
source target counts
0 1 2 1
1 1 3 2
2 1 4 1
3 2 1 1
4 2 3 1
5 3 1 1
6 3 4 2
7 3 5 1

Pandas multi index Dataframe - Select and remove

I need some help with cleaning a Dataframe that has multi index.
it looks something like this
cost
location season
Thorp park autumn £12
srping £13
summer £22
Sea life centre summer £34
spring £43
Alton towers and so on.............
location and season are index columns. I want to go through the data and remove any locations that don't have "season" values of all three seasons. So "Sea life centre" should be removed.
Can anyone help me with this?
Also another question, my dataframe was created from a groupby command and doesn't have a column name for the "cost" column. Is this normal? There are values in the column, just no header.
Option 1
groupby + count. You can use the result to index your dataframe.
df
col
a 1 0
2 1
b 1 3
2 4
3 5
c 2 7
3 8
v = df.groupby(level=0).transform('count').values
df = df[v == 3]
df
col
b 1 3
2 4
3 5
Option 2
groupby + filter. This is Paul H's idea, will remove if he wants to post.
df.groupby(level=0).filter(lambda g: g.count() == 3)
col
b 1 3
2 4
3 5
Option 1
Thinking outside the box...
df.drop(df.count(level=0).col[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Same thing with a little more robustness because I'm not depending on values in a column.
df.drop(df.index.to_series().count(level=0).loc[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Option 2
Robustify for general case with undetermined number of seasons.
This uses Pandas version 0.21's groupby.pipe method
df.groupby(level=0).pipe(lambda g: g.filter(lambda d: len(d) == g.size().max()))
col
b 1 3
2 4
3 5

Python pandas: Append rows of DataFrame and delete the appended rows

import pandas as pd
df = pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9,10,11],
'text': ['abc','zxc','qwe','asf','efe','ert','poi','wer','eer','poy','wqr']})
I have a DataFrame with columns:
id text
1 abc
2 zxc
3 qwe
4 asf
5 efe
6 ert
7 poi
8 wer
9 eer
10 poy
11 wqr
I have a list L = [1,3,6,10] which contains list of id's.
I am trying to append the text column using a list such that, from my list first taking 1 and 3(first two values in a list) and appending text column in my DataFrame with id = 1 which has id's 2, then deleting rows with id column 2 similarly then taking 3 and 6 and then appending text column where id = 4,5 to id 3 and then delete rows with id = 4 and 5 and iteratively for elements in list (x, x+1)
My final output would look like this:
id text
1 abczxc # joining id 1 and 2
3 qweasfefe # joining id 3,4 and 5
6 ertpoiwereer # joining id 6,7,8,9
10 poywqr # joining id 10 and 11
You can use isin with cumsum for Series, which is use for groupby with apply join function:
s = df.id.where(df.id.isin(L)).ffill().astype(int)
df1 = df.groupby(s)['text'].apply(''.join).reset_index()
print (df1)
id text
0 1 abczxc
1 3 qweasfefe
2 6 ertpoiwereer
3 10 poywqr
It working because:
s = df.id.where(df.id.isin(L)).ffill().astype(int)
print (s)
0 1
1 1
2 3
3 3
4 3
5 6
6 6
7 6
8 6
9 10
10 10
Name: id, dtype: int32
I changed the values not in list to np.nan and then ffill and groupby. Though #Jezrael's approach is much better. I need to remember to use cumsum:)
l = [1,3,6,10]
df.id[~df.id.isin(l)] = np.nan
df = df.ffill().groupby('id').sum()
text
id
1.0 abczxc
3.0 qweasfefe
6.0 ertpoiwereer
10.0 poywqr
Use pd.cut to create you bins then groupby with a lambda function to join your text in that group.
df.groupby(pd.cut(df.id,L+[np.inf],right=False, labels=[i for i in L])).apply(lambda x: ''.join(x.text))
EDIT:
(df.groupby(pd.cut(df.id,L+[np.inf],
right=False,
labels=[i for i in L]))
.apply(lambda x: ''.join(x.text)).reset_index().rename(columns={0:'text'}))
Output:
id text
0 1 abczxc
1 3 qweasfefe
2 6 ertpoiwereer
3 10 poywqr

Pandas summing up different data types using counts and conditions

I am working with a large data set so I am going to create similar conditions below:
Lets say we are using this data set:
import pandas as pd
df=pd.DataFrame({'Location': [ 'NY', 'SF', 'NY', 'NY', 'SF', 'SF', 'TX', 'TX', 'TX', 'DC'],
'Class': ['H','L','H','L','L','H', 'H','L','L','M'],
'Address': ['12 Silver','10 Fak','12 Silver','1 North','10 Fak','2 Fake', '1 Red','1 Dog','2 Fake','1 White'],
'Score':['4','5','3','2','1','5','4','3','2','1',]})
So I want the rows to be unique value in df.Location
The first column will be the number of of data entries for each location. I can get this separately by:
df[df['Location'] =='SF'].count()['Location']
df[df['Location'] =='NY'].count()['Location']
df[df['Location'] =='TX'].count()['Location']
df[df['Location'] =='DC'].count()['Location']
Second, third and fourth columns I want to sum the different types in Classes (H,L,M). I know I can do this by:
#Second Col for NY
print (df[(df.Location =='NY') & (df.Class=='H')].count()['Class'])
#Third Col for NY
print (df[(df.Location =='NY') & (df.Class=='L')].count()['Class'])
#Fourth Col for NY
print (df[(df.Location =='NY') & (df.Class=='M')].count()['Class'])
I am guessing this would work with a pivot table but since I was using a dataframe everything got mixed up.
For the fifth column I want the consolidate the number of unique values for each Address. For example in NY the value should be 2 since there are two unique values and a duplicate of '12 Silver'
print (df[(df.Location =='NY')].Address)
>>>
0 12 Silver
2 12 Silver
3 1 North
Name: Address, dtype: object
I guess this can be doe by groupby. But I always get confused when using it. I can also use .drop_duplicates then count to get a numerical value
The sixth column should be if the values is less than the integer 4. So the value for NY should be
print (df[(df.Location =='NY') & (df.Score.astype(float) < 4)].count()['Score'])
So what is a good way to make a dataframe like this where the rows are unique location with the columns described above?
It should look something like:
Pop H L M HH L4
DC 1 0 0 1 1 1
NY 3 2 1 0 2 2
SF 3 1 2 0 2 1
TX 3 1 2 0 3 2
Since I know more or less how to get each individual component I can use a for loop through an array but there should be easier ways of doing this.
While with enough stacking tricks you might be able to do this all in one go, I don't think it'd be worth it. You have a pivot operation and a bunch of groupby operations. So do them separately -- which is easy -- and then combine the results.
Step #1 is to make Score a float column; it's better to get the types right before you start processing.
>>> df["Score"] = df["Score"].astype(float)
Then we'll make a new frame with the groupby-like columns. We could do this by passing .agg a dictionary but we'd have to rename the columns afterwards anyway, so there's not much point.
>>> gg = df.groupby("Location")
>>> summ = pd.DataFrame({"Pop": gg.Location.count(),
... "HH": gg.Address.nunique(),
... "L4": gg.Score.apply(lambda x: (x < 4).sum())})
>>> summ
HH L4 Pop
Location
DC 1 1 1
NY 2 2 3
SF 2 1 3
TX 3 2 3
[4 rows x 3 columns]
Then we can pivot:
>>> class_info = df.pivot_table(rows="Location", cols="Class", aggfunc='size', fill_value=0)
>>> class_info
Class H L M
Location
DC 0 0 1
NY 2 1 0
SF 1 2 0
TX 1 2 0
[4 rows x 3 columns]
and combine:
>>> new_df = pd.concat([summ, class_info], axis=1)
>>> new_df
HH L4 Pop H L M
Location
DC 1 1 1 0 0 1
NY 2 2 3 2 1 0
SF 2 1 3 1 2 0
TX 3 2 3 1 2 0
[4 rows x 6 columns]
You can reorder this as you like.

Categories