New column with unique groupby results in data frame - python

I have a data frame with duplicate rows ('id').
I want to aggregate the data, but first need to sum unique sessions per id.
id session
123 X
123 X
123 Y
123 Z
234 T
234 T
This code works well, but not when I want to add this new column 'ncount' to my data frame.
df['ncount'] = df.groupby('id')['session'].nunique().reset_index()
I tried using transform and it didn't work.
df['ncount'] = df.groupby('id')['session'].transform('nunique')
This is the result from the transform code (my data as duplicates id):
id session ncount
123 X 1
123 X 1
123 Y 1
123 Z 1
234 T 1
234 T 1
This is the result I'm interested in:
id session ncount
123 X 3
123 X 3
123 Y 3
123 Z 3
234 T 1
234 T 1

Use the following steps:
1.Group data and store in separate variable.
2.Then merge back to original data frame.
Code:
import pandas as pd
df = pd.DataFrame({"id":[123,123,123,123,234,234],"session":["X","X","Y","Z","T","T"]})
x = df.groupby(["id"])['session'].nunique().reset_index()
res = pd.merge(df,x,how="left",on="id")
print(res)
You can rename the column names if required .

using .count()
Steps:
1: Group the data by "id" and count the values of id values then
2: Decrease the Count by one for index format and Merge to two DataFrames
import pandas as pd
df = pd.DataFrame({"id":[123,123,123,123,234,234],"session":["X","X","Y","Z","T","T"]})
uniq_df = df.groupby(["id"])["session"].count().reset_index()
uniq_df["session"] = uniq_df["session"] - 1
result = pd.merge(df,uniq_df,how="left",on="id")
print(result)

Related

split dataframe based on column value

I have a df that contains several IDs, I´m trying to run a regression to the data and I need to be able to split it by ID to apply the regression to each ID:
Sample DF (this is only a sample real data is larger)
I tried to save the ID´s within a list like this:
id_list = []
for data in df['id'].unique():
id_list.append(data)
The list output is [1,2,3]
Then I was trying to use that to sort the DF:
def create_dataframe(df):
for unique_id in id_list:
df = df[df['Campaign ID'] == campaign_id]
return df
when I call the function the result is:
However I only got the result for the first ID in the list ,the other 2 [2,3] are not returning any DF... which means that at some point the loop breaks.
Here it is the entire code:
df = pd.read_csv('budget.csv')
id_list = []
for unique_id in df['id'].unique():
id_list.append(unique_id)
def create_dataframe(df):
for unique_id in id_list:
df = df[df['Campaign ID'] == unique_id]
return df
print(create_dataframe(df))
You can use the code snippet df.loc[df['id'] == item] to extract sub dataframes based on a particular value of a column in the dataframe.
Please refer the full code below
import pandas as pd
df_dict = {"id" : [1,1,1,2,2,2,3,3,3],
"value" : [12,13,14,22,23,24,32,33,34]
}
df = pd.DataFrame(df_dict)
print(df)
id_list = []
for data in df['id'].unique():
id_list.append(data)
print(id_list)
for item in id_list:
sub_df = df.loc[df['id'] == item]
print(sub_df)
print("****")
The following output will be generated for this with the requirement of getting the sub dataframes for each of the distinct column ids
id value
0 1 12
1 1 13
2 1 14
3 2 22
4 2 23
5 2 24
6 3 32
7 3 33
8 3 34
[1, 2, 3]
id value
0 1 12
1 1 13
2 1 14
****
id value
3 2 22
4 2 23
5 2 24
****
id value
6 3 32
7 3 33
8 3 34
****
Now in your code snippet the issue was that the function createdataframe() is getting called only once and inside the function when we iterate through the elements, after fetching the details of the sub df for id =1 you have used a return statement to return this df. Hence you are getting only the sub df for id = 1.
You seem to be overnighting the df value in the for loop. I would recommend moving the df creation outside of the for loop and then append to it there. Then adding to it in each of the loops instead of overwriting it.
You can use numpy.split:
df.sort_values('id', inplace=True)
np.split(df, df.index[df.id.diff().fillna(0).astype(bool)])
or pandas groupby:
grp = df.groupby('id')
[grp.get_group(g) for g in df.groupby('id').groups]
Although I think you can make a regression directly using pandas groupby, since it logically apply any function you want taking each group as a distinct dataframe.

Create binary column in pandas dataframe based on priority

I have a pandas dataframe that looks something like this:
Item Status
123 B
123 BW
123 W
123 NF
456 W
456 BW
789 W
789 NF
000 NF
And I need to create a new column Value which will be either 1 or 0 depending on the values in the Item and Status columns. The assignment of the value 1 is prioritized by this order: B, BW, W, NF. So, using the sample dataframe above, the result should be:
Item Status Value
123 B 1
123 BW 0
123 W 0
123 NF 0
456 W 0
456 BW 1
789 W 1
789 NF 0
000 NF 1
Using Python 3.7.
Taking your original dataframe as input df dataframe, the following code will produce your desired output:
#dictionary assigning order of priority to status values
priority_map = {'B':1,'BW':2,'W':3,'NF':4}
#new temporary column that converts Status values to order of priority values
df['rank'] = df['Status'].map(priority_map)
#create dictionary with Item as key and lowest rank value per Item as value
lowest_val_dict = df.groupby('Item')['rank'].min().to_dict()
#new column that assigns the same Value to all rows per Item
df['Value'] = df['Item'].map(lowest_val_dict)
#replace Values where rank is different with 0's
df['Value'] = np.where(df['Value'] == df['rank'],1,0)
#delete rank column
del df['rank']
I would prefer an approach where the status is an ordered pd.Categorical, because a) that's what it is and b) it's much more readable: if you have that, you just compare if a value is equal to the max of its group:
df['Status'] = pd.Categorical(df['Status'], categories=['NF', 'W', 'BW', 'B'],
ordered=True)
df['Value'] = df.groupby('Item')['Status'].apply(lambda x: (x == x.max()).astype(int))
# Item Status Value
#0 123 B 1
#1 123 BW 0
#2 123 W 0
#3 123 NF 0
#4 456 W 0
#5 456 BW 1
#6 789 W 1
#7 789 NF 0
#8 0 NF 1
I might be able to help you conceptually, by explaining some steps that I would do:
Create the new column Value, and fill it with zeros np.zeros() or pd.fillna()
Group the dataframe by Item with groupby = pd.groupby('Item')
Iterate through all the groups founds for name, group in groupby:
By using a simple function with if's, a custom priority queue, custom sorting criteria, or any other preferred method, determine which entry has higher priority " by this value 1 is prioritized by this order: B, BW, W, NF ", and assign a value of 1 to it's Value column group.loc[entry]['Value'] == 0
Let's say we are looking at group '123':
Item Status Value
-------------------------
123 B 0 (before 0, after 1)
123 BW 0
123 W 0
123 NF 0
Because the row [123, 'B', 0] had the highest priority based on your criteria, you change it to [123, 'B', 1]
When finished, create the dataframe back from the groupby object, and you're done. You have a lot of possibilities for doing that, might check here: Converting a Pandas GroupBy object to DataFrame

Count mutual followers in a relation table using pandas

I have a pandas DataFrame like so:
from_user to_user
0 123 456
1 894 135
2 179 890
3 456 123
Where each row contains two IDs that reflect whether the from_user "follows" the to_user. How can I count the total number of mutual followers in the DataFrame using pandas?
In the example above, the answer should be 1 (users 123 & 456).
One way is to use MultiIndex set operations:
In [11]: i1 = df.set_index(["from_user", "to_user"]).index
In [12]: i2 = df.set_index(["to_user", "from_user"]).index
In [13]: (i1 & i2).levels[0]
Out[13]: Int64Index([123, 456], dtype='int64')
To get the count you have to divide the length of this index by 2:
In [14]: len(i1 & i2) // 2
Out[14]: 1
Another way to do is to concat the values and sort them as string.
Then count how many times the values occur:
# concat the values as string type
df['concat'] = df.from_user.astype(str) + df.to_user.astype(str)
# sort the string values of the concatenation
df['concat'] = df.concat.apply(lambda x: ''.join(sorted(x)))
# count the occurences of each and substract 1
count = (df.groupby('concat').size() -1).sum()
Out[64]: 1
Here is another slightly more hacky way to do this:
df.loc[df.to_user.isin(df.from_user)]
.assign(hacky=df.from_user * df.to_user)
.drop_duplicates(subset='hacky', keep='first')
.drop('hacky', 1)
from_user to_user
0 123 456
The whole multiplication hack exists to ensure we don't return 123 --> 456 and 456 --> 123 since both are valid given the conditional we provide to loc

python dask dataframes - concatenate groupby.apply output to a single data frame

I am using dask dataframe.groupby().apply()
and get a dask series as a return value.
I am each group to a list triplets such as (a,b,1) and wish then to turn all the triplets into a single dask data frame
I am using this code in the end of the mapping function to return the triplets as a dask df
#assume here that trips is a generator for tripletes such as you would produce from itertools.product([l1,l2,l3])
trip = list(itertools.chain.from_iterable(trip))
df = pd.DataFrame.from_records(trip)
return dd.from_pandas(df,npartitions=1)
then when I try to use something similar to pandas concat with dask concatenate
Assume the result of the apply function is the variable result.
I am trying to use
import dask.dataframe as dd
dd.concat(result, axis=0
and get the error
raise TypeError("dfs must be a list of DataFrames/Series objects")
TypeError: dfs must be a list of DataFrames/Series objects
But when I check for the type of result using
print type(result)
I get
output: class 'dask.dataframe.core.Series'
What is the proper way to apply a function over groups of dask groupby object and get all the results into one dataframe?
Thanks
edit:--------------------------------------------------------------
in order to produce the use case, assume this fake data generation
import random
import pandas as pd
import dask.dataframe as dd
people = [[random.randint(1,3), random.randint(1,3), random.randint(1,3)] for i in range(1000)]
ddf = dd.from_pandas(pd.DataFrame.from_records(people, columns=["first name", "last name", "cars"]), npartitions=1)
Now my mission is to group people by first and last name (e.g all the people with same first name & first last name) and than I need to get a new dask data frame which will contain how many cars each group had.
Assume that the apply function can return either a series of lists of tuples e.g [(name,name,cars count),(name,name,cars count)] or a data frame with the same columns - name, name, car count.
Yes, I know that particular use case can be solved in another way, but please trust me, my use case is more complex. But i can not share the data and can not generate any similar data. so let's use a dummy data :-)
The challenge is to connect all the results of the apply into a single dask data frame (pandas data frame will be a problem here, data will not fit in memory - so transitions via a pandas data frame will be a problem)
For me working if output of apply is pandas DataFrame, so last if necessary convert to dask DataFrame:
def f(x):
trip = ((1,2,x) for x in range(3))
df = pd.DataFrame.from_records(trip)
return df
df1 = ddf.groupby('cars').apply(f, meta={'x': 'i8', 'y': 'i8', 'z': 'i8'}).compute()
#only for remove MultiIndex
df1 = df1.reset_index()
print (df1)
cars level_1 x y z
0 1 0 1 2 0
1 1 1 1 2 1
2 1 2 1 2 2
3 2 0 1 2 0
4 2 1 1 2 1
5 2 2 1 2 2
6 3 0 1 2 0
7 3 1 1 2 1
8 3 2 1 2 2
ddf1 = dd.from_pandas(df1,npartitions=1)
print (ddf1)
cars level_1 x y z
npartitions=1
0 int64 int64 int64 int64 int64
8 ... ... ... ... ...
Dask Name: from_pandas, 1 tasks
EDIT:
L = []
def f(x):
trip = ((1,2,x) for x in range(3))
#append each
L.append(da.from_array(np.array(list(trip)), chunks=(1,3)))
ddf.groupby('cars').apply(f, meta={'x': 'i8', 'y': 'i8', 'z': 'i8'}).compute()
dar = da.concatenate(L, axis=0)
print (dar)
dask.array<concatenate, shape=(12, 3), dtype=int32, chunksize=(1, 3)>
For your edit:
In [8]: ddf.groupby(['first name', 'last name']).cars.count().compute()
Out[8]:
first name last name
1 1 107
2 107
3 110
2 1 117
2 120
3 99
3 1 119
2 103
3 118
Name: cars, dtype: int64

Pandas DataFrame: How to print single row horizontally?

Single row of a DataFrame prints value side by side, i.e. column_name then columne_value in one line and next line contains next column_name and columne_value. For example, below code
import pandas as pd
df = pd.DataFrame([[100,200,300],[400,500,600]])
for index, row in df.iterrows():
# other operations goes here....
print row
Output for first row comes as
0 100
1 200
2 300
Name: 0, dtype: int64
Is there a way to have each row printed horizontally and ignore the datatype, Name? Example for the first row:
0 1 2
100 200 300
use the to_frame method then transpose with T
df = pd.DataFrame([[100,200,300],[400,500,600]])
for index, row in df.iterrows():
print(row.to_frame().T)
0 1 2
0 100 200 300
0 1 2
1 400 500 600
note:
This is similar to #JohnE's answer in that the method to_frame is syntactic sugar around pd.DataFrame.
In fact if we follow the code
def to_frame(self, name=None):
"""
Convert Series to DataFrame
Parameters
----------
name : object, default None
The passed name should substitute for the series name (if it has
one).
Returns
-------
data_frame : DataFrame
"""
if name is None:
df = self._constructor_expanddim(self)
else:
df = self._constructor_expanddim({name: self})
return df
Points to _constructor_expanddim
#property
def _constructor_expanddim(self):
from pandas.core.frame import DataFrame
return DataFrame
Which you can see simply returns the callable DataFrame
Use the transpose property:
df.T
0 1 2
0 100 200 300
It seems like there should be a simpler answer to this, but try turning it into another DataFrame with one row.
data = {x: y for x, y in zip(df.columns, df.iloc[0])}
sf = pd.DataFrame(data, index=[0])
print(sf.to_string())
Sorta combining the two previous answers, you could do:
for index, ser in df.iterrows():
print( pd.DataFrame(ser).T )
0 1 2
0 100 200 300
0 1 2
1 400 500 600
Basically what happens is that if you extract a row or column from a dataframe, you get a series which displays as a column. And doesn't matter if you do ser or ser.T, it "looks" like a column. I mean, series are one dimensional, not two, but you get the point...
So anyway, you can convert the series to a dataframe with one row. (I changed the name from "row" to "ser" to emphasize what is happening above.) The key is you have to convert to a dataframe first (which will be a column by default), then transpose it.

Categories