I have a sample dataframe whereby all numbers are userID:
from
to
1
3
1
2
2
3
How do I count the number of occurrences for each columns, sum it up based on the same values and displays in the following format in a new dataframe?
UserID
Occurences
1
2
2
2
3
2
Thank you.
IIUC, you can stack then value_counts
out = (df.stack().value_counts()
.to_frame('Occurences')
.rename_axis('UserID')
.reset_index())
print(out)
UserID Occurences
0 1 2
1 2 2
2 3 2
Use DataFrame.melt with GroupBy.size:
df = df.melt(value_name='UserID').groupby('UserID').size().reset_index(name='Occurences')
print (df)
UserID Occurences
0 1 2
1 2 2
2 3 2
The pd.Series.value counts method may be used to count the instances of each userID in the columns "from" and "to," and pd.concat can be used to combine the results. At the end create a dataframe from the resulting series using the pd.DataFrame.reset index method:
import pandas as pd
data_frame = pd.DataFrame({'from': [1, 1, 2], 'to': [3, 2, 3]})
occur = pd.concat([df['from'].value_counts(), df['to'].value_counts()])
result_df = occur.reset_index()
result_df.columns = ['UserID', 'occur']
result_df = result_df.groupby(['UserID'])['occur'].sum().reset_index()
UserID Occur
0 1 2
1 2 2
2 3 2
Related
I have the following pandas dataframe
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,1,1,2,2,2,2,2],
'col_a': [0,1,1,0,1,1,1,0,1,1]})
I would like to create a new column (col_a_new) which will be the same as col_a but substitute with 0 the 1st out of the 2 consecutive 1s in col_a, by id
The resulting dataframe looks like this:
foo = pd.DataFrame({'id': [1,1,1,1,1,2,2,2,2,2],
'col_a': [0,1,1,0,1,1,1,0,1,1],
'col_a_new': [0,0,1,0,1,0,1,0,0,1]})
Any ideas ?
Other approach: Just group by id and define new values using appropriate conditions.
(foo.groupby("id").col_a
.transform(lambda series: [0 if i < len(series) - 1
and series.tolist()[i+1] == 1
else x for i, x in enumerate(series.tolist())]))
# group by id and non-consecutive clusters of 0/1 in col_a
group = foo.groupby(["id", foo["col_a"].ne(foo["col_a"].shift()).cumsum()])
# get cumcount and count of groups
foo_cumcount = group.cumcount()
foo_count = group.col_a.transform(len)
# set to zero all first ones of groups with two ones, otherwise use original value
foo["col_a_new"] = np.where(foo_cumcount.eq(0)
& foo_count.gt(1)
& foo.col_a.eq(1),
0, foo.col_a)
# result
id col_a col_a_new
0 1 0 0
1 1 1 0
2 1 1 1
3 1 0 0
4 1 1 1
5 2 1 0
6 2 1 1
7 2 0 0
8 2 1 0
9 2 1 1
I have a pandas dataframe that contains user id and ad click (if any) by this user across several days
df =pd.DataFrame([['A',0], ['A',1], ['A',0], ['B',0], ['B',0], ['B',0], ['B',1], ['B',1], ['B',1]],columns=['user_id', 'click_count'])
Out[8]:
user_id click_count
0 A 0
1 A 1
2 A 0
3 B 0
4 B 0
5 B 0
6 B 1
7 B 1
8 B 1
I would like to convert this dataframe into A dataframe WITH 1 row per user where 'click_cnt' now is sum of all click_count across all rows for each user in the original dataframe i.e.
Out[18]:
user_id click_cnt
0 A 1
1 B 3
What you're after is the function groupby:
df = df.groupby('user_id', as_index=False).sum()
Adding the flag as_index=False will add the keys as a separate column instead of using them for the new index.
groupby is super useful - have a read through the documentation for more info.
im new to Python and i'm working on panda dataframe.
So I have a dataframe like:
Client_id Nb_Products
1 2
2 3
3 1
And I need to explode each row Nb_Products times for each client_id.
So i need to output the following table:
Client_id Product_Nb
1 1
1 2
2 1
2 2
2 3
3 1
At first i think i should create a range of numbers for Nb_Products like :
Client_id Nb_Products_rng
1 [1,2]
2 [1,2,3]
3 [1]
And then explode it.
But i couldn't succeed creating this.
I'll be greatful to any answer or part of answer.
Thank you
Methodology
I use an index firstly to speed things up and get the unique client ids
df = df.set_index('Client_id')
client_ids = df.index.get_level_values('Client_id').unique()
Then I just reconstruct the DataFrame by iterating over all products per client
res = pd.DataFrame(
[
[client, prod]
for client in client_ids
for prod in range(1, df.loc[client, 'Nb_Products'].max()+1)
],
columns = ['Client_id', 'Nb_Products']
)
Example / Test
The test data I used
import pandas as pd
df = pd.DataFrame(
[[1, 2], [2, 3], [3, 3]],
columns=['Client_id', 'Nb_Products']
)
Initial DataFrame
Client_id Nb_Products
0 1 2
1 2 3
2 3 3
Result
Client_id Nb_Products
0 1 1
1 1 2
2 2 1
3 2 2
4 2 3
5 3 1
6 3 2
7 3 3
You can do it simply by repeating the values in Client_id Nb_products time to 'explode' your dataset. Repeating Client_id value in a row by the the value against it in the Nb_products column will produce the Client_id variable of the new dataframe. I do this using list comprehension.
To get the second column - Product_Nb you simply need a sequence starting from 1.
from io import StringIO
import pandas as pd
TESTDATA=StringIO("""Client_id Nb_Products
1 2
2 3
3 1""")
df = pd.read_csv(TESTDATA, sep=" ")
col1 = []
_ = [col1.extend([a]*b) for a,b in zip(df.iloc[:,0].values.tolist(), df.iloc[:,1].values.tolist())]
col2 = []
_ = [col2.extend(list(range(1,i+1))) for i in df.iloc[:,1].values.tolist()]
df2 = pd.DataFrame(list(zip(col1,col2)),columns = ['Client_id', 'Product_Nb'])
I'm trying to generate a timeseries from a dataframe, but the solutions I've found here don't really address my specific problem. I have a dataframe which is a series of id's which iterate from 1 to n, then repeat, like this:
key ID Var_1
0 1 1
0 2 1
0 3 2
1 1 3
1 2 2
1 3 1
I want to reshape it into a timeseries in which the index
ID Var_1_0 Var_2_0
1 1 3
2 1 2
3 2 1
I have tried the stack() method but it doesn't generate the result I want. Generating an index from ID seems to be the right ID is not a proper date so I'm not sure how to proceed. Pointers much appreciated.
Try this:
import pandas as pd
df = pd.DataFrame([[0,1,1], [0,2,1], [0,3,2], [1,1,3], [1,2,2], [1,3,1]], columns=('key', 'ID', 'Var_1'))
Use the pivot function:
df2 = df.pivot('ID', 'key', 'Var_1')
You can rename the columns by:
df2.columns = ('Var_1_0', 'Var_2_0')
Result:
Out:
Var_1_0 Var_2_0
ID
1 1 3
2 1 2
3 2 1
Consider the following DataFrame:
df2 = pd.DataFrame({
'VAR_1' : [1,1,1,3,3],
'GROUP': [1,1,1,2,2],
})
My goal ist to create a seperate column "GROUP_MEAN" which holds the column "VAR_1" arithmetic mean value.
But - it should always consider the row value in "GROUP".
GROUP VAR_1 GROUP_MEAN
0 1 1 Mean Value GROUP = 1
1 1 1 Mean Value GROUP = 1
2 1 1 Mean Value GROUP = 1
3 2 3 Mean Value GROUP = 2
4 2 3 Mean Value GROUP = 2
I can easily access the overall mean:
df2['GROUP_MEAN'] = df2['VAR_1'].mean()
How do I go about making this conditional on a another column value?
I think this is a perfect use case for transform:
>>> df2 = pd.DataFrame({'VAR_1' : [1,2,3,4,5], 'GROUP': [1,1,1,2,2]})
>>> df2["GROUP_MEAN"] = df2.groupby('GROUP')['VAR_1'].transform('mean')
>>> df2
GROUP VAR_1 GROUP_MEAN
0 1 1 2.0
1 1 2 2.0
2 1 3 2.0
3 2 4 4.5
4 2 5 4.5
[5 rows x 3 columns]
Typically you use transform when you want to broadcast the result across all entries of the group.
assuming that the actual data-frame has columns in addition to VAR_1
ts = df2.groupby( 'GROUP' )['VAR_1'].aggregate( np.mean )
df2[ 'GROUP_MEAN' ] = ts[ df2.GROUP ].values
alternatively last line could also be:
df2 = df2.join( ts, on='GROUP', rsuffix='_MEAN' )