pandas groupby changes column into series

pandas groupby changes column into series - python

df = sample.groupby('id')['user_id'].apply(list).reset_index(name='new') this gives me:
id new
0 429 [659500]
1 1676 [2281394]
2 2389 [3973559]
3 2810 [4382598]
4 3104 [4733375]
5 3447 [5519461]
6 3818 [4453354]
7 3846 [4514870]
8 4283 [6378476]
9 4626 [6670089]
10 5022 [1116244]
11 5213 [6913646]
12 5899 [8213945, 8210403]
13 5962 [8733646]
However new is a series, how can I get 'new' into a list of strings in a dataframe?
I've tried df['new_id'] = df.loc[:, ['new']] thinking that this would at least solve my series issue... since print(type(df.loc[:, ['new']])) retuns a dataframe.

Try this:
sample['new_id'] = sample['id'].map(sample.groupby('id')['user_id'].agg(list))

Related

group rows based on a string in a column in pandas and count the number of occurrence of unique rows that contained the string

I have a dataset with a few columns. I would like to slice the data frame with finding a string "M22" in the column "Run number". I am able to do so. However, I would like to count the number of unique rows that contained the string "M22".
Here is what I have done for the below table (example):
RUN_NUMBER DATE_TIME CULTURE_DAY AGE_HRS AGE_DAYS
335991M 6/30/2022 0 0 0
M220621 7/1/2022 1 24 1
M220678 7/2/2022 2 48 2
510091M 7/3/2022 3 72 3
M220500 7/4/2022 4 96 4
335991M 7/5/2022 5 120 5
M220621 7/6/2022 6 144 6
M220678 7/7/2022 7 168 7
335991M 7/8/2022 8 192 8
M220621 7/9/2022 9 216 9
M220678 7/10/2022 10 240 10
here is the results I got:
RUN_NUMBER
335991M 0
510091M 0
335992M 0
M220621 3
M220678 3
M220500 1
Now I need to count the strings/rows that contained "M22" : so I need to get 3 as output.

Use the following approach with pd.Series.unique function:
df[df['RUN_NUMBER'].str.contains("M22")]['RUN_NUMBER'].unique().size
Or a more faster alternative using numpy.char.find function:
(np.char.find(df['RUN_NUMBER'].unique().astype(str), 'M22') != -1).sum()
3

How to create a feature based on an average of X rows before? [duplicate]

This question already has answers here:
Moving Average Pandas
(4 answers)
Closed 2 years ago.
I have a dataframe with years of data and many features.
For each of those features I want to create a new feature that averages the last 12 weeks of data.
So say I have weekly data. I want a datapoint for feature1B to give me the average of the last 12 rows of data from feature1A. And if the data is hourly, I want the same done but for the last 2016 rows (24 hours * 7 days * 12 weeks)
So for instance, say the data looks like this:
Week Feature1
1 8846
2 2497
3 1987
4 5294
5 2487
6 1981
7 8973
8 9873
9 8345
10 5481
11 4381
12 8463
13 7318
14 8642
15 4181
16 3871
17 7919
18 2468
19 4981
20 9871
I need the code to loop through the multiple feature, create a feature name such as 'TARGET.'+feature and spit the averaged data based on my criteria (last 12 rows... last 2016 rows... depends on the format).
Week Feature1 Feature1-B
1 8846
2 2497
3 1987
4 5294
5 2487
6 1981
7 8973
8 9873
9 8345
10 5481
11 4381
12 8463
13 7318 5717.333333
14 8642 5590
15 4181 6102.083333
16 3871 6284.916667
17 7919 6166.333333
18 2468 6619
19 4981 6659.583333
20 9871 6326.916667
Appreciate any help.

Solved with the helpful comment from Chris A. Can't seem to mark that comment as an answer.
import pandas as pd
df = pd.read_csv('data.csv')
cols = df.iloc[:,2:].columns
for c in cols:
df['12W_AVG.'+c] = df[c].rolling(2016).mean()
df['12W_AVG.'+c] = df['12W_AVG.'+c].fillna(df['12W_AVG.'+c][2015])
df['12W_AVG.'+c+'_LAL'] = df['12W_AVG.'+c]*0.9
df['12W_AVG.'+c+'_UAL'] = df['12W_AVG.'+c]*1.1
df.drop(c, axis=1, inplace=True)

Does this work for you?
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=["week", "data"], data=[
[1, 8846],
[2,2497],
[3,1987],
[4,5294],
[5,2487],
[6,1981],
[7,8973],
[8,9873],
[9,8345],
[10,5481],
[11,4381],
[12,8463],
[13,7318],
[14,8642],
[15,4181],
[16,3871],
[17,7919],
[18,2468],
[19,4981],
[20,9871]])
df.insert(2, "average",0, True)
for length in range(12, len(df.index)):
values = df.iloc[length-12:index, 1]
weekly_sum = np.sum(values, axis=0)
df.at[length, 'average'] = weekly_sum / 12
print(df)
mind you, this is very bad code and requires you to do some work on it yourself

How to reindex a spesific column in Pandas?

gd = df.groupby(['subID'])['Accuracy'].std()
print(gd)
subID
4 0.810423
5 0.841364
6 0.881007
8 0.763175
9 0.760102
10 0.905283
12 0.928358
14 0.779291
15 0.912377
1018 0.926683
It displays like this and I assume it is a Series, not a DataFrame. I want to change the last index from 1018 to 13.

Use rename with dictionary, because first column here is index of Series:
gd = gd.rename({1018:13})
Working like:
gd = gd.rename(index={1018:13})

Python replace all values in dataframe with values from other dataframe

I'm quite new to python (and pandas) and a have a replace task for a large dataframe i couldn't find a solution for.
So i have two dataframes, one (df1) which looks something like this:
Id Id Id
4954733 3929949 515674
2950086 1863885 4269069
1241018 3711213 4507609
3806276 2035233 4968071
4437138 1248817 1167192
5468160 4726010 2851685
1211786 2604463 5172095
2914539 5235788 4130808
4730974 5835757 1536235
2201352 5779683 5771612
3864854 4784259 2928288
the other dataframe (df2) containing all the 'old' id's and the corresponding new ones in the next column (from 1 to 20,000), which looks something like this:
Id Id_new
5774290 1
761000 2
3489755 3
1084156 4
2188433 5
3456900 6
4364416 7
3518181 8
3926684 9
5797492 10
4435820 11
what i would like to do is replace all the id's (all columns) in df1 with the corresponding Id_new from df2. I guess ideally without having to do a merge or join for each column, given the size of the dataset?
The result should be like this: df_new
Id_new Id_new Id_new
8 12 22
16 9 8
21 25 10
10 15 13
29 6 4
22 7 22
30 3 3
11 31 29
32 29 27
12 3 4
14 6 24
Any tips would be great, thanks in advance!

I think you need replace by Series created by set_index:
print (df1)
Id Id.1 Id.2
0 4954733 3929949 515674 <-first value changed for match data
1 2950086 1863885 4269069
2 1241018 3711213 4507609
3 3806276 2035233 4968071
4 4437138 1248817 1167192
5 5468160 4726010 2851685
6 1211786 2604463 5172095
7 2914539 5235788 4130808
8 4730974 5835757 1536235
9 2201352 5779683 5771612
10 3864854 4784259 2928288
df = df1.replace(df2.set_index('Id')['Id_new'])
print (df)
Id Id.1 Id.2
0 1 3929949 515674
1 2950086 1863885 4269069
2 1241018 3711213 4507609
3 3806276 2035233 4968071
4 4437138 1248817 1167192
5 5468160 4726010 2851685
6 1211786 2604463 5172095
7 2914539 5235788 4130808
8 4730974 5835757 1536235
9 2201352 5779683 5771612
10 3864854 4784259 2928288

Column operations on pandas groupby object

I have a dataframe df that looks like this:
id Category Time
1 176 12 00:00:00
2 4956 2 00:00:00
3 583 4 00:00:04
4 9395 2 00:00:24
5 176 12 00:03:23
which is basically a set of id and the category of item they used at a particular Time. I use df.groupby['id'] and then I want to see if they used the same category or different and assign True or False respectively (or NaN if that was the first item for that particular id. I also filtered out the data to remove all the ids with only one Time.
For example one of the groups may look like
id Category Time
1 176 12 00:00:00
2 176 12 00:03:23
3 176 2 00:04:34
4 176 2 00:04:54
5 176 2 00:05:23
and I want to perform an operation to get
id Category Time Transition
1 176 12 00:00:00 NaN
2 176 12 00:03:23 False
3 176 2 00:04:34 True
4 176 2 00:04:54 False
5 176 2 00:05:23 False
I thought about doing an apply of some sorts to the Category column after groupby but I am having trouble figuring out the right function.

you don't need a groupby here, you just need sort and shift.
df.sort(['id', 'Time'], inplace=True)
df['Transition'] = df.Category != df.Category.shift(1)
df.loc[df.id != df.id.shift(1), 'Transition'] = np.nan
i haven't tested this, but it should do the trick

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas groupby changes column into series - python

Try this: sample['new_id'] = sample['id'].map(sample.groupby('id')['user_id'].agg(list))

Related

group rows based on a string in a column in pandas and count the number of occurrence of unique rows that contained the string

How to create a feature based on an average of X rows before? [duplicate]

How to reindex a spesific column in Pandas?

Python replace all values in dataframe with values from other dataframe

Column operations on pandas groupby object

Categories

Resources