Create pivot table in pandas [duplicate] - python

This question already has answers here:
Use groupby in Pandas to count things in one column in comparison to another
(4 answers)
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
I have my dataframe say df
data = [['00637', 'rew_A'], ['5644', 'rew_A'], ['564', 'rew_A'],
['2218', 'rew_C'], ['990', 'rew_C'], ['17', 'rew_A'],
['5565', 'rew_C'], ['121', 'rew_A'], ['76700', 'rew_B'],
['00637', 'rew_C']]
t = pd.DataFrame(data, columns = ['emp_id', 'reward'])
t
emp_id reward
0 00637 rew_A
1 5644 rew_A
2 564 rew_A
3 2218 rew_C
4 990 rew_C
5 17 rew_A
6 5565 rew_C
7 121 rew_A
8 76700 rew_B
9 00637 rew_C
My OP should contain 4 columns i.e - emp_id, rew_A, rew_B, and rew_C, a basic pivot table which should look like -
emp_id rew_A rew_B rew_C
0 00637 1 1
1 5644 1
2 564 1
Please help me out to create this.
Thanks!!! :)

Related

group rows based on a string in a column in pandas and count the number of occurrence of unique rows that contained the string

I have a dataset with a few columns. I would like to slice the data frame with finding a string "M22" in the column "Run number". I am able to do so. However, I would like to count the number of unique rows that contained the string "M22".
Here is what I have done for the below table (example):
RUN_NUMBER DATE_TIME CULTURE_DAY AGE_HRS AGE_DAYS
335991M 6/30/2022 0 0 0
M220621 7/1/2022 1 24 1
M220678 7/2/2022 2 48 2
510091M 7/3/2022 3 72 3
M220500 7/4/2022 4 96 4
335991M 7/5/2022 5 120 5
M220621 7/6/2022 6 144 6
M220678 7/7/2022 7 168 7
335991M 7/8/2022 8 192 8
M220621 7/9/2022 9 216 9
M220678 7/10/2022 10 240 10
here is the results I got:
RUN_NUMBER
335991M 0
510091M 0
335992M 0
M220621 3
M220678 3
M220500 1
Now I need to count the strings/rows that contained "M22" : so I need to get 3 as output.
Use the following approach with pd.Series.unique function:
df[df['RUN_NUMBER'].str.contains("M22")]['RUN_NUMBER'].unique().size
Or a more faster alternative using numpy.char.find function:
(np.char.find(df['RUN_NUMBER'].unique().astype(str), 'M22') != -1).sum()
3

pandas groupby changes column into series

df = sample.groupby('id')['user_id'].apply(list).reset_index(name='new') this gives me:
id new
0 429 [659500]
1 1676 [2281394]
2 2389 [3973559]
3 2810 [4382598]
4 3104 [4733375]
5 3447 [5519461]
6 3818 [4453354]
7 3846 [4514870]
8 4283 [6378476]
9 4626 [6670089]
10 5022 [1116244]
11 5213 [6913646]
12 5899 [8213945, 8210403]
13 5962 [8733646]
However new is a series, how can I get 'new' into a list of strings in a dataframe?
I've tried df['new_id'] = df.loc[:, ['new']] thinking that this would at least solve my series issue... since print(type(df.loc[:, ['new']])) retuns a dataframe.
Try this:
sample['new_id'] = sample['id'].map(sample.groupby('id')['user_id'].agg(list))

How to create a feature based on an average of X rows before? [duplicate]

This question already has answers here:
Moving Average Pandas
(4 answers)
Closed 2 years ago.
I have a dataframe with years of data and many features.
For each of those features I want to create a new feature that averages the last 12 weeks of data.
So say I have weekly data. I want a datapoint for feature1B to give me the average of the last 12 rows of data from feature1A. And if the data is hourly, I want the same done but for the last 2016 rows (24 hours * 7 days * 12 weeks)
So for instance, say the data looks like this:
Week Feature1
1 8846
2 2497
3 1987
4 5294
5 2487
6 1981
7 8973
8 9873
9 8345
10 5481
11 4381
12 8463
13 7318
14 8642
15 4181
16 3871
17 7919
18 2468
19 4981
20 9871
I need the code to loop through the multiple feature, create a feature name such as 'TARGET.'+feature and spit the averaged data based on my criteria (last 12 rows... last 2016 rows... depends on the format).
Week Feature1 Feature1-B
1 8846
2 2497
3 1987
4 5294
5 2487
6 1981
7 8973
8 9873
9 8345
10 5481
11 4381
12 8463
13 7318 5717.333333
14 8642 5590
15 4181 6102.083333
16 3871 6284.916667
17 7919 6166.333333
18 2468 6619
19 4981 6659.583333
20 9871 6326.916667
Appreciate any help.
Solved with the helpful comment from Chris A. Can't seem to mark that comment as an answer.
import pandas as pd
df = pd.read_csv('data.csv')
cols = df.iloc[:,2:].columns
for c in cols:
df['12W_AVG.'+c] = df[c].rolling(2016).mean()
df['12W_AVG.'+c] = df['12W_AVG.'+c].fillna(df['12W_AVG.'+c][2015])
df['12W_AVG.'+c+'_LAL'] = df['12W_AVG.'+c]*0.9
df['12W_AVG.'+c+'_UAL'] = df['12W_AVG.'+c]*1.1
df.drop(c, axis=1, inplace=True)
Does this work for you?
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=["week", "data"], data=[
[1, 8846],
[2,2497],
[3,1987],
[4,5294],
[5,2487],
[6,1981],
[7,8973],
[8,9873],
[9,8345],
[10,5481],
[11,4381],
[12,8463],
[13,7318],
[14,8642],
[15,4181],
[16,3871],
[17,7919],
[18,2468],
[19,4981],
[20,9871]])
df.insert(2, "average",0, True)
for length in range(12, len(df.index)):
values = df.iloc[length-12:index, 1]
weekly_sum = np.sum(values, axis=0)
df.at[length, 'average'] = weekly_sum / 12
print(df)
mind you, this is very bad code and requires you to do some work on it yourself

transform rows into columns in a pandas dataframe with serial data [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 3 years ago.
I want to transform distinct rows in a dataframe into columns and the values assigned to each column.
I have a pandas dataframe with this structure (coming from a json file):
Key Value
0 _id 1
1 type house
2 surface 156
3 county andr
4 _id 2
5 type apartment
6 surface 95
7 county sprl
8 _id 3
9 type house
10 surface 234
11 county ilm
..
I expect a dataframe similar to:
_id type surface county
0 1 house 156 andr
1 2 apartment 95 sprl
2 3 house 234 ilm
...
df = pd.read_json(your_json, orient='records')
This should read it in the format you want.

Python - duplicated lines [duplicate]

This question already has answers here:
How to analyze all duplicate entries in this Pandas DataFrame?
(3 answers)
Closed 7 years ago.
I am new on Python. I would like to find the duplicated lines in a data frame.
To explain myself, I have the following data frame
type(data)
pandas.core.frame.DataFrame
data.head()
User Hour Min Day Month Year Latitude Longitude
0 0 1 48 17 10 2010 39.75000 -105.000000
1 0 6 2 16 10 2010 39.90625 -105.062500
2 0 3 48 16 10 2010 39.90625 -105.062500
3 0 18 25 14 10 2010 39.75000 -105.000000
I would like to find the duplicated lines in this data frame and to return the 'User' that corresponds to this line.
Thanks a lot,
Is this what you are looking for?
user = data[data.duplicated()]['User']

Categories