I have a dataset with a few columns. I would like to slice the data frame with finding a string "M22" in the column "Run number". I am able to do so. However, I would like to count the number of unique rows that contained the string "M22".
Here is what I have done for the below table (example):
RUN_NUMBER DATE_TIME CULTURE_DAY AGE_HRS AGE_DAYS
335991M 6/30/2022 0 0 0
M220621 7/1/2022 1 24 1
M220678 7/2/2022 2 48 2
510091M 7/3/2022 3 72 3
M220500 7/4/2022 4 96 4
335991M 7/5/2022 5 120 5
M220621 7/6/2022 6 144 6
M220678 7/7/2022 7 168 7
335991M 7/8/2022 8 192 8
M220621 7/9/2022 9 216 9
M220678 7/10/2022 10 240 10
here is the results I got:
RUN_NUMBER
335991M 0
510091M 0
335992M 0
M220621 3
M220678 3
M220500 1
Now I need to count the strings/rows that contained "M22" : so I need to get 3 as output.
Use the following approach with pd.Series.unique function:
df[df['RUN_NUMBER'].str.contains("M22")]['RUN_NUMBER'].unique().size
Or a more faster alternative using numpy.char.find function:
(np.char.find(df['RUN_NUMBER'].unique().astype(str), 'M22') != -1).sum()
3
df = sample.groupby('id')['user_id'].apply(list).reset_index(name='new') this gives me:
id new
0 429 [659500]
1 1676 [2281394]
2 2389 [3973559]
3 2810 [4382598]
4 3104 [4733375]
5 3447 [5519461]
6 3818 [4453354]
7 3846 [4514870]
8 4283 [6378476]
9 4626 [6670089]
10 5022 [1116244]
11 5213 [6913646]
12 5899 [8213945, 8210403]
13 5962 [8733646]
However new is a series, how can I get 'new' into a list of strings in a dataframe?
I've tried df['new_id'] = df.loc[:, ['new']] thinking that this would at least solve my series issue... since print(type(df.loc[:, ['new']])) retuns a dataframe.
Try this:
sample['new_id'] = sample['id'].map(sample.groupby('id')['user_id'].agg(list))
This question already has answers here:
Moving Average Pandas
(4 answers)
Closed 2 years ago.
I have a dataframe with years of data and many features.
For each of those features I want to create a new feature that averages the last 12 weeks of data.
So say I have weekly data. I want a datapoint for feature1B to give me the average of the last 12 rows of data from feature1A. And if the data is hourly, I want the same done but for the last 2016 rows (24 hours * 7 days * 12 weeks)
So for instance, say the data looks like this:
Week Feature1
1 8846
2 2497
3 1987
4 5294
5 2487
6 1981
7 8973
8 9873
9 8345
10 5481
11 4381
12 8463
13 7318
14 8642
15 4181
16 3871
17 7919
18 2468
19 4981
20 9871
I need the code to loop through the multiple feature, create a feature name such as 'TARGET.'+feature and spit the averaged data based on my criteria (last 12 rows... last 2016 rows... depends on the format).
Week Feature1 Feature1-B
1 8846
2 2497
3 1987
4 5294
5 2487
6 1981
7 8973
8 9873
9 8345
10 5481
11 4381
12 8463
13 7318 5717.333333
14 8642 5590
15 4181 6102.083333
16 3871 6284.916667
17 7919 6166.333333
18 2468 6619
19 4981 6659.583333
20 9871 6326.916667
Appreciate any help.
Solved with the helpful comment from Chris A. Can't seem to mark that comment as an answer.
import pandas as pd
df = pd.read_csv('data.csv')
cols = df.iloc[:,2:].columns
for c in cols:
df['12W_AVG.'+c] = df[c].rolling(2016).mean()
df['12W_AVG.'+c] = df['12W_AVG.'+c].fillna(df['12W_AVG.'+c][2015])
df['12W_AVG.'+c+'_LAL'] = df['12W_AVG.'+c]*0.9
df['12W_AVG.'+c+'_UAL'] = df['12W_AVG.'+c]*1.1
df.drop(c, axis=1, inplace=True)
Does this work for you?
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=["week", "data"], data=[
[1, 8846],
[2,2497],
[3,1987],
[4,5294],
[5,2487],
[6,1981],
[7,8973],
[8,9873],
[9,8345],
[10,5481],
[11,4381],
[12,8463],
[13,7318],
[14,8642],
[15,4181],
[16,3871],
[17,7919],
[18,2468],
[19,4981],
[20,9871]])
df.insert(2, "average",0, True)
for length in range(12, len(df.index)):
values = df.iloc[length-12:index, 1]
weekly_sum = np.sum(values, axis=0)
df.at[length, 'average'] = weekly_sum / 12
print(df)
mind you, this is very bad code and requires you to do some work on it yourself
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 3 years ago.
I want to transform distinct rows in a dataframe into columns and the values assigned to each column.
I have a pandas dataframe with this structure (coming from a json file):
Key Value
0 _id 1
1 type house
2 surface 156
3 county andr
4 _id 2
5 type apartment
6 surface 95
7 county sprl
8 _id 3
9 type house
10 surface 234
11 county ilm
..
I expect a dataframe similar to:
_id type surface county
0 1 house 156 andr
1 2 apartment 95 sprl
2 3 house 234 ilm
...
df = pd.read_json(your_json, orient='records')
This should read it in the format you want.
This question already has answers here:
How to analyze all duplicate entries in this Pandas DataFrame?
(3 answers)
Closed 7 years ago.
I am new on Python. I would like to find the duplicated lines in a data frame.
To explain myself, I have the following data frame
type(data)
pandas.core.frame.DataFrame
data.head()
User Hour Min Day Month Year Latitude Longitude
0 0 1 48 17 10 2010 39.75000 -105.000000
1 0 6 2 16 10 2010 39.90625 -105.062500
2 0 3 48 16 10 2010 39.90625 -105.062500
3 0 18 25 14 10 2010 39.75000 -105.000000
I would like to find the duplicated lines in this data frame and to return the 'User' that corresponds to this line.
Thanks a lot,
Is this what you are looking for?
user = data[data.duplicated()]['User']