Python: adding columns to dataframe using calculated values from specific rows - python

Hi I am pretty new to python and would like to start working in it and move away from excel. My problem is two fold:
First part is that I have a csv file which looks like this
row 1: 52.78 52.52 53.2 51.98 53.22 50.85 51.44 52.38 52.21 52.09 51.5 51.92
row2 : 6.89 5.47 5.8 5.89 6.56 5.69 5.48 4.9 6.39 5.12 3.61 4.48
row3: 156 126 185 363 197 261 417 298 292 150 102 303
row4: 0 0 0 0 0 0 0 0 0 0 0 0
row5: 0 3 5 8 0 0 10 0 12 0 13 0
...
...
...
row195: 0 5 5 7 1 2 11 0 12 0 13 0
it goes on like this till row 195
I want to create new columns which start on row 4 and use the following formula:
the first column should be
[(row3,column1)*(row4,column1)]+[(row3,column1)*(row101,column1)]
the second column should be
[(row3,column2)*(row4,column2)]+[(row3,column2)*(row101,column2)]
it goes on like this till row 100 for all 12 columns
the formula in row 100 for the first column should be
[(row3,column1)*(row98,column1)]+[(row3,column1)*(row195,column1)]
how do I go about doing this in Pandas?
The second part of my problem is that I have 365 different files with similar data (the values change per file but the format is the same) and I would like to apply this same formula in all the files.
Appreciate any help I can get
Thanks

You are trying to use pandas like Excel if I understand it correctly.
If your dataset/dataframe is called df and you would like to append a new column. You would right something like:
df['first_col']=float(df.iloc[2,0])*float(df.iloc[3,0])+float(df.iloc[2,0])*float(df.ilo[100,0])
... and the same for the other 2 columns. Be aware that python starts counting from 0. Hence, your row 1 is actually row 0 and column 1 is column 0 etc. Hope this helps.

Related

group rows based on a string in a column in pandas and count the number of occurrence of unique rows that contained the string

I have a dataset with a few columns. I would like to slice the data frame with finding a string "M22" in the column "Run number". I am able to do so. However, I would like to count the number of unique rows that contained the string "M22".
Here is what I have done for the below table (example):
RUN_NUMBER DATE_TIME CULTURE_DAY AGE_HRS AGE_DAYS
335991M 6/30/2022 0 0 0
M220621 7/1/2022 1 24 1
M220678 7/2/2022 2 48 2
510091M 7/3/2022 3 72 3
M220500 7/4/2022 4 96 4
335991M 7/5/2022 5 120 5
M220621 7/6/2022 6 144 6
M220678 7/7/2022 7 168 7
335991M 7/8/2022 8 192 8
M220621 7/9/2022 9 216 9
M220678 7/10/2022 10 240 10
here is the results I got:
RUN_NUMBER
335991M 0
510091M 0
335992M 0
M220621 3
M220678 3
M220500 1
Now I need to count the strings/rows that contained "M22" : so I need to get 3 as output.
Use the following approach with pd.Series.unique function:
df[df['RUN_NUMBER'].str.contains("M22")]['RUN_NUMBER'].unique().size
Or a more faster alternative using numpy.char.find function:
(np.char.find(df['RUN_NUMBER'].unique().astype(str), 'M22') != -1).sum()
3

pandas groupby single column but add contents of multiple columns

I have a dataframe like this
Time
Buy
Sell
Bin
0
09:15:01
3200
0
3573.0
1
09:15:01
0
4550
3562.0
2
09:15:01
4250
0
3565.0
3
09:15:01
0
5150
3562.0
4
09:15:01
1200
0
3563.0
..
...
...
...
...
292
09:15:01
375
0
3564.0
293
09:15:01
175
0
3564.0
294
09:15:01
0
25
3564.0
295
09:15:01
400
0
3564.0
(Disregard 'Time' currently just using a static value)
what would be the most efficient way to
sum up all the Buys and sells within each bin and remove duplicates
Currently im using
Step1.
final_df1['Buy'] = final_df1.groupby(final_df1['Bin'])['Buy'].transform('sum')
Step2.
final_df1['Sell'] = final_df1.groupby(final_df1['Bin'])['Sell'].transform('sum')
Step3.
##remove duplicates
final_df1 = final_df1.groupby('Bin', as_index=False).max()
using agg or sum or cumsum just removed all the other columns from the resulting df
Ideally there should be distinct bins with sum of buy and/or sell
The output must be
Time
Buy
Sell
Bin
0
09:15:01
3200
0
3573.0
1
09:15:01
450
4550
3562.0
2
09:15:01
4250
3625
3565.0
292
09:15:01
950
25
3564.0
This can also be achieved by using pivot_table from pandas.
Here is the simple recreated example for your code:
import numpy as np
import pandas as pd
df=pd.DataFrame({'buy':[1,2,3,0,3,2,1],
'sell':[2,3,4,0,5,4,3],
'bin':[1,2,1,2,1,2,1],
'time': [1,1,1,1,1,1,1]
})
df_output=df.pivot_table(columns='bin', values=['buy','sell'], aggfunc=np.sum)
Output will look like this:
bin 1 2
buy 8 4
sell 14 7
In case you want the output you mentioned:
we can take the transpose of the above dataframe output:
df_output.T
Or use a groupby as below to input dataframe:
df.groupby(['bin'])[['sell', 'buy']].sum()
The output is as below:
sell buy
bin
1 14 8
2 7 4
If you also need time in the dataframe, we could do it by using and separating the aggregate function for each column:
df.groupby("bin").agg({ "sell":"sum", "buy":"sum", "time":"first"})
The output is as below:
sell buy time
bin
1 14 8 1
2 7 4 1

Separate Pandas DataFrame into sections between rows that satisfy a condition

I have a DataFrame of several trips that looks kind of like this:
TripID Lat Lon time delta_t
0 1 53.55 9.99 74 1
1 1 53.58 9.99 75 1
2 1 53.60 9.98 76 5
3 1 53.60 9.98 81 1
4 1 53.58 9.99 82 1
5 1 53.59 9.97 83 NaN
6 2 52.01 10.04 64 1
7 2 52.34 10.05 65 1
8 2 52.33 10.07 66 NaN
As you can see, I have records of location and time, which all belong to some trip, identified by a trip ID. I have also computed delta_t as the time that passes until the entry that follows in the trip. The last entry of each trip is assigned NaN as its delta_t.
Now I need to make sure that the time step of my records is the same value across all my data. I've gone with one time unit for this example. For the most part the trips do fulfill this condition, but every now and then I have a single record, such as record no. 2, within an otherwise fine trip, that doesn't.
That's why I want to simply split my trip into two trips at this point. That go me stuck though. I can't seem to find a good way of doing this.
To consider each trip by itself, I was thinking of something like this:
for key, grp in df.groupby('TripID'):
# split trip at too long delta_t(s)
However, the actual splitting within the loop is what I don't know how to do. Basically, I need to assign a new trip ID to every entry from one large delta_t to the next (or the end of the trip), or have some sort of grouping operation that can group between those large delta_t.
I know this is quite a specific problem. I hope someone has an idea how to do this.
I think the new NaNs, which would then be needed, can be neglected at first and easily added later with this line (which I know only works for ascending trip IDs):
df.loc[df['TripID'].diff().shift(-1) > 0, 'delta_t'] = np.nan
IIUC, there is no need for a loop. The following creates a new column called new_TripID based on 2 conditions: That the original TripID changes from one row to the next, or that the difference in your time column is greater than one
df['new_TripID'] = ((df['TripID'] != df['TripID'].shift()) | (df.time.diff() > 1)).cumsum()
>>> df
TripID Lat Lon time delta_t new_TripID
0 1 53.55 9.99 74 1.0 1
1 1 53.58 9.99 75 1.0 1
2 1 53.60 9.98 76 5.0 1
3 1 53.60 9.98 81 1.0 2
4 1 53.58 9.99 82 1.0 2
5 1 53.59 9.97 83 NaN 2
6 2 52.01 10.04 64 1.0 3
7 2 52.34 10.05 65 1.0 3
8 2 52.33 10.07 66 NaN 3
Note that from your description and your data, it looks like you could really use groupby, and you should probably look into it for other manipulations. However, in the particular case you're asking for, it's unnecessary

Time series: Mean per hour per day per Id number

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Column operations on pandas groupby object

I have a dataframe df that looks like this:
id Category Time
1 176 12 00:00:00
2 4956 2 00:00:00
3 583 4 00:00:04
4 9395 2 00:00:24
5 176 12 00:03:23
which is basically a set of id and the category of item they used at a particular Time. I use df.groupby['id'] and then I want to see if they used the same category or different and assign True or False respectively (or NaN if that was the first item for that particular id. I also filtered out the data to remove all the ids with only one Time.
For example one of the groups may look like
id Category Time
1 176 12 00:00:00
2 176 12 00:03:23
3 176 2 00:04:34
4 176 2 00:04:54
5 176 2 00:05:23
and I want to perform an operation to get
id Category Time Transition
1 176 12 00:00:00 NaN
2 176 12 00:03:23 False
3 176 2 00:04:34 True
4 176 2 00:04:54 False
5 176 2 00:05:23 False
I thought about doing an apply of some sorts to the Category column after groupby but I am having trouble figuring out the right function.
you don't need a groupby here, you just need sort and shift.
df.sort(['id', 'Time'], inplace=True)
df['Transition'] = df.Category != df.Category.shift(1)
df.loc[df.id != df.id.shift(1), 'Transition'] = np.nan
i haven't tested this, but it should do the trick

Categories