How to group specific items in a column and calculate the mean

How to group specific items in a column and calculate the mean - python

I am trying to figure out how to group specific tag names in a column and then to calculate the mean of the raw data that have the same time. My dataframe looks something like this except with 10000+ rows:
tag_name time raw_data
happy 5 300
8 340
angry 5 315
8 349
sad 5 400
8 480
etc.
.
.
I wish to keep the mean in the dataframe table, but I can't figure out how. I found that I can create a pivot table to calculate the mean, but I can't figure out how to single out the specific tag names I want. Below is what I have so far:
output = pd.pivot_table(data=dataset,index=['Timestamp'],columns=['Tag_Name'],values='Raw_Data',aggfunc='mean')
I am trying to get one of these outputs when I calculate the average of sad and happy:
1. optimal output:
tag_name time raw_data sad_happy_avg
happy 5 300 350
8 340 410
sad 5 400
8 480
angry 5 315
8 349
2. alright output:
tag_name happy sad avg
time
5 300 400 350
8 340 480 410

Try as follows:
Use Series.isin to keep only "happy" and "sad", and apply
df.pivot
to get the data in the correct shape.
Next, add a column for the mean on axis=1:
res = df[df.tag_name.isin(['happy','sad'])].pivot(index='time', columns='tag_name',
values='raw_data')
res['avg'] = res.mean(axis=1)
print(res)
tag_name happy sad avg
time
5 300 400 350.0
8 340 480 410.0
Your "optimal" output doesn't seem a very logical way to present/store this data, but you can achieve it as follows:
# assuming you have a "standard" index starting `0,1,2` etc.
df['sad_happy_avg'] = df[df.tag_name.isin(['happy','sad'])]\
.groupby('time')['raw_data'].mean().reset_index(drop=True)
print(df)
tag_name time raw_data sad_happy_avg
0 happy 5 300 350.0
1 happy 8 340 410.0
2 angry 5 315 NaN
3 angry 8 349 NaN
4 sad 5 400 NaN
5 sad 8 480 NaN

Related

Create new column with multiple values in Python

I have a dataframe, which has name of Stations and Links of Measured value of each Station for 2 days
Station Link
0 EITZE https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/EITZE/W/measurements.json?start=P2D
1 RETHEM https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/RETHEM/W/measurements.json?start=P2D
.......
685 BORGFELD https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/BORGFELD/W/measurements.json?start=P2D
To take data from json isn't a big problem.
But then I realized, that json-link from each station has multiple values from different time, so I don't know how to add these values from each time to a specific station.
I tried to get all the values from json, but I can't define, which values from which station, because it's just too many.
Anyone have a solution for me?
The Dataframe i would like to have, should look like this!
Station Timestamp Value
0 EITZE 2022-07-31T00:30:00+02:00 15
1 EITZE 2022-07-31T00:45:00+02:00 15
.......
100 RETHEM 2022-07-31T00:30:00+02:00 15
101 RETHEM 2022-07-31T00:45:00+02:00 20
.......
xxxx BORGFELD 2022-08-02T00:32:00+02:00 608

Starting with this example data frame:
Station Link
0 EITZE https://www.pegelonline.wsv.de/webservices/res...
1 RETHEM https://www.pegelonline.wsv.de/webservices/res...
You could leverage apply to populate an accumulation data frame.
import requests
import json
Define the function to be used by apply
def get_link(x):
global accum_df
r = requests.get(x['Link'])
if r.status_code == 200:
ldf = pd.DataFrame(json.loads(r.text))
ldf['station'] = x['Station']
accum_df = pd.concat([accum_df,ldf])
else:
print(r.status_code) # handle the error
return None
Apply it
accum_df = pd.DataFrame()
df.apply(get_link, axis=1)
print(accum_df)
Result
timestamp value station
0 2022-07-31T02:00:00+02:00 220.0 EITZE
1 2022-07-31T02:15:00+02:00 220.0 EITZE
2 2022-07-31T02:30:00+02:00 220.0 EITZE
3 2022-07-31T02:45:00+02:00 220.0 EITZE
4 2022-07-31T03:00:00+02:00 219.0 EITZE
.. ... ... ...
181 2022-08-02T00:00:00+02:00 23.0 RETHEM
182 2022-08-02T00:15:00+02:00 23.0 RETHEM
183 2022-08-02T00:30:00+02:00 23.0 RETHEM
184 2022-08-02T00:45:00+02:00 23.0 RETHEM
185 2022-08-02T01:00:00+02:00 23.0 RETHEM

Pandas: Select row pairs based on specific combination of strings in one column

I'm fairly new to python/pandas and have struggled to find an example specific enough for me to work with.
Say I have the following pandas dataframe, consisting of a column of event markers and a column displaying the time each marker was presented:
df = pd.DataFrame({'Marker': ['S200', 'S4', 'S44', 'Tone', 'S200', 'S1', 'S44', 'Tone'],
'Time': [0, 100, 150, 230, 300, 340, 380, 400]})
Marker Time
0 S200 0
1 S4 100
2 S44 150
3 Tone 230
4 S200 300
5 S1 340
6 S44 380
7 Tone 400
I would like to extract pairs of rows where S44 is followed by a Tone. The resulting output should be:
newdf = pd.DataFrame({'Marker': ['S44', 'Tone', 'S44', 'Tone'],
'Time': [150, 230, 380, 400]})
Marker Time
0 S44 150
1 Tone 230
2 S44 380
3 Tone 400
Any ideas would be appreciated!

One way about it is to use shift to get the indexes, add 1 and pull with loc - note that this assumes that the index is numeric and monotonic increasing:
index = df.loc[df.Marker.shift(-1).eq('Tone') & (df.Marker.eq('S44'))].index
df.loc[index.union(index +1)]
Marker Time
2 S44 150
3 Tone 230
6 S44 380
7 Tone 400

Another way:
s = ((df.Marker.eq('S44')) & (df.Marker.shift(-1).eq('Tone')))
df = df[s | s.shift()]
OUTPUT:
Marker Time
2 S44 150
3 Tone 230
6 S44 380
7 Tone 400

pandas group by multiple columns and remove rows based on multiple conditions

I have a dataframe which is as follows:
imagename,locationName,brandname,x,y,w,h,xdiff,ydiff
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,1,491,182,78,1,1
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,5,451,95,48,2,1
95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,455,342,84,93,9,-7
95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130
Its a csv dump. From this I want to group by imagename and brandname. Wherever the values in xdiff and ydiff is less than 10 then remove the second line.
For example, from the first two lines I want to delete the second line, similarly from lines 3 and 4 I want to delete line 4.
I could do this quickly in R using dplyr group by, lag and lead functions. However, I am not sure how to combine different functions in python to achieve this. This is what I have tried so far:
df[df.groupby(['imagename','brandname']).xdiff.transform() <= 10]
Not sure what function should I call within transform and how to include ydiff too.
The expected output is as follows:
imagename,locationName,brandname,x,y,w,h,xdiff,ydiff
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41
95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30
95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130

You can take individual groupby frames and apply the conditions through apply function
#df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if x['xdiff'].lt(10).any() else x)
df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if (x['xdiff'].lt(10).any() and x['ydiff'].lt(10).any()) else x)
Out:
imagename locationName brandname x y w h xdiff ydiff
2 95-20180407-215120-235505-00050.jpg Shirt DHFL 3 450 94 45 2 -41
5 95-20180407-215120-235505-00050.jpg Shirt DHFL 446 349 99 90 279 30
7 95-20180407-215120-235505-00050.jpg Shirt GOIBIBO 559 212 70 106 104 -130
0 95-20180407-215120-235505-00050.jpg Shirt SAMSUNG 0 490 177 82 0 0
4 95-20180407-215120-235505-00050.jpg DUGOUT VIVO 167 319 36 38 162 -132

Pandas: Using Append Adds New Column and Makes Another All NaN

I just started learning pandas a week ago or so and I've been struggling with a pandas dataframe for a bit now. My data looks like this:
State NY CA Other Total
Year
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
I made this table from a dataset that included 30 or so values for the variable I'm representing as State here. If they weren't NY or CA, in the example, I summed them and put them in an 'Other' category. The years here were made from a normalized list of dates (originally mm/dd/yyyy and yyyy-mm-dd) as such, if this is contributing to my issue:
dict = {'Date': pd.to_datetime(my_df.Date).dt.year}
and later:
my_df = my_df.rename_axis('Year')
I'm trying now to append a row at the bottom that shows the totals in each category:
final_df = my_df.append({'Year' : 'Total',
'NY': my_df.NY.sum(),
'CA': my_df.CA.sum(),
'Other': my_df.Other.sum(),
'Total': my_df.Total.sum()},
ignore_index=True)
This does technically work, but it makes my table look like this:
NY CA Other Total State
0 450 50 25 525 NaN
1 300 75 5 380 NaN
2 500 100 100 700 NaN
3 250 50 100 400 NaN
4 a b c d Total
('a' and so forth are the actual totals of the columns.) It adds a column at the beginning and puts my 'Year' column at the end. In fact, it removes the 'Date' label as well, and turns all the years in the last column into NaNs.
Is there any way I can get this formatted properly? Thank you for your time.

I believe you need create Series by sum and rename it:
final_df = my_df.append(my_df.sum().rename('Total'))
print (final_df)
NY CA Other Total
State
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
Total 1500 275 230 2005
Another solution is use loc for setting with enlargement:
my_df.loc['Total'] = my_df.sum()
print (my_df)
NY CA Other Total
State
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
Total 1500 275 230 2005
Another idea from previous answer - add parameters margins=True and margins_name='Total' to crosstab:
df1 = df.assign(**dct)
out = (pd.crosstab(df1['Firing'], df1['State'], margins=True, margins_name='Total'))

Pandas - Count the number of rows that would be true for a function - for each input row

I have a dataframe that needs a column added to it. That column needs to be a count of all the other rows in the table that meet a certain condition, that condition needs to take in input both from the "input" row and the "output" row.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
I'd want the height and weight of the row, as well as the height and weight of the other rows in a function, so I can do something like:
def example_function(height1, weight1, height2, weight2):
if height1 > height2 and weight1 < weight2:
return True
else:
return False
And it would just sum up all the True's and give that sum in the column.
Is something like this possible?
Thanks in advance for any ideas!
Edit: Sample input:
id name height weight country
0 Adam 70 180 USA
1 Bill 65 190 CANADA
2 Chris 71 150 GERMANY
3 Eric 72 210 USA
4 Fred 74 160 FRANCE
5 Gary 75 220 MEXICO
6 Henry 61 230 SPAIN
The result would need to be:
id name height weight country new_column
0 Adam 70 180 USA 1
1 Bill 65 190 CANADA 1
2 Chris 71 150 GERMANY 3
3 Eric 72 210 USA 1
4 Fred 74 160 FRANCE 4
5 Gary 75 220 MEXICO 1
6 Henry 61 230 SPAIN 0
I believe it will need to be some sort of function, as the actual logic I need to use is more complicated.
edit 2:fixed typo

You can add booleans, like this:
count = ((df.height1 > df.height2) & (df.weight1 < df.weight2)).sum()
EDIT:
I test it a bit and then change conditions with custom function:
def f(x):
#check boolean mask
#print ((df.height > x.height) & (df.weight < x.weight))
return ((df.height < x.height) & (df.weight > x.weight)).sum()
df['new_column'] = df.apply(f, axis=1)
print (df)
id name height weight country new_column
0 0 Adam 70 180 USA 2
1 1 Bill 65 190 CANADA 1
2 2 Chris 71 150 GERMANY 3
3 3 Eric 72 210 USA 1
4 4 Fred 74 160 FRANCE 4
5 5 Gary 75 220 MEXICO 1
6 6 Henry 61 230 SPAIN 0
Explanation:
For each row compare values and for count simply sum values True.

For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
As far as I understand, you want to assign to a new column something like
df['num_heigher_and_leighter'] = df.apply(lambda r: ((df.height > r.height) & (df.weight < r.weight)).sum(), axis=1)
However, your text description doesn't seem to match the outcome, which is:
0 2
1 3
2 0
3 1
4 0
5 0
6 6
dtype: int64
Edit
As in any other case, you can use a named function instead of a lambda:
df = ...
def foo(r):
return ((df.height > r.height) & (df.weight < r.weight)).sum()
df['num_heigher_and_leighter'] = df.apply(foo, axis=1)

I'm assuming you had a typo and want to compare heights with heights and weights with weights. If so, you could count the number of persons taller OR heavier like so:
>>> for i,height,weight in zip(df.index,df.height, df.weight):
... cnt = df.loc[((df.height>height) & (df.weight>weight)), 'height'].count()
... df.loc[i,'thing'] = cnt
...
>>> df
name height weight country thing
0 Adam 70 180 USA 2.0
1 Bill 65 190 CANADA 2.0
2 Chris 71 150 GERMANY 3.0
3 Eric 72 210 USA 1.0
4 Fred 74 160 FRANCE 1.0
5 Gary 75 220 MEXICO 0.0
6 Henry 61 230 SPAIN 0.0
Here for instance, no person is Heavier than Henry, and no person is taller than Gary. If that's not what you intended, it should be easy to modify the & above to a | instead or switching out the > to a <.
When you're more accustomed to Pandas, I suggest you use Ami Tavory excellent answer instead.
PS. For the love of god, use the Metric system for representing weight and height, and convert to whatever for presentation. These numbers are totally nonsensical for the world population at large. :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to group specific items in a column and calculate the mean - python

Related

Create new column with multiple values in Python

Pandas: Select row pairs based on specific combination of strings in one column

pandas group by multiple columns and remove rows based on multiple conditions

Pandas: Using Append Adds New Column and Makes Another All NaN

Pandas - Count the number of rows that would be true for a function - for each input row

Categories

Resources