pandas dataframe group with condition - python

I have a 3D dataframe with x and y and time as 3rd dimension.
The data are 5 indizes of satellite images that were taken at different times.
The x and y describes every pixel.
x y time SIPI classif
7.620001 -77.849990 2018-04-07 1.011107 2.0
2018-10-14 1.023407 2.0
2018-12-28 0.045107 3.0
2020-01-10 0.351107 2.0
2018-06-29 0.351107 2.0
-77.849899 2018-04-07 1.010777 8.0
2018-10-14 0.510562 2.0
2018-12-28 1.410766 4.0
2020-01-10 1.010666 8.0
2018-06-29 2.057068 8.0
-77.849809 2018-04-07 0.986991 1.0
2018-10-14 0.986991 8.0
2018-12-28 0.986991 5.0
2020-01-10 0.984791 5.0
2018-06-29 0.986991 3.0
-77.849718 2018-04-07 0.975965 10.0
2018-10-14 0.964765 7.0
2018-12-28 0.975965 10.0
2020-01-10 0.975965 10.0
2018-06-29 0.975965 3.0
-77.849627 2018-04-07 1.957747 2.0
2018-10-14 0.132445 6.0
2018-12-28 0.589677 2.0
2020-01-10 1.982445 2.0
2018-06-29 3.334456 7.0
I need to group the data and as new column I need the value from column 'classif_rf', which is most frequent in 5 datasets. The values are integers between 1 and 10. I want to add an condition which add only frequency higher than 3.
x y classif
7.620001 -77.849990 2.0
-77.849899 8.0
-77.849809 Na
-77.849718 10.0
-77.849627 2.0
So as a result I need dataframe where each pixel has a value with highest frequency and when the frequency is lower than 3 there should be a NA value.
Can the pandas.groupby function do that? I thought about value_counts(), but I'm not sure how to implement that to my dataset.
Thank you in advance!

Here is a clunky way to do it:
# Get the modes per group and count how often they occur
df_modes = df.groupby(["x", "y"]).agg(
{
'classif': [lambda x: pd.Series.mode(x)[0],
lambda x: sum(x == pd.Series.mode(x)[0])]
}
).reset_index()
# Rename the columns to something a bit more readable
df_modes.columns = ["x", "y", "classif_mode", "classif_mode_freq"]
# Discard modes whose frequency was less than 3
df_modes.loc[df_modes["classif_mode_freq"] < 3, "classif_mode"] = np.nan
Now df_modes.drop("classif_mode_freq", axis=1) will return
x y classif_mode
0 7.620001 -77.849990 2.0
1 7.620001 -77.849899 8.0
2 7.620001 -77.849809 NaN
3 7.620001 -77.849718 10.0
4 7.620001 -77.849627 2.0

Related

Convert summary data (cumulative cases) to daily cases pandas

I have case data that is presented as a time series. They are summed for each following day, what can be used to turn them into daily case count data?
My dataframe in pandas:
data sum_cases (cumulative)
0 2020-05-02 4.0
1 2020-05-03 21.0
2 2020-05-04 37.0
3 2020-05-05 51.0
I want them to look like this:
data sum_cases(cumulative) daily_cases
0 2020-05-02 4.0 4.0
1 2020-05-03 21.0 17.0
2 2020-05-04 37.0 16.0
3 2020-05-05 51.0 14.0
If indeed your DF has has the data in date order, then you might be able to get away with:
df['daily_cases'] = df['sum_cases'] - df['sum_cases'].shift(fill_value=0)

Python: Array-based equation

I have a dataframe 500 rows long by 4 columns. I need to find the proper python code that would divide the current row by the row below and then multiply that value by the value in the last row for every value in each column. I need to replicate this excel formula basically.
It's not clear if your data is stored in an array as provided by Numpy, were it true you'd write, with the original data contained in a
b = a[-1]*(a[:-1]/a[+1:])
a[-1] is the last row, a[:-1] the array without the last row and a[+1:] the array without the first (index zero, that is) row
Assuming you are talking about pandas DataFrame
import pandas as pd
import random
# sample DataFrame object
df = pd.DataFrame((float(random.randint(1, 100)),
float(random.randint(1, 100)),
float(random.randint(1, 100)),
float(random.randint(1, 100)))
for _ in range(10))
def function(col):
for i in range(len(col)-1):
col[i] = (col[i]/col[i+1])*col[len(col)-1]
print(df) # before formula apply
df.apply(function)
print(df) # after formula apply
>>>
0 1 2 3
0 10.0 78.0 27.0 23.0
1 72.0 42.0 77.0 86.0
2 82.0 12.0 58.0 98.0
3 27.0 92.0 19.0 86.0
4 48.0 83.0 14.0 43.0
5 55.0 18.0 58.0 77.0
6 20.0 58.0 20.0 22.0
7 76.0 19.0 63.0 82.0
8 23.0 99.0 58.0 15.0
9 60.0 57.0 89.0 100.0
0 1 2 3
0 8.333333 105.857143 31.207792 26.744186
1 52.682927 199.500000 118.155172 87.755102
2 182.222222 7.434783 271.684211 113.953488
3 33.750000 63.180723 120.785714 200.000000
4 52.363636 262.833333 21.482759 55.844156
5 165.000000 17.689655 258.100000 350.000000
6 15.789474 174.000000 28.253968 26.829268
7 198.260870 10.939394 96.672414 546.666667
8 23.000000 99.000000 58.000000 15.000000
9 60.000000 57.000000 89.000000 100.000000

Python Stacked bar chart from DF with index dates?

I have created a data frame in python using pandas that has the following output with date being the index:
Date Daily Anger Daily Haha Daily Like Daily Love Daily Sad Daily WoW
2019-08-31 1 2.0 132.0 8.0 0.0 5.0
2019-09-30 0 1.0 41.0 4.0 0.0 0.0
2019-10-31 15 1.0 117.0 4.0 0.0 2.0
2019-11-30 0 3.0 84.0 4.0 0.0 4.0
2019-12-31 2 17.0 98.0 20.0 5.0 7.0
I'm trying to get these values in a stacked bar chart where the X axis is the date and the y axis is the total values across these metrics
I've spent the last couple of hours trying to get this to work with google with no success. Could anyone help me?
If Date is column use x parameter in DataFrame.plot.bar:
df.plot.bar(x='Date', stacked=True)
If Date is DatetimeIndex use only stacked parameter:
df.plot.bar(stacked=True)

Problems understanding the logic when creating code using groupby, list comprehensions and custom functions

I want to calculate a rolling mean of different window sizes for each ticker in my dataframe. Ideally I could pass a list of window sizes and for each ticker I would get new columns (one for each rolling mean size). So if I wanted a rolling mean of 2 and one of 3, the output would be two columns for each ticker.
import datetime as dt
import numpy as np
import pandas as pd
Dt_df = pd.DataFrame({"Date":pd.date_range('2018-07-01', periods=5, freq='D')})
Tick_df = pd.DataFrame({"Ticker":['ABC',"HIJ","XYZ"]})
Mult_df = pd.merge(Tick_df.assign(key='x'), Dt_df.assign(key='x') on='key').drop('key', 1)
df2 = pd.DataFrame(np.random.randint(low=5, high=10, size=(15, 1)), columns=['Price'])
df3 = Mult_df.join(df2, how='outer')
df3.set_index(['Ticker','Date'],inplace = True)
Here is the Example Dataset:
When I try to apply this function:
def my_RollMeans(x):
w = [1,2,3]
s = pd.Series(x)
Bob = pd.DataFrame([s.rolling(w1).mean() for w1 in w]).T
return Bob
to my dataframe df3 using various versions of apply or transform I get errors.
NewDF = df3.groupby('Ticker').Price.transform(my_RollMeans).fillna(0)
The latest error is:
Data must be 1-dimensional
IIUC try using apply and I made a modification to your custom function:
def my_RollMeans(x):
w = [1,2,3]
s = pd.Series(x)
Bob = pd.DataFrame([s.rolling(w1).mean().rename('Price_'+str(w1)) for w1 in w]).T
return Bob
df3.groupby('Ticker').apply(lambda x : my_RollMeans(x.Price)).fillna(0)
Output:
Price_1 Price_2 Price_3
Ticker Date
ABC 2018-07-01 9.0 0.0 0.000000
2018-07-02 8.0 8.5 0.000000
2018-07-03 7.0 7.5 8.000000
2018-07-04 8.0 7.5 7.666667
2018-07-05 8.0 8.0 7.666667
HIJ 2018-07-01 8.0 0.0 0.000000
2018-07-02 9.0 8.5 0.000000
2018-07-03 5.0 7.0 7.333333
2018-07-04 6.0 5.5 6.666667
2018-07-05 7.0 6.5 6.000000
XYZ 2018-07-01 9.0 0.0 0.000000
2018-07-02 5.0 7.0 0.000000
2018-07-03 9.0 7.0 7.666667
2018-07-04 8.0 8.5 7.333333
2018-07-05 6.0 7.0 7.666667

Pandas — match last identical row and compute difference

With a DataFrame like the following:
timestamp value
0 2012-01-01 3.0
1 2012-01-05 3.0
2 2012-01-06 6.0
3 2012-01-09 3.0
4 2012-01-31 1.0
5 2012-02-09 3.0
6 2012-02-11 1.0
7 2012-02-13 3.0
8 2012-02-15 2.0
9 2012-02-18 5.0
What would be an elegant and efficient way to add a time_since_last_identical column, so that the previous example would result in:
timestamp value time_since_last_identical
0 2012-01-01 3.0 NaT
1 2012-01-05 3.0 5 days
2 2012-01-06 6.0 NaT
3 2012-01-09 3.0 4 days
4 2012-01-31 1.0 NaT
5 2012-02-09 3.0 31 days
6 2012-02-11 1.0 10 days
7 2012-02-13 3.0 4 days
8 2012-02-15 2.0 NaT
9 2012-02-18 5.0 NaT
The important part of the problem is not necessarily the usage of time delays. Any solution that matches one particular row with the previous row of identical value, and computes something out of those two rows (here, a difference) will be valid.
Note: not interested in apply or loop-based approaches.
A simple, clean and elegant groupby will do the trick:
df['time_since_last_identical'] = df.groupby('value').diff()
Gives:
timestamp value time_since_last_identical
0 2012-01-01 3.0 NaT
1 2012-01-05 3.0 4 days
2 2012-01-06 6.0 NaT
3 2012-01-09 3.0 4 days
4 2012-01-31 1.0 NaT
5 2012-02-09 3.0 31 days
6 2012-02-11 1.0 11 days
7 2012-02-13 3.0 4 days
8 2012-02-15 2.0 NaT
9 2012-02-18 5.0 NaT
Here is a solution using pandas groupby:
out = df.groupby(df['value'])\
.apply(lambda x: pd.to_datetime(x['timestamp'], format = "%Y-%m-%d").diff())\
.reset_index(level = 0, drop = False)\
.reindex(df.index)\
.rename(columns = {'timestamp' : 'time_since_last_identical'})
out = pd.concat([df['timestamp'], out], axis = 1)
That gives the following output:
timestamp value time_since_last_identical
0 2012-01-01 3.0 NaT
1 2012-01-05 3.0 4 days
2 2012-01-06 6.0 NaT
3 2012-01-09 3.0 4 days
4 2012-01-31 1.0 NaT
5 2012-02-09 3.0 31 days
6 2012-02-11 1.0 11 days
7 2012-02-13 3.0 4 days
8 2012-02-15 2.0 NaT
9 2012-02-18 5.0 NaT
It does not exactly match your desired output, but I guess it is a matter of conventions (e.g. whether to include current day or not). Happy to refine if you provide more details.

Categories