python pandas: How dropping items in dateframe - python

I have a huge amount of points in my dateframe, so I would want to drop some of them (ideally keeping the mean values).
e.g. currently I have
date calltime
0 1491928756414930 4643
1 1491928756419607 166
2 1491928756419790 120
3 1491928756419927 142
4 1491928756420083 121
5 1491928756420217 109
6 1491928756420409 52
7 1491928756420476 105
8 1491928756420605 35
9 1491928756420654 120
10 1491928756420787 105
11 1491928756420907 93
12 1491928756421013 37
13 1491928756421062 112
14 1491928756421187 41
Is there any way to drop some amount of items, based on sampling?
To give more details. My problem is number of values for very close intervals e.g. 1491928756421062 and 1491928756421187
So I have a chart like
And instead I wanted to somehow have a mean value for those close intervals. But maybe grouped by a second...

I would use sample(), but as you said it selects randomly. If you want to take sample according to some logic, for instance, only keeping rows whose value is mean *.9 < value < mean * 1.1, you can try the following code. Actually, it all depends on your sampling strategy.
As an example, something like this could be done.
test.csv:
1491928756414930,4643
1491928756419607,166
1491928756419790,120
1491928756419927,142
1491928756420083,121
1491928756420217,109
1491928756420409,52
1491928756420476,105
1491928756420605,35
1491928756420654,120
1491928756420787,105
1491928756420907,93
1491928756421013,37
1491928756421062,112
1491928756421187,41
sampling:
df = pd.read_csv("test.csv", ",", header=None)
mean = df[1].mean()
my_sample = df[(mean *.90 < df[1]) & (df[1] < mean * 1.10)]

You're looking for resample
df.set_index(pd.to_datetime(df.date)).calltime.resample('s').mean()
This is a more complete example
tidx = pd.date_range('2000-01-01', periods=10000, freq='10ms')
df = pd.DataFrame(dict(calltime=np.random.randint(200, size=len(tidx))), tidx)
fig, axes = plt.subplots(2, figsize=(25, 10))
df.plot(ax=axes[0])
df.resample('s').mean().plot(ax=axes[1])
fig.tight_layout()

Related

Pandas: Optimal subtract every nth row

I'm writing a function for a special case of row-wise subtraction in pandas.
First the user should be able to specify rows either by regex (i.e. "_BL[0-9]+") or by regular index i.e every 6th row
Then we must subtract every matching row from rows preceding it, but not past another match
[Optionally] Drop selected rows
Column to match on should be user-defined by either index or label
For example if:
Samples
var1
var1
something
10
20
something
20
30
something
40
30
some_BL20_thing
100
100
something
50
70
something
90
100
some_BL10_thing
100
10
Expected output should be:
Samples
var1
var1
something
-90
-80
something
-80
-70
something
-60
-70
something
-50
60
something
-10
90
My current (incomplete) implementation relies heavily on looping:
def subtract_blanks(data:pd.DataFrame, num_samples:int)->pd.DataFrame:
'''
Accepts a data dataframe and a mod int and
subtracts each blank from all mod preceding samples
'''
expr = compile(r'(_BL[0-9]{1})')
output = data.copy(deep = True)
for idx,row in output.iterrows():
if search(expr,row['Sample']):
for i in range(1,num_samples+1):
output.iloc[idx-i,data_start:] = output.iloc[idx-i,6:]-row.iloc[6:]
return output
Is there a better way of doing this? This implementation seems pretty ugly. I've also considered maybe splitting the DataFrame to chucks and operating on them instead.
Code
# Create boolean mask for matching rows
# m = np.arange(len(df)) % 6 == 5 # for index match
m = df['Samples'].str.contains(r'_BL\d+') # for regex match
# mask the values and backfill to propagate the row
# values corresponding to match in backward direction
df['var1'] = df['var1'] - df['var1'].mask(~m).bfill()
# Delete the matching rows
df = df[~m].copy()
Samples var1 var1
0 something -90.0 -80.0
1 something -80.0 -70.0
2 something -60.0 -70.0
4 something -50.0 60.0
5 something -10.0 90.0
Note: The core logic is specified in the code so I'll leave the function implementation upto the OP.

Create new Pandas Dataframe from observations which meets specific criteria

I have two original dataframes.
One contains limits: df_limits
feat_1 feat_2 feat_3
target 12 9 90
UL 15 10 120
LL 9 8 60
where target is ideal value,
UL - upper limit,
LL - lower limit
And another one original data: df_to_check
ID feat_1 feat_2 feat_3
123 12.5 9.6 100
456 18 3 100
789 9 11 100
I'm creating a function which desired output is get ID and features which are below or above the threshold (limits from first Dataframe) Till now I'm able to recognise which features are out of limits but I'm getting full output of original Dataframe...
def table(df_limits, df_to_check, column):
UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
UL_index = df_to_check.loc[df_to_check[column] > UL].index
LL_index = df_to_check.loc[df_to_check[column] < LL].index
if UL_index is not None:
above_limit = {'ID': df_to_check['ID'],
'column': df_to_check[column],
'target': df_limits[column].loc['target']}
return pd.DataFrame(above_limit)
What I should change so my desired output would be:
(showing only ID and column where observations are out of limit)
The best if it would show also how many percent of original value is deviate from ideal value target (I would be glad for advices how to add such a column)
ID column target value deviate(%)
456 feat_1 12 18 50
456 feat_2 9 3 ...
789 feat_2 9 11 ...
Now after running this function its returning whole dataset because statement says if not null... which is not null... I understand why I have this issue but I don't know how to change it
Issue is with statement if UL_index is not None: since it returning whole dataset and I'm looking for way how to replace this part
Approach
reshape
merge
calculate
new_df = (df_to_check.set_index("ID").unstack().reset_index()
.rename(columns={"level_0":"column",0:"value"})
.merge(df_limits.T, left_on="column", right_index=True)
.assign(deviate=lambda dfa: (dfa.value-dfa.target)/dfa.target)
)
column
ID
value
target
UL
LL
deviate
feat_1
123
12.5
12
15
9
0.0416667
feat_1
456
18
12
15
9
0.5
feat_1
789
9
12
15
9
-0.25
feat_2
123
9.6
9
10
8
0.0666667
feat_2
456
3
9
10
8
-0.666667
feat_2
789
11
9
10
8
0.222222
feat_3
123
100
90
120
60
0.111111
feat_3
456
100
90
120
60
0.111111
feat_3
789
100
90
120
60
0.111111
First of all, you have not provided a reproducible example https://stackoverflow.com/help/minimal-reproducible-example because you have not shared the code which produces the two initial dataframes. Next time you ask a question, please keep it in mind, Without those, I made a toy example with my own (random) data.
I start by unpivoting what you call dataframe_to_check: that's because, if you want to check each feature independently, then that dataframe is not normalised (you might want to look up what database normalisation means).
The next step is a left outer join between the unpivoted dataframe you want to check and the (transposed) dataframe with the limits.
Once you have that, you can easily calculate whether a row is within the range, the deviation between value and target, etc, and you can of course group this however you want.
My code is below. It should be easy enough to customise it to your case.
import pandas as pd
import numpy as np
df_limits = pd.DataFrame(index =['min val','max val','target'])
df_limits['a']=[2,4,3]
df_limits['b']=[3,5,4.5]
df =pd.DataFrame(columns = df_limits.columns, data =np.random.rand(100,2)*6 )
df_unpiv = pd.melt( df.reset_index().rename(columns ={'index':'id'}), id_vars='id', var_name ='feature', value_name = 'value' )
# I reset the index because I couldn't get a join on a column and index, but there is probably a better way to do it
df_joined = pd.merge( df_unpiv, df_limits.transpose().reset_index().rename(columns = {'index':'feature'}), how='left', on ='feature' )
df_joined['abs diff from target'] = abs( df_joined['value'] - df_joined['target'] )
df_joined['outside range'] = (df_joined['value'] < df_joined['min val'] ) | (df_joined['value'] > df_joined['max val'])
df_outside_range = df_joined.query(" `outside range` == True " )
df_inside_range = df_joined.query(" `outside range` == False " )
I solved my issue maybe in bit clumsy way but it works as desired...
If someone have better answer I will still appreciate:
Example how to get only observations above limits, to have both just concatenate observation from UL_index and LL_index
def table(df_limits,df_to_check,column):
above_limit = []
df_above_limit = pd.DataFrame()
UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
UL_index = df_to_check.loc[df_to_check[column] > UL].index
LL_index = df_to_check.loc[df_to_check[column] < LL].index
df_to_check_UL = df_to_check.loc[UL_index]
df_to_check_LL = df_to_check.loc[LL_index]
above_limit = {
'ID': df_to_check_UL['ID'],
'feature value': df_to_check[column],
'target': df_limits[column].loc['target']
}
df_above_limit = pd.DataFrame(above_limit, index = df_to_check_UL.index)
return df_above_limit

How to plot Matplotlib chart which takesvalues from different columns

This is my dataframe
Order Time Profit
0 1 106 NaN
1 1 111 -296.0
2 2 14 NaN
3 2 16 -296.0
4 3 62 NaN
.. ... ... ...
335 106 32 -297.6
336 107 44 NaN
337 107 44 138.0
338 108 58 NaN
339 108 63 -303.4
So the way I want it to work is plot a chart where X is the time, Y is the absolute price(positive or negative) so we need to have 2 bars. Now, the time should not be from the same row, but from the first row with the same order number.
For ex. The -296.0 would be under time 106, not 111 because 106 was the first under Order nr.1. How would we do something like that?
This is my code so far:
data = pd.read_csv(filename)
df = pd.DataFrame(data, columns = ['Order','Time','Profit']).astype(str)
#turns time column into hours of week
df['Time'] = df['Time'].apply(lambda x: findHourOfWeek(x))
df['Profit'] = df['Profit'].astype(float)
Assuming the structure we see in the sample of your data holds over the entire data set, i.e. there is only one Profit value per Order, you can do it like this: Group the DataFrame by Order, and aggregate by taking the minimum:
df_grouped = df.groupby(by='Order').min()
resulting in this DataFrame:
Time Profit
Order
1 106 -296.0
2 14 -296.0
3 62 NaN
...
106 32 -297.6
107 44 138.0
108 58 -303.4
Then you can sort by Time and do the plot:
import matplotlib.pyplot as plt
df_grouped.sort_values(by='Time', inplace=True)
plt.plot(df_grouped['Time'], df_grouped['Profit'])
If you rather want to rely on position in the data table you can also do this:
plot_df = pd.DataFrame()
plot_df["Order"] = df.Order.unique()
plot_df["Profit"] = list(df.groupby("Order").nth(-1)["Profit"])
plot_df["Time"] = list(df.groupby("Order").nth(0)["Time"])
However, if you want min value for time you'd better use solution provided by Arne since it would be more safe and correct (provided that you only have one profit value for each order number).

Group pandas dataframe by quantile of single column

Sorry if this is duplicate post - I can't find a related post though
from random import seed
seed(100)
P = pd.DataFrame(np.random.randint(0, 100, size=(1000, 2)), columns=list('AB'))
What I'd like is to group P by the quartiles/quantiles/deciles/etc of column A and then calculate a aggregate statistic (such as mean) by group. I can define deciles of the column as
P['A'].quantile(np.arange(10) / 10)
I'm not sure how to groupby the deciles of A. Thanks in advance!
If you want to group P e.g. by quartiles, run:
gr = P.groupby(pd.qcut(P.A, 4, labels=False))
Then you can perform any operations on these groups.
For presentation, below you have just a printout of P limited to
20 rows:
for key, grp in gr:
print(f'\nGroup: {key}\n{grp}')
which gives:
Group: 0
A B
0 8 24
3 10 94
10 9 93
15 4 91
17 7 49
Group: 1
A B
7 34 24
8 15 60
12 27 4
13 31 1
14 13 83
Group: 2
A B
4 52 98
5 53 66
9 58 16
16 59 67
18 47 65
Group: 3
A B
1 67 87
2 79 48
6 98 14
11 86 2
19 61 14
As you can see, each group (quartile) has 5 members, so the grouping is
correct.
As a supplement
If you are interested in borders of each quartile, run:
pd.qcut(P.A, 4, labels=False, retbins=True)[1]
Then cut returns 2 results (a tuple). The first element (number 0) is
the result returned before, but we are this time interested in the
second element (number 1) - the bin borders.
For your data they are:
array([ 4. , 12.25, 40.5 , 59.5 , 98. ])
So e.g. the first quartile is between 4 and 12.35.
You can use the quantile Series to make another column, to marking each row with its quantile label, and then group by that column. numpy searchsorted is very useful to do this:
import numpy as np
import pandas as pd
from random import seed
seed(100)
P = pd.DataFrame(np.random.randint(0, 100, size=(1000, 2)), columns=list('AB'))
q = P['A'].quantile(np.arange(10) / 10)
P['G'] = P['A'].apply(lambda x : q.index[np.searchsorted(q, x, side='right')-1])
Since the quantile Series stores the lower values of the quantile intervals, be sure to pass the parameter side='right' to np.searchsorted to not get 0 (the minimum should be 1 or you have one index more than you need).
Now you can elaborate your statistics by doing, for example:
P.groupby('G').agg(['sum', 'mean']) #add to the list all the statistics method you wish

How can I create a grouped bar chart with Matplotlib or Seaborn from a multi-indexed data frame?

I have a problem regarding how I can plot multi-indexed data in a single bar chart. I started with a DataFrame with three columns (artist, genre and miscl_count) and 195 rows. I then grouped the data by two of the columns, which resulted in the table below. My question is, how can I create a bar plot from this, so that the each group in "miscl_count" are shown as three separate bar plots across all five genres (i.e. a total amount of 3x5 bars)? I would also like the genre to identify what color a bar is assigned.
I know that there is unstacking, but I don't understand how I can get this to work with Matplotlib or Seaborn.
The head of the DataFrame, that I perform the groupby method on looks like this:
print(miscl_df.head())
artist miscl_count genre
0 band1 5 a
1 band2 6 b
2 band3 5 b
3 band4 4 b
4 band5 5 b
5 band6 5 c
miscl_df_group = miscl_df.groupby(['genre', 'miscl_count']).count()
print(miscl_df_group)
After group by, the output looks like this:
artist
miscl_count 4 5 6
genre
a 11 9 9
b 19 13 16
c 13 14 16
d 10 9 12
e 21 14 10
Just to make sure I made myself clear, the output should be shown as a single chart (and not as subplots)!
Working solution to be used on the grouped data:
miscl_df_group.unstack(level='genre').plot(kind='bar')
Alternatively, it can also be used this way:
miscl_df_group.unstack(level='miscl_count').plot(kind='bar')
with seaborn, no need to group the data, this is done under the hood:
import seaborn as sns
sns.barplot(x="artist", y="miscl_count", hue="genre", data=miscl_df)
(change the column names at will, depending on what you want)
# full working example
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame()
df["artist"] = list(map(lambda i: f"band{i}", np.random.randint(1,4,size=(100,))))
df["genre"] = list(map(lambda i: f"genre{i}", np.random.randint(1,6,size=(100,))))
df["count"] = np.random.randint(50,100,size=(100,))
# df
# count genre artist
# 0 97 genre9 band1
# 1 95 genre7 band1
# 2 65 genre3 band2
# 3 81 genre1 band1
# 4 58 genre10 band1
# .. ... ... ...
# 95 61 genre1 band2
# 96 53 genre9 band2
# 97 55 genre9 band1
# 98 94 genre1 band2
# 99 85 genre8 band1
# [100 rows x 3 columns]
sns.barplot(x="artist", y="count", hue="genre", data=df)

Categories