Python, Pandas: Reindex/Slice DataFrame with duplicate Index values - python

Let's consider a DataFrame that contains 1 row of 2 values per each day of the month of Jan 2010:
date_range = pd.date_range(dt(2010,1,1), dt(2010,1,31), freq='1D')
df = pd.DataFrame(data = np.random.rand(len(date_range),2), index = date_range)
and another timeserie with sparser data and duplicate index values:
observations = pd.DataFrame(data =np.random.rand(7,2), index = (dt(2010,1,12),
dt(2010,1,18), dt(2010,1,20), dt(2010,1,20), dt(2010,1,22), dt(2010,1,22),dt(2010,1,28)))
I split the first DataFrame df into a list of 5 DataFrames, each of them containing 1 week worth of data from the original: df_weeks = [g for n, g in df.groupby(pd.TimeGrouper('W'))]
Now I would like to split the data of the second DataFrame by the same 5 weeks. i.e. that would mean in that specific case ending up with a variable obs_weeks containing 5 DataFrames spanning the same time range as df_weeks , 2 of them being empty.
I tried using reindex such as in this question: Python, Pandas: Use the GroupBy.groups description to apply it to another grouping
and Periods:
p1 =[x.to_period() for x in list(df.groupby(pd.TimeGrouper('W')).groups.keys())]
p1 = sorted(p1)
dfs=[]
for p in p1:
dff = observations.truncate(p.start_time, p.end_time)
dfs.append(dff)
(see this question: Python, Pandas: Boolean Indexing Comparing DateTimeIndex to Period)
The problem is that if some values in the index of observations are duplicates (and this is the case) none of those method functions. i also tried to change the index of observations to a normal column and do the slicing on that column, but i received error message as well.

You can achieve this by doing a simple filter:
p1 = [x.to_period() for x in list(df.groupby(pd.TimeGrouper('W')).groups.keys())]
p1 = sorted(p1)
dfs = []
for p in p1:
dff = observations.ix[
(observations.index >= p.start_time) &
(observations.index < p.end_time)]
dfs.append(dff)

Related

Pandas - How to groupby, calculate difference between first and last row, calculate max, and select the corresponding group in original frame

I have a Pandas dataframe with two columns I am interested in: A categorical label and a timestamp. Presumably what I'm trying to do would also work with ordered numerical data. The dataframe is already sorted by timestamps in ascending order. I want to find out which label spans the longest time-window and select only the values associated with it in the original dataframe.
I have tried grouping the df by label, calculating the difference and selecting the maximum (longest time-window) successfully, however I'm having trouble finding an expression to select the corresponding values in the original df using this information.
Consider this example with numerical values:
d = {'cat': ['A','A','A','A','A','A','B','B','B','B','C','C','C','C','C','C','C'],
'val': [1,3,5,6,8,9,0,5,10,20,4,5,6,7,8,9,10]}
df = pd.DataFrame(data = d)
Here I would expect something equivalent to df.loc[df.cat == 'B'] since B has the maximum difference of all the categories.
df.groupby('cat').val.apply(lambda x: x.max() - x.min()).max()
gives me the correct difference, but I have no idea how to use this to select the correct category in the original df.
You can go for idxmax to get the category that gave rise to maximum peak-to-peak value within groups (np.ptp does the maximum minus minimum). Then you can index with loc as you said, or query:
>>> max_cat = df.groupby("cat").val.apply(np.ptp).idxmax()
>>> max_cat
"B"
>>> df.query("cat == #max_cat") # or df.loc[df.cat == max_cat]
cat val
6 B 0
7 B 5
8 B 10
9 B 20

PANDAS: Slice pandas series n rows into the past with a datetime index

upper_lower_bound returns 2 datetime indices from my dataframe. I only use one at a time and they have no relation to each other.
I want to get the max() value of the previous 6 rows data from a dataframe highP but I get an error if I try to subtract 6 from it. dt.timedelta(6) subtracts 6 days from the df but there are missing days in the df so it doesn't provide the correct answer.
How can I slice highP so that it gives me the previous six values in that series for eg.
highP.loc[i - 6: i].max() given that i is a datetime index?
any help would be greatly appreciated!
upper_lower_bound = df[(isoHL['IH'] >= 1) | (isoHL['IL'] >= 1)].index[-3:-1]
if isoHL.loc[upper_lower_bound[-1]]['IH'] == 1 and isoHL.loc[upper_lower_bound[-1]]['IL'] == 0:
upper_bound = highP.loc[upper_lower_bound[-1] - dt.timedelta(6):upper_lower_bound[-1]].max()
else:
pass
How about sorting by time and taking last elements with tail?
For generality, I will use notations with:
dataframe = df
time column = t
value column = val
certain location = i
first we obtain a df of sample with time < time at i:
before_df = df[df['t'] <= df.loc[i, 't']]
then we sort by time:
before_df = before_df.sort_values(by=['t'])
then we take last five with 'tail':
five_before = before_df.tail(5)
then we max over value:
val = five_before['val'].max()
Does this solve your question?
`

How to fill pandas dataframes in a loop?

I am trying to build a subset of dataframes from a larger dataframe by searching for a string in the column headings.
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for well in wells:
wellname = well
well = pd.DataFrame()
well_cols = [col for col in cdf.columns if wellname in col]
well = cdf[well_cols]
I am trying to search for the wellname in the cdf dataframe columns and put those columns which contain that wellname into a new dataframe named the wellname.
I am able to build my new sub dataframes but the dataframes come up empty of size (0, 0) while cdf is (21973, 91).
well_cols also populates correctly as a list.
These are some of cdf column headings. Each column has 20k rows of data.
Index(['N1_Inj_Casing_Gas_Valve', 'N1_LT_Stm_Rate', 'N1_ST_Stm_Rate',
'N1_Inj_Casing_Gas_Flow_Rate', 'N1_LT_Stm_Valve', 'N1_ST_Stm_Valve',
'N1_LT_Stm_Pressure', 'N1_ST_Stm_Pressure', 'N1_Bubble_Tube_Pressure',
'N1_Inj_Casing_Gas_Pressure', 'N2_Inj_Casing_Gas_Valve',
'N2_LT_Stm_Rate', 'N2_ST_Stm_Rate', 'N2_Inj_Casing_Gas_Flow_Rate',
'N2_LT_Stm_Valve', 'N2_ST_Stm_Valve', 'N2_LT_Stm_Pressure',
'N2_ST_Stm_Pressure', 'N2_Bubble_Tube_Pressure',
'N2_Inj_Casing_Gas_Pressure', 'N3_Inj_Casing_Gas_Valve',
'N3_LT_Stm_Rate', 'N3_ST_Stm_Rate', 'N3_Inj_Casing_Gas_Flow_Rate',
'N3_LT_Stm_Valve', 'N3_ST_Stm_Valve', 'N3_LT_Stm_Pressure',
I want to create a new dataframe with every heading that contains the "well" IE a new dataframe for all columns & data with column name containing N1, another for N2 etc.
The New dataframes populate correctly when inside the loop but disappear when the loop breaks... a bit of the code output for print(well):
[27884 rows x 10 columns]
N9_Inj_Casing_Gas_Valve ... N9_Inj_Casing_Gas_Pressure
0 74.375000 ... 2485.602364
1 74.520833 ... 2485.346000
2 74.437500 ... 2485.341091
IIUC this should be enough:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dict={}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dict[well] = cdf[well_cols]
Dictionaries are usually the way to go if you want to populate something. In this case, then, if you input well_dict['N1'], you'll get your first dataframe, and so on.
The elements of an array are not mutable when iterating over it. That is, here's what it's doing based on your example:
# 1st iteration
well = 'N1' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N1' in name> # assigned by `well = cdf[well_cols]`
# 2nd iteration
well = 'N2' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N2' in name> # assigned by `well = cdf[well_cols]`
...
But at no point did you change the array, or store the new dataframes for that matter (although you would still have the last dataframe stored in well at the end of the iteration).
IMO, it seems like storing the dataframes in a dict would be easier to use:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dfs = {}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dfs[well] = cdf[well_cols]
However, if you really want it in a list, you could do something like:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for ix, well in enumerate(wells):
well_cols = [col for col in cdf.columns if well in col]
wells[ix] = cdf[well_cols]
One way to approach the problem is to use pd.MultiIndex and Groupby.
You can add the construct a MultiIndex composed of well identifier and variable name. If you have df:
N1_a N1_b N2_a N2_b
1 2 2 3 4
2 7 8 9 10
You can use df.columns.str.split('_', expand=True) to parse the well identifer corresponding variable name (i.e. a or b).
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
Which returns:
N1 N2
a b a b
0 2 2 3 4
1 7 8 9 10
Then you can transpose the data frame and groupby the MultiIndex level 0.
grouped = df.T.groupby(level=0)
To return a list of untransposed sub-data frames you can use:
wells = [group.T for _, group in grouped]
where wells[0] is:
N1
a b
0 2 2
1 7 8
and wells[1] is:
N2
a b
0 3 4
1 9 10
The last step is rather unnecessary because the data can be accessed from the grouped object grouped.
All together:
import pandas as pd
from io import StringIO
data = """
N1_a,N1_b,N2_a,N2_b
1,2,2,3,4
2,7,8,9,10
"""
df = pd.read_csv(StringIO(data))
# Parse Column names to add well name to multiindex level
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
# Group by well name
grouped = df.T.groupby(level=0)
#bulist list of sub dataframes
wells = [group.T for _, group in grouped]
Using contains
df[df.columns.str.contains('|'.join(wells))]

Taking a proportion of a dataframe based on column values

I have a Pandas dataframe with ~50,000 rows and I want to randomly select a proportion of rows from that dataframe based on a number of conditions. Specifically, I have a column called 'type of use' and, for each field in that column, I want to select a different proportion of rows.
For instance:
df[df['type of use'] == 'housing'].sample(frac=0.2)
This code returns 20% of all the rows which have 'housing' as their 'type of use'. The problem is I do not know how to do this for the remaining fields in a way that is 'idiomatic'. I also do not know how I could take the result from this sampling to form a new dataframe.
You can make a unique list for all the values in the column by list(df['type of use'].unique()) and iterate like below:
for i in list(df['type of use'].unique()):
print(df[df['type of use'] == i].sample(frac=0.2))
or
i = 0
while i < len(list(df['type of use'].unique())):
df1 = df[(df['type of use']==list(df['type of use'].unique())[i])].sample(frac=0.2)
print(df1.head())
i = i + 1
For storing you can create a dictionary:
dfs = ['df' + str(x) for x in list(df2['type of use'].unique())]
dicdf = dict()
i = 0
while i < len(dfs):
dicdf[dfs[i]] = df[(df['type of use']==list(df2['type of use'].unique())[i])].sample(frac=0.2)
i = i + 1
print(dicdf)
This will print a dictionary of the dataframes.
You can print what you like to see for example for housing sample : print (dicdf['dfhousing'])
Sorry this is coming in 2+ years late, but I think you can do this without iterating, based on help I received to a similar question here. Applying it to your data:
import pandas as pd
import math
percentage_to_flag = 0.2 #I'm assuming you want the same %age for all 'types of use'?
#First, create a new 'helper' dataframe:
random_state = 41 # Change to get different random values.
df_sample = df.groupby("type of use").apply(lambda x: x.sample(n=(math.ceil(percentage_to_flag * len(x))),random_state=random_state))
df_sample = df_sample.reset_index(level=0, drop=True) #may need this to simplify multi-index dataframe
# Now, mark the random sample in a new column in the original dataframe:
df["marked"] = False
df.loc[df_sample.index, "marked"] = True

Pandas Dataframe filtering groups based on dates frequency

I have a pandas data frame with the next columns:
User_id (numeric)| Day (DateTime)| Data (numeric)
What I want is to group by user_id so that I keep just those users for who I have Data in a period of 15 consecutive days.
Say, if I had data from 01-05 (dd-mm) to 16-05 (dd-mm) the rows referring to that user would be kept.
Ex.:
df1 = pd.DataFrame(['13-01-2018',1], ['14-01-2018',2],['15-01-2018',3],
['13-02-2018',1], ['14-02-2018',2],['15-02-2018',3])
#Apply solution to extract data of first N consecutive dates with N = 3
result.head()
0 1
0 13-01-2018 1
1 14-01-2018 2
2 15-01-2018 3
Don't be afraid to ask for further details! Sorry I couldn't be more specific.
I finally managed to solved it.
At first I thought my solutions wouldn't be optimal and would take too much time, but it turned out to work pretty good.
I defined a function that, given a n_days time window traverses the data looking for a delta fitting the windows size (n_days). Thus, it looks for n_days consecutive dates.
def consecutive(x,n_days):
result = None
if len(x) >= n_days:
stop = False
i = n_days
while(stop == False and i < len(x)+1):
window = lista[i-n_days:i]
delta = (window[len(window)-1]-window[0]).n_days
if delta == n_days-1:
stop = True
result = window
i=i+1
return result
And then call it by using apply.
b=a.groupby('user_id')['day'].unique().apply(lambda x: consecutive(x,15))
df=b.loc[b.apply(lambda x: x is not None)].reset_index()
Next steps imply to transform the dataframe into a one with a row per returned date.
import itertools
import pandas as pd
import numpy as np
def melt_series(s):
lengths = s.str.len().values
flat = [i for i in itertools.chain.from_iterable(s.values.tolist())]
idx = np.repeat(s.index.values, lengths)
return pd.Series(flat, idx, name=s.name)
df=melt_series(df.day).to_frame().join(df.drop('day', 1)).reindex_axis(df.columns, 1)
And merging with the actual dataframe.
final =pd.merge(a,df[['user_id','day']],on=['user_id','day'])

Categories