How do I keep the column used as groupby criteria (column 0)? - python

I am grouping a pandas data frame by their value in column 0, which happens to be year, month column (formatted as a float64 like yy,mm).
Before using the groupby function, by dataframe is as follows:
0 1 2 3 4 5 6 7 8 9
0 13,09 0.00 NaN 26.0 5740.0 NaN NaN NaN NaN 26
1 13,09 0.02 NaN 26.0 5738.0 NaN NaN NaN NaN 26
2 13,09 0.00 NaN 26.0 5738.0 NaN NaN NaN NaN 26
3 13,09 0.00 NaN 29.0 NaN NaN NaN NaN NaN 29
4 13,09 0.00 NaN 25.0 NaN NaN NaN NaN NaN 25
After running my groupby code (seen here)
month_year_total = month_year.groupby(0).sum()
I am given the following dataframe
1 2 3 4 5 6 7 8 9
0
13,09 1.55 0.0 383.0 51583.0 0.0 0.0 0.0 0.0 383
13,10 12.56 0.0 2039.0 142426.0 0.0 0.0 0.0 0.0 2039
13,11 0.65 1890.0 1663.0 170038.0 0.0 0.0 0.0 0.0 3553
13,12 1.43 7014.0 1055.0 176217.0 0.0 0.0 0.0 0.0 8069
14,01 1.53 7284.0 856.0 101971.0 0.0 0.0 0.0 0.0 8140
I wish to keep column 0 when converting to numpy, as I intend it to be the x axis of my graph; however, the column is dropped when I convert data types. In fact, I cannot manipulate the column at all, even within the pandas dataframe.
How do I keep this column or add an identical column?

Related

How to extract or split some days from date when date is index or string?

I have 6000 rows and 8 columns, where 'Date' is like index or I can reset index and it would be like first column with string type. I need to Extract the list of 'Lake_Level' values where date of a record is second and seventh day of a month ( and provide top 3 and bottom 3 values of the 'Lake_Level' feature). Please show me how to make it. Thank you in advance.
Date Loc_1 Loc_2 Loc_3 Loc_4 Loc_5 Temp Lake_Level Flow_Rate
03/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
04/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
05/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
06/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
07/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
26/06/2021 0.0 0.0 0.0 0.0 0.0 22.50 250.85 0.60
27/06/2021 0.0 0.0 0.0 0.0 0.0 23.40 250.84 0.60
28/06/2021 0.0 0.0 0.0 0.0 0.0 21.50 250.83 0.60
29/06/2021 0.0 0.0 0.0 0.0 0.0 23.20 250.82 0.60
30/06/2021 0.0 0.0 0.0 0.0 0.0 22.75 250.80 0.60
Why don't you just filter rows with your ideal condition?
You can run queries on your dataset using pandas DataFrame like below:
If datetimes are in column
df[pd.to_datetime(df['Date'], dayfirst=True).dt.day.isin([2,7])]
If datetimes are as indexes
df[pd.to_datetime(df.index, dayfirst=True).day.isin([2,7])]
Here is an example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
...: 'Date': [random_date() for _ in range(100)],
...: 'Lake_Level': [random.randint(240, 260) for _ in range(100)]
...: })
In [3]: df[pd.to_datetime(df['Date'], dayfirst=True).dt.day.isin([2,7])]
Out[3]:
Date Lake_Level
2 07/08/2004 245
27 02/12/2017 249
30 02/06/2012 252
51 07/10/2013 257

Imputing missing timestamp indices in Pandas?

Here is a snapshot of a dataframe df I'm working with
2014-02-01 09:58:03 1.576119 0.0 8.355 0.0 0.0 1.0 0.0
2014-02-01 09:58:33 1.576119 0.0 13.371 0.0 0.0 1.0 0.0
2014-02-01 09:59:03 1.576119 0.0 13.833 0.0 0.0 1.0 0.0
With Timestamp indices spaced by 30 seconds. I'm trying to concatenate a number of rows populated by np.nan values, while keeping with the pattern of 30 second separated Timestamp indices, i.e. something that would look like
2014-02-01 09:58:03 1.576119 0.0 8.355 0.0 0.0 1.0 0.0
2014-02-01 09:58:33 1.576119 0.0 13.371 0.0 0.0 1.0 0.0
2014-02-01 09:59:03 1.576119 0.0 13.833 0.0 0.0 1.0 0.0
2014-02-01 09:59:33 NaN NaN NaN NaN NaN NaN NaN
2014-02-01 10:00:03 NaN NaN NaN NaN NaN NaN NaN
However, when I apply
df = pd.concat(df, pd.DataFrame(np.array([np.nan, np.nan])))
I'm instead left with
2014-02-01 09:58:03 1.576119 0.0 8.355 0.0 0.0 1.0 0.0
2014-02-01 09:58:33 1.576119 0.0 13.371 0.0 0.0 1.0 0.0
2014-02-01 09:59:03 1.576119 0.0 13.833 0.0 0.0 1.0 0.0
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
My question is: how I can get the timestamp index pattern to continue? Is there something I should specify in the creation of the dataframe to be concatenated, or can I re-index the dataframe shown above?
For a more complete problem statement, I'm working with several time-series dataframes, each of which is hundreds of thousands of rows with the same initial time, varying ending times, and some missing values for each dataframe; I'm trying to get them to match lengths so I can interpolate NaN values by an np.nanmean() at that element's index over all dataframes, which I'm doing by stacking the associated numpy arrays for each dataframe -- applying this averaging procedure across the arrays requires them to have the same dimensions, hence I am filling them with NaNs and interpolating.
After you done concatinating:
Make use of pd.date_range() and index attribute:
df.index=pd.date_range(df.index[0],periods=len(df),freq='30S')
output of df:
2014-02-01 09:58:03 1.576119 0.0 8.355 0.0 0.0 1.0 0.0
2014-02-01 09:58:33 1.576119 0.0 13.371 0.0 0.0 1.0 0.0
2014-02-01 09:59:03 1.576119 0.0 13.833 0.0 0.0 1.0 0.0
2014-02-01 09:59:33 NaN NaN NaN NaN NaN NaN NaN
2014-02-01 10:00:03 NaN NaN NaN NaN NaN NaN NaN
If it is the case that you have, say, two DF's - df_big and df_small. And the row indices in df_small match the beginning row indices of df_big you could:
Add the NaN rows as you describe above so that the number of rows in df_small matches the number of rows in df_big.
Then copy the index from df_big to df_small.
df_small.index = df_big.index
A different idea:
You could use the time delta between the last two rows to generate new index entries.
Set number of entries to add.
rows_to_add = 2
Create a new and extended index based on your original DF - before you add the NaN rows:
ext_index = list(df.index) + \
[df.index[-1] + (df.index[-1] - df.index[-2]) * x for x in range(1,rows_to_add+1)]
[Timestamp('2014-02-01 09:58:03'),
Timestamp('2014-02-01 09:58:33'),
Timestamp('2014-02-01 09:59:03'),
Timestamp('2014-02-01 09:59:33'),
Timestamp('2014-02-01 10:00:03')]
Then add your NaN rows as in your question. (The same number of rows as the constant rows_to_add).
Then set your new index:
df.index = ext_index
Another idea. (Doesn't directly answer your question but might help). This would be useful in situations where not all of your missing data is at the end of the frame.
Create a DF with date range in index:
df_nan = pd.DataFrame(
index=pd.date_range('2014-02-01 09:58:03',periods=5,freq='30S')
)
Outer join with your smaller DF:
df.join(df_nan, how='outer')
2014-02-01 09:58:03 1.576119 0.0 8.355 0.0 0.0 1.0 0.0
2014-02-01 09:58:33 1.576119 0.0 13.371 0.0 0.0 1.0 0.0
2014-02-01 09:59:03 1.576119 0.0 13.833 0.0 0.0 1.0 0.0
2014-02-01 09:59:33 NaN NaN NaN NaN NaN NaN NaN
2014-02-01 10:00:03 NaN NaN NaN NaN NaN NaN NaN

How to i transform a very large dataframe to get the count of values in all columns (without using df.stack or df.apply)

I am working with a very large dataframe (~3 million rows) and i need the count of values from multiple columns, grouped by time related data.
I have tried to stack the columns but the resulting dataframe was very long and wouldn't fit in the memory. Similarly df.apply gave memory issues.
For example if my sample dataframe is like,
id,date,field1,field2,field3
1,1/1/2014,abc,,abc
2,1/1/2014,abc,,abc
3,1/2/2014,,abc,abc
4,1/4/2014,xyz,abc,
1,1/1/2014,,abc,abc
1,1/1/2014,xyz,qwe,xyz
4,1/7/2014,,qwe,abc
2,1/4/2014,qwe,,qwe
2,1/4/2014,qwe,abc,qwe
2,1/5/2014,abc,,abc
3,1/5/2014,xyz,xyz,
I have written the following script that does the needed for a small sample but fails in a large dataframe.
df.set_index(["id", "date"], inplace=True)
df = df.stack(level=[0])
df = df.groupby(level=[0,1]).value_counts()
df = df.unstack(level=[1,2])
I also have a solution via apply but it has the same complications.
The expected result is,
date 1/1/2014 1/4/2014 ... 1/5/2014 1/4/2014 1/7/2014
abc xyz qwe qwe ... xyz xyz abc qwe
id ...
1 4.0 2.0 1.0 NaN ... NaN NaN NaN NaN
2 2.0 NaN NaN 4.0 ... NaN NaN NaN NaN
3 NaN NaN NaN NaN ... 2.0 NaN NaN NaN
4 NaN NaN NaN NaN ... NaN 1.0 1.0 1.0
I am looking for a more optimized version of what I have written.
Thanks for the help !!
You don't want to use stack. Therefore, another solution is using crosstab on id with each date and fields columns. Finally, concat them together, groupby() the index and sum. Use listcomp on df.columns[2:] to create each crosstab (note: I assume the first 2 columns is id and date as your sample):
pd.concat([pd.crosstab([df.id], [df.date, df[col]]) for col in df.columns[2:]]).groupby(level=0).sum()
Out[497]:
1/1/2014 1/2/2014 1/4/2014 1/5/2014 1/7/2014
abc qwe xyz abc abc qwe xyz abc xyz abc qwe
id
1 4 1.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2 0.0 0.0 0.0 1.0 4.0 0.0 2.0 0.0 0.0 0.0
3 0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0
4 0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0
I think showing 0 is better than NaN. However, if you want NaN instead of 0, you just need to chain additional replace as follows:
pd.concat([pd.crosstab([df.id], [df.date, df[col]]) for col in df.columns[2:]]).groupby(level=0).sum().replace({0: np.nan})
Out[501]:
1/1/2014 1/2/2014 1/4/2014 1/5/2014 1/7/2014
abc qwe xyz abc abc qwe xyz abc xyz abc qwe
id
1 4.0 1.0 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
2 2.0 NaN NaN NaN 1.0 4.0 NaN 2.0 NaN NaN NaN
3 NaN NaN NaN 2.0 NaN NaN NaN NaN 2.0 NaN NaN
4 NaN NaN NaN NaN 1.0 NaN 1.0 NaN NaN 1.0 1.0

Pandas Dataframe interpolating in sections delimited by indexes

My sample code is as follow:
import pandas as pd
dictx = {'col1':[1,'nan','nan','nan',5,'nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9]}
df = pd.DataFrame(dictx).astype(float)
I'm trying to interpolate various segments which contain the value 'nan'.
For context, I'm trying to track bus speeds using GPS data provided by the city (São Paulo, Brazil), but the data is scarce and with parts that do not provide the information, as the e.g., but there're segments which I know for a fact that they are stopped, such as dawn, but the information come as 'nan' as well.
What I need:
I've been experimenting with dataframe.interpolate() parameters (limit and limit_diretcion) but came up short. If I set df.interpolate(limit=2) I will not only interpolate the data that I need but the data where it shouldn't. So I need to interpolate between sections defined by a limit
Desired output:
Out[7]:
col1 col2 col3
0 1.0 20.00 15.00
1 nan nan nan
2 nan nan nan
3 nan nan nan
4 5.0 22.00 10.00
5 6.0 23.50 12.00
6 7.0 25.00 14.00
7 8.0 27.50 13.50
8 9.0 30.00 13.00
9 nan nan nan
10 nan nan nan
11 nan nan nan
12 13.0 25.00 9.00
The logic that I've been trying to apply is basically trying to find nan's and calculating the difference between their indexes and so createing a new dataframe_temp to interpolate and only than add it to another creating a new dataframe_final. But this has become hard to achieve due to the fact that 'nan'=='nan' return False
This is a hack but may still be useful. Likely Pandas 0.23 will have a better solution.
https://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#dataframe-interpolate-has-gained-the-limit-area-kwarg
df_fw = df.interpolate(limit=1)
df_bk = df.interpolate(limit=1, limit_direction='backward')
df_fw.where(df_bk.notna())
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
Not a Hack
More legitimate way of handling it.
Generalized to handle any limit.
def interp(df, limit):
d = df.notna().rolling(limit + 1).agg(any).fillna(1)
d = pd.concat({
i: d.shift(-i).fillna(1)
for i in range(limit + 1)
}).prod(level=1)
return df.interpolate(limit=limit).where(d.astype(bool))
df.pipe(interp, 1)
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
Can also handle variation in NaN from column to column. Consider a different df
dictx = {'col1':[1,'nan','nan','nan',5,'nan','nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan','nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9,'nan']}
df = pd.DataFrame(dictx).astype(float)
df
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 NaN NaN NaN
6 NaN 25.0 14.0
7 7.0 NaN NaN
8 NaN NaN 13.0
9 9.0 30.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.0 25.0 NaN
Then with limit=1
df.pipe(interp, 1)
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 NaN 23.5 12.0
6 NaN 25.0 14.0
7 7.0 NaN 13.5
8 8.0 NaN 13.0
9 9.0 30.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.0 25.0 9.0
And with limit=2
df.pipe(interp, 2).round(2)
col1 col2 col3
0 1.00 20.00 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.00 22.00 10.0
5 5.67 23.50 12.0
6 6.33 25.00 14.0
7 7.00 26.67 13.5
8 8.00 28.33 13.0
9 9.00 30.00 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.00 25.00 9.0
Here is a way to selectively ignore rows which are consecutive runs of NaNs whose length is greater than a certain size (given by limit):
import numpy as np
import pandas as pd
dictx = {'col1':[1,'nan','nan','nan',5,'nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9]}
df = pd.DataFrame(dictx).astype(float)
limit = 2
notnull = pd.notnull(df).all(axis=1)
# assign group numbers to the rows of df. Each group starts with a non-null row,
# followed by null rows
group = notnull.cumsum()
# find the index of groups having length > limit
ignore = (df.groupby(group).filter(lambda grp: len(grp)>limit)).index
# only ignore rows which are null
ignore = df.loc[~notnull].index.intersection(ignore)
keep = df.index.difference(ignore)
# interpolate only the kept rows
df.loc[keep] = df.loc[keep].interpolate()
print(df)
prints
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
By changing the value of limit you can control how big the group has to be before it should be ignored.
This is a partial answer.
for i in list(df):
for x in range(len(df[i])):
if not df[i][x] > -100:
df[i][x] = 0
df
col1 col2 col3
0 1.0 20.0 15.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 5.0 22.0 10.0
5 0.0 0.0 0.0
6 7.0 25.0 14.0
7 0.0 0.0 0.0
8 9.0 30.0 13.0
9 0.0 0.0 0.0
10 0.0 0.0 0.0
11 0.0 0.0 0.0
12 13.0 25.0 9.0
Now,
df["col1"][1] == df["col2"][1]
True

Indexing columns based on cell value in pandas

I have a dataframe of race results. I'd like to create a series that takes the last stage position and subtracts that by the average of all the stages before that. Here is a small slice for the df (could have more stages, countries and rows)
race_location stage1_position stage2_position stage3_position number_of_stages
AUS 2.0 2.0 NaN 2
AUS 1.0 5.0 NaN 2
AUS 3.0 4.0 NaN 2
AUS 4.0 8.0 NaN 2
AUS 10.0 6.0 NaN 2
AUS 9.0 7.0 NaN 2
FRA 23.0 1.0 10.0 3
FRA 6.0 12.0 24.0 3
FRA 14.0 11.0 14.0 3
FRA 18.0 10.0 1.0 3
FRA 15.0 14.0 4.0 3
USA 24.0 NaN NaN 1
USA 7.0 NaN NaN 1
USA 22.0 NaN NaN 1
USA 11.0 NaN NaN 1
USA 8.0 NaN NaN 1
USA 16.0 NaN NaN 1
USA 13.0 NaN NaN 1
USA 19.0 NaN NaN 1
USA 5.0 NaN NaN 1
USA 25.0 NaN NaN 1
The output would be
last_stage_minus_average
0
4
1
4
-4
-2
-2
15
1.5
-13
-10.5
0
0
0
0
0
0
0
0
0
0
0
This wont work, but I was thinking something like this:
new_series = []
for country in country_list:
num_stages = df.loc[df['race_location'] == country, 'number_of_stages']
differnce = df.ix[df['race_location'] == country, num_stages] -
df.iloc[:, 0:num_stages-1].mean(axis=1)
new_series.append(difference)
I'm not sure how to go about doing this. Any help or direction would be amazing!
#use pandas apply to take the mean for the first n-1 stages and subtract from last stage.
df.apply(lambda x: x.iloc[x.number_of_stages]-np.mean(x.iloc[1:x.number_of_stages]),axis=1).fillna(0)
Out[264]:
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64
I'd use filter to get just he stage columns, then stack and groupby
stages = df.filter(regex='^stage\d+.*')
stages.stack().groupby(level=0).apply(
lambda x: x.iloc[-1] - x.iloc[:-1].mean()
).fillna(0)
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64
how it works
stack will automatically drop the NaN values when converting to a series.
Now, position -1 is the last value within each group if we grouped by the first level of the new multiindex
So, we use a lambda and calculate the mean with every thing up to the last value x.iloc[:-1].mean()
And subtract that from the last value x.iloc[-1]
subtracts that by the average of all the stages before that
It's not a big deal but I'm just curious! Unlike your desired output but along to your description, if one of the racers finished only one race, shouldn't their result be inf or nan instead of 0? (to specify them from the one who has already done 2~3 race but last race result is exactly same with average of races? like racer #1 vs racer #11~20)
df_sp = df.filter(regex='^stage\d+.*')
df['last'] = df_sp.T.fillna(method='ffill').T.iloc[:, -1]
df['mean'] = (df_sp.sum(axis=1) - df['last']) / (df['number_of_stages'] - 1)
print(df['last'] - df['mean'])
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN

Categories