I have a Dataset which I need to modify using pandas. Below is the detail of the particular column I need to work on:
df["Dependents"].value_counts()
0 345
1 102
2 101
3+ 51
Name: Dependents, dtype: int64
df["Dependents"].notnull().value_counts()
True 599
False 15
Name: Dependents, dtype: int64
I need to assign the null values as 0, 1 or 2 one by one. Like if for first row, I assign 0, then next row should be 1 and then the next 2. Then again start from 0 until all null values are filled.
How can I achieve it?
IIUC you can do it this way:
assuming you have the following DF:
In [214]: df
Out[214]:
Dependents
0 NaN
1 0
2 0
3 0
4 NaN
5 1
6 NaN
7 3+
8 NaN
9 3+
10 2
11 3+
12 1
13 NaN
Solution:
In [215]: idx = df.index[df.Dependents.isnull()]
In [216]: idx
Out[216]: Int64Index([0, 4, 6, 8, 13], dtype='int64')
In [217]: df.loc[idx, 'Dependents'] = np.take(list('012'), [x%3 for x in range(len(idx))])
In [218]: df
Out[218]:
Dependents
0 0
1 0
2 0
3 0
4 1
5 1
6 2
7 3+
8 0
9 3+
10 2
11 3+
12 1
13 1
Similar to MaxU's answer, but using numpy put with 'wrap' mode.
Sample dataframe (df):
Dependents
0 NaN
1 0
2 0
3 0
4 NaN
5 1
6 NaN
7 3+
8 NaN
9 3+
10 2
11 3+
12 1
13 NaN
idx = df.index[df.Dependents.isnull()]
np.put(df.Dependents, idx, [0, 1, 2], mode='wrap')
Dependents
0 0
1 0
2 0
3 0
4 1
5 1
6 2
7 3+
8 0
9 3+
10 2
11 3+
12 1
13 1
Related
I have a pandas data frame that consists of 5 columns. The second column has the numbers 1 to 500 repeated 5 times. As a shorter example the second column is something like this (1,4,2,4,3,1,1,2,4,3,2,1,4,3,2,3) and I want to sort it to look like this (1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4). The code i am using to sort is df=res.sort([2],ascending=True) but this code sorts it (1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4).
Any help will be much appreciated. Thanks
How's about this: sort by the cumcount and then the value itself:
In [11]: df = pd.DataFrame({"s": [1,4,2,4,3,1,1,2,4,3,2,1,4,3,2,3]})
In [12]: df.groupby("s").cumcount()
Out[12]:
0 0
1 0
2 0
3 1
4 0
5 1
6 2
7 1
8 2
9 1
10 2
11 3
12 3
13 2
14 3
15 3
dtype: int64
In [13]: df["s_cumcounts"] = df.groupby("s").cumcount()
In [14]: df.sort_values(["s_cumcounts", "s"])
Out[14]:
s s_cumcounts
0 1 0
2 2 0
4 3 0
1 4 0
5 1 1
7 2 1
9 3 1
3 4 1
6 1 2
10 2 2
13 3 2
8 4 2
11 1 3
14 2 3
15 3 3
12 4 3
In [15]: df = df.sort_values(["s_cumcounts", "s"])
In [16]: del df["s_cumcounts"]
I'm attempting to do a rolling count on a dataframe. The problem that I am having is specifying the condition since it is a string, not an integer. The dataframe below is a snippet, along with a snippet of a dictionary.
GameID Event
0 100 NaN
1 100 NaN
2 100 Ben
3 100 NaN
4 100 Steve
5 100 Ben
6 100 NaN
7 100 Steve
8 100 NaN
9 100 NaN
10 101 NaN
11 101 NaN
12 101 Joe
13 101 NaN
14 101 Will
15 101 Joe
16 101 NaN
17 101 Will
18 101 NaN
19 101 NaN
gamedic = {'100':['Ben','Steve'], '101':['Joe','Will']}
Ultimately, I would want the dataframe to look like the following. I named the columns Ben and Steve for this example but in reality they will be First and Second, corresponding to their place in the dictionary.
GameID Event Ben Steve
0 100 NaN 0 0
1 100 NaN 0 0
2 100 Ben 0 0
3 100 NaN 1 0
4 100 Steve 1 0
5 100 Ben 1 1
6 100 NaN 2 1
7 100 Steve 2 1
8 100 NaN 2 2
9 100 NaN 2 2
10 101 NaN 0 0
11 101 NaN 0 0
12 101 Joe 0 0
13 101 NaN 1 0
14 101 Will 1 0
15 101 Joe 1 1
16 101 NaN 2 1
17 101 Will 2 1
18 101 NaN 2 2
19 101 NaN 2 2
pd.rolling_count(df.Event, 1000,0).shift(1)
ValueError: could not convert string to float: Steve
I'm not sure if this is a complicated problem or if I'm missing something obvious in pandas. The whole string concept makes it tough for me to even get going.
First you want to use your dictionary to get a column containing just "first" and "second". I cant think of a clever way to do this so let's just iterate over the rows:
import numpy as np
df['Winner'] = np.nan
for i,row in df.iterrows():
if row.Event == gamedic[row.GameID][0]:
df['Winner'].ix[i] = 'First'
if row.Event == gamedic[row.GameID][1]:
df['Winner'].ix[i] = 'Second'
You can use pd.get_dummies to convert a string column (representing a categorical variable) to indicator variables; in your case this will give you
pd.get_dummies(df.Winner)
Out[46]:
First Second
0 0 0
1 0 0
2 1 0
3 0 0
4 0 1
5 1 0
6 0 0
7 0 1
8 0 0
9 0 0
10 0 0
11 0 0
12 1 0
13 0 0
14 0 1
15 1 0
16 0 0
17 0 1
18 0 0
19 0 0
You can add these onto your original dataframe with pd.concat:
df = pd.concat([df,pd.get_dummies(df.Winner)],axis=1)
Then you can get your cumulative sums with groupby.cumsum as in #Brian's answer
df.groupby('GameID').cumsum()
Out[60]:
First Second
0 0 0
1 0 0
2 1 0
3 1 0
4 1 1
5 2 1
6 2 1
7 2 2
8 2 2
9 2 2
10 0 0
11 0 0
12 1 0
13 1 0
14 1 1
15 2 1
16 2 1
17 2 2
18 2 2
19 2 2
Is this what you're looking for?
df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],
columns=['A'])
df
A
0 a
1 a
2 a
3 b
4 b
5 a
df.groupby('A').cumcount()
0 0
1 1
2 2
3 0
4 1
5 3
dtype: int64
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html
Consider the following dataframe:
index count signal
1 1 1
2 1 NAN
3 1 NAN
4 1 -1
5 1 NAN
6 2 NAN
7 2 -1
8 2 NAN
9 3 NAN
10 3 NAN
11 3 NAN
12 4 1
13 4 NAN
14 4 NAN
I need to 'ffill' the NANs in 'signal' and values with different 'count' value should not affect each other. such that I should get the following dataframe:
index count signal
1 1 1
2 1 1
3 1 1
4 1 -1
5 1 -1
6 2 NAN
7 2 -1
8 2 -1
9 3 NAN
10 3 NAN
11 3 NAN
12 4 1
13 4 1
14 4 1
Right now I iterate through each data frame in group by object and fill NAN value and then copy to a new data frame:
new_table = np.array([]);
for key, group in df.groupby('count'):
group['signal'] = group['signal'].fillna(method='ffill')
group1 = group.copy()
if new_table.shape[0]==0:
new_table = group1
else:
new_table = pd.concat([new_table,group1])
which kinda works, but really slow considering the data frame is large. I am wondering if there is any other method to do it with or without groupby methods. Thanks!
EDITED:
Thanks to Alexander and jwilner for providing alternative methods. However both methods are very slow for my big dataframe which has 800,000 rows of data.
Use the apply method.
In [56]: df = pd.DataFrame({"count": [1] * 4 + [2] * 5 + [3] * 2 , "signal": [1] + [None] * 4 + [-1] + [None] * 5})
In [57]: df
Out[57]:
count signal
0 1 1
1 1 NaN
2 1 NaN
3 1 NaN
4 2 NaN
5 2 -1
6 2 NaN
7 2 NaN
8 2 NaN
9 3 NaN
10 3 NaN
[11 rows x 2 columns]
In [58]: def ffill_signal(df):
....: df["signal"] = df["signal"].ffill()
....: return df
....:
In [59]: df.groupby("count").apply(ffill_signal)
Out[59]:
count signal
0 1 1
1 1 1
2 1 1
3 1 1
4 2 NaN
5 2 -1
6 2 -1
7 2 -1
8 2 -1
9 3 NaN
10 3 NaN
[11 rows x 2 columns]
However, be aware that groupby reorders stuff. If the count column doesn't always stay the same or increase, but instead can have values repeated in it, groupby might be problematic. That is, given a count series like [1, 1, 2, 2, 1], groupby will group like so: [1, 1, 1], [2, 2], which could have possibly undesirable effects on your forward filling. If that were undesired, you'd have to create a new series to use with groupby that always stayed the same or increased according to changes in the count series -- probably using pd.Series.diff and pd.Series.cumsum
I know it's very late, but I found a solution that is much faster than those proposed, namely to collect the updated dataframes in a list and do the concatenation only at the end. To take your example:
new_table = []
for key, group in df.groupby('count'):
group['signal'] = group['signal'].fillna(method='ffill')
group1 = group.copy()
if new_table.shape[0]==0:
new_table = [group1]
else:
new_table.append(group1)
new_table = pd.concat(new_table).reset_index(drop=True)
An alternative solution is to create a pivot table, forward fill values, and then map them back into the original DataFrame.
df2 = df.pivot(columns='count', values='signal', index='index').ffill()
df['signal'] = [df2.at[i, c]
for i, c in zip(df2.index, df['count'].tolist())]
>>> df
count index signal
0 1 1 1
1 1 2 1
2 1 3 1
3 1 4 -1
4 1 5 -1
5 2 6 NaN
6 2 7 -1
7 2 8 -1
8 3 9 NaN
9 3 10 NaN
10 3 11 NaN
11 4 12 1
12 4 13 1
13 4 14 1
With 800k rows of data, the efficacy of this approach depends on how many unique values are in 'count'.
Compared to my prior answer:
%%timeit
for c in df['count'].unique():
df.loc[df['count'] == c, 'signal'] = df[df['count'] == c].ffill()
100 loops, best of 3: 4.1 ms per loop
%%timeit
df2 = df.pivot(columns='count', values='signal', index='index').ffill()
df['signal'] = [df2.at[i, c] for i, c in zip(df2.index, df['count'].tolist())]
1000 loops, best of 3: 1.32 ms per loop
Lastly, you can simply use groupby, although it is slower than the previous method:
df.groupby('count').ffill()
Out[191]:
index signal
0 1 1
1 2 1
2 3 1
3 4 -1
4 5 -1
5 6 NaN
6 7 -1
7 8 -1
8 9 NaN
9 10 NaN
10 11 NaN
11 12 1
12 13 1
13 14 1
%%timeit
df.groupby('count').ffill()
100 loops, best of 3: 3.55 ms per loop
Assuming the data has been pre-sorted on df['index'], try using loc instead:
for c in df['count'].unique():
df.loc[df['count'] == c, 'signal'] = df[df['count'] == c].ffill()
>>> df
index count signal
0 1 1 1
1 2 1 1
2 3 1 1
3 4 1 -1
4 5 1 -1
5 6 2 NaN
6 7 2 -1
7 8 2 -1
8 9 3 NaN
9 10 3 NaN
10 11 3 NaN
11 12 4 1
12 13 4 1
13 14 4 1
I have a pandas series that looks like this:
>>> x.sort_index()
2 1
5 2
6 3
8 4
I want to fill out this series so that the "missing" index rows are represented, filling in the data values with a 0.
So that when I list the new series, it looks like this:
>>> z.sort_index()
1 0
2 1
3 0
4 0
5 2
6 3
7 0
8 4
I have tried creating a "dummy" Series
>>> y = pd.Series([0 for i in range(0,8)])
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
And then concat'ing them together - but the results are either:
>>> pd.concat([x,z],axis=0)
2 1
5 2
6 3
8 4
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
Or
>>> pd.concat([x,z],axis=1)
0 1
0 NaN 0
1 NaN 0
2 1 0
3 NaN 0
4 NaN 0
5 2 0
6 3 0
7 NaN 0
8 4 NaN
Neither of which is my target structure listed above.
I could try performing some arithmetic on the axis=1 version, and taking a sum of columns 1 and 2, but am looking for a neater, one-line version of this - does such an index filling/cleansing operation exist, and if so, what is it?
What you want is a reindex. First create the index as you want (in this case just a range), and then reindex with it:
In [64]: x = pd.Series([1,2,3,4], index=[2,5,6,8])
In [65]: x
Out[65]:
2 1
5 2
6 3
8 4
dtype: int64
In [66]: x.reindex(range(9), fill_value=0)
Out[66]:
0 0
1 0
2 1
3 0
4 0
5 2
6 3
7 0
8 4
dtype: int64
Apologies - slightly embarrassing situation, but having read here on what to do in this situation, am offering an answer to my own question.
I read the documentation here - one way of doing what I'm looking for is this:
>>> x.combine_first(y)
0 0
1 0
2 1
3 0
4 0
5 2
6 3
7 0
8 4
dtype: float64
N.B. in the above,
>>> y = pd.Series([0 for i in range(0,8)])
Background:
I'm working with 8 band multispectral satellite imagery and estimating water depth from reflectance values. Using statsmodels, I've come up with an OLS model that will predict depth for each pixel based on the 8 reflectance values of that pixel. In order to work easily with the OLS model, I've stuck all the pixel reflectance values into a pandas dataframe formated like the one in the example below; where each row represents a pixel and each column is a spectral band of the multispectral image.
Due to some pre-processing steps, all the on-shore pixels have been transformed to all zeros. I don't want to try and predict the 'depth' of those pixels so I want to restrict my OLS model predictions to the rows that are NOT all zero values.
I will need to reshape my results back to the row x col dimensions of the original image so I can't just drop the all zero rows.
Specific Question:
I've got a Pandas dataframe. Some rows contain all zeros. I would like to mask those rows for some calculations but I need to keep the rows. I can't figure out how to mask all the entries for rows that are all zero.
For example:
In [1]: import pandas as pd
In [2]: import numpy as np
# my actual data has about 16 million rows so
# I'll simulate some data for the example.
In [3]: cols = ['band1','band2','band3','band4','band5','band6','band7','band8']
In [4]: rdf = pd.DataFrame(np.random.randint(0,10,80).reshape(10,8),columns=cols)
In [5]: zdf = pd.DataFrame(np.zeros( (3,8) ),columns=cols)
In [6]: df = pd.concat((rdf,zdf)).reset_index(drop=True)
In [7]: df
Out[7]:
band1 band2 band3 band4 band5 band6 band7 band8
0 9 9 8 7 2 7 5 6
1 7 7 5 6 3 0 9 8
2 5 4 3 6 0 3 8 8
3 6 4 5 0 5 7 4 5
4 8 3 2 4 1 3 2 5
5 9 7 6 3 8 7 8 4
6 6 2 8 2 2 6 9 8
7 9 4 0 2 7 6 4 8
8 1 3 5 3 3 3 0 1
9 4 2 9 7 3 5 5 0
10 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0
[13 rows x 8 columns]
I know I can get just the rows I'm interested in by doing this:
In [8]: df[df.any(axis=1)==True]
Out[8]:
band1 band2 band3 band4 band5 band6 band7 band8
0 9 9 8 7 2 7 5 6
1 7 7 5 6 3 0 9 8
2 5 4 3 6 0 3 8 8
3 6 4 5 0 5 7 4 5
4 8 3 2 4 1 3 2 5
5 9 7 6 3 8 7 8 4
6 6 2 8 2 2 6 9 8
7 9 4 0 2 7 6 4 8
8 1 3 5 3 3 3 0 1
9 4 2 9 7 3 5 5 0
[10 rows x 8 columns]
But I need to reshape the data again later so I'll need those rows to be in the right place. I've tried all sorts of things including df.where(df.any(axis=1)==True) but I can't find anything that works.
Fails:
df.any(axis=1)==True gives me True for the rows I'm interested in and False for rows I'd like to mask but when I try df.where(df.any(axis=1)==True) I just get back the whole data frame complete with all the zeros. I want the whole data frame but with all the values in those zero rows masked so, as I understand it, they should show up as Nan, right?
I tried getting the indexes of the rows with all zeros and masking by row:
mskidxs = df[df.any(axis=1)==False].index
df.mask(df.index.isin(mskidxs))
That didn't work either that gave me:
ValueError: Array conditional must be same shape as self
The .index is just giving an Int64Index back. I need a boolean array the same dimensions as my data frame and I just can't figure out how to get one.
Thanks in advance for your help.
-Jared
The process of clarifying my question has lead me, in a roundabout way, to finding the answer. This question also helped point me in the right direction. Here's what I figured out:
import pandas as pd
# Set up my fake test data again. My actual data is described
# in the question.
cols = ['band1','band2','band3','band4','band5','band6','band7','band8']
rdf = pd.DataFrame(np.random.randint(0,10,80).reshape(10,8),columns=cols)
zdf = pd.DataFrame(np.zeros( (3,8) ),columns=cols)
df = pd.concat((zdf,rdf)).reset_index(drop=True)
# View the dataframe. (sorry about the alignment, I don't
# want to spend the time putting in all the spaces)
df
band1 band2 band3 band4 band5 band6 band7 band8
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 6 3 7 0 1 7 1 8
4 9 2 6 8 7 1 4 3
5 4 2 1 1 3 2 1 9
6 5 3 8 7 3 7 5 2
7 8 2 6 0 7 2 0 7
8 1 3 5 0 7 3 3 5
9 1 8 6 0 1 5 7 7
10 4 2 6 2 2 2 4 9
11 8 7 8 0 9 3 3 0
12 6 1 6 8 2 0 2 5
13 rows × 8 columns
# This is essentially the same as item #2 under Fails
# in my question. It gives me the indexes of the rows
# I want unmasked as True and those I want masked as
# False. However, the result is not the right shape to
# use as a mask.
df.apply( lambda row: any([i<>0 for i in row]),axis=1 )
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
dtype: bool
# This is what actually works. By setting broadcast to
# True, I get a result that's the right shape to use.
land_rows = df.apply( lambda row: any([i<>0 for i in row]),axis=1,
broadcast=True )
land_rows
Out[92]:
band1 band2 band3 band4 band5 band6 band7 band8
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1
8 1 1 1 1 1 1 1 1
9 1 1 1 1 1 1 1 1
10 1 1 1 1 1 1 1 1
11 1 1 1 1 1 1 1 1
12 1 1 1 1 1 1 1 1
13 rows × 8 columns
# This produces the result I was looking for:
df.where(land_rows)
Out[93]:
band1 band2 band3 band4 band5 band6 band7 band8
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 6 3 7 0 1 7 1 8
4 9 2 6 8 7 1 4 3
5 4 2 1 1 3 2 1 9
6 5 3 8 7 3 7 5 2
7 8 2 6 0 7 2 0 7
8 1 3 5 0 7 3 3 5
9 1 8 6 0 1 5 7 7
10 4 2 6 2 2 2 4 9
11 8 7 8 0 9 3 3 0
12 6 1 6 8 2 0 2 5
13 rows × 8 columns
Thanks again to those who helped. Hopefully the solution I found will be of use to somebody at some point.
I found another way to do the same thing. There are more steps involved but, according to %timeit, it is about 9 times faster. Here it is:
def mask_all_zero_rows_numpy(df):
"""
Take a dataframe, find all the rows that contain only zeros
and mask them. Return a dataframe of the same shape with all
Nan rows in place of the all zero rows.
"""
no_data = -99
arr = df.as_matrix().astype(int16)
# make a row full of the 'no data' value
replacement_row = np.array([no_data for x in range(arr.shape[1])], dtype=int16)
# find out what rows are all zeros
mask_rows = ~arr.any(axis=1)
# replace those all zero rows with all 'no_data' rows
arr[mask_rows] = replacement_row
# create a masked array with the no_data value masked
marr = np.ma.masked_where(arr==no_data,arr)
# turn masked array into a data frame
mdf = pd.DataFrame(marr,columns=df.columns)
return mdf
The result of mask_all_zero_rows_numpy(df) should be the same as Out[93]: above.
It is not clear to me why you cannot simply perform the calculations on only a subset of the rows:
np.average(df[1][:11])
to exclude the zero rows.
Or you can just make calculations on a slice and read the computed values back into the original dataframe:
dfs = df[:10]
dfs['1_deviation_from_mean'] = pd.Series([abs(np.average(dfs[1]) - val) for val in dfs[1]])
df['deviation_from_mean'] = dfs['1_deviation_from_mean']
Alternatively you could create a list of the index points you want to mask, and then make calculations using numpy masked arrays, created by uising the np.ma.masked_where() method and specifying to mask the values in the index positions:
row_for_mask = [row for row in df.index if all(df.loc[row] == 0)]
masked_array = np.ma.masked_where(df[1].index.isin(row_for_mask), df[1])
np.mean(masked_array)
The masked array looks like this:
Name: 1, dtype: float64(data =
0 5
1 0
2 0
3 4
4 4
5 4
6 3
7 1
8 0
9 9
10 --
11 --
12 --
Name: 1, dtype: object,