Python Pandas Select Index Value Referencing String Value in a Column - python

I have a dataframe which is date sequenced and has 'x' values in one column when there is new information on a particular date.
I want to get the index value of the row for the date before the most recent new information date so I can reference that data for further operations
So my dataframe looks like this:
original_df
index date value newinfo
0 '2007-12-01' 75 Nan
1 '2007-12-02' 75 Nan
2 '2007-12-03' 83 x
3 '2007-12-04' 83 Nan
4 '2007-12-05' 83 Nan
5 '2007-12-06' 47 x
6 '2007-12-07' 47 Nan
7 '2007-12-08' 47 Nan
8 '2007-12-09' 47 Nan
So I'm interested in referencing row where original_df.index == 4 for some further operations.
The only way I can think of doing it is very 'clunky'. Basically I create another dataframe by filtering my original for rows where newinfo == 'x', take the index value of the last row, subtract 1, and use that value to access various columns in that row of the original dataframe using iloc. Code looks like this:
interim_df = original_df[original_df['newinfo']=='x']
index_ref_value = interim_df.index[-1] - 1
This returns an index_ref_value of 4.
I can then access value in original_df as follows:
original_df.iloc[index_ref_value,1]
In other words, I'm access the value for 2007-12-05, the day before the most recent newinfo.
This gets the job done but strikes me as complicated and sloppy. Is there a cleaner, easier, more Pythonic way to find the index_ref_value I'm looking for?

you can combine iloc and loc into one statement:
original_df.iloc[original_df.loc[original_df['newinfo'] == 'x'].index-1]
the loc statement is taking the index of where the condition (where newinfo is x) and then getting the index of that value. iloc then takes those indexes and givies you the result you are looking for
judging from your quesiton, you may need a list of these values in the futre. try df1.iloc[df1.loc[df1['newinfo'] == 'x'].index-1].index.tolist()
edit to get the desired output:
original_df.iloc[original_df.loc[original_df['newinfo'] == 'x'].index[-1]-1]
# added a [0] at the end below to get just the value of `4`
original_df.iloc[original_df.loc[original_df['newinfo'] == 'x'].index[-1]-1][0]

Related

Pandas apply function and update copy of dataframe

I have data frames
df = pd.DataFrame({'A':[1,2,2,1],'B':[20,21,22,32],'C':[4,5,6,7],'D':[99,98,97,96]})
dfcopy = df.copy()
I want to apply a function to values in df columns 'B' and 'C' based on value in col 'A' and then update the result in corresponding rows in dfcopy.
For example, for each row where 'A' is 1, get the 'B' and 'C' values for that row, apply function, and store results in dfcopy. For the first row where 'A'==2, the value for 'B' is 21 and 'C' is 5. Assume the function is to multiply by 2x2 ones matrix: np.dot(np.ones((2,2)),np.array([[21],[5]])). Then we want df[1,'B']=26 and df[1,'C']=26. Then I want to repeat for a different value in A until the function has been applied uniquely based on each value in A.
Lastly, I don't want to iterate row by row, check value in A, and apply function. This is because there will be an operation to do based on each value of A (i.e. the np.ones((2,2)) will be replaced by values in file corresponding to value in A, and I don't want to repeat it
I'm sure I can force a solution (e.g. by looping and setting values), but I'm guessing there is an elegant way to do this with Pandas API. I just can't find it.
In the example below I picked different matrices so it's obvious that I have applied them.
df = pd.DataFrame({'A':[1,2,2,1],'B':[20,21,22,32],'C':[4,5,6,7],'D':[99,98,97,96]})
matrices = [None,pd.DataFrame([[1,0],[0,0]],index=["B","C"]),pd.DataFrame([[0,0],[0,1]],index=["B","C"])]
df[["B","C"]] = pd.concat((df[df["A"] == i][["B","C"]].dot(matrices[i]) for i in set(df["A"])))
A B C D
0 1 20 0 99
1 2 0 5 98
2 2 0 6 97
3 1 32 0 96

How to select certain values based on a condition in a data frame?

I have a dataframe called df that looks like this:
Date Reading1 Reading2 Reading3 Reading4
2000-05-01 15 13 14 11
2000-05-02 15 14 18 9
2000-05-03 14 12 15 8
2000-05-04 17 11 16 13
I used df.setindex('Date') to make the date the index.
I have 3 questions.
1) How do I display the number of days that had a reading greater than 13 in the entire data frame not just in a single column?
I tried df.[(df.Reading1:df.Reading4>13)].shape[0] but obviously the syntax is wrong.
2) How do I display the values that happened on 2000-05-03 for columns Readings 1, 3, and 4?
I tried df.loc[["20000503"],["Reading1","Reading3,"Reading4"]]
but i got the error "None of the Index(['20000503'],dtype='object')] are in the [index]"
3) How do find do I display the dates for which the values for the column Readings 1 are twice as much as those in column Readings 2? And how do I display those values (the ones in Reading 1 that are twice as big) as well?
I have no idea where to even start this one.
Try this:
1. (df > 13).any(axis=1).sum()
Create a boolean dataframe then check to see if any value is True along the row and sum rows to get number of days.
2. df.loc['2000-05-03', ['Reading1', 'Reading3', 'Reading4']]
Use partial string indexing on DatetimeIndex to get a day, then column filtering with a list of column header.
3. df.loc[df['Reading1'] > (df['Reading2'] * 2)].index
df.loc[df['Reading1'] > (df['Reading2'] * 2)].to_numpy().tolist()
Create a boolean series to do boolean indexing and get the index to return date. Next convert the dataframe to numpy array then tolist to get values.

Accessing Pandas groupby() function

I have the below data frame with me after doing the following:
train_X = icon[['property', 'room', 'date', 'month', 'amount']]
train_frame = train_X.groupby(['property', 'month', 'date', 'room']).median()
print(train_frame)
amount
property month date room
1 6 6 2 3195.000
12 3 2977.000
18 2 3195.000
24 3 3581.000
36 2 3146.000
3 3321.500
42 2 3096.000
3 3580.000
54 2 3195.000
3 3580.000
60 2 3000.000
66 3 3810.000
78 2 3000.000
84 2 3461.320
3 2872.800
90 2 3461.320
3 3580.000
96 2 3534.000
3 2872.800
102 3 3581.000
108 3 3580.000
114 2 3195.000
My objective is to track the median amount based on the (property, month, date, room)
I did this:
big_list = [[property, month, date, room], ...]
test_list = [property, month, date, room]
if test_list == big_list:
#I want to get the median amount wrt to that row which matches the test_list
How do I do this?
What I did is, tried the below...
count = 0
test_list = [2, 6, 36, 2]
for j in big_list:
if test_list == j:
break
count += 1
Now, after getting the count how do I access the median amount by count in dataframe? Is their a way to access dataframe by index?
Please note:
big_list is the list of lists where each list is [property, month, date, room] from the above dataframe
test_list is an incoming list to be matched with the big_list in case it does.
Answering the last question:
Is their a way to access dataframe by index?
Of course there is - you should use df.iloc or loc
depends if you want to get purerly by integer (I guess this is the situation) - you should use 'iloc' or by for example string type index - then you can use loc.
Documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
Edit:
Coming back to the question.
I assume that 'amount' is your searched median, then.
You can use reset_index() method on grouped dataframe, like
train_frame_reset = train_frame.reset_index()
and then you can again access your column names, so you should be albe to do the following (assuming j is index of found row):
train_frame_reset.iloc[j]['amount'] <- will give you median
If I understand your problem correctly you don't need to count at all, you can access the values via loc directly.
Look at:
A=pd.DataFrame([[5,6,9],[5,7,10],[6,3,11],[6,5,12]],columns=(['lev0','lev1','val']))
Then you did:
test=A.groupby(['lev0','lev1']).median()
Accessing, say, the median to the group lev0=6 and lev1 =1 can be done via:
test.loc[6,5]

Index a DataFrame with a list and return NaN for out-of-bounds indices in Pandas?

Normally when I index a DataFrame (or a Series) with a list of integer indices, I get back a subset of the rows, unless some of my indices are out of bounds, in which case I get an IndexError:
s = pd.Series(range(4))
0 0
1 1
2 2
3 3
s.iloc[[1,3]]
1 1
3 3
s.iloc[[1,3,5]]
IndexError
But I'd like to get back a DataFrame (or Series) having an index identical to the list I queried with (i.e., parallel to the query list), with (the rows corresponding to) any out-of-bounds indices filled in with NaN :
s.something[[1,3,5]]
1 1
3 3
5 NaN
I don't think join tricks work because those want to operate on the DataFrame index (or columns). As far as I can tell there's not even an "iget" integer-based get method if I wanted to manually loop over the indices myself. That leaves something like:
indices = [1,3,5]
pd.Series([s.iloc[i] if 0 <= i < len(s) else np.nan for i in indices], index=indices)
Is that the best Pandas 0.18 can do?
You can use reindex to achieve this:
In [119]:
s.reindex([1,3,5])
Out[119]:
1 1
3 3
5 NaN
dtype: float64
This will use the passed index and return existing values or NaN
Thanks to #EdChum for inspiration, the general solution is:
s.reset_index(drop=True).reindex([1,3,5])

Pandas. Selection by label. One-row output

I'm trying to select every entry in a pandas DataFrame D, correspoding to some certain userid, filling missing etime values with zeros as follows:
user_entries = D.loc[userid]
user_entries.index = user_entries.etime
user_entries = user_entries.reindex(range(distinct_time_entries_num))
user_entries = user_entries.fillna(0)
The problem is, for some ids, there exists exactly one entry, and thus .loc() method is returning a Series object with an unexpected index:
(Pdb) user_entries.index = user_entries.etime
*** TypeError: Index(...) must be called with a collection of some kind, 388 was passed
(Pdb) user_entries
etime 388
requested 1
rejected 0
Name: 351, dtype: int64
(Pdb) user_entries.index
Index([u'etime', u'requested', u'rejected'], dtype='object')
which is painful to handle. I'd seiously prefer a DataFrame object with one row. Is there any way around it? Thanks.
UPD: A have to apologize for unintengible formulation, this is my first post here. I'll try again.
So the deal is: there is a dataframe, indexed by userid. Every userid can possibly have up to some number N corresponding dataframe rows (columns are: 'etime','requested','rejected') for which 'etime' is basically the key. For some 'userid', there exist all of the N corresponding entries, but for the most of them, there are missing entries for some 'etime'.
My intensions are: for every 'userid' construct an explicit DataFrame object, containing all N entries indexed by 'etime', filled with zeros for the missing entries. That's why I'm changing index to 'etime' and then reindexing selected row subset with the full 'etime' range.
The problem is: for some 'userid' there is exactly one corresponding 'etime', for which.loc() subsetting returns not a dataframe with one row indexed by 'userid' but a series object indexed by the array:
Index([u'etime', u'requested', u'rejected'], dtype='object')
And that's why changing index fails. Checking dimensions and index every time I select some dataframe subset looks pretty ugly. What else can I do about it?
UPD2: here is the script demonstrating the case
full_etime_range = range(10)
df = DataFrame(index=[0,0,1],
columns=['etime','requested'],
data=[[0,1],[1,1],[1,1]])
for i in df.index:
tmp = df.loc[i]
tmp.index = tmp['etime']
tmp = tmp.reindex(full_etime_range,fill_value = 0)
print tmp
So, starting with df being your dataframe, we can do the following safely:
In[215]: df.set_index([df.index, 'etime'], inplace=True)
In[216]: df
Out[216]:
requested
etime
0 0 1
1 1
1 1 1
DF = pd.DataFrame(index=full_etime_range, columns=[])
df0 = DF.copy()
In[225]: df0.join(df.loc[0])
Out[225]:
requested
0 1
1 1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
In[230]: df1 = DF.copy()
In[231]: df1.join(df.loc[1])
Out[231]:
requested
0 NaN
1 1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
which is technically what you want. But behold, we can do this nicer:
listOfDf = [DF.copy().join(df.loc[i]) for i in df.index.get_level_values(1).unique()]
I wanted to do it even one level nicer, but the following did not work - maybe someone can chip in why.
df.groupby(level=0).apply(lambda x: DF.copy().join(x))
Are you just trying to fill nas? Why are you reindexing the dataframe?
Just
user_entries = D.loc[userid]
user_entries.fillna(0)
Should do the trick. But if you are willing to fillna just for the etime field, what you should do is:
user_entries = D.loc[userid]
temp = user_entries["etime"].fillna(0)
user_extries["etime"] = temp
Hope it helps. If not, clarify what you're trying to achieve

Categories