Accessing Pandas groupby() function - python

I have the below data frame with me after doing the following:
train_X = icon[['property', 'room', 'date', 'month', 'amount']]
train_frame = train_X.groupby(['property', 'month', 'date', 'room']).median()
print(train_frame)
amount
property month date room
1 6 6 2 3195.000
12 3 2977.000
18 2 3195.000
24 3 3581.000
36 2 3146.000
3 3321.500
42 2 3096.000
3 3580.000
54 2 3195.000
3 3580.000
60 2 3000.000
66 3 3810.000
78 2 3000.000
84 2 3461.320
3 2872.800
90 2 3461.320
3 3580.000
96 2 3534.000
3 2872.800
102 3 3581.000
108 3 3580.000
114 2 3195.000
My objective is to track the median amount based on the (property, month, date, room)
I did this:
big_list = [[property, month, date, room], ...]
test_list = [property, month, date, room]
if test_list == big_list:
#I want to get the median amount wrt to that row which matches the test_list
How do I do this?
What I did is, tried the below...
count = 0
test_list = [2, 6, 36, 2]
for j in big_list:
if test_list == j:
break
count += 1
Now, after getting the count how do I access the median amount by count in dataframe? Is their a way to access dataframe by index?
Please note:
big_list is the list of lists where each list is [property, month, date, room] from the above dataframe
test_list is an incoming list to be matched with the big_list in case it does.

Answering the last question:
Is their a way to access dataframe by index?
Of course there is - you should use df.iloc or loc
depends if you want to get purerly by integer (I guess this is the situation) - you should use 'iloc' or by for example string type index - then you can use loc.
Documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
Edit:
Coming back to the question.
I assume that 'amount' is your searched median, then.
You can use reset_index() method on grouped dataframe, like
train_frame_reset = train_frame.reset_index()
and then you can again access your column names, so you should be albe to do the following (assuming j is index of found row):
train_frame_reset.iloc[j]['amount'] <- will give you median

If I understand your problem correctly you don't need to count at all, you can access the values via loc directly.
Look at:
A=pd.DataFrame([[5,6,9],[5,7,10],[6,3,11],[6,5,12]],columns=(['lev0','lev1','val']))
Then you did:
test=A.groupby(['lev0','lev1']).median()
Accessing, say, the median to the group lev0=6 and lev1 =1 can be done via:
test.loc[6,5]

Related

changing index of 1 row in pandas

I have the the below df build from a pivot of a larger df. In this table 'week' is the the index (dtype = object) and I need to show week 53 as the first row instead of the last
Can someone advice please? I tried reindex and custom sorting but can't find the way
Thanks!
here is the table
Since you can't insert the row and push others back directly, a clever trick you can use is create a new order:
# adds a new column, "new" with the original order
df['new'] = range(1, len(df) + 1)
# sets value that has index 53 with 0 on the new column
# note that this comparison requires you to match index type
# so if weeks are object, you should compare df.index == '53'
df.loc[df.index == 53, 'new'] = 0
# sorts values by the new column and drops it
df = df.sort_values("new").drop('new', axis=1)
Before:
numbers
weeks
1 181519.23
2 18507.58
3 11342.63
4 6064.06
53 4597.90
After:
numbers
weeks
53 4597.90
1 181519.23
2 18507.58
3 11342.63
4 6064.06
One way of doing this would be:
import pandas as pd
df = pd.DataFrame(range(10))
new_df = df.loc[[df.index[-1]]+list(df.index[:-1])].reset_index(drop=True)
output:
0
9 9
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Alternate method:
new_df = pd.concat([df[df["Year week"]==52], df[~(df["Year week"]==52)]])

Conditionally dropping columns in a pandas dataframe

I have this dataframe and my goal is to remove any columns that have less than 1000 entries.
Prior to to pivoting the df I know I have 880 unique well_id's with entries ranging from 4 to 60k+. I know should end up with 102 well_id's.
I tried to accomplish this in a very naïve way by collecting the wells that I am trying to remove in an array and using a loop but I keep getting a 'TypeError: Level type mismatch' but when I just use del without a for loop it works.
#this works
del df[164301.0]
del df['TB-0071']
# this doesn't work
for id in unwanted_id:
del df[id]
Any help is appreciated, Thanks.
You can use dropna method:
df.dropna(thresh=[]) #specify [here] how many non-na values you require to keep the row
The advantage of this method is that you don't need to create a list.
Also don't forget to add the usual inplace = True if you want the changes to be made in place.
You can use pandas drop method:
df.drop(columns=['colName'], inplace=True)
You can actually pass a list of columns names:
unwanted_id = [164301.0, 'TB-0071']
df.drop(columns=unwanted_ids, inplace=True)
Sample:
df[:5]
from to freq
0 A X 20
1 B Z 9
2 A Y 2
3 A Z 5
4 A X 8
df.drop(columns=['from', 'to'])
freq
0 20
1 9
2 2
3 5
4 8
And to get those column names with more than 1000 unique values, you can use something like this:
counts = df.nunique()[df.nunique()>1000].to_frame('uCounts').reset_index().rename(columns={'index':'colName'})
counts
colName uCounts
0 to 1001
1 freq 1050

Finding the next value corresponding to a criteria

First, here's part of my data frame
Dates Order Value
0 2010-12-07T10:00:00.000000000Z In 70
1 2010-12-07T14:00:00.000000000Z Out 70
2 2010-12-08T06:00:00.000000000Z In 31
3 2010-12-09T02:00:00.000000000Z In 48
4 2010-12-09T10:00:00.000000000Z In 29
5 2010-12-09T10:00:00.000000000Z In 59
6 2010-12-09T10:00:00.000000000Z Out 31
7 2010-12-09T14:00:00.000000000Z Out 29
8 2010-12-09T14:00:00.000000000Z In 32
9 2010-12-10T06:00:00.000000000Z In 1
10 2010-12-10T10:00:00.000000000Z Out 48
In this code, I'm trying to find a few things:
The first occurrence of a 'In' in the dataframe. For that, I'm using
index_1 = df[df.Order=='In'].first_valid_index() This will result in 0, that's correct.
Then, I'll find the corresponding Value for that index with
order_1 = df.at[index_1,'Value'] This will result in 70, also correct.
Find the NEXT time the value 70 appears in this dataframe. This is the part I'm struggling with. The values in Value only repeat once, and the second time it appears will always be on a Out.
Can anyone help me finish this part of the code?
Using idxmax with boolean indexing:
val = df.loc[df['Order'].eq('In').idxmax(), 'Value']
df[df['Value'].eq(val) & df['Order'].eq('Out')]
Dates Order Value
1 2010-12-07T14:00:00.000000000Z Out 70
IIUC, we can use index filtering with isin
val = df[
(df["Value"].isin(df[df["Order"].eq("In")]["Value"].head(1)))
& (df["Order"].eq("Out"))
]
print(val)
Dates Order Value
1 2010-12-07T14:00:00.000000000Z Out 70
You can do the following given the fact that you succesded in extracting the first index and its value:
import pandas as pd
df = pd.DataFrame({'value': [70,70,10,10,50,60,70]})
index_1 = 0
order_1 = 70
indices = df.index[df['value']==order_1].tolist()
next_index = indices.index(index_1) + 1

How can I keep all columns in a dataframe, plus add groupby, and sum?

I have a data frame with 5 fields. I want to copy 2 fields from this into a new data frame. This works fine. df1 = df[['task_id','duration']]
Now in this df1, when I try to group by task_id and sum duration, the task_id field drops off.
Before (what I have now).
After (what I'm trying to achieve).
So, for instance, I'm trying this:
df1['total'] = df1.groupby(['task_id'])['duration'].sum()
The result is:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I don't know why I can't just sum the values in a column and group by unique IDs in another column. Basically, all I want to do is preserve the original two columns (['task_id', 'duration']), sum duration, and calculate a percentage of duration in a new column named pct. This seems like a very simple thing but I can't get anything working. How can I get this straightened out?
The code will take care of having the columns retained and getting the sum.
df[['task_id', 'duration']].groupby(['task_id', 'duration']).size().reset_index(name='counts')
Setup:
X = np.random.choice([0,1,2], 20)
Y = np.random.uniform(2,10,20)
df = pd.DataFrame({'task_id':X, 'duration':Y})
Calculate pct:
df = pd.merge(df, df.groupby('task_id').agg(sum).reset_index(), on='task_id')
df['pct'] = df['duration_x'].divide(df['duration_y'])*100
df.drop('duration_y', axis=1) # Drops sum duration, remove this line if you want to see it.
Result:
duration_x task_id pct
0 8.751517 0 58.017921
1 6.332645 0 41.982079
2 8.828693 1 9.865355
3 2.611285 1 2.917901
4 5.806709 1 6.488531
5 8.045490 1 8.990189
6 6.285593 1 7.023645
7 7.932952 1 8.864436
8 7.440938 1 8.314650
9 7.272948 1 8.126935
10 9.162262 1 10.238092
11 7.834692 1 8.754639
12 7.989057 1 8.927129
13 3.795571 1 4.241246
14 6.485703 1 7.247252
15 5.858985 2 21.396850
16 9.024650 2 32.957771
17 3.885288 2 14.188966
18 5.794491 2 21.161322
19 2.819049 2 10.295091
disclaimer: All data is randomly generated in setup, however, calculations are straightforward and should be correct for any case.
I finally got everything working in the following way.
# group by and sum durations
df1 = df1.groupby('task_id', as_index=False).agg({'duration': 'sum'})
list(df1)
# find each task_id as relative percentage of whole
df1['pct'] = df1['duration']/(df1['duration'].sum())
df1 = pd.DataFrame(df1)

search for a set of values on rows (& not |) from a column

I'm new to python and I'm trying to find the entries from the first column that have in the second column all the entries I'm searching for. ex: I want entries {155, 137} and I expect to get 5 and 6 from column id1 in return.
id1 id2
----------
1. 10
2. 10
3. 10
4. 9
5. 137
5. 150
5. 155
6. 10
6. 137
6. 155
....
I've searched a lot on google, but couldn't solve it. I read these entries from an excel, I tried with multiple for loops, but it doesn't look nice because I'm searching for a lot of entries
I tried this:
df = pd.read_excel('path/temp.xlsx') #now I have two Columns and many rows
d1 = df.values.T[0].tolist()
d2 = df.values.T[1].tolist()
d1[d2.index(115) & d2.index(187)& d2.index(276) & d2.index(239) & d2.index(200) & d2.index(24) & d2.index(83)]
and it returned 1
I started to work this week, so I'm very new
Assuming you have two lists for both the IDs (i.e. one list for id1, and one for id2), and the lists correspond to each other (that means, the value at index i in list1 corresponds to the value at index i of list2).
If that is your case, then you simply have to find out the index of the element, you want to search for, and the corresponding index in the other list will be the answer to your query.
To get the index of the element, you can use Python's inbuilt feature to get an index, namely:
list.index(<element>)
It will return the zero-based index of the element you wanted in the list.
To get the corresponding ID from id1, you can simply use this index (because of one-one correspondence). In your case, it can be written as:
id1[id2.index(137)] #it will return 5
NOTE:
index() method will return the index of the first matching entry from the list.
best to use pandas
import pandas as pd
import numpy as np
Random data
n = 10
I = [i for i in range(1,7)]
df1 = pd.DataFrame({'Id1': [Id[np.random.randint(len(I))] for i in range(n)],
'Id2': np.random.randint(0,1000,n)})
df1.head(5)
Id1 Id2
0 4 170
1 6 170
2 6 479
3 4 413
4 6 52
Query using
df1.loc[~df1.Id2.isin([170,479])].Id1
Out[344]:
3 4
4 6
5 6
6 3
7 1
8 5
9 6
Name: Id1, dtype: int64
for now, I've solved it by doing this

Categories