I have this dataframe and my goal is to remove any columns that have less than 1000 entries.
Prior to to pivoting the df I know I have 880 unique well_id's with entries ranging from 4 to 60k+. I know should end up with 102 well_id's.
I tried to accomplish this in a very naïve way by collecting the wells that I am trying to remove in an array and using a loop but I keep getting a 'TypeError: Level type mismatch' but when I just use del without a for loop it works.
#this works
del df[164301.0]
del df['TB-0071']
# this doesn't work
for id in unwanted_id:
del df[id]
Any help is appreciated, Thanks.
You can use dropna method:
df.dropna(thresh=[]) #specify [here] how many non-na values you require to keep the row
The advantage of this method is that you don't need to create a list.
Also don't forget to add the usual inplace = True if you want the changes to be made in place.
You can use pandas drop method:
df.drop(columns=['colName'], inplace=True)
You can actually pass a list of columns names:
unwanted_id = [164301.0, 'TB-0071']
df.drop(columns=unwanted_ids, inplace=True)
Sample:
df[:5]
from to freq
0 A X 20
1 B Z 9
2 A Y 2
3 A Z 5
4 A X 8
df.drop(columns=['from', 'to'])
freq
0 20
1 9
2 2
3 5
4 8
And to get those column names with more than 1000 unique values, you can use something like this:
counts = df.nunique()[df.nunique()>1000].to_frame('uCounts').reset_index().rename(columns={'index':'colName'})
counts
colName uCounts
0 to 1001
1 freq 1050
Related
I have the below data frame with me after doing the following:
train_X = icon[['property', 'room', 'date', 'month', 'amount']]
train_frame = train_X.groupby(['property', 'month', 'date', 'room']).median()
print(train_frame)
amount
property month date room
1 6 6 2 3195.000
12 3 2977.000
18 2 3195.000
24 3 3581.000
36 2 3146.000
3 3321.500
42 2 3096.000
3 3580.000
54 2 3195.000
3 3580.000
60 2 3000.000
66 3 3810.000
78 2 3000.000
84 2 3461.320
3 2872.800
90 2 3461.320
3 3580.000
96 2 3534.000
3 2872.800
102 3 3581.000
108 3 3580.000
114 2 3195.000
My objective is to track the median amount based on the (property, month, date, room)
I did this:
big_list = [[property, month, date, room], ...]
test_list = [property, month, date, room]
if test_list == big_list:
#I want to get the median amount wrt to that row which matches the test_list
How do I do this?
What I did is, tried the below...
count = 0
test_list = [2, 6, 36, 2]
for j in big_list:
if test_list == j:
break
count += 1
Now, after getting the count how do I access the median amount by count in dataframe? Is their a way to access dataframe by index?
Please note:
big_list is the list of lists where each list is [property, month, date, room] from the above dataframe
test_list is an incoming list to be matched with the big_list in case it does.
Answering the last question:
Is their a way to access dataframe by index?
Of course there is - you should use df.iloc or loc
depends if you want to get purerly by integer (I guess this is the situation) - you should use 'iloc' or by for example string type index - then you can use loc.
Documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
Edit:
Coming back to the question.
I assume that 'amount' is your searched median, then.
You can use reset_index() method on grouped dataframe, like
train_frame_reset = train_frame.reset_index()
and then you can again access your column names, so you should be albe to do the following (assuming j is index of found row):
train_frame_reset.iloc[j]['amount'] <- will give you median
If I understand your problem correctly you don't need to count at all, you can access the values via loc directly.
Look at:
A=pd.DataFrame([[5,6,9],[5,7,10],[6,3,11],[6,5,12]],columns=(['lev0','lev1','val']))
Then you did:
test=A.groupby(['lev0','lev1']).median()
Accessing, say, the median to the group lev0=6 and lev1 =1 can be done via:
test.loc[6,5]
I am trying to edit values after making duplicate rows in Pandas.
I want to edit only one column ("code"), but i see that since it has duplicates , it will affect the entire rows.
Is there any method to first create duplicates and then modify data only of duplicates created ?
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 1234
b = df[a]
df=df.append(b)
print('\n\nafter replicate')
print(df)
Current output after making duplicates is as below:
coun code name
0 A 123 AR
1 F 123 AD
2 N 7 AR
3 I 0 AA
4 T 10 AS
2 N 7 AR
3 I 7 AA
Now I expect to change values only on duplicates created , in this case bottom two rows. But now I see the indexes are duplicated as well.
You can avoid the duplicate indices by using the ignore_index argument to append.
df=df.append(b, ignore_index=True)
You may also find it easier to modify your data in b, before appending it to the frame.
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 3
b = df[a]
b["region"][2] = "N"
df=df.append(b, ignore_index=True)
print('\n\nafter replicate')
print(df)
I have a data frame with 5 fields. I want to copy 2 fields from this into a new data frame. This works fine. df1 = df[['task_id','duration']]
Now in this df1, when I try to group by task_id and sum duration, the task_id field drops off.
Before (what I have now).
After (what I'm trying to achieve).
So, for instance, I'm trying this:
df1['total'] = df1.groupby(['task_id'])['duration'].sum()
The result is:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I don't know why I can't just sum the values in a column and group by unique IDs in another column. Basically, all I want to do is preserve the original two columns (['task_id', 'duration']), sum duration, and calculate a percentage of duration in a new column named pct. This seems like a very simple thing but I can't get anything working. How can I get this straightened out?
The code will take care of having the columns retained and getting the sum.
df[['task_id', 'duration']].groupby(['task_id', 'duration']).size().reset_index(name='counts')
Setup:
X = np.random.choice([0,1,2], 20)
Y = np.random.uniform(2,10,20)
df = pd.DataFrame({'task_id':X, 'duration':Y})
Calculate pct:
df = pd.merge(df, df.groupby('task_id').agg(sum).reset_index(), on='task_id')
df['pct'] = df['duration_x'].divide(df['duration_y'])*100
df.drop('duration_y', axis=1) # Drops sum duration, remove this line if you want to see it.
Result:
duration_x task_id pct
0 8.751517 0 58.017921
1 6.332645 0 41.982079
2 8.828693 1 9.865355
3 2.611285 1 2.917901
4 5.806709 1 6.488531
5 8.045490 1 8.990189
6 6.285593 1 7.023645
7 7.932952 1 8.864436
8 7.440938 1 8.314650
9 7.272948 1 8.126935
10 9.162262 1 10.238092
11 7.834692 1 8.754639
12 7.989057 1 8.927129
13 3.795571 1 4.241246
14 6.485703 1 7.247252
15 5.858985 2 21.396850
16 9.024650 2 32.957771
17 3.885288 2 14.188966
18 5.794491 2 21.161322
19 2.819049 2 10.295091
disclaimer: All data is randomly generated in setup, however, calculations are straightforward and should be correct for any case.
I finally got everything working in the following way.
# group by and sum durations
df1 = df1.groupby('task_id', as_index=False).agg({'duration': 'sum'})
list(df1)
# find each task_id as relative percentage of whole
df1['pct'] = df1['duration']/(df1['duration'].sum())
df1 = pd.DataFrame(df1)
I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.
my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.
something like this isn't working:
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()
nor is
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)
this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)
in R this is easily done in data.table with
df[, n_unique_id := uniqueN(id), by = c('track', 'type')]
thanks!
df.groupby(['track', 'type'])['id'].transform(nunique)
Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.
As pointed out by #root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.
For example transform('sum') should be preferred over transform(sum).
Try this instead
df.groupby(['track', 'type'])['id'].transform('nunique')
demo
df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)
id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B
df.groupby(['track', 'type'])['id'].transform('nunique')
0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64
If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.
IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0
Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.