I have a dataset in a form of:
A B C D label
6 2 6 8 0
2 5 3 6 0
4 3 4 9 1
5 7 5 5 1
6 4 5 8 0
in which each row is a label with a unique value, and that unique value is repeating after some lines, so there are 7 labels to 7000 lines if I do
df.loc[df['label'] == 0]
it will grab all the values of 0 labeled rows, but I want to extract the values according to the first label set of 0, if there are first 10 rows labeled as 0, then it just brings them not others label 0 in the data frame
We may need a new parameter here
df=df.assign(new=df.label.diff().ne(0).cumsum())
df[df.new==df.groupby('label').new.transform('min')]
Out[206]:
A B C D label new
0 6 2 6 8 0 1
1 2 5 3 6 0 1
2 4 3 4 9 1 2
3 5 7 5 5 1 2
Save to list
s=df[df.new==df.groupby('label').new.transform('min')];
l=[df1 for _, df1 in s.groupby('label')]
Related
Given a pandas data frame, how can I get the first row for each unique value in a column?
for example, given:
a b key
0 1 2 1
1 2 3 1
2 3 3 1
3 4 5 2
4 5 6 2
5 6 6 2
6 7 2 1
7 8 2 1
8 9 2 3
the result when analyzing by column key should be
a b key
0 1 2 1
3 4 5 2
8 9 2 3
p.s. df src:
pd.DataFrame([{'a':1,'b':2,'key':1},
{'a':2,'b':3,'key':1},
{'a':3,'b':3,'key':1},
{'a':4,'b':5,'key':2},
{'a':5,'b':6,'key':2},
{'a':6,'b':6,'key':2},
{'a':7,'b':2,'key':1},
{'a':8,'b':2,'key':1},
{'a':9,'b':2,'key':3}])
drop_duplicates does this. By default it keeps the first of the set, although that can be changed by other parameters.
df = df.drop_duplicates('key')
the image shows the test dataset I am using to verify if the right averages are being calculated.
I want to be able to get the average of the corresponding values in the 'G' column based on the filtered values in the 'T' column.
So I set the values for the 'T' coloumn based on which I want to sum the values in the 'G' column and then divide the total by the count to get an average, which is appended to a variable.
however the average is not correctly calculated. see below
screenshot
total=0
g_avg=[]
output=[]
counter=0
for i, row in df_new.iterrows():
if (row['T'] > 2):
counter+=1
total+=row['G']
if (counter != 0 and row['T']==10):
g_avg.append(total/counter)
counter = 0
total = 0
print(g_avg)
below is a better set of data as there is repetition in the 'T' values so I would need a counter in order to get my average for the G values when the T value is in a certain range i.e. from 2am to 10 am etc
sorry it wont allow me to just paste the dataset so ive took a snippy of it
If you want the average of column G values when T is between 2 and 7:
df_new.loc[(df_new['T']>2) & (df_new['T']<7), 'G'].mean()
Update
It's difficult to know exactly what you want without any expected output. If you have some data that looks like this:
print(df)
T G
0 0 0
1 0 0
2 1 0
3 2 1
4 3 3
5 4 0
6 5 4
7 6 5
8 7 0
9 8 6
10 9 7
And you want something like this:
print(df)
T G
0 0 0
1 0 0
2 1 0
3 2 1
4 3 3
5 4 3
6 5 3
7 6 3
8 7 0
9 8 6
10 9 7
Then you could use boolean indexing and DataFrame.loc:
avg = df.loc[(df['T']>2) & (df['T']<7), 'G'].mean()
df.loc[(df['T']>2) & (df['T']<7), 'G'] = avg
print(df)
T G
0 0 0.0
1 0 0.0
2 1 0.0
3 2 1.0
4 3 3.0
5 4 3.0
6 5 3.0
7 6 3.0
8 7 0.0
9 8 6.0
10 9 7.0
Update 2
If you have some sample data:
print(df)
T G
0 0 1
1 2 2
2 3 3
3 3 1
4 3 2
5 10 4
6 2 5
7 2 5
8 2 5
9 10 5
Method 1: To simply get a list of those means, you could create groups for your interval and filter on m:
m = df['T'].between(0,5,inclusive=False)
g = m.ne(m.shift()).cumsum()[m]
lst = df.groupby(g).mean()['G'].tolist()
print(lst)
[2.0, 5.0]
Method 2: If you want to include these means at their respective T values, then you could do this instead:
m = df['T'].between(0,5,inclusive=False)
g = m.ne(m.shift()).cumsum()
df['G_new'] = df.groupby(g)['G'].transform('mean')
print(df)
T G G_new
0 0 1 1
1 2 2 2
2 3 3 2
3 3 1 2
4 3 2 2
5 10 4 4
6 2 5 5
7 2 5 5
8 2 5 5
9 10 5 5
I'm trying to create a historical time-series of a number of identifiers for a number of different metrics, as part of that i'm trying to create multi index dataframe and then "fill it" with the individual dataframes.
Multi Index:
ID1 ID2
ITEM1 ITEM2 ITEM1 ITEM2
index
Dataframe to insert
ITEM1 ITEM2
Date
a
b
c
looking through the official docs and this website i found the following relevant:
Add single index data frame to multi index data frame, Pandas, Python and the associated pandas official docs pages:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
https://pandas.pydata.org/pandas-docs/stable/advanced.html
i've managed with something like :
for i in df1.index:
for j in df2.columns:
df1.loc[i,(ID,j)]=df2.loc[i,j]
but it seems highly inefficient when i need to do this across circa 100 dataframes.
for some reason a simply
df1.loc[i,(ID)]=df2.loc[i] doesn't seem to work
neither does :
df1[ID1]=df1.append(df2)
which returns a Cannot set a frame with no defined index and a value that cannot be converted to a Series
my understanding from looking around is that this is because im effectively leaving half the dataframe empty ( ragged list? )
any help appreciated on how to iteratively populate my multi index DF would be greatly appreciated.
let me know if i've missed relevant information,
cheers.
Setup
df1 = pd.DataFrame(
[[1, 2, 3, 4, 5, 6] * 2] * 3,
columns=pd.MultiIndex.from_product(['ID1 ID2 ID3'.split(), range(4)])
)
df2 = df1.ID1 * 2
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 1 2 3 4 5 6
1 1 2 3 4 5 6 1 2 3 4 5 6
2 1 2 3 4 5 6 1 2 3 4 5 6
df2
0 1 2 3
0 2 4 6 8
1 2 4 6 8
2 2 4 6 8
The problem is that Pandas is trying to line up indices (or columns in this case). We can do some transpose/join trickery but I'd rather avoid that.
Option 1
Take advantage of the fact that we can assign via loc an array so long as the shape matches up. Well, we better make sure it does and that the order of columns and index are correct. I use align with the right parameter to do this. Then assign the values of the aligned df2
df1.loc[:, 'ID1'] = df2.align(df1.ID1, 'right')[0].values
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
Option 2
Or, we can give df2 the additional level of column indexing that we need to lined it up. The use update to replace the relevant cells in place.
df1.update(pd.concat({'ID1': df2}, axis=1))
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
Option 3
A creative way using stack and assign with unstack
df1.stack().assign(ID1=df2.stack()).unstack()
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
I'm trying to use pandas to identify sub-sections of a dataframe which are identical. So, for example, if I have a dataframe like:
id A B
0 1 1 2
1 1 2 3
2 1 5 6
3 2 1 2
4 2 2 3
5 2 5 6
6 3 8 9
7 3 4 0
8 3 9 7
I want to group by ID, so Rows 0 - 2 would form Group 1, Rows 3 - 5 would form Group 2, and Rows 6 - 8 would form Group 3. I know I can use pd.groupby() to group rows by ID. In the case here, Group 2 is a repetition of Group 1 (Columns A and B are identical in both)
What I then want to do is to remove repeated groups, so in this case I would want to remove the second group. My final dataframe would then look like:
id A B
0 1 1 2
1 1 2 3
2 1 5 6
6 3 8 9
7 3 4 0
8 3 9 7
Every column in the duplicate groups is the same, except for the ID which is different for each group. I only want to remove a group if it is identical for every row in the group. Any help would be much appreciated!
This is one way using a helper column and pd.Series.drop_duplicates.
The idea is to first create a mapping from id to a tuple of values representing all rows for that id. Then drop duplicates and extract the index of the remainder.
df['C'] = list(zip(df['A'], df['B']))
s = df.groupby('id')['C'].apply(tuple)\
.drop_duplicates().index
res = df.loc[df['id'].isin(s), ['id', 'A', 'B']]
print(res)
id A B
0 1 1 2
1 1 2 3
2 1 5 6
6 3 8 9
7 3 4 0
8 3 9 7
Check pd.crosstab
s=pd.crosstab(df.id,[df.A,df.B]).drop_duplicates().unstack()
s[s!=0].reset_index().drop(0,1)
Out[128]:
A B id
0 1 2 1
1 2 3 1
2 4 0 3
3 5 6 1
4 8 9 3
5 9 7 3
I have a dataframe that looks like this:
test_data = pd.DataFrame(np.array([np.arange(10)]*3).T, columns =['issuer_id','winner_id','gov'])
issuer_id winner_id gov
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
and a list of two-tuples consisting of a dataframe and a label encoding 'gov' (perhaps a label:dataframe dict would be better). In test_out below the two labels are 2 and 7.
test_out = [(pd.DataFrame(np.array([np.arange(10)]*2).T, columns =['id','partition']),2),(pd.DataFrame(np.array([np.arange(10)]*2).T, columns =['id','partition']),7)]
[( id partition
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9, 2), ( id partition
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9, 7)]
I want to add two columns to the test_data dataframe: issuer_partition and winner_partition
test_data['issuer_partition']=''
test_data['winner_partition']=''
and I would like to fill in these values from the test_out list where the entry in the gov column determines the labeled dataframe in test_out to draw from. Then I look up the winner_id and issuer_id in the id-partition dataframe and write them into test_data.
Put another way: I have a list of labeled dataframes that I would like to loop through to conditionally fill in data in a primary dataframe.
Is there a clever way to use merge in this scenario?
*edit - added another sentence and fixed test_out code