I'm looking for a more automated approach to subset this dataframe by rank and put them in a list. Because if there happens to be 150 ranks I can't do individual subsets.
ID | GROUP | RANK
1 | A | 1
2 | B | 2
3 | C | 3
2 | A | 1
2 | E | 2
2 | G | 3
How can I subset the dataframe by Rank and then put every subset in a list? (Not using group by)
I know how to individually subset them but I'm not sure how I can do this if there's more ranks.
Output:
ranks = [df1,df2,df3....and so on]
Just use groupby directly in a list comprehension
>>> [df for rank, df in df.groupby('RANK')]
This will generate a list of dataframes, each a sub-dataframe related to the corresponding rank.
You can also do a dict comprehension:
>>> dic = {rank: df for rank, df in df.groupby('RANK')}
such that you can access your df via dic[1] for rank == 1.
In more detail, pd.DataFrame.groupby is a method that returns a DataFrameGroupBy object. A DataFrameGroupBy object is an iterable, which means you can iterate over it with a for loop. This iterable generates tuples with two vales, where the first is whatever you used to group (in this case, an integer rank), and the second, the sub dataframe.
I'm new to pandas and i'm trying to understand if there is a method to find out, if two values from one row in df1 are between two values from one row in df2.
Basically my df1 looks like this:
start | value | end
1 | TEST | 5
2 | TEST | 3
...
and my df2 looks like this:
start | value | end
2 | TEST2 | 10
3 | TEST2 | 4
...
Right now i've got it working with two loops:
for row in df1.iterrows():
for row2 in df2.iterrows():
if row2[1]["start"] >= row[1]["start"] and row2[1]["end"] <= row[1]["end"]:
print(row2)
but this doesn't feel like it's the pandas way to me.
What I'm expecting is that row number 2 from df2 is getting printed because 3 > 1 and 4 < 5, i.e.:
3 | TEST2 | 4
Is there a method to do this in the pandas kind of working?
You could use a cross merge to get all combinations of df1 and df2 rows, and filter using classical comparisons. Finally, get the indices and slice:
idx = (df1.merge(df2.reset_index(), suffixes=('1', '2'), how='cross')
.query('(start2 > start1) & (end2 < end1)')
['index'].unique()
)
df2.loc[idx]
NB. I am using unique here to ensure that a row is selected only once, even if there are several matches
output:
start value end
1 3 TEST2 4
Table 1
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
Table 2
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','05-05-2021','03-05-2021'],'data':[12,13,16,9],'test':['g','h','j','i'],'test2':['k','l','m','n']})
Result Table
Brief Explanation on how the Result table needs to be created:
I have two data frames and I want to merge them based on a df_id. But the date column from second table should be transposed into the resultant table.
The date columns for the result table will be a range between the min date and max date from the second table
The column values for the dates in the result table will be from the data column of the second table.
Also the test column from the second table will only take its value of the latest date for the result table
I hope this is clear. Any suggestion or help regarding this will be greatly appreciated.
I have tried using pivot on the second table and then trying to merge the pivoted second table df1 but its not working. I do not know how to get only one row for the latest value of test.
Note: I am trying to solve this problem using vectorization and do not want to serially parse through each row
You need to pivot your df2 into two separate table as we need data and test values and then merge both resulting pivot table with df1
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','03-05-2021','05-05-2021'],'data':[12,13,9,16],'test':['g','h','i','j']})
test_piv = df2.pivot(index=['df1_id'],columns=['date'],values=['test'])
data_piv = df2.pivot(index=['df1_id'],columns=['date'],values=['data'])
max_test = test_piv['test'].ffill(axis=1).iloc[:,-1].rename('test')
final = df1.merge(data_piv['data'],left_on=df1.df1_id, right_index=True, how='left')
final = final.merge(max_test,left_on=df1.df1_id, right_index=True, how='left')
and hence your resulting final dataframe as below
| | df1_id | col1 | col2 | 01-05-2021 | 03-05-2021 | 05-05-2021 | test |
|---:|---------:|:-------|:-------|-------------:|-------------:|-------------:|:-------|
| 0 | 1 | a | d | 12 | 9 | 16 | j |
| 1 | 2 | b | e | nan | 13 | nan | h |
| 2 | 3 | c | f | nan | nan | nan | nan |
Here is the solution for the question:
I first sort df2 based of df1_id and date to ensure that table entries are in order.
Then I drop duplicates based on df_id and select the last row to ensure I have the latest values for test and test2
Then I pivot df2 to get the corresponding date as column and data as its value
Then I merge the table with df2_pivoted to combine the latest values of test and test2
Then I merge with df1 to get the resultant table
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','05-05-2021','03-05-2021'],'data':[12,13,16,9],'test':['g','h','j','i'],'test2':['k','l','m','n']})
df2=df2.sort_values(by=['df1_id','date'])
df2_latest_vals = df2.drop_duplicates(subset=['df1_id'],keep='last')
df2_pivoted = df2.pivot_table(index=['df1_id'],columns=['date'],values=['data'])
df2_pivoted = df2_pivoted.droplevel(0,axis=1).reset_index()
df2_pivoted = pd.merge(df2_pivoted,df2_latest_vals,on='df1_id')
df2_pivoted = df2_pivoted.drop(columns=['date','data'])
result = pd.merge(df1,df2_pivoted,on='df1_id',how='left')
result
Note: I have not been able to figure out how to get the entire date range between 01-05-2021 and 05-05-2021 and show the empty values as NaN. If anyone can help please edit the answer
I have a MultiIndex dataframe with the top level columns named:
Col1_1 | Col1_2 | Col 2_1 | Col 2_2 | ... |
I'm looking to combine Col1_1 with Col1_2 as Col1. I could also do this before creating the MultiIndex, but the original data is more drawn out as:
Col1_1.aspect1 | Col1_1.aspect 2 | Col1_2.aspect1 | Col1_2.aspect2 | ... |
where 'aspect1' and 'aspect2' become subcolumns in the MultiIndex.
Please let me know if I can clarify anything, and many thanks in advance.
The expected result combines the two as just Sample1; any number of ways is fine, including stacking/concatenating the data, outputting a summary stat e.g. mean(), etc.
You can use groupby and apply an aggregation function against it like mean.
You must group against axis 1 (columns) and with level 1 (lower multiindex columns). It will apply the grouping across all samples. Then simply do a mean if it's what you want to achieve:
df.groupby(level=1, axis=1).mean()
I have a table of the form:
item_code | attribute | time_offset | mean | median | description | ...
The attribute column has one of 40 possible values and the time_offset column can be an integer from 0 to 20.
I want to transform this table to a wide one of the form:
item_code | <attribute1>_<time_offset1>_mean | <attribute1>_<time_offset1>_median | <attribute1>_<time_offset1>_description | <attribute1>_<time_offset1>_... | <attribute2>...
I can do this either in SQL or in Pandas but I'm having difficulty with the fact that some of the columns are not numeric, so it is hard to come up with an aggregation function for them.
I can guarantee that each combination of item_code, attribute and time_offset will have only one row, so I do not need an aggregation function. Is there something like a transpose operation that will allow me to do what I am looking for?