I want to groupby pandas dataframe and get last n elements from each group but with any offset. For example, after group by column A i've a column 'A' with elements in column 'B' with values (1,2,3,4,5,6,7) for certain value in 'A'. And I want to take the last 10 elements excluding the most recent one or two. How can I do it?
I've tried to use tail(), df.groupby('A').tail(10), but that's not my case.
input: 'A': [1,1,1,1,1,1,1,1,1,], 'B': [1,2,3,4,5,6,7,8,9] output: (last 3 excluding the recent 2) 'A' [1], 'B': [5,6,7]
First of all, it is unusual task, since all your "A" values are the same -> it is weird to group by such a column.
This leads to 2 solutions that came to my mind...
1]
data = {'A': [1,2,3,4,5,6,7,8,9], 'B': [1,2,3,4,5,6,7,8,9]}
df_dict = pd.DataFrame.from_dict(data)
no_of_unwanted_values = 2
df_dict.groupby('A').agg(lambda a: a).head(-no_of_unwanted_values)#.tail(1)
This solution work if you group by A-column-row-specific values. The head(-x) selects all the values top down but the last x values.
I think what you are looking for is the second solution:
2]
data = {'A': [1,2,1,3,1,2,1,2,3], 'B': [1,2,3,4,5,6,7,8,9]}
df_dict = pd.DataFrame.from_dict(data)
no_of_unwanted_values = 2
df_dict.groupby('A').sum().head(-no_of_unwanted_values)#.tail(1)
Here you have 3 values to group by and then you are using some operation on those groups (in this case it is sum). Lastly you select again all but the last with head(-x). Optionaly if you would like to select also some values but the top ones from such set, you can append the query by .tail() and again specify number of rows to retrieve. The last line could be also rewriten as len(df_dict) - no_of_unwanted_values (but in this case the number of unwanted values woudl have to be x + 1). You could apply the logic with len(x) - 1 for example also to selection of lists.
PS.:
beware when using sort_values for example:
data.sort_values(['col_1','col_2']).groupby('col_3','col_2').head(x)
here the head(x) correspond to col_1 values. That is if you want all but last values for len(data.col_1.unique()) = 100, use head(99).
Related
I have a column in a dataframe that contains IATA_Codes (Abbreviations) for Airports (such as: LAX, SFO, ...) However, if I analyze the column values a little more (column.unique()), it says that there are also 4 digit numbers in it.
How can I filter the column so that my Datafram will only consist of rows containing a real Airport code?
My idea was to filter the length (Airports Code Length is always 3, while the Number length is always 4) but I don't know how to implement this idea.
array(['LFT', 'HYS', 'ELP', 'DVL', 'ISP', 'BUR', 'DAB', 'DAY', 'GRK',
'GJT', 'BMI', 'LBE', 'ASE', 'RKS', 'GUM', 'TVC', 'ALO', 'IMT',
...
10170, 11577, 14709, 14711, 12255, 10165, 10918, 15401, 13970,
15497, 12265, 14254, 10581, 12016, 11503, 13459, 14222, 14025,
'10333', '14222', '14025', '13502', '15497', '12265'], dtype=object)
You can use df.columns.str.len to get the length, and pass that to the second indexer position of df.loc:
df = df.loc[:, df.columns.astype(str).str.len() == 3]
one another possibility is to use lambda expression :
df[df['IATA_Codes'].apply(lambda x : len(str(x))==3)]['IATA_Codes'].unique()
I have a dataframe with two columns ['A', 'B']. The columns are sorted already. I want to find a list of A values based on every first n times of 100 of column B and add min and max in the list. And n is fixed with 3 times.
d={'A': [15,16, 17,19,20,21,25,26,27,28,29,30], 'B':
[25,90,101,137,140,190,202,207,290,304,355,367]
df=pd.DataFrame(data=d)
The end result is to create a list=[15,17,25,28,30] based on another list of B values:[25, 101,202,304,367].
I previously set colA=[min(df.A)], and I'm trying to append the other three items based on the index of the colB list. And add the max column A value as the last item in the list colA.
So back to the other 3 items in colB, I'll need to do in the range(3) iteration. So when n=0, the first item is the first value that closed to (n+1)*100 but >(n+1)*100, same thing for the rest two values.
If I understand you problem right, you can use df.groupby:
out = df.groupby(df["B"] // 100)["A"].first().to_list() + [df["A"].max()]
print(out)
Prints:
[15, 17, 25, 28, 30]
I have a Pandas dataframe with two columns I am interested in: A categorical label and a timestamp. Presumably what I'm trying to do would also work with ordered numerical data. The dataframe is already sorted by timestamps in ascending order. I want to find out which label spans the longest time-window and select only the values associated with it in the original dataframe.
I have tried grouping the df by label, calculating the difference and selecting the maximum (longest time-window) successfully, however I'm having trouble finding an expression to select the corresponding values in the original df using this information.
Consider this example with numerical values:
d = {'cat': ['A','A','A','A','A','A','B','B','B','B','C','C','C','C','C','C','C'],
'val': [1,3,5,6,8,9,0,5,10,20,4,5,6,7,8,9,10]}
df = pd.DataFrame(data = d)
Here I would expect something equivalent to df.loc[df.cat == 'B'] since B has the maximum difference of all the categories.
df.groupby('cat').val.apply(lambda x: x.max() - x.min()).max()
gives me the correct difference, but I have no idea how to use this to select the correct category in the original df.
You can go for idxmax to get the category that gave rise to maximum peak-to-peak value within groups (np.ptp does the maximum minus minimum). Then you can index with loc as you said, or query:
>>> max_cat = df.groupby("cat").val.apply(np.ptp).idxmax()
>>> max_cat
"B"
>>> df.query("cat == #max_cat") # or df.loc[df.cat == max_cat]
cat val
6 B 0
7 B 5
8 B 10
9 B 20
I am new to Pandas (but not to data science and Python). This question is not anly about how to solve this specific problem but how to handle problems like this the panda-way.
Please feel free to improve the title of that question. Because I am not sure what are the correct terms here.
Here is my MWE
#!/usr/bin/env python3
import pandas as pd
data = {'A': [1, 2, 3, 3, 1, 4],
'B': ['One', 'Two', 'Three', 'Three', 'Eins', 'Four']}
df = pd.DataFrame(data)
print(df)
Resulting in
A B
0 1 One
1 2 Two
2 3 Three
3 3 Three
4 1 Eins
5 4 Four
My assumption is that when the value in A column is 1 that the value in B column is always One. And so on...
I want to proof that assumption.
Secondary I also assume that if my first assumption is incorrect that this is not an error but there are valid (human) reasons for that. e.g. see row index 4 where the A-value is related to Eins (and not One) in the B column.
Because of that I also need to see and explore the cases where my assumption is incorrect.
Update of the question:
This data is only an example. In real world I am not aware of the pairing of the two columns. Because of that solutions like this do not work in my case
df.loc[df['A'] == 1, 'B']
I do not know how many and which expressions are in column A.
I do not know how to do that with pandas. How would a panda professional would solve this?
My approach would be to use pure Python code with list(), set() and some iterations. ;)
You can filter your data frame this way:
df.loc[df['A'] == 1, 'B']
This gives you the values of B where A is 1. Next you can add an equals statement:
df.loc[df['A'] == 1, 'B'] == 'One'
Which results in a boolean series (True, False in this case). If you want to check if all are true, you add:
all(df.loc[df['A'] == 1, 'B'] == 'One')
And the answer is False because of the Eins.
EDIT
If you want to create a new column which says if your criterion is met (always the same value for B if A) then you can do this:
df['C'] = df['A'].map(df.groupby('A')['B'].nunique() < 2)
Which results in a bool column. It creates column C by mapping the values in A in a by the list in the brackets. In between the brackets it is a groupby function of the values in A and counting the unique values in B. If that is under 2 it is unique it yields True.
If solution should be testing if only one unique value per A and return all rows which failed use DataFrameGroupBy.nunique for count unique values in GroupBy.transform for repeat aggregate values per groups, so possible filter rows which are not 1, it means there are 2 or more unique values per A:
df1 = df[df.groupby('A').B.transform('nunique').ne(1)]
print (df1)
A B
0 1 One
4 1 Eins
if df1.empty:
print ('My assumption is good')
else:
print ('My assumption is wrong')
print (df1)
I'm using pandas groupby on my DataFrame df which has columns type, subtype, and 11 others. I'm then calling an apply with my combine_function (needs a better name) on the groups like:
grouped = df('type')
reduced = grouped.apply(combine_function)
where my combine_function checks if any element in the group contains any element with the given subtype, say 1, and looks like:
def combine_function(group):
if 1 in group.subtype:
return aggregate_function(group)
else:
return group
The combine_function then can call an aggregate_function, that calculates summary statistics, stores them in the first row, and then sets that row to be the group. It looks like:
def aggregate_function(group):
first = group.first_valid_index()
group.value1[group.index == first] = group.value1.mean()
group.value2[group.index == first] = group.value2.max()
group.value3[group.index == first] = group.value3.std()
group = group[(group.index == first)]
return group
I'm fairly sure this isn't the best way to do this, but it has been giving my the desired results, 99.9% of the time on thousands of DataFrames. However it sometimes throws an error that is somehow related to a group that I don't want to aggregate has exactly 2 rows:
ValueError: Shape of passed values is (13,), indices imply (13, 5)
where my an example groups had size:
In [4]: grouped.size()
Out[4]:
type
1 9288
3 7667
5 7604
11 2
dtype: int64
It processed the 3 three fine, and then gave the error when it tried to combine everything. If I comment out the line group = group[(group.index == first)] so update but don't aggregate or call my aggregate_function on all groups its fine.
Does anyone know the proper way to be doing this kind of aggregation of some groups but not others?
Your aggregate_functions looks contorted to me. When you aggregate a group, it automatically reduces to one row; you don't need to do it manually. Maybe I am missing the point. (Are you doing something special with the index that I'm not understanding?) But a more normal usage would look like this:
agg_condition = lambda x: Series([1]).isin(x['subtype]').any()
agg_functions = {'value1': np.mean, 'value2': np.max, 'value3': np.std}
df1 = df.groupby('type').filter(agg_condition).groupby('type').agg(**agg_functions)
df2 = df.groupby('type').filter(~agg_condition)
result = pd.concat([df1, df2])
Note: agg_condition is messy because (1) built-in Python in refers to the index of a Series, not its values, and (2) the result has to be reduced to a scalar by any().