Splitting dataframe into multiple dataframes by using groupby

Splitting dataframe into multiple dataframes by using groupby - python

I have a question related to splitting dataframes into multiple dataframes using groupby such that on each iteration i cover more than one grouped by item. I looked at the forum n found the below example to be very close to my problem. However, i was wondering if there is any possibility of printing all the rows of more than one grouped by item per iteration in the loop. So from below example, in my 1st iteration, is it possible to print all the rows of Region A, B, C and then iterate again for next 3 regions?
for region, df_region in df.groupby('Region'):
print(df_region)
Competitor Region ProductA ProductB
0 Comp1 A £10 £15
3 Comp2 A £9 £16
6 Comp3 A £11 £16
Competitor Region ProductA ProductB
1 Comp1 B £11 £16
4 Comp2 B £12 £14
7 Comp3 B £10 £15
Competitor Region ProductA ProductB
2 Comp1 C £11 £15
5 Comp2 C £14 £17
8 Comp3 C £12 £15
I am learning and implementing python/ pandas so still a beginner of this language. Any help would be really appreciated. Thanks

The technical terminology for this is batching: you're returning values from your dataframe in batches of some size in order to avoid getting everything at once (batch of size equal to the length of your dataframe) or one item at a time (batch size equal to 1). As you said, under certain conditions this can improve performance. Here's one way you might go about this:
import pandas as pd
df = pd.DataFrame({"Region": ["A", "B", "C", "C", "D", "E", "E", "E", "F", "F"],
"Product A": [1, 2, 1, 2, 2, 1, 1, 1, 1, 3]})
Don't use those lines above, this is just so that I can replicate your dataframe. Here's the approach, feel free to change batch_size as you wish:
batch_size = 3
regions = df["Region"].unique()
b = 0
while True:
r = regions[b*batch_size:(b+1)*batch_size]
if len(r) == 0:
break # already went through all the regions
else:
b += 1 # increment so that the next iterations gets the next set of regions
print(f"batch: {b}")
sub_df = df.loc[df["Region"].isin(r)]
print(sub_df)
sub_df will be the batched results, with the values for only batch_size number of regions each iteration.

Related

Aggregate a dataframe column based on a hierarichal condition from another column

I have an interesting problem and thought I will share it here for everyone. Let's assume we have a pandas DataFrame like this (dummy data):
Category
Samples
A,B,123
6
A,B,456
3
A,B,789
1
X,Y,123
18
X,Y,456
7
X,Y,789
2
P,Q,123
1
P,Q,456
2
P,Q,789
2
L,M,123
1
L,M,456
3
S,T,123
5
S,T,456
5
S,T,789
3
The value in category are basically hierarchal in nature. Think of A as country, B as state, and 123 as zip-code. What I want is to greedily match for each category which has less than 5 samples and merge it with the nearest one. The final example DataFrame should be like:
Category
Samples
A,B,123
10
X,Y,123
18
X,Y,456
9
P,Q,456
5
L,M,456
4
S,T,123
8
S,T,456
5
These are the possible rules I see that will be needed:
Case A,B : Sub-categories 456, 789 have less than 5 categories so we merge it but then the merged one will also have 4 which is less than 5 so it gets further merged with 123 and thus finally we get A,B,123 with 10.
Case X,Y : Subcategory 789 is the only one with less than 5 so it will merge with the category 456 (the one closest to 5 samples) to become X,Y,456 as 9 where X,Y,123 always had more than 5 so it remains as is.
Case P,Q : Here all the sub-categories have less then 5 but the idea is to merge it one at a time and it has nothing to do with the sequence. 123 here is having one sample, so it will merge with 789 to form a sample size of 3 which is still less than 5 so it will merge with 456 to form P,Q,456 with sample size of 5 but it can also be P,Q,789 as 5. Either is fine.
Case L,M : Only two sub-categories and both even merged will remain less than 5 but that's the best we can have so it should be L,M,456 as 4.
Case S,T : Only 789 has less than 5 so it can go with either 123 or 456 (as both have same samples), but not both. So the answer should be either S,T,123 as 5 and S,T,456 as 8 or S,T,123 as 8 and S,T,456 as 5.
What happens if there is a third column with values and based on the logic we want them to be merged too - add up if its an integer, and concatenate if that's a string - based on whatever condition we use on these columns?
I have been trying to break the column category then work with samples to add up but so far no luck. Any help is greatly appreciated.

Very tricky question especially with the structure of your data(because your grouper which is really the parts "A,B", "X,Y", etc. are not in a separate column. But I think you can do:
df.sort_values(by='Samples', inplace=True, ignore_index=True)
#grouper containing groupby keys ['A,B', 'X,Y', etc.)
g = df['Category'].str.extract("(.*),+")[0]
#create a column to keep the category together
df['sample_category'] = list(zip(df['Samples'], df['Category']))
Then use functools.reduce to reduce the list by iteratively grabbing the next tuple if sample is less than 5:
df2 = df.groupby(g, as_index=False).agg(
{'sample_category': lambda s:
functools.reduce(lambda x, y: (x[0] + y[0], y[1]) if x[0] < 5 else (x, y), s)})
Then do some munging to modify the elements to a list type:
df2['sample_category'] = df2['sample_category'].apply(
lambda x: [x] if isinstance(x[0], int) else list(x))
Then explode, extract columns and finally drop the intermediate column 'sample_category'
df2 = df2.explode('sample_category', ignore_index=True)
df2['Sample'] = df2['sample_category'].str[0]
df2['Category'] = df2['sample_category'].str[1]
df2.drop('sample_category', axis=1, inplace=True)
print(df2):
Sample Category
0 10 A,B,123
1 4 L,M,456
2 5 P,Q,789
3 8 S,T,123
4 5 S,T,456
5 9 X,Y,456
6 18 X,Y,123

What is the optimal method for multiple if/then's to evaluate a column's value in a dataframe and conditionally modify another column value?

I have a dataframe (1580 rows x 48 columns) where each column contains answers to questions, but not every row contains an answer to every question (leaving it NaN). Groups of questions are related, and I'd like to tabulate the answers to the group of questions into new columns (c_answers and i_answers). I have generated lists of the correct answers for each group of questions. Here is an example of the data:
ex_df = pd.DataFrame([["a", "b", "d"],[np.nan, "a", "b"], ["c", "e", np.nan]], columns=["q1", "q2", "q3"])
correct_answers = ["a", "b", "c"]
ex_df
which generates the following dataframe:
q1 q2 q3
0 a b d
1 NaN a b
2 e c NaN
What I would like to do, ideally, is to create a function that would score each column, and for each correct answer on a row (appears in the correct_answers list) it would increment a c_answers column by 1, for each answer that is not in correct_answers, it would increment a i_answers column by 1 instead, but if the provided answer is NaN, it would do neither (not counted as correct or incorrect). This function could then be applied to each group of questions, calculating the number of correct and incorrect answers for each row, for that group.
What I have been able to make a bit of progress with instead is something like this:
ex_df['q1score'] = np.where(ex_df['q1'].isna(), np.nan,
np.where(ex_df['q1'].isin(correct_answers), 1, 100))
which updates the dataframe like so:
q1 q2 q3 q1score
0 a b d 1.0
1 NaN a b NaN
2 e c NaN 100.0
I could then re-use this code to score out q2 and q3 into their own new columns, which I could then sum up into a new column, and then from that column, I could generate two more columns which could calculate the number of correct and incorrect scores from that sum. Finally, I could go back and drop the other 4 columns that I created and keep only the two that I wanted in the first place.
Looking around and trying different methods for the last two hours, I'm finding a lot of answers that deal with one or another of the different issues I'm trying to deal with, but nothing that I could finagle to actually work for my situation. Maybe the solution I've kludged together is the best one, but I'm still relatively new to programming (<18 months) and it didn't seem like the most efficient or most Pythonic method to solve this problem. Hoping someone else has a better answer out there. Thank you!
Edit for more information regarding output: Regarding what I'd like the final output to look like, I'd like something that looks like this:
q1 q2 q3 c_answers i_answers
0 a b d 2 1
1 NaN a b 2 0
2 e c NaN 1 1
Like I said, I can kind of finagle that using the nested np.where() to create numeric columns that I can then sum up and reverse engineer to get a raw count from. While this is a solution, its cumbersome and seems like its probably not the optimal one, especially with the amount of repetition involved (I'll have to do this process for 9 different groups of columns, each being a cluster of questions).

Use sum for count Trues values for correct and incorrect values per rows:
m1 = ex_df.isin(correct_answers)
m2 = ex_df.notna() & ~m1
df = ex_df.assign(c_answers=m1.sum(axis=1), i_answers=m2.sum(axis=1))
print (df)
q1 q2 q3 c_answers i_answers
0 a b d 2 1
1 NaN a b 2 0
2 c e NaN 1 1
Possible solution for multiple groups:
groups = {'g1':['q1','q2'], 'g2':['q2','q3'], 'g3':['q1','q2','q3']}
for k, v in groups.items():
m1 = ex_df[v].isin(correct_answers)
m2 = ex_df[v].notna() & ~m1
ex_df = ex_df.assign(**{f'c_answers_{k}':m1.sum(axis=1),
f'i_answers_{k}':m2.sum(axis=1)})
print (ex_df)
q1 q2 q3 c_answers_g1 i_answers_g1 c_answers_g2 i_answers_g2 \
0 a b d 2 0 1 1
1 NaN a b 1 0 2 0
2 c e NaN 1 1 0 1
c_answers_g3 i_answers_g3
0 2 1
1 2 0
2 1 1

Average score per attempt for entries with non fully overlapping attempts

I have a pandas dataframe that has a column that contains a list of attempt numbers and another column that contains the score achieved on those attempts. A simplified example is below:
scores = [[0,1,0], [0,0], [0,6,2]]
attempt_num = [[1,2,3], [2,4], [2,3,4]]
df = pd.DataFrame([attempt_num, scores]).T
df.columns = ['Attempt', 'Score']
Each row represents a different person, which for the purposes of this question, we can assume are unique. The data is incomplete, and so I have attempt number 1, 2 and 3 for the first person, 2 and 4 for the second and 2, 3 and 4 for the last. What I want to do is to get an average score per attempt. For example, attempt 1 only shows up once and so the average would be 0, the score achieved when it did show up. Attempt 2 shows up for all persons which gives an average of 0.33 ((1 + 0 + 0)/3) and so on. So the expected output would be:
Attempt_Number Average_Score
0 1 0.00
1 2 0.33
2 3 3.00
3 4 1.00
I could loop through every element of row of the dataframe and then through every element in the list in that row, append the score to an ordered list and calculate the average for every element in that list, but this would seem to be very inefficient. Is there a better way?

Use DataFrame.explode with aggregate mean:
df = (df.explode(['Number','Score'])
.astype({'Score':int})
.groupby('Attempt', as_index=False)['Score']
.mean()
.rename(columns={'Attempt':'Attempt_Number','Score':'Average_Score'})
)
print (df)
Attempt_Number Average_Score
0 1 0.000000
1 2 0.333333
2 3 3.000000
3 4 1.000000
For oldier pandas versions use:
df = (df.apply(pd.Series.explode)
.astype({'Score':int})
.groupby('Attempt', as_index=False)['Score']
.mean()
.rename(columns={'Attempt':'Attempt_Number','Score':'Average_Score'})
)

Comparing the value of two consecutive rows of a dataframe for each level of a factor variable - Python Pandas

I have a pandas Dataframe containing traders' positions over time, that I created like this:
history = pd.read_csv(r"history.csv")
history = DataFrame(history, columns=['Symbol', 'Size', 'Entry Price',
'Mark Price', 'PNL (ROE %)', 'Last Position Update'])
frames = [historylast, history]
history = pd.concat(frames)
positions = historylast['Symbol'].tolist()
historylast_symbol_set = set(positions)
where historylast is the last scraped database containing current positions, and history is the local copy with previous positions. This is the outcome:
history = history.sort_values('Symbol')
print (history)
Symbol Size ... PNL (ROE %) Last Position Update
0 BNBUSDT 250.800 ... 7702.095588 2021-05-01 03:12:09
5 BNBUSDT 1000.800 ... 43351.359565 2021-04-29 03:51:41
0 BTCUSDT 54.422 ... 513277.155788 2021-04-25 21:03:13
0 BTCUSDT 54.422 ... 328896.563684 2021-04-25 21:03:13
1 DOGEUSDT 2600000.000 ... 46896.408000 2021-05-01 08:24:51
This dataframe has been created by putting toghether traders' positions over time.
What I would like to do is to see if the last available 'Size' for each coin has changed with respect to the previous one. For example, for BNBUSDT, the last size is 250, reduced by 75% with respect to the previous size which was 1000. For BTCUSDT the Size has not changed since the last time. While for DOGEUSDT there are no previous data to compare, so it s still 100% of bought position.
To achieve this I though I should split the dataframe into different dataframes, one for each symbol, and compute and save the percentage change with a for loop, but I have difficulties, and wonder if there is not a better way. Any help would be much appreciated

Considering the following df as example (will use the column names Symbol and Size as well)
import pandas as pd
d = {'Symbol': ["A", "C", "A", "B", "A", "B", "A"], 'Size': [1, 1, 2, 3, 4, 5, 4]}
df = pd.DataFrame(data=d)
print(df)
>>>> Symbol Size
0 A 1
1 C 1
2 A 2
3 B 3
4 A 4
5 B 5
6 A 4
To retrieve the last two rows for each Symbol, do the following
g = df.groupby('Symbol').head(2)
g = g.sort_values('Symbol').reset_index(drop=True)
print(g)
>>> Symbol Size
0 A 1
1 A 2
2 B 3
3 B 5
4 C 1
After that, in order to compute the difference between each Size for the respective group, assuming the values for that change are relevant, create a new column to display that difference with
g['Difference'] = g.groupby('Symbol').diff()
print(g)
>>> Symbol Size Difference
0 A 1 NaN
1 A 2 1.0
2 B 3 NaN
3 B 5 2.0
4 C 1 NaN
Note that the first element appears a NaN as it wasn't changed.

how to plot histogram of maximum values of a dataframe

I have a dataframe with 3 columns df=["a", "b", "value"]. (Actually this is a snippet, the solution should be able to handle n variables, like "a", "b", "c", "d"...) In this case, the "value" column has been generated depending on the "a" and "b" value, doing something like:
for a in range(1,10):
for b in range (1,10):
generate_value(a,b)
The resulting data is similar to:
a b value
0 1 1 0.23
1 1 2 6.34
2 1 3 0.25
3 1 4 2.17
4 1 5 5.97
[...]
I want to know the statistical better combinations of "a" and "b" that gives me the bigger "value". So I want to draw some kind of histogram that shows me which values of "a" and "b" statistically generates bigger "value". I tried with something like:
fig = plot.figure()
ax=fig.add_subplot(111)
ax.hist(df["a"],bins=50, normed=True)
or:
plot.plot(df["a"].values, df["value"].values, "o")
But the results are not good. I think that I should use some kind of histogram or gauss bell curve, but I'm not sure how to plot it.
So, how to plot the statistically better "a" and "b" to get maximum "value"?
Note: the answer 1 is perfect for two variables a and b, but the problem is that the correct answer would need to work for multiple variables, a, b, c, d...
Edit 1: Please note that although I'm asking about two variables, the solution can't be to bound "a" to axis x and "b" to axis y, as there may be more variables. So if we have "a", "b", "c", "d", "e", the solution should be valid
Edit 2: Trying to explain it better: Lets take the following dataframe:
a b c d value
0 1 6 9 7 0.23
1 5 2 3 5 11.34
2 6 7 8 4 0.25
3 1 4 9 3 2.17
4 1 5 9 1 4.97
5 6 6 4 7 25.9
6 3 5 5 2 10.37
7 1 5 1 2 7.87
8 2 5 3 3 8.12
9 1 5 2 1 2.97
10 7 5 4 9 5.97
11 3 5 2 3 9.92
[...]
The row 5 clearly is the winner, with a 25.9 value, so the supposedly better values of a,b,c,d are: 6 6 4 7 . But we can see that statistically it is a strange result, it is the only one so high with those values of a,b,c,d, so it is very unlikely that we're going to get, in the future, a high value choosing those values for a,b,c,d. Instead, seems much more safe to choose numbers that have generated "value" between 8 and 11. Although a 8 to 11 gain is less than 25.9, the probability that the values of a,b,c,d (5,2,3,3) generate this higher "value" is bigger
Edit 3: Although a,b,c,d are discrete, the combination/order of them will generate different results. I mean, there is a function that will return a value inside a small range, like: value=func(a,b,c,d). That value will depend not only on the values of a,b,c,d, but also on some random things. So, for instance, func(5,2,3,5) could return a value of 11.34, but it also could return a similar value, like 10.8, 9.5 or something like that (a range value between 8 and 11). Also, func(1,6,9,7) will return 0.23, or it could return 2.7, but probably it won't return 10.1 as it is also very far from its range.
Following the the example, I'm trying to get the numbers that most probably will generate something in the range of 8-11 (well, the maximum). Probably the numbers I want to visualize somehow will be some kind of combination of numbers 3,5 and 2. But probably there won't be any 6,7,4 numbers, as they usually generate smaller "value" results

I don't think there are any statistics involved here. You can plot the value as a function of a and b.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
A,B = np.meshgrid(np.arange(10),np.arange(10))
df = pd.DataFrame({"a" : A.flatten(), "b" : B.flatten(),
"value" : np.random.rand(100)})
ax = df.plot.scatter(x="a",y="b", c=df["value"])
plt.colorbar(ax.collections[0])
plt.show()
The darker the dots, the higher the value.

This problem seems to be very complicated to solve it by one built-in function.
I think it should be solved in this way:
exclude outliers from data
select n largest values
summarize results with bar plot or any other
Clean data from outliers
We might choose any appropriate method for outliers detection e.g. 3*sigma, 1.5*IQR etc. I used 1.5*IQR in the example bellow.
cleaned_data = data[data['value'] < 1.5 * stats.iqr(data['value'])]
Select n largest values
Pandas provides method nlargest, so you can use it to select n largest values:
largest_values = cleaned_data.nlargest(5, 'value')
or you can use interval of values
largest_values = cleaned_data[cleaned_data['value'] > cleaned_data['value'].max() - 3]
Summarize results
Here we should count ocurrences of values in each column and then plot this data.
melted = pd.melt(largest_values['here you should select columns with explanatory variables'])
table = pd.crosstab(melted['variable'], melted['value'])
table.plot.bar()
example of resulting plot

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.