split pandas data frame into multiple of 4 rows

split pandas data frame into multiple of 4 rows - python

I have a dataset of 100 rows, I want to split them into multiple of 4 and then perform operations on it, i.e., first perform operation on first four rows, then on the next four rows and so on.
Note: Rows are independent of each other.
I don't know how to do it. Can somebody pls help me, I would be extremely thankful to him/her.

i will divide df per 2 row (simple example)
and make list dfs
Example
df = pd.DataFrame(list('ABCDE'), columns=['value'])
df
value
0 A
1 B
2 C
3 D
4 E
Code
grouper for grouping
grouper = pd.Series(range(0, len(df))) // 2
grouper
0 0
1 0
2 1
3 1
4 2
dtype: int64
divide to list
g = df.groupby(grouper)
dfs = [g.get_group(x) for x in g.groups]
result(dfs):
[ value
0 A
1 B,
value
2 C
3 D,
value
4 E]
Check
dfs[0]
output:
value
0 A
1 B

Related

Python Pandas - How to get group by counts by values from multiple columns with multiple values

My data includes a few variables holding data from multi-answer questions. These are stored as string (comma separated) and aren't ordered by value.
I need to run different counts across 2 or more of these variables at the same time, i.e. get the frequencies of each combination of their unique values.
I also have a second dataframe with the available codes for each variable
df_meta['a']['Categories'] = ['1', '2', '3','4']
df_meta['b']['Categories'] = ['1', '2']
If this is my data
df = pd.DataFrame(np.array([["1,3","1"],["3","1,2"],["1,3,2","1"],["3,1","2,1"]]),
columns=['a', 'b'])
index a b
1 1,3 1
2 3 1,2
3 1,3,2 1
4 3,1 2,1
Ideally, this is what the output would look like
a b count
1 1 3
1 2 1
2 1 1
2 2 0
3 1 4
3 2 2
4 1 0
4 2 0
Although if I it's not possible to get the zero-counts, this would be just fine
a b count
1 1 3
1 2 1
2 1 1
3 1 4
3 2 2
So far, I got the counts for each of these variables individually, by using split and value_counts
df["a"].str.split(',',expand=True).stack().value_counts()
3 4
1 3
2 1
df["b"].str.split(',',expand=True).stack().value_counts()
1 4
2 2
But I can't figure how to group by them, because of the differences in the indexes.
df2 = pd.DataFrame()
df2["a"] = df["a"].str.split(',',expand=True).stack()
df2["b"] = df["b"].str.split(',',expand=True).stack()
df2.groupby(['a','b']).size()
a b
1 1 3
3 1 1
2 1
Is there a way to adjust the groupby to only count the instances of the first index or another way to count the unique combinations more efficiency?
I can alternatively iterate through all codes using the df_meta dataframe, but some of the actual variables have 300-400 codes and it's very slow, when I try to cross 2-3 of them and, if it's possible to use groupby or another function, it should work much faster.

First we make your dataframe to start with.
df = pd.DataFrame(np.array([["1,3","1"],["3","1,2"],["1,3,2","1"],
["3,1","2,1"]]),columns=['a', 'b'])
Then split columns to separate dataframes.
da = df["a"].str.split(',',expand=True)
db = df["b"].str.split(',',expand=True)
Loop through all rows and both dataframes. Make temporary dataframes of all compinations and add them to a list.
ab = list()
for r in range(len(da)):
for i in da.iloc[r,:]:
for j in db.iloc[r,:]:
if i != None and j != None:
daf = pd.DataFrame({'a':[i], 'b':[j]})
ab.append(daf)
Concatenate list of temporary dataframes into one new dataframe.
dfn = pd.concat(ab)
Groupby with 'a' and 'b' columns and size() gives you the answer.
print(dfn.groupby(['a', 'b']).size().reset_index(name='count'))
a b count
0 1 1 3
1 1 2 1
2 2 1 1
3 3 1 4
4 3 2 2

Merge And overwrite common columns in two DataFrames

I have a dataframe A with 80 columns, and I did group by A and Sum 20 columns
E.g.
New_df=A.groupby(['X','Y','Z'])['a','b','c',......].sum().reset_Index()--------(1)
Then I want to overwrite the values in columns which are present in A with the New_df columns value which are common.

You can do:
cols1=set(A.columns.tolist())
cols2=set(New_df.columns.tolist())
common_cols = list(cols1.intersection(cols2))
A[common_cols]=New_df[common_cols]
to find the columns that the two df's have in common , then replace those in the first with the columns from the second.
This will give you results for example given an initial A:
x y
0 1 a
1 2 b
2 3 c
and New_df:
z y
0 4 d
1 5 e
2 6 f
And we wind up with final 'A', with y column taken from New_df:
x y
0 1 d
1 2 e
2 3 f

Shuffle rows of a DataFrame until all consecutive values in a column are different?

I have a dataframe with rows that I'd like to shuffle continuously until the value in column B is not identical across any two consecutive rows:
initial dataframe:
A | B
_______
a 1
b 1
c 2
d 3
e 3
Possible outcome:
A | B
_______
b 1
c 2
e 3
a 1
d 3
I made a function scramble meant to do this but I am having trouble passing the newly scrambled dataframe back into the function to test for matching B values:
def scamble(x):
curr_B='nothing'
for index, row in x.iterrows():
next_B=row['B']
if str(next_B) == str(curr_B):
x=x.sample(frac=1)
curr_B=next_B
curr_B=next_B
return x
df=scramble(df)
I suspect the function is finding the matching values in the next row, but I can't shuffle it continuously until there are no two sequential rows with the same B value.
Printing the output yields a dataframe shows consecutive rows with the same value in B.

If your goal is to eliminate consecutive duplicates, you can just use groupby and cumcount, then reindex your DataFrame:
df.loc[df.groupby('B').cumcount().sort_values().index]
A B
0 a 1
2 c 2
3 d 3
1 b 1
4 e 3
If you actually want randomness, then you can group on cumcount and call shuffle. This should eliminate consecutive dupes to some degree (NOT GUARANTEED) while preserving randomness and still avoiding slow iteration. Here's an example:
np.random.seed(0)
(df.groupby(df.groupby('B').cumcount(), group_keys=False)
.apply(lambda x: x.sample(frac=1))
.reset_index(drop=True))
A B
0 d 3
1 a 1
2 c 2
3 b 1
4 e 3

df.groupby() modification HELP needed

This is my table:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 2
Now, I want to group all rows by Column A and B. Column C should be summed and for column E, I want to use the value where value C is max.
I did the first part of grouping A and B and summing C. I did this with:
df = df.groupby(['A', 'B'])['C'].sum()
But at this point, I am not sure how to tell that column E should take the value where C is max.
The end result should look like this:
A B C E
0 1 1 6 4
1 3 3 8 2
Can somebody help me with this past piece?
Thanks!

Using groupby with agg after sorting by C.
In general, if you are applying different functions to different columns, DataFrameGroupBy.agg allows you to pass a dictionary specifying which operation is applied to each column:
df.sort_values('C').groupby(['A', 'B'], sort=False).agg({'C': 'sum', 'E': 'last'})
C E
A B
1 1 6 4
3 3 8 2
By sorting by column C first, and not sorting as part of groupby, we can select the last value of E per group, which will align with the maximum value of C for each group.

Pandas groupby data and do calculation

I have a dataframe looks like below and I have reordered the dataframe depending on the value of column B.
a = df.sort(['B', 'A'], ascending=[True, False])
#This is my df
A,B
a,2
b,3
c,4
d,5
d,6
d,7
d,9
Then I'd like to calculate the difference between each element in column B when column A is the same. But if column A only contain single data point then the result will be zero.
So firstly I used groupby() to do so.
b = a['B'].groupby(df['A']))
Then I stuck here, I know I can use lambda x: abs(x[i] - x[i+1]) or even apply() function to finish the calculation. But I still fail to get it done.
Can anyone give me a tip or suggestion?
# What I want to see in the result
A,B
a,0
b,0
c,0
d,0 # 5 minus 5
d,1 # 6 minus 5
d,1 # 7 minus 6
d,2 # 9 minus 7

In both the 1-member and multimember group cases, taking the diff will produce a nan for the first value, which we can fillna with 0:
>>> df["B"] = df.groupby("A")["B"].diff().fillna(0)
>>> df
A B
0 a 0
1 b 0
2 c 0
3 d 0
4 d 1
5 d 1
6 d 2
This assumes there aren't NaNs already there you want to preserve. We could still make that work if we needed to.

You can do that:
df.groupby(level="A").B.diff().fillna(0)
A
a 0
b 0
c 0
d 0
d 1
d 1
d 2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

split pandas data frame into multiple of 4 rows - python

Related

Python Pandas - How to get group by counts by values from multiple columns with multiple values

Merge And overwrite common columns in two DataFrames

Shuffle rows of a DataFrame until all consecutive values in a column are different?

df.groupby() modification HELP needed

Pandas groupby data and do calculation

Categories

Resources