I would like to split one DataFrame into N Dataframes based on columns X and Z where they are the same (as eachother by column value).
For example, this input:
df =
NAME X Y Z Other
0 a 1 1 1 1
1 b 1 1 2 2
2 c 1 2 1 3
3 d 1 2 2 4
4 e 1 1 1 5
5 f 2 1 2 6
6 g 2 2 1 7
7 h 2 2 2 8
8 i 2 1 1 9
9 j 2 1 2 0
Would have this output:
df_group_0 =
NAME X Y Z Other
0 a 1 1 1 1
2 c 1 2 1 3
4 e 1 1 1 5
df_group_1 =
NAME X Y Z Other
1 b 1 1 2 2
3 d 1 2 2 4
df_group_2 =
NAME X Y Z Other
6 g 2 2 1 7
8 i 2 1 1 9
df_group_3 =
NAME X Y Z Other
7 h 2 2 2 8
9 j 2 1 2 0
Is this possible?
groupby generates an iterator of tuples with the first element be the group id, so if you iterate through the groupers and extract the second element from each tuple, you can get a list of data frames each having a unique group:
grouper = [g[1] for g in df.groupby(['X', 'Z'])]
grouper[0]
NAME X Y Z Other
0 a 1 1 1 1
2 c 1 2 1 3
4 e 1 1 1 5
grouper[1]
NAME X Y Z Other
1 b 1 1 2 2
3 d 1 2 2 4
grouper[2]
NAME X Y Z Other
6 g 2 2 1 7
8 i 2 1 1 9
grouper[3]
NAME X Y Z Other
5 f 2 1 2 6
7 h 2 2 2 8
9 j 2 1 2 0
Related
I have an excel dataset which contains 100 rows and 100 clolumns with order frequencies in locations described by x and y.(gird like structure)
I'd like to convert it to the following structure with 3 columns:
x-Coördinaten | y-Coördinaten | value
The "value" column only contains positive integers. The x and y column contain float type data (geograohical coordinates.
The order does not matter, as it can easily be sorted afterwards.
So,basicly a merge of lists could work, e.g.:
[[1,5,3,5], [4,2,5,6], [2,3,1,5]] ==> [1,5,3,5,4,2,5,6,2,3,1,5]
But then i would lose the location...which is key for my project.
What is the best way to accomplish this?
Assuming this input:
l = [[1,5,3,5],[4,2,5,6],[2,3,1,5]]
df = pd.DataFrame(l)
you can use stack:
df2 = df.rename_axis(index='x', columns='y').stack().reset_index(name='value')
output:
x y value
0 0 0 1
1 0 1 5
2 0 2 3
3 0 3 5
4 1 0 4
5 1 1 2
6 1 2 5
7 1 3 6
8 2 0 2
9 2 1 3
10 2 2 1
11 2 3 5
or melt for a different order:
df2 = df.rename_axis('x').reset_index().melt('x', var_name='y', value_name='value')
output:
x y value
0 0 0 1
1 1 0 4
2 2 0 2
3 0 1 5
4 1 1 2
5 2 1 3
6 0 2 3
7 1 2 5
8 2 2 1
9 0 3 5
10 1 3 6
11 2 3 5
You should be able to get the results with a melt operation -
df = pd.DataFrame(np.arange(9).reshape(3, 3))
df.columns = [2, 3, 4]
df.loc[:, 'x'] = [3, 4, 5]
This is what df looks like
2 3 4 x
0 0 1 2 3
1 3 4 5 4
2 6 7 8 5
The melt operation -
df.melt(id_vars='x', var_name='y')
output -
x y value
0 3 2 0
1 4 2 3
2 5 2 6
3 3 3 1
4 4 3 4
5 5 3 7
6 3 4 2
7 4 4 5
8 5 4 8
I have a Pandas dataframe like this :
id A B
0 2 2
1 1 1
2 3 3
3 7 7
And I want to duplicate the first row 3 times just below the selected row :
id A B
0 2 2
1 2 2
2 2 2
3 2 2
4 1 1
5 3 3
6 7 7
Is there a method that already exist in Pandas library ?
There is no built-in method for doing just this. However, you can create a list of indexes, and use df.loc + df.index.repeat:
new_df = df.loc[df.index.repeat([4] + [1] * (len(df) - 1))].reset_index(drop=True)
Output:
>>> new_df
id A B
0 0 2 2
1 0 2 2
2 0 2 2
3 0 2 2
4 1 1 1
5 2 3 3
6 3 7 7
Use reindex and Index.repeat to create your dataframe:
>>> df.reindex(df.index.repeat([3] + [1] * (len(df) - 1)))
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
Another way:
>>> df.loc[[df.index[0]]*3 + df.index[1:].tolist()]
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
A more generalized way proposed by #MuhammadHassan:
row_index = 0
repeat_time = 3
>>> df.reindex(df.index.tolist() + [row_index]*repeat_time).sort_index()
id A B
0 0 2 2
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
Let us try
n=3
row = 0
df = df.append(df.loc[[row]*(n-1)]).sort_index()
df
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
I have a df that looks like this:
time val
0 1
1 1
2 2
3 3
4 1
5 2
6 3
7 3
8 3
9 3
10 1
11 1
How do I create new columns that hold the amount of times a condition occurs and does not change? In this case, I want to create a column for each unique value in val that holds the cumulative sum at the given row of occurences, but does not increment the value if the condition doesn't change.
Expected outcome below:
time val sum_1 sum_2 sum_3
0 1 1 0 0
1 1 1 0 0
2 2 1 1 0
3 3 1 1 1
4 1 2 1 1
5 2 2 2 1
6 3 2 2 2
7 3 2 2 2
8 3 2 2 2
9 3 2 2 2
10 1 3 2 2
11 1 3 2 2
EDIT
To be more specific with the condition:
I want to count the number of times a unique value appears in val. For example, using the code below, I could get this result:
df['sum_1'] = (df['val'] == 1).cumsum()
df['sum_2'] = (df['val'] == 2).cumsum()
df['sum_3'] = (df['val'] == 3).cumsum()
time val sum_1 sum_2 sum_3
0 0 1 1 0 0
1 1 1 2 0 0
2 2 2 2 1 0
3 3 3 2 1 1
4 4 1 3 1 1
5 5 2 3 2 1
However, this code counts EVERY occurence of a condition. For example, val shows 1 occurring 3 times total. However, I want to treat consecutive occurrences of 1 as a single group, counting only the number of consecutive groupings that occur. In the example above, 1 occurs in total 3 times, but only 2 times as a consecutive grouping.
You can chain mask by & for bitwise AND for test first consecutive values by compare by shifted values by Series.ne with Series.shift and run code for test all unique values of column val:
uniq = df['val'].unique()
m = df['val'].ne(df['val'].shift())
for c in uniq:
df[f'sum_{c}'] = (df['val'].eq(c) & m).cumsum()
print (df)
time val sum_1 sum_2 sum_3
0 0 1 1 0 0
1 1 1 1 0 0
2 2 2 1 1 0
3 3 3 1 1 1
4 4 1 2 1 1
5 5 2 2 2 1
6 6 3 2 2 2
7 7 3 2 2 2
8 8 3 2 2 2
9 9 3 2 2 2
10 10 1 3 2 2
11 11 1 3 2 2
For better performance (I hope) here is numpy alternative:
a = df['val'].to_numpy()
uniq = np.unique(a)
m = np.concatenate(([False], a[:-1])) != a
arr = np.cumsum((a[:, None] == uniq) & m[:, None], axis=0)
df = df.join(pd.DataFrame(arr, index=df.index, columns=uniq).add_prefix('sum_'))
print (df)
time val sum_1 sum_2 sum_3
0 0 1 1 0 0
1 1 1 1 0 0
2 2 2 1 1 0
3 3 3 1 1 1
4 4 1 2 1 1
5 5 2 2 2 1
6 6 3 2 2 2
7 7 3 2 2 2
8 8 3 2 2 2
9 9 3 2 2 2
10 10 1 3 2 2
11 11 1 3 2 2
I've got a df
df1
a b
4 0 1
5 0 1
6 0 2
2 0 3
3 1 2
15 1 3
12 1 3
13 1 1
15 3 1
14 3 1
8 3 3
9 3 2
10 3 1
the df should be grouped by a and b and I need a column c that goes up from 1 to amount of groups within subgroups of a
df1
a b c
4 0 1 1
5 0 1 1
6 0 2 2
2 0 3 3
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 3
15 3 1 1
14 3 1 1
8 3 3 2
9 3 2 3
10 3 1 4
How can I do that?
We can do groupby + transform factorize
df['C']=df.groupby('a').b.transform(lambda x : x.factorize()[0]+1)
4 1
5 1
6 2
2 3
3 1
15 2
12 2
13 3
15 1
14 1
8 1
9 1
10 2
Name: b, dtype: int64
Just so we can see the loop version
from itertools import count
from collections import defaultdict
x = defaultdict(count)
y = {}
c = []
for a, b in zip(df.a, df.b):
if (a, b) not in y:
y[(a, b)] = next(x[a]) + 1
c.append(y[(a, b)])
df.assign(C=c)
a b C
4 0 1 1
5 0 1 1
6 0 2 2
2 0 3 3
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 3
15 3 1 1
14 3 1 1
8 3 3 2
9 3 2 3
10 3 1 1
One option is groupby a and then iterate through each group and groupby b. Then use can use ngroup
df['c'] = np.hstack([g.groupby('b').ngroup().to_numpy() for _,g in df.groupby('a')])
a b c
4 0 1 0
5 0 1 0
6 0 2 1
2 0 3 2
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 0
15 3 1 0
14 3 1 0
8 3 1 0
9 3 1 0
10 3 2 1
you can use groupby.rank if you don't care about the order in the data.
df['c'] = df.groupby('a')['b'].rank('dense').astype(int)
In the following DataFrame, the column B computes the sum of column A from index 0 to n.
ix A B
---------------
0 1 1
1 1 2
2 1 3
3 1 4
4 2 6
5 -1 5
6 -3 2
Alternatively, the column B sums 1 for each type == 'I' and -1 for each type == 'O'.
ix type B
----------------
0 I 1
1 I 2
2 O 1
3 I 2
4 O 1
5 O 0
6 I 1
How to perform this type of calculations, where the n-th result of one column depends on the aggregated results of another column, up to n?
You can use cumsum:
df['C'] = df.A.cumsum()
print (df)
ix A B C
0 0 1 1 1
1 1 1 2 2
2 2 1 3 3
3 3 1 4 4
4 4 2 6 6
5 5 -1 5 5
6 6 -3 2 2
And for second df add map by dict:
df['C'] = df.type.map({'I':1, 'O':-1}).cumsum()
print (df)
ix type B C
0 0 I 1 1
1 1 I 2 2
2 2 O 1 1
3 3 I 2 2
4 4 O 1 1
5 5 O 0 0
6 6 I 1 1
Or:
df['C'] = df.type.replace({'I':1, 'O':-1}).cumsum()
print (df)
ix type B C
0 0 I 1 1
1 1 I 2 2
2 2 O 1 1
3 3 I 2 2
4 4 O 1 1
5 5 O 0 0
6 6 I 1 1