Replicating rows in a pandas data frame by a column value [duplicate] - python

This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 11 months ago.
I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.
import pandas as pd
what_i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'v' : [ 10, 13, 13, 8, 8, 8]
})
Is this possible?

You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:
df2 = df.loc[df.index.repeat(df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
Or you could use np.repeat to get the repeated indices and then use that to index into the frame:
df2 = df.loc[np.repeat(df.index.values, df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
After which there's only a bit of cleaning up to do:
df2 = df2.drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Note that if you might have duplicate indices to worry about, you could use .iloc instead:
df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
which uses the positions, and not the index labels.

You could use set_index and repeat
In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Details
In [1058]: df
Out[1058]:
id n v
0 A 1 10
1 B 2 13
2 C 3 8

It's something like the uncount in tidyr:
https://tidyr.tidyverse.org/reference/uncount.html
I wrote a package (https://github.com/pwwang/datar) that implements this API:
from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount
what_i_have = tribble(
f.id, f.n, f.v,
'A', 1, 10,
'B', 2, 13,
'C', 3, 8
)
what_i_have >> uncount(f.n)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8

Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():
df.reindex(df.index.repeat(df.n)).drop('n', axis=1)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
You can further append .reset_index(drop=True) to reset the .index.

Related

Join tables and create combinations in python

In advance: Sorry, the title is a bit fuzzy
PYTHON
I have two tables. In one there are unique names for example 'A', 'B', 'C' and in the other table there is a Time series with months example 10/2021, 11/2021, 12/2021. I want to join the tables now that I have all TimeStemps for each name. So the final data should look like this:
Month
Name
10/2021
A
11/2021
A
12/2021
A
10/2021
B
11/2021
B
12/2021
B
10/2021
C
11/2021
C
12/2021
C
from cartesian product in pandas
df1 = pd.DataFrame([1, 2, 3], columns=['A'])
df2 = pd.DataFrame(["a", "b", "c"], columns=['B'])
df = (df1.assign(key=1)
.merge(df2.assign(key=1), on="key")
.drop("key", axis=1)
)
A B
0 1 a
1 1 b
2 1 c
3 2 a
4 2 b
5 2 c
6 3 a
7 3 b
8 3 c
If you are only trying to get the cartesian product of the values - you can do it using itertools.product
import pandas as pd
from itertools import product
df1 = pd.DataFrame(list('abcd'), columns=['letters'])
df2 = pd.DataFrame(list('1234'), columns=['numbers'])
df_combined = pd.DataFrame(product(df1['letters'], df2['numbers']), columns=['letters', 'numbers'])
output
letters numbers
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
10 c 3
11 c 4
12 d 1
13 d 2
14 d 3
15 d 4

drop rows using pandas groupby and filter

I'm trying to drop rows from a df where certain conditions are met. Using below, I'm grouping values using column C. For each unique group, I want to drop ALL rows where A is less than 1 AND B is greater than 100. This has to occur on the same row though. If I use .any() or .all(), it doesn't return what I want.
df = pd.DataFrame({
'A' : [1,0,1,0,1,0,0,1,0,1],
'B' : [101, 2, 3, 1, 5, 101, 2, 3, 4, 5],
'C' : ['d', 'd', 'd', 'd', 'e', 'e', 'e', 'f', 'f',],
})
df.groupby(['C']).filter(lambda g: g['A'].lt(1) & g['B'].gt(100))
initial df:
A B C
0 1 101 d # A is not lt 1 so keep all d's
1 0 2 d
2 1 3 d
3 0 1 d
4 1 5 e
5 0 101 e # A is lt 1 and B is gt 100 so drop all e's
6 0 2 e
7 1 3 f
8 0 4 f
9 1 5 f
intended out:
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f
For better performnce get all C values match condition and then filter original column C by Series.isin in boolean indexing with inverted mask:
df1 = df[~df['C'].isin(df.loc[df['A'].lt(1) & df['B'].gt(100), 'C'])]
Another idea is use GroupBy.transform with GroupBy.any for test if match at least one value:
df1 = df[~(df['A'].lt(1) & df['B'].gt(100)).groupby(df['C']).transform('any')]
Your solution is possible with any and not for scalars, if large DataFrame it should be slow:
df1 = df.groupby(['C']).filter(lambda g:not ( g['A'].lt(1) & g['B'].gt(100)).any())
df1 = df.groupby(['C']).filter(lambda g: (g['A'].ge(1) | g['B'].le(100)).all())
print (df1)
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f

How to delete the row in a dataframe panda based on the row names of another dataframe?

I want to short my data, the whole data shape is 30000x480. And I want to drop some rows based on the row names of another data frame.
Help me to solve it and get the solution for:
df1
Row a b
A 1 2
B 3 4
C 5 6
D 7 8
E 9 10
F 11 12
G 13 14
df2
Row a b
C 5 6
D 7 8
F 11 12
G 13 14
So, I want to delete the rows in df1 that doesn't exist on the df2, it's hard to delete as manually because the data is very big
For better understanding, taking the same data given. Let me put the same question in a different context for a better understanding as below:
Question : Want to delete the rows in df1 that doesn't exist on the df2
New way : you need the rows of df1 that are present in df2 (or) in a way you need the common rows of both df1 & df2, try this
>>> import pandas as pd
>>> df2 = pd.DataFrame({'Row': ['C', 'D', 'F','G'], 'a': [5, 7, 11, 13], 'b' : [6, 8, 12, 14]})
>>> df1 = pd.DataFrame({'Row' : ['A', 'B', 'C', 'D'], 'a': [1,3,5,7], 'b': [2,4,6, 8]})
>>> df1
Row a b
0 A 1 2
1 B 3 4
2 C 5 6
3 D 7 8
>>> df2
Row a b
0 C 5 6
1 D 7 8
2 F 11 12
3 G 13 14
>>> pd.merge(df1, df2, 'inner')
Row a b
0 C 5 6
1 D 7 8
>>>

How can I extract a column from dataframe and attach it to rows while keeping other columns intact

How can I extract a column from pandas dataframe attach it to rows while keeping the other columns same.
This is my example dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID': np.arange(0,5),
'sample_1' : [5,6,7,8,9],
'sample_2' : [10,11,12,13,14],
'group_id' : ["A","B","C","D","E"]})
The output I'm looking for is:
df2 = pd.DataFrame({'ID': [0, 1, 2, 3, 4, 0, 1, 2, 3, 4],
'sample_1' : [5,6,7,8,9,10,11,12,13,14],
'group_id' : ["A","B","C","D","E","A","B","C","D","E"]})
I have tried to slice the dataframe and concat using pd.concat but it was giving NaN values.
My original dataset is large.
You could do this using stack: Set the index to the columns you don't want to modify, call stack, sort by the "sample" column, then reset your index:
df.set_index(['ID','group_id']).stack().sort_values(0).reset_index([0,1]).reset_index(drop=True)
ID group_id 0
0 0 A 5
1 1 B 6
2 2 C 7
3 3 D 8
4 4 E 9
5 0 A 10
6 1 B 11
7 2 C 12
8 3 D 13
9 4 E 14
Using pd.wide_to_long:
res = pd.wide_to_long(df, stubnames='sample_', i='ID', j='group_id')
res.index = res.index.droplevel(1)
res = res.rename(columns={'sample_': 'sample_1'}).reset_index()
print(res)
ID group_id sample_1
0 0 A 5
1 1 B 6
2 2 C 7
3 3 D 8
4 4 E 9
5 0 A 10
6 1 B 11
7 2 C 12
8 3 D 13
9 4 E 14
The function you are looking for is called melt
For example:
df2 = pd.melt(df, id_vars=['ID', 'group_id'], value_vars=['sample_1', 'sample_2'], value_name='sample_1')
df2 = df2.drop('variable', axis=1)

Pandas : Sum multiple columns and get results in multiple columns

I have a "sample.txt" like this.
idx A B C D cat
J 1 2 3 1 x
K 4 5 6 2 x
L 7 8 9 3 y
M 1 2 3 4 y
N 4 5 6 5 z
O 7 8 9 6 z
With this dataset, I want to get sum in row and column.
In row, it is not a big deal.
I made result like this.
### MY CODE ###
import pandas as pd
df = pd.read_csv('sample.txt',sep="\t",index_col='idx')
df.info()
df2 = df.groupby('cat').sum()
print( df2 )
The result is like this.
A B C D
cat
x 5 7 9 3
y 8 10 12 7
z 11 13 15 11
But I don't know how to write a code to get result like this.
(simply add values in column A and B as well as column C and D)
AB CD
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Could anybody help how to write a code?
By the way, I don't want to do like this.
(it looks too dull, but if it is the only way, I'll deem it)
df2 = df['A'] + df['B']
df3 = df['C'] + df['D']
df = pd.DataFrame([df2,df3],index=['AB','CD']).transpose()
print( df )
When you pass a dictionary or callable to groupby it gets applied to an axis. I specified axis one which is columns.
d = dict(A='AB', B='AB', C='CD', D='CD')
df.groupby(d, axis=1).sum()
Use concat with sum:
df = df.set_index('idx')
df = pd.concat([df[['A', 'B']].sum(1), df[['C', 'D']].sum(1)], axis=1, keys=['AB','CD'])
print( df)
AB CD
idx
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Does this do what you need? By using axis=1 with DataFrame.apply, you can use the data that you want in a row to construct a new column. Then you can drop the columns that you don't want anymore.
In [1]: import pandas as pd
In [5]: df = pd.DataFrame(columns=['A', 'B', 'C', 'D'], data=[[1, 2, 3, 4], [1, 2, 3, 4]])
In [6]: df
Out[6]:
A B C D
0 1 2 3 4
1 1 2 3 4
In [7]: df['CD'] = df.apply(lambda x: x['C'] + x['D'], axis=1)
In [8]: df
Out[8]:
A B C D CD
0 1 2 3 4 7
1 1 2 3 4 7
In [13]: df.drop(['C', 'D'], axis=1)
Out[13]:
A B CD
0 1 2 7
1 1 2 7

Categories