Merging rows in a dataframe based on reoccurring values

Merging rows in a dataframe based on reoccurring values - python

I have the following dataframe with each row containing two values.
print(x)
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17
6 16 18
7 16 19
8 17 18
9 17 19
10 18 19
11 20 21
I want to merge these values if one or both values of a particular row reoccur in another row. The principal can be explained as follows: if A and B are together in one row and B and C are together in another row, then it means that A, B and C should be together. What I want as an outcome looking at the dataframe above is:
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21
I tried creating a loop with df.duplicated that would create such an outcome, but it hasn't worked out yet.

This seems like graph theory problem dealing with connected components. You can use the networkx library:
import networkx as nx
g = nx.from_pandas_edgelist(df, 'a', 'b')
pd.concat([pd.Series([list(i)[0],
' '.join(map(str, list(i)[1:]))],
index=['a', 'b'])
for i in list(nx.connected_components(g))], axis=1).T
Output:
a b
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21

Related

Pandas dataframe drop by column

I want to filter a dataframe based on values in a column. Here is how the df looks:
lead_snp Set_1 Set_2 Set_3 Set_4 Set_5 ... Set_4995 Set_4996 Set_4997 Set_4998 Set_4999 Set_5000
0 1:2444414 8 7 1 10 17 ... 16 6 10 12 8 12
1 1:1865298 2 2 11 21 6 ... 16 3 13 17 8 3
2 1:1865298 2 2 11 21 6 ... 16 3 13 17 8 3
3 1:1865298 2 2 11 21 6 ... 16 3 13 17 8 3
4 1:1865298 2 2 11 21 6 ... 16 3 13 17 8 3
When I run (lead_chrom_only_df.groupby("lead_snp").nunique().drop("lead_snp", axis=1)), I get the error below:
KeyError: "['lead_snp'] not found in axis"
Not sure if I'm missing something obvious, thanks in advance.

Try pass the as_index = False
out = lead_chrom_only_df.groupby("lead_snp",as_index = False).nunique().drop("lead_snp", axis=1)

How to exclude some string patterns when using filter on pandas?

dataframe
df.columns=['ipo_date','l2y_gg_date','l1k_kk_date']
Goal
return dataframe with columns name containing _date except for ipo_date.
Try
df.filter(regex='_date&^ipo_date')

Try a negative lookbehind:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(1, 21).reshape((5, 4)),
columns=['ipo_date', 'l2y_gg_date', 'l1k_kk_date', 'other'])
filtered = df.filter(regex=r'(?<!ipo)_date')
print(filtered)
Sample df:
ipo_date l2y_gg_date l1k_kk_date other
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
4 17 18 19 20
filtered:
l2y_gg_date l1k_kk_date
0 2 3
1 6 7
2 10 11
3 14 15
4 18 19

Pandas - Randomly Replace 10% of rows with other rows

I want to randomly select 10% of all rows in my df and replace each with a randomly sampled existing row from the df.
To randomly select 10% of rows rows_to_change = df.sample(frac=0.1) works and I can get a new random existing row with replacement_sample = df.sample(n=1) but how do I put this together to quickly iterate over the entire 10%?
The df contains millions of rows x ~100 cols.
Example df:
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'B':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'C':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
Let's say it randomly samples indexes 2,13 to replace with randomly selected indexes 6,9 the final df would look like:
A B C
0 1 1 1
1 2 2 2
2 7 7 7
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 10 10 10
14 15 15 15

You can take a random sample, then take another random sample of the same size and replace the values at those indices with the original sample.
import pandas as pd
df = pd.DataFrame({'A': range(1,15), 'B': range(1,15), 'C': range(1,15)})
samp = df.sample(frac=0.1)
samp
# returns:
A B C
6 7 7 7
9 10 10 10
replace = df.loc[~df.index.isin(samp.index)].sample(samp.shape[0])
replace
# returns:
A B C
3 4 4 4
7 8 8 8
df.loc[replace.index] = samp.values
This copies the rows without replacement
df
# returns:
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 7 7 7
4 5 5 5
5 6 6 6
6 7 7 7
7 10 10 10
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
To sample with replacement, use the keyword replace = True when defining samp

#James' answer is a smart Pandas solution. However, given that you noted your dataset length is somewhere in the millions, you could also consider NumPy given that Pandas often comes with significant performance overhead.
def repl_rows(df: pd.DataFrame, pct: float):
# Modifies `df` inplace.
n, _ = df.shape
rows = int(2 * np.ceil(n * pct)) # Total rows in both sets
idx = np.arange(n, dtype=np.int) # dtype agnostic
full = np.random.choice(idx, size=rows, replace=False)
to_repl, repl_with = np.split(full, 2)
df.values[to_repl] = df.values[repl_with]
Steps:
Get target rows as an integer.
Get a NumPy range-array the same length as your index. Might provide more stability than using the index itself if you have something like an uneven datetime index. (I'm not totally sure, something to toy around with.)
Sample from this index without replacement, sample size is 2 times the number of rows you want to manipulate.
Split the result in half to get targets and replacements. Should be faster than two calls to choice().
Replace at positions to_repl with values from repl_with.

Use pandas to read in text file with row as column names

I'm working on a project to read in a text file of variable length which will be generated by a user. There are several comments at the beginning of the text file, one of which needs to be used as the column name. I know it is possible to do this with genfromtxt(), but I am required to use pandas. Here is the beginning of a sample text file:
#GeneratedFile
#This file will be generated by a user
#a b c d f g h i j k l m n p q r s t v w x y z
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
I need #a, b, c,... to be the column names. I tried the following lines of code to read in the data and change it to an array, but it returned only rows and ignored the column names.
import pandas as pd
data = pd.read_table('example.txt',header=2)
d = pd.DataFrame.as_matrix(data)
Is there a way to do this without using genfromtxt()?

One way may be to try following:
df = pd.read_csv('example.txt', sep='\s+', engine='python', header=2)
# the first column name become #a so, replacing the column name
df.rename(columns={'#a':'a'}, inplace=True)
# alternatively, other way is to replace # from all the column names
#df.columns = [column_name.replace('#', '') for column_name in df.columns]
print(df)
Result:
a b c d f g h i j k ... p q r s t v w x y z
0 0 1 2 3 4 5 6 7 8 9 ... 13 14 15 16 17 18 19 20 21 22
1 1 2 3 4 5 6 7 8 9 10 ... 14 15 16 17 18 19 20 21 22 23
[2 rows x 23 columns]

Multiindex on DataFrames and sum in Pandas

I am currently trying to make use of Pandas MultiIndex attribute. I am trying to group an existing DataFrame-object df_original based on its columns in a smart way, and was therefore thinking of MultiIndex.
print df_original =
by_currency by_portfolio A B C
1 AUD a 1 2 3
2 AUD b 4 5 6
3 AUD c 7 8 9
4 AUD d 10 11 12
5 CHF a 13 14 15
6 CHF b 16 17 18
7 CHF c 19 20 21
8 CHF d 22 23 24
Now, what I would like to have is a MultiIndex DataFrame-object, with A, B and C, and by_portfolio as indices. Looking like
CHF AUD
A a 13 1
b 16 4
c 19 7
d 22 10
B a 14 2
b 17 5
c 20 8
d 23 11
C a 15 3
b 18 6
c 21 9
d 24 12
I have tried making all columns in df_original and the sought after indices into list-objects, and from there create a new DataFrame. This seems a bit cumbersome, and I can't figure out how to add the actual values after.
Perhaps some sort of groupby is better for this purpose? Thing is I will need to be able to add this data to another, similar, DataFrame, so I will need the resulting DataFrame to be able to be added to another one later on.
Thanks

You can use a combination of stack and unstack:
In [50]: df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
Out[50]:
by_currency AUD CHF
by_portfolio
a A 1 13
B 2 14
C 3 15
b A 4 16
B 5 17
C 6 18
c A 7 19
B 8 20
C 9 21
d A 10 22
B 11 23
C 12 24
To obtain your desired result, we only need to swap the levels of the index:
In [51]: df2 = df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
In [52]: df2.columns.name = None
In [53]: df2.index = df2.index.swaplevel(0,1)
In [55]: df2 = df2.sort_index()
In [56]: df2
Out[56]:
AUD CHF
by_portfolio
A a 1 13
b 4 16
c 7 19
d 10 22
B a 2 14
b 5 17
c 8 20
d 11 23
C a 3 15
b 6 18
c 9 21
d 12 24

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging rows in a dataframe based on reoccurring values - python

Related

Pandas dataframe drop by column

How to exclude some string patterns when using filter on pandas?

Pandas - Randomly Replace 10% of rows with other rows

Use pandas to read in text file with row as column names

Multiindex on DataFrames and sum in Pandas

Categories

Resources