Looping through specific columns in two separate text files - python

I have two text files A and B, with 16 and 14 columns respectively.
The columns in these files are separated with spaces.
For each entry in column 9 of file A, I want to check if the entry is in column 8 of file B.
If it is, I would like to add this value to a new file (file C). However, I would like file C to retain the same format as file A.
In other words, this new file should contain 17 columns as well.
I have been unable to figure out how to approach this problem and cannot include my progress as a result. Any help is appreciated.
Thank you in advance.

You can read both files into a list, extract B's 8th column in a list and then iterate over file A and check if its 9th element matches with the list of column 8 of B.
If there is a match then I am appending the match at end of each line of A else just print line A.
NOTE: if you do not need the line when there is no match then you can delete the else part.
Code
alines = [line.rstrip('\n') for line in open('aa.txt')]
blines = [line.rstrip('\n') for line in open('bb.txt')]
column8b=[]
for line in blines:
column8b.append(line.split(" ")[7])
with open('cc.txt', "w") as oFile:
for line in alines:
element = line.split(" ")[8]
if element in column8b:
oFile.write(line + " " + element + "\n")
## Delete this if you do not want to write A into C
## when there is no match between A[9] and B[8]
else:
oFile.write(line + "\n")
Sample Data:
aa.txt
1 2 3 4 5 6 7 8 16 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 26 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 36 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 46 10 11 12 13 14 15 16
bb.txt
1 2 3 4 5 6 7 16 9 10 11 12 13 14
1 2 3 4 5 6 7 36 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
cc.txt
1 2 3 4 5 6 7 8 16 10 11 12 13 14 15 16 16
1 2 3 4 5 6 7 8 26 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 36 10 11 12 13 14 15 16 36
1 2 3 4 5 6 7 8 46 10 11 12 13 14 15 16

If you read in the file line by line, then you can pull out the relevant information you want.
your_file_A = open("FILEPATH.EXTENSION")
your_file_B = open("FILEPATH.EXTENSION")
your_file_C = open("FILEPATH.EXTENSION", 'w')
col8_of_B=[]
for line in your_file_B:
col8_of_B.append(line[7]) #line[7] is position 8
for line in your_file_A:
if line[8] in col8_of_B:
your_file_C.write(line)

What about awk (since you have the bash tag)?:
awk 'FNR==NR {b[$8]=$0;next} b[$9] {print $0,$9 }' b a > c

Related

Merging rows in a dataframe based on reoccurring values

I have the following dataframe with each row containing two values.
print(x)
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17
6 16 18
7 16 19
8 17 18
9 17 19
10 18 19
11 20 21
I want to merge these values if one or both values of a particular row reoccur in another row. The principal can be explained as follows: if A and B are together in one row and B and C are together in another row, then it means that A, B and C should be together. What I want as an outcome looking at the dataframe above is:
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21
I tried creating a loop with df.duplicated that would create such an outcome, but it hasn't worked out yet.
This seems like graph theory problem dealing with connected components. You can use the networkx library:
import networkx as nx
g = nx.from_pandas_edgelist(df, 'a', 'b')
pd.concat([pd.Series([list(i)[0],
' '.join(map(str, list(i)[1:]))],
index=['a', 'b'])
for i in list(nx.connected_components(g))], axis=1).T
Output:
a b
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21

Pandas dataframe drop by column

I want to filter a dataframe based on values in a column. Here is how the df looks:
lead_snp Set_1 Set_2 Set_3 Set_4 Set_5 ... Set_4995 Set_4996 Set_4997 Set_4998 Set_4999 Set_5000
0 1:2444414 8 7 1 10 17 ... 16 6 10 12 8 12
1 1:1865298 2 2 11 21 6 ... 16 3 13 17 8 3
2 1:1865298 2 2 11 21 6 ... 16 3 13 17 8 3
3 1:1865298 2 2 11 21 6 ... 16 3 13 17 8 3
4 1:1865298 2 2 11 21 6 ... 16 3 13 17 8 3
When I run (lead_chrom_only_df.groupby("lead_snp").nunique().drop("lead_snp", axis=1)), I get the error below:
KeyError: "['lead_snp'] not found in axis"
Not sure if I'm missing something obvious, thanks in advance.
Try pass the as_index = False
out = lead_chrom_only_df.groupby("lead_snp",as_index = False).nunique().drop("lead_snp", axis=1)

How can I transform each 7 row to single row in csv file with python?

csv file example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
new csv file which I want:
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
Note: I want tab between each numbers.
This can be done in pandas:
import pandas as pd
df = pd.read_csv('filename.csv', header=None)
df = pd.DataFrame(df.values.reshape(-1, 7))
df.to_csv("output.csv", sep="\t", header=False)
output df:
0
1
2
3
4
5
6
0
1
2
3
4
5
6
7
1
8
9
10
11
12
13
14
2
15
16
17
18
19
20
21
file = open("yourFile.csv", "r")
newFile = open("newFile.csv", "w")
for line in file.read().strip().split("\n"):
newFile.write(line)
if (int(line) % 7 == 0):
newFile.write("\n")
else:
newFile.write("\t")
file.close()
newFile.close()
I would have preferred that you show your work first before I post the answer.
Here's the answer anyway so it gives you some pointers to solve this.
with open('abc.txt', 'r') as f1, open ('xyz.txt','w') as f2:
write_line = ''
for i, line in enumerate(f1):
line = line.strip()
if i != 0 and i %7 == 0:
write_line += '\n'
f2.write(write_line)
write_line = line + '\t'
else:
write_line += line + '\t'
if i != 0 and i %7 != 0:
f2.write(write_line)
Keep track of the row you read. If the row is the first one (i==0), just add the line + \t to a temp variable. If the line is the 8th / 15th / ... (i%7 ==0), then you have read 7 rows already. So write the data into the output file. Then reset the temp variable so you can do this again.
Once you are out of the loop, you want to make sure you write the last set of rows. Sometimes the file may end up having 20 rows. So you want to ensure you write the last 6 rows into the output file.
The output of this will be:
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
If the input file had values from 1 thru 24, the output file would look like this:
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24

Pandas - Randomly Replace 10% of rows with other rows

I want to randomly select 10% of all rows in my df and replace each with a randomly sampled existing row from the df.
To randomly select 10% of rows rows_to_change = df.sample(frac=0.1) works and I can get a new random existing row with replacement_sample = df.sample(n=1) but how do I put this together to quickly iterate over the entire 10%?
The df contains millions of rows x ~100 cols.
Example df:
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'B':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'C':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
Let's say it randomly samples indexes 2,13 to replace with randomly selected indexes 6,9 the final df would look like:
A B C
0 1 1 1
1 2 2 2
2 7 7 7
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 10 10 10
14 15 15 15
You can take a random sample, then take another random sample of the same size and replace the values at those indices with the original sample.
import pandas as pd
df = pd.DataFrame({'A': range(1,15), 'B': range(1,15), 'C': range(1,15)})
samp = df.sample(frac=0.1)
samp
# returns:
A B C
6 7 7 7
9 10 10 10
replace = df.loc[~df.index.isin(samp.index)].sample(samp.shape[0])
replace
# returns:
A B C
3 4 4 4
7 8 8 8
df.loc[replace.index] = samp.values
This copies the rows without replacement
df
# returns:
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 7 7 7
4 5 5 5
5 6 6 6
6 7 7 7
7 10 10 10
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
To sample with replacement, use the keyword replace = True when defining samp
#James' answer is a smart Pandas solution. However, given that you noted your dataset length is somewhere in the millions, you could also consider NumPy given that Pandas often comes with significant performance overhead.
def repl_rows(df: pd.DataFrame, pct: float):
# Modifies `df` inplace.
n, _ = df.shape
rows = int(2 * np.ceil(n * pct)) # Total rows in both sets
idx = np.arange(n, dtype=np.int) # dtype agnostic
full = np.random.choice(idx, size=rows, replace=False)
to_repl, repl_with = np.split(full, 2)
df.values[to_repl] = df.values[repl_with]
Steps:
Get target rows as an integer.
Get a NumPy range-array the same length as your index. Might provide more stability than using the index itself if you have something like an uneven datetime index. (I'm not totally sure, something to toy around with.)
Sample from this index without replacement, sample size is 2 times the number of rows you want to manipulate.
Split the result in half to get targets and replacements. Should be faster than two calls to choice().
Replace at positions to_repl with values from repl_with.

Formatting output as table

Example input:
[('b', 'c', 4),
('l', 'r', 5),
('i', 'a', 6),
('c', 't', 7),
('a', '$', 8),
('n', '$', 9)]
[0] contains the vertical heading, [1] contains the horizontal heading.
Example output:
c r a t $ $
b 4
l 5
i 6
c 7
a 8
n 9
Note: given enough tuples the entire table could be filled :P
How do I format output as a table in Python using [preferably] one line of code?
Here's an answer for your revised question:
data = [
['A','a','1'],
['B','b','2'],
['C','c','3'],
['D','d','4']
]
# Desired output:
#
# A B C D
# a 1
# b 2
# c 3
# d 4
# Check data consists of colname, rowname, value triples
assert all([3 == len(row) for row in data])
# Convert all data to strings
data = [ [str(c) for c in r] for r in data]
# Check all data is one character wide
assert all([1 == len(s) for s in r for r in data])
#============================================================================
# Verbose version
#============================================================================
col_names, row_names, values = zip(*data) # Transpose
header_line = ' ' + ' '.join(col_names)
row_lines = []
for idx, (row_name, value) in enumerate(zip(row_names,values)):
# Use ' '*n to get 2n consecutive spaces.
row_line = row_name + ' ' + ' '*idx + value
row_lines.append(row_line)
print header_line
for r in row_lines:
print (r)
Or, if that's too long for you, try this:
cs, rs, vs = zip(*data)
print ('\n'.join([' '+' '.join(cs)] + [r+' '+' '*i+v for i,(r,v) in enumerate(zip(rs,vs))]))
Both have the following output:
A B C D
a 1
b 2
c 3
d 4
Here's the kernel of what you want (no reader row or header column)
>>> print('\n'.join([ ''.join([str(i+j+2).rjust(3)
for i in range(10)]) for j in range(10) ]))
2 3 4 5 6 7 8 9 10 11
3 4 5 6 7 8 9 10 11 12
4 5 6 7 8 9 10 11 12 13
5 6 7 8 9 10 11 12 13 14
6 7 8 9 10 11 12 13 14 15
7 8 9 10 11 12 13 14 15 16
8 9 10 11 12 13 14 15 16 17
9 10 11 12 13 14 15 16 17 18
10 11 12 13 14 15 16 17 18 19
11 12 13 14 15 16 17 18 19 20
It uses a nested list comprehension over i and j to generate the numbers i+j, then str.rjust() to pad all fields to three characters in length, and finally some str.join()s to put all the substrings together.
Assuming python 2.x, it's a bit ugly, but it's functional:
import operator
from functools import partial
x = range(1,11)
y = range(0,11)
multtable = [y]+[[i]+map(partial(operator.add,i),y[1:]) for i in x]
for i in multtable:
for j in i:
print str(j).rjust(3),
print
0 1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10 11
2 3 4 5 6 7 8 9 10 11 12
3 4 5 6 7 8 9 10 11 12 13
4 5 6 7 8 9 10 11 12 13 14
5 6 7 8 9 10 11 12 13 14 15
6 7 8 9 10 11 12 13 14 15 16
7 8 9 10 11 12 13 14 15 16 17
8 9 10 11 12 13 14 15 16 17 18
9 10 11 12 13 14 15 16 17 18 19
10 11 12 13 14 15 16 17 18 19 20
Your problem is so darn specific, it's difficult to make a real generic example.
The important part here, though, is the part that makes the table, rathter than the actual printing:
[map(partial(operator.add,i),y[1:]) for i in x]

Categories