How to get odd column headers in Python? - python

I have data in file as shown below:
odd_column.dat
X1 X2 X3 X4 X5 X6 X7
1 1 1 1 2 2 2
2 2 4 2 5 5 3
3 3 9 3 10 10 4
4 4 16 4 17 17 5
5 5 25 5 26 26 6
6 6 36 6 37 37 7
7 7 49 7 50 50 8
8 8 64 8 65 65 9
9 9 81 9 82 82 10
And I am trying to get the odd column headers with this code (which does not work):
Code
import numpy as np
with open('odd_column.dat', "r") as data:
while True:
line = data.readline()
if not line.startswith('#'):
break
data_header = [i for i in line.strip().split('\t') if i]
odd_column_header = data_header[n for n in (1, 3, 5, 7)]
I have given only 7 total columns as an example. I would like to generalize it for thousands of columns, so that I get the headers of only the odd columns. How can this be done in Python?

Just use Python slicing:
odd_column_header = data_header[0::2]

Related

Using numpy to compute the squared sum of distances between values

I am a newbie in python and stackoverflow. I am trying to change my way of thinking about loops.
I have a series of values which type is <class 'pandas.core.series.Series'>.
Goal: Giving a depth n, I would like to compute for each value (except the first 2*n-2) :
result(i) = sum[j=0 to n-1](distance(i-j)*value[i-j])/sum[j=0 to n-1](distance[j])
with distance(i) = sum[k=1 to n-1]((value[i]-value[i-k])^2)
I want to avoid loops, so is there a better way to achieve my goal using numpy?
EDIT :
Ok, it seems that I am not that clear so here is an example with n= 4 :
Index
Value
0
2
1
4
2
5
3
3
4
1
5
8
6
9
7
4
8
2
9
1
10
7
Then I compute the squared difference (value[i]-value[j])^2 with j=i-1 to i-3 :
diff²
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
1
1
4
4
9
16
4
5
9
25
49
6
36
64
1
7
9
16
25
8
36
49
4
9
64
9
1
10
9
25
36
I think that getting this matrix, full or not is the core of my problem.
I can now compute distance(i) which is the sum of a row and distance(i)*value(i) :
Index
distance
distance x Value
0
1
2
3
6
18
4
29
29
5
83
664
6
101
909
7
50
200
8
89
178
9
74
74
10
70
490
And finally I can get the result :
Index
Value
Result
0
2
1
4
2
5
3
3
4
1
5
8
6
9
7.397260274
7
4
6.851711027
8
2
6.040247678
9
1
4.334394904
10
7
3.328621908
For example :
result(10) = (distance(10)*value(10)+distance(9)*value(9)+distance(8)*value(8)+distance(7)*value(7))/(distance(10)+distance(9)+distance(8)+distance(7))
I have a Java version of the algorithm if needed.
Thank you.
UPDATE :
I finally found how to get the full sqared differences matrix :
import numpy as np
import pandas as pd
n=4
myseries=pd.Series([2, 4, 5, 3, 1, 8, 9, 4, 2, 1, 7])
l=len(myseries)
vector = np.repeat(myseries, l)
mat = vector.to_numpy().reshape((l, l))
diff = mat-np.transpose(mat)
squared_diff = np.multiply(diff, diff)
print(squared_diff)
I still have to get the sum of selected elements
You could do like this:
myseries = pd.Series(np.random.rand(100), dtype='float32')
sum_of_squared_distances= np.sum(np.square(np.diff(myseries.values[:n][1::2])))
where "n" is the nth index(depth) and [1::2] part gets only odd-index values since you only need values corresponding to odd-index(except 2*n-2)

Applying Pandas iterrows logic across many groups in a dataframe

I am having trouble applying some logic across my entire dataset. I am able to apply the logic on a small "group" but not on all of the groups (note, the groups are made by primaryFilter and secondaryFilter. Do you all mind pointing me in the right direction to go about this?
Entire Data
import pandas as pd
import numpy as np
myInput = {
'primaryFilter': [100,100,100,100,100,100,100,100,100,100,200,200,200,200,200,200,200,200,200,200],
'secondaryFilter': [1,1,1,1,2,2,2,3,3,3,1,1,2,2,2,2,3,3,3,3],
'constantValuePerGroup': [15,15,15,15,20,20,20,17,17,17,10,10,30,30,30,30,22,22,22,22],
'someValue':[3,1,4,7,9,9,2,7,3,7,6,4,7,10,10,3,4,6,7,5]
}
df_input = pd.DataFrame(data=myInput)
df_input
Test Data (First Group)
df_test = df_input[df_input.primaryFilter.isin([100])]
df_test = df_test[df_test.secondaryFilter == 1.0]
df_test['newColumn'] = np.nan
for index,row in df_test.iterrows():
if index==0:
print("start")
df_test.loc[0, 'newColumn'] = 0
elif index==df_test.shape[0]-1:
df_test.loc[index, 'newColumn'] = df_test.loc[index-1, 'newColumn'] + df_test.loc[index-1, 'someValue']
print("end")
else:
print("inter")
df_test.loc[index, 'newColumn'] = df_test.loc[index-1, 'newColumn'] + df_test.loc[index-1, 'someValue']
df_test["delta"] = df_test["constantValuePerGroup"] - df_test['newColumn']
df_test.head()
Here is the output of the test
I now would like to apply the above logic to the remaining groups 100,2 and 100,3 and 200,1 and so forth..
No need to use iterrows here, you can group the dataframe on primaryFilter and secondaryFilter columns then for each unique group take the cumulative sum of values in column someValue and shift the resulting cummulative sum by 1 position downwards to obtain newColumn. Finally subtract newColumn from constantValuePerGroup to get the delta.
df_input['newColumn'] = df_input.groupby(['primaryFilter', 'secondaryFilter'])['someValue'].apply(lambda s: s.cumsum().shift(fill_value=0))
df_input['delta'] = df_input['constantValuePerGroup'] - df_input['newColumn']
>>> df_input
primaryFilter secondaryFilter constantValuePerGroup someValue newColumn delta
0 100 1 15 3 0 15
1 100 1 15 1 3 12
2 100 1 15 4 4 11
3 100 1 15 7 8 7
4 100 2 20 9 0 20
5 100 2 20 9 9 11
6 100 2 20 2 18 2
7 100 3 17 7 0 17
8 100 3 17 3 7 10
9 100 3 17 7 10 7
10 200 1 10 6 0 10
11 200 1 10 4 6 4
12 200 2 30 7 0 30
13 200 2 30 10 7 23
14 200 2 30 10 17 13
15 200 2 30 3 27 3
16 200 3 22 4 0 22
17 200 3 22 6 4 18
18 200 3 22 7 10 12
19 200 3 22 5 17 5

Pandas - Randomly Replace 10% of rows with other rows

I want to randomly select 10% of all rows in my df and replace each with a randomly sampled existing row from the df.
To randomly select 10% of rows rows_to_change = df.sample(frac=0.1) works and I can get a new random existing row with replacement_sample = df.sample(n=1) but how do I put this together to quickly iterate over the entire 10%?
The df contains millions of rows x ~100 cols.
Example df:
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'B':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'C':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
Let's say it randomly samples indexes 2,13 to replace with randomly selected indexes 6,9 the final df would look like:
A B C
0 1 1 1
1 2 2 2
2 7 7 7
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 10 10 10
14 15 15 15
You can take a random sample, then take another random sample of the same size and replace the values at those indices with the original sample.
import pandas as pd
df = pd.DataFrame({'A': range(1,15), 'B': range(1,15), 'C': range(1,15)})
samp = df.sample(frac=0.1)
samp
# returns:
A B C
6 7 7 7
9 10 10 10
replace = df.loc[~df.index.isin(samp.index)].sample(samp.shape[0])
replace
# returns:
A B C
3 4 4 4
7 8 8 8
df.loc[replace.index] = samp.values
This copies the rows without replacement
df
# returns:
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 7 7 7
4 5 5 5
5 6 6 6
6 7 7 7
7 10 10 10
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
To sample with replacement, use the keyword replace = True when defining samp
#James' answer is a smart Pandas solution. However, given that you noted your dataset length is somewhere in the millions, you could also consider NumPy given that Pandas often comes with significant performance overhead.
def repl_rows(df: pd.DataFrame, pct: float):
# Modifies `df` inplace.
n, _ = df.shape
rows = int(2 * np.ceil(n * pct)) # Total rows in both sets
idx = np.arange(n, dtype=np.int) # dtype agnostic
full = np.random.choice(idx, size=rows, replace=False)
to_repl, repl_with = np.split(full, 2)
df.values[to_repl] = df.values[repl_with]
Steps:
Get target rows as an integer.
Get a NumPy range-array the same length as your index. Might provide more stability than using the index itself if you have something like an uneven datetime index. (I'm not totally sure, something to toy around with.)
Sample from this index without replacement, sample size is 2 times the number of rows you want to manipulate.
Split the result in half to get targets and replacements. Should be faster than two calls to choice().
Replace at positions to_repl with values from repl_with.

Looping through specific columns in two separate text files

I have two text files A and B, with 16 and 14 columns respectively.
The columns in these files are separated with spaces.
For each entry in column 9 of file A, I want to check if the entry is in column 8 of file B.
If it is, I would like to add this value to a new file (file C). However, I would like file C to retain the same format as file A.
In other words, this new file should contain 17 columns as well.
I have been unable to figure out how to approach this problem and cannot include my progress as a result. Any help is appreciated.
Thank you in advance.
You can read both files into a list, extract B's 8th column in a list and then iterate over file A and check if its 9th element matches with the list of column 8 of B.
If there is a match then I am appending the match at end of each line of A else just print line A.
NOTE: if you do not need the line when there is no match then you can delete the else part.
Code
alines = [line.rstrip('\n') for line in open('aa.txt')]
blines = [line.rstrip('\n') for line in open('bb.txt')]
column8b=[]
for line in blines:
column8b.append(line.split(" ")[7])
with open('cc.txt', "w") as oFile:
for line in alines:
element = line.split(" ")[8]
if element in column8b:
oFile.write(line + " " + element + "\n")
## Delete this if you do not want to write A into C
## when there is no match between A[9] and B[8]
else:
oFile.write(line + "\n")
Sample Data:
aa.txt
1 2 3 4 5 6 7 8 16 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 26 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 36 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 46 10 11 12 13 14 15 16
bb.txt
1 2 3 4 5 6 7 16 9 10 11 12 13 14
1 2 3 4 5 6 7 36 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
cc.txt
1 2 3 4 5 6 7 8 16 10 11 12 13 14 15 16 16
1 2 3 4 5 6 7 8 26 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 36 10 11 12 13 14 15 16 36
1 2 3 4 5 6 7 8 46 10 11 12 13 14 15 16
If you read in the file line by line, then you can pull out the relevant information you want.
your_file_A = open("FILEPATH.EXTENSION")
your_file_B = open("FILEPATH.EXTENSION")
your_file_C = open("FILEPATH.EXTENSION", 'w')
col8_of_B=[]
for line in your_file_B:
col8_of_B.append(line[7]) #line[7] is position 8
for line in your_file_A:
if line[8] in col8_of_B:
your_file_C.write(line)
What about awk (since you have the bash tag)?:
awk 'FNR==NR {b[$8]=$0;next} b[$9] {print $0,$9 }' b a > c

Formatting output as table

Example input:
[('b', 'c', 4),
('l', 'r', 5),
('i', 'a', 6),
('c', 't', 7),
('a', '$', 8),
('n', '$', 9)]
[0] contains the vertical heading, [1] contains the horizontal heading.
Example output:
c r a t $ $
b 4
l 5
i 6
c 7
a 8
n 9
Note: given enough tuples the entire table could be filled :P
How do I format output as a table in Python using [preferably] one line of code?
Here's an answer for your revised question:
data = [
['A','a','1'],
['B','b','2'],
['C','c','3'],
['D','d','4']
]
# Desired output:
#
# A B C D
# a 1
# b 2
# c 3
# d 4
# Check data consists of colname, rowname, value triples
assert all([3 == len(row) for row in data])
# Convert all data to strings
data = [ [str(c) for c in r] for r in data]
# Check all data is one character wide
assert all([1 == len(s) for s in r for r in data])
#============================================================================
# Verbose version
#============================================================================
col_names, row_names, values = zip(*data) # Transpose
header_line = ' ' + ' '.join(col_names)
row_lines = []
for idx, (row_name, value) in enumerate(zip(row_names,values)):
# Use ' '*n to get 2n consecutive spaces.
row_line = row_name + ' ' + ' '*idx + value
row_lines.append(row_line)
print header_line
for r in row_lines:
print (r)
Or, if that's too long for you, try this:
cs, rs, vs = zip(*data)
print ('\n'.join([' '+' '.join(cs)] + [r+' '+' '*i+v for i,(r,v) in enumerate(zip(rs,vs))]))
Both have the following output:
A B C D
a 1
b 2
c 3
d 4
Here's the kernel of what you want (no reader row or header column)
>>> print('\n'.join([ ''.join([str(i+j+2).rjust(3)
for i in range(10)]) for j in range(10) ]))
2 3 4 5 6 7 8 9 10 11
3 4 5 6 7 8 9 10 11 12
4 5 6 7 8 9 10 11 12 13
5 6 7 8 9 10 11 12 13 14
6 7 8 9 10 11 12 13 14 15
7 8 9 10 11 12 13 14 15 16
8 9 10 11 12 13 14 15 16 17
9 10 11 12 13 14 15 16 17 18
10 11 12 13 14 15 16 17 18 19
11 12 13 14 15 16 17 18 19 20
It uses a nested list comprehension over i and j to generate the numbers i+j, then str.rjust() to pad all fields to three characters in length, and finally some str.join()s to put all the substrings together.
Assuming python 2.x, it's a bit ugly, but it's functional:
import operator
from functools import partial
x = range(1,11)
y = range(0,11)
multtable = [y]+[[i]+map(partial(operator.add,i),y[1:]) for i in x]
for i in multtable:
for j in i:
print str(j).rjust(3),
print
0 1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10 11
2 3 4 5 6 7 8 9 10 11 12
3 4 5 6 7 8 9 10 11 12 13
4 5 6 7 8 9 10 11 12 13 14
5 6 7 8 9 10 11 12 13 14 15
6 7 8 9 10 11 12 13 14 15 16
7 8 9 10 11 12 13 14 15 16 17
8 9 10 11 12 13 14 15 16 17 18
9 10 11 12 13 14 15 16 17 18 19
10 11 12 13 14 15 16 17 18 19 20
Your problem is so darn specific, it's difficult to make a real generic example.
The important part here, though, is the part that makes the table, rathter than the actual printing:
[map(partial(operator.add,i),y[1:]) for i in x]

Categories