With my Python code I'm looking for a cell with a specific table name, in this case 'Quality distribution'. In the Excel file there are two tables with this name and I only want to work with the first table.
My code is working correctly if there is only one cell with the specific table name, but now my code is finding the first cell with 'Quality distribution' and then goes looking for a second cell and starts the index at the second table. How can I adjust my code so that I work with the first table?
My Excel file contains out of 12 tables in columns A and B, and every table has 67 to 350 rows. Above the table a table name is stated.
An example (I have deletes some tables and rows for for the sheet has 2000 rows):
Summary
Creation date: Fri Aug 02 13:49:15 CEST 2019
Generated by: XXXX
Software: CLC Genomics Workbench 12.0
Based upon: 1 data set
XXXXXXX_S7_L001_R1_001 (paired): 5.102.482 sequences in pairs
Total sequences in data set 5.102.482 sequences
Total nucleotides in data set 558.462.117 nucleotides
Quality distribution
average PHRED score % sequences:
0 0
1 0
27 0.889841454
28 1.157475911
29 1.472773446
Per-base analysis
Coverage
base position % coverage:
0 100
1 100
2 100
147 37.30090572
148 36.1365508
149 33.95743483
150 24.3650639
151 0
Quality distribution
base position PHRED score: 5%ile PHRED score: 25%ile PHRED score: Median PHRED score: 75%ile PHRED score: 95%ile
0 0 0 0 0 0
1 18 32 32 33 34
2 18 32 33 33 34
3 18 32 33 34 34
146 15 37 38 39 39
147 15 37 38 39 39
148 15 37 38 39 39
149 15 37 38 39 39
150 15 36 38 39 39
151 13 33 37 38 39
#!/usr/bin/python3
import xlrd
kit = ('test_QC_150.xlsx')
wb = xlrd.open_workbook(kit)
sheet = wb.sheet_by_index(0)
def phred_score():
for sheet in wb.sheets():
for rowidx in range(sheet.nrows):
row = sheet.row(rowidx)
for colidx, cell in enumerate(row):
# searching for the quality distribution
if cell.value == "Quality distribution":
index_quality_distribution = rowidx
print('index_quality_distribution: ', index_quality_distribution)
index = index_quality_distribution + 35
index_end = index_quality_distribution + 67
print(index)
print(index_end)
def main():
phred_score()
if __name__ == '__main__':
main()
I think the answer is quite simple. Your code is not "wrong", you just haven't thought it through to the end:
Your for-loop runs through all the cells in the range you specified and you previously only had one cell that validated the if statement that follows:
for colidx, cell in enumerate(row):
# searching for the quality distribution
if cell.value == "Quality distribution":
index_quality_distribution = rowidx
now that there are two instances, it will find both, but since you are overwriting the "index_quality_distribution" variable, only the last one it finds will be kept "in memory". What you can do is wrap everything in a while-loop and break out of it when the index is found the first time:
while True:
for sheet in wb.sheets():
for rowidx in range(sheet.nrows):
row = sheet.row(rowidx)
for colidx, cell in enumerate(row):
# searching for the quality distribution
if cell.value == "Quality distribution":
index_quality_distribution = rowidx
print('index_quality_distribution: ', index_quality_distribution)
break #Exits the while-loop and stop iterating
break #failsafe in case no "Quality distribution is found
That should do it.
Related
I want to train a binary classification ML model with some data that I have; something like this:
df
y ch1_g1 ch2_g1 ch3_g1 ch1_g2 ch2_g2 ch3_g2
0 20 89 62 23 3 74
1 51 64 19 2 83 0
0 14 58 2 71 31 48
1 32 28 2 30 92 91
1 51 36 51 66 15 14
...
My target (y) depends on three characteristics from two groups, however I have an imbalance in my data, a count of values of my y target reveals that I have more zeros than ones in a ratio of about 2.68. I correct this by looping each row and randomly swapping values from group 1 to group 2 and viceversa, like this:
for index,row in df.iterrows():
choice = np.random.choice([0,1])
if row['y'] != choice:
df.loc[index, 'y'] = choice
for column in df.columns[1:]:
key = column.replace('g1', 'g2') if 'g1' in column else column.replace('g2', 'g1')
df.loc[index, column] = row[key]
Doing this reduce the ratio to no more than 1.3, so I was wondering if there is a more direct aproach using pandas methods.
¿Anyone have an idea how to accomplish this?
Whether or not swapping columns solves class unbalance aside, I would swap the whole data set, and randomly choose between the original and the swapped:
# Step 1: swap the columns
df1 = pd.concat((df.filter(regex='[^(_g1)]$'),
df.filter(regex='_g1$')),
axis=1)
# Step 2: rename the columns
df1.columns = df.columns
# random choice
np.random.seed(1)
is_original = np.random.choice([True,False], size=len(df))
# concat to make new dataset
pd.concat((df[is_original],df1[~is_original]))
Output:
y ch1_g1 ch2_g1 ch3_g1 ch1_g2 ch2_g2 ch3_g2
2 0 14 58 2 71 31 48
3 1 32 28 2 30 92 91
0 0 23 3 74 20 89 62
1 1 2 83 0 51 64 19
4 1 66 15 14 51 36 51
Notice that row with indexes 1,4 have g1 swap with g2.
Here I have a dataset with three inputs. Three inputs x1,x2,x3. Here I want to read just x2 column and in that column data stepwise row by row.
Here I wrote a code. But it is just showing only letters.
Here is my code
data = pd.read_csv('data6.csv')
row_num =0
x=[]
for col in data:
if (row_num==1):
x.append(col[0])
row_num =+ 1
print(x)
result : x1,x2,x3
What I expected output is:
expected output x2 (read one by one row)
65
32
14
25
85
47
63
21
98
65
21
47
48
49
46
43
48
25
28
29
37
Subset of my csv file :
x1 x2 x3
6 65 78
5 32 59
5 14 547
6 25 69
7 85 57
8 47 51
9 63 26
3 21 38
2 98 24
7 65 96
1 21 85
5 47 94
9 48 15
4 49 27
3 46 96
6 43 32
5 48 10
8 25 75
5 28 20
2 29 30
7 37 96
Can anyone help me to solve this error?
If you want list from x2 use:
x = data['x2'].tolist()
I am not sure I even get what you're trying to do from your code.
What you're doing (after fixing the indentation to make it somewhat correct):
Iterate through all columns of your dataframe
Take the first character of the column name if row_num is equal to 1.
Based on this guess:
import pandas as pd
data = pd.read_csv("data6.csv")
row_num = 0
x = []
for col in data:
if row_num == 1:
x.append(col[0])
row_num = +1
print(x)
What you probably want to do:
import pandas as pd
data = pd.read_csv("data6.csv")
# Make a list containing the values in column 'x2'
x = list(data['x2'])
# Print all values at once:
print(x)
# Print one value per line:
for val in x:
print(val)
When you are using pandas you can use it. You can try this to get any specific column values by using list to direct convert into a list.For loop not needed
import pandas as pd
data = pd.read_csv('data6.csv')
print(list(data['x2']))
Part of this assignment deals with a 1-dimensional list and a 2-dimensional list. The 2-D list has 10 rows, with 4 elements each; the 1-D list has 4 elements.
The assignments calls for copying the gamma list (see code) into the first row of the inStock list. Then each row after the first needs to be successively incremented by 3. By successively i mean multiplying everything in the first row of inStock by three and storing those values in the second row, then taking the values stored in the second row multiplying those by three and storing those values in the third row of inStock, and so on.
I understand how to copy gamma but I am having trouble figuring out how to increment based off the previous list.
I am having difficulty creating a function that increments inStock successively.
This is what I have done. It increases the elements in gamma by three and stores them into the first row of inStock. But all the while loop does is take the values from the first row of inStock and store them into the other rows, rather than increment them successively.
row = 10
col = 4
gamma = [11, 13, 15, 17]
inStock = [[0] * col] * row
def copyGamma(listG, gamma):
listG[0] = gamma.copy()
x = 0
while x < 9:
x +=1
listG[x] = [i * 3 for i in listG[0]]
return listG
retList = copyGamma(inStock, gamma)
print(retList)
#this is the output of the above code
11 13 15 17 #this is inStock[0]
33 39 45 51 #this is inStock[1]
33 39 45 51 #this is inStock[2]
33 39 45 51
33 39 45 51
33 39 45 51
33 39 45 51
33 39 45 51
33 39 45 51
33 39 45 51
#This is the output i am looking for, format does not matter:
11 13 15 17 #This is inStock[0]
33 39 45 51 #This is inStock[1]
99 117 135 153 #This *should* be inStock[2]
297 351 405 459 #and so on
891 1053 1215 1377
2673 3159 3645 4131
8019 9477 10935 12393
24057 28431 32805 37179
72171 85293 98415 111537
216513 255879 295245 334611
You can use a list comprehension and the fact that each row's elements are effectively multiplied by a power of 3:
inStock = [[x * 3**i for x in gamma] for i in range(row)]
I have a socket that take 60 numbers from another computer in 6 columns and 10 rows. I orderd them with spilit and output is completely right. about first column, I want to take each number separately for calculating moving average filter on them.
Codes:
import socket
import numpy as np
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind(('192.168.0.1', 2015))
column1 = []
column2 = []
column3 = []
column4 = []
column5 = []
column6 = []
for message in range(10):
message = sock.recv(1024)
a1 = column1.append(message.split()[0])
a2 = column2.append(message.split()[1])
a3 = column3.append(message.split()[2])
a4 = column4.append(message.split()[3])
a5 = column5.append(message.split()[4])
a6 = column6.append(message.split()[5])
b1 =message.split()[0]
b2 = message.split()[1]
b3 = message.split()[2]
b4 = message.split()[3]
b5 = message.split()[4]
b6 = message.split()[5]
print b1
print b2
print b3
print b4
print b5
print b6
if I only print b1, output will be 10 numbers that I want to have tham separately for next function (moving average filter). I need help to make them separate.
I tried a for loop for b1[i] but gives me only first digit of b1.
First, you want to use a list of columns:
columns = [[] for _ in range(6)]
Then you can split the message into a single list:
for message in range(10):
message = sock.recv(1024)
splits = message.split(None, 5) # split into six pieces at most
which you can then append to the list of lists you created before:
for index, item in enumerate(splits):
columns[index].append(item)
Now if you only wish to print the first of those appended numbers, do
print columns[0][0] # first item of first list
The following should get you started. I have created some random data in the format 6 columns by 10 rows. It then splits the raw data into rows, splits each row into columns and then transposes them to get the data per columns.
Each entry in the first column is then displayed with a moving average of the last 3 entries. deque is used to implement an efficient mini queue of the last entries to calculate the moving average with.
import collections
message = """89 39 59 88 46 1 87 21 2 34
59 40 68 74 29 29 26 30 93 38
84 60 44 98 41 29 8 60 61 83
36 44 56 8 50 94 99 1 30 52
5 27 53 85 67 69 38 67 69 26
92 17 4 13 74 89 30 49 44 20"""
rows = message.splitlines()
data = []
for row in rows:
data.append(row.split())
columns = zip(*data)
total = 0
moving = collections.deque()
# Display the moving average for the first column
for entry in columns[0]:
value = int(entry)
moving.append(value)
total += value
if len(moving) > 3: # Length of moving average
total -= moving.popleft()
print "%3d %.1f" % (value, total/float(len(moving)))
For this data, it will display the following output:
89 89.0
59 74.0
84 77.3
36 59.7
5 41.7
92 44.3
Tested using Python 2.7
For index.csv file, its fourth column has ten numbers ranging from 1-5. Each number can be regarded as an index, and each index corresponds with an array of numbers in filename.csv.
The row number of filename.csv represents the index, and each row has three numbers. My question is about using a nesting loop to transfer the numbers in filename.csv to index.csv.
from numpy import genfromtxt
import numpy as np
import csv
import collections
data1 = genfromtxt('filename.csv', delimiter=',')
data2 = genfromtxt('index.csv', delimiter=',')
out = np.zeros((len(data2),len(data1)))
for row in data2:
for ch_row in range(len(data1)):
if (row[3] == ch_row + 1):
out = row.tolist() + data1[ch_row].tolist()
print(out)
writer = csv.writer(open('dn.csv','w'), delimiter=',',quoting=csv.QUOTE_ALL)
writer.writerow(out)
For example, the fourth column of index.csv contains 1,2,5,3,4,1,4,5,2,3 and filename.csv contains:
# filename.csv
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
What I need is to write the indexed row from filename.csv to index.csv and store these number in 5th, 6th and 7th column:
# index.csv
# 4 5 6 7
... 1 20 30 50
... 2 70 60 45
... 5 13 08 55
... 3 35 26 77
... 4 93 37 68
... 1 20 30 50
... 4 93 37 68
... 5 13 08 55
... 2 70 60 45
... 3 35 26 77
If I do "print(out)", it comes out a correct answer. However, when I input "out" in the shell, there are only one row appears like [1.0, 1.0, 1.0, 1.0, 20.0, 30.0, 50.0]
What I need is to store all the values in the "out" variables and write them to the dn.csv file.
This ought to do the trick for you:
Code:
from csv import reader, writer
data = list(reader(open("filename.csv", "r"), delimiter=" "))
out = writer(open("output.csv", "w"), delimiter=" ")
for row in reader(open("index.csv", "r"), delimiter=" "):
out.writerow(row + data[int(row[3])])
index.csv:
0 0 0 1
0 0 0 2
0 0 0 3
filename.csv:
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
This produces the output:
0 0 0 1 70 60 45
0 0 0 2 35 26 77
0 0 0 3 93 37 68
Note: There's no need to use numpy here. The stadard library csv module will do most of the work for you.
I also had to modify your sample datasets a bit as what you showed had indexes out of bounds of the sample data in filename.csv.
Please also note that Python (like most languages) uses 0th indexes. So you may have to fiddle with the above code to exactly fit your needs.
with open('dn.csv','w') as f:
writer = csv.writer(f, delimiter=',',quoting=csv.QUOTE_ALL)
for row in data2:
idx = row[3]
out = [idx] + [x for x in data1[idx-1]]
writer.writerow(out)