Getting rid of Characters in CVS file to get mean of columns - python

I asked for help a while ago and I thought this was what I was looking for unfortunately I ran into another problem. In my CSV file I have ?'s inplace of missing data in some rows in the 13 columns. I have an idea of how to fix it but have yet to be successful in implementing it. My current Idea would be to use use ord and chr to change the ? to 0 but not sure how to implement that to list. This is the error I get
File "C:\Users\David\Documents\Python\asdf.py", line 46, in <module>
iList_sum[i] += float(ill_data[i])
ValueError: could not convert string to float: '?'
Just so you know I can not use numby or panda. I am also trying to refrain from using mapping since I am trying to get a very simplistic code.
import csv
#turn csv files into a list of lists
with open('train.csv','rU') as csvfile:
reader = csv.reader(csvfile)
csv_data = list(reader)
# Create two lists to handle the patients
# And two more lists to collect the 'sum' of the columns
# The one that needs to hold the sum 'must' have 0 so we
# can work with them more easily
iList = []
iList_sum = [0,0,0,0,0,0,0,0,0,0,0,0,0]
hList = []
hList_sum = [0,0,0,0,0,0,0,0,0,0,0,0,0]
# Only use one loop to make the process mega faster
for row in csv_data:
# If row 13 is greater than 0, then place them as unhealthy
if (row and int(row[13]) > 0):
# This appends the whole 'line'/'row' for storing :)
# That's what you want (instead of saving only one cell at a time)
iList.append(row)
# If it failed the initial condition (greater than 0), then row 13
# is either less than or equal to 0. That's simply the logical outcome
else:
hList.append(row)
# Use these to verify the data and make sure we collected the right thing
# print iList
# [['67', '1', '4', '160', '286', '0', '2', '108', '1', '1.5', '2', '3', '3', '2'], ['67', '1', '4', '120', '229', '0', '2', '129', '1', '2.6', '2', '2', '7', '1']]
# print hList
# [['63', '1', '1', '145', '233', '1', '2', '150', '0', '2.3', '3', '0', '6', '0'], ['37', '1', '3', '130', '250', '0', '0', '187', '0', '3.5', '3', '0', '3', '0']]
# We can use list comprehension, but since this is a beginner task, let's go with basics:
# Loop through all the 'rows' of the ill patient
for ill_data in iList:
# Loop through the data within each row, and sum them up
for i in range(0,len(ill_data) - 1):
iList_sum[i] += float(ill_data[i])
# Now repeat the process for healthy patient
# Loop through all the 'rows' of the healthy patient
for healthy_data in hList:
# Loop through the data within each row, and sum them up
for i in range(0,len(healthy_data) - 1):
hList_sum[i] += float(ill_data[i])
# Using list comprehension, I basically go through each number
# In ill list (sum of all columns), and divide it by the lenght of iList that
# I found from the csv file. So, if there are 22 ill patients, then len(iList) will
# be 22. You can see that the whole thing is wrapped in brackets, so it would show
# as a python list
ill_avg = [ ill / len(iList) for ill in iList_sum]
hlt_avg = [ hlt / len(hList) for hlt in hList_sum]
Here is a screenshot of the CSV file.

Simply check the value you get from the list:
# Loop through the data within each row, and sum them up
qmark_counter = 0
for i in range(0,len(ill_data) - 1):
if ill_data[i] == '?':
val = 0
qmark_counter += 1
else
val = ill_data[i]
iList_sum[i] += float(val)
And so on for the other ones. There are many other improvements that could be done; for instance, I would put the snippet of code in a function so that it does not have to be repeated multiple times.
EDIT: added the counter for question marks. If you want to keep track of question marks separately for each list, you may want to use a dictionary.

Related

Parse a list into a list of lists in python

I am trying to figure out how to parse a list into a list of lists.
tileElements = browser.find_element(By.CLASS_NAME, 'tile-container')
tileHTML = (str(tileElements.get_attribute('innerHTML')))
tileNUMS = re.findall('\d+',tileHTML)
NumTiles = int(len(tileNUMS)/4)
#parse out list, each 4 list items are one tile
print(str(tileNUMS))
print(str(NumTiles))
TileList = [[i+j for i in range(len(tileNUMS))]for j in range (NumTiles)]
print(str(TileList))
The first part of this code works find and gives me a list of Tile Numbers:
['2', '3', '1', '2', '2', '4', '4', '2']
However, what I need is a list of lists made out of this and that is where I am getting stuck.
The list of lists should be 4 elements long and look like this:
[['2', '3', '1', '2'] , ['2', '4', '4', '2']]
It should be able to do this for as many tiles as there are in the game (up to 19 I believe). It would be really nice if when the middle numbers are repeated that the two outside numbers are replaced with the latest value from the source list.
You can use a list comprehension to get slices from the list like so.
elements = ['2', '3', '1', '2', '2', '4', '4', '2']
size = 4
result = [elements[i:i+size] for i in range(0, len(elements), size)]
(By the way, there's no need to cast things into str to print them, and tileHTML is probably already a string, too.)

Pairing elements of list of lists and storing in tuple form

I have a file say : file1.txt, which has multiple rows and columns. I want to read that and store that as list of lists. Now I want to pair them using the logic, no 2 same rows can be in a pair. Now the 2nd lastcolumn represent the class. Below is my file:
27,28,29,30,1,0.67
31,32,33,34,1,0.84
35,36,37,38,1,0.45
39,40,41,42,1,0.82
43,44,45,46,1,0.92
43,44,45,46,1,0.92
Here all the 6 rows are class 1. I am using below logic to do this pairing part.
from operator import itemgetter
rule_file_name = 'file1.txt'
rule_fp = open(rule_file_name)
list1 = []
for line in rule_fp.readlines():
list1.append(line.replace("\n","").split(","))
list1=sorted(list1,key=itemgetter(-1),reverse=True)
length = len(list1)
middle_index = length // 2
first_half = list1[:middle_index]
second_half = list1[middle_index:]
result=[]
result=list(zip(first_half,second_half))
for a,b in result:
if a==b:
result.remove((a, b))
print(result)
print("-------------------")
It is working absolutely fine when I have one class only. But if my file has multiple classes then I want the pairing to be done with is the same class only. For an example if my file looks like below: say file2
27,28,29,30,1,0.67
31,32,33,34,1,0.84
35,36,37,38,1,0.45
39,40,41,42,1,0.82
43,44,45,46,1,0.92
43,44,45,46,1,0.92
51,52,53,54,2,0.28
55,56,57,58,2,0.77
59,60,61,62,2,0.39
63,64,65,66,2,0.41
75,76,77,78,3,0.51
90,91,92,93,3,0.97
Then I want to make 3 pairs from class 1, 2 from class 2 and 1 from class 3.Then I am using this logic to make the dictionary where the keys will be the classes.
d = {}
sorted_grouped = []
for row in list1:
# Add name to dict if not exists
if row[-2] not in d:
d[row[-2]] = []
# Add all non-Name attributes as a new list
d[row[-2]].append(row)
#print(d.items())
for k,v in d.items():
sorted_grouped.append(v)
#print(sorted_grouped)
gp_vals = {}
for i in sorted_grouped:
gp_vals[i[0][-2]] = i
print(gp_vals)
Now how can I do it, please help !
My desired output for file2 is:
[([43,44,45,46,1,0.92], [39,40,41,42,1,0.82]), ([43,44,45,46,1,0.92],
[27,28,29,30,1,0.67]), ([31,32,33,34,1,0.84], [35,36,37,38,1,0.45])]
[([55,56,57,58,2,0.77], [59,60,61,62,2,0.39]), ([63,64,65,66,2,0.41],
[51,52,53,54,2,0.28])] [([90,91,92,93,3,0.97], [75,76,77,78,3,0.51])]
Edit1:
All the files will have even number of rows, where every class will have even number of rows as well.
For a particular class(say class 2), if there are n rows then there can be maximum n/2 identical rows for that class in the dataset.
My primary intention was to get random pairing but making sure no self pairing is allowed. For that I thought of taking the row with the highest fitness value(The last column) inside any class and take any other row from that class randomly and make a pair just by making sure both the rows are not exactly the same. And this same thing is repeated for every class separately.
First read in the data from the file, I'd use assert here to communicate your assumptions to people who read the code (including future you) and to confirm the assumption actually holds for the file. If not it will raise an AssertionError.
rule_file_name = 'file2.txt'
list1 = []
with open(rule_file_name) as rule_fp:
for line in rule_fp.readlines():
list1.append(line.replace("\n","").split(","))
assert len(list1) & 1 == 0 # confirm length is even
Then use a defaultdict to store the lists for each class.
from collections import defaultdict
classes = defaultdict(list)
for _list in list1:
classes[_list[4]].append(_list)
Then use sample to draw pairs and confirm they aren't the same. Here I'm including a seed to make the results reproducible but you can take that out for randomness.
from random import sample, seed
seed(1) # remove this line when you want actual randomness
for key, _list in classes.items():
assert len(_list) & 1 == 0 # each also be even else an error in data
_list.sort(key=lambda x: x[5])
pairs = []
while _list:
first = _list[-1]
candidate = sample(_list, 1)[0]
if first != candidate:
print(f'first {first}, candidate{candidate}')
pairs.append((first, candidate))
_list.remove(first)
_list.remove(candidate)
classes[key] = pairs
Note that an implicit assumption in the way to do the sampling (stated in edit) is that the duplicates arise from the highest fitness values. If this is not true this could go into an infinite loop.
If you want to print them then iterate over the dictionary again:
for key, pairs in classes.items():
print(key, pairs)
which for me gives:
1 [(['43', '44', '45', '46', '1', '0.92'], ['27', '28', '29', '30', '1', '0.67']), (['43', '44', '45', '46', '1', '0.92'], ['31', '32', '33', '34', '1', '0.84']), (['39', '40', '41', '42', '1', '0.82'], ['35', '36', '37', '38', '1', '0.45'])]
2 [(['55', '56', '57', '58', '2', '0.77'], ['51', '52', '53', '54', '2', '0.28']), (['63', '64', '65', '66', '2', '0.41'], ['59', '60', '61', '62', '2', '0.39'])]
3 [(['90', '91', '92', '93', '3', '0.97'], ['75', '76', '77', '78', '3', '0.51'])]
Using these values for file2.text-the first numbers are row numbers and not part of the actual file.
1 27,28,29,30,1,0.67
2 31,32,33,34,1,0.84
3 35,36,37,38,1,0.45
4 39,40,41,42,1,0.82
5 43,44,45,46,1,0.92
6 43,44,45,46,1,0.92
7 51,52,53,54,2,0.28
8 55,56,57,58,2,0.77
9 59,60,61,62,2,0.39
10 63,64,65,66,2,0.41
11 75,76,77,78,3,0.51
12 90,91,92,93,3,0.97

Use index of first and second repeated index in list

There are lots of similar posts out there, but I could not find something that directly matched, or resulted in a solution to, the issue I am dealing with.
I want to use the second instance of a repeated index contained in a list as the index of another list. When the function is executed, I want all numbers from the start of the list up to the first '\*' to print after Code1, all numbers between the first '\*' and the second '\*' to print after Code2, and then all numbers following the second '\*' until the end of the list to print after Code3. Example data for digit would be "['1', '2', '3', '4', '5', '\*', '6', '\*', '7', '8', '9', '10', '1']".
In other words, I want the code below to print , assuming those digits exist, User Code: 12345, Pass Code: 6, Pin Code: 789101, all in one line.
print_string += 'User Code: {} '.format(''.join(str(dig) for dig in digit[:digit.index('*')])) + \
'Pass Code: {} '.format(''.join(str(dig) for dig in digit[digit.index('*'):digit.index('*')])) + \
'Pin Code: {} '.format(''.join(str(dig) for dig in digit[digit.index('*'):]))
print(print_string)
Essentially, I would like to call the first asterisk as the right index for User Code, the first asterisk as the left index and the second asterisk as the right index for Pass Code, and the second asterisk as the left index for Pin Code.
I just cannot figure out how make it look for sequential asterisks. If there is a simpler way to execute this, please let me know!
Given,
L = ['1', '2', '3', '4', '5', '\*', '6', '\*', '7', '8', '9', '10', '1']
Then
str.join('', L)
will form a string
'12345\\*6\\*789101'
which you can split into the three parts
parts = str.join('', L).split('\*')
and then pull out what you need
user_code = parts[0]
pass_code = parts[1]
pin = parts[2]
If you have actually got all the digits in a list like shape ina string,
"['1', '2', '3', '4', '5', '\*', '6', '\*', '7', '8', '9', '10', '1']"
it might be worth just having them as a list, then you can use the join/split method above.

how to remove the first occurence of an integer in a list

this is my code:
positions = []
for i in lines[2]:
if i not in positions:
positions.append(i)
print (positions)
print (lines[1])
print (lines[2])
the output is:
['1', '2', '3', '4', '5']
['is', 'the', 'time', 'this', 'ends']
['1', '2', '3', '4', '1', '5']
I would want my output of the variable "positions" to be; ['2','3','4','1','5']
so instead of removing the second duplicate from the variable "lines[2]" it should remove the first duplicate.
You can reverse your list, create the positions and then reverse it back as mentioned by #tobias_k in the comment:
lst = ['1', '2', '3', '4', '1', '5']
positions = []
for i in reversed(lst):
if i not in positions:
positions.append(i)
list(reversed(positions))
# ['2', '3', '4', '1', '5']
You'll need to first detect what values are duplicated before you can build positions. Use an itertools.Counter() object to test if a value has been seen more than once:
from itertools import Counter
counts = Counter(lines[2])
positions = []
for i in lines[2]:
counts[i] -= 1
if counts[i] == 0:
# only add if this is the 'last' value
positions.append(i)
This'll work for any number of repetitions of values; only the last value to appear is ever used.
You could also reverse the list, and track what you have already seen with a set, which is faster than testing against the list:
positions = []
seen = set()
for i in reversed(lines[2]):
if i not in seen:
# only add if this is the first time we see the value
positions.append(i)
seen.add(i)
positions = positions[::-1] # reverse the output list
Both approaches require two iterations; the first to create the counts mapping, the second to reverse the output list. Which is faster will depend on the size of lines[2] and the number of duplicates in it, and wether or not you are using Python 3 (where Counter performance was significantly improved).
you can use a dictionary to save the last position of the element and then build a new list with that information
>>> data=['1', '2', '3', '4', '1', '5']
>>> temp={ e:i for i,e in enumerate(data) }
>>> sorted(temp, key=lambda x:temp[x])
['2', '3', '4', '1', '5']
>>>

How do I clear this list at the end of every loop?

I am trying to find the maximum value for different subsets of a list.
def max_value(filename):
CHR=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', 'X']
SNP = [ ]
chr_max=[ ]
for n in CHR:
for r in reader:
if r[1]==n:
SNP.append(r[2]) #append values into empty list SNP
SNP = [try_int(x) for x in SNP] #convert to integers
max_val=max(SNP) #find the maximum value
chr_max.append((n, max_val)) #append this maximum to a new list
del SNP[:] #clear the list and loop for next item in CHR list
return chr_max
I keep getting
ValueError: max() arg is an empty sequence
When I remove the del SNP[:] step I get output, but it returns the max value for n='1'(since it is the maximum value overall it gets returned for all 20 loops, if i do not empty clear the list).
How do I clear the SNP list at the end of each loop, so I can find the maximum value for different subsets of the list?
You need to reverse the reader and CHR loops so you only loop reader once:
SNPs = {}
for r in reader:
for n in CHR:
if r[1]==n:
SNPs.setdefault(n, []).append(r[2]) #append values into empty list SNP
for n in CHR:
SNP = SNPs[n]
# I didn't change anything below here..
SNP = [try_int(x) for x in SNP] #convert to integers
max_val=max(SNP) #find the maximum value
chr_max.append((n, max_val)) #append this maximum to a new list
Note you can also use
from itertools import defaultdict
SNPs = defaultdict(list)
and change the append to:
SNPs[n].append(r[2])
If reader is a file object or csv.reader() object, you cannot loop over it multiple times and expect it to start from the beginning again.
A file object would need to be rewound to the start with reader.seek(0), for example.
As a consequence, the second time your code reaches the for r in reader: loop, the loop terminates immediately without executing any iterations, no new elements are added to SNP and it remains empty.
You could just sort the input from the reader iterable into a dictionary instead of continues looping:
CHR=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', 'X']
values = {c: [] for c in CHR}
for row in reader:
if row[1] in values:
values[row[1]].append(try_int(row[2]))
return [max(values[c]) for c in CHR if values[c]]

Categories