Pulling info by greater than statement in lists (Python) - python

I used the csv module to create lists from a data file. It looks something like this now:
['unitig_5\t.\tregion\t401\t500\t0.00\t+\t.\tcov2=3.000', '0.000;gaps=0',
'0;cov=3', '3', '3;cQv=20', '20', '20;del=0;ins=0;sub=0']
['unitig_5\t.\tregion\t2201\t2300\t0.00\t+\t.\tcov2=10.860',
'1.217;gaps=0', '0;cov=8', '11', '13;cQv=20', '20', '20;del=0;ins=0;sub=0']
I need to pull lists and put them into a new file if cov2= (part of the first column above) is equal to some number greater than some specified integer (say 140), so then in that case the two lists above wouldn't be accepted.
How would I set it up to check which lists meet this qualification and put those lists to a new file?

You can use regex :
>>> l=['unitig_5\t.\tregion\t401\t500\t0.00\t+\t.\tcov2=3.000', '0.000;gaps=0',
... '0;cov=3', '3', '3;cQv=20', '20', '20;del=0;ins=0;sub=0']
>>> import re
>>> float(re.search(r'cov2=([\d.]+)',l[0]).group(1))
3.0
The pattern r'cov2=([\d.]+)' will match and combination of digits (\d) and dot with length 1 or more. then you can convert the result to float and compare :
>>> var=float(re.search(r'cov2=([\d.]+)',l[0]).group(1))
>>> var>140
False
Also as its possible that your regex doesn't match the pattern you can use a try-except to handle the exception :
try :
var=float(re.search(r'cov2=([\d.]+)',l[0]).group(1))
print var>140
except AttributeError:
#print 'the_error_message'

I would first split the first string by tabs "\t", which seems to separate the fields.
Then, if cov2 is always the last fild, further parsing would be easy (cut of "cov2=", then convert the remainder to float and compare.
If not necessarily the last field, a simple search for the start should be sufficient.
Of course, complexity could be increased indefinitively if error-checking or a more tolerant search is required.

lst = [ ['unitig_5\t.\tregion\t401\t500\t0.00\t+\t.\tcov2=3.000', '0.000;gaps=0',
'0;cov=3', '3', '3;cQv=20', '20', '20;del=0;ins=0;sub=0'],
['unitig_5\t.\tregion\t2201\t2300\t0.00\t+\t.\tcov2=10.860',
'1.217;gaps=0', '0;cov=8', '11', '13;cQv=20', '20', '20;del=0;ins=0;sub=0'], ]
filtered_list = [ l for l in lst if re.match('.*cov2=([\d.]+$'), l) ]

You could extract the float value using rsplit if all the first elements contain the substring:
for row in list_of_rows:
if float(row[0].rsplit("=",1)[1]) > 140:
# write rows
If you don't actually need every row you should do it when you first read the file writing as you go.
with open("input.csv") as f, open("output.csv", "w") as out:
r = csv.reader(f)
wr = csv.writer(out)
for row in r:
if float(row[0].rsplit("=", 1)[1]) > 140:
wr.writerows(row)

Related

How to get two values in a list within a list?

I was trying to come up with a function that would read an .csv archive and from there I could get for example, grades for students tests, example below:
NOME,G1,G2
Paulo,5.0,7.2
Pedro,6,4.1
Ana,3.3,2.3
Thereza,5,6.5
Roberto,7,5.2
Matheus,6.3,6.1
I managed to split the lines on the , part but I end up with somewhat a matrix:
[['NOME', 'G1', 'G2'], ['Paulo', '5.0', '7.2'], ['Pedro', '6', '4.1'], ['Ana', '3.3', '2.3'], ['Thereza', '5', '6.5'], ['Roberto', '7', '5.2'], ['Matheus', '6.3', '6.1']]
How do I go from one list to the other and manage to get the grades within them?
This is the code I got so far:
def leArquivo(arquivo):
arq = open(arquivo, 'r')
conteudo = arq.read()
arq.close
return conteudo
def separaLinhas(conteudo):
conteudo=conteudo.split('\n')
conteudo1 = []
for i in conteudo:
conteudo1.append(i.split(','))
return conteudo1
Where do I go from here?
A simple for will do it, i.e.:
notas = [['NOME', 'G1', 'G2'], ['Paulo', '5.0', '7.2'], ['Pedro', '6', '4.1'], ['Ana', '3.3', '2.3'], ['Thereza', '5', '6.5'], ['Roberto', '7', '5.2'], ['Matheus', '6.3', '6.1']]
for nota in notas[1:]: ## [1:] skip the first item
nome = nota[0]
g1 = nota[1]
g2 = nota[2]
print ("NOME:{} | G1: {} | G2: {}".format(nome, g1, g2))
DEMO
PS: You may want to cast g1 and g2 to a float - float(nota[1])- if you need to perform math operations.
Since you're working with a csv file, you may want to look at the csv module in Python. That module has many convenient options and forms in which the data is read. Following is an example of csv.DictReader reading and usage,
import csv
# Read the data
with open('data.csv') as f:
reader = csv.DictReader(f)
data = [row for row in reader]
# Print it
for row in data:
print (' ').join(['Nome:',row['NOME'],'G1:',row['G1'],'G2:',row['G2']])
# Print only names and G2 grades as a table
print '- '*10
print 'NOME\t' + 'G2'
for row in data:
print row['NOME'] + '\t' + row['G2']
# Average of G1 and G2 for each student
print '- '*10
print 'NOME\t' + 'Average'
for row in data:
gpa = (float(row['G1']) + float(row['G2']))/2.0
print row['NOME'] + '\t' + str(gpa)
Here the data is read as a list of dictionaries - each element in the list is a dictionary representing a single row of your dataset. The dictionary keys are names of your headers (NOME, G1) and values are the corresponding values for that row.
That particular form can be useful in some situations. Here in the first part of the program the data is printed with keys and values, one row per line. The thing to note is that dictionaries are unordered - to ensure printing in some specific order we need to traverse the dictionary "manually". I used join simply to demonstrate an alternative to format (which is actually more powerful) or just typing everything with spaces in between. Second usage example prints names and the second grade as a table with proper headers. Third calculates the average and prints it as a table.
For me this approach proved very useful when dealing with datasets with several thousands entries that have many columns - headers - that I want to study separately (thus I don't mind them not being in order). To get an ordered dictionary you can use OrderedDict or consider other available datastructures. I also use Python 2.7, but since you tagged the question as 3.X, the links point to 3.X documentation.

Python regular expression to split parameterized text file

I'm trying to split a file that contains 'string = float' format repeatedly.
Below is how the file looks like.
+name1 = 32 name2= 4
+name3 = 2 name4 = 5
+name5 = 2e+23
...
And I want them to put it a dictionary.
Like...
a={name1:32, name2:4, name3:2, name4:5, name5:2e+23}
I'm new to regular expression and having a hard time figuring out what to do.
After some googling, I tried to do as below to remove "+" character and white space..
p=re.compile('[^+\s]+')
splitted_list=p.findall(lineof_file)
But this gave me two problems..
1. when there is no space btw name and "=" sign, it doesn't spilit.
2. for numbers like 2e+23, it split the + sign in between.
I managed to parse the file as I wanted after some modification of depperm's code.
But I'm facing another problem.
To better explain my problems. Below is how my file can look like.
After + sign multiple parameter and value pair can appear with '=' sign.
The parameter name can contain alphabet and number in any position. Also value can contain +- sign with scientific notification(E/e-+). And sometimes value can be a math expression if it is single quoted.
+ abc2dfg3 = -2.3534E-03 dfe4c3= 2.000
+ abcdefg= '1.00232e-1*x' * bdfd=1e-3
I managed to parse the above using the below regex.
re.findall("(\w+)\s*=\s*([+-]?[\d+.Ee+-]+|'[^']+')",eachline)
But now my problem is sometimes like "* bdfd=1e-3", there could be some comment. Anything after *(asterisk) in my file should be treated as comment but not if * present inside single quoted string.
With above regex, it parses "bdfd=1e-3" as well but I want it to be not parsed.
I tried to find solution for hours but I couldn't find any solution so far.
I would suggest just grabbing the name and the value instead of worrying about the spaces or unwanted characters. I'd use this regex: (name\d+)\s?=\s?([\de+]+) which will get the name and then you also group the number even if it has an e or space.
import re
p=re.compile('(name\d+)\s*=\s*([\de+]+)')
a ={}
with open("file.txt", "r") as ins:
for line in ins:
splitted_list=p.findall(line)
#splitted_list looks like: [('name1', '32'), ('name2', '4')]
for group in splitted_list:
a[group[0]]=group[1]
print(a)
#{'name1': '32', 'name2': '4', 'name3': '2', 'name4': '5', 'name5': '2e+23'}
You don't need a regular expression to accomplish your goal. You can use built-in Python methods.
your_dictionary = {}
# Read the file
with open('file.txt','r') as fin:
lines = fin.readlines()
# iterate over each line
for line in lines:
splittedLine = line.split('=')
your_dictionary.push({dict.push({
key: splittedLine[0],
value: splittedLine[1]
});
print(your_dictionary)
Hope it helps!
You can combine regex with string splitting:
Create the file:
t ="""
+name1 = 32 name2= 4
+name3 = 2 name4 = 5
+name5 = 2e+23"""
fn = "t.txt"
with open(fn,"w") as f:
f.write(t)
Split the file:
import re
d = {}
with open(fn,"r") as f:
for line in f: # proces each line
g = re.findall(r'(\w+ ?= ?[^ ]*)',line) # find all name = something
for hit in g: # something != space
hit = hit.strip() # remove spaces
if hit:
key, val = hit.split("=") # split and strip and convert
d[key.rstrip()] = float(val.strip()) # put into dict
print d
Output:
{'name4': 5.0, 'name5': 2e+23, 'name2': 4.0, 'name3': 2.0, 'name1': 32.0}

How to read files with one key but multiple values in Python

I am just beginning working with Python and am a little confused. I understand the basic idea of a dictionary as (key, value). I am writing a program and want to read in a file, story it in a dictionary and then complete different functions by referrencing the values. I am not sure if I should use a dictionary or lists. The basic layout of the file is:
Name followed by 12 different years for example :
A 12 12 01 11 0 0 2 3 4 9 12 9
I am not sure what the best way to read in this information would be. I was thinking that a dictionary may be helpful if I had Name followed by Years, but I am not sure if I can map 12 years to one key name. I am really confused on how to do this. I can read in the file line by line, but not within the dictionary.
def readInFile():
fileDict ={"Name ": "Years"}
with open("names.txt", "r") as f:
_ = next(f)
for line in f:
if line[1] in fileDict:
fileDict[line[0]].append(line[1])
else:
fileDict[line[0]] = [line[1]]
My thinking with this code was to append each year to the value.
Please let me know if you have any recommendations.
Thank you!
You can do in one line :)
print({line[0]:line[1:].split() for line in open('file.txt','r') if line[0]!='\n'})
output:
{'A': ['12', '12', '01', '11', '0', '0', '2', '3', '4', '9', '12', '9']}
Above dict comprehension is same as:
dict_1={}
for line in open('legend.txt', 'r'):
if line[0]!='\n':
dict_1[line[0]]=line[1:].split()
print(dict_1)
You can map 12 years to one key name. You seem to think that you need to choose between a dictionary and a list ("I am not sure if I should use a dictionary or lists.") But those are not alternatives. Your 12 years can usefully be represented as a list. Your names can be dictionary keys. So you need (as PM 2Ring suggests) a dictionary where the key is a name and the value is a list of years.
def readInFile():
fileDict = {}
with open(r"names.txt", "r") as f:
for line in f:
name, years = line.split(" ",1)
fileDict[name] = years.split()
There are two calls to the string method split(). The first splits the name from the years at the first space. (You can get the name using line[0], but only if the name is one character long, and that is unlikely to be useful with real data.) The second call to split() picks the years apart and puts them in a list.
The result from the one-line sample file will be the same as running this:
fileDict = {'A': ['12', '12', '01', '11', '0', '0', '2', '3', '4', '9', '12', '9']}
As you can see, these years are strings not integers: you may want to convert them.
Rather than doing:
_ = next(f)
to throw away your record count, consider doing
for line in f:
if line.strip().isdigit():
continue
instead. If you are using file's built-in iteration (for line in f) then it's generally best not to call next() on f yourself.
It's also not clear to me why your code is doing this:
fileDict ={"Name ": "Years"}
This is a description of what you plan to put in the dictionary, but that is not how dictionaries work. They are not database tables with named columns. If you use a dictionary with key:name and value:list of years, that structure is implicit. The best you can do is describe it in a comment or a type annotation. Performing the assignment will result in this:
fileDict = {
'A': ['12', '12', '01', '11', '0', '0', '2', '3', '4', '9', '12', '9'],
'Name ': 'Years'
}
which mixes up description and data, and is probably not what you want, because your subsequent code is likely to expect a 12-list of years in the dictionary value, and if so it will choke on the string "Years".
Values in a dict can be anything, including a new dict, but in this case a list sounds good. Maybe something like this.
from io import StringIO # just to make it run without an actual file
the_file_content = 'A 12 12 01 11\nB 13 13 02'
fake_file = StringIO(the_file_content)
# this stays for your
#with open('names.txt', 'rt') as f:
# lines = f.readlines()
lines = fake_file.readlines() # this goes away for you
lines = [l.strip().split(' ') for l in lines]
fileDict = {row[0]: row[1:] for row in lines}
# if you want the values to be actual numbers rather than strings
for k, v in fileDict.items():
fileDict[k] = [int(i) for i in v]
In python there are constructs where most simple as well as complex things can be done in one go, rather than looping with index-like constructs.

How do I turn a repeated list element with delimiters into a list?

I imported a CSV file that's basically a table with 5 headers and data sets with 5 elements.
With this code I turned that data into a list of individuals with 5 bits of information (list within a list):
import csv
readFile = open('Category.csv','r')
categoryList = []
for row in csv.reader(readFile):
categoryList.append(row)
readFile.close()
Now I have a list of lists [[a,b,c,d,e],[a,b,c,d,e],[a,b,c,d,e]...]
However element 2 (categoryList[i][2]) or 'c' in each list within the overall list is a string separated by a delimiter (':') of variable length. How do I turn element 2 into a list itself? Basically making it look like this:
[[a,b,[1,2,3...],d,e][a,b,[1,2,3...],d,e][a,b,[1,2,3...],d,e]...]
I thought about looping through each list element and finding element 2, then use the .split(':') command to separate those values out.
The solution you suggested is feasible. You just don't need to do it after you read the file. You can do it while taking it as a input in the first place.
for row in csv.reader(readFile):
row[2] = row[2].split(":") # Split element 2 of each row before appending
categoryList.append(row)
Edit: I guess you know the purpose of split function. So I will explain row[2].
You have a data such as [[a,b,c,d,e],[a,b,c,d,e],[a,b,c,d,e]...] which means each row goes like [a,b,c,d,e], [a,b,c,d,e], [a,b,c,d,e], So every row[2] corresponds to c. Using this way, you get to alter all c's before you append and turn them in to [[a,b,c,d,e],[a,b,c,d,e],[a,b,c,d,e]...].
Not really clear about your structure but if c is a string seperated by : within then try
list(c.split(':'))
Let me know if it solved your problem
You can use a list comprehension on each row and split items containing ':' into a new sublist:
for row in csv.reader(readFile):
new_row = [i.split(':') if ':' in i else i for i in row]
categoryList.append(new_row)
This works if you also have other items in the row that you need to split on ':'.
Otherwise, you can directly split on the index if you only have one item containing ':':
for row in csv.reader(readFile):
row[2] = row[2].split(':')
categoryList.append(row)
Assume that you have a row like this:
row = ["foo", "bar", "1:2:3:4:5", "baz"]
To convert item [2] into a sublist, you can use
row[2] = row[2].split(":") # elements can be assigned to, yawn.
Now the row is ['foo', 'bar', ['1', '2', '3', '4', '5'], 'baz']
To graft the split items to the "top level" of the row, you can use
row[2:3] = row[2].split(":") # slices can be assigned to, too, yay!
Now the row is ['foo', 'bar', '1', '2', '3', '4', '5', 'baz']
This of course skips any defensive checks of the row data (can it at all be split?) that a real robust application should have.

Python - Replace list of characters with another list

I have two lists:
wrong_chars = [
['أ','إ','ٱ','ٲ','ٳ','ٵ'],
['ٮ','ݕ','ݖ','ﭒ','ﭓ','ﭔ'],
['ڀ','ݐ','ݔ','ﭖ','ﭗ','ﭘ'],
['ٹ','ٺ','ٻ','ټ','ݓ','ﭞ'],
]
true_chars = [
['ا'],
['ب'],
['پ'],
['ت'],
]
For a given string I want to replace the entries in wrong_chars with those in true_chars. Is there a clean way to do that in python?
string module to the rescue!
There's a really handy function as a part of the string module called translate that does exactly what you're looking for, though you'll have to pass in your translation mapping as a dictionary.
The documentation is here
An example based on a tutorial from tutoriapoint is shown below:
>>> from string import maketrans
>>> trantab = maketrans("aeiou", "12345")
>>> "this is string example....wow!!!".translate(trantab)
th3s 3s str3ng 2x1mpl2....w4w!!!
It looks like you're using unicode here though, which works slightly differently. You can look at this question to get a sense, but here's an example that should work for you more specifically:
translation_dict = {}
for i, char_list in enumerate(wrong_chars):
for char in char_list:
translation_dict[ord(char)] = true_chars[i]
example.translate(translation_dict)
I merged your two wrong and true chars in a list of dictionaries of wrongs and what should be replaced with them. so here you are:
link to a working sample http://ideone.com/mz7E0R
and code itself
given_string = "ayznobcyn"
correction_list = [
{"wrongs":['x','y','z'],"true":'x'},
{"wrongs":['m','n','o'],"true":'m'},
{"wrongs":['q','r','s','t'],"true":'q'}
]
processed_string = ""
true_char = ""
for s in given_string:
for correction in correction_list:
true_char=s
if s in correction['wrongs']:
true_char=correction['true']
break
processed_string+=true_char
print given_string
print processed_string
this code can be more optimized and of course i do not care about unicode problems if there was any, because i see you are using Farsi. you should take care about that.
#!/usr/bin/env python
from __future__ import unicode_literals
wrong_chars = [
['1', '2', '3'],
['4', '5', '6'],
['7'],
]
true_chars = 'abc'
table = {}
for keys, value in zip(wrong_chars, true_chars):
table.update(dict.fromkeys(map(ord, keys), value))
print("123456789".translate(table))
Output
aaabbbc89
In my idea you can make just one list that contain true characters too like this:
NewChars = {["ا"،"أ"،"إ"،"آ"], ["ب"،"بِ"،"بِ"،]}
# add all true characters to the first of lists and add all lists to a dict, then:
Ch="إ"
For L in NewChars:
If Ch in L: return L[0]

Categories