I have a file that looks like this
a:1
a:2
a:3
b:1
b:2
b:2
and I would like it to take the the a and b portion of the file and add it as the first column and and the number below, like this.
a b
1 1
2 2
3 3
can this be done?
A CSV (Comma Separated File) should have commas in it, so the output should have commas instead of space-separators.
I recommend writing your code in two parts: The first part should read the input; the second should write out the output.
If your input looks like this:
a:1
a:2
a:3
b:1
b:2
b:2
c:7
you can read in the input like this:
#!/usr/bin/env python3
# Usage: python3 scripy.py < input.txt > output.csv
import sys
# Loop through all the input lines and put the values in
# a list according to their category:
categoryList = {} # key => category, value => list of values
for line in sys.stdin.readlines():
line = line.rstrip('\n')
category, value = line.split(':')
if category not in categoryList:
categoryList[category] = []
categoryList[category].append(value)
# print(categoryList) # debug line
# Debug line prints: {'a': ['1', '2', '3'], 'b': ['1', '2', '2']}
This will read in all your data into a categoryList dict. It's a dict that contains the categories (the letters) as keys, and contains lists (of numbers) as the values. Once you have all the data held in that dict, you can output it.
Outputting involves getting a list of categories (the letters, in your example case) so that they can be written out first as your header:
# Get the list of categories:
categories = sorted(categoryList.keys())
assert categories, 'No categories found!' # sanity check
From here, you can use Python's nice csv module to output the header and then the rest of the lines. When outputting the main data, we can use an outer loop to loop through the nth entries of each category, then we can use an inner loop to loop through every category:
import csv
csvWriter = csv.writer(sys.stdout)
# Output the categories as the CSV header:
csvWriter.writerow(categories)
# Now output the values we just gathered as
# Comma Separated Values:
i = 0 # the index into an individual category list
while True:
values = []
for category in categories:
try:
values.append(categoryList[category][i])
except IndexError:
values.append('') # no value, so use an empty string
if len(''.join(values)) == 0:
break # we've run out of categories that contain input
csvWriter.writerow(values)
i += 1 # increment index for the next time through the loop
If you don't want to use Python's csv module, you will still need to figure out how to group the entries in the category together. And if all you have is simple output (where none of the entries contain quotes, newlines, or commas), you can get away with manually writing out the output.
You could use something like this to output your values:
# Output the categories as the CSV header:
print(','.join(categories))
# Now output the values we just gathered as
# Comma Separated Values:
i = 0 # the index into an individual category list
while True:
values = []
for category in categories:
try:
values.append(categoryList[category][i])
except IndexError:
values.append('') # no value, so use an empty string
if len(''.join(values)) == 0:
break # we've run out of categories that contain input
print(','.join(values))
i += 1 # increment index for the next time through the loop
This will print out:
a,b,c
1,1,7
2,2,
3,2,
It does this by looping through all the list entries (the outer loop), and then looping through all the categories (the inner loop), and then printing out the values joined together by commas.
If you don't want the commas in your output, then you're technically not looking for CSV (Comma Separated Value) output. Still, in that case, it should be easy to modify the code to get what you want.
But if you have more complicated output (that is, values that have quotes, commas, and newlines in it) you should strongly consider using the csv module to output your data. Otherwise, you'll spend lots of time trying to fix obscure bugs with odd input that the csv module already handles.
Related
How to remove some extra commas on CSV file sometimes there are 3 or more extra commas, I would like the marked part to become a single column
correct format is 11 columns, I just want to find the ones that are not and remove the commas
84,855,648857,8787548,R,mark,one 55, power,0000081,3434,59190000,defen,six,
first 5 and last 5 columns are static, only the middle will become a single column and sometimes there are more than 3 extra columns
now i split the 300 GB file to work with python script in loop so there is a folder contain the files
the result should be like this
84,855,648857,8787548,R,mark one 55 power,0000081,3434,59190000,defen,six,
I suggest reading the csv data into a list, merge them, and write it back:
def merge(data):
result = []
result += data[:5]
temporary = ""
for item in data[5:-5]:
temporary += item + " "
result.append(temporary[:-1])
result += data[-5:]
return result
This function take a list, start(inclusive), end(exclusive), it merge the range specified and returns the result.
For example, calling
merge(["84","855","648857","8787548","R","mark","one 55","power","0000081","3434","59190000","defen","six"])
will merge index 5,6,7, and return:
['84', '855', '648857', '8787548', 'R', 'mark one 55 power', '0000081', '3434', '59190000', 'defen', 'six']
You can then write the list back into a csv.
I have a csv file named film.csv here is the header line with a few lines to use as an example
Year;Length;Title;Subject;Actor;Actress;Director;Popularity;Awards;*Image
1990;111;Tie Me Up! Tie Me Down!;Comedy;Banderas, Antonio;Abril, Victoria;Almodóvar, Pedro;68;No;NicholasCage.png
1991;113;High Heels;Comedy;Bosé, Miguel;Abril, Victoria;Almodóvar, Pedro;68;No;NicholasCage.png
1983;104;Dead Zone, The;Horror;Walken, Christopher;Adams, Brooke;Cronenberg, David;79;No;NicholasCage.png
1979;122;Cuba;Action;Connery, Sean;Adams, Brooke;Lester, Richard;6;No;seanConnery.png
1978;94;Days of Heaven;Drama;Gere, Richard;Adams, Brooke;Malick, Terrence;14;No;NicholasCage.png
1983;140;Octopussy;Action;Moore, Roger;Adams, Maud;Glen, John;68;No;NicholasCage.png
I am trying to filter, and need to display the move titles, for this criteria: first name contains "Richard", Year < 1985, Awards == "Y"
I am able to filter for the award, but not the rest. can you help?
file_name = "film.csv"
lines = (line for line in open(file_name,encoding='cp1252')) #generator to capture lines
lists = (s.rstrip().split(";") for s in lines) #generators to capture lists containing values from lines
#browse lists and index them per header values, then filter all movies that have been awarded
#using a new generator object
cols=next(lists) #obtains only the header
print(cols)
collections = (dict(zip(cols,data)) for data in lists)
filtered = (col["Title"] for col in collections if col["Awards"][0] == "Y")
for item in filtered:
print(item)
# input()
This works for the award but I don't know how to add additional filters. Also when I try to filter for if col["Year"] < 1985 I get error message because string vs int not compatible. How do I make the years a value?
I believe for the first name I can filter like this:
if col["Actor"].split(", ")[-1] == "Richard"
You know how to add one filter. There is no such thing as "additional" filters. Just add your conditions to the current condition. Since you want all of the conditions to be True to select a record, you'd use the boolean and logic. For example:
filtered = (
col["Title"]
for col in collections
if col["Awards"][0] == "Y"
and col["Actor"].split(", ")[-1] == "Richard"
and int(col["Year"]) < 1985
)
Notice I added an int() around the col["Year"] to convert it to an integer.
You've actually gone and reinvented csv.DictReader in the setup to this problem! Instead of
file_name = "film.csv"
lines = (line for line in open(file_name,encoding='cp1252')) #generator to capture lines
lists = (s.rstrip().split(";") for s in lines) #generators to capture lists containing values from lines
#browse lists and index them per header values, then filter all movies that have been awarded
#using a new generator object
cols=next(lists) #obtains only the header
print(cols)
collections = (dict(zip(cols,data)) for data in lists)
filtered = ...
You could have just done:
import csv
file_name = "film.csv"
with open(file_name) as f:
collections = csv.DictReader(delimiter=";")
filtered = ...
Sorry if this is a repeat but I can't find it for now.
Basically I am opening and reading a dat file which contains a load of paths that I need to loop through to get certain information.
Each of the lines in the base.dat file contains m.somenumber. For example some lines in the file might be:
Volumes/hard_disc/u14_cut//u14m12.40_all.beta/beta8
Volumes/hard_disc/u14_cut/u14m12.50_all.beta/beta8
Volumes/hard_disc/u14_cut/u14m11.40_all.beta/beta8
I need to be able to re-write the dat file so that all the lines are re-ordered from the largest m.number to the smallest m.number. Then when I loop through PATH in database (shown in code) I am looping through in decreasing m.
Here is the relevant part of the code
base = open('base8.dat', 'r')
database= base.read().splitlines()
base.close()
counter=0
mu_list=np.array([])
delta_list=np.array([])
ofsset = 0.00136
beta=0
for PATH in database:
if os.path.exists(str(PATH)+'/CHI/optimal_spectral_function_CHI.dat'):
n1_array = numpy.loadtxt(str(PATH)+'/AVERAGES/av-err.n.dat')
n7_array= numpy.loadtxt(str(PATH)+'/AVERAGES/av-err.npx.dat')
n1_mean = n1_array[0]
delta=round(float(5.0+ofsset-(n1_array[0]*2.+4.*n7_array[0])),6)
par = open(str(PATH)+"/params10", "r")
for line in par:
counter= counter+1
if re.match("mu", line):
mioMU= re.findall('\d+', line.translate(None, ';'))
mioMU2=line.split()[2][:-1]
mu=mioMU2
print mu, delta, PATH
mu_list=np.append(mu_list, mu)
delta_list=np.append(delta_list,delta)
optimal_counter=0
print delta_list, mu_list
I have checked the possible flagged repeat but I can't seem to get it to work for mine because my file doesn't technically contain strings and numbers. The 'number' I need to sort by is contained in the string as a whole:
Volumes/data_disc/u14_cut/from_met/u14m11.40_all.beta/beta16
and I need to sort the entire line by just the m(somenumber) part
Assuming that the number part of your line has the form of a float you can use a regular expression to match that part and convert it from string to float.
After that you can use this information in order to sort all the lines read from your file. I added a invalid line in order to show how invalid data is handled.
As a quick example I would suggest something like this:
import re
# TODO: Read file and get list of lines
l = ['Volumes/hard_disc/u14_cut/u14**m12.40**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**m12.50**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**m11.40**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**mm11.40**_all.beta/beta8']
regex = r'^.+\*{2}m{1}(?P<criterion>[0-9\.]*)\*{2}.+$'
p = re.compile(regex)
criterion_list = []
for s in l:
m = p.match(s)
if m:
crit = m.group('criterion')
try:
crit = float(crit)
except Exception as e:
crit = 0
else:
crit = 0
criterion_list.append(crit)
tuples_list = list(zip(criterion_list, l))
output = [element[1] for element in sorted(tuples_list, key=lambda t: t[0])]
print(output)
# TODO: Write output to new file or overwrite existing one.
Giving:
['Volumes/hard_disc/u14_cut/u14**mm11.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m11.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m12.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m12.50**_all.beta/beta8']
This snippets starts after all lines are read from the file and stored into a list (list called l here). The regex group criterion catches the float part contained in **m12.50** as you can see on regex101. So iterating through all the lines gives you a new list containing all matching groups as floats. If the regex does not match on a given string or casting the group to a float fails, crit is set to zero in order to have those invalid lines at the very beginning of the sorted list later.
After that zip() is used to get a list of tules containing the extracted floats and the according string. Now you can sort this list of tuples based on the tuple's first element and write the according string to a new list output.
I have a data file with tons of data like:
{"Passenger Quarters",27.`,"Cardassian","not injured"},{"Passenger Quarters",9.`,"Cardassian","injured"},{"Passenger Quarters",32.`,"Romulan","not injured"},{"Bridge","Unknown","Romulan","not injured"}
I want to read in the data and save it in a list. I am having trouble getting the exact right code to exact the data between the { }. I don't want the quotes and the ` after the numbers. Also, data is not separated by line so how do I tell re.search where to begin looking for the next set of data?
At first glance, you can break this data into chunks by splitting it on the string },{:
chunks = data.split('},{')
chunks[0] = chunks[0][1:] # first chunk started with '{'
chunks[-1] = chunks[-1][:-1] # last chunk ended with '}'
Now you have chunks like
"Passenger Quarters",27.`,"Cardassian","not injured"
and you can apply a regular expression to them.
You should do this in two passes. One to get the list of items and one to get the contents of each item:
import re
from pprint import pprint
data = '{"Passenger Quarters",27.`,"Cardassian","not injured"},{"Passenger Quarters",9.`,"Cardassian","injured"},{"Passenger Quarters",32.`,"Romulan","not injured"},{"Bridge","Unknown","Romulan","not injured"}'
# This splits up the data into items where each item is the
# contents inside a pair of braces
item_pattern = re.compile("{([^}]+)}")
# This plits up each item into it's parts. Either matching a string
# inside quotation marks or a number followed by some garbage
contents_pattern = re.compile('(?:"([^"]+)"|([0-9]+)[^,]+),?')
rows = []
for item in item_pattern.findall(data):
row = []
for content in contents_pattern.findall(item):
if content[1]: # Number matched, treat it as one
row.append(int(content[1]))
else: # Number not matched, use the string (even if empty)
row.append(content[0])
rows.append(row)
pprint(rows)
The following will produce a list of lists, where each list is an individual record.
import re
data = '{"Passenger Quarters",27.`,"Cardassian","not injured"},{"Passenger Quarters",9.`,"Cardassian","injured"},{"Pssenger Quarters",32.`,"Romulan","not injured"},{"Bridge","Unknown","Romulan","not injured"}'
# remove characters we don't want and split into individual fields
badchars = ['{','}','`','.','"']
newdata = data.translate(None, ''.join(badchars))
fields = newdata.split(',')
# Assemble groups of 4 fields into separate lists and append
# to the parent list. Obvious weakness here is if there are
# records that contain something other than 4 fields
records = []
myrecord = []
recordcount = 1
for field in fields:
myrecord.append(field)
recordcount = recordcount + 1
if (recordcount > 4):
records.append(myrecord)
myrecord = []
recordcount = 1
for record in records:
print record
Output:
['Passenger Quarters', '27', 'Cardassian', 'not injured']
['Passenger Quarters', '9', 'Cardassian', 'injured']
['Pssenger Quarters', '32', 'Romulan', 'not injured']
['Bridge', 'Unknown', 'Romulan', 'not injured']
I have a text file that I am reading in python . I'm trying to extract certain elements from the text file that follow keywords to append them into empty lists . The file looks like this:
so I want to make two empty lists
1st list will append the sequence names
2nd list will be a list of lists which will include be in the format [Bacteria,Phylum,Class,Order, Family, Genus, Species]
most of the organisms will be Uncultured bacterium . I am trying to add the Uncultured bacterium with the following IDs that are separated by ;
Is there anyway to scan for a certain word and when the word is found, take the word that is after it [separated by a '\t'] ?
I need it to create a dictionary of the Sequence Name to be translated to the taxonomic data .
I know i will need an empty list to append the names to:
seq_names=[ ]
a second list to put the taxonomy lists into
taxonomy=[ ]
and a 3rd list that will be reset after every iteration
temp = [ ]
I'm sure it can be done in Biopython but i'm working on my python skills
Yes there is a way.
You can split the string which you get from reading the file into an array using the inbuilt function split. From this you can find the index of the word you are looking for and then using this index plus one to get the word after it. For example using a text file called test.text that looks like so (the formatting is a bit weird because SO doesn't seem to like hard tabs).
one two three four five six seven eight nine
The following code
f = open('test.txt','r')
string = f.read()
words = string.split('\t')
ind = words.index('seven')
desired = words[ind+1]
will return desired as 'eight'
Edit: To return every following word in the list
f = open('test.txt','r')
string = f.read()
words = string.split('\t')
desired = [words[ind+1] for ind, word in enumerate(words) if word == "seven"]
This is using list comprehensions. It enumerates the list of words and if the word is what you are looking for includes the word at the next index in the list.
Edit2: To split it on both new lines and tabs you can use regular expressions
import re
f = open('testtest.txt','r')
string = f.read()
words = re.split('\t|\n',string)
desired = [words[ind+1] for ind, word in enumerate(words) if word == "seven"]
It sounds like you might want a dictionary indexed by sequence name. For instance,
my_data = {
'some_sequence': [Bacteria,Phylum,Class,Order, Family, Genus, Species],
'some_other_sequence': [Bacteria,Phylum,Class,Order, Family, Genus, Species]
}
Then, you'd just access my_data['some_sequence'] to pull up the data about that sequence.
To populate your data structure, I would just loop over the lines of the files, .split('\t') to break them into "columns" and then do something like my_data[the_row[0]] = [the_row[10], the_row[11], the_row[13]...] to load the row into the dictionary.
So,
for row in inp_file.readlines():
row = row.split('\t')
my_data[row[0]] = [row[10], row[11], row[13], ...]