Consider the following textfile excerpt
Distance,Velocity,Time
(m),(m/s),(s)
1,1,1
2,1,2
3,1,3
I want it to be transformed into this:
Distance(m),Velocity(m/s),Time(s)
1,1,1
2,1,2
3,1,3
In other words, I want to concatenate rows that contains text, and I want them to be concatenated column-wise.
I am initially manipulating a textfile that's generated from a software. I have successfully transformed it down to only numeric columns and their headers, in a csv format. But I have multiple headers for each column. And I need all the information in each header row, because the column attributes will differ from file to file. How can I do this in a smart way in python?
edit: Thank you for your suggestions, it helped me a lot. I used Daweos solution, and added dynamic row count because number of header rows may differ from 2 to 7, depending on the generated output. Here's the code snippet i ended up with.
# Get column headers
a = 0
header_rows= 0
with open(full,"r") as input:
Lines= ""
for line in input:
l = line
g = re.sub(' +',' ',l)
y = re.sub('\t',',',g)
numlines += 1
if len(l.encode('ANSI')) > 250:
# finds header start row
a += 1
if a>0:
# finds header end row
if "---" in line:
header_rows = numlines - (numlines-a+1)
break
else:
# Lines is my headers string
Lines = Lines + "%s" % (y) + ' '
output.close()
# Create concatenated column headers
rows = [i.split(',') for i in Lines.rstrip().split('\n')]
cols = [list(c) for c in zip(*rows)]
for i in (cols):
for j in (rows):
newcolz = [list(c) for c in zip(*rows)]
print(newcolz)
I would do it following way:
txt = " Distance,Velocity,Time \n (m),(m/s),(s) \n 1,1,1 \n 2,1,2 \n 3,1,3 \n "
rows = [i.split(',') for i in txt.rstrip().split('\n')]
cols = [list(c) for c in zip(*rows)]
newcols = [[i[0]+i[1],*i[2:]] for i in cols]
newrows = [','.join(i) for i in zip(*newcols)]
print(newtxt)
Output:
Distance (m),Velocity(m/s),Time (s)
1,1,1
2,1,2
3,1,3
Crucial here is usage of zip to transpose your data, so I can deal with columns rather than rows. [[i[0]+i[1],*i[2:]] for i in cols] is responsible for actual concat, so if you would have headers spanning 3 lines you can do [[i[0]+i[1]+i[2],*i[3:]] for i in cols] and so on.
I am not aware of anything that exists to do this so instaed you can just write a custom function. In the example below the function takes to strings and also a separator which defaults to ,.
It will split each string into a list then use list comprehension using zip to pair up the lists. and then joining the pairs.
Lastly it will join the consolidated headers again with the separator.
def concat_headers(header1, header2, seperator=","):
headers1 = header1.split(seperator)
headers2 = header2.split(seperator)
consolidated_headers = ["".join(values) for values in zip(headers1, headers2)]
return seperator.join(consolidated_headers)
data = """Distance,Velocity,Time\n(m),(m/s),(s)\n1,1,1\n2,1,2\n3,1,3\n"""
header1, header2, *lines = data.splitlines()
consolidated_headers = concat_headers(header1, header2)
print(consolidated_headers)
print("\n".join(lines))
OUTPUT
Distance(m),Velocity(m/s),Time(s)
1,1,1
2,1,2
3,1,3
You don't really need a function to do it because it can be done like this using the csv module:
import csv
data_filename = 'position_data.csv'
new_filename = 'new_position_data.csv'
with open(data_filename, 'r', newline='') as inp, \
open(new_filename, 'w', newline='') as outp:
reader, writer = csv.reader(inp), csv.writer(outp)
row1, row2 = next(reader), next(reader)
new_header = [a+b for a,b in zip(row1, row2)]
writer.writerow(new_header)
# Copy the rest of the input file.
for row in reader:
writer.writerow(row)
Related
I have a csv file where the columns are all in one row, encased in quotation marks and separated by commas. The columns are in one line.
The rows in the csv are split by comma , if there are 2 commas this means that there is a missing value. I would like to separate these columns by these parameters. In cases where the row has a quotation mark this the comma in the quotation mark should not be a separator because this is an address.
This is a sample of the data (its a csv, I converted it into a dictionary to show you a sample)
{'Store code,"Biz","Add","Labels","TotalSe","DirectSe","DSe","TotalVe","SeVe","MaVe","Totalac","Webact","Dions","Ps"': {0: ',,,,"Numsearching","Numsearchingbusiness","Numcatprod","Numview","Numviewed","Numviewed2","Numaction","Numwebsite","Numreques","Numcall"',
1: 'Nora,"Ora","Sgo, Mp, 2000",,111,44,33,121,1232,53411,4,5,,3',
2: 'mc11,"21 old","tjis that place, somewher, Netherlands, 2434",,3245,325,52454,3432,243,4353,343,23,23,18'}}
I have tried this so far and a bit stuck:
disc = pd.read_csv('/content/gdrive/My Drive/blank/blank.csv',delimiter='",')
Sample of csv:
csv sample
I use normal functions to remove " in every line on both ends, and I convert two "" into single "
This way I get CSV which I can load with read_csv()
f1 = open('Sample - Sheet1.csv')
f2 = open('temp.csv', 'w')
for row in f1:
row = row.strip() # remove "\n"
row = row[1:-1] # remove " on both ends
row = row.replace('""', '"') # conver "" into "
f2.write(row + '\n')
f2.close()
f1.close()
df = pd.read_csv('temp.csv')
print(len(df.columns))
print(df)
Another method: read it as CSV and save as normal string
import csv
f1 = open('Sample - Sheet1.csv')
f2 = open('temp.csv', 'w')
reader = csv.reader(f1)
for row in reader:
f2.write(row[0] + '\n')
f2.close()
f1.close()
df = pd.read_csv('temp.csv')
print(len(df.columns))
print(df)
Below is some python code that runs on a file similar to this (old_file.csv).
A,B,C,D
1,2,XX,3
11,22,XX,33
111,222,XX,333
How can I iterate through all lines in the old_file.csv (if I don't know the length of the file) and replace all values in column C or index 2 or cells[row][2] (based on cells[row][col]). But I'd like to ignore the header row. In the new_file.csv, all values containing 'XX' will become 'YY' for example.
import csv
r = csv.reader(open('old_file.csv'))
cells = [l for l in r]
cells[1][2] = 'YY'
cells[2][2] = 'YY'
cells[3][2] = 'YY'
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(cells)
Just small change in #Soviut ans, try this I think this will help you
import csv
rows = csv.reader(open('old_file.csv'))
newRows=[]
for i, row in enumerate(rows):
# ignore the first row, modify all the rest
if i > 0:
row[2] = 'YY'
newRows.append(row)
# write rows to new CSV file, no header is written unless explicitly told to
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(newRows)
You can very easily loop over the array of rows and replace values in the target cell.
# get rows from old CSV file
rows = csv.reader(open('old_file.csv'))
# iterate over each row and replace target cell
for i, row in enumerate(rows):
# ignore the first row, modify all the rest
if i > 0:
row[2] = 'YY'
# write rows to new CSV file, no header is written unless explicitly told to
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(rows)
csv reader makes arrays, so you could just run it on r[1:]
len(cells) is the number of rows. Iterating from 1 makes it skip the header line. Also the lines should be cells.
import csv
r = csv.reader(open('old_file.csv'))
cells = [l for l in r]
for i in range(1, len(cells)):
cells[i][2] = 'YY'
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(cells)
read_handle = open('old_file.csv', 'r')
data = read_handle.read().split('\n')
read_handle.close()
new_data = []
new_data.append(data[0])
for line in data[1:]:
if not line:
new_data.append(line)
continue
line = line.split(',')
line[2] = 'YY'
new_data.append(','.join(line))
write_handle = open('new_file.csv', 'w')
write_handle.writelines('\n'.join(new_data))
write_handle.close()
If I have multiple text files that I need to parse that look like so, but can vary in terms of column names, and the length of the hashtags above:
How would I go about turning this into a pandas dataframe? I've tried using pd.read_table('file.txt', delim_whitespace = True, skiprows = 14), but it has all sorts of problems. My issues are...
All the text, asterisks, and pounds at the top needs to be ignored, but I can't just use skip rows because the size of all the junk up top can vary in length in another file.
The columns "stat (+/-)" and "syst (+/-)" are seen as 4 columns because of the whitespace.
The one pound sign is included in the column names, and I don't want that. I can't just assign the column names manually because they vary from text file to text file.
Any help is much obliged, I'm just not really sure where to go from after I read the file using pandas.
Consider reading in raw file, cleaning it line by line while writing to a new file using csv module. Regex is used to identify column headers using the i as match criteria. Below assumes more than one space separates columns:
import os
import csv, re
import pandas as pd
rawfile = "path/To/RawText.txt"
tempfile = "path/To/TempText.txt"
with open(tempfile, 'w', newline='') as output_file:
writer = csv.writer(output_file)
with open(rawfile, 'r') as data_file:
for line in data_file:
if re.match('^.*i', line): # KEEP COLUMN HEADER ROW
line = line.replace('\n', '')
row = line.split(" ")
writer.writerow(row)
elif line.startswith('#') == False: # REMOVE HASHTAG LINES
line = line.replace('\n', '')
row = line.split(" ")
writer.writerow(row)
df = pd.read_csv(tempfile) # IMPORT TEMP FILE
df.columns = [c.replace('# ', '') for c in df.columns] # REMOVE '#' IN COL NAMES
os.remove(tempfile) # DELETE TEMP FILE
This is the way I'm mentioning in the comment: it uses a file object to skip the custom dirty data you need to skip at the beginning. You land the file offset at the appropriate location in the file where read_fwf simply does the job:
with open(rawfile, 'r') as data_file:
while(data_file.read(1)=='#'):
last_pound_pos = data_file.tell()
data_file.readline()
data_file.seek(last_pound_pos)
df = pd.read_fwf(data_file)
df
Out[88]:
i mult stat (+/-) syst (+/-) Q2 x x.1 Php
0 0 0.322541 0.018731 0.026681 1.250269 0.037525 0.148981 0.104192
1 1 0.667686 0.023593 0.033163 1.250269 0.037525 0.150414 0.211203
2 2 0.766044 0.022712 0.037836 1.250269 0.037525 0.149641 0.316589
3 3 0.668402 0.024219 0.031938 1.250269 0.037525 0.148027 0.415451
4 4 0.423496 0.020548 0.018001 1.250269 0.037525 0.154227 0.557743
5 5 0.237175 0.023561 0.007481 1.250269 0.037525 0.159904 0.750544
I'm trying to write some data into the excel spreadsheet using CSV.
I'm writing a motif finder, reading the input from fasta and outputs to excel.
But I'm having a hard time writing the data in a correct format.
My desired result in the excel is like below..
SeqName M1 Hits M2 Hits
Seq1 MN[A-Z] 3 V[A-Z]R[ML] 2
Seq2 MN[A-Z] 0 V[A-Z]R[ML] 5
Seq3 MN[A-Z] 1 V[A-Z]R[ML] 0
I have generated correct results but I just don't know how to put them in a correct format like above.
This is the code that I have so far.
import re
from Bio import SeqIO
import csv
import collections
def SearchMotif(f1, motif, f2="motifs.xls"):
with open(f1, 'r') as fin, open(f2,'wb') as fout:
# This makes SeqName static and everything else mutable thus, when more than 1 motifs are searched,
# they can be correctly placed into excel.
writer = csv.writer(fout, delimiter = '\t')
motif_fieldnames = ['SeqName']
writer_dict = csv.DictWriter(fout,delimiter = '\t' ,fieldnames=motif_fieldnames)
for i in range(0,len(motif),1):
motif_fieldnames.append('M%d' %(i+1))
motif_fieldnames.append('Hits')
writer_dict.writeheader()
# Reading input fasta file for processing.
fasta_name = []
for seq_record in SeqIO.parse(f1,'fasta'):
sequence = repr(seq_record.seq) # re-module only takes string
fasta_name.append(seq_record.name)
print sequence **********
for j in motif:
motif_name = j
print motif_name **********
number_count = len(re.findall(j,sequence))
print number_count **********
writer.writerow([motif_name])
for i in fasta_name:
writer.writerow([i]) # [] makes it fit into one column instead of characters taking each columns
The print statement that have the asterisks ********** generates this...where number is the number of Hits and difference sequences are seq1, seq2 ...and so on.
Seq('QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQ...LTS', SingleLetterAlphabet())
PA[A-Z]
0
Y[A-Z]L[A-Z]
0
Seq('SFNVATLPAESSSTDLDTTVLLPDEPAEVSDLERIETEWTNMKILELPFAPQMK...VSS', SingleLetterAlphabet())
PA[A-Z]
2
Y[A-Z]L[A-Z]
0
Seq('PAESIYFKIEKTYNLT', SingleLetterAlphabet())
PA[A-Z]
1
Y[A-Z]L[A-Z]
1
You can write your data to a Pandas DataFrame, and then use the DataFrame's to_csv method to export it to a CSV. There is also a to_excel method. Pandas won't let you have multiple columns with the same name, like your "Hits" column. However, you can work around that by putting the column names you want in the first row and using the header=False option when you export.
"import pandas as pd", then replace your code starting at "fasta_name = []" with this:
column_names = ['SeqName']
for i, m in enumerate(motif):
column_names += ['M'+str(i), 'Hits'+str(i)]
df = pd.DataFrame(columns=column_names)
for row, seq_record in enumerate(SeqIO.parse(f1, 'fasta')):
sequence = repr(seq_record.name)
df.loc[row, 'SeqName'] = sequence
for i, j in enumerate(motif):
df.loc[row, 'M'+str(i)] = j
df.loc[row, 'Hits'+str(i)] = len(re.findall(j, sequence))
df.to_csv(index=False)
I have a CSV file that I read like below:
with open ("ann.csv", "rb") as annotate:
for col in annotate:
ann = col.lower().split(",")
print ann[0]
My CSV file looks like below:
H1,H2,H3
da,ta,one
dat,a,two
My output looks like this:
da
dat
but I want a comma separated output like (da,dat). How can I do that?
First, in Python you have the csv module - use that.
Second, you're iterating through rows, so using col as a variable name is a bit confusing.
Third, just collect the items in a list and print that using .join():
import csv
with open ("ann.csv", "rb") as csvfile:
reader = csv.reader(csvfile)
reader.next() # Skip the header row
collected = []
for row in reader:
collected.append(row[0])
print ",".join(collected)
try like this:
with open ("ann.csv", "rb") as annotate:
output = []
next(annotate) # next will advanced the file pointer to next line
for col in annotate:
output.append(col.lower().split(",")[0])
print ",".join(output)
Then try this:
result = ''
with open ("ann.csv", "rb") as annotate:
for col in annotate:
ann = col.lower().split(",")
# add first element of every line to one string and separate them by comma
result = result + ann[0] + ','
print result
Try this
>>> with open ("ann.csv", "rb") as annotate:
... for col in annotate:
... ann = col.lower().split(",")
... print ann[0]+',',
...
Instead of printing it on the spot, build up a string, and print it in the end.
s = ''
with open ("ann.csv", "rb") as annotate:
for col in annotate:
ann = col.lower().split(",")
s += ann[0] + ','
s = s[:-1] # Remove last comma
print(s)
I would also suggest to change the variable name col, it is looping over lines, not over columns.
Using numpy.loadtxt might be a bit easier:
In [23]: import numpy as np
...: fn = 'a.csv'
...: m = np.loadtxt(fn, dtype=str, delimiter=',')
...: print m
[['H1' 'H2' 'H3']
['da' 'ta' 'one']
['dat' 'a' 'two']]
In [24]: m[:,0][1:]
Out[24]:
array(['da', 'dat'],
dtype='|S3')
In [25]: print ','.join(m[:,0][1:])
da,dat
m[:,0] gets the first column of matrix m, and [1:] skips the first element 'H1'.