Python: Pandas, dealing with spaced column names - python

If I have multiple text files that I need to parse that look like so, but can vary in terms of column names, and the length of the hashtags above:
How would I go about turning this into a pandas dataframe? I've tried using pd.read_table('file.txt', delim_whitespace = True, skiprows = 14), but it has all sorts of problems. My issues are...
All the text, asterisks, and pounds at the top needs to be ignored, but I can't just use skip rows because the size of all the junk up top can vary in length in another file.
The columns "stat (+/-)" and "syst (+/-)" are seen as 4 columns because of the whitespace.
The one pound sign is included in the column names, and I don't want that. I can't just assign the column names manually because they vary from text file to text file.
Any help is much obliged, I'm just not really sure where to go from after I read the file using pandas.

Consider reading in raw file, cleaning it line by line while writing to a new file using csv module. Regex is used to identify column headers using the i as match criteria. Below assumes more than one space separates columns:
import os
import csv, re
import pandas as pd
rawfile = "path/To/RawText.txt"
tempfile = "path/To/TempText.txt"
with open(tempfile, 'w', newline='') as output_file:
writer = csv.writer(output_file)
with open(rawfile, 'r') as data_file:
for line in data_file:
if re.match('^.*i', line): # KEEP COLUMN HEADER ROW
line = line.replace('\n', '')
row = line.split(" ")
writer.writerow(row)
elif line.startswith('#') == False: # REMOVE HASHTAG LINES
line = line.replace('\n', '')
row = line.split(" ")
writer.writerow(row)
df = pd.read_csv(tempfile) # IMPORT TEMP FILE
df.columns = [c.replace('# ', '') for c in df.columns] # REMOVE '#' IN COL NAMES
os.remove(tempfile) # DELETE TEMP FILE

This is the way I'm mentioning in the comment: it uses a file object to skip the custom dirty data you need to skip at the beginning. You land the file offset at the appropriate location in the file where read_fwf simply does the job:
with open(rawfile, 'r') as data_file:
while(data_file.read(1)=='#'):
last_pound_pos = data_file.tell()
data_file.readline()
data_file.seek(last_pound_pos)
df = pd.read_fwf(data_file)
df
Out[88]:
i mult stat (+/-) syst (+/-) Q2 x x.1 Php
0 0 0.322541 0.018731 0.026681 1.250269 0.037525 0.148981 0.104192
1 1 0.667686 0.023593 0.033163 1.250269 0.037525 0.150414 0.211203
2 2 0.766044 0.022712 0.037836 1.250269 0.037525 0.149641 0.316589
3 3 0.668402 0.024219 0.031938 1.250269 0.037525 0.148027 0.415451
4 4 0.423496 0.020548 0.018001 1.250269 0.037525 0.154227 0.557743
5 5 0.237175 0.023561 0.007481 1.250269 0.037525 0.159904 0.750544

Related

Reading a txt file and saving individual columns as lists

I am trying to read a .txt file and save the data in each column as a list. each column in the file contains a variable which I will later on use to plot a graph. I have tried looking up the best method to do this and most answers recommend opening the file, reading it, and then either splitting or saving the columns as a list. The data in the .txt is as follows -
0 1.644231726
0.00025 1.651333945
0.0005 1.669593478
0.00075 1.695214575
0.001 1.725409504
the delimiter is a space '' or a tab '\t' . I have used the following code to try and append the columns to my variables -
import csv
with open('./rvt.txt') as file:
readfile = csv.reader(file, delimiter='\t')
time = []
rim = []
for line in readfile:
t = line[0]
r = line[1]
time.append(t)
rim.append(r)
print(time, rim)
However, when I try to print the lists, time and rim, using print(time, rim), I get the following error message -
r = line[1]
IndexError: list index out of range
I am, however, able to print only the 'time' if I comment out the r=line[1] and rim.append(r) parts. How do I approach this problem? Thank you in advance!
I would suggest the following:
import pandas as pd
df=pd.read_csv('./rvt.txt', sep='\t'), header=[a list with your column names])
Then you can use list(your_column) to work with your columns as lists
The problem is with the delimiter. The dataset contain multiple space ' '.
When you use '\t' and
print line you can see it's not separating the line with the delimiter.
eg:
['0 1.644231726']
['0.00025 1.651333945']
['0.0005 1.669593478']
['0.00075 1.695214575']
['0.001 1.725409504']
To get the desired result you can use (space) as delimiter and filter the empty values:
readfile = csv.reader(file, delimiter=" ")
time, rim = [], []
for line in readfile:
line = list(filter(lambda x: len(x), line))
t = line[0]
r = line[1]
Here is the code to do this:
import csv
with open('./rvt.txt') as file:
readfile = csv.reader(file, delimiter=” ”)
time = []
rim = []
for line in readfile:
t = line[0]
r = line[1]
time.append(t)
rim.append(r)
print(time, rim)

Reading CSV file from stdin in Python and modifying it

I need to read csv file from stdin and output the rows only the rows which values are equal to those specified in the columns. My input is like this:
2
Kashiwa
Name,Campus,LabName
Shinichi MORISHITA,Kashiwa,Laboratory of Omics
Kenta Naai,Shirogane,Laboratory of Functional Analysis in Silico
Kiyoshi ASAI,Kashiwa,Laboratory of Genome Informatics
Yukihide Tomari,Yayoi,Laboratory of RNA Function
My output should be like this:
Name,Campus,LabName
Shinichi MORISHITA,Kashiwa,Laboratory of Omics
Kiyoshi ASAI,Kashiwa,Laboratory of Genome Informatics
I need to sort out the people whose values in column#2 == Kashiwa and not output first 2 lines of stdin in stdout.
So far I just tried to read from stdin into csv but I am getting each row as a list of strings (as expected from csv documentation). Can I change this?
#!usr/bin/env python3
import sys
import csv
data = sys.stdin.readlines()
for line in csv.reader(data):
print(line)
Output:
['2']
['Kashiwa']
['Name', 'Campus', 'LabName']
['Shinichi MORISHITA', 'Kashiwa', 'Laboratory of Omics']
['Kenta Naai', 'Shirogane', 'Laboratory of Functional Analysis in
Silico']
['Kiyoshi ASAI', 'Kashiwa', 'Laboratory of Genome Informatics']
['Yukihide Tomari', 'Yayoi', 'Laboratory of RNA Function']
Can someone give me some advice on reading stdin into CSV and manipulating the data later (outputting only needed values of columns, swapping the columns, etc.,)?
#!usr/bin/env python3
import sys
import csv
data = sys.stdin.readlines() # to read the file
column_to_be_matched = int(data.pop(0)) # to get the column number to match
word_to_be_matched = data.pop(0) # to get the word to be matched in said column
col_headers = data.pop(0) # to get the column names
print(", ".join(col_headers)) # to print the column names
for line in csv.reader(data):
if line[column_to_be_matched-1] == word_to_be_matched: #while it matched
print(", ".join(line)) #print it
Use Pandas to read your and manage your data in a DataFrame
import pandas as pd
# File location
infile = r'path/file'
# Load file and skip first two rows
df = pd.read_csv(infile, skiprows=2)
# Refresh your Dataframe en throw out the rows that contain Kashiwa in the campus column
df = df[df['campus'] != 'Kashiwa']
You can perform all kinds edits for example sort your DataFrame simply by:
df.sort(columns='your column')
Check the Pandas documentation for all the possibilities.
This is one approach.
Ex:
import csv
with open(filename) as csv_file:
reader = csv.reader(csv_file)
next(reader) #Skip First Line
next(reader) #Skip Second Line
print(next(reader)) #print Header
for row in reader:
if row[1] == 'Kashiwa': #Filter By 'Kashiwa'
print(row)
Output:
['Name', 'Campus', 'LabName']
['Shinichi MORISHITA', 'Kashiwa', 'Laboratory of Omics']
['Kiyoshi ASAI', 'Kashiwa', 'Laboratory of Genome Informatics']
import csv, sys
f= sys.stdin.readline()
data = csv.reader(f)
out = []
data_lines = list(data)
for line in data_lines[2:5]:#u can increase index to match urs
if line[1] == 'kashiwa':
new = [line[0], line[1], line[2]]#u can use string instead if list
string = f"{line[0]},{line[1]},{line[2]}"
#print(string)#print does same as stdout u can use dis
sys.stdout.write(string+'\n')
out.append(new)
sys.stdout.write(str(out))#same thing dat happens in print in the background#it out puts it as a list after the string repr
#print(out)#u can use dis too instead of stdout
f.close()

Filter large csv files (10GB+) based on column value in Python

EDITED : Added Complexity
I have a large csv file, and I want to filter out rows based on the column values. For example consider the following CSV file format:
Col1,Col2,Nation,State,Col4...
a1,b1,Germany,state1,d1...
a2,b2,Germany,state2,d2...
a3,b3,USA,AL,d3...
a3,b3,USA,AL,d4...
a3,b3,USA,AK,d5...
a3,b3,USA,AK,d6...
I want to filter all rows with Nation == 'USA', and then based on each of the 50 state. What's the most efficient way of doing this? I'm using Python. Thanks
Also, is R better than Python for such tasks?
Use boolean indexing or DataFrame.query:
df1 = df[df['Nation'] == "Japan"]
Or:
df1 = df.query('Nation == "Japan"')
Second should be faster, see performance of query.
If still not possible (not a lot of RAM) try use dask as commented Jon Clements (thank you).
One way would be to filter the csv first and then load, given the size of the data
import csv
with open('yourfile.csv', 'r') as f_in:
with open('yourfile_edit.csv', 'w') as f_outfile:
f_out = csv.writer(f_outfile, escapechar=' ',quoting=csv.QUOTE_NONE)
for line in f_in:
line = line.strip()
row = []
if 'Japan' in line:
row.append(line)
f_out.writerow(row)
Now load the csv
df = pd.read_csv('yourfile_edit.csv', sep = ',',header = None)
You get
0 1 2 3 4
0 2 a3 b3 Japan d3
You could open the file, index the position of the Nation header, then iterate over a reader().
import csv
temp = r'C:\path\to\file'
with open(temp, 'r', newline='') as f:
cr = csv.reader(f, delimiter=',')
# next(cr) gets the header row (row[0])
i = next(cr).index('Nation')
# list comprehension through remaining cr iterables
filtered = [row for row in cr if row[i] == 'Japan']

Count repeated values in a specific column in a CSV file and return the value to another column (python2)

I am currently trying to count repeated values in a column of a CSV file and return the value to another CSV column in a python.
For example, my CSV file :
KeyID GeneralID
145258 KL456
145259 BG486
145260 HJ789
145261 KL456
What I want to achieve is to count how many data have the same GeneralID and insert it into a new CSV column. For example,
KeyID Total_GeneralID
145258 2
145259 1
145260 1
145261 2
I have tried to split each column using the split method but it didn't work so well.
My code :
case_id_list_data = []
with open(file_path_1, "rU") as g:
for line in g:
case_id_list_data.append(line.split('\t'))
#print case_id_list_data[0][0] #the result is dissatisfying
#I'm stuck here..
And if you are adverse to pandas and want to stay with the standard library:
Code:
import csv
from collections import Counter
with open('file1', 'rU') as f:
reader = csv.reader(f, delimiter='\t')
header = next(reader)
lines = [line for line in reader]
counts = Counter([l[1] for l in lines])
new_lines = [l + [str(counts[l[1]])] for l in lines]
with open('file2', 'wb') as f:
writer = csv.writer(f, delimiter='\t')
writer.writerow(header + ['Total_GeneralID'])
writer.writerows(new_lines)
Results:
KeyID GeneralID Total_GeneralID
145258 KL456 2
145259 BG486 1
145260 HJ789 1
145261 KL456 2
You have to divide the task in three steps:
1. Read CSV file
2. Generate new column's value
3. Add value to the file back
import csv
import fileinput
import sys
# 1. Read CSV file
# This is opening CSV and reading value from it.
with open("dev.csv") as filein:
reader = csv.reader(filein, skipinitialspace = True)
xs, ys = zip(*reader)
result=["Total_GeneralID"]
# 2. Generate new column's value
# This loop is for counting the "GeneralID" element.
for i in range(1,len(ys),1):
result.append(ys.count(ys[i]))
# 3. Add value to the file back
# This loop is for writing new column
for ind,line in enumerate(fileinput.input("dev.csv",inplace=True)):
sys.stdout.write("{} {}, {}\n".format("",line.rstrip(),result[ind]))
I haven't use temp file or any high level module like panda or anything.
import pandas as pd
#read your csv to a dataframe
df = pd.read_csv('file_path_1')
#generate the Total_GeneralID by counting the values in the GeneralID column and extract the occurrance for the current row.
df['Total_GeneralID'] = df.GeneralID.apply(lambda x: df.GeneralID.value_counts()[x])
df = df[['KeyID','Total_GeneralID']]
Out[442]:
KeyID Total_GeneralID
0 145258 2
1 145259 1
2 145260 1
3 145261 2
You can use pandas library:
first read_csv
get counts of values in column GeneralID by value_counts, rename by output column
join to original DataFrame
import pandas as pd
df = pd.read_csv('file')
s = df['GeneralID'].value_counts().rename('Total_GeneralID')
df = df.join(s, on='GeneralID')
print (df)
KeyID GeneralID Total_GeneralID
0 145258 KL456 2
1 145259 BG486 1
2 145260 HJ789 1
3 145261 KL456 2
Use csv.reader instead of split() method.
Its easier.
Thanks

Writing processed data into excel using CSV Python

I'm trying to write some data into the excel spreadsheet using CSV.
I'm writing a motif finder, reading the input from fasta and outputs to excel.
But I'm having a hard time writing the data in a correct format.
My desired result in the excel is like below..
SeqName M1 Hits M2 Hits
Seq1 MN[A-Z] 3 V[A-Z]R[ML] 2
Seq2 MN[A-Z] 0 V[A-Z]R[ML] 5
Seq3 MN[A-Z] 1 V[A-Z]R[ML] 0
I have generated correct results but I just don't know how to put them in a correct format like above.
This is the code that I have so far.
import re
from Bio import SeqIO
import csv
import collections
def SearchMotif(f1, motif, f2="motifs.xls"):
with open(f1, 'r') as fin, open(f2,'wb') as fout:
# This makes SeqName static and everything else mutable thus, when more than 1 motifs are searched,
# they can be correctly placed into excel.
writer = csv.writer(fout, delimiter = '\t')
motif_fieldnames = ['SeqName']
writer_dict = csv.DictWriter(fout,delimiter = '\t' ,fieldnames=motif_fieldnames)
for i in range(0,len(motif),1):
motif_fieldnames.append('M%d' %(i+1))
motif_fieldnames.append('Hits')
writer_dict.writeheader()
# Reading input fasta file for processing.
fasta_name = []
for seq_record in SeqIO.parse(f1,'fasta'):
sequence = repr(seq_record.seq) # re-module only takes string
fasta_name.append(seq_record.name)
print sequence **********
for j in motif:
motif_name = j
print motif_name **********
number_count = len(re.findall(j,sequence))
print number_count **********
writer.writerow([motif_name])
for i in fasta_name:
writer.writerow([i]) # [] makes it fit into one column instead of characters taking each columns
The print statement that have the asterisks ********** generates this...where number is the number of Hits and difference sequences are seq1, seq2 ...and so on.
Seq('QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQ...LTS', SingleLetterAlphabet())
PA[A-Z]
0
Y[A-Z]L[A-Z]
0
Seq('SFNVATLPAESSSTDLDTTVLLPDEPAEVSDLERIETEWTNMKILELPFAPQMK...VSS', SingleLetterAlphabet())
PA[A-Z]
2
Y[A-Z]L[A-Z]
0
Seq('PAESIYFKIEKTYNLT', SingleLetterAlphabet())
PA[A-Z]
1
Y[A-Z]L[A-Z]
1
You can write your data to a Pandas DataFrame, and then use the DataFrame's to_csv method to export it to a CSV. There is also a to_excel method. Pandas won't let you have multiple columns with the same name, like your "Hits" column. However, you can work around that by putting the column names you want in the first row and using the header=False option when you export.
"import pandas as pd", then replace your code starting at "fasta_name = []" with this:
column_names = ['SeqName']
for i, m in enumerate(motif):
column_names += ['M'+str(i), 'Hits'+str(i)]
df = pd.DataFrame(columns=column_names)
for row, seq_record in enumerate(SeqIO.parse(f1, 'fasta')):
sequence = repr(seq_record.name)
df.loc[row, 'SeqName'] = sequence
for i, j in enumerate(motif):
df.loc[row, 'M'+str(i)] = j
df.loc[row, 'Hits'+str(i)] = len(re.findall(j, sequence))
df.to_csv(index=False)

Categories