Reading in header information from csv file using Pandas - python

I have a data file that has 14 lines of header. In the header, there is the metadata for the latitude-longitude coordinates and time. I am currently using
pandas.read_csv(filename, delimiter",", header=14)
to read in the file but this just gets the data and I can't seem to get the metadata. Would anyone know how to read in the information in the header? The header looks like:
CSD,20160315SSIO
NUMBER_HEADERS = 11
EXPOCODE = 33RR20160208
SECT_ID = I08
STNBBR = 1
CASTNO = 1
DATE = 20160219
TIME = 0558
LATITUDE = -66.6027
LONGITUDE = 78.3815
DEPTH = 462
INSTRUMENT_ID = 0401
CTDPRS,CTDPRS_FLAG,CTDTMP,CTDTMP_FLAG
DBAR,,ITS-90,,PSS-78

You have to parse your metadata header by yourself, yet you can do it in an elegant manner in one pass and even by using it on the fly so that you can extract data out it / control the correctness of the file etc.
First, open the file yourself:
f = open(filename)
Then, do the work to parse each metadata line to extract data out it. For the sake of the explanation, I'm just skipping these rows:
for i in range(13): # skip the first 13 lines that are useless for the columns definition
f.readline() # use the resulting string for metadata extraction
Now you have the file pointer ready on the unique header line you want to use to load the DataFrame. The cool thing is that read_csv accepts file objects! Thus you start loading your DataFrame right away now:
pandas.read_csv(f, sep=",")
Note that I don't use the header argument as I consider by your description you have only that one last line of header that is useful for your dataframe. You can build and adjust hearder parsing values / rows to skip from that example.

Although the following method does not use Pandas, I was able to extract the header information.
with open(fname) as csvfile:
forheader_IO2016 = csv.reader(csvfile, delimiter=',')
header_IO2016 = []
for row in forheader_IO2016:
header_IO2016.append(row[0])
date = header_IO2016[7].split(" ")[2]
time = header_IO2016[8].split(" ")[2]
lat = float(header_IO2016[9].split(" ")[2])
lon = float(header_IO2016[10].split(" ")[4])

Related

Removal of rows containing a particular text in a csv file using Python

I have a genomic dataset consisting of more than 3500 rows. I need to remove rows in two columns that("Length" and "Protein Name") from them. How do I specify the condition for this purpose.
import csv #importing the csv module or method
#opening a new csv file
file = open('C:\\Users\\Admin\\Downloads\\csv.csv', 'r')
type(file)
#reading the csv file
csvreader = csv.reader(file)
header = []
header = next(csvreader)
print(header)
#extracting rows from the csv file
rows = []
for row in csvreader:
rows.append(row)
print(rows)
I am a beginner in python bioinformatic data analysis and I haven't tried any extensive methods. I don't how to proceed from here. I have done the work opening and reading the csv file. I have also extracted the column headers. But I don't know how to proceed from here. Please help.
try this :
csvreader= csvreader[csvreader["columnName"].str.contains("string to delete") == False]
It will be better to read scv in pandas since you have lots of rows. That will be the smart decision to make. And also set your conditional variables which you will use to perform the operation. If this do not help. I will suggest you provide a sample data of your scv file.
df = pd.read_csv('C:\\Users\\Admin\\Downloads\\csv.csv')
length = 10
protein_name = "replace with protain name"
df = df[(df["Length"] > length) & (df["Protein Name"] != protein_name)]
print(df)
You can save the df back to a scv file if you want:
df.to_csv("'C:\\Users\\Admin\\Downloads\\new_csv.csv'", index=False)

In Pandas, how can I extract certain value using the key off of a dataframe imported from a csv file?

Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.
The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])
You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR
You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])

Read CSV with comma delimiter (sorting issue while importing csv)

I am trying to open a csv file by skipping first 5 rows. The data is not getting aligned in dataframe. See screenshot of file
PO = pd.DataFrame()
PO = pd.read_table(acct.csv',sep='\t',skiprows=5,skip_blank_lines=True)
PO
try to set it after import datewise as below.
First sort your data with proper import as it is sticked to the index values. see data image again and data as well. So, when you have proper separator / delimiter you can do following.
do = pd.read_csv('check_test.csv', "r", delimiter='\t', skiprows=range(1, 7),skip_blank_lines=True, encoding="utf8")
d01 = do.iloc[:,1:7]
d02 = d01.sort_values('Date,Reference,Debit')
This is sorting the values into the way you want.

extract specific data from a csv file with specified row and column names

The CSV module of python is pretty new for me and would like to get some help with a specific task.
I am looking to extract data (numeric values) from a csv-file-1 based on its row and column names. Secondly, I would like to put this data into another csv-file, in a new column, at the same line corresponding to the raw name's data from csv-file-1.
Here are examples of my two dataframes (csv format, sep = ","):
csv-file-1:
seq_label,id3,id4
id1,0.3,0.2
id2,0.4,0.7
csv-file-2:
seq_label,x1,...
id1,id3,...
id2,id4,...
For example, I would like to select values from csv-file-1, which correspond to the intersection of row names of "seq_label" and "x1" variables in csv-file-2.
Then, I would like to create a new csv-file (csv-file-3) which is the fusion of csv-file-1 and the extracted data from csv-file-1, in this way:
csv-file-3 ("x3" is the new variable or new column with extracted values):
seq_label,x1,...,x3
id1,id3,...,0.3
id2,id4,...,0.7
Could someone give me a hand on this?
Best regards
This is just an example with comments to explain the steps. Hope it'll help you.
import csv
with open("path to file", "r") as f: # to open the file with read mode
r = csv.reader(f) # create a csv reader
content = list(r) # get the content of the file in a list
column = ["x3", 0.3, 0.7, ...] # prepare the last column
content.append(column) # add it to content list
with open("path to file 2", "w") as f2 : ## Open file 2 in order to write into it
w = csv.writer(r, newline='')
w.writerows(content) ## write the new content
The csv lib will return you a list for each row.
What you want to do is
read the first csv
and convert it into something you can use (depends on whether you want row or column based access
do the same for csv2
for each line of csv1 search for a match in csv2
and add it to your internal data
write this data to your output file
You might also want to look at
https://pandas.pydata.org/
since it seems like you could save a lot of time using pandas instead of the plain csv methods.

How to split a log file into several csv files with python

I'm pretty new to python and coding in general, so sorry in advance for any dumb questions. My program needs to split an existing log file into several *.csv files (run1,.csv, run2.csv, ...) based on the keyword 'MYLOG'. If the keyword appears it should start copying the two desired columns into the new file till the keyword appears again. When finished there need to be as many csv files as there are keywords.
53.2436 EXP MYLOG: START RUN specs/run03_block_order.csv
53.2589 EXP TextStim: autoDraw = None
53.2589 EXP TextStim: autoDraw = None
55.2257 DATA Keypress: t
57.2412 DATA Keypress: t
59.2406 DATA Keypress: t
61.2400 DATA Keypress: t
63.2393 DATA Keypress: t
...
89.2314 EXP MYLOG: START BLOCK scene [specs/run03_block01.csv]
89.2336 EXP Imported specs/run03_block01.csv as conditions
89.2339 EXP Created sequence: sequential, trialTypes=9
...
[EDIT]: The output per file (run*.csv) should look like this:
onset type
53.2436 EXP
53.2589 EXP
53.2589 EXP
55.2257 DATA
57.2412 DATA
59.2406 DATA
61.2400 DATA
...
The program creates as much run*.csv as needed, but i can't store the desired columns in my new files. When finished, all I get are empty csv files. If I shift the counter variable to == 1 it creates just one big file with the desired columns.
Thanks again!
import csv
QUERY = 'MYLOG'
with open('localizer.log', 'rt') as log_input:
i = 0
for line in log_input:
if QUERY in line:
i = i + 1
with open('run' + str(i) + '.csv', 'w') as output:
reader = csv.reader(log_input, delimiter = ' ')
writer = csv.writer(output)
content_column_A = [0]
content_column_B = [1]
for row in reader:
content_A = list(row[j] for j in content_column_A)
content_B = list(row[k] for k in content_column_B)
writer.writerow(content_A)
writer.writerow(content_B)
Looking at the code there's a few things that are possibly wrong:
the csv reader should take a file handler, not a single line.
the reader delimiter should not be a single space character as it looks like the actual delimiter in your logs is a variable number of multiple space characters.
the looping logic seems to be a bit off, confusing files/lines/rows a bit.
You may be looking at something like the code below (pending clarification in the question):
import csv
NEW_LOG_DELIMITER = 'MYLOG'
def write_buffer(_index, buffer):
"""
This function takes an index and a buffer.
The buffer is just an iterable of iterables (ex a list of lists)
Each buffer item is a row of values.
"""
filename = 'run{}.csv'.format(_index)
with open(filename, 'w') as output:
writer = csv.writer(output)
writer.writerow(['onset', 'type']) # adding the heading
writer.writerows(buffer)
current_buffer = []
_index = 1
with open('localizer.log', 'rt') as log_input:
for line in log_input:
# will deal ok with multi-space as long as
# you don't care about the last column
fields = line.split()[:2]
if not NEW_LOG_DELIMITER in line or not current_buffer:
# If it's the first line (the current_buffer is empty)
# or the line does NOT contain "MYLOG" then
# collect it until it's time to write it to file.
current_buffer.append(fields)
else:
write_buffer(_index, current_buffer)
_index += 1
current_buffer = [fields] # EDIT: fixed bug, new buffer should not be empty
if current_buffer:
# We are now out of the loop,
# if there's an unwritten buffer then write it to file.
write_buffer(_index, current_buffer)
You can use pandas to simplify this problem.
Import pandas and read in log file.
import pandas as pd
df = pd.read_fwf('localizer2.log', header=None)
df.columns = ['onset', 'type', 'event']
df.set_index('onset', inplace=True)
Set Flag where third column == 'MYLOG'
df['flag'] = 0
df.loc[df.event.str[:5] == 'MYLOG', 'flag'] = 1
df.flag = df['flag'].cumsum()
Save each run as a separate run*.csv file
for i in range(1, df.flag.max()+1):
df.loc[df.flag == i, 'event'].to_csv('run{0}.csv'.format(i))
EDIT:
Looks like your format is different than I originally assumed. Changed to use pd.read_fwf. my localizer.log file was a copy and paste of your original data, hope this works for you. I assumed by the original post that it did not have headers. If it does have headers then remove header=None and df.columns = ['onset', 'type', 'event'].

Categories