Using Pandas to read data and skip metadata - python

Background
I have data files which consist of two parts: data in CSV format, and Metadata. I can use the method given here 1 and here 2 to manually skip the Metadata portion by specifying the location/line number of the beginning of the Metadata.
Following is the sample of the data file:
Here, you can see that I can specify the line number (420) manually and use the following code to skip the Metadata:
with open('data.csv', 'r') as f:
metadata_location = [i for i, x in enumerate(f.readlines()) if 'Metadata' in x]
with open('data.csv', 'r') as f:
flat_data = pd.read_csv(f, index_col=False, skiprows=lambda x: x >= metadata_location[0])
with open('data.csv') as f:
df = pd.read_csv(f, index_col=False)
df = df[:420]
Question
How can I scan the file to capture the Metadata and then skip reading it? (I will need to process multiple such files, hence, I wish to write such a code)

IIUC, You can pass the callable function to skiprows argument that will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. Use:
df = pd.read_csv("data.csv", index_col=False, skiprows=lambda x: x >= 420)
UPDATE: To find the metadata location:
import re
md_loc = 0
with open("data.csv") as f:
for idx, line in enumerate(f):
if re.search(r'^"Metadata:\s*"$', line):
md_loc = idx

You question is not clear.
If I got you right, you are looking for a way to scan all the lines and run the above code on each?
EDIT 1:
for index, row in All_Patients_Chosen_Visit.iterrows():
df = row[:420]
See above code. Check if it works

Related

How can I open csv files and read them and sort them based on the data inside it?

So I'm trying to find how to open csv files and sort all the details in it...
so an example of data contained in a CSV file is...
2,8dac2b,ewmzr,jewelry,phone0,9759243157894736,us,69.166.231.58,vasstdc27m7nks3
1,668d39,aeqok,furniture,phone1,9759243157894736,in,50.201.125.84,jmqlhflrzwuay9c
3,622r49,arqek,doctor,phone2,9759544365415694736,in,53.001.135.54,weqlhrerreuert6f
and so I'm trying to let a function sortCSV(File) to open the CSV file and sort it based on the very first number, which is 0, 1 ....
so the output should be
1,668d39,aeqok,furniture,phone1,9759243157894736,in,50.201.125.84,jmqlhflrzwuay9c
2,8dac2b,ewmzr,jewelry,phone0,9759243157894736,us,69.166.231.58,vasstdc27m7nks3
3,622r49,arqek,doctor,phone2,9759544365415694736,in,53.001.135.54,weqlhrerreuert6f
Here is my code so far, which clearly doesn't work....
import csv
def CSV2List(csvFilename: str):
f = open(csvFilename)
q = list(f)
return q.sort()
What changes should I make to my code to make sure my code works??
using pandas, set the first column as index and use sort_index to sort based on your index column:
import pandas as pd
file_path = '/data.csv'
df = pd.read_csv(file_path,header=None,index_col=0)
df = df.sort_index()
print(df)
There's a number of ways you could handle this but one of the easiest would be to install Pandas (https://pandas.pydata.org/).
First off you most likely will need some titles of each column which should be on the first row of you CSV file. When you've added the column titles and installed pandas:
With pandas:
import pandas as pd
dataframe = pd.read_csv(filepath, index=0)
This will set the first column as the index column and will be sorting on the index.
Another way I've had to handle CSV:s with difficult formatting (aka exporting form excel etc) is by reading the file as a regular file and then iterating the rows to handle them on my own.
final_data = []
with open (filepath, "r") as f:
for row in f:
# Split the row
row_data = row.split(",")
# Add to final data array
final_data.append(row_data
# This sorts the final data based on first row
final_data.sort(key = lambda row: row[0])
# This returns a sorted list of rows of your CSV
return final_data
try csv.reader(Filename)
import csv
def CSV2List(csvFilename: str):
f = open(csvFilename)
q = csv.reader(f)
return q.sort(key=lambda x: x[0])
Using the csv module:
import csv
def csv_to_list(filename: str):
# use a context manager here
with open(filename) as fh:
reader = csv.reader(fh)
# convert the first item to an int for sorting
rows = [[int(num), *row] for num, *row in reader]
# sort the rows based on that value
return sorted(rows, key=lambda row: row[0])
This is not the best way to deal with CSV files but:
def CSV2List(csvFilename: str):
f = open(csvFilename,'r')
l = []
for line in f:
l.append(line.split(','))
for item in l:
item[0] = int(item[0])
return sorted(l)
print(CSV2List('data.csv'))
However I would probably use pandas instead, it is a great module

Python read csv file columns into lists, ignoring headers

I have a file 'data.csv' that looks something like
ColA, ColB, ColC
1,2,3
4,5,6
7,8,9
I want to open and read the file columns into lists, with the 1st entry of that list omitted, e.g.
dataA = [1,4,7]
dataB = [2,5,8]
dataC = [3,6,9]
In reality there are more than 3 columns and the lists are very long, this is just an example of the format. I've tried:
csv_file = open('data.csv','rb')
csv_array = []
for row in csv.reader(csv_file, delimiter=','):
csv_array.append(row)
Where I would then allocate each index of csv_array to a list, e.g.
dataA = [int(i) for i in csv_array[0]]
But I'm getting errors:
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Also it feels like a very long winded way of just saving data to a few lists...
Thanks!
edit:
Here is how I solved it:
import pandas as pd
df = pd.read_csv('data.csv', names = ['ColA','ColB','ColC']
dataA = map(int,(df.ColA.tolist())[1:3])
and repeat for the rest of the columns.
Just to spell this out for people trying to solve a similar problem, perhaps without Pandas, here's a simple refactoring with comments.
import csv
# Open the file in 'r' mode, not 'rb'
csv_file = open('data.csv','r')
dataA = []
dataB = []
dataC = []
# Read off and discard first line, to skip headers
csv_file.readline()
# Split columns while reading
for a, b, c in csv.reader(csv_file, delimiter=','):
# Append each variable to a separate list
dataA.append(a)
dataB.append(b)
dataC.append(c)
This does nothing to convert the individual fields to numbers (use append(int(a)) etc if you want that) but should hopefully be explicit and flexible enough to show you how to adapt this to new requirements.
Use Pandas:
import pandas as pd
df = pd.DataFrame.from_csv(path)
rows = df.apply(lambda x: x.tolist(), axis=1)
To skip the header, create your reader on a seperate line. Then to convert from a list of rows to a list of columns, use zip():
import csv
with open('data.csv', 'rb') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
data = zip(*[map(int, row) for row in csv_input])
print data
Giving you:
[(1, 4, 7), (2, 5, 8), (3, 6, 9)]
So if needed:
dataA = data[0]
Seems like you have OSX line endings in your csv file. Try saving the csv file as "Windows Comma Separated (.csv)" format.
There are also easier ways to do what you're doing with the csv reader:
csv_array = []
with open('data.csv', 'r') as csv_file:
reader = csv.reader(csv_file)
# remove headers
reader.next()
# loop over rows in the file, append them to your array. each row is already formatted as a list.
for row in reader:
csv_array.append(row)
You can then set dataA = csv_array[0]
First if you read the csv file with csv.reader(csv_file, delimiter=','), you will still read the header.
csv_array[0] will be the header row -> ['ColA', ' ColB', ' ColC']
Also if you're using mac, this issues is already referenced here: CSV new-line character seen in unquoted field error
And I would recommend using pandas&numpy instead if you will do more analysis using the data. It read the csv file to pandas dataframe.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
use csv.DictReader() to select specific columns
dataA = []
dataB = []
with open('data.csv', 'r') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter=',')
for row in csv_reader:
dataA.append(row['ColA'])
dataB.append(row['ColB'])

Combining two scripts into one code for csv file data verification

Hello everyone currently I have two scripts that I would like to combine into 1 code. The first script finds missing time stamps from a set of data and fills in a blank row with NaN values then saves to an output file. The second script compares different rows in a set of data and creates a new column with True/False values based on the test condition.
If I run each script as a function then call both with another function I would get two separate output files. How can I make this run with only 1 saved output file?
First Code
import pandas as pd
df = pd.read_csv("data5.csv", index_col="DateTime", parse_dates=True)
df = df.resample('1min').mean()
df = df.reindex(pd.date_range(df.index.min(), df.index.max(), freq="1min"))
df.to_csv("output.csv", na_rep='NaN')
Second Code
with open('data5.csv', 'r') as f:
rows = [row.split(',') for row in f]
rows = [[cell.strip() for cell in row if cell] for row in rows]
def isValidRow(row):
return float(row[5]) <= 900 or all(float(val) > 7 for val in row[1:4])
header, rows = rows[0], rows[1:]
validRows = list(map(isValidRow, rows))
with open('output.csv', 'w') as f:
f.write(','.join(header + ['IsValid']) + '\n')
for row, valid in zip(rows, validRows):
f.write(','.join(row + [str(valid)]) + '\n')
Let put your code as function of filenames:
def first_code(file_in, file_out):
df = pd.read_csv(file_in, ... )
...
df.to_csv(file_out, ...)
def second_code(file_in, file_out):
with open(file_in, 'r') as f:
...
....
with open(file_out, 'w') as f:
...
Your solution can then be:
first_code('data5.csv', 'output.csv')
second_code('output.csv', 'output.csv')
Hope it helps
Note that there is not problem reading and writing in the same file. Be sure that the file is previously closed to avoid side effect. This is implicitly done by using with, which is a good practice
In the second code, change data5.csv which is the first input to the second code to output.csv. and make sure that the file1.py and file2.py are in the same directory. so your modified code in a single file will be as follows:
import pandas as pd
df = pd.read_csv("data5.csv", index_col="DateTime", parse_dates=True)
df = df.resample('1min').mean()
df = df.reindex(pd.date_range(df.index.min(), df.index.max(), freq="1min"))
df.to_csv("output.csv", na_rep='NaN')
with open('output.csv', 'r') as f:
rows = [row.split(',') for row in f]
rows = [[cell.strip() for cell in row if cell] for row in rows]
def isValidRow(row):
return float(row[5]) <= 900 or all(float(val) > 7 for val in row[1:4])
header, rows = rows[0], rows[1:]
validRows = list(map(isValidRow, rows))
with open('output.csv', 'w') as f:
f.write(','.join(header + ['IsValid']) + '\n')
for row, valid in zip(rows, validRows):
f.write(','.join(row + [str(valid)]) + '\n')

Reading column names alone in a csv file

I have a csv file with the following columns:
id,name,age,sex
Followed by a lot of values for the above columns.
I am trying to read the column names alone and put them inside a list.
I am using Dictreader and this gives out the correct details:
with open('details.csv') as csvfile:
i=["name","age","sex"]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]
But what I want to do is, I need the list of columns, ("i" in the above case)to be automatically parsed with the input csv than hardcoding them inside a list.
with open('details.csv') as csvfile:
rows=iter(csv.reader(csvfile)).next()
header=rows[1:]
re=csv.DictReader(csvfile)
for row in re:
print row
for x in header:
print row[x]
This gives out an error
Keyerrror:'name'
in the line print row[x]. Where am I going wrong? Is it possible to fetch the column names using Dictreader?
Though you already have an accepted answer, I figured I'd add this for anyone else interested in a different solution-
Python's DictReader object in the CSV module (as of Python 2.6 and above) has a public attribute called fieldnames.
https://docs.python.org/3.4/library/csv.html#csv.csvreader.fieldnames
An implementation could be as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
d_reader = csv.DictReader(f)
#get fieldnames from DictReader object and store in list
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])
In the above, d_reader.fieldnames returns a list of your headers (assuming the headers are in the top row).
Which allows...
>>> print(headers)
['MyCol1', 'MyCol2', 'MyCol3']
If your headers are in, say the 2nd row (with the very top row being row 1), you could do as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
#you can eat the first line before creating DictReader.
#if no "fieldnames" param is passed into
#DictReader object upon creation, DictReader
#will read the upper-most line as the headers
f.readline()
d_reader = csv.DictReader(f)
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])
You can read the header by using the next() function which return the next row of the reader’s iterable object as a list. then you can add the content of the file to a list.
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
rest = list(reader)
Now i has the column's names as a list.
print i
>>>['id', 'name', 'age', 'sex']
Also note that reader.next() does not work in python 3. Instead use the the inbuilt next() to get the first line of the csv immediately after reading like so:
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = next(reader)
print(i)
>>>['id', 'name', 'age', 'sex']
The csv.DictReader object exposes an attribute called fieldnames, and that is what you'd use. Here's example code, followed by input and corresponding output:
import csv
file = "/path/to/file.csv"
with open(file, mode='r', encoding='utf-8') as f:
reader = csv.DictReader(f, delimiter=',')
for row in reader:
print([col + '=' + row[col] for col in reader.fieldnames])
Input file contents:
col0,col1,col2,col3,col4,col5,col6,col7,col8,col9
00,01,02,03,04,05,06,07,08,09
10,11,12,13,14,15,16,17,18,19
20,21,22,23,24,25,26,27,28,29
30,31,32,33,34,35,36,37,38,39
40,41,42,43,44,45,46,47,48,49
50,51,52,53,54,55,56,57,58,59
60,61,62,63,64,65,66,67,68,69
70,71,72,73,74,75,76,77,78,79
80,81,82,83,84,85,86,87,88,89
90,91,92,93,94,95,96,97,98,99
Output of print statements:
['col0=00', 'col1=01', 'col2=02', 'col3=03', 'col4=04', 'col5=05', 'col6=06', 'col7=07', 'col8=08', 'col9=09']
['col0=10', 'col1=11', 'col2=12', 'col3=13', 'col4=14', 'col5=15', 'col6=16', 'col7=17', 'col8=18', 'col9=19']
['col0=20', 'col1=21', 'col2=22', 'col3=23', 'col4=24', 'col5=25', 'col6=26', 'col7=27', 'col8=28', 'col9=29']
['col0=30', 'col1=31', 'col2=32', 'col3=33', 'col4=34', 'col5=35', 'col6=36', 'col7=37', 'col8=38', 'col9=39']
['col0=40', 'col1=41', 'col2=42', 'col3=43', 'col4=44', 'col5=45', 'col6=46', 'col7=47', 'col8=48', 'col9=49']
['col0=50', 'col1=51', 'col2=52', 'col3=53', 'col4=54', 'col5=55', 'col6=56', 'col7=57', 'col8=58', 'col9=59']
['col0=60', 'col1=61', 'col2=62', 'col3=63', 'col4=64', 'col5=65', 'col6=66', 'col7=67', 'col8=68', 'col9=69']
['col0=70', 'col1=71', 'col2=72', 'col3=73', 'col4=74', 'col5=75', 'col6=76', 'col7=77', 'col8=78', 'col9=79']
['col0=80', 'col1=81', 'col2=82', 'col3=83', 'col4=84', 'col5=85', 'col6=86', 'col7=87', 'col8=88', 'col9=89']
['col0=90', 'col1=91', 'col2=92', 'col3=93', 'col4=94', 'col5=95', 'col6=96', 'col7=97', 'col8=98', 'col9=99']
How about
with open(csv_input_path + file, 'r') as ft:
header = ft.readline() # read only first line; returns string
header_list = header.split(',') # returns list
I am assuming your input file is CSV format.
If using pandas, it takes more time if the file is big size because it loads the entire data as the dataset.
I am just mentioning how to get all the column names from a csv file.
I am using pandas library.
First we read the file.
import pandas as pd
file = pd.read_csv('details.csv')
Then, in order to just get all the column names as a list from input file use:-
columns = list(file.head(0))
Thanking Daniel Jimenez for his perfect solution to fetch column names alone from my csv, I extend his solution to use DictReader so we can iterate over the rows using column names as indexes. Thanks Jimenez.
with open('myfile.csv') as csvfile:
rest = []
with open("myfile.csv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
i=i[1:]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]
here is the code to print only the headers or columns of the csv file.
import csv
HEADERS = next(csv.reader(open('filepath.csv')))
print (HEADERS)
Another method with pandas
import pandas as pd
HEADERS = list(pd.read_csv('filepath.csv').head(0))
print (HEADERS)
import pandas as pd
data = pd.read_csv("data.csv")
cols = data.columns
I literally just wanted the first row of my data which are the headers I need and didn't want to iterate over all my data to get them, so I just did this:
with open(data, 'r', newline='') as csvfile:
t = 0
for i in csv.reader(csvfile, delimiter=',', quotechar='|'):
if t > 0:
break
else:
dbh = i
t += 1
Using pandas is also an option.
But instead of loading the full file in memory, you can retrieve only the first chunk of it to get the field names by using iterator.
import pandas as pd
file = pd.read_csv('details.csv'), iterator=True)
column_names_full=file.get_chunk(1)
column_names=[column for column in column_names_full]
print column_names

read and concatenate 3,000 files into a pandas data frame starting at a specific value, python

I have 3,000 .dat files that I am reading and concatenating into one pandas dataframe. They have the same format (4 columns, no header) except that some of them have a description at the beginning of the file while others don't. In order to concatenate those files, I need to get rid of those first rows before I concatenate them. The skiprows option of the pandas.read_csv() doesn't apply here, because the number of rows to skip is very inconsistent from one file to another (btw, I use pandas.read_csv() and not pandas.read_table() because the files are separated by a coma).
However, the fist value after the rows I am trying to omit is identical for all 3,000 files. This value is "2004", which is the first data point of my dataset.
Is there an equivalent to skiprows where I could mention something such as "start reading the file starting at "2004" and skip everything else before that (for each of the 3,00 files)?
I am really out of luck at this point and would appreciate some help,
Thank you!
You could just loop through them and skip every line that doesn't start with 2004.
Something like ...
while True:
line = pandas.read_csv()
if line[0] != '2004': continue
# whatever else you need here
Probably not worth trying to be clever here; if you have a handy criterion, you might as well use it to figure out what skiprows is, i.e. something like
import pandas as pd
import csv
def find_skip(filename):
with open(filename, newline="") as fp:
# (use open(filename, "rb") in Python 2)
reader = csv.reader(fp)
for i, row in enumerate(reader):
if row[0] == "2004":
return i
for filename in filenames:
skiprows = find_skip(filename)
if skiprows is None:
raise ValueError("something went wrong in determining skiprows!")
this_df = pd.read_csv(filename, skiprows=skiprows, header=None)
# do something here, e.g. append this_df to a list and concatenate it after the loop
uss the skip_to() function:
def skip_to(f, text):
while True:
last_pos = f.tell()
line = f.readline()
if not line:
return False
if line.startswith(text):
f.seek(last_pos)
return True
with open("tmp.txt") as f:
if skip_to(f, "2004"):
df = pd.read_csv(f, header=None)
print df

Categories