Extracting data from a file read in with Pandas - python

I believe that I have successfully read in my files with a "for loop" as shown in the code below.
import pandas as pd
import glob
filename = glob.glob('1511**.mnd')
data_nov15_hereford = pd.DataFrame()
frames = []
for i in filename:
f_nov15_hereford = pd.read_csv(i, skiprows = 33, sep='\s+')
frames.append(f_nov15_hereford)
data_nov15_hereford = pd.concat(frames)
data_nov15_hereford = data_nov15_hereford.convert_objects(convert_numeric=True)
My problem now is that I want to take out some information from the files. Specifically I want the 80 m wind speed from the files. When I read in just one file, instead of looping over multiple files the code works like I need it to by simply doing this:
height = data_nov15_hereford['#']
wspd = data_nov15_hereford["z"]
hub = np.where(height==80)
print hub
hub_wspd = wspd[5:4582:32]
hub_wspd is the 80 m wind speed that I am interested in. And I get the index numbers 5:4582 by printing out hub. And then all I have to do is skip every 32 rows to continue to pull out the 80 m wind speed from the file. However, now that I have read in multiple files (that all look the same and have the same layout as this one file) I can't seem to pull out the 80 m wind speed the same way. Basically, I print out hub and get the indices 5:65418 and then I skip every 32 rows but when I print out the tail end of the hub_wspd it doesn't match the file so I must be doing something wrong. Any ideas why it isn't working with multiple files but worked with the single file? I can also attach a copy of the single data file if that would help. Thanks!

Related

Code works on individual files, but data gets jumbled when looping through directory

I have one main dataframe that contains data for ~25 people with 10 trials per person. I also have individual files for each trial by participant. My goal is to have one file per participant that contains all 10 trials with data from both the main dataframe and the individual files.
I am matching the data in the main dataframe and the files in the directory by filename (the filename contains both the participant ID and the trial number- ex: 90-9.csv).
Example of the data:
# Main df:
ID trial file length
90 9 90-9.csv 56
90 10 90-10.csv 44
91 1 91-1.csv 62
91 2 91-2.csv 48
# Individual files in directory- these files contain diameter data:
90-9.csv
90-10.csv
91-1.csv
91-2.csv
# intended output:
ID trial file length diameter
.. .. .. .. ..
90 8 90-8.csv 62 3.15
90 9 90-9.csv 56 3.17
90 10 90-10.csv 44 3.14
I have tried the following for looping through the directory:
directory = os.chdir(r'filepath')
# create list of files
dir_list = os.listdir(directory)
for file in dir_list:
df = pd.read_csv('mainDF.csv')
# create filename column in main df
df['ID'] = df['ID'].astype(str)
df['trial'] = df['trial'].astype(str)
df['file'] = df['ID']+'-'+df['trial']+'.csv'
# this doesn't work
for file in zoom['filename']:
pupil = pd.read_csv(file)
# this one doesn't organize the data properly
if ([x in file for x in zoom['filename']]):
pupil = pd.read_csv(file)
When I isolate one participant and one trial number, the data is organized the way I show in the intended output. When I loop through the directory, everything becomes out of order. I'm not sure what's going on.
os.listdir will list the files (and folders) in the directory given to it. However, if you want to then access that file, you need to give the full path. So directory+"/"+file. In addition, you're reloading the mainDF.csv dataframe on each iteration. I don't think you actually want to do that, you probably just want to do that once before you start your for loop. You also seem to be trying to iterate on your files twice. I have no idea what your zoom variable even is, let alone what you hope to achieve by looping over it.
What I would advise is that you rewrite this code using glob instead of listdir. That should be safer by only selecting the specific files you want and won't require any file manipulation on your part. So something like this :
import glob
import pandas as pd
df = pd.read_csv('mainDF.csv')
df['ID'] = df['ID'].astype(str)
df['trial'] = df['trial'].astype(str)
df['file'] = df['ID']+'-'+df['trial']+'.csv'
for path in glob.glob("path/to/files/*.csv"):
pupil_df = pd.read_csv(path)
# do what you want with pupil_df here

Data from multiple sensors saved to txt file imported to pandas

Good day everyone.
I was hoping someone here could help me with a bit of a problem. I've run an experiment, where data has been gathered from 6 separate sensors simultaneously. The data has then been exported to a single shared txt file. Now I need to import the data to python to analyze it.
I know I can do this by taking each of the lines and simply copy&pasting data output from each sensor into a separate document, and then import those in a loop - but that is a lot of work and brings in a high potential of human error.
But is there no way of using readline with specific lines read, and porting that to pandas DataFrame? There is a fixed header spacing, and line spacing between each sensor.
I tried:
f=open('OR0024622_auto3200.txt')
lines = f.readlines()
base = 83
sensorlines = 6400
Sensor=[]
Sensor = lines[base:sensorlines+base]
df_sens = pd.DataFrame(Sensor)
df_sens
but the output isn't very useful:
Snip from of Output
--
Here's the file i am importing:
link.
Any suggestions ?
Looks like a tab separated data.
use
>>> df = pd.read_csv('OR0024622_auto3200.txt', delimiter=r'\t', skiprows=83, header=None, nrows=38955-84)
>>> df.tail()
0 1 2
38686 6397 3.1980000000e+003 9.28819e-009
38687 6398 3.1985000000e+003 9.41507e-009
38688 6399 3.1990000000e+003 1.11703e-008
38689 6400 3.1995000000e+003 9.64276e-009
38690 6401 3.2000000000e+003 8.92203e-009
>>> df.head()
0 1 2
0 1 0.0000000000e+000 6.62579e+000
1 2 5.0000000000e-001 3.31289e+000
2 3 1.0000000000e+000 2.62362e-011
3 4 1.5000000000e+000 1.51130e-011
4 5 2.0000000000e+000 8.35723e-012
abhilb's answer is to the point and correct, but there is a lot to be said regarding loading/reading files. A quick browser search will take you a long way (I encourage you to read up on this!), but I'll add a few details here:
If you want to load multiple files that match a pattern you can do so iteratively via glob:
import pandas as pd
from glob import glob as gg
filePattern = "/path/to/file/*.txt"
for fileName in gg(filePattern):
df = pd.read_csv('OR0024622_auto3200.txt', delimiter=r'\t')
This will load each file one-by-one. What if you want to put all data into a single dataframe? Do this:
masterDF = pd.Dataframe()
for fileName in gg(filePattern):
df = pd.read_csv('OR0024622_auto3200.txt', delimiter=r'\t')
masterDF = pd.concat([masterDF, df], axis=0)
This works great for pandas, but what if you want to read into a numpy array?
import numpy as np
# using previous imports
base = 83
sensorlines = 6400
# create an empty array that has three columns
masterArray = np.full((0, 3), np.nan)
for fileName in gg(filePattern):
# open the file (NOTE: this does not read the file, just puts it in a buffer)
with open(fileName, "r") as tmp:
# now read the file and split each line by the carriage return (could be "\r\n")
# you now have a list of strings
data = tmp.read().split("\n")
# keep only the "data" portion of the file
data = data[base:sensorlines + base]
# convert list of strings to an array of floats
# here, I use a "list comprehension" for speed and simplicity
data = np.array([r.split("\t") for r in data]).astype(float)
# stack your new data onto your master array
masterArray = np.vstack([masterArray, data])
Opening a file via the "with open(fileName, "r")" syntax is handy because Python automatically closes the file when you are done. If you don't use "with" then you must manually close the file (e.g. tmp.close()).
These are just some starting points to get you on your way. Feel free to ask for clarification.

How to extract multiple columns from a space delimited .DAT file in python

I'm quite new to coding and don't have a proper education on the subject (most of my experience has been just stumbling through google searches) and I have a task that I would like assistance with.
I have 38 files which look something like this:
NGANo: 000a16d_1
Zeta: 0.050000
Ds5-95: 5.290000
Comments:
Period, SD, SV, SA
0.010000 0.000433 0.013167 170.812839
0.020000 0.001749 0.071471 172.720229
0.030000 0.004014 0.187542 176.055129
0.040000 0.007631 0.468785 189.322248
0.050000 0.012815 0.912067 203.359441
0.060000 0.019246 1.556853 210.602517
0.070000 0.025400 1.571091 206.360018
They're all .DAT files and are four columns of data (Period, SD, SV, SA) that are single space delimited in each row, additionally there are two spaces at the end of each line of data.
The only important data for me is the SA data, and I'd like to take the SA data and the title (this particular example being 000a16d_1) from each of these 38 files and put them all on the same sheet of an excel spreadsheet (one column after the next) with just the title followed by the SA data.
I've tried a few different things, but I'm stuck on how to separate the rows of data from one column into 4. I'm not too knowledgeable on whether I should use numpy or pandas. I know that everything up to the second to last line is correct, as when I have print(table) it does print the rows of data, I just don't understand how to separate the single column into multiple. Here is my current code, all assistance is appreciated.
import pandas as pd
import numpy as np
import os
import xlsxwriter
#
path = "C:/Users/amihi/Downloads/Plotter_Output"
dirs = os.listdir(path)
#
#
for file in dirs:
table = pd.read_table(file, skiprows=4)
SA = table.loc[:,"SA"]
print(SA)
You could also do this without using pandas if you wanted. The code below will deal only with the table section of it, but wont deal with the info at the top of the file.
finalColumns = []
for file in dirs:
with open(file, "r") as f:
for l in f:
line = l.strip("\n")
splitted = line.split()
if len(splitted) > len(columns):
for i in range(len(splitted)):
columns.append([])
counter = 0
for item in splitted:
columns[counter].append(item)
counter += 1
finalColumns.append(columns[3])
When adding to your other file, simply loop through finalColumns and each item will be what should be a new column in your file.

Sequentially read huge CSV file in python

I have a 10gb CSV file that contains some information that I need to use.
As I have limited memory on my PC, I can not read all the file in memory in one single batch. Instead, I would like to iteratively read only some rows of this file.
Say that at the first iteration I want to read the first 100, at the second those going to 101 to 200 and so on.
Is there an efficient way to perform this task in Python?
May Pandas provide something useful to this? Or are there better (in terms of memory and speed) methods?
Here is the short answer.
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
Here is the very long answer.
To get started, you’ll need to import pandas and sqlalchemy. The commands below will do that.
import pandas as pd
from sqlalchemy import create_engine
Next, set up a variable that points to your csv file. This isn’t necessary but it does help in re-usability.
file = '/path/to/csv/file'
With these three lines of code, we are ready to start analyzing our data. Let’s take a look at the ‘head’ of the csv file to see what the contents might look like.
print pd.read_csv(file, nrows=5)
This command uses pandas’ “read_csv” command to read in only 5 rows (nrows=5) and then print those rows to the screen. This lets you understand the structure of the csv file and make sure the data is formatted in a way that makes sense for your work.
Before we can actually work with the data, we need to do something with it so we can begin to filter it to work with subsets of the data. This is usually what I would use pandas’ dataframe for but with large data files, we need to store the data somewhere else. In this case, we’ll set up a local sqllite database, read the csv file in chunks and then write those chunks to sqllite.
To do this, we’ll first need to create the sqllite database using the following command.
csv_database = create_engine('sqlite:///csv_database.db')
Next, we need to iterate through the CSV file in chunks and store the data into sqllite.
chunksize = 100000
i = 0
j = 1
for df in pd.read_csv(file, chunksize=chunksize, iterator=True):
df = df.rename(columns={c: c.replace(' ', '') for c in df.columns})
df.index += j
i+=1
df.to_sql('table', csv_database, if_exists='append')
j = df.index[-1] + 1
With this code, we are setting the chunksize at 100,000 to keep the size of the chunks managable, initializing a couple of iterators (i=0, j=0) and then running a through a for loop. The for loop read a chunk of data from the CSV file, removes space from any of column names, then stores the chunk into the sqllite database (df.to_sql(…)).
This might take a while if your CSV file is sufficiently large, but the time spent waiting is worth it because you can now use pandas ‘sql’ tools to pull data from the database without worrying about memory constraints.
To access the data now, you can run commands like the following:
df = pd.read_sql_query('SELECT * FROM table', csv_database)
Of course, using ‘select *…’ will load all data into memory, which is the problem we are trying to get away from so you should throw from filters into your select statements to filter the data. For example:
df = pd.read_sql_query('SELECT COl1, COL2 FROM table where COL1 = SOMEVALUE', csv_database)
You can use pandas.read_csv() with chuncksize parameter:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv
for chunck_df in pd.read_csv('yourfile.csv', chunksize=100):
# each chunck_df contains a part of the whole CSV
This code may help you for this task. It navigates trough a large .csv file and does not consume lots of memory so that you can perform this in a standard lap top.
import pandas as pd
import os
The chunksize here orders the number of rows within the csv file you want to read later
chunksize2 = 2000
path = './'
data2 = pd.read_csv('ukb35190.csv',
chunksize=chunksize2,
encoding = "ISO-8859-1")
df2 = data2.get_chunk(chunksize2)
headers = list(df2.keys())
del data2
start_chunk = 0
data2 = pd.read_csv('ukb35190.csv',
chunksize=chunksize2,
encoding = "ISO-8859-1",
skiprows=chunksize2*start_chunk)
headers = []
for i, df2 in enumerate(data2):
try:
print('reading cvs....')
print(df2)
print('header: ', list(df2.keys()))
print('our header: ', headers)
# Access chunks within data
# for chunk in data:
# You can now export all outcomes in new csv files
file_name = 'export_csv_' + str(start_chunk+i) + '.csv'
save_path = os.path.abspath(
os.path.join(
path, file_name
)
)
print('saving ...')
except Exception:
print('reach the end')
break
Method to transfer huge CSV into database is good because we can easily use SQL query.
We have also to take into account two things.
FIRST POINT: SQL also are not a rubber, it will not be able to stretch the memory.
For example converted to bd file:
https://nycopendata.socrata.com/Social-Services/311-Service-Requests-
from-2010-to-Present/erm2-nwe9
For this db file SQL language:
pd.read_sql_query("SELECT * FROM 'table'LIMIT 600000", Mydatabase)
It can read maximum about 0,6 mln records no more with 16 GB RAM memory of PC (time of operation 15,8 second).
It could be malicious to add that downloading directly from a csv file is a bit more efficient:
giga_plik = 'c:/1/311_Service_Requests_from_2010_to_Present.csv'
Abdul = pd.read_csv(giga_plik, nrows=1100000)
(time of operation 16,5 second)
SECOND POINT: To effectively using SQL data series converted from CSV we ought to memory about suitable form of date. So I proposer add to ryguy72's code this:
df['ColumnWithQuasiDate'] = pd.to_datetime(df['Date'])
All code for file 311 as about I pointed:
start_time = time.time()
### sqlalchemy create_engine
plikcsv = 'c:/1/311_Service_Requests_from_2010_to_Present.csv'
WM_csv_datab7 = create_engine('sqlite:///C:/1/WM_csv_db77.db')
#----------------------------------------------------------------------
chunksize = 100000
i = 0
j = 1
## --------------------------------------------------------------------
for df in pd.read_csv(plikcsv, chunksize=chunksize, iterator=True, encoding='utf-8', low_memory=False):
df = df.rename(columns={c: c.replace(' ', '') for c in df.columns})
## -----------------------------------------------------------------------
df['CreatedDate'] = pd.to_datetime(df['CreatedDate']) # to datetimes
df['ClosedDate'] = pd.to_datetime(df['ClosedDate'])
## --------------------------------------------------------------------------
df.index += j
i+=1
df.to_sql('table', WM_csv_datab7, if_exists='append')
j = df.index[-1] + 1
print(time.time() - start_time)
At the end I would like to add: converting a csv file directly from the Internet to db seems to me a bad idea. I propose to download base and convert locally.

pandas HDF select does not recognise column name

I'm trying to process a large (2gb) csv file on a machine with only 4gb of RAM (don't ask) to produce a different, formatted csv containing a subset of data that needs some processing. I'm reading the file and creating a HDFstore that I query later for the data that I require for output. Everything works except that I cant retrieve data from the store using Term - error message comes back that PLOT is not a column name. Individual variables look fine and the store is what I expect I just can't see where the error is. (nb pandas v14 and numpy1.9.0). Very new to this so apologies for the clunky code.
#wibble wobble -*- coding: utf-8 -*-
# short version
def filesport():
import pandas as pd
import numpy as np
from pandas.io.pytables import Term
Location = r"CL_short.csv"
store = pd.HDFStore('blarg.h5')
maxlines = sum(1 for line in open (Location))
print maxlines
#set chunk small for test file
chunky=4
plotty =pd.DataFrame(columns=['PLOT'])
dfdum=pd.DataFrame(columns=['PLOT', 'mDate', 'D100'])
#read file in chunks to avoid RAM blowing up
bucket = pd.read_csv(Location, iterator=True, chunksize=chunky, usecols= ['PLOT','mDate','D100'])
for chunk in bucket:
store.append('wibble', chunk, format='table', data_columns=['PLOT','mDate','D100'], ignore_index=True)
#retrieve plot numbers and select unique items
plotty = store.select('wibble', "columns = ['PLOT']")
plotty.drop_duplicates(inplace=True)
#iterate through unique plots to retrieve data and put in dataframe for output
for index, row in plotty.iterrows():
dfdum = store.select('wibble', [Term('PLOT', '=', plotty.iloc[index]['PLOT'])])
#process dfdum for output to new csv
print("successful completion")
filesport()
Final listing for those that wish to fight through the tumbleweed to reach here and are similarly bemused by processing large .csv files and the various methods of trying to retrieve/process data. The biggest problem was getting the sytax of the pytables Term right. Despite several examples indicating that it was possible to use 'A >20' etc this never worked for me. I set up a string condition containing the Term query and this worked (it is in the documentation TBF).
Also found it easier to query the HDF to retrieve unique items direct from the store in a list which could then be sorted and iterated through to retrieve data plot by plot. Note that I wanted the final csv file to have plot and then all the D100 data in date order, hence the pivot at the end.
Reading the csv file in chunks meant that each plot retrieved from the store had a header and this got written to the final csv which messed things up. I'm sure there's a more elegant way of only writing one header than the one I've shown here.
It works, takes about 2 hours to process the data and produce the final csv file (initial file 2GB, 30+million lines, data for 100,000+ unique plots, machine has 4GB of RAM but running 32-bit which means that only 2.5GB of RAM was available).
Good luck if you have a similar problem, and I hope you find this useful
#wibble wobble -*- coding: utf-8 -*-
def filesport():
import pandas as pd
import numpy as np
from pandas.io.pytables import Term
print (pd.__version__)
print (np.__version__)
Location = r"conliq_med.csv"
store = pd.HDFStore('blarg.h5')
maxlines = sum(1 for line in open (Location))
print maxlines
chunky=100000
#read file in chunks to avoid RAM blowing up select only needed columns
bucket = pd.read_csv(Location, iterator=True, chunksize=chunky, usecols= ['PLOT','mDate','D100'])
for chunk in bucket:
store.append('wibble', chunk, format='table', data_columns=['PLOT','mDate','D100'], ignore_index=True)
#retrieve unique plots and sort
plotty = store.select_column('wibble', 'PLOT').unique()
plotty.sort()
#set flag for writing file header
i=0
#iterate through unique plots to retrieve data and put in dataframe for output
for item in plotty:
condition = 'PLOT =' + str(item)
dfdum = store.select('wibble', [Term(condition)])
dfdum["mDate"]= pd.to_datetime(dfdum["mDate"], dayfirst=True)
dfdum.sort(columns=["PLOT", "mDate"], inplace=True)
dfdum["mDate"] = dfdum["mDate"].map(lambda x: x.strftime("%Y - %m"))
dfdum=dfdum.pivot("PLOT", "mDate", "D100")
#only print one header to file
if i ==0:
dfdum.to_csv("CL_OP.csv", mode='a')
i=1
else:
dfdum.to_csv("CL_OP.csv", mode='a', header=False)
print("successful completion")
filesport()

Categories