Reading multiple CSVs into different arrays - python

Update. Here is my code. I am importing 400 csv files into 1 list. Each csv file is 200 rows and 5 columns. My end goal is to sum the values from the 4th column of each row or each csv file. The below code imports all the csv files. However, I am struggling to isolate 4th column of data from each csv file from the large list.
for i in range (1, 5, 1):
data = list()
for i in range(1,400,1):
datafile = 'particle_path_%d' % i
data.append(np.genfromtxt(datafile, delimiter = "", skip_header=2))
print datafile
I want to read 100 csv files into 100 different arrays in python. For example:
array1 will have csv1
array2 will have csv2 etc etc.
Whats the best way of doing this? I am appending to a list right now but I have one big list which is proving difficult to split into smaller lists. My ultimate goal is to be able to perform different operations of each array (add, subtract numbers etc)

Could you provide more detail on what needs to be done? If you are simply trying to read line by line in the csv files and make that the array then this should work:
I would create a 2 dimensional array for this, something like:
csv_array_container = []
for csv_file in csv_files:
csv_lines = csv_file.readlines()
csv_array_container.append(csv_lines)
#Now close your file handlers
Assuming that csv_files is a list of open file_handlers for the csv files. Something more appropriate would likely open the files in the loop and close them after use rather than open 100, gather data, and close 100 due to limits on file handlers.
If you would like more detail on this, please give us more info on what you are exactly trying to do with examples. Hope this helps.

So you have a list of 100 arrays. What can you tell us about their shapes?
If they all have the same shape you could use
arr = np.stack(data)
I expect arr.shape will be (100,200,5)
fthcol = arr[:,:,3] # 4th column
If they aren't all the same, then a simple list comprehension will work
fthcol = [a[:,3] for a in data]
Again, depending on the shapes you could np.stack(fthcol) (choose your axis).
Don't be afraid to iterate over the elements of the data list. With 100 items the cost won't be prohibitive.

Related

Writing a whole list in a CSV row-Python/IronPython

Right now I have several long lists : One called variable_names.
Lets say Variable names= [ Velocity, Density, Pressure, ....] (length is 50+)
I want to write a row that reads every index of the list, leaves about 5 empty cells, then writes next value, and keeps doing it until list is done.
As shown in row1 Sample picture
The thing is I can't use xlrd due to compatibility issues with Iron Python and I need to dynamically write each row in the new csv , load data from old csv , then append that data in the new csv, the old csv keeps changing once I append the data in the new csv, so I need to iterate all values in the lists for every time I write the row, because it is much more difficult to append columns in csv.
What I basicall want to do is :
with open('data.csv','a') as f:
sWriter=csv.writer(f)
sWriter.writerow([Value_list[i],Value_list[i+1],Value_list[i+2].....Value_list[end])
But I can't seem to think of a way to do this with iteration
Because writerow method takes a list argument, you can first construct the list and then write the list so everything in the list will be in one row.
Like,
with open('data.csv','a') as f:
sWriter=csv.writer(f)
listOfColumns = []
for i in range(from, to): # append elements from Value_list
listOfColumns.append(Value_list[i])
for i in range(0, 2): # Or you may want some columns with blank
listOfColumns.append("")
for i in range(anotherFrom, anotherTo): # append elements from Value_list
listOfColumns.append(Value_list[i])
# At here, the listOfColumns will be [Value_list[from], ..., Value_list[to], "", "", Value_list[anotherFrom], ..., Value_list[anotherTo]]
sWriter.writerow(listOfColumns)

Python - Trying to read a csv file and output values with specific criteria

I am going to start off with stating I am very much new at working in Python. I have a very rudimentary knowledge of SQL but this is my first go 'round with Python. I have a csv file of customer related data and I need to output the records of customers who have spent more than $1000. I was also given this starter code:
import csv
import re
data = []
with open('customerData.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
data.append(row)
print(data[0])
print(data[1]["Name"])
print(data[2]["Spent Past 30 Days"])
I am not looking for anyone to give me the answer, but maybe nudge me in the right direction. I know that it has opened the file to read and created a list (data) and is ready to output the values of the first and second row. I am stuck trying to figure out how to call out the column value without limiting it to a specific row number. Do I need to make another list for columns? Do I need to create a loop to output each record that meets the > 1000 criteria? Any advice would be much appreciated.
To get a particular column you could use a for loop. I'm not sure exactly what you're wanting to do with it, but this might be a good place to start.
for i in range(0,len(data)):
print data[i]['Name']
len(data) should equal the number of rows, thus iterating through the entire column
The sample code does not give away the secret of data structure. It looks like maybe a list of dicts. Which does not make much sense, so I'll guess how data is organized. Assuming data is a list of lists you can get at a column with a list comprehension:
data = [['Name','Spent Past 30 Days'],['Ernie',890],['Bert',1200]]
spent_column = [row[1] for row in data]
print(spent_column) # prints: ['Spent Past 30 Days', 890, 1200]
But you will probably want to know who is a big spender so maybe you should return the names:
data = [['Name','Spent Past 30 Days'],['Ernie',890],['Bert',1200]]
spent_names = [row[0] for row in data[1:] if int(row[1])>1000]
print(spent_names) # prints: ['Bert']
If the examples are unclear I suggest you read up on list comprehensions; they are awesome :)
You can do all of the above with regular for-loops as well.

Excluding certain rows while importing data with Numpy

I am generating data-sets from experiments. I end up with csv data-sets that are typically are n x 4 dimensional (n rows; n > 1000 and 4 columns). However, due to an artifact of the data-collection process, typically the first couple of rows and the last couple of rows have only 2 or 3 columns. So a data-set looks like:
8,0,4091
8,0,
8,0,4091,14454
10,0,4099,14454
2,0,4094,14454
8,-3,4104,14455
3,0,4100,14455
....
....
14,-1,4094,14723
0,3,4105,14723
7,0,4123,14723
7,
6,-2,4096,
3,2,
As you can see, the first two rows and the last three don't have the 4 columns that I expect. When I try importing this file using np.loadtxt(filename, delimiter = ','), I get an error. Once I remove the rows which have fewer than 4 columns (first 2 rows, and last 3 rows, in this case), the import works fine.
Two questions:
Why doesn't the usual importing work. I am not sure what is the exact error in this importing. In other words, why is not having the same number of columns in all rows a problem?
As a workaround, I know how to ignore the first two rows while importing the files with numpy np.loadtxt(filename, skiprows= 2), but is there a simple way to also select a fixed number of rows at the bottom to ignore?
Note: This is NOT about finding unique rows in a numpy array. Its more about importing csv data that are non-uniform in the number of columns that each row contains.
Your question is similar (duplicate) to Using genfromtxt to import csv data with missing values in numpy
1) I'm not sure about why this is the default behavior.
Could be to warn users that the CSV file might be corrupt.
Could be to optimize the array and make it N x M, instead of having multiple column lengths.
2) Use numpy's genfromtext. For this you'll need to know the correct number of columns in advance.
data = numpy.genfromtxt('data.csv', delimiter=',', usecols=[0,1,2,3], invalid_raise=False)
Hope this helps!
You can use genfromtxt, which allows to skip lines a the beginning and at the end:
np.genfromtxt('array.txt', delimiter=',', skip_header=2, skip_footer=3)

How to write a pandas DataFrame to a CSV by fixed size chunks

I need to output data from pandas into CSV files to interact with a 3rd party developed process.
The process requires that I pass it no more than 100,000 records in a file, or it will cause issues (slowness, perhaps a crash).
That said, how can I write something that takes a dataframe in pandas and splits it into 100,000 records frames? Nothing would be different other than the exported dataframes would be subsets of the parent dataframe.
I assume I could do a loop with something like this, but I assume it would be remarkably inefficient..
First, taking recordcount=len(df.index) to get the number of records and then looping until I get there using something like
df1 = df[currentrecord:currentrecord+100000,]
And then exporting that to a CSV file
There has to be an easier way.
You can try smth like this:
def save_df(df, chunk_size=100000):
df_size=len(df)
for i, start in enumerate(range(0, df_size, chunk_size)):
df[start:start+chunk_size].to_csv('df_name_{}.csv'.format(i))
You could add a column with a group, and then use the function groupby:
df1['Dummy'] = [a for b in zip(*[range(N)] * 100000) for a in b][:len(df1)]
Where N is set to a value large enough, the minimum being:
N = int(np.ceil(df1.len() / 100000))
Then group by that column and apply function to_csv():
def save_group(df):
df.drop('Dummy', axis=1).to_csv("Dataframe_" + str(df['Dummy'].iloc[0]) + ".csv")
df1.groupby('Dummy').apply(save_group)

Python in memory table data structures for analysis (dict, list, combo)

I'm trying to simulate some code that I have working with SQL but using all Python instead..
With some help here
CSV to Python Dictionary with all column names?
I now can read my zipped-csv file into a dict Only one line though, the last one. (how do I get a sample of lines or the whole data file?)
I am hoping to have a memory resident table that I can manipulate much like sql when I'm done eg Clean data by matching bad data to to another table with bad data and correct entries.. then sum by type average by time period and the like.. The total data file is about 500,000 rows.. I'm not fussed about getting all in memory but want to solve the general case as best I can,, again so I know what can be done without resorting to SQL
import csv, sys, zipfile
sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip"
zip_file = zipfile.ZipFile(sys.argv[0])
items_file = zip_file.open('AllListing1RES.txt', 'rU')
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
pass
# Then is my result is
>>> for key in row:
print 'key=%s, value=%s' % (key, row[key])
key=YEAR_BUILT_DESC, value=EXIST
key=SUBDIVISION, value=KNOLLWOOD
key=DOM, value=2
key=STREET_NAME, value=ORLEANS RD
key=BEDROOMS, value=3
key=SOLD_PRICE, value=
key=PROP_TYPE, value=SFR
key=BATHS_FULL, value=2
key=PENDING_DATE, value=
key=STREET_NUM, value=3828
key=SOLD_DATE, value=
key=LIST_PRICE, value=324900
key=AREA, value=200
key=STATUS_DATE, value=3/3/2011 11:54:56 PM
key=STATUS, value=A
key=BATHS_HALF, value=0
key=YEAR_BUILT, value=1968
key=ZIP, value=35243
key=COUNTY, value=JEFF
key=MLS_ACCT, value=492859
key=CITY, value=MOUNTAIN BROOK
key=OWNER_NAME, value=SPARKS
key=LIST_DATE, value=3/3/2011
key=DATE_MODIFIED, value=3/4/2011 12:04:11 AM
key=PARCEL_ID, value=28-15-3-009-001.0000
key=ACREAGE, value=0
key=WITHDRAWN_DATE, value=
>>>
I think I'm barking up a few wrong trees here...
One is that I only have 1 line of my about 500,000 line data file..
Two is it seems that the dict may not be the right structure since I don't think I can just load all 500,000 lines and do various operations on them. Like..Sum by group and date..
plus it seems that duplicate keys may cause problems ie the non unique descriptors like county and subdivision.
I also don't know how to read a specific small subset of line into memory (like 10 or 100 to test with, before loading all (which I also don't get..) I have read over the Python docs and several reference books but it just is not clicking yet..
It seems that most of the answers I can find all suggest using various SQL solutions for this sort of thing,, but I am anxious to learn the basics of achieving the similar results with Python. As in some cases I think it will be easier and faster as well as expand my tool set. But I'm having a hard time finding relevant examples.
one answer that hints at what I'm getting at is:
Once the reading is done right, DictReader should work for getting rows as dictionaries, a typical row-oriented structure. Oddly enough, this isn't normally the efficient way to handle queries like yours; having only column lists makes searches a lot easier. Row orientation means you have to redo some lookup work for every row. Things like date matching requires data that is certainly not present in a CSV, like how dates are represented and which columns are dates.
An example of getting a column-oriented data structure (however, involving loading the whole file):
import csv
allrows=list(csv.reader(open('test.csv')))
# Extract the first row as keys for a columns dictionary
columns=dict([(x[0],x[1:]) for x in zip(*allrows)])
The intermediate steps of going to list and storing in a variable aren't necessary.
The key is using zip (or its cousin itertools.izip) to transpose the table.
Then extracting column two from all rows with a certain criterion in column one:
matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2]
print map(columns['two'].__getitem__, matchingrows)
When you do know the type of a column, it may make sense to parse it, using appropriate
functions like datetime.datetime.strptime.
via Yann Vernier
Surely there is some good reference for this general topic?
You can only read one line at a time from the csv reader, but you can store them all in memory quite easily:
rows = []
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
rows.append(row)
# rows[0]
{'keyA': 13, 'keyB': 'dataB' ... }
# rows[1]
{'keyA': 5, 'keyB': 'dataB' ... }
Then, to do aggregations and calculations:
sum(row['keyA'] for row in rows)
You may want to transform the data before it goes into rows, or use a friendlier data structure. Iterating over 500,000 rows for each calculation could become quite inefficient.
As a commenter mentioned, using an in-memory database could be really beneficial to you. another question asks exactly how to transfer csv data into a sqlite database.
import csv
import sqlite3
conn = sqlite3.connect(":memory:")
c = conn.cursor()
c.execute("create table t (col1 text, col2 float);")
# csv.DictReader uses the first line in the file as column headings by default
dr = csv.DictReader(open('data.csv', delimiter=','))
to_db = [(i['col1'], i['col2']) for i in dr]
c.executemany("insert into t (col1, col2) values (?, ?);", to_db)
You say """I now can read my zipped-csv file into a dict Only one line though, the last one. (how do I get a sample of lines or the whole data file?)"""
Your code does this:
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
pass
I can't imagine why you wrote that, but the effect is to read the whole input file row by row, ignoring each row (pass means "do exactly nothing"). The end result is that row refers to the last row (unless of course the file is empty).
To "get" the whole file, change pass to do_something_useful_with(row).
If you want to read the whole file into memory, simply do this:
rows = list(csv.DictReader(.....))
To get a sample, e.g. every Nth row (N > 0), starting at the Mth row (0 <= M < N), do something like this:
for row_index, row in enumerate(csv.DictReader(.....)):
if row_index % N != M: continue
do_something_useful_with(row)
By the way, you don't need dialect='excel'; that's the default.
Numpy (numerical python) is the best tool for operating on, comparing etc arrays, and your table is basically a 2d array.

Categories