Python repeating CSV file

Python repeating CSV file - python

I'm attempting to load numerical data from CSV files in order to loop through the calculated data of each stock(file) separately and determine if the calculated value is greater than a specific number (731 in this case). However, the method I am using seems to make Python repeat the list as well as add quotation marks around the numbers ('500'), as an example, making them strings. Unfortunately, I think the final "if" statement can't handle this and as a result it doesn't seem to function appropriately. I'm not sure what's going on and why Python what I need to do to get this code running properly.
import csv
stocks = ['JPM','PG','GOOG','KO']
for stock in stocks:
Data = open("%sMin.csv" % (stock), 'r')
stockdata = []
for row in Data:
stockdata.extend(map(float, row.strip().split(',')))
stockdata.append(row.strip().split(',')[0])
if any(x > 731 for x in stockdata):
print "%s Minimum" % (stock)

Currently you're adding all columns of each row to a list, then adding to the end of that, the first column of the row again? So are all columns significant, or just the first?
You're also loading all data from the file before the comparison but don't appear to be using it anywhere, so I guess you can shortcut earlier...
If I understand correctly, your code should be this (or amend to only compare first column).
Are you basically writing this?
import csv
STOCKS = ['JPM', 'PG', 'GOOG', 'KO']
for stock in STOCKS:
with open('{}Min.csv'.format(stock)) as csvin:
for row in csv.reader(csvin):
if any(col > 731 for col in map(float, row)):
print '{} minimum'.format(stock)
break

Related

Finding string contained in CSV file and computing a sum

It's my first time working with Panda, so I am trying to wrap my head around all of its functionalities.
Essentially, I want to download my bank statements in CSV and search for a keyword (e.g. steam) and compute the money I spent.
I was able to use panda to locate lines that contain my keyword, but I do not know how to iterate through them and attribute the cost of that purchase to a variable that I will sum up as the iteration grows.
If you look in the image I upload, I am able to find the lines containing my keyword in the dataframe, but what I want to do is for each line found, I want to take the content of the col1 and sum it up together.
Attempt At Code
# importing pandas module
import pandas as pd
keyword = input("Enter the keyword you wish to search in the statement: ")
# reading csv file from url
df = pd.read_csv('accountactivity.csv',header=None)
dff=df.loc[df[1].str.contains(keyword,case=False)]
value=df.values[68][2] #Fetches value of a specific cell in the CSV/dataframe created
print(dff)
print(value)
EDIT:
I essentially was almost able to complete the code I wanted, using only the CSV reader, but I can't get that code to find substrings. It only works if I enter the exact same string, meaning if I enter netflix it doesn't work, I would need to write it exactly as it appears on the statement like NETFLIX.COM _V. Here is another screenshot of that working code. I essentially want to mimic that with the capabilities of just finding substrings.
Working Code using CSV reader
import csv
data=[]
with open("accountactivity.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
keyword = input("Enter the keyword you wish to search in the statement: ")
col = [x[1] for x in data]
Sum = 0
if keyword in col:
for x in range(0, len(data)):
if keyword == data[x][1]:
PartialSum=float(data[x][2])
Sum=Sum+PartialSum
print(data[x][1])
print("The sum for expenses at ",keyword," is of: ",Sum,"$",sep = '')
else:
print("Keyword returned no results.")
The format of the CSV is the following: CSV Format
column 0 Date of transaction
column 1 Name of transaction
column 2 Money spent from account
column 3 Money received to account
The CSV file downloaded directly from my bank has no headers. So I refer to columns using col[0] etc...
Thanks for your help, I will continue meanwhile to look at how to potentially do this.

dff[dff.columns[col_index]].sum()
where col_index is the index of the column you want to sum together.

Thanks everyone for your help. I ended up understanding more how dataframe with Pandas work and I used the command: df[df.columns["col_index"]].sum() (which was suggested to me by Jonny Kong) with the column of interest (which in my case is column 2 containing my expenses). It computes the sum of my expenses for the searched keyword which is what I need!
#Importing pandas module
import pandas as pd
#Keyword searched through bank statement
keyword = input("Enter the keyword you wish to search in the statement: ")
#Reading the bank statement CSV file
df = pd.read_csv('accountactivity.csv',header=None)
#Creating dataframe from bank statement with lines that match search keyword
dff=df.loc[df[1].str.contains(keyword,case=False)]
#Sum the column which contains total money spent on the keyword searched
Sum=dff[dff.columns[2]].sum()
#Prints the created dataframe
print("\n",dff,"\n")
#Prints the sum of expenses for the keyword searched
print("The sum for expenses at ",keyword," is of: ",Sum,"$",sep = '')
Working Code!
Again, thanks everyone for helping and supporting me through my first post on SO!

Python - Trying to read a csv file and output values with specific criteria

I am going to start off with stating I am very much new at working in Python. I have a very rudimentary knowledge of SQL but this is my first go 'round with Python. I have a csv file of customer related data and I need to output the records of customers who have spent more than $1000. I was also given this starter code:
import csv
import re
data = []
with open('customerData.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
data.append(row)
print(data[0])
print(data[1]["Name"])
print(data[2]["Spent Past 30 Days"])
I am not looking for anyone to give me the answer, but maybe nudge me in the right direction. I know that it has opened the file to read and created a list (data) and is ready to output the values of the first and second row. I am stuck trying to figure out how to call out the column value without limiting it to a specific row number. Do I need to make another list for columns? Do I need to create a loop to output each record that meets the > 1000 criteria? Any advice would be much appreciated.

To get a particular column you could use a for loop. I'm not sure exactly what you're wanting to do with it, but this might be a good place to start.
for i in range(0,len(data)):
print data[i]['Name']
len(data) should equal the number of rows, thus iterating through the entire column

The sample code does not give away the secret of data structure. It looks like maybe a list of dicts. Which does not make much sense, so I'll guess how data is organized. Assuming data is a list of lists you can get at a column with a list comprehension:
data = [['Name','Spent Past 30 Days'],['Ernie',890],['Bert',1200]]
spent_column = [row[1] for row in data]
print(spent_column) # prints: ['Spent Past 30 Days', 890, 1200]
But you will probably want to know who is a big spender so maybe you should return the names:
data = [['Name','Spent Past 30 Days'],['Ernie',890],['Bert',1200]]
spent_names = [row[0] for row in data[1:] if int(row[1])>1000]
print(spent_names) # prints: ['Bert']
If the examples are unclear I suggest you read up on list comprehensions; they are awesome :)
You can do all of the above with regular for-loops as well.

Getting information for multiple queries across multiple .csv files

I am currently trying to figure out a way to get information stored across multiple datasets as .csv files.
Context
For the purposes of this question, suppose I have 4 datasets: experiment_1.csv, experiment_2.csv, experiment_3.csv, and experiment_4.csv. In each dataset, there are 20,000+ rows with 80+ columns in each row. Each row represents an Animal, identified by a id number, and each column represents various experimental data about that Animal. Assume each row's Animal ID number is unique for each dataset, but not across all datasets. For instance, ID#ABC123 can be found in experiment_1.csv, experiment_2.csv, but not experiment_3.csv and experiment_4.csv
Problem
Say a user wants to get info for ~100 Animals by looking up each Animal's ID # across all datasets. How would I go about doing this? I'm relatively new to programming, and I would like to improve. Here's what I have so far.
class Animal:
def __init__(self, id_number, *other_parameters):
self.animal_id = id_number
self.animal_data = {}
def store_info(self, csv_row, dataset):
self.animal_data[dataset] = csv_row
# Main function
# ...
# Assume animal_queries = list of Animal Objects
# Iterate through each dataset csv file
for dataset in all_datasets:
# Make a copy of the list of queries
copy_animal_queries = animal_queries[:]
with open(dataset, 'r', newline='') as dataset_file:
reader = csv.DictReader(dataset_file, delimiter=',')
# Iterate through each row in the csv file
for row in reader:
# Check if the list is not empty
if animal_queries_copy:
# Get the current row's animal id number
row_animal_id = row['ANIMAL ID']
# Check if the animal id number matches with a query for
# every animal in the list
for animal in animal_queries_copy[:]:
if animal.animal_id == row_animal_id:
# If a match is found, store the info, remove the
# query from the list, and exit iterating through
# each query
animal.store_info(row, dataset)
animal_list_copy.remove(animal)
break
# If the list is empty, all queries were found for the current
# dataset, so exit iterating through rows in reader
else:
break
Discussion
Is there a more obvious approach for this? Assume that I want to use .csv files for now, and I will consider converting these .csv files to an easier-to-use format like SQL Tables later down the line (I am an absolute beginner at databases and SQL, so I need to spend time learning this).
The one thing that sticks out to me is that I have to create multiple copies of animal_queries: 1 for each dataset, and 1 for each row in a dataset (in the for loop). Since 1 row only contains 1 ID, I can exit the loop early once I find a match to an ID from animal_queries. In addition, since that ID was already found, I no longer need to search for that ID for the rest of the current dataset, so I remove it from the list, but I need to keep the original copy of the queries since I also need it to search the remaining datasets. However, I can't remove an element from a list while inside a for loop, so I need to create another copy as well. This doesn't seem optimal to me and I'm wondering if I'm approaching this in the wrong direction. Any help would be appreciated, thanks!

Well, you could greatly speed this up by using the pandas library for one thing. Ignoring the class definition for now, you could do the following:
import pandas as pd
file_names = ['ex_1.csv', 'ex_2.csv']
animal_queries = ['foo', 'bar'] #input by user
#create list of data sets
data_sets = [pd.read_csv(_file) for _file in file_names]
#create store of retrieved data
retrieved_data = [d_s[d_s['ANIMAL ID'].isin(animal_queries)] for d_s in data_sets]
#concatenate the data
final_data = pd.concat(retrieved_data)
#export to csv
final_data.to_csv('your_data')
This simplifies things a lot. The isin method slices each data frame where ANIMAL ID is found in the list animal_queires. Incidentally pandas will also help you to cope with sql tables also so is probably a good route for you to go down.

Dealing with subsets of data using csv.DictReader

I'm parsing a big CSV file using csv.DictReader.
quotes=open( "file.csv", "rb" )
csvReader= csv.DictReader( quotes )
Then for each row I'm converting the time value in the CSV in datetime using this :
for data in csvReader:
year = int(data["Date"].split("-")[2])
month = strptime(data["Date"].split("-")[1],'%b').tm_mon
day = int(data["Date"].split("-")[0])
hour = int(data["Time"].split(":")[0])
minute = int(data["Time"].split(":")[1])
bars = datetime.datetime(year,month,day,hour,minute)
Now I would like to perform actions only on the rows of the same day. Would it be possible to do it in the same for loop or should I maybe save the data out per day and then perform actions? What would be an efficient way of baking the parsing?

As jogojapan has pointed out, it is important to know whether we can assume that the CSV file is sorted by date. If it is, then you could use itertools.groupby to simplify your code. For example, the for loop in this code iterates over the data one day at time:
import csv
import datetime
import itertools
with open("file.csv", "rb") as quotes:
csvReader = csv.DictReader(quotes)
lmb = lambda d: datetime.datetime.strptime(d["Date"], "%d-%b-%Y").date()
for k, g in itertools.groupby(csvReader, key = lmb):
# do stuff per day
counts = (int(data["Count"]) for data in g)
print "On {0} the total count was {1}".format(k, sum(counts))
I created a test "file.csv" containing the following data:
Date,Time,Count
1-Apr-2012,13:23,10
2-Apr-2012,10:57,5
2-Apr-2012,11:38,23
2-Apr-2012,15:10,1
3-Apr-2012,17:47,123
3-Apr-2012,18:21,8
and when I ran the above code I got the following results:
On 2012-04-01 the total count was 10
On 2012-04-02 the total count was 29
On 2012-04-03 the total count was 131
But remember that this will only work if the data in "file.csv" is sorted by date.

If (for some reason) you can assume that the input rows are already sorted by date, you could put them into a local container one by one as long as the date of any new row is the same as the previous one:
same_date_rows = []
prev_date = None
for data in csvReader:
# ... your existing code
bars = datetime.datetime(year,month,day,hour,minute)
if bars == prev_date:
same_date_rows.append(data)
else:
# New date. We process all rows collected so far
do_something(same_date_rows)
# Then we start a new collection for the new date
same_date_rows = [date]
# Remember the date of the current row
prev_date = bars
# Finally, process the final group of rows
do_something(same_date_rows)
But if you cannot make that assumption, you will have to
Either: Put the rows in a long list, sort that by date, and then apply an algorithm like the above to the sorted list
Or: Put the rows in a dictionary, using the date as key, and a list of rows as value for each key. Then you can iterate through the keys of that dictionary to get access to all rows that share a date.
The second of these two approaches is a little more space-consuming, but it may allow you do to some of the date-specific processing in the main loop, because whenever you receive a new row for an already-existing date, you could apply some of the date-specific processing right away, possibly avoiding the need to actually store all date-specific rows explicitly. Whether that is possible depends on what kind of processing you apply to the rows.

If you are not going for space efficeny, an elegant solution would be to create a dictionary where the key is your day, and the value is a list object, where all the information for each day is stored. Later you can do whatever operations you want based on per day.
For example
d = {} #Initialize emptry dictionry
for data in csvReader:
Day = int(data["Date"].split("-")[0])
try:
d[Day].append('Some_Val')
except KeyError:
d[Day] = ['Some_val']
This will either modify or create a new list object for each day. This is later easily accessible either by iterating over the dictionary or simply referring to the day as a key.
For example:
d[Some_Day]
will give you simply a list object with all the information you have stored. Given the linear lookup time of a dictionary, it should be quite efficent in terms of time.

Python in memory table data structures for analysis (dict, list, combo)

I'm trying to simulate some code that I have working with SQL but using all Python instead..
With some help here
CSV to Python Dictionary with all column names?
I now can read my zipped-csv file into a dict Only one line though, the last one. (how do I get a sample of lines or the whole data file?)
I am hoping to have a memory resident table that I can manipulate much like sql when I'm done eg Clean data by matching bad data to to another table with bad data and correct entries.. then sum by type average by time period and the like.. The total data file is about 500,000 rows.. I'm not fussed about getting all in memory but want to solve the general case as best I can,, again so I know what can be done without resorting to SQL
import csv, sys, zipfile
sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip"
zip_file = zipfile.ZipFile(sys.argv[0])
items_file = zip_file.open('AllListing1RES.txt', 'rU')
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
pass
# Then is my result is
>>> for key in row:
print 'key=%s, value=%s' % (key, row[key])
key=YEAR_BUILT_DESC, value=EXIST
key=SUBDIVISION, value=KNOLLWOOD
key=DOM, value=2
key=STREET_NAME, value=ORLEANS RD
key=BEDROOMS, value=3
key=SOLD_PRICE, value=
key=PROP_TYPE, value=SFR
key=BATHS_FULL, value=2
key=PENDING_DATE, value=
key=STREET_NUM, value=3828
key=SOLD_DATE, value=
key=LIST_PRICE, value=324900
key=AREA, value=200
key=STATUS_DATE, value=3/3/2011 11:54:56 PM
key=STATUS, value=A
key=BATHS_HALF, value=0
key=YEAR_BUILT, value=1968
key=ZIP, value=35243
key=COUNTY, value=JEFF
key=MLS_ACCT, value=492859
key=CITY, value=MOUNTAIN BROOK
key=OWNER_NAME, value=SPARKS
key=LIST_DATE, value=3/3/2011
key=DATE_MODIFIED, value=3/4/2011 12:04:11 AM
key=PARCEL_ID, value=28-15-3-009-001.0000
key=ACREAGE, value=0
key=WITHDRAWN_DATE, value=
>>>
I think I'm barking up a few wrong trees here...
One is that I only have 1 line of my about 500,000 line data file..
Two is it seems that the dict may not be the right structure since I don't think I can just load all 500,000 lines and do various operations on them. Like..Sum by group and date..
plus it seems that duplicate keys may cause problems ie the non unique descriptors like county and subdivision.
I also don't know how to read a specific small subset of line into memory (like 10 or 100 to test with, before loading all (which I also don't get..) I have read over the Python docs and several reference books but it just is not clicking yet..
It seems that most of the answers I can find all suggest using various SQL solutions for this sort of thing,, but I am anxious to learn the basics of achieving the similar results with Python. As in some cases I think it will be easier and faster as well as expand my tool set. But I'm having a hard time finding relevant examples.
one answer that hints at what I'm getting at is:
Once the reading is done right, DictReader should work for getting rows as dictionaries, a typical row-oriented structure. Oddly enough, this isn't normally the efficient way to handle queries like yours; having only column lists makes searches a lot easier. Row orientation means you have to redo some lookup work for every row. Things like date matching requires data that is certainly not present in a CSV, like how dates are represented and which columns are dates.
An example of getting a column-oriented data structure (however, involving loading the whole file):
import csv
allrows=list(csv.reader(open('test.csv')))
# Extract the first row as keys for a columns dictionary
columns=dict([(x[0],x[1:]) for x in zip(*allrows)])
The intermediate steps of going to list and storing in a variable aren't necessary.
The key is using zip (or its cousin itertools.izip) to transpose the table.
Then extracting column two from all rows with a certain criterion in column one:
matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2]
print map(columns['two'].__getitem__, matchingrows)
When you do know the type of a column, it may make sense to parse it, using appropriate
functions like datetime.datetime.strptime.
via Yann Vernier
Surely there is some good reference for this general topic?

You can only read one line at a time from the csv reader, but you can store them all in memory quite easily:
rows = []
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
rows.append(row)
# rows[0]
{'keyA': 13, 'keyB': 'dataB' ... }
# rows[1]
{'keyA': 5, 'keyB': 'dataB' ... }
Then, to do aggregations and calculations:
sum(row['keyA'] for row in rows)
You may want to transform the data before it goes into rows, or use a friendlier data structure. Iterating over 500,000 rows for each calculation could become quite inefficient.
As a commenter mentioned, using an in-memory database could be really beneficial to you. another question asks exactly how to transfer csv data into a sqlite database.
import csv
import sqlite3
conn = sqlite3.connect(":memory:")
c = conn.cursor()
c.execute("create table t (col1 text, col2 float);")
# csv.DictReader uses the first line in the file as column headings by default
dr = csv.DictReader(open('data.csv', delimiter=','))
to_db = [(i['col1'], i['col2']) for i in dr]
c.executemany("insert into t (col1, col2) values (?, ?);", to_db)

You say """I now can read my zipped-csv file into a dict Only one line though, the last one. (how do I get a sample of lines or the whole data file?)"""
Your code does this:
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
pass
I can't imagine why you wrote that, but the effect is to read the whole input file row by row, ignoring each row (pass means "do exactly nothing"). The end result is that row refers to the last row (unless of course the file is empty).
To "get" the whole file, change pass to do_something_useful_with(row).
If you want to read the whole file into memory, simply do this:
rows = list(csv.DictReader(.....))
To get a sample, e.g. every Nth row (N > 0), starting at the Mth row (0 <= M < N), do something like this:
for row_index, row in enumerate(csv.DictReader(.....)):
if row_index % N != M: continue
do_something_useful_with(row)
By the way, you don't need dialect='excel'; that's the default.

Numpy (numerical python) is the best tool for operating on, comparing etc arrays, and your table is basically a 2d array.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.