How to sort a CSV file using python - python

I am trying to make a leader board for my python program
I have sorted out writing the different scores to the leader board already
However, I am having trouble finding a way that I can sort this data
(Highest score at the top and lowest at the bottom)
Also, I am sorry but I do not have any code that is even vaguely functional, everything I have tried has just been incorrect
Also I only have limited access to modules as it is for a school project which makes it even harder for me (I have CSV,Random,Time,)
Thank you so much
I would really appreciate any help I can recieve

You can read in the file with pandas, sort it by a column, and overwrite the old csv with the new values. The code would look similar to this:
import pandas as pd
path = your_file_path
df = pd.read_csv(path)
df = df.sort_values(by=["column_name"], ascending=False)
df.to_csv(path)

This problem can be done in 3 parts using standard Python:
Read all of the data (assuming it has a header row). A csv_reader() is used to parse your file and read in each row as a list of values. By calling list() it will read all rows as a list of rows.
Sort the data
Write all of the data (add back the header first), this time using a csv.writer() to automatically take your list of rows and write the correct format to the file.
This can be done using Python's csv library which you say you can use. Secondly you need to tell the sort() function how to sort your rows. In this example it assumes the scores are in the second column. The csv library will read each row as a list of values (starting from 0), so the score in this example is column 1. The key parameter gives sort() a function to call for each row that it is sorting. The function receives a row and returns which parts of the row to sort on, that way you don't have to sort on the first column. lambda is just shorthand for writing a single line function, it takes a parameter x and returns the elements from the row to sort on. Here we use a Python tuple to return two elements, the score and the name. First convert the score string x[1] into an integer. Adding a - will make the highest score sort to the top. x[0] then uses the Name column to sort for cases where two scores are the same:
import csv
with open('scores.csv', newline='') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
data = list(csv_input)
data.sort(key=lambda x: (-int(x[1]), x[0]))
with open('scores_sorted.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(header)
csv_output.writerows(data)
So for a sample CSV file containing:
name,score
fred,5
wilma,10
barney,8
betty,4
dino,10
You would get a sorted output CSV looking like:
name,score
dino,10
wilma,10
barney,8
fred,5
betty,4
Note, dino and wilma both have the same score, but dino is alphabetically first.
This assumes you are using Python 3.x

Related

Finding string contained in CSV file and computing a sum

It's my first time working with Panda, so I am trying to wrap my head around all of its functionalities.
Essentially, I want to download my bank statements in CSV and search for a keyword (e.g. steam) and compute the money I spent.
I was able to use panda to locate lines that contain my keyword, but I do not know how to iterate through them and attribute the cost of that purchase to a variable that I will sum up as the iteration grows.
If you look in the image I upload, I am able to find the lines containing my keyword in the dataframe, but what I want to do is for each line found, I want to take the content of the col1 and sum it up together.
Attempt At Code
# importing pandas module
import pandas as pd
keyword = input("Enter the keyword you wish to search in the statement: ")
# reading csv file from url
df = pd.read_csv('accountactivity.csv',header=None)
dff=df.loc[df[1].str.contains(keyword,case=False)]
value=df.values[68][2] #Fetches value of a specific cell in the CSV/dataframe created
print(dff)
print(value)
EDIT:
I essentially was almost able to complete the code I wanted, using only the CSV reader, but I can't get that code to find substrings. It only works if I enter the exact same string, meaning if I enter netflix it doesn't work, I would need to write it exactly as it appears on the statement like NETFLIX.COM _V. Here is another screenshot of that working code. I essentially want to mimic that with the capabilities of just finding substrings.
Working Code using CSV reader
import csv
data=[]
with open("accountactivity.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
keyword = input("Enter the keyword you wish to search in the statement: ")
col = [x[1] for x in data]
Sum = 0
if keyword in col:
for x in range(0, len(data)):
if keyword == data[x][1]:
PartialSum=float(data[x][2])
Sum=Sum+PartialSum
print(data[x][1])
print("The sum for expenses at ",keyword," is of: ",Sum,"$",sep = '')
else:
print("Keyword returned no results.")
The format of the CSV is the following: CSV Format
column 0 Date of transaction
column 1 Name of transaction
column 2 Money spent from account
column 3 Money received to account
The CSV file downloaded directly from my bank has no headers. So I refer to columns using col[0] etc...
Thanks for your help, I will continue meanwhile to look at how to potentially do this.
dff[dff.columns[col_index]].sum()
where col_index is the index of the column you want to sum together.
Thanks everyone for your help. I ended up understanding more how dataframe with Pandas work and I used the command: df[df.columns["col_index"]].sum() (which was suggested to me by Jonny Kong) with the column of interest (which in my case is column 2 containing my expenses). It computes the sum of my expenses for the searched keyword which is what I need!
#Importing pandas module
import pandas as pd
#Keyword searched through bank statement
keyword = input("Enter the keyword you wish to search in the statement: ")
#Reading the bank statement CSV file
df = pd.read_csv('accountactivity.csv',header=None)
#Creating dataframe from bank statement with lines that match search keyword
dff=df.loc[df[1].str.contains(keyword,case=False)]
#Sum the column which contains total money spent on the keyword searched
Sum=dff[dff.columns[2]].sum()
#Prints the created dataframe
print("\n",dff,"\n")
#Prints the sum of expenses for the keyword searched
print("The sum for expenses at ",keyword," is of: ",Sum,"$",sep = '')
Working Code!
Again, thanks everyone for helping and supporting me through my first post on SO!

Writing a whole list in a CSV row-Python/IronPython

Right now I have several long lists : One called variable_names.
Lets say Variable names= [ Velocity, Density, Pressure, ....] (length is 50+)
I want to write a row that reads every index of the list, leaves about 5 empty cells, then writes next value, and keeps doing it until list is done.
As shown in row1 Sample picture
The thing is I can't use xlrd due to compatibility issues with Iron Python and I need to dynamically write each row in the new csv , load data from old csv , then append that data in the new csv, the old csv keeps changing once I append the data in the new csv, so I need to iterate all values in the lists for every time I write the row, because it is much more difficult to append columns in csv.
What I basicall want to do is :
with open('data.csv','a') as f:
sWriter=csv.writer(f)
sWriter.writerow([Value_list[i],Value_list[i+1],Value_list[i+2].....Value_list[end])
But I can't seem to think of a way to do this with iteration
Because writerow method takes a list argument, you can first construct the list and then write the list so everything in the list will be in one row.
Like,
with open('data.csv','a') as f:
sWriter=csv.writer(f)
listOfColumns = []
for i in range(from, to): # append elements from Value_list
listOfColumns.append(Value_list[i])
for i in range(0, 2): # Or you may want some columns with blank
listOfColumns.append("")
for i in range(anotherFrom, anotherTo): # append elements from Value_list
listOfColumns.append(Value_list[i])
# At here, the listOfColumns will be [Value_list[from], ..., Value_list[to], "", "", Value_list[anotherFrom], ..., Value_list[anotherTo]]
sWriter.writerow(listOfColumns)

Python - Trying to read a csv file and output values with specific criteria

I am going to start off with stating I am very much new at working in Python. I have a very rudimentary knowledge of SQL but this is my first go 'round with Python. I have a csv file of customer related data and I need to output the records of customers who have spent more than $1000. I was also given this starter code:
import csv
import re
data = []
with open('customerData.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
data.append(row)
print(data[0])
print(data[1]["Name"])
print(data[2]["Spent Past 30 Days"])
I am not looking for anyone to give me the answer, but maybe nudge me in the right direction. I know that it has opened the file to read and created a list (data) and is ready to output the values of the first and second row. I am stuck trying to figure out how to call out the column value without limiting it to a specific row number. Do I need to make another list for columns? Do I need to create a loop to output each record that meets the > 1000 criteria? Any advice would be much appreciated.
To get a particular column you could use a for loop. I'm not sure exactly what you're wanting to do with it, but this might be a good place to start.
for i in range(0,len(data)):
print data[i]['Name']
len(data) should equal the number of rows, thus iterating through the entire column
The sample code does not give away the secret of data structure. It looks like maybe a list of dicts. Which does not make much sense, so I'll guess how data is organized. Assuming data is a list of lists you can get at a column with a list comprehension:
data = [['Name','Spent Past 30 Days'],['Ernie',890],['Bert',1200]]
spent_column = [row[1] for row in data]
print(spent_column) # prints: ['Spent Past 30 Days', 890, 1200]
But you will probably want to know who is a big spender so maybe you should return the names:
data = [['Name','Spent Past 30 Days'],['Ernie',890],['Bert',1200]]
spent_names = [row[0] for row in data[1:] if int(row[1])>1000]
print(spent_names) # prints: ['Bert']
If the examples are unclear I suggest you read up on list comprehensions; they are awesome :)
You can do all of the above with regular for-loops as well.

Python repeating CSV file

I'm attempting to load numerical data from CSV files in order to loop through the calculated data of each stock(file) separately and determine if the calculated value is greater than a specific number (731 in this case). However, the method I am using seems to make Python repeat the list as well as add quotation marks around the numbers ('500'), as an example, making them strings. Unfortunately, I think the final "if" statement can't handle this and as a result it doesn't seem to function appropriately. I'm not sure what's going on and why Python what I need to do to get this code running properly.
import csv
stocks = ['JPM','PG','GOOG','KO']
for stock in stocks:
Data = open("%sMin.csv" % (stock), 'r')
stockdata = []
for row in Data:
stockdata.extend(map(float, row.strip().split(',')))
stockdata.append(row.strip().split(',')[0])
if any(x > 731 for x in stockdata):
print "%s Minimum" % (stock)
Currently you're adding all columns of each row to a list, then adding to the end of that, the first column of the row again? So are all columns significant, or just the first?
You're also loading all data from the file before the comparison but don't appear to be using it anywhere, so I guess you can shortcut earlier...
If I understand correctly, your code should be this (or amend to only compare first column).
Are you basically writing this?
import csv
STOCKS = ['JPM', 'PG', 'GOOG', 'KO']
for stock in STOCKS:
with open('{}Min.csv'.format(stock)) as csvin:
for row in csv.reader(csvin):
if any(col > 731 for col in map(float, row)):
print '{} minimum'.format(stock)
break

Python in memory table data structures for analysis (dict, list, combo)

I'm trying to simulate some code that I have working with SQL but using all Python instead..
With some help here
CSV to Python Dictionary with all column names?
I now can read my zipped-csv file into a dict Only one line though, the last one. (how do I get a sample of lines or the whole data file?)
I am hoping to have a memory resident table that I can manipulate much like sql when I'm done eg Clean data by matching bad data to to another table with bad data and correct entries.. then sum by type average by time period and the like.. The total data file is about 500,000 rows.. I'm not fussed about getting all in memory but want to solve the general case as best I can,, again so I know what can be done without resorting to SQL
import csv, sys, zipfile
sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip"
zip_file = zipfile.ZipFile(sys.argv[0])
items_file = zip_file.open('AllListing1RES.txt', 'rU')
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
pass
# Then is my result is
>>> for key in row:
print 'key=%s, value=%s' % (key, row[key])
key=YEAR_BUILT_DESC, value=EXIST
key=SUBDIVISION, value=KNOLLWOOD
key=DOM, value=2
key=STREET_NAME, value=ORLEANS RD
key=BEDROOMS, value=3
key=SOLD_PRICE, value=
key=PROP_TYPE, value=SFR
key=BATHS_FULL, value=2
key=PENDING_DATE, value=
key=STREET_NUM, value=3828
key=SOLD_DATE, value=
key=LIST_PRICE, value=324900
key=AREA, value=200
key=STATUS_DATE, value=3/3/2011 11:54:56 PM
key=STATUS, value=A
key=BATHS_HALF, value=0
key=YEAR_BUILT, value=1968
key=ZIP, value=35243
key=COUNTY, value=JEFF
key=MLS_ACCT, value=492859
key=CITY, value=MOUNTAIN BROOK
key=OWNER_NAME, value=SPARKS
key=LIST_DATE, value=3/3/2011
key=DATE_MODIFIED, value=3/4/2011 12:04:11 AM
key=PARCEL_ID, value=28-15-3-009-001.0000
key=ACREAGE, value=0
key=WITHDRAWN_DATE, value=
>>>
I think I'm barking up a few wrong trees here...
One is that I only have 1 line of my about 500,000 line data file..
Two is it seems that the dict may not be the right structure since I don't think I can just load all 500,000 lines and do various operations on them. Like..Sum by group and date..
plus it seems that duplicate keys may cause problems ie the non unique descriptors like county and subdivision.
I also don't know how to read a specific small subset of line into memory (like 10 or 100 to test with, before loading all (which I also don't get..) I have read over the Python docs and several reference books but it just is not clicking yet..
It seems that most of the answers I can find all suggest using various SQL solutions for this sort of thing,, but I am anxious to learn the basics of achieving the similar results with Python. As in some cases I think it will be easier and faster as well as expand my tool set. But I'm having a hard time finding relevant examples.
one answer that hints at what I'm getting at is:
Once the reading is done right, DictReader should work for getting rows as dictionaries, a typical row-oriented structure. Oddly enough, this isn't normally the efficient way to handle queries like yours; having only column lists makes searches a lot easier. Row orientation means you have to redo some lookup work for every row. Things like date matching requires data that is certainly not present in a CSV, like how dates are represented and which columns are dates.
An example of getting a column-oriented data structure (however, involving loading the whole file):
import csv
allrows=list(csv.reader(open('test.csv')))
# Extract the first row as keys for a columns dictionary
columns=dict([(x[0],x[1:]) for x in zip(*allrows)])
The intermediate steps of going to list and storing in a variable aren't necessary.
The key is using zip (or its cousin itertools.izip) to transpose the table.
Then extracting column two from all rows with a certain criterion in column one:
matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2]
print map(columns['two'].__getitem__, matchingrows)
When you do know the type of a column, it may make sense to parse it, using appropriate
functions like datetime.datetime.strptime.
via Yann Vernier
Surely there is some good reference for this general topic?
You can only read one line at a time from the csv reader, but you can store them all in memory quite easily:
rows = []
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
rows.append(row)
# rows[0]
{'keyA': 13, 'keyB': 'dataB' ... }
# rows[1]
{'keyA': 5, 'keyB': 'dataB' ... }
Then, to do aggregations and calculations:
sum(row['keyA'] for row in rows)
You may want to transform the data before it goes into rows, or use a friendlier data structure. Iterating over 500,000 rows for each calculation could become quite inefficient.
As a commenter mentioned, using an in-memory database could be really beneficial to you. another question asks exactly how to transfer csv data into a sqlite database.
import csv
import sqlite3
conn = sqlite3.connect(":memory:")
c = conn.cursor()
c.execute("create table t (col1 text, col2 float);")
# csv.DictReader uses the first line in the file as column headings by default
dr = csv.DictReader(open('data.csv', delimiter=','))
to_db = [(i['col1'], i['col2']) for i in dr]
c.executemany("insert into t (col1, col2) values (?, ?);", to_db)
You say """I now can read my zipped-csv file into a dict Only one line though, the last one. (how do I get a sample of lines or the whole data file?)"""
Your code does this:
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
pass
I can't imagine why you wrote that, but the effect is to read the whole input file row by row, ignoring each row (pass means "do exactly nothing"). The end result is that row refers to the last row (unless of course the file is empty).
To "get" the whole file, change pass to do_something_useful_with(row).
If you want to read the whole file into memory, simply do this:
rows = list(csv.DictReader(.....))
To get a sample, e.g. every Nth row (N > 0), starting at the Mth row (0 <= M < N), do something like this:
for row_index, row in enumerate(csv.DictReader(.....)):
if row_index % N != M: continue
do_something_useful_with(row)
By the way, you don't need dialect='excel'; that's the default.
Numpy (numerical python) is the best tool for operating on, comparing etc arrays, and your table is basically a 2d array.

Categories