python csv: getting subset

python csv: getting subset - python

here is a snapshot of my csv:
alex 123f 1
harry fwef 2
alex sef 3
alex gsdf 4
alex wf35 6
harry sdfsdf 3
i would like to get the subset of this data where the occurrence of anything in the first column (harry, alex) is at least 4. so i want the resulting data set to be:
alex 123f 1
alex sef 3
alex gsdf 4
alex wf35 6

Clearly, you cannot decide which rows are interesting until you've seen all rows (since the very last row might be the one turning some count from three to four and thereby making some previously seen rows interesting, for example;-). So, unless your CSV file is horribly huge, suck it all into memory, first, as a list...:
import csv
with open('thefile.csv', 'rb') as f:
data = list(csv.reader(f))
then, do the counting -- Python 2.7 has a better way, but assuming you're still on 2.6 like most of us...:
import collections
counter = collections.defaultdict(int)
for row in data:
counter[row[0]] += 1
and finally do the selection loop...:
for row in data:
if counter[row[0]] >= 4:
print row
Of course, this prints each interesting row as a roughly-hewed list (with square brackets and quotes around the items), but it will be easy to format it in any way you might prefer.

if Python is not a must
$ gawk '{b[$1]++;c[++d,$1]=$0}END{for(i in b){if(b[i]>=4){for(j=1;j<=d;j++){print c[j,i]}}}}' file
And yes, 70MB file is fine.

Related

Python Sorting and Organising

I'm trying to sort data from a file and not quiet getting what i need. I have a text file with race details ( name placement( ie 1,2,3). I would like to be able to organize the data by highest placement first and also alphabetically by name. I can do this if i split the lines but then the name and score will not match up.
Any help and suggestion would be very welcomed, I've hit that proverbial wall.
My apologies ( first time user for this site , and python noob, steep learning curve ) Thank you for your suggestions , i really do appreciate the help.
comp=[]
results = open('d:\\test.txt', 'r')
for line in results:
line=line.split()
# (name,score)= line.split()
comp.append(line)
sorted(comp)
results.close()
print (comp)
Test file was in this format:
Jones 2
Ranfel 7
Peterson 5
Smith 1
Simons 9
Roberts 4
McDonald 3
Rogers 6
Elliks 8
Helm 10

I completely agree with everyone who has down-voted this question for being badly posed. However, I'm in a good mood so I'll try and at least steer you in the right direction:
Let's assume your text file looks like this:
Name,Placement
D,1
D,2
C,1
C,3
B,1
B,3
A,1
A,4
I suggest importing the data and sorting it using Pandas http://pandas.pydata.org/
import pandas as pd
# Read in the data
# Replace <FULL_PATH_OF FILE> with something like C:/Data/RaceDetails.csv
# The first row is automatically used for column names
data=pd.read_csv("<FULL_PATH_OF_FILE>")
# Sort the data
sorted_data=data.sort(['Placement','Name'])
# Create a re-indexed data frame if you so desire
sorted_data_new_index=sorted_data.reset_index(drop=True)
This gives me:
Name Placement
A 1
B 1
C 1
D 1
D 2
B 3
C 3
A 4
I'll leave you to figure out the rest..

As #Jack said, I am very limited to how I can help if you don't post code or the txt file. However, I've run into a similar problem before, so I know the basics (again, will need code/files before I can give an exact type-this-stuff answer!)
You can either develop an algorithm yourself, or use the built-in sorted feature
Put the names and scores in a list (or dictionary) such as:
name_scores = [['Matt', 95], ['Bob', 50], ['Ashley', 100]]
and then call sorted(name_scores) and it will sort by names: [['Ashley', 100], ['Bob', 50], ['Matt', 95]]

Python data wrangling issues

I'm currently stumped by some basic issues with a small data set. Here are the first three lines to illustrate the format of the data:
"Sport","Entry","Contest_Date_EST","Place","Points","Winnings_Non_Ticket","Winnings_Ticket","Contest_Entries","Entry_Fee","Prize_Pool","Places_Paid"
"NBA","NBA 3K Crossover #3 [3,000 Guaranteed] (Early Only) (1/15)","2015-03-01 13:00:00",35,283.25,"13.33","0.00",171,"20.00","3,000.00",35
"NBA","NBA 1,500 Layup #4 [1,500 Guaranteed] (Early Only) (1/25)","2015-03-01 13:00:00",148,283.25,"3.00","0.00",862,"2.00","1,500.00",200
The issues I am having after using read_csv to create a DataFrame:
The presence of commas in certain categorical values (such as Prize_Pool) results in python considering these entries as strings. I need to convert these to floats in order to make certain calculations. I've used python's replace() function to get rid of the commas, but that's as far as I've gotten.
The category Contest_Date_EST contains timestamps, but some are repeated. I'd like to subset the entire dataset into one that has only unique timestamps. It would be nice to have a choice in which repeated entry or entries are removed, but at the moment I'd just like to be able to filter the data with unique timestamps.

Use thousands=',' argument for numbers that contain a comma
In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')
You can check Prize_Pool is numerical
In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64
To drop rows - take first observed, you can also take last
In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
Sport Entry \
0 NBA NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...
Contest_Date_EST Place Points Winnings_Non_Ticket Winnings_Ticket \
0 2015-03-01 13:00:00 35 283.25 13.33 0
Contest_Entries Entry_Fee Prize_Pool Places_Paid
0 171 20 3000 35

Edit: Just realized you're using pandas - should have looked at that.
I'll leave this here for now in case it's applicable but if it gets
downvoted I'll take it down by virtue of peer pressure :)
I'll try and update it to use pandas later tonight
Seems like itertools.groupby() is the tool for this job;
Something like this?
import csv
import itertools
class CsvImport():
def Run(self, filename):
# Get the formatted rows from CSV file
rows = self.readCsv(filename)
for key in rows.keys():
print "\nKey: " + key
i = 1
for value in rows[key]:
print "\nValue {index} : {value}".format(index = i, value = value)
i += 1
def readCsv(self, fileName):
with open(fileName, 'rU') as csvfile:
reader = csv.DictReader(csvfile)
# Keys may or may not be pulled in with extra space by DictReader()
# The next line simply creates a small dict of stripped keys to original padded keys
keys = { key.strip(): key for (key) in reader.fieldnames }
# Format each row into the final string
groupedRows = {}
for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
return groupedRows;
def normalizeRow(self, row):
row[1] = float(row[1].replace(',','')) # "Prize_Pool"
# and so on
return row
if __name__ == "__main__":
CsvImport().Run("./Test1.csv")
Output:
More info:
https://docs.python.org/2/library/itertools.html
Hope this helps :)

Input file, modify column, output file

I have data in a text file and I would like to be able to modify the file by columns and output the file again. I normally write in C (basic ability) but choose python for it's obvious string benefits. I haven't ever used python before so I'm a tad stuck. I have been reading up on similar problems but they only show how to change whole lines. To be honest I have on clue what to do.
Say I have the file
1 2 3
4 5 6
7 8 9
and I want to be able to change column two with some function say multiply it by 2 so I get
1 4 3
4 10 6
7 16 9
Ideally I would be able to easily change the program so I apply any function to any column.
For anyone who is interested it is for modifying lab data for plotting. eg take the log of the first column.

Python is an excellent general purpose language however I might suggest that if you are on an Unix based system then maybe you should take a look at awk. The language awk is design for these kind of text based transformation. The power of awk is easily seen for your question as the solution is only a few characters: awk '{$2=$2*2;print}'.
$ cat file
1 2 3
4 5 6
7 8 9
$ awk '{$2=$2*2;print}' file
1 4 3
4 10 6
7 16 9
# Multiple the third column by 10
$ awk '{$3=$3*10;print}' file
1 2 30
4 5 60
7 8 90
In awk each column is referenced by $i where i is the ith field. So we just set the value of second field to be the value of second field multiplied by two and print the line. This can be written even more concisely like awk '{$2=$2*2}1' file but best to be clear at beginning.

Here is a very simple Python solution:
for line in open("myfile.txt"):
col = line.strip().split(' ')
print col[0],int(col[1])*2,col[2]
There are plenty of improvements that could made but I'll leave that as an exercise for you.

I would use pandas or just numpy. Read your file with:
data = pd.read_csv('file.txt', header=None, delim_whitespace=True)
then work with the data in a spreadsheet like style, ex:
data.values[:,1] *= 2
finally write again to file with:
data.to_csv('output.txt')

As #sudo_O said, there are much efficient tools than python for this task. However,here is a possible solution :
from itertools import imap, repeat
import csv
fun = pow
with open('m.in', 'r') as input_file :
with open('m.out', 'wb') as out_file:
inpt = csv.reader(input_file, delimiter=' ')
out = csv.writer(out_file, delimiter=' ')
for row in inpt:
row = [ int(e) for e in row] #conversion
opt = repeat(2, len(row) ) # square power for every value
# write ( function(data, argument) )
out.writerow( [ str(elem )for elem in imap(fun, row , opt ) ] )
Here it multiply every number by itself, but you can configure it to multiply only the second colum, by changing opt : opt = [ 1 + (col == 1) for col in range(len(row)) ] (2 for col 1, 1 otherwise )

Extracting Groups

Using Python 3.2 I was hoping to solve the below issue. My data consist of hundreds of rows (signifying a project) and 21 columns. The first of which is a unique project ID and the other 20 columns is the group of people, or person, that led the project. person_1 is always filled and if there is a name in person_3 that means 3 people are working together. If there is a name in person_18 that means 18 people are working together.
I have an excel spreadsheet that is setup the following way:
unique ID person_1 person _2 person_3 person_4 ... person_20
12 Tom Sally Mike
16 Joe Mike
5 Joe Sally
1 Sally Mike Tom
6 Sally Tom Mike
2 Jared Joe Mike John ... Carl
I want to do a few things:
1) Make a column that will give me a unique 'Group Name' which will be, using unique ID 1 as my example, Sally/Mike/Tom. So it will be the names separated by '/'.
2) How can I treat, from my example, Sally/Mike/Tom the same as Sally/Tom/Mike. Meaning, I would like another column that makes the group name in alphabetical order (no matter the actual permutation), still separated by '/'.
3) This question is similar to (2). However, I want the person listed in person_1 to matter. Meaning Joe/Tom/Mike is different from Tom/Joe/Mike but not different than Joe/Mike/Tom. So there will be another column that keeps person_1 at the start of the group name but alphabetizes person_2 through person_20 if applicable (i.e., if the project has more than 1 person on it).
Thanks for the help and suggestions

The previous answer gave a clear statement of method, but perhaps you are stuck on either the string processing or the csv processing. Both are demonstrated in the following code. The relevant string methods are sorted and join. '/'.join tells join to use / as separator between joined items. The + operator between lists in tname and writerow statements concatenates the lists. A csv.reader is an iterator that delivers one list per row, and a csv.writer converts a list to a row and writes it out. You will want to add error testing to the file opens, etc. The data file used to test this code is shown after the code.
import csv
fi = open('xgroup.csv')
fo = open('xgroup3.csv', 'w')
w = csv.writer(fo)
r = csv.reader(fi)
li = 0
print "Opened reader and writer"
for row in r:
gname = '/'.join(row[1:])
sname = '/'.join(sorted(row[1:]))
tname = '/'.join([row[1]]+sorted(row[2:]))
w.writerow([row[0], gname, sname, tname]+row[1:])
li += 1
fi.close()
fo.close()
print "Closed reader and writer after",li,"lines"
File xgroup.csv is shown next.
unique-ID,person_1,person,_2,person_3,person_4,...,person_20
12,Tom,Sally,Mike
16,Joe,Mike
5,Joe,Sally
1,Sally,Mike,Tom
6,Sally,Tom,Mike
2,Jared,Joe,Mike,John,...,Carl
Upon reading data as above, the program prints Opened reader and writer and Closed reader and writer after 7 lines and produces output in file xgroup3.csv as shown next.
unique-ID,person_1/person/_2/person_3/person_4/.../person_20,.../_2/person/person_1/person_20/person_3/person_4,person_1/.../_2/person/person_20/person_3/person_4,person_1,person,_2,person_3,person_4,...,person_20
12,Tom/Sally/Mike,Mike/Sally/Tom,Tom/Mike/Sally,Tom,Sally,Mike
16,Joe/Mike,Joe/Mike,Joe/Mike,Joe,Mike
5,Joe/Sally,Joe/Sally,Joe/Sally,Joe,Sally
1,Sally/Mike/Tom,Mike/Sally/Tom,Sally/Mike/Tom,Sally,Mike,Tom
6,Sally/Tom/Mike,Mike/Sally/Tom,Sally/Mike/Tom,Sally,Tom,Mike
2,Jared/Joe/Mike/John/.../Carl,.../Carl/Jared/Joe/John/Mike,Jared/.../Carl/Joe/John/Mike,Jared,Joe,Mike,John,...,Carl
Note, given a data line like
5,Joe,Sally,,,,,
instead of
5,Joe,Sally
the program as above produces
5,Joe/Sally/////,/////Joe/Sally,Joe//////Sally,Joe,Sally,,,,,
instead of
5,Joe/Sally,Joe/Sally,Joe/Sally,Joe,Sally
If that's a problem, filter out empty entries. For example, if
row=['5', 'Joe', 'Sally', '', '', '', '', ''], then
'/'.join(row[1:]) produces
'Joe/Sally/////', while
'/'.join(filter(lambda x: x, row[1:])) and
'/'.join(x for x in row[1:] if x) and
'/'.join(filter(len, row[1:])) produce
'Joe/Sally' .

You could do the following:
Export your file to a .csv file from Excel
Open that input file using python's csv module, using csv.reader
Open another file (output) to write to it using csv.writer
Iterate over each row in your reader, do your treatment, and write that using your writer
Import the output file in Excel

Turning project data into a relationship matrix

My data set a list of people either working together or alone.
I have have a row for each project and columns with names of all the people who worked on that project. If column 2 is the first empty column given a row it was a solo job, if column 4 is the first empty column given a row then there were 3 people working together.
My goal is to find which people have worked together, and how many times, so I want all pairs in the data set, treating A working with B the same as B working with A.
From this a square N x N would be created with every actor labeling the column and row and in cell (A,B) and (B,A) would have how many times that pair worked together, and this would be done for every pair.
I know of a 'pretty' quick way to do it in Excel but I want it automated, hopefully in Stata or Python, just in case projects are added or removed I can just 1-click the re-run and not have to re-do it every time.
An example of the data, in a comma delimited fashion:
A
A,B
B,C,E
B,F
D,F
A,B,C
D,B
E,C,B
X,D,A
Hope that helps!
Brice.
F,D
B
F
F,X,C
C,F,D

Maybe something like this would get you started?
import csv
import collections
import itertools
grid = collections.Counter()
with open("connect.csv", "r", newline="") as fp:
reader = csv.reader(fp)
for line in reader:
# clean empty names
line = [name.strip() for name in line if name.strip()]
# count single works
if len(line) == 1:
grid[line[0], line[0]] += 1
# do pairwise counts
for pair in itertools.combinations(line, 2):
grid[pair] += 1
grid[pair[::-1]] += 1
actors = sorted(set(pair[0] for pair in grid))
with open("connection_grid.csv", "w", newline="") as fp:
writer = csv.writer(fp)
writer.writerow([''] + actors)
for actor in actors:
line = [actor,] + [grid[actor, other] for other in actors]
writer.writerow(line)
[edit: modified to work under Python 3.2]
The key modules are (1)csv, which makes reading and writing csv files much simpler; (2) collections, which provides an object called a Counter -- like a defaultdict(int), which you could use if your Python doesn't have Counter, it's a dictionary which automatically generates default values so you don't have to, and here the default count is 0; and (3) itertools, which has a combinations function to get all the pairs.
which produces
,A,B,C,D,E,F,X
A,1,2,1,1,0,0,1
B,2,1,3,1,2,1,0
C,1,3,0,1,2,2,1
D,1,1,1,0,0,3,1
E,0,2,2,0,0,0,0
F,0,1,2,3,0,1,1
X,1,0,1,1,0,1,0
You could use itertools.product to make building the array a little more compact, but since it's only a line or two I figured it was as simple to do it manually.

If I were to keep this project around for a while, I'd implement a database and then create the matrix you're talking about from a query against that database.
You have a Project table (let's say) with one record per project, an Actor table with one row per person, and a Participant table with a record per project for each actor that was in that project. (Each record would have an ID, a ProjectID, and an ActorID.)
From your example, you'd have 14 Project records, 7 Actor records (A through F, and X), and 31 Participant records.
Now, with this set up, each cell is a query against this database.
To reconstruct the matrix, first you'd add/update/remove the appropriate records in your database, and then rerun the query.

I guess that you don't have thousands of people working together in these projects. This implementation is pretty simple.
fp = open('projects.cvs')
# counts how many times each pair worked together
pairs = {}
# each element of `project` is a person
for project in (p[:-1].split(',') for p in fp):
project.sort()
# someone is alone here
if len(project) == 1:
continue
# iterate over each pair
for i in range(len(project)):
for j in range(i+1, len(project)):
pair = (project[i], project[j])
# increase `pairs` counter
pairs[pair] = pairs.get(pair, 0) + 1
from pprint import pprint
pprint(pairs)
It outputs:
{('A', 'B'): 1,
('B', 'C'): 2,
('B', 'D'): 1,
('B', 'E'): 1,
('B', 'F'): 2,
('C', 'E'): 1,
('C', 'F'): 1,
('D', 'F'): 1}

I suggest using Python Pandas for this. It enables a slick solutions for formatting your adjacency matrix, and it will make any statistical calculations much easier too. You can also directly extract the matrix of values into a NumPy array, for doing eigenvalue decompositions or other graph-theoretical procedures on the group clusters if needed later.
I assume that the example data you listed is saved into a file called projects_data.csv (it doesn't need to actually be a .csv file though). I also assume no blank lines between each observations, but this is all just file organization details.
Here's my code for this:
# File I/O part
import itertools, pandas, numpy as np
with open("projects_data.csv") as tmp:
lines = tmp.readlines()
lines = [line.split('\n')[0].split(',') for line in lines]
# Unique letters
s = set(list(itertools.chain(*lines)))
# Actual work.
df = pandas.DataFrame(
np.zeros((len(s),len(s))),
columns=sorted(list(s)),
index=sorted(list(s))
)
for line in lines:
if len(line) == 1:
df.ix[line[0],line[0]] += 1 # Single-person projects
elif len(line) > 1:
# Get all pairs in multi-person project.
tmp_pairs = list(itertools.combinations(line, 2))
# Append pair reversals to update (i,j) and (j,i) for each pair.
tmp_pairs = tmp_pairs + [pair[::-1] for pair in tmp_pairs]
for pair in tmp_pairs:
df.ix[pair[0], pair[1]] +=1
# Uncomment below if you don't want the list
# comprehension method for getting the reverals.
#df.ix[pair[1], pair[0]] +=1
# Final product
print df.to_string()
A B C D E F X
A 1 2 1 1 0 0 1
B 2 1 3 1 2 1 0
C 1 3 0 1 2 2 1
D 1 1 1 0 0 3 1
E 0 2 2 0 0 0 0
F 0 1 2 3 0 1 1
X 1 0 1 1 0 1 0
Now you can do a lot of stuff for free, like see the total number of project partners (repeats included) for each participant:
>>> df.sum()
A 6
B 10
C 10
D 7
E 4
F 8
X 4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python csv: getting subset - python

if Python is not a must $ gawk '{b[$1]++;c[++d,$1]=$0}END{for(i in b){if(b[i]>=4){for(j=1;j<=d;j++){print c[j,i]}}}}' file And yes, 70MB file is fine.

Related

Python Sorting and Organising

Python data wrangling issues

Input file, modify column, output file

Extracting Groups

Turning project data into a relationship matrix

Categories

Resources