convert crosstab to columns without using pandas in python - python

How do I convert the crosstab data from the input file mentioned below into columns based on the input list without using pandas?
Input list
[A,B,C]
Input data file
Labels A,B,C are only for representation, original file only has the numeric values.
We can ignore the colums XX & YY based on the length of the input list
A B C XX YY
A 0 2 3 4 8
B 4 0 6 4 8
C 7 8 0 5 8
Output (Output needs to have labels)
A A 0
A B 2
A C 3
B A 4
B B 0
B C 6
C A 7
C B 8
C C 0
The labels need to be present in the output file even though its present in the input file, hence I have mentioned its representation in the output file.
NB: In reality the labels are sorted city names without duplicates in ascending order & not single alphabets like A or B.
Unfortunately this would have been easier if I could install pandas on the server & use unstack(), but installations aren't allowed on this old server right now.
This is on python 3.5

Considering you tagged the post csv, I'm assuming the actual input data is a .csv file, without header as you indicated.
So example data would look like:
0,2,3,4,8
4,0,6,4,8
7,8,0,5,8
If the labels are provided as a list, matching the order of the columns and rows (i.e. ['A', 'B', 'C'] this would turn the example output into:
'A','A',0
'A','B',2
'A','C',3
'B','A',4
etc.
Note that this implies the number of rows and columns in the file cannot exceed the number of labels provided.
You indicate that the columns you label 'XX' and 'YY' are to be ignored, but you don't indicate how that's supposed to be communicated, but you do mention the length of the input is determining it, so I assume this means 'everything after column n can be ignored'.
This is a simple implementation:
from csv import reader
def unstack_csv(fn, columns, labels):
with open(fn) as f:
cr = reader(f)
row = 0
for line in cr:
col = 0
for x in line[:columns]:
yield labels[row], labels[col], x
col += 1
row += 1
print(list(unstack_csv('unstack.csv', 3, ['A', 'B', 'C'])))
or if you like it short and sweet:
from csv import reader
with open('unstack.csv') as f:
content = reader(f)
labels = ['A', 'B', 'C']
print([(labels[row], labels[col], x)
for row, data in enumerate(content)
for col, x in enumerate(data) if col < 3])
(I'm also assuming using numpy is out, for the same reason as pandas, but that stuff like csv is in, since it's a standard library)
If you don't want to provide the labels explicitly, but just want them generated, you could do something like:
def label(n):
r = n // 26
c = chr(65 + (n % 26))
if r > 0:
return label(r-1)+c
else:
return c
And then of course just remove the labels from the examples and replace with calls to label(col) and label(row).

Related

Blank column appearing in .csv output, how can I remove it?

*Updated to add more lines of input file
I have a .csv file with header and subsequent data as follows (shown only first few rows here):
gene_name VarXCRep.1 VarX1Rep.1 VarX2Rep.1 VarXCRep.2 VarX3Rep.2 VarX1Rep.2 VarX2Rep.2 VarXCRep.3 VarX3Rep.3 VarX1Rep.3 VarX2Rep.3
1 Soltu.DM.01G000010 360.7000522 395.2279977 323.2595994 361.5910696 327.7380499 386.8290979 336.3997167 333.0843759 317.4954424 377.756613 396.666783
2 Soltu.DM.01G000020 91.12422371 69.30538348 77.36127164 135.060696 61.85252412 110.6099 68.21624475 108.7053612 55.31681029 56.52040232 36.14709293
3 Soltu.DM.01G000030 439.1681337 183.5656103 232.0838149 579.546161 220.9018719 179.6646995 179.2348391 291.2746216 222.4196747 266.8621527 208.321404
4 Soltu.DM.01G000040 268.3102142 185.4387288 192.0217278 301.5640936 130.9345641 237.108515 203.9799475 236.921941 92.19468382 198.1791322 38.04957151
5 Soltu.DM.01G000050 341.7158389 479.5183289 504.229717 322.2876925 528.5579334 390.4957244 470.1570594 342.8399852 554.3205365 424.9761896 634.4766049
6 Soltu.DM.01G000060 468.2772607 839.1570756 759.7982036 514.516937 886.0173261 572.6048416 579.8380803 549.1014398 1011.836655 598.8300854 1077.754113
7 Soltu.DM.01G000070 2.531228436 0 5.525805117 1.429213714 8.032795341 1.83331326 5.350293706 0 4.609734191 0 7.609914302
8 Soltu.DM.01G000090 84.79615262 54.3204357 75.97982036 98.61574626 102.0165008 83.11020113 84.26712586 108.7053612 98.53306833 80.13019064 93.2214502
9 Soltu.DM.01G000100 67.07755356 73.05162042 12.43306151 118.6247383 6.426236273 77.61026135 36.11448251 97.55609336 8.643251608 67.25212429 15.2198286
10 Soltu.DM.01G000110 1.265614218 0 1.381451279 2.143820571 0 1.22220884 4.012720279 0 2.304867095 0.715448131 0.951239288
11 Soltu.DM.01G000120 821.3836276 451.4215518 846.8296342 820.3686718 737.4106123 497.4389979 835.9833915 798.5663071 752.5391067 704.7164087 532.6940011
12 Soltu.DM.01G000130 2.531228436 3.746236945 5.525805117 2.143820571 0.803279534 0.61110442 2.00636014 1.393658477 1.728650322 2.146344392 10.46363217
13 Soltu.DM.01G000140 93.65545214 127.3720561 102.2273947 105.7618148 104.4263394 108.7765868 115.7001014 98.94975183 108.9049703 110.8944603 126.5148253
14 Soltu.DM.01G000150 112.6396654 84.29033126 91.17578444 86.46742969 154.2296705 99.61002047 111.0185944 115.6736536 111.7860541 115.187149 163.6131575
15 Soltu.DM.01G000160 644.197637 573.1742525 222.413656 760.3416958 178.3280566 761.4361074 594.551388 1053.605808 222.4196747 585.2365709 303.4453328
16 Soltu.DM.01G000170 751.7748456 841.0301941 910.3763931 773.9192261 835.4107154 820.7132361 1148.975573 804.140941 849.3435247 710.4399938 946.4830913
17 Soltu.DM.01G000190 6.328071091 1.873118472 5.525805117 6.431461713 8.836074875 5.49993978 8.694227272 11.14926781 4.609734191 7.869929438 0.951239288
18 Soltu.DM.01G000200 88.59299527 73.05162042 66.30966141 74.31911313 63.45908319 78.83247019 74.23532517 86.40682554 59.35032771 59.38219485 44.70824652
19 Soltu.DM.01G000210 108.8428228 112.3871083 85.64997932 111.4786697 73.0984376 123.4430928 113.6937412 143.5468231 67.41736254 77.26839812 86.56277518
20 Soltu.DM.01G000220 5.062456873 86.16344973 93.938687 20.72359885 507.6726655 30.555221 24.74510839 6.968292383 551.4394526 54.37405793 920.7996305
This is how the file appears in Bash shell
gene_name,VarXCRep.1,VarX1Rep.1,VarX2Rep.1,VarXCRep.2,VarX3Rep.2,VarX1Rep.2,VarX2Rep.2,VarXCRep.3,VarX3Rep.3,VarX1Rep.3,VarX2Rep.3
Soltu.DM.01G000010,360.7000522,395.2279977,323.2595994,361.5910696,327.7380499,386.8290979,336.3997167,333.0843759,317.4954424,377.756613,396.666783
Soltu.DM.01G000020,91.12422371,69.30538348,77.36127164,135.060696,61.85252412,110.6099,68.21624475,108.7053612,55.31681029,56.52040232,36.14709293
Soltu.DM.01G000030,439.1681337,183.5656103,232.0838149,579.546161,220.9018719,179.6646995,179.2348391,291.2746216,222.4196747,266.8621527,208.321404
Soltu.DM.01G000040,268.3102142,185.4387288,192.0217278,301.5640936,130.9345641,237.108515,203.9799475,236.921941,92.19468382,198.1791322,38.04957151
Soltu.DM.01G000050,341.7158389,479.5183289,504.229717,322.2876925,528.5579334,390.4957244,470.1570594,342.8399852,554.3205365,424.9761896,634.4766049
Soltu.DM.01G000060,468.2772607,839.1570756,759.7982036,514.516937,886.0173261,572.6048416,579.8380803,549.1014398,1011.836655,598.8300854,1077.754113
Soltu.DM.01G000070,2.531228436,0,5.525805117,1.429213714,8.032795341,1.83331326,5.350293706,0,4.609734191,0,7.609914302
Soltu.DM.01G000090,84.79615262,54.3204357,75.97982036,98.61574626,102.0165008,83.11020113,84.26712586,108.7053612,98.53306833,80.13019064,93.2214502
Soltu.DM.01G000100,67.07755356,73.05162042,12.43306151,118.6247383,6.426236273,77.61026135,36.11448251,97.55609336,8.643251608,67.25212429,15.2198286
I was asked to remove various types of columns and associated data which I have done successfully in the following code. I was then asked to arrange the data such that the headers show control (VarXC) repeats 1, 2 and 3 and experiment 1 (VarX1) repeats in columns next to each other which also has been done in the following code:
empty_list = []
for ln in open("FinalXVartest.csv").readlines():
col = ln.split(",")
del col[3]
del col[4]
del col[5]
del col[6]
del col[7]
col.append(col.pop(2))
col.append(col.pop(3))
col.append(col.pop(4))
empty_list += col
empty_list += '\n'
file_out = open("Xtest_2Var.csv", "w")
file_out.write(','.join(empty_list))
file_out.close()
When I try to compile all this information, the output shows up like this:
This is the final output
I am not sure how I am getting that space on the left side. Can someone help me remove so that all the rows shift by one cell to the left?
You should change the code a little bit to make it work as you expect. The problem with your code is that you are constructing a single list to which you add EOL \n as elements. Therefore, when you write this list to a file
file_out.write(','.join(empty_list))
there will be a comma after each line break. I construct a list of lists and add \n right after join to avoid your problem:
empty_list = []
for ln in open("files/FinalXVartest.csv").readlines():
col = ln.split(",")
del col[3]
del col[4]
del col[5]
del col[6]
del col[7]
col.append(col.pop(2))
col.append(col.pop(3))
col.append(col.pop(4))
empty_list.append(col)
file_out = open("files/Xtest_2Var.csv", "w")
for item in empty_list:
file_out.write(','.join(item) + '\n')
file_out.close()
But it's better to use csv library. It is suitable for reading and writing csv files.
Using pandas:
import pandas as pd
import re
df = pd.read_csv('FinalXVartest.csv', index_col='gene_name')
parsed = sorted([(re.match(r'VarX(.)Rep.(\d)', k).groups()[::-1], k) for k in df.columns])
cols = [k for (i, j), k in parsed if j in {'1', 'C'}]
df.to_csv('Xtest_2Var.csv')
>>> df[cols]
VarX1Rep.1 VarXCRep.1 VarX1Rep.2 VarXCRep.2 VarX1Rep.3 VarXCRep.3
gene_name
Soltu.DM.01G000010 395.227998 360.700052 386.829098 361.591070 377.756613 333.084376
Soltu.DM.01G000020 69.305383 91.124224 110.609900 135.060696 56.520402 108.705361
Soltu.DM.01G000030 183.565610 439.168134 179.664700 579.546161 266.862153 291.274622
Soltu.DM.01G000040 185.438729 268.310214 237.108515 301.564094 198.179132 236.921941
Soltu.DM.01G000050 479.518329 341.715839 390.495724 322.287692 424.976190 342.839985
Soltu.DM.01G000060 839.157076 468.277261 572.604842 514.516937 598.830085 549.101440
Soltu.DM.01G000070 0.000000 2.531228 1.833313 1.429214 0.000000 0.000000
Soltu.DM.01G000090 54.320436 84.796153 83.110201 98.615746 80.130191 108.705361
Soltu.DM.01G000100 73.051620 67.077554 77.610261 118.624738 67.252124 97.556093

Save each Excel-spreadsheet-row with header in separate .txt-file (saved as a parameter-sample to be read by simulation programs)

I'm a building energy simulation modeller with an Excel-question to enable automated large-scale simulations using parameter samples (samples generated using Monte Carlo). Now I have the following question in saving my samples:
I want to save each row of an Excel-spreadsheet in a separate .txt-file in a 'special' way to be read by simulation programs.
Let's say, I have the following excel-file with 4 parameters (a,b,c,d) and 20 values underneath:
a b c d
2 3 5 7
6 7 9 1
3 2 6 2
5 8 7 6
6 2 3 4
Each row of this spreadsheet represents a simulation-parameter-sample.
I want to store each row in a separate .txt-file as follows (so 5 '.txt'-files for this spreadsheet):
'1.txt' should contain:
a=2;
b=3;
c=5;
d=7;
'2.txt' should contain:
a=6;
b=7;
c=9;
d=1;
and so on for files '3.txt', '4.txt' and '5.txt'.
So basically matching the header with its corresponding value underneath for each row in a separate .txt-file ('header equals value;').
Is there an Excel add-in that does this or is it better to use some VBA-code? Anybody some idea?
(I'm quit experienced in simulation modelling but not in programming, therefore this rather easy parameter-sample-saving question in Excel. (Solutions in Python are also welcome if that's easier for you people))
my idea would be to use Python along with Pandas as it's one of the most flexible solutions, as your use case might expand in the future.
I'm gonna try making this as simple as possible. Though I'm assuming, that you have Python, that you know how to install packages via pip or conda and are ready to run a python script on whatever system you are using.
First your script needs to import pandas and read the file into a DataFrame:
import pandas as pd
df = pd.read_xlsx('path/to/your/file.xlsx')
(Note that you might need to install the xlrd package, in addition to pandas)
Now you have a powerful data structure, that you can manipulate in plenty of ways. I guess the most intuitive one, would be to loop over all items. Use string formatting, which is best explained over here and put the strings together the way you need them:
outputs = {}
for row in df.index:
s = ""
for col in df.columns:
s += "{}={};\n".format(col, df[col][row])
print(s)
now you just need to write to a file using python's io method open. I'll just name the files by the index of the row, but this solution will overwrite older text files, created by earlier runs of this script. You might wonna add something unique like the date and time or the name of the file you read to it or increment the file name further with multiple runs of the script, for example like this.
All together we get:
import pandas as pd
df = pd.read_excel('path/to/your/file.xlsx')
file_count = 0
for row in df.index:
s = ""
for col in df.columns:
s += "{}={};\n".format(col, df[col][row])
file = open('test_{:03}.txt'.format(file_count), "w")
file.write(s)
file.close()
file_count += 1
Note that it's probably not the most elegant way and that there are one liners out there, but since you are not a programmer I thought you might prefer a more intuitive way, that you can tweak yourself easily.
I got this to work in Excel. You can expand the length of the variables x,y and z to match your situation and use LastRow, LastColumn methods to find the dimensions of your data set. I named the original worksheet "Data", as shown below.
Sub TestExportText()
Dim Hdr(1 To 4) As String
Dim x As Long
Dim y As Long
Dim z As Long
For x = 1 To 4
Hdr(x) = Cells(1, x)
Next x
x = 1
For y = 1 To 5
ThisWorkbook.Sheets.Add After:=Sheets(Sheets.Count)
ActiveSheet.Name = y
For z = 1 To 4
With ActiveSheet
.Cells(z, 1) = Hdr(z) & "=" & Sheets("Data").Cells(x + 1, z) & ";"
End With
Next z
x = x + 1
ActiveSheet.Move
ActiveWorkbook.ActiveSheet.SaveAs Filename:="File" & y & ".txt", FileFormat:=xlTextWindows
ActiveWorkbook.Close SaveChanges:=False
Next y
End Sub
If you can save your Excel spreadsheet as a CSV file then this python script will do what you want.
with open('data.csv') as file:
data_list = [l.rstrip('\n').split(',') for l in file]
counter = 1
for x in range (1, len (data_list)) :
output_file_name = str (counter) + '.txt'
with open (output_file_name, 'w' ) as file :
for x in range (len (data_list [counter])) :
print (x)
output_string = data_list [0] [x] + '=' + data_list [counter] [x] + ';\n'
file.write (output_string)
counter += 1

Store and find higher value index in a numpy array

I used numpy.loadtxt to load a file that contains this scructure:
99 0 1 2 3 ... n
46 0.137673 0.147241 0.130374 0.155461 ... 0.192291
32 0.242157 0.186015 0.153261 0.152680 ... 0.154239
77 0.163889 0.176748 0.184754 0.126667 ... 0.191237
12 0.139989 0.417530 0.148208 0.188872 ... 0.141071
64 0.172326 0.172623 0.196263 0.152864 ... 0.168985
50 0.145201 0.156627 0.214384 0.123387 ... 0.187624
92 0.127143 0.133587 0.133994 0.198704 ... 0.161480
Now, I need that the first column (except first line) store the index of the higher value in it's line.
At end, save this array in a file with the same number format as original.
Thank's.
Can you use numpy.argmax something like this:
import numpy as np
# This is a simple example. In your case, A is loaded with np.loadtxt
A = np.array([[1, 2.0, 3.0], [3, 1.0, 2.0], [2.0, 4.0, 3.0]])
B = A.copy()
# Copy the max indices of rows of A into first column of B
B[:,0] = np.argmax(A[:,1:], 1)
# Save the results using np.savetxt with fmt, dynamically generating the
# format string based on the number of columns in B (setting the first
# column to integer and the rest to float)
np.savetxt('/path/to/output.txt', B, fmt='%d' + ' %f' * (B.shape[1]-1))
Note that np.savetxt allows for formatting.
This example code doesn't address the fact that you want to skip the first row, and you might want to subtract 1 from the results of np.argmax depending on if the index into the remaining columns is inclusive of the index column (0) or not.
Your data look like a Dataframe with columns and index : the data type is not homogeneous. It is more convenient to do it with pandas, which manage natively this layout:
import pandas as pd
a=pd.DataFrame.from_csv('data.txt',sep=' *')
u=a.set_index(a.values.argmax(axis=1)).to_string()
with open('out.txt','w') as f : f.write(u)
then out.txt is
0 1 2 3 4
4 0.137673 0.147241 0.130374 0.155461 0.192291
0 0.242157 0.186015 0.153261 0.152680 0.154239
4 0.163889 0.176748 0.184754 0.126667 0.191237
1 0.139989 0.417530 0.148208 0.188872 0.141071
2 0.172326 0.172623 0.196263 0.152864 0.168985
2 0.145201 0.156627 0.214384 0.123387 0.187624
3 0.127143 0.133587 0.133994 0.198704 0.161480

Formatting list data by table

I am trying to analyse some data, but my data contains letters which require standardising. What I would like to be able to do is, for every datatable in the data (this csv data contains 3 datatables) replace the letter T or any other letter for that matter with the next highest integer for that table. The first table contains no errors, the second table contains 1 T and the third contains 2 x t's.
DatatableA,1
DatatableA,2
DatatableA,3
DatatableA,4
DatatableA,5
DatatableB,1
DatatableB,6
DatatableB,T
DatatableB,3
DatatableB,4
DatatableB,5
DatatableB,2
DatatableC,3
DatatableC,4
DatatableC,2
DatatableC,1
DatatableC,Q
DatatableC,5
DatatableC,T
I am expecting this to be a relatively easy thing to code, however whilst I know how to replace all T's with a number, within a particular column or a particular row, I do not know how to replace each T with a different number depending on the Datatable it is in. Essentially I am looking to produce the following from the above:
DatatableA,1
DatatableA,2
DatatableA,3
DatatableA,4
DatatableA,5
DatatableB,1
DatatableB,6
DatatableB,7
DatatableB,3
DatatableB,4
DatatableB,5
DatatableB,2
DatatableC,3
DatatableC,4
DatatableC,2
DatatableC,1
DatatableC,6
DatatableC,5
DatatableC,6
Here nothing happened in DatatableA, DatatableB the only T was replaced with the next highest integer in this case it was replaced with a 7, in DatatableC there was two anomalous data points which were both replaced with the next highest integer, which was a 6.
If anyone can point me in the right direction or provide a snippet of something, It would be greatly appreciated. As always constructive comments are also appreciated.
Edit in reply to elyase
I attempted to run the code:
import pandas as pd
df = pd.read_csv('test.csv', sep=',', header=None, names=['datatable', 'col'])
def replace_letter(group):
letters = group.isin(['T', 'Q']) # select letters
group[letters] = int(group[~letters].max()) + 1 # replace by next max
return group
df['col'] = df.groupby('datatable').transform(replace_letter)
print df
and i received the traceback:
Traceback (most recent call last):
File "C:/test.py", line 11, in <module>
df['col'] = df.groupby('datatable').transform(replace_letter)
File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 1981, in transform
res = path(group)
File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 2006, in <lambda>
slow_path = lambda group: group.apply(lambda x: func(x, *args, **kwargs), axis=self.axis)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 4416, in apply
return self._apply_standard(f, axis)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 4491, in _apply_standard
raise e
ValueError: ("invalid literal for int() with base 10: 'col'", u'occurred at index col')
Is there something I have used in correctly, I could use AEAs answer, but I have been meaning to use pandas more, as the library seems so useful for data manipulations.
Pandas is ideal for this kind of tasks:
Read your csv:
>>> import pandas as pd
>>> df = pd.read_csv('data.csv', sep=',', header=None, names=['datatable', 'col'])
>>> df.head()
datatable col
0 DatatableA 1
1 DatatableA 2
2 DatatableA 3
3 DatatableA 4
4 DatatableA 5
Group, select and replace max:
def replace_letter(group):
letters = group.isin(['T', 'Q']) # select letters
group[letters] = int(group[~letters].max()) + 1 # replace by next max
return group
>>> df['col'] = df.groupby('datatable').transform(replace_letter)
>>> df
datatable col
0 DatatableA 1
1 DatatableA 2
2 DatatableA 3
3 DatatableA 4
4 DatatableA 5
5 DatatableB 1
6 DatatableB 6
7 DatatableB 7
8 DatatableB 3
9 DatatableB 4
10 DatatableB 5
11 DatatableB 2
12 DatatableC 3
13 DatatableC 4
14 DatatableC 2
15 DatatableC 1
16 DatatableC 6
17 DatatableC 5
18 DatatableC 6
Write to csv:
df.to_csv('result.csv', index=None, header=None)
I suppose I have to answer the question asked my by own alter-ego. Seriously, does StackExchange not sanitize usernames?
Here's a solution, not guaranteeing that it's efficient or simple, but the logic is pretty simple. First you iterate your dataset and check for anything that's not an integer string and record the largest value. Then you iterate again and replace non-integer strings.
I am using StringIO as a replacement for a file just for convenience sake.
import csv
import string
from StringIO import StringIO
raw = """DatatableA,1
DatatableA,2
DatatableA,3
DatatableA,4
DatatableA,5
DatatableB,1
DatatableB,6
DatatableB,T
DatatableB,3
DatatableB,4
DatatableB,5
DatatableB,2
DatatableC,3
DatatableC,4
DatatableC,2
DatatableC,1
DatatableC,Q
DatatableC,5
DatatableC,T"""
fp = StringIO()
fp.write(raw)
fp.seek(0)
reader = csv.reader(fp)
data = []
mapping = {}
for row in reader:
if row[0] not in mapping:
mapping[row[0]] = float("-inf")
if row[1] in string.digits:
x = int(row[1])
if x > mapping[row[0]]:
mapping[row[0]] = x
data.append(row)
for i, row in enumerate(data):
if row[1] not in string.digits:
mapping[row[0]] += 1
row[1] = str(mapping[row[0]])
fp.close()
fp = StringIO()
writer = csv.writer(fp)
writer.writerows(data)
print fp.getvalue()

Turning project data into a relationship matrix

My data set a list of people either working together or alone.
I have have a row for each project and columns with names of all the people who worked on that project. If column 2 is the first empty column given a row it was a solo job, if column 4 is the first empty column given a row then there were 3 people working together.
My goal is to find which people have worked together, and how many times, so I want all pairs in the data set, treating A working with B the same as B working with A.
From this a square N x N would be created with every actor labeling the column and row and in cell (A,B) and (B,A) would have how many times that pair worked together, and this would be done for every pair.
I know of a 'pretty' quick way to do it in Excel but I want it automated, hopefully in Stata or Python, just in case projects are added or removed I can just 1-click the re-run and not have to re-do it every time.
An example of the data, in a comma delimited fashion:
A
A,B
B,C,E
B,F
D,F
A,B,C
D,B
E,C,B
X,D,A
Hope that helps!
Brice.
F,D
B
F
F,X,C
C,F,D
Maybe something like this would get you started?
import csv
import collections
import itertools
grid = collections.Counter()
with open("connect.csv", "r", newline="") as fp:
reader = csv.reader(fp)
for line in reader:
# clean empty names
line = [name.strip() for name in line if name.strip()]
# count single works
if len(line) == 1:
grid[line[0], line[0]] += 1
# do pairwise counts
for pair in itertools.combinations(line, 2):
grid[pair] += 1
grid[pair[::-1]] += 1
actors = sorted(set(pair[0] for pair in grid))
with open("connection_grid.csv", "w", newline="") as fp:
writer = csv.writer(fp)
writer.writerow([''] + actors)
for actor in actors:
line = [actor,] + [grid[actor, other] for other in actors]
writer.writerow(line)
[edit: modified to work under Python 3.2]
The key modules are (1)csv, which makes reading and writing csv files much simpler; (2) collections, which provides an object called a Counter -- like a defaultdict(int), which you could use if your Python doesn't have Counter, it's a dictionary which automatically generates default values so you don't have to, and here the default count is 0; and (3) itertools, which has a combinations function to get all the pairs.
which produces
,A,B,C,D,E,F,X
A,1,2,1,1,0,0,1
B,2,1,3,1,2,1,0
C,1,3,0,1,2,2,1
D,1,1,1,0,0,3,1
E,0,2,2,0,0,0,0
F,0,1,2,3,0,1,1
X,1,0,1,1,0,1,0
You could use itertools.product to make building the array a little more compact, but since it's only a line or two I figured it was as simple to do it manually.
If I were to keep this project around for a while, I'd implement a database and then create the matrix you're talking about from a query against that database.
You have a Project table (let's say) with one record per project, an Actor table with one row per person, and a Participant table with a record per project for each actor that was in that project. (Each record would have an ID, a ProjectID, and an ActorID.)
From your example, you'd have 14 Project records, 7 Actor records (A through F, and X), and 31 Participant records.
Now, with this set up, each cell is a query against this database.
To reconstruct the matrix, first you'd add/update/remove the appropriate records in your database, and then rerun the query.
I guess that you don't have thousands of people working together in these projects. This implementation is pretty simple.
fp = open('projects.cvs')
# counts how many times each pair worked together
pairs = {}
# each element of `project` is a person
for project in (p[:-1].split(',') for p in fp):
project.sort()
# someone is alone here
if len(project) == 1:
continue
# iterate over each pair
for i in range(len(project)):
for j in range(i+1, len(project)):
pair = (project[i], project[j])
# increase `pairs` counter
pairs[pair] = pairs.get(pair, 0) + 1
from pprint import pprint
pprint(pairs)
It outputs:
{('A', 'B'): 1,
('B', 'C'): 2,
('B', 'D'): 1,
('B', 'E'): 1,
('B', 'F'): 2,
('C', 'E'): 1,
('C', 'F'): 1,
('D', 'F'): 1}
I suggest using Python Pandas for this. It enables a slick solutions for formatting your adjacency matrix, and it will make any statistical calculations much easier too. You can also directly extract the matrix of values into a NumPy array, for doing eigenvalue decompositions or other graph-theoretical procedures on the group clusters if needed later.
I assume that the example data you listed is saved into a file called projects_data.csv (it doesn't need to actually be a .csv file though). I also assume no blank lines between each observations, but this is all just file organization details.
Here's my code for this:
# File I/O part
import itertools, pandas, numpy as np
with open("projects_data.csv") as tmp:
lines = tmp.readlines()
lines = [line.split('\n')[0].split(',') for line in lines]
# Unique letters
s = set(list(itertools.chain(*lines)))
# Actual work.
df = pandas.DataFrame(
np.zeros((len(s),len(s))),
columns=sorted(list(s)),
index=sorted(list(s))
)
for line in lines:
if len(line) == 1:
df.ix[line[0],line[0]] += 1 # Single-person projects
elif len(line) > 1:
# Get all pairs in multi-person project.
tmp_pairs = list(itertools.combinations(line, 2))
# Append pair reversals to update (i,j) and (j,i) for each pair.
tmp_pairs = tmp_pairs + [pair[::-1] for pair in tmp_pairs]
for pair in tmp_pairs:
df.ix[pair[0], pair[1]] +=1
# Uncomment below if you don't want the list
# comprehension method for getting the reverals.
#df.ix[pair[1], pair[0]] +=1
# Final product
print df.to_string()
A B C D E F X
A 1 2 1 1 0 0 1
B 2 1 3 1 2 1 0
C 1 3 0 1 2 2 1
D 1 1 1 0 0 3 1
E 0 2 2 0 0 0 0
F 0 1 2 3 0 1 1
X 1 0 1 1 0 1 0
Now you can do a lot of stuff for free, like see the total number of project partners (repeats included) for each participant:
>>> df.sum()
A 6
B 10
C 10
D 7
E 4
F 8
X 4

Categories