Python txt matrix from multiple files - python

How can I convert line wise frequency distributions from multiple TXT files into a single matrix? Each of the files has exactly the same structure in that all words/terms/phrases are in the same order and contained in every file. Unique for each file is the filename, an issue date and the respective frequency of the words/terms/phrases given by a number after ":", see the following:
How my input files look like:
FilenameA Date:31.12.20XX
('financial' 'statement'):15
('corporate-taxes'):3
('assets'):8
('available-for-sale' 'property'):2
('auditors'):23
I have multiple files which have the exact same order of words/phrases and only differ in the frequency (number behind ":")
Now I want to create a single file containing a matrix, which keeps all words as top column and attaches the file characteristics (filename, date and frequencies) as row wise entries:
Desired Output:
Filename Date ('financial' 'statement') ('corporate-taxes') ... ('auditors)
A 2008 15 3 23
B 2010 9 6 11
C 2013 1 8 4
...
.
.
Really appreciate any help, would be great to have a loop which reads all files from a directory and outputs the above matrix.

The following code should help you:
import os
# Compute matrix
titles = ['Filename', 'Date']
matrix = [titles]
for directory, __, files in os.walk('files'): # replace with your directory
for filename in files:
with open(os.path.join(directory, filename)) as f:
name, date = f.readline().strip().split()
row = [name[8:], date.split('.')[-1]]
for line in f:
header, value = line.strip().split(':')
if len(matrix) == 1:
titles.append(header)
row.append(value)
matrix.append(row)
# Work out column widths
column_widths = [0]*len(titles)
for row in matrix:
for column, data in enumerate(row):
column_widths[column] = max(column_widths[column], len(data))
formats = ['{:%s%ss}' % ('^' if c>1 else '<', w) for c, w in enumerate(column_widths)]
# Print matrix
for row in matrix:
for column, data in enumerate(row):
print formats[column].format(data),
print
Sample output:
Filename Date ('financial' 'statement') ('corporate-taxes') ('assets') ('available-for-sale' 'property') ('auditors')
A 2012 15 3 8 2 23
B 2010 9 6 8 2 11
C 2010 1 8 8 2 4

Related

Write out file names and its contents from a folder to a dataframe

I want to write the filename and its content to a dataframe. The data basically contains the figure and its captions info from articles. Eg) 2_1.jpg is a figure and 2_1.txt is the corresponding caption captured in Page 2 of the article. The folder structure looks like below,
Documents
-Article1
-2_1.jpg
-2_1.txt
-3_1.jpg
-3_1.txt
-3_2.jpg
-3_2.txt
-Article2
-2_1.jpg
-2_1.txt
-2_2.jpg
-2_2.txt
-3_2.jpg
-3_2.txt
I used dictionary to store the data as key(filename)-value(contents) pairs, but all the file names and its content are not stored because some of the filenames are repeated (Like 2_1.txt is present in both Article1 and Article2, however the contents will be different). I used below code to create the dataframe, but couldn't retain all filenames since duplicates are not allowed in dict.
# Create Dictionary for File Name and Text
file_name_and_text = {}
for path,dirs,files in os.walk('C:/Users/Project/Documents/'):
for file in files:
if file.endswith('.txt'):
fullname = os.path.join(path,file)
with open(fullname, "r") as target_file:
file_name_and_text[file] = target_file.read()
df = (pd.DataFrame.from_dict(file_name_and_text, orient='index')
.reset_index().rename(index = str, columns = {'index': 'image_path', 0: 'text'}))
df['image_path'] = df['image_path'].str.replace('.txt', '.jpg', regex = True)
df.head()
Output:
image_path text
0 2_1.jpg ['Figure 1.Embedded trials of the...']
1 2_2.jpg [Figure 2. A) Helical wheel projections of zp1â...]
2 3_1.jpg ['Fig. 1. Positions of MHC side chain Trp167 a...]
3 3_2.jpg [Figure 2. A) CD spectra of zp3 in 50 % TFE wit...]
How can I retain all image file names and it's corresponding text content. Eg) I want to retain 2_1.jpg under Article1 as well as 2_1.jpg under Article2 with their respective contents.
Expected Output
image_path text
0 2_1.jpg ['Figure 1.Embedded trials of the...']
1 3_1.jpg ['Fig. 1. Positions of MHC side chain Trp167 a...]
2 3_2.jpg [Figure 2. A) CD spectra of zp3 in 50 % TFE wit...]
3 2_1.jpg [Figure 1. zp3 may lead to the pore formation....]
4 2_2.jpg [Figure 2. A) Helical wheel projections of zp1â...]
5 3_2.jpg [Figure 2. Close-up views of A) STEM...]
If it is suitable to use a full path or just add a prefix path, maybe you can change the code:
file_name_and_text[file] = target_file.read()
->
file_name_and_text[fullname] = target_file.read()
or
->
file_name_and_text[path.split('//')[-1] + file] = target_file.read()

Blank column appearing in .csv output, how can I remove it?

*Updated to add more lines of input file
I have a .csv file with header and subsequent data as follows (shown only first few rows here):
gene_name VarXCRep.1 VarX1Rep.1 VarX2Rep.1 VarXCRep.2 VarX3Rep.2 VarX1Rep.2 VarX2Rep.2 VarXCRep.3 VarX3Rep.3 VarX1Rep.3 VarX2Rep.3
1 Soltu.DM.01G000010 360.7000522 395.2279977 323.2595994 361.5910696 327.7380499 386.8290979 336.3997167 333.0843759 317.4954424 377.756613 396.666783
2 Soltu.DM.01G000020 91.12422371 69.30538348 77.36127164 135.060696 61.85252412 110.6099 68.21624475 108.7053612 55.31681029 56.52040232 36.14709293
3 Soltu.DM.01G000030 439.1681337 183.5656103 232.0838149 579.546161 220.9018719 179.6646995 179.2348391 291.2746216 222.4196747 266.8621527 208.321404
4 Soltu.DM.01G000040 268.3102142 185.4387288 192.0217278 301.5640936 130.9345641 237.108515 203.9799475 236.921941 92.19468382 198.1791322 38.04957151
5 Soltu.DM.01G000050 341.7158389 479.5183289 504.229717 322.2876925 528.5579334 390.4957244 470.1570594 342.8399852 554.3205365 424.9761896 634.4766049
6 Soltu.DM.01G000060 468.2772607 839.1570756 759.7982036 514.516937 886.0173261 572.6048416 579.8380803 549.1014398 1011.836655 598.8300854 1077.754113
7 Soltu.DM.01G000070 2.531228436 0 5.525805117 1.429213714 8.032795341 1.83331326 5.350293706 0 4.609734191 0 7.609914302
8 Soltu.DM.01G000090 84.79615262 54.3204357 75.97982036 98.61574626 102.0165008 83.11020113 84.26712586 108.7053612 98.53306833 80.13019064 93.2214502
9 Soltu.DM.01G000100 67.07755356 73.05162042 12.43306151 118.6247383 6.426236273 77.61026135 36.11448251 97.55609336 8.643251608 67.25212429 15.2198286
10 Soltu.DM.01G000110 1.265614218 0 1.381451279 2.143820571 0 1.22220884 4.012720279 0 2.304867095 0.715448131 0.951239288
11 Soltu.DM.01G000120 821.3836276 451.4215518 846.8296342 820.3686718 737.4106123 497.4389979 835.9833915 798.5663071 752.5391067 704.7164087 532.6940011
12 Soltu.DM.01G000130 2.531228436 3.746236945 5.525805117 2.143820571 0.803279534 0.61110442 2.00636014 1.393658477 1.728650322 2.146344392 10.46363217
13 Soltu.DM.01G000140 93.65545214 127.3720561 102.2273947 105.7618148 104.4263394 108.7765868 115.7001014 98.94975183 108.9049703 110.8944603 126.5148253
14 Soltu.DM.01G000150 112.6396654 84.29033126 91.17578444 86.46742969 154.2296705 99.61002047 111.0185944 115.6736536 111.7860541 115.187149 163.6131575
15 Soltu.DM.01G000160 644.197637 573.1742525 222.413656 760.3416958 178.3280566 761.4361074 594.551388 1053.605808 222.4196747 585.2365709 303.4453328
16 Soltu.DM.01G000170 751.7748456 841.0301941 910.3763931 773.9192261 835.4107154 820.7132361 1148.975573 804.140941 849.3435247 710.4399938 946.4830913
17 Soltu.DM.01G000190 6.328071091 1.873118472 5.525805117 6.431461713 8.836074875 5.49993978 8.694227272 11.14926781 4.609734191 7.869929438 0.951239288
18 Soltu.DM.01G000200 88.59299527 73.05162042 66.30966141 74.31911313 63.45908319 78.83247019 74.23532517 86.40682554 59.35032771 59.38219485 44.70824652
19 Soltu.DM.01G000210 108.8428228 112.3871083 85.64997932 111.4786697 73.0984376 123.4430928 113.6937412 143.5468231 67.41736254 77.26839812 86.56277518
20 Soltu.DM.01G000220 5.062456873 86.16344973 93.938687 20.72359885 507.6726655 30.555221 24.74510839 6.968292383 551.4394526 54.37405793 920.7996305
This is how the file appears in Bash shell
gene_name,VarXCRep.1,VarX1Rep.1,VarX2Rep.1,VarXCRep.2,VarX3Rep.2,VarX1Rep.2,VarX2Rep.2,VarXCRep.3,VarX3Rep.3,VarX1Rep.3,VarX2Rep.3
Soltu.DM.01G000010,360.7000522,395.2279977,323.2595994,361.5910696,327.7380499,386.8290979,336.3997167,333.0843759,317.4954424,377.756613,396.666783
Soltu.DM.01G000020,91.12422371,69.30538348,77.36127164,135.060696,61.85252412,110.6099,68.21624475,108.7053612,55.31681029,56.52040232,36.14709293
Soltu.DM.01G000030,439.1681337,183.5656103,232.0838149,579.546161,220.9018719,179.6646995,179.2348391,291.2746216,222.4196747,266.8621527,208.321404
Soltu.DM.01G000040,268.3102142,185.4387288,192.0217278,301.5640936,130.9345641,237.108515,203.9799475,236.921941,92.19468382,198.1791322,38.04957151
Soltu.DM.01G000050,341.7158389,479.5183289,504.229717,322.2876925,528.5579334,390.4957244,470.1570594,342.8399852,554.3205365,424.9761896,634.4766049
Soltu.DM.01G000060,468.2772607,839.1570756,759.7982036,514.516937,886.0173261,572.6048416,579.8380803,549.1014398,1011.836655,598.8300854,1077.754113
Soltu.DM.01G000070,2.531228436,0,5.525805117,1.429213714,8.032795341,1.83331326,5.350293706,0,4.609734191,0,7.609914302
Soltu.DM.01G000090,84.79615262,54.3204357,75.97982036,98.61574626,102.0165008,83.11020113,84.26712586,108.7053612,98.53306833,80.13019064,93.2214502
Soltu.DM.01G000100,67.07755356,73.05162042,12.43306151,118.6247383,6.426236273,77.61026135,36.11448251,97.55609336,8.643251608,67.25212429,15.2198286
I was asked to remove various types of columns and associated data which I have done successfully in the following code. I was then asked to arrange the data such that the headers show control (VarXC) repeats 1, 2 and 3 and experiment 1 (VarX1) repeats in columns next to each other which also has been done in the following code:
empty_list = []
for ln in open("FinalXVartest.csv").readlines():
col = ln.split(",")
del col[3]
del col[4]
del col[5]
del col[6]
del col[7]
col.append(col.pop(2))
col.append(col.pop(3))
col.append(col.pop(4))
empty_list += col
empty_list += '\n'
file_out = open("Xtest_2Var.csv", "w")
file_out.write(','.join(empty_list))
file_out.close()
When I try to compile all this information, the output shows up like this:
This is the final output
I am not sure how I am getting that space on the left side. Can someone help me remove so that all the rows shift by one cell to the left?
You should change the code a little bit to make it work as you expect. The problem with your code is that you are constructing a single list to which you add EOL \n as elements. Therefore, when you write this list to a file
file_out.write(','.join(empty_list))
there will be a comma after each line break. I construct a list of lists and add \n right after join to avoid your problem:
empty_list = []
for ln in open("files/FinalXVartest.csv").readlines():
col = ln.split(",")
del col[3]
del col[4]
del col[5]
del col[6]
del col[7]
col.append(col.pop(2))
col.append(col.pop(3))
col.append(col.pop(4))
empty_list.append(col)
file_out = open("files/Xtest_2Var.csv", "w")
for item in empty_list:
file_out.write(','.join(item) + '\n')
file_out.close()
But it's better to use csv library. It is suitable for reading and writing csv files.
Using pandas:
import pandas as pd
import re
df = pd.read_csv('FinalXVartest.csv', index_col='gene_name')
parsed = sorted([(re.match(r'VarX(.)Rep.(\d)', k).groups()[::-1], k) for k in df.columns])
cols = [k for (i, j), k in parsed if j in {'1', 'C'}]
df.to_csv('Xtest_2Var.csv')
>>> df[cols]
VarX1Rep.1 VarXCRep.1 VarX1Rep.2 VarXCRep.2 VarX1Rep.3 VarXCRep.3
gene_name
Soltu.DM.01G000010 395.227998 360.700052 386.829098 361.591070 377.756613 333.084376
Soltu.DM.01G000020 69.305383 91.124224 110.609900 135.060696 56.520402 108.705361
Soltu.DM.01G000030 183.565610 439.168134 179.664700 579.546161 266.862153 291.274622
Soltu.DM.01G000040 185.438729 268.310214 237.108515 301.564094 198.179132 236.921941
Soltu.DM.01G000050 479.518329 341.715839 390.495724 322.287692 424.976190 342.839985
Soltu.DM.01G000060 839.157076 468.277261 572.604842 514.516937 598.830085 549.101440
Soltu.DM.01G000070 0.000000 2.531228 1.833313 1.429214 0.000000 0.000000
Soltu.DM.01G000090 54.320436 84.796153 83.110201 98.615746 80.130191 108.705361
Soltu.DM.01G000100 73.051620 67.077554 77.610261 118.624738 67.252124 97.556093

Read Delimited File That Wraps Lines

I apologize if there is an obvious answer to this already.
I have a very large file that poses a few challenges for parsing. I am delivered these files from outside my organization, so there is no chance I can change their format.
Firstly, the file is space delimited but the fields that represent a "column" of data can span multiple rows. For example, if you had a row that was supposed to be 25 columns of data, it may be written in the file as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25
1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18
19 20 21 22 23 24 25
As you can see, I can't rely on each set of data being on the same line, but I can rely on there being the same number of columns per set.
To make matters worse, the file follows a definition:data type format where the first 3 or so lines will be describing the data (including a field that tells me how many rows there are) and the next N rows are data. Then it will go back to the 3 lines format again to describe the next set of data. That means I can't just set up a reader for the N columns format and let it run to EOF.
I'm afraid the built in python file reading functionality could get really ugly real fast, but I can't find anything in csv or numpy that works.
Any suggestions?
EDIT: Just as an example of a different solution:
We have an old tool in MATLAB that parses this file using textscan on an open file handle. We know the number of columns so we do something like:
data = textscan(fid, repmat('%f ',1,n_cols), n_rows, 'delimiter', {' ', '\r', '\n'}, 'multipledelimsasone', true);
This would read the data no matter how it wrapped while leaving a file handle open to process the next section later. This is done because the files are so large they can lead to excess RAM usage.
This is a sketch how you can proceed:
(EDIT: with some modifications)
file = open("testfile.txt", "r")
# store data for the different sections here
datasections = list()
while True:
current_row = []
# read three lines
l1 = file.readline()
if line == '': # or other end condition
break
l2 = file.readline()
l3 = file.readline()
# extract the following information from l1, l2, l3
nrows = # extract the number rows in the next section
ncols = # extract the number of columns in the next section
# loop while len(current_row) < nrows * ncols:
# read next line, isolate the items using str.split()
# append items to current_row
# break current_row into the lines after each ncols-th item
# store data in datasections in a new array

enumeration of elements for lists within lists

I have a collection of files (kind of like CSV, but no commas) with data arranged like the following:
RMS ResNum Scores Rank
30 1 44 5
12 1 99 2
2 1 60 1
1.5 1 63 3
12 2 91 4
2 2 77 3
I'm trying to write a script that enumerates for me and gives an integer as the output. I want it to count how many times we get a value of RMS below 3 AND a score above 51. Only if both these criteria are met should it add 1 to our count.
HOWEVER, the tricky part is that for any given "ResNum" it cannot add 1 multiple times. In other words, I want to sub-group the data by ResNum then decide 1 or 0 on whether or not those two criteria are met within that group.
So right now it would give as an output as 3, whereas I want it to display 2 instead. Since ResNum 1 is being counted twice here (two rows meet the criteria).
import glob
file_list = glob.glob("*")
file_list = sorted(file_list)
for input_file in file_list:
masterlist = []
opened_file = open(input_file,'r')
count = 0
for line in opened_file:
data = line.split()
templist = []
templist.append(float(data[0])) #RMS
templist.append(int(data[1])) #ResNum
templist.append(float(data[2])) #Scores
templist.append(float(data[3])) #Rank
masterlist.append(templist)
then here comes the part that needs modification (I think)
for placement in masterlist:
if placement[0] <3 and placement[2] >51.0:
count += 1
print input_file
print count
count = 0
Choose you data structures carefully to make your life easier.
import glob
file_list = glob.glob("*")
file_list = sorted(file_list)
grouper = {}
for input_file in file_list:
with open(input_file) as f:
grouper[input_file] = set()
for line in f:
rms, resnum, scores, rank = line.split()
if int(rms) < 3 and float(scores) > 53:
grouper[input_file].add(float(resnum))
for input_file, group in grouper.iteritems():
print input_file
print len(group)
This creates a dictionary of sets. The key of this dictionary is the file-name. The values are sets of the ResNums, added only when your condition holds. Since sets don't have repeated elements, the size of your set (len) will give you the right count of the number of times your condition was met, per ResNum, per file.

Translating a gridded csv file with numpy

I need to get some meteorological data into a MySQL database.
File inputFile.csv is a comma-delimited list of values. There are 241 lines and 481 values per line.
Each line maps to a certain latitude, and each value's position within the line maps to a certain longitude.
There are two additional files with the same structure, lat.csv and lon.csv. These files contain the coordinates that the values in inputFile.csv map to.
So to find the latitude and longitude for a value in inputFile.csv, we need to refer to the values at the same line/position (or row/column) within lat.csv and lon.csv
I want to translate inputFile.csv using lat.csv and lon.csv such that my output file contains a list of values (from inputFile.csv),latitudes, and longitudes.
Here is a small visual example:
inputFile.csv
3,5,1,4,5
1,4,1,2,5
5,7,3,8,0
lat.csv
22,31,51,21,52
55,21,24,66,12
11,23,12,55,55
lon.csv
12,35,12,52,11
35,11,25,33,42
62,53,45,25,54
output:
val lat lon
3 22 12
5 31 35
1 51 12
4 21 52
5 52 11
1 55 35
4 21 11
1 24 25
2 66 33
etc
What is the best way to do this in python/numpy?
I suppose that since you know the total size the the array that you want, you can preallocate it:
a = np.empty((241*481,3))
Now you can add the data:
for i,fname in enumerate(('inputFile.csv','lat.csv','lon.csv')):
with open(fname) as f:
data = np.fromfile(f,sep=',')
a[:,i] = data.ravel()
If you don't know the number of elements up front, you can generate a 2-d list instead (a list of np.ndarrays):
alist = []
for fname in ('inputFile.csv','lat.csv','lon.csv'):
with open(fname) as f:
data = np.fromfile(f,sep=',')
alist.append( data.ravel() )
a = np.array(alist).T
Only with numpy functions:
import numpy as np
inputFile = np.gentfromtxt('inputFile.csv',delimiter = ',')
inputFile.reshape(-1)
lat = np.gentfromtxt('lat.csv',delimiter = ',')
lat.reshape(-1)
lon = np.gentfromtxt('lon.csv',delimiter = ',')
lon.reshape(-1)
output = np.vstack( (inputFile,lat,lon) )

Categories