Python: Read and write the file of complex and reapeating format - python

To begin with, sorry for poor Engish.
I have a file with repeating format. Such as
326 Iteration: 0 #Bonds: 10
1 6 7 14 54 70 77 0 0 0 0 0 1 0.693 0.632 0.847 0.750 0.644 0.000 0.000 0.000 0.000 0.000 3.566 0.000 0.028
2 6 3 6 15 55 0 0 0 0 0 0 1 0.925 0.920 0.909 0.892 0.000 0.000 0.000 0.000 0.000 0.000 3.645 0.000 -0.040
3 6 2 8 10 52 0 0 0 0 0 0 1 0.925 0.910 0.920 0.898 0.000 0.000 0.000 0.000 0.000 0.000 3.653 0.000 0.000
...
324 8 323 0 0 0 0 0 0 0 0 0 100 0.871 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.871 3.000 -0.493
325 2 326 0 0 0 0 0 0 0 0 0 101 0.930 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.930 0.000 0.334
326 8 325 0 0 0 0 0 0 0 0 0 101 0.930 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.930 3.000 -0.611
637.916060425841 306.094529423257 1250.10511927236
6.782126993565285E-006
326 (repeating from here) Iteration: 100 #Bonds: 10
1 6 7 14 54 64 70 77 0 0 0 0 1 0.885 0.580 0.819 0.335 0.784 0.709 0.000 0.000 0.000 0.000 4.111 0.000 0.025
2 6 3 6 15 55 0 0 0 0 0 0 1 0.812 0.992 0.869 0.966 0.000 0.000 0.000 0.000 0.000 0.000 3.639 0.000 -0.034
3 6 2 8 10 52 0 0 0 0 0 0 1 0.812 0.966 0.989 0.926 0.000 0.000 0.000 0.000 0.000 0.000 3.692 0.000 0.004
As you can see here, the first line is the header, and 2nd~327th line is the data that I want to analyze, and 328th and 329th line have some numbers which I don't want to use. Next "frame" starts from line 330, with exactly same format. This "frame" repeats more than 200000 times.
I want to use 1st ~ 13th column from that 2nd~327th line data of each frames. Also I want to use first number of header.
I want to analyze the data, 3th~12th column of each 2nd~327th line of all repeating "frames", printing number of 0s and number of non-0s data from of target matrix of each frames. Also print some 1st, 2nd and 13th column as well. So the expected output file become like
326
1
1 6 5 5 1
2 6 4 6 1
...
325 2 1 9 101
326 8 1 9 101
326 (Next frame starts from here)
2
1 6 5 5 1
2 6 4 6 1
...
326
3
1 6 5 5 1
2 6 4 6 1
...
First line: First number of first line.
Second line: Frame number
3rd~328th line: 1st column of input file, 2nd column of input file, number of non-zeros of 3th~12th column of input, number of zeros of 3th~12th column of input, and 13th column of input.
From 4th line: repeating format, same with above.
So, the result file have 2 header line, and analyzed data of 326 lines, total 328 line per each frame. Same format repeats for next frame too. Using that format of result data (5 spaces each) is recommended to use the file for other purpose.
The way I'm using is, Creating 13 arrays for 13 columns -> store data using double for loops for each frame, and each 328 lines. But I have no idea how can I deal with output.
Following is the my trial code (unfinished, only for read the input), but this code have a lot of problems. Linecache reads whole line, not the first number of every first line. Every frame have 326+3=329 lines, but it seems like my code is not properly working for frame-wise workings. I welcomes any help and assist to analyze this data. Thank you very much in advance.
# Read the file
filename = raw_input("Enter the file name \n")
file = open(filename, 'r')
# Read the number of atom from header
import linecache
nnn = linecache.getline(filename, 1)
natoms = int(nnn)
singleframe = natoms + 3
# get number of frames
nlines = 0
for i1 in file:
nlines = nlines +1
file.close()
nframes = nlines / singleframe
print 'no of lines are: ', nlines
print 'no of frames are: ', nframes
print 'no of atoms are:', natoms
# Create 1d string array
nrange = range(nlines)
data_lines = [None]*(nlines)
# Store whole input file into string array
file = open(filename, 'r')
i1=0
for i1 in nrange:
data_lines[i1] = file.readline()
file.close()
# Create 1d array to store atomic data
at_index = [None]*natoms
at_type = [None]*natoms
n1 = [None]*natoms
n2 = [None]*natoms
n3 = [None]*natoms
n4 = [None]*natoms
n5 = [None]*natoms
n6 = [None]*natoms
n7 = [None]*natoms
n8 = [None]*natoms
n9 = [None]*natoms
n10 = [None]*natoms
molnr = [None]*natoms
nrange1= range(natoms)
nframe = range(nframes)
file = open('output_force','w')
print data_lines[9]
for j1 in nframe:
start = j1*(natoms + 3) + 3
for i1 in nrange1:
line = data_lines[i1+start].split() #Split each line based on spaces
at_index[i1] = int(line[0])
at_type[i1] = int(line[1])
n1[i1]= int(line[2])
n2[i1]= int(line[3])
n3[i1]= int(line[4])
n4[i1]= int(line[5])
n5[i1]= int(line[6])
n6[i1]= int(line[7])
n7[i1]= int(line[8])
n8[i1]= int(line[9])
n9[i1]= int(line[10])
n10[i1]= int(line[11])
molnr[i1]= int(line[12])

When you are working with csv files, you should look into the csv module. I wrote a code that are should do the trick.
This code assumes "good data". If your data set may contain errors (such as less columns than 13, or less data rows than 326) some alterations should be done.
(changed to comply with Python 2.6.6)
import csv
with open('mydata.csv') as in_file:
with open('outfile.csv', 'wb') as out_file:
csv_reader = csv.reader(in_file, delimiter=' ', skipinitialspace=True)
csv_writer = csv.writer(out_file, delimiter = '\t')
# Iterate over all rows in the file
for i, header in enumerate(csv_reader):
# Get the header data
num = header[0]
csv_writer.writerow([num])
# Write frame number, starting with 1 (hence the +1 part)
csv_writer.writerow([i+1])
# Iterate over all data rows
for _ in xrange(326):
# Call next(csv_reader) to get the next row
# Put inside a try ... except to avoid StopIteration exception
# if end of file is found before reaching 326 lines
try:
row = next(csv_reader)
except StopIteration:
break
# Use list comprehension to extract number of zeros
zeros = sum([1 for x in row[2:12] if x.strip() == '0'])
not_zeros = 10 - zeros
# Write the data to output file
out = [row[0].strip(), row[1].strip(),not_zeros, zeros, row[12].strip()]
csv_writer.writerow(out)
# If the
else:
# Skip the last two lines of the file
next(csv_reader)
next(csv_reader)
For the first three lines, this yields:
326
1
1 6 5 5 1
2 6 4 6 1
3 6 4 6 1

Related

Using pandas dataframes, how to read through a column to find "True" statement and then proceed to create a new dataframe

Below I have 4 columns in my dataframe. I am interested in going through the entire "Greater_than_50" column. Upon reaching a "True" flag, I then want to take the associated "Discharge" and "Resistance" values to make a new dataframe which contains only those values found to be "True".
time Discharge Resistance Greater_than_50
-------------------------------------------------------------
0 0.000 NaN NaN
1 0.005 76.373 True
2 0.010 -48.174 False
3 0.016 -37.012 False
4 0.021 -27.808 False
5 0.026 -24.674 False
6 0.031 -20.464 False
7 0.037 100.114 True
... ... ... ...
I would like the new dataframe to look something like this:
Discharge Resistance
------------------------------
0.005 76.373
0.037 100.114
... ...
df['Greater_than_50'] = [val.strip() for val in df['Greater_than_50'].astype(str)]
# columns to keep
col_mask = ['Discharge', 'Resistance']
df_new = df.loc[df['Greater_than_50'] == 'True'][col_mask]
This is how I tested it:
'''
time Discharge Resistance Greater_than_50
0 0.000 NaN NaN
1 0.005 76.373 True
2 0.010 -48.174 False
3 0.016 -37.012 False
4 0.021 -27.808 False
5 0.026 -24.674 False
6 0.031 -20.464 False
7 0.037 100.114 True
'''
import pandas as pd
df = pd.read_clipboard()
print(df)
Original df:
time Discharge Resistance Greater_than_50
0 0 0.000 NaN NaN
1 1 0.005 76.373 True
2 2 0.010 -48.174 False
3 3 0.016 -37.012 False
4 4 0.021 -27.808 False
5 5 0.026 -24.674 False
6 6 0.031 -20.464 False
7 7 0.037 100.114 True
.
df['Greater_than_50'] = [val.strip() for val in df['Greater_than_50'].astype(str)]
# columns to keep
col_mask = ['Discharge', 'Resistance']
df_new = df.loc[df['Greater_than_50'] == 'True'][col_mask]
print(df_new)
Output:
Discharge Resistance
1 0.005 76.373
7 0.037 100.114
Just replace whatever columns you want to keep in the 'col_mask'.
Assuming the positions of columns "Discharge" and "Resistance" are 1 and 2, df2 is what you need
df1 = df[df.Greater_than_50 == True]
df2 = df1.iloc[:, 1:3]
You can do a one liner like so
df2 = df[df.Greater_than_50 == True].iloc[:, 1:3]

read until end of file after a matching string

I am trying to readline after a match as from a file:
with open(jij, "a") as jout:
with open(jfile, "r") as jinp:
for line in jinp:
if line.strip().startswith("IQ"):
# for _ in line:
#for lines in jinp:
for lines in range(2500):
# lines = jinp.readline()
rows = jinp.readline().split()
print("{0:<3s}{1:<3s}{2:<3s}{3:<3s}{4:>3s}{5:>3s}{6:>3s}{7:>15s}{8:>7s}".
format(rows[3], rows[2], rows[0], rows[1], rows[4], rows[5], rows[6], rows[11], rows[10]))
A very short jfile is(I generaly have around 1000 lines, but it may be even bigger):
Isotropic exchange couplings Jij
number of sites NQ = 2
number of types NT = 2
site occupation:
1 1 1 1.000
2 1 2 1.000
IQ IT JQ JT N1 N2 N3 DRX DRY DRZ DR J_ij [mRy] J_ij [meV]
1 1 2 2 -1 -1 -1 -0.500 -0.500 -0.681 0.982 0.159317355 2.167623834
1 1 2 2 0 -1 -1 0.500 -0.500 -0.681 0.982 0.159317355 2.167623834
1 1 2 2 -1 0 -1 -0.500 0.500 -0.681 0.982 0.159317355 2.167623834
1 1 2 2 0 0 -1 0.500 0.500 -0.681 0.982 0.159317355 2.167623834
1 1 2 2 -1 -1 0 -0.500 -0.500 0.681 0.982 0.159317355 2.167623834
1 1 2 2 0 -1 0 0.500 -0.500 0.681 0.982 0.159317355 2.167623834
1 1 2 2 -1 0 0 -0.500 0.500 0.681 0.982 0.159317355 2.167623834
1 1 2 2 0 0 0 0.500 0.500 0.681 0.982 0.159317355 2.167623834
1 1 1 1 0 -1 0 0.000 -1.000 0.000 1.000 1.457569899 19.831256008
1 1 1 1 -1 0 0 -1.000 0.000 0.000 1.000 1.453728096 19.778985590
I am trying to print few elements as list after it finds "IQ".
My preferred way is to do it by for _ in line which is taking first 100 lines only; for lines in jinp is skipping one line, and reading the next line. It is only working as intended when I am putting it in range. But I don't want to put a fixed line number.
What is going wrong with for _ in line?
https://da.gd/CtKZ is the complete file.
https://da.gd/7V8F result with for lines in range(2500)
https://da.gd/6cx3 result with for _ in line
https://da.gd/v9ts result with for lines in jinp
Expected result is from range(2500), but I dont want to hardcode the line numbers.
Your problem is u reuse the same fd:
rows = jinp.readline().split()# This make the pointer point to next line
All your solutions have this line + another way iteration:
# for _ in line: go over the chars in the line (100)
#for lines in jinp: go over the open file - > so you read twice per iteration
You could use this, Shorter and more readable.
flag = False
with open(jij, "a") as jout:
with open(jfile, "r") as jinp:
for line in jinp:
if flag:
rows = line.split()
jout.write("{0:<3s}{1:<3s}{2:<3s}{3:<3s}{4:>3s}{5:>3s}{6:>3s}{7:>15s}{8:>7s}\n".
format(rows[3], rows[2], rows[0], rows[1], rows[4], rows[5], rows[6], rows[11],
rows[10]))
else:
flag = line.strip().startswith("IQ")

Performance: Python pandas DataFrame.to_csv append becomes gradually slower

Initial Question:
I'm looping through a couple of thousand pickle files with Python Pandas DataFrames in it which vary in the number of rows (between aprox. 600 and 1300) but not in the number of collumns (636 to be exact). Then I transform them (exactly the same tranformations to each) and append them to a csv file using the DataFrame.to_csv() method.
The to_csv code excerpt:
if picklefile == '0000.p':
dftemp.to_csv(finalnormCSVFile)
else:
dftemp.to_csv(finalnormCSVFile, mode='a', header=False)
What bothers me is that it starts off pretty fast but performance decreases exponentially, I kept a processing time log:
start: 2015-03-24 03:26:36.958058
2015-03-24 03:26:36.958058
count = 0
time: 0:00:00
2015-03-24 03:30:53.254755
count = 100
time: 0:04:16.296697
2015-03-24 03:39:16.149883
count = 200
time: 0:08:22.895128
2015-03-24 03:51:12.247342
count = 300
time: 0:11:56.097459
2015-03-24 04:06:45.099034
count = 400
time: 0:15:32.851692
2015-03-24 04:26:09.411652
count = 500
time: 0:19:24.312618
2015-03-24 04:49:14.519529
count = 600
time: 0:23:05.107877
2015-03-24 05:16:30.175175
count = 700
time: 0:27:15.655646
2015-03-24 05:47:04.792289
count = 800
time: 0:30:34.617114
2015-03-24 06:21:35.137891
count = 900
time: 0:34:30.345602
2015-03-24 06:59:53.313468
count = 1000
time: 0:38:18.175577
2015-03-24 07:39:29.805270
count = 1100
time: 0:39:36.491802
2015-03-24 08:20:30.852613
count = 1200
time: 0:41:01.047343
2015-03-24 09:04:14.613948
count = 1300
time: 0:43:43.761335
2015-03-24 09:51:45.502538
count = 1400
time: 0:47:30.888590
2015-03-24 11:09:48.366950
count = 1500
time: 1:18:02.864412
2015-03-24 13:02:33.152289
count = 1600
time: 1:52:44.785339
2015-03-24 15:30:58.534493
count = 1700
time: 2:28:25.382204
2015-03-24 18:09:40.391639
count = 1800
time: 2:38:41.857146
2015-03-24 21:03:19.204587
count = 1900
time: 2:53:38.812948
2015-03-25 00:00:05.855970
count = 2000
time: 2:56:46.651383
2015-03-25 03:53:05.020944
count = 2100
time: 3:52:59.164974
2015-03-25 05:02:16.534149
count = 2200
time: 1:09:11.513205
2015-03-25 06:07:32.446801
count = 2300
time: 1:05:15.912652
2015-03-25 07:13:45.075216
count = 2400
time: 1:06:12.628415
2015-03-25 08:20:17.927286
count = 2500
time: 1:06:32.852070
2015-03-25 09:27:20.676520
count = 2600
time: 1:07:02.749234
2015-03-25 10:35:01.657199
count = 2700
time: 1:07:40.980679
2015-03-25 11:43:20.788178
count = 2800
time: 1:08:19.130979
2015-03-25 12:53:57.734390
count = 2900
time: 1:10:36.946212
2015-03-25 14:07:20.936314
count = 3000
time: 1:13:23.201924
2015-03-25 15:22:47.076786
count = 3100
time: 1:15:26.140472
2015-03-25 19:51:10.776342
count = 3200
time: 4:28:23.699556
2015-03-26 03:06:47.372698
count = 3300
time: 7:15:36.596356
count = 3324
end of cycle: 2015-03-26 03:59:54.161842
end: 2015-03-26 03:59:54.161842
total duration: 2 days, 0:33:17.203784
Update #1:
I did as you suggested #Alexander but it has certainly to do with the to_csv() mehod:
start: 2015-03-26 05:18:25.948410
2015-03-26 05:18:25.948410
count = 0
time: 0:00:00
2015-03-26 05:20:30.425041
count = 100
time: 0:02:04.476631
2015-03-26 05:22:27.680582
count = 200
time: 0:01:57.255541
2015-03-26 05:24:26.012598
count = 300
time: 0:01:58.332016
2015-03-26 05:26:16.542835
count = 400
time: 0:01:50.530237
2015-03-26 05:27:58.063196
count = 500
time: 0:01:41.520361
2015-03-26 05:29:45.769580
count = 600
time: 0:01:47.706384
2015-03-26 05:31:44.537213
count = 700
time: 0:01:58.767633
2015-03-26 05:33:41.591837
count = 800
time: 0:01:57.054624
2015-03-26 05:35:43.963843
count = 900
time: 0:02:02.372006
2015-03-26 05:37:46.171643
count = 1000
time: 0:02:02.207800
2015-03-26 05:38:36.493399
count = 1100
time: 0:00:50.321756
2015-03-26 05:39:42.123395
count = 1200
time: 0:01:05.629996
2015-03-26 05:41:13.122048
count = 1300
time: 0:01:30.998653
2015-03-26 05:42:41.885513
count = 1400
time: 0:01:28.763465
2015-03-26 05:44:20.937519
count = 1500
time: 0:01:39.052006
2015-03-26 05:46:16.012842
count = 1600
time: 0:01:55.075323
2015-03-26 05:48:14.727444
count = 1700
time: 0:01:58.714602
2015-03-26 05:50:15.792909
count = 1800
time: 0:02:01.065465
2015-03-26 05:51:48.228601
count = 1900
time: 0:01:32.435692
2015-03-26 05:52:22.755937
count = 2000
time: 0:00:34.527336
2015-03-26 05:52:58.289474
count = 2100
time: 0:00:35.533537
2015-03-26 05:53:39.406794
count = 2200
time: 0:00:41.117320
2015-03-26 05:54:11.348939
count = 2300
time: 0:00:31.942145
2015-03-26 05:54:43.057281
count = 2400
time: 0:00:31.708342
2015-03-26 05:55:19.483600
count = 2500
time: 0:00:36.426319
2015-03-26 05:55:52.216424
count = 2600
time: 0:00:32.732824
2015-03-26 05:56:27.409991
count = 2700
time: 0:00:35.193567
2015-03-26 05:57:00.810139
count = 2800
time: 0:00:33.400148
2015-03-26 05:58:17.109425
count = 2900
time: 0:01:16.299286
2015-03-26 05:59:31.021719
count = 3000
time: 0:01:13.912294
2015-03-26 06:00:49.200303
count = 3100
time: 0:01:18.178584
2015-03-26 06:02:07.732028
count = 3200
time: 0:01:18.531725
2015-03-26 06:03:28.518541
count = 3300
time: 0:01:20.786513
count = 3324
end of cycle: 2015-03-26 06:03:47.321182
end: 2015-03-26 06:03:47.321182
total duration: 0:45:21.372772
And as requested, the source code:
import pickle
import pandas as pd
import numpy as np
from os import listdir
from os.path import isfile, join
from datetime import datetime
# Defining function to deep copy pandas data frame:
def very_deep_copy(self):
return pd.DataFrame(self.values.copy(), self.index.copy(), self.columns.copy())
# Adding function to Dataframe module:
pd.DataFrame.very_deep_copy = very_deep_copy
#Define Data Frame Header:
head = [
'ConcatIndex', 'Concatenated String Index', 'FileID', ..., 'Attribute<autosave>', 'Attribute<bgcolor>'
]
exclude = [
'ConcatIndex', 'Concatenated String Index', 'FileID', ... , 'Real URL Array'
]
path = "./dataset_final/"
pickleFiles = [ f for f in listdir(path) if isfile(join(path,f)) ]
finalnormCSVFile = 'finalNormalizedDataFrame2.csv'
count = 0
start_time = datetime.now()
t1 = start_time
print("start: " + str(start_time) + "\n")
for picklefile in pickleFiles:
if count%100 == 0:
t2 = datetime.now()
print(str(t2))
print('count = ' + str(count))
print('time: ' + str(t2 - t1) + '\n')
t1 = t2
#DataFrame Manipulation:
df = pd.read_pickle(path + picklefile)
df['ConcatIndex'] = 100000*df.FileID + df.ID
for i in range(0, len(df)):
df.loc[i, 'Concatenated String Index'] = str(df['ConcatIndex'][i]).zfill(10)
df.index = df.ConcatIndex
#DataFrame Normalization:
dftemp = df.very_deep_copy()
for string in head:
if string in exclude:
if string != 'ConcatIndex':
dftemp.drop(string, axis=1, inplace=True)
else:
if 'Real ' in string:
max = pd.DataFrame.max(df[string.strip('Real ')])
elif 'child' in string:
max = pd.DataFrame.max(df[string.strip('child')+'desc'])
else:
max = pd.DataFrame.max(df[string])
if max != 0:
dftemp[string] = dftemp[string]/max
dftemp.drop('ConcatIndex', axis=1, inplace=True)
#Saving DataFrame in CSV:
if picklefile == '0000.p':
dftemp.to_csv(finalnormCSVFile)
else:
dftemp.to_csv(finalnormCSVFile, mode='a', header=False)
count += 1
print('count = ' + str(count))
cycle_end_time = datetime.now()
print("end of cycle: " + str(cycle_end_time) + "\n")
end_time = datetime.now()
print("end: " + str(end_time))
print('total duration: ' + str(end_time - start_time) + '\n')
Update #2:
As suggested I executed the command %prun %run "./DataSetNormalization.py" for the first couple of hundred picklefiles and the result is as followed:
136373640 function calls (136342619 primitive calls) in 1018.769 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
220 667.069 3.032 667.069 3.032 {method 'close' of '_io.TextIOWrapper' objects}
1540 42.046 0.027 46.341 0.030 {pandas.lib.write_csv_rows}
219 34.886 0.159 34.886 0.159 {built-in method collect}
3520 16.782 0.005 16.782 0.005 {pandas.algos.take_2d_axis1_object_object}
78323 9.948 0.000 9.948 0.000 {built-in method empty}
25336892 9.645 0.000 12.635 0.000 {built-in method isinstance}
1433941 9.344 0.000 9.363 0.000 generic.py:1845(__setattr__)
221051/220831 7.387 0.000 119.767 0.001 indexing.py:194(_setitem_with_indexer)
723540 7.312 0.000 7.312 0.000 {method 'reduce' of 'numpy.ufunc' objects}
273414 7.137 0.000 20.642 0.000 internals.py:2656(set)
604245 6.846 0.000 6.850 0.000 {method 'copy' of 'numpy.ndarray' objects}
1760 6.566 0.004 6.566 0.004 {pandas.lib.isnullobj}
276274 5.315 0.000 5.315 0.000 {method 'ravel' of 'numpy.ndarray' objects}
1719244 5.264 0.000 5.266 0.000 {built-in method array}
1102450 5.070 0.000 29.543 0.000 internals.py:1804(make_block)
1045687 5.056 0.000 10.209 0.000 index.py:709(__getitem__)
1 4.718 4.718 1018.727 1018.727 DataSetNormalization.py:6(<module>)
602485 4.575 0.000 15.087 0.000 internals.py:2586(iget)
441662 4.562 0.000 33.386 0.000 internals.py:2129(apply)
272754 4.550 0.000 4.550 0.000 internals.py:1291(set)
220883 4.073 0.000 4.073 0.000 {built-in method charmap_encode}
4781222 3.805 0.000 4.349 0.000 {built-in method getattr}
52143 3.673 0.000 3.673 0.000 {built-in method truediv}
1920486 3.671 0.000 3.672 0.000 {method 'get_loc' of 'pandas.index.IndexEngine' objects}
1096730 3.513 0.000 8.370 0.000 internals.py:3035(__init__)
875899 3.508 0.000 14.458 0.000 series.py:134(__init__)
334357 3.420 0.000 3.439 0.000 {pandas.lib.infer_dtype}
2581268 3.419 0.000 4.774 0.000 {pandas.lib.values_from_object}
1102450 3.036 0.000 6.110 0.000 internals.py:59(__init__)
824856 2.888 0.000 45.749 0.000 generic.py:1047(_get_item_cache)
2424185 2.657 0.000 3.870 0.000 numeric.py:1910(isscalar)
273414 2.505 0.000 9.332 0.000 frame.py:2113(_sanitize_column)
1646198 2.491 0.000 2.880 0.000 index.py:698(__contains__)
879639 2.461 0.000 2.461 0.000 generic.py:87(__init__)
552988 2.385 0.000 4.451 0.000 internals.py:3565(_get_blkno_placements)
824856 2.349 0.000 51.282 0.000 frame.py:1655(__getitem__)
220831 2.224 0.000 21.670 0.000 internals.py:460(setitem)
326437 2.183 0.000 11.352 0.000 common.py:1862(_possibly_infer_to_datetimelike)
602485 2.167 0.000 16.974 0.000 frame.py:1982(_box_item_values)
602485 2.087 0.000 23.202 0.000 internals.py:2558(get)
770739 2.036 0.000 6.471 0.000 internals.py:1238(__init__)
276494 1.966 0.000 1.966 0.000 {pandas.lib.get_blkno_indexers}
10903876/10873076 1.935 0.000 1.972 0.000 {built-in method len}
220831 1.924 0.000 76.647 0.000 indexing.py:372(setter)
220 1.893 0.009 1.995 0.009 {built-in method load}
1920486 1.855 0.000 8.198 0.000 index.py:1173(get_loc)
112860 1.828 0.000 9.607 0.000 common.py:202(_isnull_ndarraylike)
602485 1.707 0.000 8.903 0.000 series.py:238(from_array)
875899 1.688 0.000 2.493 0.000 series.py:263(_set_axis)
3300 1.661 0.001 1.661 0.001 {method 'tolist' of 'numpy.ndarray' objects}
1102670 1.609 0.000 2.024 0.000 internals.py:108(mgr_locs)
4211850 1.593 0.000 1.593 0.000 {built-in method issubclass}
1335546 1.501 0.000 2.253 0.000 generic.py:297(_get_axis_name)
273414 1.411 0.000 37.866 0.000 frame.py:1994(__setitem__)
441662 1.356 0.000 7.884 0.000 indexing.py:982(_convert_to_indexer)
220831 1.349 0.000 131.331 0.001 indexing.py:95(__setitem__)
273414 1.329 0.000 23.170 0.000 generic.py:1138(_set_item)
326437 1.276 0.000 6.203 0.000 fromnumeric.py:2259(prod)
274734 1.271 0.000 2.113 0.000 shape_base.py:60(atleast_2d)
273414 1.242 0.000 34.396 0.000 frame.py:2072(_set_item)
602485 1.183 0.000 1.979 0.000 generic.py:1061(_set_as_cached)
934422 1.175 0.000 1.894 0.000 {method 'view' of 'numpy.ndarray'objects}
1540 1.144 0.001 58.217 0.038 format.py:1409(_save_chunk)
220831 1.144 0.000 9.198 0.000 indexing.py:139(_convert_tuple)
441662 1.137 0.000 3.036 0.000 indexing.py:154(_convert_scalar_indexer)
220831 1.087 0.000 1.281 0.000 arrayprint.py:343(array2string)
1332026 1.056 0.000 3.997 0.000 generic.py:310(_get_axis)
602485 1.046 0.000 9.949 0.000 frame.py:1989(_box_col_values)
220 1.029 0.005 1.644 0.007 internals.py:2429(_interleave)
824856 1.025 0.000 46.777 0.000 frame.py:1680(_getitem_column)
1491578 1.022 0.000 2.990 0.000 common.py:58(_check)
782616 1.010 0.000 3.513 0.000 numeric.py:394(asarray)
290354 0.988 0.000 1.386 0.000 internals.py:1950(shape)
220831 0.958 0.000 15.392 0.000 generic.py:2101(copy)
273414 0.940 0.000 1.796 0.000 indexing.py:1520(_convert_to_index_sliceable)
220831 0.920 0.000 1.558 0.000 common.py:1110(_possibly_downcast_to_dtype)
220611 0.914 0.000 0.914 0.000 {pandas.lib.is_bool_array}
498646 0.906 0.000 0.906 0.000 {method 'clear' of 'dict' objects}
715345 0.848 0.000 13.083 0.000 common.py:132(_isnull_new)
452882 0.824 0.000 1.653 0.000 index.py:256(__array_finalize__)
602485 0.801 0.000 0.801 0.000 internals.py:208(iget)
52583 0.748 0.000 2.038 0.000 common.py:1223(_fill_zeros)
606005 0.736 0.000 6.755 0.000 internals.py:95(make_block_same_class)
708971 0.732 0.000 2.156 0.000 internals.py:3165(values)
1760378 0.724 0.000 0.724 0.000 internals.py:2025(_get_items)
109560 0.720 0.000 6.140 0.000 nanops.py:152(_get_values)
220831 0.718 0.000 11.017 0.000 internals.py:2395(copy)
924669 0.712 0.000 1.298 0.000 common.py:2248(_get_dtype_type)
1515796 0.698 0.000 0.868 0.000 {built-in method hasattr}
220831 0.670 0.000 4.299 0.000 internals.py:435(copy)
875899 0.661 0.000 0.661 0.000 series.py:285(_set_subtyp)
220831 0.648 0.000 0.649 0.000 {method 'get_value' of 'pandas.index.IndexEngine' objects}
452882 0.640 0.000 0.640 0.000 index.py:218(_reset_identity)
715345 0.634 0.000 1.886 0.000 {pandas.lib.isscalar}
1980 0.626 0.000 1.172 0.001 internals.py:3497(_merge_blocks)
220831 0.620 0.000 2.635 0.000 common.py:1933(_is_bool_indexer)
272754 0.608 0.000 0.899 0.000 internals.py:1338(should_store)
220831 0.599 0.000 3.463 0.000 series.py:482(__getitem__)
498645 0.591 0.000 1.497 0.000 generic.py:1122(_clear_item_cache)
1119390 0.584 0.000 1.171 0.000 index.py:3936(_ensure_index)
220831 0.573 0.000 1.883 0.000 index.py:222(view)
814797 0.555 0.000 0.905 0.000 internals.py:3086(_values)
52583 0.543 0.000 15.545 0.000 ops.py:469(wrapper)
220831 0.536 0.000 3.760 0.000 internals.py:371(_try_cast_result)
228971 0.533 0.000 0.622 0.000 generic.py:1829(__getattr__)
769651 0.528 0.000 0.528 0.000 {built-in method min}
224351 0.509 0.000 2.030 0.000 generic.py:1099(_maybe_update_cacher)
...
I will rerun it for confirmation but looks like it certainly has something to do with pandas' to_csv() method, because most of the run time is used on io and the csv writer. Why is it having this effect? Any suggestions?
Update #3:
Well, I did a full %prun test and indeed almost 90% of the time spent is used on {method 'close' of '_io.TextIOWrapper' objects}. So I guess here's the problem... What do you guys think?
My questions here are:
What originates here the decrease in performance?
Does pandas.DataFrames.to_csv() append mode load the whole file each time it writes to it?
Is there a way to enhance the process?
In these kind of situation you should profile your code (to see which function calls are taking the most time), that way you can check empirically that it is indeed slow in the read_csv rather than elsewhere...
From looking at your code: Firstly there's a lot of copying here and a lot of looping (not enough vectorization)... everytime you see looping look for a way to remove it. Secondly, when you use things like zfill, I wonder if you want to_fwf (fixed width format) rather than to_csv?
Some sanity testing: Are some files are significantly bigger than others (which could lead to you hitting swap)? Are you sure the largest files are only 1200 rows?? Have your checked this? e.g. using wc -l.
IMO I think it unlikely to be garbage collection.. (as was suggested in the other answer).
Here are a few improvements on your code, which should improve the runtime.
Columns are fixed I would extract the column calculations and vectorize the real, child and other normalizations. Use apply rather than iterating (for zfill).
columns_to_drop = set(head) & set(exclude) # maybe also - ['ConcatIndex']
remaining_cols = set(head) - set(exclude)
real_cols = [r for r in remaining_cols if 'Real ' in r]
real_cols_suffix = [r.strip('Real ') for r in real]
remaining_cols = remaining_cols - real_cols
child_cols = [r for r in remaining_cols if 'child' in r]
child_cols_desc = [r.strip('child'+'desc') for r in real]
remaining_cols = remaining_cols - child_cols
for count, picklefile in enumerate(pickleFiles):
if count % 100 == 0:
t2 = datetime.now()
print(str(t2))
print('count = ' + str(count))
print('time: ' + str(t2 - t1) + '\n')
t1 = t2
#DataFrame Manipulation:
df = pd.read_pickle(path + picklefile)
df['ConcatIndex'] = 100000*df.FileID + df.ID
# use apply here rather than iterating
df['Concatenated String Index'] = df['ConcatIndex'].apply(lambda x: str(x).zfill(10))
df.index = df.ConcatIndex
#DataFrame Normalization:
dftemp = df.very_deep_copy() # don't *think* you need this
# drop all excludes
dftemp.drop(columns_to_drop), axis=1, inplace=True)
# normalize real cols
m = dftemp[real_cols_suffix].max()
m.index = real_cols
dftemp[real_cols] = dftemp[real_cols] / m
# normalize child cols
m = dftemp[child_cols_desc].max()
m.index = child_cols
dftemp[child_cols] = dftemp[child_cols] / m
# normalize remaining
remaining = list(remaining - child)
dftemp[remaining] = dftemp[remaining] / dftemp[remaining].max()
# if this case is important then discard the rows of m with .max() is 0
#if max != 0:
# dftemp[string] = dftemp[string]/max
# this is dropped earlier, if you need it, then subtract ['ConcatIndex'] from columns_to_drop
# dftemp.drop('ConcatIndex', axis=1, inplace=True)
#Saving DataFrame in CSV:
if picklefile == '0000.p':
dftemp.to_csv(finalnormCSVFile)
else:
dftemp.to_csv(finalnormCSVFile, mode='a', header=False)
As a point of style I would probably choose to wrap each of these parts into functions, this will also mean more things can be gc'd if that really was the issue...
Another options which would be faster is to use pytables (HDF5Store) if you didn't need to resulting output to be csv (but I expect you do)...
The best thing to do by far is to profile your code. e.g. with %prun in ipython e.g. see http://pynash.org/2013/03/06/timing-and-profiling.html. Then you can see it definitely is read_csv and specifically where (which line of your code and which lines of pandas code).
Ah ha, I'd missed that you are appending all these to a single csv file. And in your prun it shows most of the time is spent in close, so let's keep the file open:
# outside of the for loop (so the file is opened and closed only once)
f = open(finalnormCSVFile, 'w')
...
for picklefile in ...
if picklefile == '0000.p':
dftemp.to_csv(f)
else:
dftemp.to_csv(f, mode='a', header=False)
...
f.close()
Each time the file is opened before it can append to, it needs to seek to the end before writing, it could be that this is the expensive (I don't see why this should be that bad, but keeping it open removes the need to do this).
My guess would be that it comes from the very_deep_copy you are doing, did you check the memory usage over time ? It is possible that the memory is not freed correctly.
If that is the problem, you could do one of the following:
1) Avoid the copying altogether (better performance-wise).
2) Force a garbage collection using gc.collect() once in a while.
See "Python garbage collection" for a probably related issue, and this article for an introduction about garbage collection in python.
Edit:
A solution to remove the copy would be to:
1) store the normalizing constant for each column before normalizing.
2) drop the columns you do not need after the normalization.
# Get the normalizing constant for each column.
max = {}
for string in head:
if string not in exclude:
if 'Real ' in string:
max[string] = df[string.strip('Real ')].max()
elif 'child' in string:
max[string] = df[string.strip('child')+'desc'].max()
else:
max[string] = df[string].max()
# Actual normalization, each column is divided by
# its constant if possible.
for key,value in max.items():
if value != 0:
df[key] /= value
# Drop the excluded columns
df.drop(exclude, axis=1, inplace=True)

Modify 2D Array (Python)

Hey guys so I have a 2D array that looks like this:
12381.000 63242.000 0.000 0.000 0.000 8.000 9.200 0.000 0.000
12401.000 8884.000 0.000 0.000 96.000 128.000 114.400 61.600 0.000
12606.000 74204.000 56.000 64.000 72.000 21.600 18.000 0.000 0.000
12606.000 105492.000 0.000 0.000 0.000 0.000 0.000 0.000 45.600
12606.000 112151.000 2.400 4.000 0.000 0.000 0.000 0.000 0.000
12606.000 121896.000 0.000 0.000 0.000 0.000 0.000 60.800 0.000
(Cut off couple of columns due to formatting)
So it indicates the employee ID, Department ID, followed by the 12 months worked by each employee and the hours they worked for each month. My 2D array is essentially a list of lists where each row is a list in its own. I am trying to convert each nonzero value to a one and maintain all the zeros. There are 857 rows and 14 columns. My code is as follows:
def convBin(A):
"""Nonzero values are converted into 1s and zero values are kept constant.
"""
for i in range(len(A)):
for j in range(len(A[i])):
if A[i][j] > 0:
A[i][j] == 1
else:
A[i][j] == 0
return A
Can someone tell me what I am doing wrong?
You are doing equality evaluation, not assignment, inside your loop:
A[i][j] == 1
should be
A[i][j] = 1
# ^ note only one equals sign
Also, there is no need to return A; A is being modified in-place, so it is conventional to implicitly return None by removing the explicit return ... line.
You should bear in mind that:
You don't actually want to do anything in the else case; and
Iterating over range(len(...)) is not Pythonic - use e.g. enumerate.
Your function could therefore be simplified to:
def convBin(A):
"""Convert non-zero values in 2-D array A into 1s."""
for row in A:
for j, val in enumerate(row):
if val:
row[j] = 1

Need for speed: Slow nested groupbys and applys in Pandas

I am performing a complex transformation on a DataFrame. I thought it would be quick for Pandas, but the only way I've managed to do it is with some nested groupbys and applys, using lambda functions, and it is slow. It seems like the sort of thing where there should be built-in, faster methods. At n_rows=1000 it's 2 seconds, but I'll be doing 10^7 rows, so this is far too slow. It's difficult to explain what we're doing, so here's the code and profile, then I'll explain:
n_rows = 1000
d = pd.DataFrame(randint(1,10,(n_rows,8))) #Raw data
dgs = array([3,4,1,8,9,2,3,7,10,8]) #Values we will look up, referenced by index
grps = pd.cut(randint(1,5,n_rows),arange(1,5)) #Grouping
f = lambda x: dgs[x.index].mean() #Works on a grouped Series
g = lambda x: x.groupby(x).apply(f) #Works on a Series
h = lambda x: x.apply(g,axis=1).mean(axis=0) #Works on a grouped DataFrame
q = d.groupby(grps).apply(h) #Slow
824984 function calls (816675 primitive calls) in 1.850 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
221770 0.105 0.000 0.105 0.000 {isinstance}
7329 0.104 0.000 0.217 0.000 index.py:86(__new__)
8309 0.089 0.000 0.423 0.000 series.py:430(__new__)
5375 0.081 0.000 0.081 0.000 {method 'reduce' of 'numpy.ufunc' objects}
34225 0.068 0.000 0.133 0.000 {method 'view' of 'numpy.ndarray' objects}
36780/36779 0.067 0.000 0.067 0.000 {numpy.core.multiarray.array}
5349 0.065 0.000 0.567 0.000 series.py:709(_get_values)
985/1 0.063 0.000 1.847 1.847 groupby.py:608(apply)
5349 0.056 0.000 0.198 0.000 _methods.py:42(_mean)
5358 0.050 0.000 0.232 0.000 index.py:332(__getitem__)
8309 0.049 0.000 0.228 0.000 series.py:3299(_sanitize_array)
9296 0.047 0.000 0.116 0.000 index.py:1341(__new__)
984 0.039 0.000 0.092 0.000 algorithms.py:105(factorize)
Group the DataFrame rows by the groupings. For each grouping, for each row, group by those values that are the same (i.e. all have the value 3 versus all have value 4). For each index in a value grouping, look up the corresponding index in dgs, and average. Then average for the row groupings.
::exhale::
Any suggestions on how to rearrange this for speed would be appreciated.
You can do the apply and groupby by one multilevel groupby, here is the code:
import pandas as pd
from numpy import array, arange
from numpy.random import randint, seed
seed(42)
n_rows = 1000
d = pd.DataFrame(randint(1,10,(n_rows,8))) #Raw data
dgs = array([3,4,1,8,9,2,3,7,10,8]) #Values we will look up, referenced by index
grps = pd.cut(randint(1,5,n_rows),arange(1,5)) #Grouping
f = lambda x: dgs[x.index].mean() #Works on a grouped Series
g = lambda x: x.groupby(x).apply(f) #Works on a Series
h = lambda x: x.apply(g,axis=1).mean(axis=0) #Works on a grouped DataFrame
print d.groupby(grps).apply(h) #Slow
### my code starts from here ###
def group_process(df2):
s = df2.stack()
v = np.repeat(dgs[None, :df2.shape[1]], df2.shape[0], axis=0).ravel()
return pd.Series(v).groupby([s.index.get_level_values(0), s.values]).mean().mean(level=1)
print d.groupby(grps).apply(group_process)
output:
1 2 3 4 5 6 7 \
(1, 2] 4.621575 4.625887 4.775235 4.954321 4.566441 4.568111 4.835664
(2, 3] 4.446347 4.138528 4.862613 4.800538 4.582721 4.595890 4.794183
(3, 4] 4.776144 4.510119 4.391729 4.392262 4.930556 4.695776 4.630068
8 9
(1, 2] 4.246085 4.520384
(2, 3] 5.237360 4.418934
(3, 4] 4.829167 4.681548
[3 rows x 9 columns]
1 2 3 4 5 6 7 \
(1, 2] 4.621575 4.625887 4.775235 4.954321 4.566441 4.568111 4.835664
(2, 3] 4.446347 4.138528 4.862613 4.800538 4.582721 4.595890 4.794183
(3, 4] 4.776144 4.510119 4.391729 4.392262 4.930556 4.695776 4.630068
8 9
(1, 2] 4.246085 4.520384
(2, 3] 5.237360 4.418934
(3, 4] 4.829167 4.681548
[3 rows x 9 columns]
It's about 70x faster, but I don't know if it can work with 10**7 rows.

Categories