Text file parsing fastest as possible - python

I have a very large file with lines like follows:
....
0.040027 a b c d e 12 34 56 78 90 12 34 56
0.050027 f g h i l 12 34 56 78 90 12 34 56
0.060027 a b c d e 12 34 56 78 90 12 34 56
0.070027 f g h i l 12 34 56 78 90 12 34 56
0.080027 a b c d e 12 34 56 78 90 12 34 56
0.090027 f g h i l 12 34 56 78 90 12 34 56
....
I need to have a dictionary as follows in the fastest way possible.
I using the following code:
ascFile = open('C:\\eample.txt', 'r', encoding='UTF-8')
tag1 = ' a b c d e '
tag2 = ' f g h i l '
tags = [tag1, tag2]
temp = {'k1':[], 'k2':[]}
key_tag = {'k1':tag1, 'k2':tag2 }
t1 = time.time()
for line in ascFile:
for path, tag in key_tag.items():
if tag in line:
columns = line.strip().split(tag, 1)
temp[path].append([columns[0], columns[-1].replace(' ', '')])
t2 = time.time()
print(t2-t1)
I have the following result in 6 second parsing a file of 360MB, I'd like to improve the time.
temp = {'k1':[['0.040027', '1234567890123456'], ['0.060027', '1234567890123456'], ['0.080027', '1234567890123456']], 'k2':[['0.050027', '1234567890123456'], ['0.070027', '1234567890123456'], ['0.090027', '1234567890123456']]
}

I assume you have a fixed number of words in the file that are your keys. Use split to break the string, then take a slice of the split list to compute your key directly:
import collections
# raw strings don't need \\ for backslash:
FILESPEC = r'C:\example.txt'
lines_by_key = collections.defaultdict(list)
with open(FILESPEC, 'r', encoding='UTF-8') as f:
for line in f:
cols = line.split()
key = ' '.join(cols[1:6])
pair = (cols[0], ''.join(cols[6:]) # tuple, not list, could be changed
lines_by_key[key].append(pair)
print(lines_by_key)

I used partition instead of split so that the 'in' test and splitting can be done in a single pass.
for line in ascFile:
for path, tag in key_tag.items():
val0, tag_found, val1 = line.partition(tag)
if tag_found:
temp[path].append([val0, val1.replace(' ', '')])
break
Is this any better with your 360MB file?
You might also do a simple test where all you do is loop through the file a line at a time:
for line in ascFile:
pass
This will tell you what your best possible time will be.

Related

Strip the last character from a string if it is a letter in python dataframe

It is possibly done with regular expressions, which I am not very strong at.
My dataframe is like this:
import pandas as pd
import regex as re
data = {'postcode': ['DG14','EC3M','BN45','M2','WC2A','W1C','PE35'], 'total':[44, 54,56, 78,87,35,36]}
df = pd.DataFrame(data)
df
postcode total
0 DG14 44
1 EC3M 54
2 BN45 56
3 M2 78
4 WC2A 87
5 W1C 35
6 PE35 36
I want to get these strings in my column with the last letter stripped like so:
postcode total
0 DG14 44
1 EC3 54
2 BN45 56
3 M2 78
4 WC2 87
5 W1C 35
6 PE35 36
Probably something using re.sub('', '\D')?
Thank you.
You could use str.replace here:
df["postcode"] = df["postcode"].str.replace(r'[A-Za-z]$', '')
One of the approaches:
import pandas as pd
import re
data = {'postcode': ['DG14','EC3M','BN45','M2','WC2A','W1C','PE35'], 'total':[44, 54,56, 78,87,35,36]}
data['postcode'] = [re.sub(r'[a-zA-Z]$', '', item) for item in data['postcode']]
df = pd.DataFrame(data)
print(df)
Output:
postcode total
0 DG14 44
1 EC3 54
2 BN45 56
3 M2 78
4 WC2 87
5 W1 35
6 PE35 36

Reading txt file with number and suming them python

I have txt file witht the following txt in it:
2
4 8 15 16 23 42
1 3 5
6
66 77
77
888
888 77
34
23 234 234
1
32
3
23 23 23
365
22 12
I need a way to read the file and sum all the numbers.
i have this code for now but not sure what to do next. Thx in advance
`lstComplete = []
fichNbr = open("nombres.txt", "r")
lstComplete = fichNbr
somme = 0
for i in lstComplete:
i = i.split()`
Turn them into a list and sum them:
with open('nombres.txt', 'r') as f:
num_list = f.read().split()
print sum([int(n) for n in num_list])
Returns 3227
Open the file and use read() method to get the content and then convert string to int, use sum() to get the result:
>>> sum(map(int,open('nombres.txt').read().split()))
3227

Add columns from a file to a specific position in another file

I have 3 files. In there header they all have Id's. What I want to do is to find the intersecting ID's (template is file 1) and then copy the columns with the correct ID behind the ID in the template-file.
Here is an example:
File 1 is the template:
name 123 124 125 128 131 145 156
rdt4 35 12 23 21 36 34 37
gtf2 24 18 18 29 26 12 40
hzt7 40 23 26 25 13 21 28
File 2:
name 123 124 125 126 127 128 131 132 133 145 156
rdt4 F F F T T F T T T F T
gtf2 F F F T T F T T T F T
hzt7 F F F T T F T T T F T
File 3:
name 123_a 123_b 123_c 124_a 124_b 124_c 125_a 125_b 125_c 126_a 126_b 126_c 127_a 127_b 127_c 128_a 128_b 128_c and so on
rdt4 0,087 0,265 0,632 0,220 0,851 0,271 0,436 0,148 0,080 0,899 0,636 0,467 0,508 0,460 0,393 0,689 0,427 0,798
gtf2 0,770 0,971 0,231 0,969 0,494 0,181 0,989 0,155 0,351 0,131 0,204 0,553 0,581 0,138 0,982 0,287 0,702 0,522
hzt7 0,185 0,535 0,093 0,807 0,487 0,786 0,886 0,905 0,966 0,283 0,490 0,190 0,688 0,714 0,577 0,643 0,476 0,738
The final file should look like this:
name 123 123 123_b 124 124 124_b 125 125 125_b 128 128 128_b 131 131 131_b 145 145 145_b 156 156 156_b
rdt4 35 F 0,265 12 F 0,851 23 F 0,148 21 F 0,427 36 T 34 F 37 T
gtf2 24 F 0,971 18 F 0,494 18 F 0,155 29 F 0,702 26 T 12 F 40 T
hzt7 40 F 0,535 23 F 0,487 26 F 0,905 25 F 0,476 13 T 21 F 28 T
Note: I skipped to type in everything for File 3 because it has the same numbers of ID's like file 2 but in file 3 every ID have 3 columns and I need only one of these columns (in the example column b).
What I tried so far:
I started first to do everything only with file 1 and file 2.
I copied the ID's to a new list and then find the positions of these ID's in file 2 to extract the data. But this seems to be very tricky (at least for me). The appending works so far but the problem is that every list which is stored in the list final is the same. It would be nice if you can help me with this.
This is my code so far:
try:
Expr_Matrix_1="file1.txt"
#Expr_Matrix_1=raw_input('Name file with expression data: ')
Expr_Matrix_2=open(Expr_Matrix_1)
Expr_Matrix_3=open(Expr_Matrix_1)
except:
print 'This is a wrong filename!'
exit()
try:
Probe_Detect_1="file2.txt"
#Probe_Detect_1=raw_input('Name of file with probe detection: ')
Probe_Detect_2=open(Probe_Detect_1)
Probe_Detect_3=open(Probe_Detect_1)
except:
print 'This is a wrong filename!'
exit()
find_list=list()
for b, line2 in enumerate(Expr_Matrix_2):
line2 = line2.rstrip()
line2 = line2.split("\t")
if b == 0:
for item in line2:
find_list.append(item)
find_list=find_list[7:]
find_list2=list()
for i, line in enumerate(Probe_Detect_2):
line = line.rstrip()
line = line.split("\t")
if i == 0:
for item in find_list:
find_list2.append(line.index(item))
#print find_list2
index1=8
final=list()
for b, line2 in enumerate(Expr_Matrix_3):
line2 = line2.rstrip()
line2 = line2.split("\t")
for c, line in enumerate(Probe_Detect_3):
line = line.rstrip()
line = line.split("\t")
if line2[b]==line[c]:
for item in find_list2:
if len(line2)<1551:
line2.insert(index1, line[item])
index1=index1+2
final.append(line2)
print final[1]
The fist ID-column in file 1 is column 7 that's why I used the 7 for slicing.
The 1551 means the number of rows to which it should be copied but I think this is a complete wrong approach. However, I wanted to show you my try!
Another note: All files start with the name-column but between this column and the first ID-column there are some columns which shouldn't be considered. Because file 1 is the template those columns should also be in the final file.
What is the solution?

Python parsing data from a website using regular expression

I'm trying to parse some data from this website:
http://www.csfbl.com/freeagents.asp?leagueid=2237
I've written some code:
import urllib
import re
name = re.compile('<td>(.+?)')
player_id = re.compile('<td><a href="(.+?)" onclick=')
#player_id_num = re.compile('<td><a href=player.asp?playerid="(.+?)" onclick=')
stat_c = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">(.+?)</span><br><span class="[^"]?">')
stat_p = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">"[^"]+"</span><br><span class="[^"]?">(.+?)</span></td>')
url = 'http://www.csfbl.com/freeagents.asp?leagueid=2237'
sock = urllib.request.urlopen(url).read().decode("utf-8")
#li = name.findall(sock)
name = name.findall(sock)
player_id = player_id.findall(sock)
#player_id_num = player_id_num.findall(sock)
#age = age.findall(sock)
stat_c = stat_c.findall(sock)
stat_p = stat_p.findall(sock)
First question : player_id returns the whole url "player.asp?playerid=4209661". I was unable to get just the number part. How can I do that?
(my attempt is described in #player_id_num)
Second question: I am not able to get stat_c when span_class is empty as in "".
Is there a way I can get these resolved? I am not very familiar with RE (regular expressions), I looked up tutorials online but it's still unclear what I am doing wrong.
Very simple using the pandas library.
Code:
import pandas as pd
url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)
# print dfs[3]
# dfs[3].to_csv("stats.csv") # Send to a CSV file.
print dfs[3].head()
Result:
0 1 2 3 4 5 6 7 8 9 10 \
0 Pos Name Age T PO FI CO SY HR RA GL
1 P George Pacheco 38 R 4858 7484 8090 7888 6777 4353 6979
2 P David Montoya 34 R 3944 5976 6673 8699 6267 6685 5459
3 P Robert Cole 34 R 5769 7189 7285 5863 6267 5868 5462
4 P Juanold McDonald 32 R 69100 5772 4953 4866 5976 67100 5362
11 12 13 14 15 16
0 AR EN RL Fatigue Salary NaN
1 3747 6171 -3 100% --- $3,672,000
2 5257 5975 -4 96% 2% $2,736,000
3 4953 5061 -4 96% 3% $2,401,000
4 5982 5263 -4 100% --- $1,890,000
You can apply whatever cleaning methods you want from here onwards. Code is rudimentary so it's up to you to improve it.
More Code:
import pandas as pd
import itertools
url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)
df = dfs[3] # "First" stats table.
# The first row is the actual header.
# Also, notice the NaN at the end.
header = df.iloc[0][:-1].tolist()
# Fix that atrocity of a last column.
df.drop([15], axis=1, inplace=True)
# Last row is all NaNs. This particular
# table should end with Jeremy Dix.
df = df.iloc[1:-1,:]
df.columns = header
df.reset_index(drop=True, inplace=True)
# Pandas cannot create two rows without the
# dataframe turning into a nightmare. Let's
# try an aesthetic change.
sub_header = header[4:13]
orig = ["{}{}".format(h, "r") for h in sub_header]
clone = ["{}{}".format(h, "p") for h in sub_header]
# http://stackoverflow.com/a/3678930/2548721
comb = [iter(orig), iter(clone)]
comb = list(it.next() for it in itertools.cycle(comb))
# Construct the new header.
new_header = header[0:4]
new_header += comb
new_header += header[13:]
# Slow but does it cleanly.
for s, o, c in zip(sub_header, orig, clone):
df.loc[:, o] = df[s].apply(lambda x: x[:2])
df.loc[:, c] = df[s].apply(lambda x: x[2:])
df = df[new_header] # Drop the other columns.
print df.head()
More result:
Pos Name Age T POr POp FIr FIp COr COp ... RAp GLr \
0 P George Pacheco 38 R 48 58 74 84 80 90 ... 53 69
1 P David Montoya 34 R 39 44 59 76 66 73 ... 85 54
2 P Robert Cole 34 R 57 69 71 89 72 85 ... 68 54
3 P Juanold McDonald 32 R 69 100 57 72 49 53 ... 100 53
4 P Trevor White 37 R 61 66 62 64 67 67 ... 38 48
GLp ARr ARp ENr ENp RL Fatigue Salary
0 79 37 47 61 71 -3 100% $3,672,000
1 59 52 57 59 75 -4 96% $2,736,000
2 62 49 53 50 61 -4 96% $2,401,000
3 62 59 82 52 63 -4 100% $1,890,000
4 50 70 100 62 69 -4 100% $1,887,000
Obviously, what I did instead was separate the Real values from Potential values. Some tricks were used but it gets the job done at least for the first table of players. The next few ones require a degree of manipulation.

Python: How to write values to a csv file from another csv file

For index.csv file, its fourth column has ten numbers ranging from 1-5. Each number can be regarded as an index, and each index corresponds with an array of numbers in filename.csv.
The row number of filename.csv represents the index, and each row has three numbers. My question is about using a nesting loop to transfer the numbers in filename.csv to index.csv.
from numpy import genfromtxt
import numpy as np
import csv
import collections
data1 = genfromtxt('filename.csv', delimiter=',')
data2 = genfromtxt('index.csv', delimiter=',')
out = np.zeros((len(data2),len(data1)))
for row in data2:
for ch_row in range(len(data1)):
if (row[3] == ch_row + 1):
out = row.tolist() + data1[ch_row].tolist()
print(out)
writer = csv.writer(open('dn.csv','w'), delimiter=',',quoting=csv.QUOTE_ALL)
writer.writerow(out)
For example, the fourth column of index.csv contains 1,2,5,3,4,1,4,5,2,3 and filename.csv contains:
# filename.csv
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
What I need is to write the indexed row from filename.csv to index.csv and store these number in 5th, 6th and 7th column:
# index.csv
# 4 5 6 7
... 1 20 30 50
... 2 70 60 45
... 5 13 08 55
... 3 35 26 77
... 4 93 37 68
... 1 20 30 50
... 4 93 37 68
... 5 13 08 55
... 2 70 60 45
... 3 35 26 77
If I do "print(out)", it comes out a correct answer. However, when I input "out" in the shell, there are only one row appears like [1.0, 1.0, 1.0, 1.0, 20.0, 30.0, 50.0]
What I need is to store all the values in the "out" variables and write them to the dn.csv file.
This ought to do the trick for you:
Code:
from csv import reader, writer
data = list(reader(open("filename.csv", "r"), delimiter=" "))
out = writer(open("output.csv", "w"), delimiter=" ")
for row in reader(open("index.csv", "r"), delimiter=" "):
out.writerow(row + data[int(row[3])])
index.csv:
0 0 0 1
0 0 0 2
0 0 0 3
filename.csv:
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
This produces the output:
0 0 0 1 70 60 45
0 0 0 2 35 26 77
0 0 0 3 93 37 68
Note: There's no need to use numpy here. The stadard library csv module will do most of the work for you.
I also had to modify your sample datasets a bit as what you showed had indexes out of bounds of the sample data in filename.csv.
Please also note that Python (like most languages) uses 0th indexes. So you may have to fiddle with the above code to exactly fit your needs.
with open('dn.csv','w') as f:
writer = csv.writer(f, delimiter=',',quoting=csv.QUOTE_ALL)
for row in data2:
idx = row[3]
out = [idx] + [x for x in data1[idx-1]]
writer.writerow(out)

Categories