Most efficient way to pull information from a table - python

As a new-to-python person the only way that I can think of to grab information from a table (separated by only whitespace) is to call an item by its position using the row and column its located in. I've been using numpy a lot like so:
information_table=np.array([]) #The table I'm pulling data from
info_i_need_from_table=information_table[i][j] #Where i/j is the location of whatever info I need
Is this the optimal way of grabbing information from a table? I'm new to python so as far as I know this is the only way to do it, but I'd be willing to bet I'm wrong. Say my information_table is fairly large, thousands of rows, hundreds of colums. Would you utilize the same 'tool' to pull information from, say, a much smaller table?
As an example of one of the tables I'm working with:
/SAH/SAH5/jimunoz/DUSTYlib2/models_Y100_699K/COMPACTs3300_Al_g+1.5_m1.0_t02_st_z-0.25.inp 3300 699 1.000E+02 Al2O3 1.06E+05 1.26E-05 1.70E+14 81 2.61E-10 0.360484737991 0.77871386826 1.03440307618 0.568135259544 0.157877963222 0.0791445961324 0.0398783584044 0.0159762347055 0.000741792598059
/SAH/SAH5/jimunoz/DUSTYlib2/models_Y100_699K/COMPACTs3300_Al_g+1.5_m1.0_t02_st_z-0.25.inp 3300 699 1.000E+02 Al2O3 1.06E+05 1.60E-05 1.70E+14 81 3.12E-10 0.360484737991 0.77871386826 1.03440307618 0.568135259544 0.157877963222 0.0791445961324 0.0398783584044 0.0159762347055 0.000741792620505
Those are just the first two rows. Another table I may be working with would just have numerous rows and columns of floating point numbers (at about 5 sig-figs).

Related

Duplicate function retuning non-duplicated results on a BLAST hittable

New to python (3 weeks!) and unsurprisingly having difficulties. I'm working on a BLAST hittable and am trying to identify sequences coming from the same hit by using duplicate on the accession number only. I do not want to discard these results but rather save them to a new file so I can take a look to see if anything interesting if popping up.
A snippet of the table (this is purely an example, it includes 11 columns but it seems excessive to print them all here):
Query
Accession
% identity
mismatches
gaps
s start
s end
Q112
ABCDEFG111222
90.99
9
3
1000
2000
Q112
HIJKLMN222111
80
14
98
128
900
Q112
OPQRSTUV33111
76
2
23
12
900
I'm importing the file to make it a data frame using pandas, then using reset_index to replace the query number with an index.
I have then done the following:
To find out if I have any duplicates in Accession column
print(file.Accession.duplicated().sum())
To print those results into a new data frame (n is the same)
filedupes = (file.loc[file.Accession.duplicated()])
Finally, to then write it to a csv for me to look through
fileDupes.to_csv('Filedupes.csv', sep='\t', encoding='utf-8')
This does half work, as my CSV does contain duplicated entries based on Accession number only but it also contains some unique Accession number entries. These entries only seem to have the first 2 letters identical to other entries but then the rest is unique
ie I have XM_JK1234343 and XM_983HSJAN and XM_83QMZBDH1 included despite having no other entry present (have checked using find/replace). The other 11 columns are also unique to these strange entries.
I am stumped, is it my code? Have I not specified enough and allowed the above examples to be chucked in with other legit duplicates? I have tried to find if someone else had asked a similar question but no luck - thank you kindly in advance for any insight and apologises in advance if this is silly mistake!

Manipulating API data in Python before appending to a csv file

I'm requesting data from a RESTful API. The first request is written to a csv file with no problems. In the csv file the data has 5 header rows (including column headers), 11 rows of actual data (13 fields per row), and an EOF row, so 17 rows of data in all (the data as it appears following a print(response.text) command is shown at the end of this post.
For subsequent requests to the API I simply want to append the 11 rows of data (i.e. rows 6 through 16) to the existing csv file. This is a process I will repeat numerous times in order to create a single large csv file with probably close to a million rows of data. I'm struggling to find a way to manipulate the data returned by the API so as to allow me to only write rows 6 through 16 to the csv file.
I'm pretty new to coding and Python, so I'd be grateful for suggestions as to how to proceed.
This is what the data looks like from a Python 'print' command (the first asterix is row 1. The fifth asterix denotes the start of the column headings, with 'Document RevNum' being the last column heading):
*
*
*Actual Aggregated Generation Per Type (B1620) Data
*
*Document Type,Business Type,Process Type,Time Series ID,Quantity,Curve Type,Resolution,Settlement Date,Settlement Period,Power System Resource Type,Active Flag,Document ID,Document RevNum
Actual generation per type,Solar generation,Realised,NGET-EMFIP-AGPT-TS-21614701,3250,Sequential fixed size block,PT30M,2020-07-01,21,"Solar",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Wind generation,Realised,NGET-EMFIP-AGPT-TS-21614702,2075.338,Sequential fixed size block,PT30M,2020-07-01,21,"Wind Offshore",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Wind generation,Realised,NGET-EMFIP-AGPT-TS-21614703,1486.519,Sequential fixed size block,PT30M,2020-07-01,21,"Wind Onshore",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614704,258,Sequential fixed size block,PT30M,2020-07-01,21,"Other",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614705,4871,Sequential fixed size block,PT30M,2020-07-01,21,"Nuclear",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614706,0,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Oil",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614707,16448,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Gas",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614708,0,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Hard coal",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614709,783,Sequential fixed size block,PT30M,2020-07-01,21,"Hydro Run-of-river and poundage",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614710,118,Sequential fixed size block,PT30M,2020-07-01,21,"Hydro Pumped Storage",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614711,3029,Sequential fixed size block,PT30M,2020-07-01,21,"Biomass",Y,NGET-EMFIP-AGPT-06372506,1
<EOF>
I interpret your problem as not being able to append the newly fetched data to the CSV file. I am not sure whether you are working with some module that helps with working with CSV files, but I just assume you don't for now.
If you just open the file with "a" as the second argument to open like (f = open("file.csv","a")) you can easily append the new content. You would first off have to strip the EOF row though and later append a new one, but I think that isn't the problem.
Hope I could help you, and please tell me whether I understood correctly what your problem is here :)
Btw, I would recommend looking into a CSV module for this or something like sqlite3.
The solution which seems to work is as follows:
Recall that the API returns a long string of comma-separated data for a given date.
When the data is written to a csv file it presents as 4 rows of 'header' data that I'm not interested in, 1 row of column heading data (13 columns), 11 rows of the data that I AM interested in (with data in all 13 columns), and 1 ("EOF") row that I don't need.
On the first API query I want to create a csv file with only the column headings and 11 rows of data, jettisoning the first 4 rows (redundant info) and the last row ("EOF") of data.
On all subsequent API queries I simply want to append the 11 rows of data to the already created csv file.
The API response is returned as a string.
The following code excludes the first 4 rows and the last row ("EOF") but still writes the column headings and the 11 rows of useful data to the newly created csv file:
# Write to .CSV
with open('C:/Users/Paul/.spyder-py3/Elexon API Data 05.csv', "w") as f:
f.write(api_response.text[59:-5])
The first 59 characters and last 5 characters of the API response string will always be the same so I can be confident that this will work for the initial API response.
For subsequent API responses I use the following code to append to the csv file:
# Now append the api response data to the CSV file
with open('C:/Users/Paul/.spyder-py3/Elexon API Data 05.csv', "a") as f:
f.write(api_response.text[248:-5])
This excludes the first 248 and last 5 characters from the API response appended to the csv file. The first 248 characters of the string will always include the same redundant info, and the last 5 characters will always include the '', so again I can be confident that only the 11 rows of data I am interested in will be appended to the csv file.
The solution, for this particular case, turned out to be simpler than I had expected, and thanks to CodingGuy for directing me to think in terms of stripping data from the start and end of the api response, and for exploring the type of data I was dealing with.
Please let me know if there are problems with my solution. Likewise, I'm always interested to learn more sophisticated ways of handling data within Python.

subsetting very large files - python methods for optimal performance

I have one file (index1) with 17,270,877 IDs, and another file (read1) with a subset of these IDs (17,211,741). For both files, the IDs are on every 4th line.
I need a new (index2) file that contains only the IDs in read1. For each of those IDs I also need to grab the next 3 lines from index1. So I'll end up with index2 whose format exactly matches index1 except it only contains IDs from read1.
I am trying to implement the methods I've read here. But I'm stumbling on these two points: 1) I need to check IDs on every 4th line, but I need all of the data in index1 (in order) because I have to write the associated 3 lines following the ID. 2) unlike that post, which is about searching for one string in a large file, I'm searching for a huge number of strings in another huge file.
Can some folks point me in some direction? Maybe none of those 5 methods are ideal for this. I don't know any information theory; we have plenty of RAM so I think holding the data in RAM for searching is the most efficient? I'm really not sure.
Here a sample of what the index look like (IDs start with #M00347):
#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0
CCTAAGGTTCGG
+
CDDDDFFFFFCB
#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0
CGCCATGCATCC
+
BBCCBBFFFFFF
#M00347:30:000000000-BCWL3:1:1101:15711:1332 1:N:0:0
TTTGGTTCCCGG
+
CDCDECCFFFCB
read1 looks very similar, but the lines before and after the '+' are different.
If data of index1 can fit in memory, the best approach is to do a single scan of this file and store all data in a dictionary like this:
{"#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0":["CCTAAGGTTCGG","+","CDDDDFFFFFCB"],
"#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0":["CGCCATGCATCC","+","BBCCBBFFFFFF"],
..... }
Values can be stored as formatted string as you prefer.
After this, you can do a single scan on read1 and when an IDs is encountered you can do a simple lookup on the dictionary to retrieve needed data.

Efficient way to intersect multiple large files containing geodata

Okay, deep breath, this may be a bit verbose, but better to err on the side of detail than lack thereof...
So, in one sentence, my goal is to find the intersection of about 22 ~300-400mb files based on 3 of 139 attributes.:
Now a bit more background. The files range from ~300-400mb, consisting of 139 columns and typically in the range of 400,000-600,000 rows. I have three particular fields I want to join on - a unique ID, and latitude/longitude (with a bit of a tolerance if possible). The goal is to determine which of these recored existed across certain ranges of files. Going worst case, that will mean performing a 22 file intersection.
So far, the following has failed
I tried using MySQL to perform the join. This was back when I was only looking at 7 years. Attempting the join on 7 years (using INNER JOIN about 7 times... e.g. t1 INNER JOIN t2 ON condition INNER JOIN t3 ON condition ... etc), I let it run for about 48 hours before the timeout ended it. Was it likely to actually still be running, or does that seem overly long? Despite all the suggestions I found to enable better multithreading and more RAM usage, I couldn't seem to get the cpu usage above 25%. If this is a good approach to pursue, any tips would be greatly appreciated.
I tried using ArcMap. I converted the CSVs to tables and imported them into a file geodatabase. I ran the intersection tool on two files, which took about 4 days, and the number of records returned was more than twice the number of input features combined. Each file had about 600,000 records. The intersection returned with 2,000,0000 results. In other cases, not all records were recognized by ArcMap. ArcMap says there are 5,000 records, when in reality there are 400,000+
I tried combining in python. Firstly, I can immediately tell RAM is going to be an issue. Each file takes up roughly 2GB of RAM in python when fully opened. I do this with:
f1 = [row for row in csv.reader(open('file1.csv', 'rU'))]
f2 = [row for row in csv.reader(open('file2.csv', 'rU'))]
joinOut = csv.writer(open('Intersect.csv', 'wb'))
uniqueIDs = set([row[uniqueIDIndex] for row in f1].extend([row[uniqueIDIndex] for row in f2]))
for uniqueID in uniqueIDs:
f1rows = [row for row in f1 if row[uniqueIDIndex] == uniqueID]
f2rows = [row for row in f2 if row[uniqueIDIndex] == uniqueID]
if len(f1rows) == 0 or len(f2rows) == 0:
//Not an intersect
else:
// Strings, split at decimal, if integer and first 3 places
// after decimal are equal, they are spatially close enough
f1lat = f1rows[0][latIndex].split('.')
f1long = f1rows[0][longIndex].split('.')
f2lat = f2rows[0][latIndex].split('.')
f2long = f2rows[0][longIndex].split('.')
if f1lat[0]+f1lat[1][:3] == f2lat[0]+f2lat[1][:3] and f1long[0]+f1long[1][:3] == f2long[0]+f2long[1][:3]:
joinOut.writerows([f1rows[0], f2rows[0]])
Obviously, this approach requires that the files being intersected are available in memory. Well I only have 16GB of RAM available and 22 files would need ~44GB of RAM. I could change it so that instead, when each uniqueID is iterated, it opens and parses each file for the row with that uniqueID. This has the benefit of reducing the footprint to almost nothing, but with hundreds of thousands of unique IDs, that could take an unreasonable amount of time to execute.
So, here I am, asking for suggestions on how I can best handle this data. I have an i7-3770k at 4.4Ghz, 16GB RAM, and a vertex4 SSD, rated at 560 MB/s read speed. Is this machine even capable of handling this amount of data?
Another venue I've thought about exploring is an Amazon EC2 cluster and Hadoop. Would that be a better idea to investigate?
Suggestion: Pre-process all the files to extract the 3 attributes you're interested in first. You can always keep track of the file/rownumber as well, so you can reference all the original attributes later if you want.

Huge text file to small excel files

I have a huge text file (4 GB), where each "line" is of the syntax:
[number] [number]_[number] [Text].
For example
123 12_14 Text 1
1234 13_456 Text 2
33 12_12 Text 3
24 678_10 Text 4
My purpose is to have this data saved as Excel file, where each "line" in the text file,
is a row in the excel file. According to the past example:
[A1] 123
[B1] 12_14
[C1] Text 1
[A2] 1234
[B2] 13_456
[C2] Text 2
[A3] 33
[B3] 12_12
[C3] Text 3
[A4] 24
[B4] 678_10
[C4] Text 4
My plan is to iterate the text "lines", as advised here, separate the "lines",
and save to the cells in an excel file.
Because of the text size issue, I thought to create many small excel files, which all together will be equal to the text file.
Than I need to analyze the small excel files, mainly found terms that where mentioned in the [Text] cells, and count the number of apperance, related to the [number] cells (representing a post and ID of a post).
Finally, I need to sum all this data in an excel file.
I'm considering the best way to create and analyze the excel files.
As mentioned here the main libraries are xlrd and csv.
"I'm pretty sure I don't have other options than small excel files, but what will be the another approach?"
Your huge text file is a type of database, although an inconvenient one. A bunch of small Excel files are another, even less convenient representation of the same database. I assume you are looking to make a bunch of small files because Excel has an upper limit on how many rows it can contain (65'000 or 1'000'000 depending on the version of Excel). However, as has been noted, Excel files are truly horrible database stores.
Since you are already using Python, use module sqlite3, it's already built in and it's a real database, and it can handle more than a million rows. And it's fast.
But I wanted to get an idea how fast it is with data on the scale that you propose so I created a 30M row database of roughly the same complexity as your dataset. The schema is simple:
create table words
(id integer primary key autoincrement not null,
first text, second text, third text);
and populated it with random trigrams drawn from /usr/dict/words (I have a module for generating test data like this which makes entries that look like
sqlite> select * from words limit 5;
1|salvation|prorates|bird
2|fore|embellishment|empathized
3|scalier|trinity|graze
4|exes|archways|interrelationships
5|voguish|collating|partying
but a simple query for a row I knew was near the end took longer than I'd hoped:
select * from words where first == "dole" and second == "licked";
29599672|dole|licked|hates
took about 7 seconds on a pretty average 3-year-old desktop so I added a couple of indexes
create index first on words (first);
create index second on words (second);
which did double the size of the database file from 1.1GB to 2.3GB but brought the simple query time down to a rather reasonable 0.006 second. I don't think you'll do as well with Excel.
So parse your data however you must, but then put it in a real database.
What is the issue with just looping through the file line by line? If you have your heart set on excel I would reccomend openpyxl.

Categories