Huge text file to small excel files - python

I have a huge text file (4 GB), where each "line" is of the syntax:
[number] [number]_[number] [Text].
For example
123 12_14 Text 1
1234 13_456 Text 2
33 12_12 Text 3
24 678_10 Text 4
My purpose is to have this data saved as Excel file, where each "line" in the text file,
is a row in the excel file. According to the past example:
[A1] 123
[B1] 12_14
[C1] Text 1
[A2] 1234
[B2] 13_456
[C2] Text 2
[A3] 33
[B3] 12_12
[C3] Text 3
[A4] 24
[B4] 678_10
[C4] Text 4
My plan is to iterate the text "lines", as advised here, separate the "lines",
and save to the cells in an excel file.
Because of the text size issue, I thought to create many small excel files, which all together will be equal to the text file.
Than I need to analyze the small excel files, mainly found terms that where mentioned in the [Text] cells, and count the number of apperance, related to the [number] cells (representing a post and ID of a post).
Finally, I need to sum all this data in an excel file.
I'm considering the best way to create and analyze the excel files.
As mentioned here the main libraries are xlrd and csv.

"I'm pretty sure I don't have other options than small excel files, but what will be the another approach?"
Your huge text file is a type of database, although an inconvenient one. A bunch of small Excel files are another, even less convenient representation of the same database. I assume you are looking to make a bunch of small files because Excel has an upper limit on how many rows it can contain (65'000 or 1'000'000 depending on the version of Excel). However, as has been noted, Excel files are truly horrible database stores.
Since you are already using Python, use module sqlite3, it's already built in and it's a real database, and it can handle more than a million rows. And it's fast.
But I wanted to get an idea how fast it is with data on the scale that you propose so I created a 30M row database of roughly the same complexity as your dataset. The schema is simple:
create table words
(id integer primary key autoincrement not null,
first text, second text, third text);
and populated it with random trigrams drawn from /usr/dict/words (I have a module for generating test data like this which makes entries that look like
sqlite> select * from words limit 5;
1|salvation|prorates|bird
2|fore|embellishment|empathized
3|scalier|trinity|graze
4|exes|archways|interrelationships
5|voguish|collating|partying
but a simple query for a row I knew was near the end took longer than I'd hoped:
select * from words where first == "dole" and second == "licked";
29599672|dole|licked|hates
took about 7 seconds on a pretty average 3-year-old desktop so I added a couple of indexes
create index first on words (first);
create index second on words (second);
which did double the size of the database file from 1.1GB to 2.3GB but brought the simple query time down to a rather reasonable 0.006 second. I don't think you'll do as well with Excel.
So parse your data however you must, but then put it in a real database.

What is the issue with just looping through the file line by line? If you have your heart set on excel I would reccomend openpyxl.

Related

Manipulating API data in Python before appending to a csv file

I'm requesting data from a RESTful API. The first request is written to a csv file with no problems. In the csv file the data has 5 header rows (including column headers), 11 rows of actual data (13 fields per row), and an EOF row, so 17 rows of data in all (the data as it appears following a print(response.text) command is shown at the end of this post.
For subsequent requests to the API I simply want to append the 11 rows of data (i.e. rows 6 through 16) to the existing csv file. This is a process I will repeat numerous times in order to create a single large csv file with probably close to a million rows of data. I'm struggling to find a way to manipulate the data returned by the API so as to allow me to only write rows 6 through 16 to the csv file.
I'm pretty new to coding and Python, so I'd be grateful for suggestions as to how to proceed.
This is what the data looks like from a Python 'print' command (the first asterix is row 1. The fifth asterix denotes the start of the column headings, with 'Document RevNum' being the last column heading):
*
*
*Actual Aggregated Generation Per Type (B1620) Data
*
*Document Type,Business Type,Process Type,Time Series ID,Quantity,Curve Type,Resolution,Settlement Date,Settlement Period,Power System Resource Type,Active Flag,Document ID,Document RevNum
Actual generation per type,Solar generation,Realised,NGET-EMFIP-AGPT-TS-21614701,3250,Sequential fixed size block,PT30M,2020-07-01,21,"Solar",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Wind generation,Realised,NGET-EMFIP-AGPT-TS-21614702,2075.338,Sequential fixed size block,PT30M,2020-07-01,21,"Wind Offshore",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Wind generation,Realised,NGET-EMFIP-AGPT-TS-21614703,1486.519,Sequential fixed size block,PT30M,2020-07-01,21,"Wind Onshore",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614704,258,Sequential fixed size block,PT30M,2020-07-01,21,"Other",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614705,4871,Sequential fixed size block,PT30M,2020-07-01,21,"Nuclear",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614706,0,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Oil",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614707,16448,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Gas",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614708,0,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Hard coal",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614709,783,Sequential fixed size block,PT30M,2020-07-01,21,"Hydro Run-of-river and poundage",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614710,118,Sequential fixed size block,PT30M,2020-07-01,21,"Hydro Pumped Storage",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614711,3029,Sequential fixed size block,PT30M,2020-07-01,21,"Biomass",Y,NGET-EMFIP-AGPT-06372506,1
<EOF>
I interpret your problem as not being able to append the newly fetched data to the CSV file. I am not sure whether you are working with some module that helps with working with CSV files, but I just assume you don't for now.
If you just open the file with "a" as the second argument to open like (f = open("file.csv","a")) you can easily append the new content. You would first off have to strip the EOF row though and later append a new one, but I think that isn't the problem.
Hope I could help you, and please tell me whether I understood correctly what your problem is here :)
Btw, I would recommend looking into a CSV module for this or something like sqlite3.
The solution which seems to work is as follows:
Recall that the API returns a long string of comma-separated data for a given date.
When the data is written to a csv file it presents as 4 rows of 'header' data that I'm not interested in, 1 row of column heading data (13 columns), 11 rows of the data that I AM interested in (with data in all 13 columns), and 1 ("EOF") row that I don't need.
On the first API query I want to create a csv file with only the column headings and 11 rows of data, jettisoning the first 4 rows (redundant info) and the last row ("EOF") of data.
On all subsequent API queries I simply want to append the 11 rows of data to the already created csv file.
The API response is returned as a string.
The following code excludes the first 4 rows and the last row ("EOF") but still writes the column headings and the 11 rows of useful data to the newly created csv file:
# Write to .CSV
with open('C:/Users/Paul/.spyder-py3/Elexon API Data 05.csv', "w") as f:
f.write(api_response.text[59:-5])
The first 59 characters and last 5 characters of the API response string will always be the same so I can be confident that this will work for the initial API response.
For subsequent API responses I use the following code to append to the csv file:
# Now append the api response data to the CSV file
with open('C:/Users/Paul/.spyder-py3/Elexon API Data 05.csv', "a") as f:
f.write(api_response.text[248:-5])
This excludes the first 248 and last 5 characters from the API response appended to the csv file. The first 248 characters of the string will always include the same redundant info, and the last 5 characters will always include the '', so again I can be confident that only the 11 rows of data I am interested in will be appended to the csv file.
The solution, for this particular case, turned out to be simpler than I had expected, and thanks to CodingGuy for directing me to think in terms of stripping data from the start and end of the api response, and for exploring the type of data I was dealing with.
Please let me know if there are problems with my solution. Likewise, I'm always interested to learn more sophisticated ways of handling data within Python.

subsetting very large files - python methods for optimal performance

I have one file (index1) with 17,270,877 IDs, and another file (read1) with a subset of these IDs (17,211,741). For both files, the IDs are on every 4th line.
I need a new (index2) file that contains only the IDs in read1. For each of those IDs I also need to grab the next 3 lines from index1. So I'll end up with index2 whose format exactly matches index1 except it only contains IDs from read1.
I am trying to implement the methods I've read here. But I'm stumbling on these two points: 1) I need to check IDs on every 4th line, but I need all of the data in index1 (in order) because I have to write the associated 3 lines following the ID. 2) unlike that post, which is about searching for one string in a large file, I'm searching for a huge number of strings in another huge file.
Can some folks point me in some direction? Maybe none of those 5 methods are ideal for this. I don't know any information theory; we have plenty of RAM so I think holding the data in RAM for searching is the most efficient? I'm really not sure.
Here a sample of what the index look like (IDs start with #M00347):
#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0
CCTAAGGTTCGG
+
CDDDDFFFFFCB
#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0
CGCCATGCATCC
+
BBCCBBFFFFFF
#M00347:30:000000000-BCWL3:1:1101:15711:1332 1:N:0:0
TTTGGTTCCCGG
+
CDCDECCFFFCB
read1 looks very similar, but the lines before and after the '+' are different.
If data of index1 can fit in memory, the best approach is to do a single scan of this file and store all data in a dictionary like this:
{"#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0":["CCTAAGGTTCGG","+","CDDDDFFFFFCB"],
"#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0":["CGCCATGCATCC","+","BBCCBBFFFFFF"],
..... }
Values can be stored as formatted string as you prefer.
After this, you can do a single scan on read1 and when an IDs is encountered you can do a simple lookup on the dictionary to retrieve needed data.

How to stop truncating strings when I use the group by function

I have a table with columns: Location, Basic quals, Preferred quals, and Responsibilities.
The last three columns have string entries that I tokenized, I want to group the columns by Location. When I do this my strings Truncate eg. "we want an individual who knows python and java." turns into "we want an individual..."
How do I avoid this from happening?
grouped_location=pd.DataFrame(df1['Pref'].groupby(df1['Location']))
grouped_location.columns = ['Loaction','Pref']
grouped_location=grouped_location.set_index('Loaction')
grouped_location.iat[0,0]
I expect to get
17 [Experience, in, design, verification,, includ (full entry)]
but what I get is:
17 [Experience, in, design, verification,, includ...
Try saving out the dataframe to csv, it's probably only the display configuration that's truncating it.

Python: Removing duplicates from a huge csv file (memory issues)

I have a csv file that is very big, containing a load of different people. Some of these people come up twice. Something like this:
Name,Colour,Date
John,Red,2017
Dave,Blue,2017
Tom,Blue,2017
Amy,Green,2017
John,Red,2016
Dave,Green,2016
Tom,Blue,2016
John,Green,2015
Dave,Green,2015
Tom,Blue,2015
Rebecca,Blue,2015
I want a csv file that contains only the most recent colour for each person. For example, for John, Dave, Tom and Amy I am only interested in the row for 2017. For Rebecca I will need the value from 2015.
The csv file is huge, containing over 10 million records (all people have a unique ID so repeated names don't matter). I've tried something along the lines of the following:
Open csv file
Read line 1.
If person is not in "seen" list, add to csv file 2
Add person to "Seen" list.
Read line 2...
The problem is the "seen" list gets massive and I run out of memory. The other issue is sometimes the dates are not in order so an old entry gets into the "seen" list and then the new entry won't overwrite it. This would be easy to solve if I could sort the data by descending date, but I'm struggling to sort it with the size of the file.
Any suggestions?
If the whole csv file can be stored in a list like:
csv_as_list = [
(unique_id, color, year),
…
]
then you can sort this list by:
import operator
# first sort by year descending
csv_as_list.sort(key=operator.itemgetter(2), reverse=True)
# then, since the Python sort is stable, by unique_id
csv_as_list.sort(key=operator.itemgetter(0))
and then you can:
from __future__ import print_function
import operator, itertools
for unique_id, group in itertools.groupby(csv_as_list, operator.itemgetter(0)):
latest_color = next(group)[1]
print(unique_id, latest_color)
(I just used print here, but you get the gist.)
If the csv file cannot be loaded in-memory as a list, you'll have to go through an intermediate step that uses disk (e.g. SQLite).
Open your csv file to read.
Read line by line, append user to final_list if his ID is not already found in there. If it is found, check the year of your current_data, with your final_list data. If the current data has a more recent entry, just change the date of your user in final_list, along with the color associated with it.
Only then, when your final_list is done, will you write a new csv file.
If you want this task to be faster, you want to...
Optimize your loops.
Use standard python functions and/or libraries coded in C.
If this is still not optimized enough... learn C. Reading a csv file in C, parsing it with a separator, and iterating through an array is not hard, even in C.
I see two obvious ways to solve this that don't involve keeping huge amounts of data in memory:
Use a database instead of CSV files
Reorganise your CSV files to facilitate sorting.
Using a database is fairly straightforward. I expect you could could even use the SQLite that comes with Python. This would be my preferred option, I think. To get the best performance, create an index of (person, date).
The second involves letting the first column of your CSV file be the person ID and the second column be the date. Then you could sort the CSV file from the commandline, i.e. sort myfile.csv. This will group all entries for a particular person together, and provided your date is in a proper format (e.g. YYYY-MM-DD), the entry of interest will be the last one. The Unix sort command is not known for its speed, but it's very robust.

Most efficient way to pull information from a table

As a new-to-python person the only way that I can think of to grab information from a table (separated by only whitespace) is to call an item by its position using the row and column its located in. I've been using numpy a lot like so:
information_table=np.array([]) #The table I'm pulling data from
info_i_need_from_table=information_table[i][j] #Where i/j is the location of whatever info I need
Is this the optimal way of grabbing information from a table? I'm new to python so as far as I know this is the only way to do it, but I'd be willing to bet I'm wrong. Say my information_table is fairly large, thousands of rows, hundreds of colums. Would you utilize the same 'tool' to pull information from, say, a much smaller table?
As an example of one of the tables I'm working with:
/SAH/SAH5/jimunoz/DUSTYlib2/models_Y100_699K/COMPACTs3300_Al_g+1.5_m1.0_t02_st_z-0.25.inp 3300 699 1.000E+02 Al2O3 1.06E+05 1.26E-05 1.70E+14 81 2.61E-10 0.360484737991 0.77871386826 1.03440307618 0.568135259544 0.157877963222 0.0791445961324 0.0398783584044 0.0159762347055 0.000741792598059
/SAH/SAH5/jimunoz/DUSTYlib2/models_Y100_699K/COMPACTs3300_Al_g+1.5_m1.0_t02_st_z-0.25.inp 3300 699 1.000E+02 Al2O3 1.06E+05 1.60E-05 1.70E+14 81 3.12E-10 0.360484737991 0.77871386826 1.03440307618 0.568135259544 0.157877963222 0.0791445961324 0.0398783584044 0.0159762347055 0.000741792620505
Those are just the first two rows. Another table I may be working with would just have numerous rows and columns of floating point numbers (at about 5 sig-figs).

Categories