Viewing a portion of a very large CSV file? - python

I have a ~1.0gb CSV file, and when trying to load it into Excel just to view, Excel crashes. I don't know the schema of the file, so it's difficult for me to load it into R or Python. The file contains restaurant reviews and has commas in it.
How can I open just a portion of the file (say, the first 100 rows, or 1.0mb's worth) in Windows Notepad or Excel?

In my version of excel the open dialogs do not seem to offer a "read only these many lines" option, only a start at line (used to skip headers I guess).
So if you have no head binary at hand on your platform, but python a simplistic working solution for your case should be (hard coded 100 lines aka rows):
#! /usr/bin/env python
from __future__ import print_function
import sys
LINE_COUNT = 100
def main():
"""Do the thing."""
if len(sys.argv) != 3:
sys.exit("Usage: InFIle OutHead100File")
in_name, out_name = sys.argv[1:3]
print("Simple head(100)[%s] -> %s ..." % (in_name, out_name))
with open(in_name, 'rt') as f_in, open(out_name, 'wt') as f_out:
for n in range(LINE_COUNT):
f_out.write(f_in.readline())
if __name__ == '__main__':
main()
and one would call the above code (assuming stored in script file so_x_head_100.py and given a file huge.csv should have first 100 rows copied to a file 100.csv):
$ python2 ./so_x_head_100.py huge.csv 100.csv
Simple head(100)[huge.csv] -> 100.csv ...
And now in 100.csvther are the first 100 lines of huge.csv.

If you want to do somewhat more selective fishing for particular rows, then the python csv module will allow you to read the csv file row by row into Python data structures. Consult the documentation.
This may be useful if just grabbing the first hundred lines reveals nothing about many of the columns because they are blank in all those rows. So you could easily write a program in Python to read as many rows as it takes to find and write out a few rows with non-blank data in particular columns. Likewise if you want to analyze a subset of the data matching particular criteria, you can read all the rows in and write only the interesting ones out for further analysis.
An alternative to csv is pandas. Bigger learning curve, but it is probably the right tool for analyzing big data. (1Gb is not very big these days).

Related

How to write a large .txt file to a csv for Biq Query dump?

I have a dataset that is 86 million rows x 20 columns with a header, and I need to convert it to a csv in order to dump it into big query (adding multiple tags from that). The logical solution is reading the .txt file with pd.read_csv but I don't have 86 million rows of memory on my device and it will crash jupyter.
I'm aware of other threads such as (How to convert a tab delimited text file to a csv file in Python) but my issue seems rather niche.
Is there a way I could go about this? I thought about Vaex but I have total unfamiliarity with the toolkit, and it doesn't seem to have a writer within.
Current thoughts would be:
csv_path = r'csv_test.csv'
txt_path = r'txt_test.txt'
with open(txt_path, "r") as in_text:
in_reader = csv.reader(in_text, delimiter="|", skipinitialspace=True)
with open(csv_path, "w") as out_csv:
out_writer = csv.writer(out_csv, delimiter = ',')
for row in in_reader:
out_writer.writerow(row)
Currently, I am receiving an error stating:
Error: field larger than field limit (131072)
It seems it's the maximum row count in a single column, so I'm quite a bit off.
I've gotten a csv of smaller files to generate (only using 3 of the 35 total .txt files) but when I attempt to use all, it fails with code above.
Update: I have expanded the sys.maxsize and am still receiving this same error
I have no way to verify if this works due to the sheer size of the dataset, but it seems like it /should/ work. Trying to read it with Vaex would work if I wasn't getting parsing errors due to there being commas within the data.
So I have 3 questions:
Is there a way I can write a larger sized csv?
Is there a way to dump in the large pipe delimited .txt file to Big Query in chunks as different csv's?
Can I dump 35 csv's into Big Query in one upload?
Edit:
here is a short dataframe sample:
|CMTE_ID| AMNDT_IND| RPT_TP| TRANSACTION_PGI| IMAGE_NUM| TRANSACTION_TP| ENTITY_TP| NAME| CITY| STATE| ZIP_CODE| EMPLOYER| OCCUPATION| TRANSACTION_DT| TRANSACTION_AMT| OTHER_ID| TRAN_ID| FILE_NUM| MEMO_CD| MEMO_TEXT| SUB_ID
0|C00632562|N|M4|P|202204139496092475|15E|IND|NAME, NAME|PALO ALTO|CA|943012820.0|NOT EMPLOYED|RETIRED|3272022|5|C00401224|VTEKDYJ78M3|1581595||* EARMARKED CONTRIBUTION: SEE BELOW|4041920221470955005
1|C00632562|N|M4|P|202204139496092487|15E|IND|NAME, NAME|DALLAS|TX|752054324.0|SELF EMPLOYED|PHOTOGRAPHER|3272022|500|C00401224|VTEKDYJ7BD4|1581595||* EARMARKED CONTRIBUTION: SEE BELOW|4041920221470955041
I think there is some red-herring going on here:
Is there a way I can write a larger sized csv?
Yes, the reader and writer iterator style should be able to read any length of file, they step through incrementally, and at no stage do they attempt to read the whole file. Something else is going wrong in your example.
Is there a way to dump in the large tab-delimited .txt file to Big Query in chunks as different csv's?
You shouldn't need to.
Can I dump 35 csv's into Big Query in one upload?
That's more a Big Query api question, so I wont attempt to answer that here.
In your code, your text delimiter is set to a pipe, but in your question number 2, you describe it as being tab delimited. If you're giving the wrong delimiter to the code, it might try to read more content into a field than it's expecting, and fail when it hits some field-size limit. This sounds like it might be what's going on in your case.
Also, watch out when piping your file out and changing delimiters - in the data sample you post, there are some commas embedded in the text, this might result in a corrupted file when it comes to reading it in again on the other side. Take some time to think about your target CSV dialect, in terms of text quoting, chosen delimiters etc.
Try replacing the | with \t and see if that helps.
If you're only changing the delimiter from one thing to another, is that a useful process? Maybe forget the whole CSV nature of the file, and read lines iteratively, and write them without modifying them any, you could use readline and writeline for this, probably speeding things up in the process. Again, because they're iterative, you wont have to worry about loading the whole file into RAM, and just stream from one source to your target. Beware how long it might take to do this, and if you've a patchy network, it can all go horribly wrong. But at least it's a different error!

Using pysftp to split text file in SFTP directory

I'm trying to split text file of size 100 MB (having unique rows) into 10 files of equal size using python pysftp but I'm unable to find proper approach for same.
Please let me know how can I read/ split files from SFTP directory and place back all files to FTP directory itself.
with pysftp.Connection(host=sftphostname, username=sftpusername, port=sftpport, private_key=sftpkeypath) as sftp:
with sftp.open(source_filedir+source_filename) as file:
for line in file:
<....................Unable to decide logic------------------>
The logic you probably need is as follows:
As you are in a read only environment, you will need to download the whole file into memory.
Use Python's io.StringIO() to handle the data in memory as if it is a file.
As you are talking about rows, I assume you mean the file is in CSV format? You can make use of Python's csv library to parse the file.
First do a quick scan of the file using a csv.reader(), use this to count the number of rows in the file. This can then be used to determine how to split the file into equal number of rows, rather than just splitting the file at set byte counts.
Once you know the number of rows, reopen the data (as a file again) and just read the header row in. This can then be added to the first row of each split file you create.
Now read n rows in (based on your total row count). Use a csv.writer() and another io.StringIO() to first write the header row and then write the split rows into memory. This can then be used to upload using pysftp to a new file on the server, all without requiring access to an actual filing system.
The result will be that each file will also have a valid header row.
I don't think FTP / SFTP allow for something more clever than simply downloading the file. Meaning, you'd have to get the whole file, split it locally, then put the new files back.
For text file splitting logic I believe that this thread may be of use: Split large files using python
There is a library like filesplit you can use to split files.
It has similar functionality like the Linux command split or csplit.
For you case
split text file of size 100 MB into 10 files of equal size
you can use method bysize:
import os
from filesplit.split import Split
infile = source_filedir + source_filename
outdir = source_filedir
split = Split(infile, outdir) # construct the splitter
file_size = os.path.getsize(infile)
desired_parts = 10
bytes_per_split = file_size / desired_parts # have to calculate the size
split.bysize(bytes_per_split)
For a line-partitioned split use bylinecount:
from filesplit.split import Split
split = Split(infile, outdir)
split.bylinecount(1_000_000) # for a million lines each file
See also:
Split Command in Linux: 9 Useful Examples
How do I check file size in Python?
Bonus
Since Python 3.6 you can use underscores in numeric literals (see PEP515): million = 1_000_000 to improve readability,

How can I create a template .csv file that users can fill in before running my script?

I am trying to make a script that requires a user to input at least 12 different values in order to function. I thought this was somewhat impractical, so I decided to make a function that would generate dict from a .csv file that was designed with two columns– variables and their respective values. The user could use a provided .csv file as a template and then fill it in with all their necessary values, save it as their own .csv file, and then run it with the script.
Although this sounds simple in theory, I have found that is not working quite so well in practice. Because some of the inputs values will be text with a lot of periods in them ("..."), they are sometimes converted into the unicode representing horizontal ellipses (xe2\x80\xa6). Also, a UTF-8 mark will occur at the beginning of the first column and row (which can be designated by the code codecs.BOM_UTF8), and must be removed. In other cases, the delimiter of the .csv file was changed so that tabs were recognized as separating cells, or the contents of each row were converted from two to one cell.
I have no experience with the different forms of encoding or what any of them entail, but from what I have tested, it seems that opening the .csv template file in Excel or using different settings when opening your .csv file causes such problems. It's also possible that copying and pasting the values from other places brings hidden characters with them. I have been trying to fix the problems but then new problems keep springing up, and I feel like it's possible that my current approach is just wrong.
Can anybody recommend me a different, more efficient approach for allowing a user to enter in multiple inputs in one go? Or should I stick to my original approach and figure out how to keep the .csv formatting as rigorous as possible?
You can always use the csv module to abstract away most of the CSV oddities (although you will have to enforce the basic format):
import csv
import sys
def main(argv):
if len(argv) < 2:
print("Please provide path to your CSV template as the first argument.")
return 1
with open(argv[1], "r") as f:
reader = csv.DictReader(f)
your_vars = next(reader)
print(your_vars) # prints a dictionary of all CSV vars
return 0
if __name__ == "main":
sys.exit(main(sys.argv))
NOTE: This requires the first row to hold the variables, while the second holds their values.
So all users have to do is call the script with: python your_script.py their_file.csv and in most cases it will print out a dict with the values... However, Excel is notoriously bad in handling unicode CSVs and if your users use it as their primary spreadsheet app they're likely to encounter issues. Some of that can be rectified by installing the unicodecsv module and using it as a drop-in replacement (import unicodecsv as csv) but if your users start going wild with the format eventually it will break.
If you're looking for suggestions on formats, one of the most user-friendly formats you can use is YAML and there are several parsers available for Python - they largely work the same for the simple stuff like this but I'd recommend using the ruamel.yaml module as it's actively maintained.
Then you can create a YAML template like:
----
var1: value1
var2: value2
var3: value3
etc: add as many as you want
And your users can fill in the values in a simple text editor, then to replicate the above CSV behavior all you need is:
import yaml
import sys
def main(argv):
if len(argv) < 2:
print("Please provide path to your YAML template as the first argument.")
return 1
with open(argv[1], "r") as f:
your_vars = yaml.load(f)
print(your_vars) # prints a dictionary of all YAML vars
return 0
if __name__ == "main":
sys.exit(main(sys.argv))
Bonus is that YAML is plain-text format so your users don't need fancy editors and therefore they have a lesser chance to screw up. Of course, while YAML is permissive it still requires modicum of well-formedness so be sure to include the usual checks (if the file exists, can it be open, can it be parsed etc.)

Using Python v3.5 to load a tab-delimited file, omit some rows, and output max and min floating numbers in a specific column to a new file

I've tried for several hours to research this, but every possible solution hasn't suited my particular needs.
I have written the following in Python (v3.5) to download a tab-delimited .txt file.
#!/usr/bin/env /Library/Frameworks/Python.framework/Versions/3.5/bin/python3.5
import urllib.request
import time
timestr = time.strftime("%Y-%m-%d %H-%M-%S")
filename="/data examples/"+ "ace-magnetometer-" + timestr + '.txt'
urllib.request.urlretrieve('http://services.swpc.noaa.gov/text/ace-magnetometer.txt', filename=filename)
This downloads the file from here and renames it based on the current time. It works perfectly.
I am hoping that I can then use the "filename" variable to then load the file and do some things to it (rather than having to write out the full file path and file name, because my ultimate goal is to do the following to several hundred different files, so using a variable will be easier in the long run).
This using-the-variable idea seems to work, because adding the following to the above prints the contents of the file to STDOUT... (so it's able to find the file without any issues):
import csv
with open(filename, 'r') as f:
reader = csv.reader(f, dialect='excel', delimiter='\t')
for row in reader:
print(row)
As you can see from the file, the first 18 lines are informational.
Line 19 provides the actual column names. Then there is a line of dashes.
The actual data I'm interested in starts on line 21.
I want to find the minimum and maximum numbers in the "Bt" column (third column from the right). One of the possible solutions I found would only work with integers, and this dataset has floating numbers.
Another possible solution involved importing the pyexcel module, but I can't seem to install that correctly...
import pyexcel as pe
data = pe.load(filename, name_columns_by_row=19)
min(data.column["Bt"])
I'd like to be able to print the minimum Bt and maximum Bt values into two separate files called minBt.txt and maxBt.txt.
I would appreciate any pointers anyone may have, please.
This is meant to be a comment on your latest question to Apoc, but I'm new, so I'm not allowed to comment. One thing that might create problems is that bz_values (and bt_values, for that matter) might be a list of strings (at least it was when I tried to run Apoc's script on the example file you linked to). You could solve this by substituting this:
min_bz = min([float(x) for x in bz_values])
max_bz = max([float(x) for x in bz_values])
for this:
min_bz = min(bz_values)
max_bz = max(bz_values)
The following will work as long as all the files are formatted in the same way, i.e. data 21 lines in, same number of columns and so on. Also, the file that you linked did not appear to be tab delimited, and thus I've simply used the string split method on each row instead of the csv reader. The column is read from the file into a list, and that list is used to calculate the maximum and minimum values:
from itertools import islice
# Line that data starts from, zero-indexed.
START_LINE = 20
# The column containing the data in question, zero-indexed.
DATA_COL = 10
# The value present when a measurement failed.
FAILED_MEASUREMENT = '-999.9'
with open('data.txt', 'r') as f:
bt_values = []
for val in (row.split()[DATA_COL] for row in islice(f, START_LINE, None)):
if val != FAILED_MEASUREMENT:
bt_values.append(float(val))
min_bt = min(bt_values)
max_bt = max(bt_values)
with open('minBt.txt', 'a') as minFile:
print(min_bt, file=minFile)
with open('maxBt.txt', 'a') as maxFile:
print(max_bt, file=maxFile)
I have assumed that since you are doing this to multiple files you are looking to accumulate multiple max and min values in the maxBt.txt and minBt.txt files, and hence I've opened them in 'append' mode. If this is not the case, please swap out the 'a' argument for 'w', which will overwrite the file contents each time.
Edit: Updated to include workaround for failed measurements, as discussed in comments.
Edit 2: Updated to fix problem with negative numbers, also noted by Derek in separate answer.

Python fast way to read several rows of csv text?

I wish to to the following as fast as possible with Python:
read rows i to j of a csv file
create the concatenation of all the strings in csv[row=(loop i to j)][column=3]
My first code was a loop (i to j) of the following:
with open('Train.csv', 'rt') as f:
row = next(itertools.islice(csv.reader(f), row_number, row_number+1))
tags = (row[3].decode('utf8'))
return tags
but my code above reads the csv one column at a time and is slow.
How can I read all rows in one call and concatenate fast?
Edit for additional information:
the csv file size is 7GB; I have only 4GB of RAM, on windows XP; but I don't need to read all columns (only 1% of the 7GB would be good I think).
Since I know which data you are interested in, I can speak from experience:
import csv
with open('Train.csv', 'rt') as csvfile:
reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in reader:
row[0] # ID
row[1] # title
row[2] # body
row[3] # tags
You can of course per row select anything you want, and store it as you like.
By using an iterator variable, you can decide which rows to collect:
import csv
with open('Train.csv', 'rt') as csvfile:
reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
linenum = 0
tags = [] # you can preallocate memory to this list if you want though.
for row in reader:
if linenum > 1000 and linenum < 2000:
tags.append(row[3]) # tags
if linenum == 2000:
break # so it won't read the next 3 million rows
linenum += 1
The good thing about it is also that this will really use low memory as you read in line by line.
As mentioned, if you want the later cases, it still has to parse the data to get there (this is inevitable since there are newlines in the text, so you can't skip to a certain row). Personally, I just roughly used linux's split, to split the file in chunks, and then edited them making sure they start at an ID (and end with a tag).
Then I used:
train = pandas.io.parsers.read_csv(file, quotechar="\"")
To quickly read in the split files.
If the file is not HUGE (hundred of megabytes) and you actually need to read a lot of rows then probably just
tags = " ".join(x.split("\t")[3]
for x in open("Train.csv").readlines()[from_row:to_row+1])
is going to be the fastest way.
If the file is instead very big the only thing you can do is iterating over all lines because CSV is uses unfortunately (in general) variable-sized records.
If by chance the specific CSV uses a fixed-size record format (not uncommon for large files) then directly seeking into the file may be an option.
If the file uses variable-sized records and the search must be done several times with different ranges then creating a simple external index just once (e.g. line->file offset for all line numbers that are a multiple of 1000) can be good idea.
Your question does not contain enough information, probably because you don't see some existing complexity: Most CSV files contain one record per line. In that case it's simple to skip the rows you're not interested in. But in CSV records can span lines, so a general solution (like the CSV reader from the standard library) has to parse the records to skip lines. It's up to you to decide what optimization is ok in your use case.
The next problem is, that you don't know, which part of the code you posted, is too slow. Measure it. Your code will never run faster than the time you need to read the file from disc. Have you checked that? Or have you guessed what part's to slow?
If you want to do fast transformations of CSV data which fits to memory, I would propose to use/learn Pandas. So it would probably a good idea to split your code in two steps:
Reduce file to the required data.
Transform the remaining data.
sed is designed for the task 'read rows i to j of a csv file'.to
If the solution does not have to be pure Python, I think preprocess the csv file with sed sed -n 'i, jp', then parse the output with Python would be simple and quick.

Categories