I have a 175 GB csv that I am trying to pull into MySQL.
The table is set up and formatted.
The problem is, the csv uses unorthodox delimeters and line seperators (both are 3 character strings, #%# and #^#).
After a lot of trial and error I was able to get the process to start in HeidiSQL, but it would freeze up and never actually populate any data.
I would ideally like to use Python, but the parser only accepts 1-character line separators, making this tricky.
Does anyone have any tips on getting this to work?
MySQL LOAD DATA statement will process a csv file with multiple character delimiters
https://dev.mysql.com/doc/refman/5.7/en/load-data.html
I'd expect something like this:
LOAD DATA LOCAL INFILE '/dir/my_wonky.csv'
INTO TABLE my_table
FIELDS TERMINATED BY '#%#'
LINES TERMINATED BY '#^#'
( col1
, col2
, col3
)
I'd use a very small subset of the .csv file and do the load into a test table, just to get it working, make necessary adjustments, verify the results.
I would also want to break up the load into more manageable chunks, and avoid blowing out rollback space in the ibdata1 file. I would use something like pt-fifo-split (part of the Percona toolkit) to break the file up into a series of separate loads, but unfortunately, pt-fifo-split doesn't provide a way to specify the line delimiter character(s). To make use of that, we'd have to pre-process the file, to replace existing new line characters, and replace the line delimiter #^# with new line characters.
(If I had to load the whole file in a single shot, I'd do that into a MyISAM table, and not an InnoDB table, as a staging table. And I'd have a separate process that copied rows (in reasonably sized chunks) from the MyISAM staging table into the InnoDB table.)
Related
I have a dataset that is 86 million rows x 20 columns with a header, and I need to convert it to a csv in order to dump it into big query (adding multiple tags from that). The logical solution is reading the .txt file with pd.read_csv but I don't have 86 million rows of memory on my device and it will crash jupyter.
I'm aware of other threads such as (How to convert a tab delimited text file to a csv file in Python) but my issue seems rather niche.
Is there a way I could go about this? I thought about Vaex but I have total unfamiliarity with the toolkit, and it doesn't seem to have a writer within.
Current thoughts would be:
csv_path = r'csv_test.csv'
txt_path = r'txt_test.txt'
with open(txt_path, "r") as in_text:
in_reader = csv.reader(in_text, delimiter="|", skipinitialspace=True)
with open(csv_path, "w") as out_csv:
out_writer = csv.writer(out_csv, delimiter = ',')
for row in in_reader:
out_writer.writerow(row)
Currently, I am receiving an error stating:
Error: field larger than field limit (131072)
It seems it's the maximum row count in a single column, so I'm quite a bit off.
I've gotten a csv of smaller files to generate (only using 3 of the 35 total .txt files) but when I attempt to use all, it fails with code above.
Update: I have expanded the sys.maxsize and am still receiving this same error
I have no way to verify if this works due to the sheer size of the dataset, but it seems like it /should/ work. Trying to read it with Vaex would work if I wasn't getting parsing errors due to there being commas within the data.
So I have 3 questions:
Is there a way I can write a larger sized csv?
Is there a way to dump in the large pipe delimited .txt file to Big Query in chunks as different csv's?
Can I dump 35 csv's into Big Query in one upload?
Edit:
here is a short dataframe sample:
|CMTE_ID| AMNDT_IND| RPT_TP| TRANSACTION_PGI| IMAGE_NUM| TRANSACTION_TP| ENTITY_TP| NAME| CITY| STATE| ZIP_CODE| EMPLOYER| OCCUPATION| TRANSACTION_DT| TRANSACTION_AMT| OTHER_ID| TRAN_ID| FILE_NUM| MEMO_CD| MEMO_TEXT| SUB_ID
0|C00632562|N|M4|P|202204139496092475|15E|IND|NAME, NAME|PALO ALTO|CA|943012820.0|NOT EMPLOYED|RETIRED|3272022|5|C00401224|VTEKDYJ78M3|1581595||* EARMARKED CONTRIBUTION: SEE BELOW|4041920221470955005
1|C00632562|N|M4|P|202204139496092487|15E|IND|NAME, NAME|DALLAS|TX|752054324.0|SELF EMPLOYED|PHOTOGRAPHER|3272022|500|C00401224|VTEKDYJ7BD4|1581595||* EARMARKED CONTRIBUTION: SEE BELOW|4041920221470955041
I think there is some red-herring going on here:
Is there a way I can write a larger sized csv?
Yes, the reader and writer iterator style should be able to read any length of file, they step through incrementally, and at no stage do they attempt to read the whole file. Something else is going wrong in your example.
Is there a way to dump in the large tab-delimited .txt file to Big Query in chunks as different csv's?
You shouldn't need to.
Can I dump 35 csv's into Big Query in one upload?
That's more a Big Query api question, so I wont attempt to answer that here.
In your code, your text delimiter is set to a pipe, but in your question number 2, you describe it as being tab delimited. If you're giving the wrong delimiter to the code, it might try to read more content into a field than it's expecting, and fail when it hits some field-size limit. This sounds like it might be what's going on in your case.
Also, watch out when piping your file out and changing delimiters - in the data sample you post, there are some commas embedded in the text, this might result in a corrupted file when it comes to reading it in again on the other side. Take some time to think about your target CSV dialect, in terms of text quoting, chosen delimiters etc.
Try replacing the | with \t and see if that helps.
If you're only changing the delimiter from one thing to another, is that a useful process? Maybe forget the whole CSV nature of the file, and read lines iteratively, and write them without modifying them any, you could use readline and writeline for this, probably speeding things up in the process. Again, because they're iterative, you wont have to worry about loading the whole file into RAM, and just stream from one source to your target. Beware how long it might take to do this, and if you've a patchy network, it can all go horribly wrong. But at least it's a different error!
I'm trying to split text file of size 100 MB (having unique rows) into 10 files of equal size using python pysftp but I'm unable to find proper approach for same.
Please let me know how can I read/ split files from SFTP directory and place back all files to FTP directory itself.
with pysftp.Connection(host=sftphostname, username=sftpusername, port=sftpport, private_key=sftpkeypath) as sftp:
with sftp.open(source_filedir+source_filename) as file:
for line in file:
<....................Unable to decide logic------------------>
The logic you probably need is as follows:
As you are in a read only environment, you will need to download the whole file into memory.
Use Python's io.StringIO() to handle the data in memory as if it is a file.
As you are talking about rows, I assume you mean the file is in CSV format? You can make use of Python's csv library to parse the file.
First do a quick scan of the file using a csv.reader(), use this to count the number of rows in the file. This can then be used to determine how to split the file into equal number of rows, rather than just splitting the file at set byte counts.
Once you know the number of rows, reopen the data (as a file again) and just read the header row in. This can then be added to the first row of each split file you create.
Now read n rows in (based on your total row count). Use a csv.writer() and another io.StringIO() to first write the header row and then write the split rows into memory. This can then be used to upload using pysftp to a new file on the server, all without requiring access to an actual filing system.
The result will be that each file will also have a valid header row.
I don't think FTP / SFTP allow for something more clever than simply downloading the file. Meaning, you'd have to get the whole file, split it locally, then put the new files back.
For text file splitting logic I believe that this thread may be of use: Split large files using python
There is a library like filesplit you can use to split files.
It has similar functionality like the Linux command split or csplit.
For you case
split text file of size 100 MB into 10 files of equal size
you can use method bysize:
import os
from filesplit.split import Split
infile = source_filedir + source_filename
outdir = source_filedir
split = Split(infile, outdir) # construct the splitter
file_size = os.path.getsize(infile)
desired_parts = 10
bytes_per_split = file_size / desired_parts # have to calculate the size
split.bysize(bytes_per_split)
For a line-partitioned split use bylinecount:
from filesplit.split import Split
split = Split(infile, outdir)
split.bylinecount(1_000_000) # for a million lines each file
See also:
Split Command in Linux: 9 Useful Examples
How do I check file size in Python?
Bonus
Since Python 3.6 you can use underscores in numeric literals (see PEP515): million = 1_000_000 to improve readability,
I want to analyse a temp file (it has the .txt extension) in real time. Temp file has format:
6000 -64.367700E+0 19.035500E-3
8000 -64.367700E+0 18.989700E-3
However after importing & printing it is not a matrix as I hoped, but actually has format:
'6000\t-64.367700E+0\t19.035500E-3\n8000\t-64.367700E+0\t18.989700E-3'
I tried importing line by line, but since it's in string format I couldn't get xreadlines() or readlines() to work. I can split the string, then separate the data into an appropriate list for analysis, but are there any suggestions to only deal with new data. As the file gets larger it will slow the code down to reprocess all the data regularly and I can't work out how to replicate an xreadlines() loop.
Thanks for any help
Have you tried to use this?
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
You could specify the separator which is \t.
I am running Python 3.4 on win 7 64 bits.
Basically I have a CSV file which looks like this:
#Begin
#Column1Title;Column2Title;Column3Title;Column4Title
value1;value2;value3;value4
#End
#Begin
#Column1Title;Column2Title;Column3Title;Column4Title;Column5Title;Column6Title
value1;value2;value3;value4;value5;value6
value1;value2;value3;value4;value5;value6
#End
....
A single CSV file contains several tables (with different number of columns) delimited by a #begin and #end tags. Each table has a header (column title) and it has nothing to do with other tables of the file, the file has almost 14 000 lines.
I'd like to only determine the position of the #Begin and #end tags in order to efficiently extract the data within those tags, I'd like to avoid reading the file line by line unless someone indicates me otherwise.
I tried to get around Pandas, installed the 0.15.2 version. So far, I have not been able to produce anything close to what I want with it.
Since the file is long and the next step will be to parse multiple file like this at the same time, I am looking for the most efficient way in terms of time of execution.
Computational efficiency is likely less important in most cases than storage access speed. Therefore the best performance speedup will be to only read the file once (not find each #begin first then iterate again to process data).
Looping line by line is not in fact terribly inefficient, so an effective approach could be to check each line for a #begin tag, then enter a data processing loop for each array that exits when it finds a #end tag
I am working on an assignment where in we were provided a bunch of csv files to work on and extract information . I have succesfuly completed that part. As a bonus question we have 1 SQlite file with a .db extension . I wanted to know if any module exists to convert such files to .csv or to read them directly ?
In case such a method doesnt exist , ill probably insert the file into a database and use the python sqlite3 module to extract the data I need.
You can use the sqlite commandline tool to dump table data to CSV.
To export an SQLite table (or part of a table) as CSV, simply set the "mode" to "csv" and then run a query to extract the desired rows of the table.
sqlite> .header on
sqlite> .mode csv
sqlite> .once c:/work/dataout.csv
sqlite> SELECT * FROM tab1;
In the example above, the ".header on" line causes column labels to be
printed as the first row of output. This means that the first row of
the resulting CSV file will contain column labels. If column labels
are not desired, set ".header off" instead. (The ".header off" setting
is the default and can be omitted if the headers have not been
previously turned on.)
The line ".once FILENAME" causes all query output to go into the named
file instead of being printed on the console. In the example above,
that line causes the CSV content to be written into a file named
"C:/work/dataout.csv".
http://www.sqlite.org/cli.html