I'm trying to split text file of size 100 MB (having unique rows) into 10 files of equal size using python pysftp but I'm unable to find proper approach for same.
Please let me know how can I read/ split files from SFTP directory and place back all files to FTP directory itself.
with pysftp.Connection(host=sftphostname, username=sftpusername, port=sftpport, private_key=sftpkeypath) as sftp:
with sftp.open(source_filedir+source_filename) as file:
for line in file:
<....................Unable to decide logic------------------>
The logic you probably need is as follows:
As you are in a read only environment, you will need to download the whole file into memory.
Use Python's io.StringIO() to handle the data in memory as if it is a file.
As you are talking about rows, I assume you mean the file is in CSV format? You can make use of Python's csv library to parse the file.
First do a quick scan of the file using a csv.reader(), use this to count the number of rows in the file. This can then be used to determine how to split the file into equal number of rows, rather than just splitting the file at set byte counts.
Once you know the number of rows, reopen the data (as a file again) and just read the header row in. This can then be added to the first row of each split file you create.
Now read n rows in (based on your total row count). Use a csv.writer() and another io.StringIO() to first write the header row and then write the split rows into memory. This can then be used to upload using pysftp to a new file on the server, all without requiring access to an actual filing system.
The result will be that each file will also have a valid header row.
I don't think FTP / SFTP allow for something more clever than simply downloading the file. Meaning, you'd have to get the whole file, split it locally, then put the new files back.
For text file splitting logic I believe that this thread may be of use: Split large files using python
There is a library like filesplit you can use to split files.
It has similar functionality like the Linux command split or csplit.
For you case
split text file of size 100 MB into 10 files of equal size
you can use method bysize:
import os
from filesplit.split import Split
infile = source_filedir + source_filename
outdir = source_filedir
split = Split(infile, outdir) # construct the splitter
file_size = os.path.getsize(infile)
desired_parts = 10
bytes_per_split = file_size / desired_parts # have to calculate the size
split.bysize(bytes_per_split)
For a line-partitioned split use bylinecount:
from filesplit.split import Split
split = Split(infile, outdir)
split.bylinecount(1_000_000) # for a million lines each file
See also:
Split Command in Linux: 9 Useful Examples
How do I check file size in Python?
Bonus
Since Python 3.6 you can use underscores in numeric literals (see PEP515): million = 1_000_000 to improve readability,
Related
I have a dataset that is 86 million rows x 20 columns with a header, and I need to convert it to a csv in order to dump it into big query (adding multiple tags from that). The logical solution is reading the .txt file with pd.read_csv but I don't have 86 million rows of memory on my device and it will crash jupyter.
I'm aware of other threads such as (How to convert a tab delimited text file to a csv file in Python) but my issue seems rather niche.
Is there a way I could go about this? I thought about Vaex but I have total unfamiliarity with the toolkit, and it doesn't seem to have a writer within.
Current thoughts would be:
csv_path = r'csv_test.csv'
txt_path = r'txt_test.txt'
with open(txt_path, "r") as in_text:
in_reader = csv.reader(in_text, delimiter="|", skipinitialspace=True)
with open(csv_path, "w") as out_csv:
out_writer = csv.writer(out_csv, delimiter = ',')
for row in in_reader:
out_writer.writerow(row)
Currently, I am receiving an error stating:
Error: field larger than field limit (131072)
It seems it's the maximum row count in a single column, so I'm quite a bit off.
I've gotten a csv of smaller files to generate (only using 3 of the 35 total .txt files) but when I attempt to use all, it fails with code above.
Update: I have expanded the sys.maxsize and am still receiving this same error
I have no way to verify if this works due to the sheer size of the dataset, but it seems like it /should/ work. Trying to read it with Vaex would work if I wasn't getting parsing errors due to there being commas within the data.
So I have 3 questions:
Is there a way I can write a larger sized csv?
Is there a way to dump in the large pipe delimited .txt file to Big Query in chunks as different csv's?
Can I dump 35 csv's into Big Query in one upload?
Edit:
here is a short dataframe sample:
|CMTE_ID| AMNDT_IND| RPT_TP| TRANSACTION_PGI| IMAGE_NUM| TRANSACTION_TP| ENTITY_TP| NAME| CITY| STATE| ZIP_CODE| EMPLOYER| OCCUPATION| TRANSACTION_DT| TRANSACTION_AMT| OTHER_ID| TRAN_ID| FILE_NUM| MEMO_CD| MEMO_TEXT| SUB_ID
0|C00632562|N|M4|P|202204139496092475|15E|IND|NAME, NAME|PALO ALTO|CA|943012820.0|NOT EMPLOYED|RETIRED|3272022|5|C00401224|VTEKDYJ78M3|1581595||* EARMARKED CONTRIBUTION: SEE BELOW|4041920221470955005
1|C00632562|N|M4|P|202204139496092487|15E|IND|NAME, NAME|DALLAS|TX|752054324.0|SELF EMPLOYED|PHOTOGRAPHER|3272022|500|C00401224|VTEKDYJ7BD4|1581595||* EARMARKED CONTRIBUTION: SEE BELOW|4041920221470955041
I think there is some red-herring going on here:
Is there a way I can write a larger sized csv?
Yes, the reader and writer iterator style should be able to read any length of file, they step through incrementally, and at no stage do they attempt to read the whole file. Something else is going wrong in your example.
Is there a way to dump in the large tab-delimited .txt file to Big Query in chunks as different csv's?
You shouldn't need to.
Can I dump 35 csv's into Big Query in one upload?
That's more a Big Query api question, so I wont attempt to answer that here.
In your code, your text delimiter is set to a pipe, but in your question number 2, you describe it as being tab delimited. If you're giving the wrong delimiter to the code, it might try to read more content into a field than it's expecting, and fail when it hits some field-size limit. This sounds like it might be what's going on in your case.
Also, watch out when piping your file out and changing delimiters - in the data sample you post, there are some commas embedded in the text, this might result in a corrupted file when it comes to reading it in again on the other side. Take some time to think about your target CSV dialect, in terms of text quoting, chosen delimiters etc.
Try replacing the | with \t and see if that helps.
If you're only changing the delimiter from one thing to another, is that a useful process? Maybe forget the whole CSV nature of the file, and read lines iteratively, and write them without modifying them any, you could use readline and writeline for this, probably speeding things up in the process. Again, because they're iterative, you wont have to worry about loading the whole file into RAM, and just stream from one source to your target. Beware how long it might take to do this, and if you've a patchy network, it can all go horribly wrong. But at least it's a different error!
I hope you can help out a new learner of Python. I could not find my problem in other questions, but if so: apologies. What I basically want to do is this:
Read a large number of text files and search each for a number of string terms.
If the search terms are matched, store the corresponding file name to a new file called "filelist", so that I can tell the good files from the bad files.
Export "filelist" to Excel or CSV.
Here is the code that I have so far:
#textfiles all contain only simple text e.g. "6 Apples"
filelist=[]
for file in os.listdir('C:/mydirectory/'):
with open('C:/mydirectory/' + file, encoding="Latin1") as f:
fine=f.read()
if re.search('APPLES',fine) or re.search('ORANGE',fine) or re.search('BANANA',fine):
filelist.append(file)
listoffiles = pd.DataFrame(filelist)
writer = pd.ExcelWriter('ListofFiles.xlsx', engine='xlsxwriter')
listoffiles.to_excel(writer,sheet_name='welcome',index=False)
writer.save()
print(filelist)
Questions:
Surely, there is a more elegant or time-efficient way? I need to do this for a large amount of files :D
Related to the former, is there a way to solve the reading-in of files using pandas? Or is it less time efficient? For me as a STATA user, having a dataframe feels a bit more like home....
I added the "Latin1" option, as some characters in the raw data create conflict in encoding. Is there a way to understand which characters are causing the problem? Can I get rid of this easily, e.g. by cutting of the first line beforehand (skiprow maybe)?
Just couple of things to speed up the script:
1.) compile your regex beforehand, not every time in the loop (also use | to combine multiple strings to one regex!
2.) read files line by line, not all at once!
3.) Use any() to terminate search when you get first positive
For example:
import re
import os
filelist=[]
r = re.compile(r'APPLES|ORANGE|BANANA') # you can add flags=re.I for case insensitive search
for file in os.listdir('C:/mydirectory/'):
with open('C:/mydirectory/' + file, 'r', encoding='latin1') as f:
if any(r.search(line) for line in f): # read files line by line, not all content at once
filelist.append(file) # add to list
# convert list to pandas, etc...
I have a file (in GBs) and want to read out only (let's say) 500MB of it. Is there a way I can do this?
PS: I thought of reading in first few lines of the dataset. See how much memory it uses and then accordingly get the number of lines. I'm looking for a way that can avoid this approach.
You can use generator here to read lines from a file in a memory efficient way, you can refer to this Lazy Method for Reading Big File in Python?
or
you can use f.read(number of lines) to read from line, lets suppose you want to read first 100 lines in a file
fname='your file name'
with open(fname) as f:
lines=100
content = f.read(lines)
print content
or
by using pandas nrows (number of rows)
import pandas as pd
myfile = pd.read('your file name',nrows=1000)
I am attempting to convert a very large json file to csv. I have been able to convert a small file of this type to a 10 record (for example) csv file. However, when trying to convert a large file (on the order of 50000 rows in the csv file) it does not work. The data was created by a curl command with the -o pointing to the json file to be created. The file that is output does not have newline characters in it. The csv file will be written with csv.DictWriter() and (where data is the json file input) has the form
rowcount = len(data['MainKey'])
colcount = len(data['MainKey'][0]['Fields'])
I then loop through the range of the rows and columns to get the csv dictionary entries
csvkey = data['MainKey'][recno]['Fields'][colno]['name']
cvsval = data['MainKey'][recno][['Fields'][colno]['Values']['value']
I attempted to use the answers from other questions, but they did not work with a big file (du -m bigfile.json = 157) and the files that I want to handle are even larger.
An attempt to get the size of each line shows
myfile = open('file.json','r').
line = readline():
print len(line)
shows that this reads the entire file as a full string. Thus, one small file will show a length of 67744, while a larger file will show 163815116.
An attempt to read the data directly from
data=json.load(infile)
gives the error that other questions have discussed for the large files
An attempt to use the
def json_parse(self, fileobj, decoder=JSONDecoder(), buffersize=2048):
yield results
as shown in another answer, works with a 72 kb file (10 rows, 22 columns) but seems to either lock up or take an interminable amount of time for an intermediate sized file of 157 mb (from du -m bigfile.json)
Note that a debug print shows that each chunk is 2048 in size as specified by the default input argument. It appears that it is trying to go through the entire 163815116 (shown from the len above) in 2048 chunks. If I change the chunk size to 32768, simple math shows that it would take 5,000 cycles through the loop to process the file.
A change to a chunk size of 524288 exits the loop approximately every 11 chunks but should still take approximately 312 chunks to process the entire file
If I can get it to stop at the end of each row item, I would be able to process that row and send it to the csv file based on the form shown below.
vi on the small file shows that it is of the form
{"MainKey":[{"Fields":[{"Value": {'value':val}, 'name':'valname'}, {'Value': {'value':val}, 'name':'valname'}}], (other keys)},{'Fields' ... }] (other keys on MainKey level) }
I cannot use ijson as I must set this up for systems that I cannot import additional software for.
I wound up using a chunk size of 8388608 (0x800000 hex) in order to process the files. I then processed the lines that had been read in as part of the loop, keeping count of rows processed and rows discarded. At each process function, I added the number to the totals so that I could keep track of total records processed.
This appears to be the way that it needs to go.
Next time a question like this is asked, please emphasize that a large chunk size must be specified and not the 2048 as shown in the original answer.
The loop goes
first = True
for data in self.json_parse(inf):
records = len(data['MainKey'])
columns = len(data['MainKey'][0]['Fields'])
if first:
# Initialize output as DictWriter
ofile, outf, fields = self.init_csv(csvname, data, records, columns)
first = False
reccount, errcount = self.parse_records(outf, data, fields, records)
Within the parsing routine
for rec in range(records):
currec = data['MainKey'][rec]
# If each column count can be different
columns = len(currec['Fields'])
retval, valrec = self.build_csv_row(currec, columns, fields)
To parse the columns use
for col in columns:
dataname = currec['Fields'][col]['name']
dataval = currec['Fields'][col]['Values']['value']
Thus the references now work and the processing is handled correctly. The large chunk apparently allows the processing to be fast enough to handle the data while being small enough not to overload the system.
I am currently programming a game that requires reading and writing lines in a text file. I was wondering if there is a way to read a specific line in the text file (i.e. the first line in the text file). Also, is there a way to write a line in a specific location (i.e. change the first line in the file, write a couple of other lines and then change the first line again)? I know that we can read lines sequentially by calling:
f.readline()
Edit: Based on responses, apparently there is no way to read specific lines if they are different lengths. I am only working on a small part of a large group project and to change the way I'm storing data would mean a lot of work.
But is there a method to change specifically the first line of the file? I know calling:
f.write('text')
Writes something into the file, but it writes the line at the end of the file instead of the beginning. Is there a way for me to specifically rewrite the text at the beginning?
If all your lines are guaranteed to be the same length, then you can use f.seek(N) to position the file pointer at the N'th byte (where N is LINESIZE*line_number) and then f.read(LINESIZE). Otherwise, I'm not aware of any way to do it in an ordinary ASCII file (which I think is what you're asking about).
Of course, you could store some sort of record information in the header of the file and read that first to let you know where to seek to in your file -- but at that point you're better off using some external library that has already done all that work for you.
Unless your text file is really big, you can always store each line in a list:
with open('textfile','r') as f:
lines=[L[:-1] for L in f.readlines()]
(note I've stripped off the newline so you don't have to remember to keep it around)
Then you can manipulate the list by adding entries, removing entries, changing entries, etc.
At the end of the day, you can write the list back to your text file:
with open('textfile','w') as f:
f.write('\n'.join(lines))
Here's a little test which works for me on OS-X to replace only the first line.
test.dat
this line has n characters
this line also has n characters
test.py
#First, I get the length of the first line -- if you already know it, skip this block
f=open('test.dat','r')
l=f.readline()
linelen=len(l)-1
f.close()
#apparently mode='a+' doesn't work on all systems :( so I use 'r+' instead
f=open('test.dat','r+')
f.seek(0)
f.write('a'*linelen+'\n') #'a'*linelen = 'aaaaaaaaa...'
f.close()
These days, jumping within files in an optimized fashion is a task for high performance applications that manage huge files.
Are you sure that your software project requires reading/writing random places in a file during runtime? I think you should consider changing the whole approach:
If the data is small, you can keep / modify / generate the data at runtime in memory within appropriate container formats (list or dict, for instance) and then write it entirely at once (on change, or only when your program exits). You could consider looking at simple databases. Also, there are nice data exchange formats like JSON, which would be the ideal format in case your data is stored in a dictionary at runtime.
An example, to make the concept more clear. Consider you already have data written to gamedata.dat:
[{"playtime": 25, "score": 13, "name": "rudolf"}, {"playtime": 300, "score": 1, "name": "peter"}]
This is utf-8-encoded and JSON-formatted data. Read the file during runtime of your Python game:
with open("gamedata.dat") as f:
s = f.read().decode("utf-8")
Convert the data to Python types:
gamedata = json.loads(s)
Modify the data (add a new user):
user = {"name": "john", "score": 1337, "playtime": 1}
gamedata.append(user)
John really is a 1337 gamer. However, at this point, you also could have deleted a user, changed the score of Rudolf or changed the name of Peter, ... In any case, after the modification, you can simply write the new data back to disk:
with open("gamedata.dat", "w") as f:
f.write(json.dumps(gamedata).encode("utf-8"))
The point is that you manage (create/modify/remove) data during runtime within appropriate container types. When writing data to disk, you write the entire data set in order to save the current state of the game.