pandas - read file only to a certain limit - python

I have a file (in GBs) and want to read out only (let's say) 500MB of it. Is there a way I can do this?
PS: I thought of reading in first few lines of the dataset. See how much memory it uses and then accordingly get the number of lines. I'm looking for a way that can avoid this approach.

You can use generator here to read lines from a file in a memory efficient way, you can refer to this Lazy Method for Reading Big File in Python?
or
you can use f.read(number of lines) to read from line, lets suppose you want to read first 100 lines in a file
fname='your file name'
with open(fname) as f:
lines=100
content = f.read(lines)
print content
or
by using pandas nrows (number of rows)
import pandas as pd
myfile = pd.read('your file name',nrows=1000)

Related

comparing large csv files

Which method is more efficient for comparing two large (8GB & 5GB) csv files? The output should contain every id that is not in file1.
The data is a single column with GUIDs.
Method 1:
df = pd.read_csv(file)
df1 = pd.read_csv(file1)
df = df.merge(df1, on=['id'], how="outer", indicator=True).query('_merge=="left_only"')
df['id'].to_csv(output_path, index=False)
Method 2:
with open(file1, 'r') as t1:
file = set(t1)
with open(file, 'r') as t2, open(output_path, 'w') as outFile:
for line in t2:
if line not in file:
outFile.write(line)
What do you mean by efficiency? Certainly two major differences are as follows:
The first method, which pandas uses, needs to have all the data in memory. So you will need an amount of available memory to hold the data from the two csv files (note: 5+8gb may not be enough, but it will depend on the type of data in the csv files).
The second method takes advantage of python's generators, and reads
the file line by line, loading into memory one line at a time.
So if you have the memory available to load the data into memory, it will certainly be faster to load all the data into memory and do the operations on the data in memory.
If you don't have enough memory available, the second method works but is definitely slower. a good compromise might be to read the file by chunk, loading into memory an amount of data that your hardware can handle.
Extras
To estimate the memory space used by your datframe you can read this nice post:
How to estimate how much memory a Pandas' DataFrame will need?
Here you can find approndments explaining how to read a file by chunk, with or without pandas
How do I read a large csv file with pandas?
Lazy Method for Reading Big File in Python?
If this is something you'll have to run multiple times, you can just wrap them with start = time.time() at the beginning and execution_time = time.time() - start at the end to compare speed. To compare memory, you can check out this package, memory_profiler

Pandas read_excel get only last row

I have an excel that is generated daily and can have up to 50k+ rows. Is there a way to read only the last row (which is the sum of the columns)?
right now I am just reading the entire sheet and keeping only the last row but it is taking up a huge amount of runtime.
my code:
df=pd.read_excel(filepath,header=1,usecols="O:AC")
df=df.tail(1)
Pandas is quite slow, especially with large in memory data. You can think about a lazy loading method, for example check dask.
Else you can read the file using "open" and read the last line :
with open(filepath, "r") as file:
last_line = file.readlines()[-1]
I dont think there is a way to decrease runtime when you read excel file.
When you read a excel or one sheet of excel,you would load excel all data into dask,even you use pd.read_excel skiprows,Its just keep the row the skiprows choose after you load all data into dask.So it cant decrease runtime.
If you really want decrease runtime of read file,you should save the file into another format,.csv or .txt and so on.
AND you generally you can't read Microsoft Excel files as a text files using methods like readlines or read. You should convert files to another format before (good solution is .csv which can be readed by csv module) or use a special python modules like pyexcel and openpyxl to read .xlsx files directly.

Using pysftp to split text file in SFTP directory

I'm trying to split text file of size 100 MB (having unique rows) into 10 files of equal size using python pysftp but I'm unable to find proper approach for same.
Please let me know how can I read/ split files from SFTP directory and place back all files to FTP directory itself.
with pysftp.Connection(host=sftphostname, username=sftpusername, port=sftpport, private_key=sftpkeypath) as sftp:
with sftp.open(source_filedir+source_filename) as file:
for line in file:
<....................Unable to decide logic------------------>
The logic you probably need is as follows:
As you are in a read only environment, you will need to download the whole file into memory.
Use Python's io.StringIO() to handle the data in memory as if it is a file.
As you are talking about rows, I assume you mean the file is in CSV format? You can make use of Python's csv library to parse the file.
First do a quick scan of the file using a csv.reader(), use this to count the number of rows in the file. This can then be used to determine how to split the file into equal number of rows, rather than just splitting the file at set byte counts.
Once you know the number of rows, reopen the data (as a file again) and just read the header row in. This can then be added to the first row of each split file you create.
Now read n rows in (based on your total row count). Use a csv.writer() and another io.StringIO() to first write the header row and then write the split rows into memory. This can then be used to upload using pysftp to a new file on the server, all without requiring access to an actual filing system.
The result will be that each file will also have a valid header row.
I don't think FTP / SFTP allow for something more clever than simply downloading the file. Meaning, you'd have to get the whole file, split it locally, then put the new files back.
For text file splitting logic I believe that this thread may be of use: Split large files using python
There is a library like filesplit you can use to split files.
It has similar functionality like the Linux command split or csplit.
For you case
split text file of size 100 MB into 10 files of equal size
you can use method bysize:
import os
from filesplit.split import Split
infile = source_filedir + source_filename
outdir = source_filedir
split = Split(infile, outdir) # construct the splitter
file_size = os.path.getsize(infile)
desired_parts = 10
bytes_per_split = file_size / desired_parts # have to calculate the size
split.bysize(bytes_per_split)
For a line-partitioned split use bylinecount:
from filesplit.split import Split
split = Split(infile, outdir)
split.bylinecount(1_000_000) # for a million lines each file
See also:
Split Command in Linux: 9 Useful Examples
How do I check file size in Python?
Bonus
Since Python 3.6 you can use underscores in numeric literals (see PEP515): million = 1_000_000 to improve readability,

Huge txt file with one column (text to columns in python)

I'm struggeling with one task that can save plenty of time. I'm new to Python so please don't kill me :)
I've got huge txt file with millions of records. I used to split them in MS Access, delimiter "|", filtered data so I can have about 400K records and then copied to Excel.
So basically file looks like:
What I would like to have:
I'm using Spyder so it would be great to see data in variable explorer so I can easily check and (after additional filters) export it to excel.
I use LibreOffice so I'm not 100% sure about Excel but if you change the .txt to .csv and try to open the file with Excel, it should allow to change the delimiter from a comma to '|' and then import it directly. That work with LibreOffice Calc anyway.
u have to split the file in lines then split the lines by the char l and map the data to a list o dicts.
with open ('filename') as file:
data = [{'id': line[0], 'fname':line[1]} for line in f.readlines()]
you have to fill in tve rest of the fields
Doing this with pandas will be much easier
Note: I am assuming that each entry is on a new line.
import pandas as pd
data = pd.read_csv("data.txt", delimiter='|')
# Do something here or let it be if you want to just convert text file to excel file
data.to_excel("data.xlsx")

Python fast way to read several rows of csv text?

I wish to to the following as fast as possible with Python:
read rows i to j of a csv file
create the concatenation of all the strings in csv[row=(loop i to j)][column=3]
My first code was a loop (i to j) of the following:
with open('Train.csv', 'rt') as f:
row = next(itertools.islice(csv.reader(f), row_number, row_number+1))
tags = (row[3].decode('utf8'))
return tags
but my code above reads the csv one column at a time and is slow.
How can I read all rows in one call and concatenate fast?
Edit for additional information:
the csv file size is 7GB; I have only 4GB of RAM, on windows XP; but I don't need to read all columns (only 1% of the 7GB would be good I think).
Since I know which data you are interested in, I can speak from experience:
import csv
with open('Train.csv', 'rt') as csvfile:
reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in reader:
row[0] # ID
row[1] # title
row[2] # body
row[3] # tags
You can of course per row select anything you want, and store it as you like.
By using an iterator variable, you can decide which rows to collect:
import csv
with open('Train.csv', 'rt') as csvfile:
reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
linenum = 0
tags = [] # you can preallocate memory to this list if you want though.
for row in reader:
if linenum > 1000 and linenum < 2000:
tags.append(row[3]) # tags
if linenum == 2000:
break # so it won't read the next 3 million rows
linenum += 1
The good thing about it is also that this will really use low memory as you read in line by line.
As mentioned, if you want the later cases, it still has to parse the data to get there (this is inevitable since there are newlines in the text, so you can't skip to a certain row). Personally, I just roughly used linux's split, to split the file in chunks, and then edited them making sure they start at an ID (and end with a tag).
Then I used:
train = pandas.io.parsers.read_csv(file, quotechar="\"")
To quickly read in the split files.
If the file is not HUGE (hundred of megabytes) and you actually need to read a lot of rows then probably just
tags = " ".join(x.split("\t")[3]
for x in open("Train.csv").readlines()[from_row:to_row+1])
is going to be the fastest way.
If the file is instead very big the only thing you can do is iterating over all lines because CSV is uses unfortunately (in general) variable-sized records.
If by chance the specific CSV uses a fixed-size record format (not uncommon for large files) then directly seeking into the file may be an option.
If the file uses variable-sized records and the search must be done several times with different ranges then creating a simple external index just once (e.g. line->file offset for all line numbers that are a multiple of 1000) can be good idea.
Your question does not contain enough information, probably because you don't see some existing complexity: Most CSV files contain one record per line. In that case it's simple to skip the rows you're not interested in. But in CSV records can span lines, so a general solution (like the CSV reader from the standard library) has to parse the records to skip lines. It's up to you to decide what optimization is ok in your use case.
The next problem is, that you don't know, which part of the code you posted, is too slow. Measure it. Your code will never run faster than the time you need to read the file from disc. Have you checked that? Or have you guessed what part's to slow?
If you want to do fast transformations of CSV data which fits to memory, I would propose to use/learn Pandas. So it would probably a good idea to split your code in two steps:
Reduce file to the required data.
Transform the remaining data.
sed is designed for the task 'read rows i to j of a csv file'.to
If the solution does not have to be pure Python, I think preprocess the csv file with sed sed -n 'i, jp', then parse the output with Python would be simple and quick.

Categories