Python-pandas "read_csv" is not reading the whole .TXT file - python

First of all, I have found several questions with the same title/topic here and I have tried the solutions that have been suggested, but none has worked for me
Here is the issue:
I want to extract a sample of workers from a huge .txt file (> 50 GB)
I am using HPC cluster for this purpose.
Every row in the data represents a worker which has many info (column variables). The idea is to a extract subsample of workers based on the first two letters in the ID variable:
df = pd.read_csv('path-to-my-txt-file', encoding= 'ISO-8859-1', sep = '\t', low_memory=False, error_bad_lines=False, dtype=str)
df = df.rename(columns = {'Worker ID' : 'worker_id'})
# extract subsample based on first 2 lettter in worker id
new_df = df[df.worker_id.str.startswith('DK', na=False)]
new_df.to_csv('DK_worker.csv', index = False)
The problem is that the resulting .CSV file has only 10-15 % of the number of rows that should be there (I have another source of information on the approximate number of rows that I should expect).
I think the data has some encoding issues. I have tried something like 'utf-8', 'latin_1' .. nothing has changed.
Do you see anything wrong in this code that may cause this problem? have I missed some argument?
I am not a Python expert :)
Many thanks in advance.

you can't load a 50GB file into your computers RAM, it would not be possible to store that much data. And I doubt the csv module can handle files of that size. What you need to do is open the file in small pieces, then process each piece.
def process_data(piece):
# process the chunk ...
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open('path-to-my-txt-file.csv') as f:
for piece in read_in_chunks(f):
process_data(piece)

Related

Unsuccessful Importing .txt file containing 11 rows info and then real data starts

I am trying to import so many text files (probably around 100) as you see in the pic. Basically, I tried to use read_csv to import them but it did not work. So, files are kinda in a little bit complex form. I need to separate them in a proper way. The real data (31 columns including time in the first column) that I am going to use starts at 12th row. However, I'd like to keep those first 11 rows as well such that i.e. I can match the Measurements labels with each column in the future. Lastly, I am gonna need to write a for loop to import 100 txt files and read every 31 columns and first 11 info rows in those.
DATA VIEW
I tried read.csv by doing a lot of things even including skiprows, however it did not work out.Then, I also implemented the following code but not perfectly it gave me what I wanted to have
one of the things I've tried is
with open('zzzAD1.TXT', 'r') as the_file:
all_data = [line.split() for line in the_file.readlines()]
height_line = all_data[:11]
data = all_data[11:]
So, could anyone help me please?
If you're trying to get this into pandas, the only problem is that you need to convert the strings to floats, and you'll have to figure out what column headings to use, but you're basically on the right track here.
with open('zzzAD1.TXT', 'r') as the_file:
all_data = [line.split() for line in the_file.readlines()]
height_line = all_data[:11]
data = all_data[11:]
data = [ [float(x) for x in row] for row in data]
df = pandas.DataFrame(data)

How to read a few lines in a large CSV file with pandas?

I have a CSV file that doesn't fit into my system's memory. Using Pandas, I want to read a small number of rows scattered all over the file.
I think that I can accomplish this without pandas following the steps here: How to read specific lines of a large csv file
In pandas, I am trying to use skiprows to select only the rows that I need.
# FILESIZE is the number of lines in the CSV file (~600M)
# rows2keep is an np.array with the line numbers that I want to read (~20)
rows2skip = (row for row in range(0,FILESIZE) if row not in rows2keep)
signal = pd.read_csv('train.csv', skiprows=rows2skip)
I would expect this code to return a small dataframe pretty fast. However, what is does is start consuming memory over several minutes until the system becomes irresponsive. I'm guessing that it is reading the whole dataframe first and will get rid of rows2skip later.
Why is this implementation so inefficient? How can I efficiently create a dataframe with only the lines specified in rows2keep?
Try this
train = pd.read_csv('file.csv', iterator=True, chunksize=150000)
If you only want to read the first n rows:
train = pd.read_csv(..., nrows=n)
If you only want to read rows from n to n+100
train = pd.read_csv(..., skiprows=n, nrows=n+100)
chunksize should help in limiting the memory usage. Alternatively, if you only need a few number of lines, a possible way is to first read the required lines ouside of pandas and then only feed read_csv with that subset. Code could be:
lines = [line for i, line in enumerate(open('train.csv')) if i in lines_to_keep]
signal = pd.read_csv(io.StringIO(''.join(lines)))

Opening a 20GB file for analysis with pandas

i am new to data Science and Dta Analytics i hope my question is not too naive. I am currently trying to open a file with pandas and python for machine learning purposes it would be ideal for me to have them all in a DataFrame. Now The file is 18GB large and my RAM is 32 GB but i keep getting memory errors.
from your experience is it possible?
if not do you know know of a better to way to go around this? (hive table? increase the size of my RAM to 64? create a database and access it from python)
Every input will be welcome!
Thanks in advance.
You should try read and process one predefined chunk of data each time
by using chunksize as explained here
for chunk in pd.read_csv(f, sep = ' ', header = None, chunksize = 512):
# process your chunk here
Can you work with the data in chunks? If so you can use the iterator interface of pandas to go through the file.
df_iterator = pd.read_csv('test.csv', index_col=0, iterator=True, chunksize=5)
for df in df_iterator:
print(df)
# do something meaningful
print('finished iteration on {} rows'.format(df.shape[0]))
print()

Efficient way to parsing from tweets json formated files

I'm parsing from tweets data which is json format and compressed with gzip.
Here's my code:
###Preprocessing
##Importing:
import os
import gzip
import json
import pandas as pd
from pandas.io.json import json_normalize
##Variables:
#tweets: DataFrame for merging. empty
tweets = pd.DataFrame()
idx = 0
#Parser provides parsing the input data and return as pd.DataFrame format
###Directory reading:
##Reading whole directory from
for root, dirs, files in os.walk('D:/twitter/salathe-us-twitter/11April1'):
for file in files:
#file tracking, #Memory Checker:
print(file, tweets.memory_usage())
# ext represent the extension.
ext = os.path.splitext(file)[-1]
if ext == '.gz':
with gzip.open(os.path.join(root, file), "rt") as tweet_file:
# print(tweet_file)
for line in tweet_file:
try:
temp = line.partition('|')
date = temp[0]
tweet = json.loads(temp[2])
if tweet['user']['lang'] == 'en' and tweet['place']['country_code'] == 'US':
# Mapping for memory.
# The index must be sequence like series.
# temporary solve by listlizing int values: id, retweet-count.
#print(tweet)
temp_dict = {"id": tweet["user"]["id"],
"text": tweet["text"],
"hashtags": tweet["entities"]["hashtags"][0]["text"],
"date":[int(date[:8])]}
#idx for DataFrame ix
temp_DF = pd.DataFrame(temp_dict, index=[idx])
tweets = pd.concat([tweets, temp_DF])
idx += 1
except:
continue
else:
with open(os.path.join(root, file), "r") as tweet_file:
# print(tweets_file)
for line in tweet_file:
try:
temp = line.partition('|')
#date
date = temp[0]
tweet = json.loads(temp[2])
if tweet['user']['lang'] == 'en' and tweet['place']['country_code'] == 'US':
# Mapping for memory.
# The index must be sequence like series.
# temporary solve by listlizing int values: id, retweet-count.
#print(tweet)
temp_dict = {"id": [tweet["user"]["id"]],
"text": tweet["text"],
"hashtags": tweet["entities"]["hashtags"][0]["text"],
"date":[int(date[:8])]}
temp_DF = pd.DataFrame(temp_dict, index=[idx])
tweets = pd.concat([tweets, temp_DF])
idx += 1
except:
continue
##STORING PROCESS.
store = pd.HDFStore('D:/Twitter_project/mydata.h5')
store['11April1'] = tweets
store.close()
My code can be distinct to 3 parts: reading, processing to select columns and storing.
What I interest is that I want to parsing them more faster.
So here's my questions:
It's too slow. How could it be much faster? read by pandas json reader?
Well I guess it's much faster than normal json.loads...
But! Because my raw tweet data have multi-index values.
So pandas read_json didn't work.
And overally, I'm not sure I implemented my code well.
Are there something problems or better way? I'm kinda new on programming.
So please teach me to do much better.
p.s The computer just turned off while the code is running. Why this happen?
Memory problem?
Thanks to read this.
p.p.s
20110331010003954|{"text":"#Honestly my toe still aint healed im suppose to be in that boot still!!!","truncated":false,"in_reply_to_user_id":null,"in_reply_to_status_id":null,"favorited":false,"source":"web","in_reply_to_screen_name":null,"in_reply_to_status_id_str":null,"id_str":"53320627431550976","entities":{"hashtags":[{"text":"Honestly","indices":[0,9]}],"user_mentions":[],"urls":[]},"contributors":null,"retweeted":false,"in_reply_to_user_id_str":null,"place":{"country_code":"US","country":"United States","bounding_box":{"type":"Polygon","coordinates":[[[-84.161625,35.849573],[-83.688543,35.849573],[-83.688543,36.067417],[-84.161625,36.067417]]]},"attributes":{},"full_name":"Knoxville, TN","name":"Knoxville","id":"6565298bcadb82a1","place_type":"city","url":"http:\/\/api.twitter.com\/1\/geo\/id\/6565298bcadb82a1.json"},"retweet_count":0,"created_at":"Thu Mar 31 05:00:02 +0000 2011","user":{"notifications":null,"profile_use_background_image":true,"default_profile":true,"profile_background_color":"C0DEED","followers_count":161,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/1220577968\/RoadRunner_normal.jpg","is_translator":false,"profile_background_image_url":"http:\/\/a3.twimg.com\/a\/1301071706\/images\/themes\/theme1\/bg.png","default_profile_image":false,"description":"Cool & Calm Basically Females are the way of life and key to my heart...","screen_name":"FranklinOwens","verified":false,"time_zone":"Central Time (US & Canada)","friends_count":183,"profile_text_color":"333333","profile_sidebar_fill_color":"DDEEF6","location":"","id_str":"63499713","show_all_inline_media":true,"follow_request_sent":null,"geo_enabled":true,"profile_background_tile":false,"contributors_enabled":false,"lang":"en","protected":false,"favourites_count":8,"created_at":"Thu Aug 06 18:24:50 +0000 2009","profile_link_color":"0084B4","name":"Franklin","statuses_count":5297,"profile_sidebar_border_color":"C0DEED","url":null,"id":63499713,"listed_count":0,"following":null,"utc_offset":-21600},"id":53320627431550976,"coordinates":null,"geo":null}
it's just one line. I have more than 200GB which is compressed with gzip file. I guess the number at very first refers to its date. I'm not sure it's clear to you.
First of all, my congratulations. You get better as a software engineer when you face real world challenges like this one.
Now, talking about your solution.
Every software works in 3 phases.
Input data.
Process data.
Output data. (response)
Input data
1.1. boring staff
The information preferably should be in one format. To achieve that we write parsers, API, wrappers, adapters. The idea behind all of that is to transform data into the same format. This helps to avoid issues working with different data sources, if one of them brakes - you fix only one adapter and that's it, all other and your parser still works.
1.2. your case
You have data coming in the same scheme but in different file formats. You can either convert it to one format as read as json, txt or extract a method that transforms data into separate function or module and reuse/call it 2 times.
example:
with gzip.open(os.path.join(root, file), "rt") as tweet_file:
process_data(tweet_file)
with open(os.path.join(root, file), "r") as tweet_file:
process_data(tweet_file)
process_data(tweet_file):
for line in tweet_file:
# do your stuff
2. Process data
2.1 boring staff
Most likely this is a bottleneck part. Here your goal is to transform data from the given format into the desired format and do some actions if required. Here you get all exceptions, all performance issues, all business logic. This is where SE craft comes handy, you create an architecture and you decide how many bugs to put in it.
2.2 your case
The simplest way to deal with the issue is to know how to find it. If this is performance - put timestamps to track it. With experience, it will get easier to spot the issues. In this case, dt.concat most likely causes the performance hit. With each call it copies all the data to a new instance, thus you have 2 memory objects when you need only 1. Try to avoid it concat, gather all data into a list and then put it into the DataFrame.
For instance, I would not put all the data into the DataFrame on the start, you can gather it and put into a csv file and then build a DataFrame from it, pandas deals with csv files really well. Here is an example:
import json
import pandas as pd
from pandas.io.json import json_normalize
import csv
source_file = '11April1.txt'
result_file = 'output.csv'
with open(source_file) as source:
with open(result_file, 'wb') as result:
writer = csv.DictWriter(result, fieldnames=['id','text','hashtags','date','idx'])
writer.writeheader();
# get index together with a line
for index, line in enumerate(source):
# a handy way to get data in 1 func call.
date, data = line.split('|')
tweet = json.loads(data)
if tweet['user']['lang'] != 'en' or tweet['place']['country_code'] != 'US':
continue
item = {"id": tweet["user"]["id"],
"text": tweet["text"],
"hashtags": tweet["entities"]["hashtags"][0]["text"],
"date":[int(date[:8])],
"idx": index}
# either write it to the csv or save into the array
# tweets.append(item)
writer.writerow(item)
print "done"
3. Output data.
3.1. boring staff
After your data is processed and in the right format, you need to see the results, right? This is where HTTP responses and page loads happen, where pandas builds graphs etc. You decide what kind of output you need, that's why you created software, to get what you want from the format you did not want to go through by yourself.
3.2 your case
You have to find an efficient way to get the desired output from the processed files. Maybe you need to put data into HDF5 format and process it on Hadoop, in this case, your software output becomes someone's software input, sexy right? :D
Jokes aside, gather all processed data from csv or arrays and put it into the HDF5 by chunks, this is important as you cannot load everything into RAM, RAM was called temporary memory within a reason, it is fast and very limited, use it wisely. This is reason your PC turned off, from my opinion. Or there maybe a memory corruption due to some C libraries nature which is OK from time to times.
Overall, try to experiment and get back to StackOverflow if anything.

Pandas MemoryError when reading large CSV followed by `.iloc` slicing columns

I've been trying to process a 1.4GB CSV file with Pandas, but keep having memory problems. I have tried different things in attempt to make Pandas read_csv work to no avail.
It didn't work when I used the iterator=True and chunksize=number parameters. Moreover, the smaller the chunksize, the slower it is to process the same amount of data.
(Simple heavier overhead doesn't explain it because it was way too slower when number of chunks is big. I suspect when processing every chunk, panda needs to go though all the chunks before it to "get to it", instead of jumping right to the start of the chunk. This seems the only way this can be explained.)
Then as a last resort, I split the CSV files into 6 parts, and tried to read them one by one, but still get MemoryError.
(I have monitored the memory usage of python when running the code below, and found that each time python finishes processing a file and moves on to the next, the memory usage goes up. It seemed quite obvious that panda didn't release memory for the previous file when it's already finished processing it.)
The code may not make sense but that's because I removed the part where it writes into an SQL database to simplify it and isolate the problem.
import csv,pandas as pd
import glob
filenameStem = 'Crimes'
counter = 0
for filename in glob.glob(filenameStem + '_part*.csv'): # reading files Crimes_part1.csv through Crimes_part6.csv
chunk = pd.read_csv(filename)
df = chunk.iloc[:,[5,8,15,16]]
df = df.dropna(how='any')
counter += 1
print(counter)
you may try to parse only those columns that you need (as #BrenBarn said in comments):
import os
import glob
import pandas as pd
def get_merged_csv(flist, **kwargs):
return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)
fmask = 'Crimes_part*.csv'
cols = [5,8,15,16]
df = get_merged_csv(glob.glob(fmask), index_col=None, usecols=cols).dropna(how='any')
print(df.head())
PS this will include only 4 out of at least 17 columns in your resulting data frame
Thanks for the reply.
After some debugging, I have located the problem. The "iloc" subsetting of pandas created a circular reference, which prevented garbage recollection. Detailed discussion can be found here
I have found same issues in csv file. First to make csv as chunks and fix the chunksize.use the chunksize or iterator parameter to return the data in chunks.
Syntax:
csv_onechunk = padas.read_csv(filepath, sep = delimiter, skiprows = 1, chunksize = 10000)
then concatenate the chunks (Only valid with C parser)

Categories