I am trying to import so many text files (probably around 100) as you see in the pic. Basically, I tried to use read_csv to import them but it did not work. So, files are kinda in a little bit complex form. I need to separate them in a proper way. The real data (31 columns including time in the first column) that I am going to use starts at 12th row. However, I'd like to keep those first 11 rows as well such that i.e. I can match the Measurements labels with each column in the future. Lastly, I am gonna need to write a for loop to import 100 txt files and read every 31 columns and first 11 info rows in those.
DATA VIEW
I tried read.csv by doing a lot of things even including skiprows, however it did not work out.Then, I also implemented the following code but not perfectly it gave me what I wanted to have
one of the things I've tried is
with open('zzzAD1.TXT', 'r') as the_file:
all_data = [line.split() for line in the_file.readlines()]
height_line = all_data[:11]
data = all_data[11:]
So, could anyone help me please?
If you're trying to get this into pandas, the only problem is that you need to convert the strings to floats, and you'll have to figure out what column headings to use, but you're basically on the right track here.
with open('zzzAD1.TXT', 'r') as the_file:
all_data = [line.split() for line in the_file.readlines()]
height_line = all_data[:11]
data = all_data[11:]
data = [ [float(x) for x in row] for row in data]
df = pandas.DataFrame(data)
First of all, I have found several questions with the same title/topic here and I have tried the solutions that have been suggested, but none has worked for me
Here is the issue:
I want to extract a sample of workers from a huge .txt file (> 50 GB)
I am using HPC cluster for this purpose.
Every row in the data represents a worker which has many info (column variables). The idea is to a extract subsample of workers based on the first two letters in the ID variable:
df = pd.read_csv('path-to-my-txt-file', encoding= 'ISO-8859-1', sep = '\t', low_memory=False, error_bad_lines=False, dtype=str)
df = df.rename(columns = {'Worker ID' : 'worker_id'})
# extract subsample based on first 2 lettter in worker id
new_df = df[df.worker_id.str.startswith('DK', na=False)]
new_df.to_csv('DK_worker.csv', index = False)
The problem is that the resulting .CSV file has only 10-15 % of the number of rows that should be there (I have another source of information on the approximate number of rows that I should expect).
I think the data has some encoding issues. I have tried something like 'utf-8', 'latin_1' .. nothing has changed.
Do you see anything wrong in this code that may cause this problem? have I missed some argument?
I am not a Python expert :)
Many thanks in advance.
you can't load a 50GB file into your computers RAM, it would not be possible to store that much data. And I doubt the csv module can handle files of that size. What you need to do is open the file in small pieces, then process each piece.
def process_data(piece):
# process the chunk ...
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open('path-to-my-txt-file.csv') as f:
for piece in read_in_chunks(f):
process_data(piece)
I have a CSV file that doesn't fit into my system's memory. Using Pandas, I want to read a small number of rows scattered all over the file.
I think that I can accomplish this without pandas following the steps here: How to read specific lines of a large csv file
In pandas, I am trying to use skiprows to select only the rows that I need.
# FILESIZE is the number of lines in the CSV file (~600M)
# rows2keep is an np.array with the line numbers that I want to read (~20)
rows2skip = (row for row in range(0,FILESIZE) if row not in rows2keep)
signal = pd.read_csv('train.csv', skiprows=rows2skip)
I would expect this code to return a small dataframe pretty fast. However, what is does is start consuming memory over several minutes until the system becomes irresponsive. I'm guessing that it is reading the whole dataframe first and will get rid of rows2skip later.
Why is this implementation so inefficient? How can I efficiently create a dataframe with only the lines specified in rows2keep?
Try this
train = pd.read_csv('file.csv', iterator=True, chunksize=150000)
If you only want to read the first n rows:
train = pd.read_csv(..., nrows=n)
If you only want to read rows from n to n+100
train = pd.read_csv(..., skiprows=n, nrows=n+100)
chunksize should help in limiting the memory usage. Alternatively, if you only need a few number of lines, a possible way is to first read the required lines ouside of pandas and then only feed read_csv with that subset. Code could be:
lines = [line for i, line in enumerate(open('train.csv')) if i in lines_to_keep]
signal = pd.read_csv(io.StringIO(''.join(lines)))
I'm parsing from tweets data which is json format and compressed with gzip.
Here's my code:
###Preprocessing
##Importing:
import os
import gzip
import json
import pandas as pd
from pandas.io.json import json_normalize
##Variables:
#tweets: DataFrame for merging. empty
tweets = pd.DataFrame()
idx = 0
#Parser provides parsing the input data and return as pd.DataFrame format
###Directory reading:
##Reading whole directory from
for root, dirs, files in os.walk('D:/twitter/salathe-us-twitter/11April1'):
for file in files:
#file tracking, #Memory Checker:
print(file, tweets.memory_usage())
# ext represent the extension.
ext = os.path.splitext(file)[-1]
if ext == '.gz':
with gzip.open(os.path.join(root, file), "rt") as tweet_file:
# print(tweet_file)
for line in tweet_file:
try:
temp = line.partition('|')
date = temp[0]
tweet = json.loads(temp[2])
if tweet['user']['lang'] == 'en' and tweet['place']['country_code'] == 'US':
# Mapping for memory.
# The index must be sequence like series.
# temporary solve by listlizing int values: id, retweet-count.
#print(tweet)
temp_dict = {"id": tweet["user"]["id"],
"text": tweet["text"],
"hashtags": tweet["entities"]["hashtags"][0]["text"],
"date":[int(date[:8])]}
#idx for DataFrame ix
temp_DF = pd.DataFrame(temp_dict, index=[idx])
tweets = pd.concat([tweets, temp_DF])
idx += 1
except:
continue
else:
with open(os.path.join(root, file), "r") as tweet_file:
# print(tweets_file)
for line in tweet_file:
try:
temp = line.partition('|')
#date
date = temp[0]
tweet = json.loads(temp[2])
if tweet['user']['lang'] == 'en' and tweet['place']['country_code'] == 'US':
# Mapping for memory.
# The index must be sequence like series.
# temporary solve by listlizing int values: id, retweet-count.
#print(tweet)
temp_dict = {"id": [tweet["user"]["id"]],
"text": tweet["text"],
"hashtags": tweet["entities"]["hashtags"][0]["text"],
"date":[int(date[:8])]}
temp_DF = pd.DataFrame(temp_dict, index=[idx])
tweets = pd.concat([tweets, temp_DF])
idx += 1
except:
continue
##STORING PROCESS.
store = pd.HDFStore('D:/Twitter_project/mydata.h5')
store['11April1'] = tweets
store.close()
My code can be distinct to 3 parts: reading, processing to select columns and storing.
What I interest is that I want to parsing them more faster.
So here's my questions:
It's too slow. How could it be much faster? read by pandas json reader?
Well I guess it's much faster than normal json.loads...
But! Because my raw tweet data have multi-index values.
So pandas read_json didn't work.
And overally, I'm not sure I implemented my code well.
Are there something problems or better way? I'm kinda new on programming.
So please teach me to do much better.
p.s The computer just turned off while the code is running. Why this happen?
Memory problem?
Thanks to read this.
p.p.s
20110331010003954|{"text":"#Honestly my toe still aint healed im suppose to be in that boot still!!!","truncated":false,"in_reply_to_user_id":null,"in_reply_to_status_id":null,"favorited":false,"source":"web","in_reply_to_screen_name":null,"in_reply_to_status_id_str":null,"id_str":"53320627431550976","entities":{"hashtags":[{"text":"Honestly","indices":[0,9]}],"user_mentions":[],"urls":[]},"contributors":null,"retweeted":false,"in_reply_to_user_id_str":null,"place":{"country_code":"US","country":"United States","bounding_box":{"type":"Polygon","coordinates":[[[-84.161625,35.849573],[-83.688543,35.849573],[-83.688543,36.067417],[-84.161625,36.067417]]]},"attributes":{},"full_name":"Knoxville, TN","name":"Knoxville","id":"6565298bcadb82a1","place_type":"city","url":"http:\/\/api.twitter.com\/1\/geo\/id\/6565298bcadb82a1.json"},"retweet_count":0,"created_at":"Thu Mar 31 05:00:02 +0000 2011","user":{"notifications":null,"profile_use_background_image":true,"default_profile":true,"profile_background_color":"C0DEED","followers_count":161,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/1220577968\/RoadRunner_normal.jpg","is_translator":false,"profile_background_image_url":"http:\/\/a3.twimg.com\/a\/1301071706\/images\/themes\/theme1\/bg.png","default_profile_image":false,"description":"Cool & Calm Basically Females are the way of life and key to my heart...","screen_name":"FranklinOwens","verified":false,"time_zone":"Central Time (US & Canada)","friends_count":183,"profile_text_color":"333333","profile_sidebar_fill_color":"DDEEF6","location":"","id_str":"63499713","show_all_inline_media":true,"follow_request_sent":null,"geo_enabled":true,"profile_background_tile":false,"contributors_enabled":false,"lang":"en","protected":false,"favourites_count":8,"created_at":"Thu Aug 06 18:24:50 +0000 2009","profile_link_color":"0084B4","name":"Franklin","statuses_count":5297,"profile_sidebar_border_color":"C0DEED","url":null,"id":63499713,"listed_count":0,"following":null,"utc_offset":-21600},"id":53320627431550976,"coordinates":null,"geo":null}
it's just one line. I have more than 200GB which is compressed with gzip file. I guess the number at very first refers to its date. I'm not sure it's clear to you.
First of all, my congratulations. You get better as a software engineer when you face real world challenges like this one.
Now, talking about your solution.
Every software works in 3 phases.
Input data.
Process data.
Output data. (response)
Input data
1.1. boring staff
The information preferably should be in one format. To achieve that we write parsers, API, wrappers, adapters. The idea behind all of that is to transform data into the same format. This helps to avoid issues working with different data sources, if one of them brakes - you fix only one adapter and that's it, all other and your parser still works.
1.2. your case
You have data coming in the same scheme but in different file formats. You can either convert it to one format as read as json, txt or extract a method that transforms data into separate function or module and reuse/call it 2 times.
example:
with gzip.open(os.path.join(root, file), "rt") as tweet_file:
process_data(tweet_file)
with open(os.path.join(root, file), "r") as tweet_file:
process_data(tweet_file)
process_data(tweet_file):
for line in tweet_file:
# do your stuff
2. Process data
2.1 boring staff
Most likely this is a bottleneck part. Here your goal is to transform data from the given format into the desired format and do some actions if required. Here you get all exceptions, all performance issues, all business logic. This is where SE craft comes handy, you create an architecture and you decide how many bugs to put in it.
2.2 your case
The simplest way to deal with the issue is to know how to find it. If this is performance - put timestamps to track it. With experience, it will get easier to spot the issues. In this case, dt.concat most likely causes the performance hit. With each call it copies all the data to a new instance, thus you have 2 memory objects when you need only 1. Try to avoid it concat, gather all data into a list and then put it into the DataFrame.
For instance, I would not put all the data into the DataFrame on the start, you can gather it and put into a csv file and then build a DataFrame from it, pandas deals with csv files really well. Here is an example:
import json
import pandas as pd
from pandas.io.json import json_normalize
import csv
source_file = '11April1.txt'
result_file = 'output.csv'
with open(source_file) as source:
with open(result_file, 'wb') as result:
writer = csv.DictWriter(result, fieldnames=['id','text','hashtags','date','idx'])
writer.writeheader();
# get index together with a line
for index, line in enumerate(source):
# a handy way to get data in 1 func call.
date, data = line.split('|')
tweet = json.loads(data)
if tweet['user']['lang'] != 'en' or tweet['place']['country_code'] != 'US':
continue
item = {"id": tweet["user"]["id"],
"text": tweet["text"],
"hashtags": tweet["entities"]["hashtags"][0]["text"],
"date":[int(date[:8])],
"idx": index}
# either write it to the csv or save into the array
# tweets.append(item)
writer.writerow(item)
print "done"
3. Output data.
3.1. boring staff
After your data is processed and in the right format, you need to see the results, right? This is where HTTP responses and page loads happen, where pandas builds graphs etc. You decide what kind of output you need, that's why you created software, to get what you want from the format you did not want to go through by yourself.
3.2 your case
You have to find an efficient way to get the desired output from the processed files. Maybe you need to put data into HDF5 format and process it on Hadoop, in this case, your software output becomes someone's software input, sexy right? :D
Jokes aside, gather all processed data from csv or arrays and put it into the HDF5 by chunks, this is important as you cannot load everything into RAM, RAM was called temporary memory within a reason, it is fast and very limited, use it wisely. This is reason your PC turned off, from my opinion. Or there maybe a memory corruption due to some C libraries nature which is OK from time to times.
Overall, try to experiment and get back to StackOverflow if anything.
I am trying to manipulate some data with Python, but having quite a bit of difficulty (given that I'm still a rookie). I have taken some code from other questions/sites but still can't quite get what I want.
Basically what I need is to take a set of data files and select the data from 1 particular row of each of those files, then put it into a new file so I can plot it.
So, to get the data into Python in the first place I'm trying to use:
data = []
path = C:/path/to/file
for files in glob.glob(os.path.join(path, ‘*.*’)):
data.append(list(numpy.loadtxt(files, skiprows=34))) #first 34 rows aren't used
This has worked great for me once before, but for some reason it won't work now. Any possible reasons why that might be the case?
Anyway, carrying on, this should give me a 2D list containing all the data.
Next I want to select a certain row from each data set, and can do so using:
x = list(xrange(30)) #since there are 30 files
Then:
rowdata = list(data[i][some particular row] for i in x)
Which gives me a list containing the value for that particular row from each imported file. This part seems to work quite nicely.
Lastly, I want to write this to a file. I have been trying:
f = open('path/to/file', 'w')
for item in rowdata:
f.write(item)
f.close()
But I keep getting an error. Is there another method of approach here?
You are already using numpy to load the text, you can use it to manipulate it as well.
import numpy as np
path = 'C:/path/to/file'
mydata = np.array([np.loadtxt(f) for f in glob.glob(os.path.join(path, '*.*'))])
This will load all your data into one 3d array:
mydata.ndim
#3
where the first dimension (axis) runs over the files, the second over rows, the third over columns:
mydata.shape
#(number of files, number of rows in each file, number of columns in each file)
So, you can access the first file by
mydata[0,...] # equivalent to: mydata[0,:,:]
or specific parts of all files:
mydata[0,34,:] #the 35th row of the first file by
mydata[:,34,:] #the 35th row in all files
mydata[:,34,1] #the second value in the 34th row in all files
To write to file:
Say you want to write a new file with just the 35th row from all files:
np.savetxt(os.join(path,'outfile.txt'), mydata[:,34,:])
If you just have to read from a file and write to a file you can use open().
For a better solution, you can use linecache