Efficient way to parsing from tweets json formated files - python

I'm parsing from tweets data which is json format and compressed with gzip.
Here's my code:
###Preprocessing
##Importing:
import os
import gzip
import json
import pandas as pd
from pandas.io.json import json_normalize
##Variables:
#tweets: DataFrame for merging. empty
tweets = pd.DataFrame()
idx = 0
#Parser provides parsing the input data and return as pd.DataFrame format
###Directory reading:
##Reading whole directory from
for root, dirs, files in os.walk('D:/twitter/salathe-us-twitter/11April1'):
for file in files:
#file tracking, #Memory Checker:
print(file, tweets.memory_usage())
# ext represent the extension.
ext = os.path.splitext(file)[-1]
if ext == '.gz':
with gzip.open(os.path.join(root, file), "rt") as tweet_file:
# print(tweet_file)
for line in tweet_file:
try:
temp = line.partition('|')
date = temp[0]
tweet = json.loads(temp[2])
if tweet['user']['lang'] == 'en' and tweet['place']['country_code'] == 'US':
# Mapping for memory.
# The index must be sequence like series.
# temporary solve by listlizing int values: id, retweet-count.
#print(tweet)
temp_dict = {"id": tweet["user"]["id"],
"text": tweet["text"],
"hashtags": tweet["entities"]["hashtags"][0]["text"],
"date":[int(date[:8])]}
#idx for DataFrame ix
temp_DF = pd.DataFrame(temp_dict, index=[idx])
tweets = pd.concat([tweets, temp_DF])
idx += 1
except:
continue
else:
with open(os.path.join(root, file), "r") as tweet_file:
# print(tweets_file)
for line in tweet_file:
try:
temp = line.partition('|')
#date
date = temp[0]
tweet = json.loads(temp[2])
if tweet['user']['lang'] == 'en' and tweet['place']['country_code'] == 'US':
# Mapping for memory.
# The index must be sequence like series.
# temporary solve by listlizing int values: id, retweet-count.
#print(tweet)
temp_dict = {"id": [tweet["user"]["id"]],
"text": tweet["text"],
"hashtags": tweet["entities"]["hashtags"][0]["text"],
"date":[int(date[:8])]}
temp_DF = pd.DataFrame(temp_dict, index=[idx])
tweets = pd.concat([tweets, temp_DF])
idx += 1
except:
continue
##STORING PROCESS.
store = pd.HDFStore('D:/Twitter_project/mydata.h5')
store['11April1'] = tweets
store.close()
My code can be distinct to 3 parts: reading, processing to select columns and storing.
What I interest is that I want to parsing them more faster.
So here's my questions:
It's too slow. How could it be much faster? read by pandas json reader?
Well I guess it's much faster than normal json.loads...
But! Because my raw tweet data have multi-index values.
So pandas read_json didn't work.
And overally, I'm not sure I implemented my code well.
Are there something problems or better way? I'm kinda new on programming.
So please teach me to do much better.
p.s The computer just turned off while the code is running. Why this happen?
Memory problem?
Thanks to read this.
p.p.s
20110331010003954|{"text":"#Honestly my toe still aint healed im suppose to be in that boot still!!!","truncated":false,"in_reply_to_user_id":null,"in_reply_to_status_id":null,"favorited":false,"source":"web","in_reply_to_screen_name":null,"in_reply_to_status_id_str":null,"id_str":"53320627431550976","entities":{"hashtags":[{"text":"Honestly","indices":[0,9]}],"user_mentions":[],"urls":[]},"contributors":null,"retweeted":false,"in_reply_to_user_id_str":null,"place":{"country_code":"US","country":"United States","bounding_box":{"type":"Polygon","coordinates":[[[-84.161625,35.849573],[-83.688543,35.849573],[-83.688543,36.067417],[-84.161625,36.067417]]]},"attributes":{},"full_name":"Knoxville, TN","name":"Knoxville","id":"6565298bcadb82a1","place_type":"city","url":"http:\/\/api.twitter.com\/1\/geo\/id\/6565298bcadb82a1.json"},"retweet_count":0,"created_at":"Thu Mar 31 05:00:02 +0000 2011","user":{"notifications":null,"profile_use_background_image":true,"default_profile":true,"profile_background_color":"C0DEED","followers_count":161,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/1220577968\/RoadRunner_normal.jpg","is_translator":false,"profile_background_image_url":"http:\/\/a3.twimg.com\/a\/1301071706\/images\/themes\/theme1\/bg.png","default_profile_image":false,"description":"Cool & Calm Basically Females are the way of life and key to my heart...","screen_name":"FranklinOwens","verified":false,"time_zone":"Central Time (US & Canada)","friends_count":183,"profile_text_color":"333333","profile_sidebar_fill_color":"DDEEF6","location":"","id_str":"63499713","show_all_inline_media":true,"follow_request_sent":null,"geo_enabled":true,"profile_background_tile":false,"contributors_enabled":false,"lang":"en","protected":false,"favourites_count":8,"created_at":"Thu Aug 06 18:24:50 +0000 2009","profile_link_color":"0084B4","name":"Franklin","statuses_count":5297,"profile_sidebar_border_color":"C0DEED","url":null,"id":63499713,"listed_count":0,"following":null,"utc_offset":-21600},"id":53320627431550976,"coordinates":null,"geo":null}
it's just one line. I have more than 200GB which is compressed with gzip file. I guess the number at very first refers to its date. I'm not sure it's clear to you.

First of all, my congratulations. You get better as a software engineer when you face real world challenges like this one.
Now, talking about your solution.
Every software works in 3 phases.
Input data.
Process data.
Output data. (response)
Input data
1.1. boring staff
The information preferably should be in one format. To achieve that we write parsers, API, wrappers, adapters. The idea behind all of that is to transform data into the same format. This helps to avoid issues working with different data sources, if one of them brakes - you fix only one adapter and that's it, all other and your parser still works.
1.2. your case
You have data coming in the same scheme but in different file formats. You can either convert it to one format as read as json, txt or extract a method that transforms data into separate function or module and reuse/call it 2 times.
example:
with gzip.open(os.path.join(root, file), "rt") as tweet_file:
process_data(tweet_file)
with open(os.path.join(root, file), "r") as tweet_file:
process_data(tweet_file)
process_data(tweet_file):
for line in tweet_file:
# do your stuff
2. Process data
2.1 boring staff
Most likely this is a bottleneck part. Here your goal is to transform data from the given format into the desired format and do some actions if required. Here you get all exceptions, all performance issues, all business logic. This is where SE craft comes handy, you create an architecture and you decide how many bugs to put in it.
2.2 your case
The simplest way to deal with the issue is to know how to find it. If this is performance - put timestamps to track it. With experience, it will get easier to spot the issues. In this case, dt.concat most likely causes the performance hit. With each call it copies all the data to a new instance, thus you have 2 memory objects when you need only 1. Try to avoid it concat, gather all data into a list and then put it into the DataFrame.
For instance, I would not put all the data into the DataFrame on the start, you can gather it and put into a csv file and then build a DataFrame from it, pandas deals with csv files really well. Here is an example:
import json
import pandas as pd
from pandas.io.json import json_normalize
import csv
source_file = '11April1.txt'
result_file = 'output.csv'
with open(source_file) as source:
with open(result_file, 'wb') as result:
writer = csv.DictWriter(result, fieldnames=['id','text','hashtags','date','idx'])
writer.writeheader();
# get index together with a line
for index, line in enumerate(source):
# a handy way to get data in 1 func call.
date, data = line.split('|')
tweet = json.loads(data)
if tweet['user']['lang'] != 'en' or tweet['place']['country_code'] != 'US':
continue
item = {"id": tweet["user"]["id"],
"text": tweet["text"],
"hashtags": tweet["entities"]["hashtags"][0]["text"],
"date":[int(date[:8])],
"idx": index}
# either write it to the csv or save into the array
# tweets.append(item)
writer.writerow(item)
print "done"
3. Output data.
3.1. boring staff
After your data is processed and in the right format, you need to see the results, right? This is where HTTP responses and page loads happen, where pandas builds graphs etc. You decide what kind of output you need, that's why you created software, to get what you want from the format you did not want to go through by yourself.
3.2 your case
You have to find an efficient way to get the desired output from the processed files. Maybe you need to put data into HDF5 format and process it on Hadoop, in this case, your software output becomes someone's software input, sexy right? :D
Jokes aside, gather all processed data from csv or arrays and put it into the HDF5 by chunks, this is important as you cannot load everything into RAM, RAM was called temporary memory within a reason, it is fast and very limited, use it wisely. This is reason your PC turned off, from my opinion. Or there maybe a memory corruption due to some C libraries nature which is OK from time to times.
Overall, try to experiment and get back to StackOverflow if anything.

Related

How can I chunk through a CSV using Arrow?

What I am trying to do
I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machine running the job). I am trying to chunk through the file while reading the CSV in a similar way to how Pandas read_csv with chunksize works.
For example this is how the chunking code would work in pandas:
chunks = pandas.read_csv(data, chunksize=100, iterator=True)
# Iterate through chunks
for chunk in chunks:
do_stuff(chunk)
I want to port a similar functionality to Arrow
What I have tried to do
I noticed that Arrow has ReadOptions which include a block_size parameter, and I thought maybe I could use it like:
# Reading in-memory csv file
arrow_table = arrow_csv.read_csv(
input_file=input_buffer,
read_options=arrow_csv.ReadOptions(
use_threads=True,
block_size=4096
)
)
# Iterate through batches
for batch in arrow_table.to_batches():
do_stuff(batch)
As this (block_size) does not seem to return an iterator, I am under the impression that this will still make Arrow read the entire table in memory and thus recreate my problem.
Lastly, I am aware that I can first read the csv using Pandas and chunk through it then convert to Arrow tables. But I am trying to avoid using Pandas and only use Arrow.
I am happy to provide additional information if needed
The function you are looking for is pyarrow.csv.open_csv which returns a pyarrow.csv.CSVStreamingReader. The size of the batches will be controlled by the block_size option you noticed. For a complete example:
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv
in_path = '/home/pace/dev/benchmarks-proj/benchmarks/data/nyctaxi_2010-01.csv.gz'
out_path = '/home/pace/dev/benchmarks-proj/benchmarks/data/temp/iterative.parquet'
convert_options = pyarrow.csv.ConvertOptions()
convert_options.column_types = {
'rate_code': pa.utf8(),
'store_and_fwd_flag': pa.utf8()
}
writer = None
with pyarrow.csv.open_csv(in_path, convert_options=convert_options) as reader:
for next_chunk in reader:
if next_chunk is None:
break
if writer is None:
writer = pq.ParquetWriter(out_path, next_chunk.schema)
next_table = pa.Table.from_batches([next_chunk])
writer.write_table(next_table)
writer.close()
This example also highlights one of the challenges the streaming CSV reader introduces. It needs to return batches with consistent data types. However, when parsing CSV you typically need to infer the data type. In my example data the first few MB of the file have integral values for the rate_code column. Somewhere in the middle of the batch there is a non-integer value (* in this case) for that column. To work around this issue you can specify the types for columns up front as I am doing here.

Python-pandas "read_csv" is not reading the whole .TXT file

First of all, I have found several questions with the same title/topic here and I have tried the solutions that have been suggested, but none has worked for me
Here is the issue:
I want to extract a sample of workers from a huge .txt file (> 50 GB)
I am using HPC cluster for this purpose.
Every row in the data represents a worker which has many info (column variables). The idea is to a extract subsample of workers based on the first two letters in the ID variable:
df = pd.read_csv('path-to-my-txt-file', encoding= 'ISO-8859-1', sep = '\t', low_memory=False, error_bad_lines=False, dtype=str)
df = df.rename(columns = {'Worker ID' : 'worker_id'})
# extract subsample based on first 2 lettter in worker id
new_df = df[df.worker_id.str.startswith('DK', na=False)]
new_df.to_csv('DK_worker.csv', index = False)
The problem is that the resulting .CSV file has only 10-15 % of the number of rows that should be there (I have another source of information on the approximate number of rows that I should expect).
I think the data has some encoding issues. I have tried something like 'utf-8', 'latin_1' .. nothing has changed.
Do you see anything wrong in this code that may cause this problem? have I missed some argument?
I am not a Python expert :)
Many thanks in advance.
you can't load a 50GB file into your computers RAM, it would not be possible to store that much data. And I doubt the csv module can handle files of that size. What you need to do is open the file in small pieces, then process each piece.
def process_data(piece):
# process the chunk ...
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open('path-to-my-txt-file.csv') as f:
for piece in read_in_chunks(f):
process_data(piece)

Processing data using pandas in a memory efficient manner using Python

I have to read multiple csv files and group them by "event_name". I also might have some duplicates, so I need to drop them. paths contains all the paths of the csv files, and my code is as follows:
data = []
for path in paths:
csv_file = pd.read_csv(path)
data.append(csv_file)
events = pd.concat(data)
events = events.drop_duplicates()
event_names = events.groupby('event_name')
ev2 = []
for name, group in event_names:
a, b = group.shape
ev2.append([name, a])
This code is going to tell me how many unique event_name unique there are, and how many entries per event_name. It works wonderfully, except that the csv files are too large and I am having memory problems. Is there a way to do the same using less memory?
I read about using dir() and globals() to delete variables, which I could certainly use, because once I have event_names, I don't need the DataFrame events any longer. However, I am still having those memory issues. My question more specifically is: can I read the csv files in a more memory-efficient way? or is there something additional I can do to reduce memory usage? I don't mind sacrificing performance, as long as I can read all csv files at once, instead of doing chunk by chunk.
Just keep a hash value of each row to reduce the data size.
csv_file = pd.read_csv(path)
# compute hash (gives an `uint64` value per row)
csv_file["hash"] = pd.util.hash_pandas_object(csv_file)
# keep only the 2 columns relevant to counting
data.append(csv_file[["event_name", "hash"]])
If you cannot risk hash collision (which would be astronomically unlikely), just use another hash key and check if the final counting results are identical. The way to change a hash key is as follows.
# compute hash using a different hash key
csv_file["hash2"] = pd.util.hash_pandas_object(csv_file, hash_key='stackoverflow')
Reference: pandas official docs page

Python: Issue with rapidly reading and writing excel files after web scraping? Works for a bit then weird issues come up

So I developed a script that would pull data from a live-updated site tracking coronavirus data. I set it up to pull data every 30 minutes but recently tested it on updates every 30 seconds.
The idea is that it creates the request to the site, pulls the html, creates a list of all of the data I need, then restructures into a dataframe (basically it's the country, the cases, deaths, etc.).
Then it will take each row and append to the rows of each of the 123 excel files that are for the various countries. This will work well for, I believe, somewhere in the range of 30-50 iterations before it either causes file corruptions or weird data entries.
I have my code below. I know it's poorly written (my initial reasoning was I felt confident I could set it up quickly and I wanted to collect data quickly.. unfortunately I overestimated my abilities but now I want to learn what went wrong). Below my code I'll include sample output.
PLEASE note that this 30 second interval code pull is only for quick testing. I don't usually look to send that many requests for months. I just wanted to see what the issue was. Originally it was set to pull every 30 minutes when I detected this issue.
See below for the code:
import schedule
import time
def RecurringProcess2():
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import numpy as np
from os import listdir
import os
try:
extractTime = datetime.datetime.now()
extractTime = str(extractTime)
print("Access Initiated at " + extractTime)
link = 'https://www.worldometers.info/coronavirus/'
response = requests.get(link)
soup = BeautifulSoup(response.text,'html.parser').findAll('td')#[1107].get_text()
table = pd.DataFrame(columns=['Date and Time','Country','Total Cases','New Cases','Total Deaths','New Deaths','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop'])
soupList = []
for i in range(1107):
value = soup[i].get_text()
soupList.insert(i,value)
table = np.reshape(soupList,(123,-1))
table = pd.DataFrame(table)
table.columns=['Country','Total Cases','New Cases (+)','Total Deaths','New Deaths (+)','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop']
table['Date & Time'] = extractTime
#Below code is run once to generate the initial files. That's it.
# for i in range(122):
# fileName = table.iloc[i,0] + '.xlsx'
# table.iloc[i:i+1,:].to_excel(fileName)
FilesDirectory = 'D:\\Professional\\Coronavirus'
fileType = '.csv'
filenames = listdir(FilesDirectory)
DataFiles = [ filename for filename in filenames if filename.endswith(fileType) ]
for file in DataFiles:
countryData = pd.read_csv(file,index_col=0)
MatchedCountry = table.loc[table['Country'] == str(file)[:-4]]
if file == ' USA .csv':
print("Country Data Rows: ",len(countryData))
if os.stat(file).st_size < 1500:
print("File Size under 1500")
countryData = countryData.append(MatchedCountry)
countryData.to_csv(FilesDirectory+'\\'+file, index=False)
except :
pass
print("Process Complete!")
return
schedule.every(30).seconds.do(RecurringProcess2)
while True:
schedule.run_pending()
time.sleep(1)
When I check the code after some number of iterations (usually successful for like 30-50) it has either displayed only 2 rows and lost all other rows, or it'll keep appending while deleting a single entry in the row above while two rows above loses 2 entries, etc. (essentially forming a triangle of sorts).
Above that image would be a few hundred empty rows. Does anyone have an idea of what is going wrong here? I'd consider this a failed attempt but would still like to learn from this attempt. I appreciate any help in advance.
Hi as per my understanding the webpage only has one table element. My suggestion would be to use pandas read_html method as it provides clean and structured table.
Try the below code you can modify to schedule the same:-
import requests
import pandas as pd
url = 'https://www.worldometers.info/coronavirus/'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
Disclaimer: I'm still evaluating this solution. So far it works almost perfectly for 77 rows.
Originally I had set the script up to run for .xlsx files. I converted everything to .csv but retained the index column code:
countryData = pd.read_csv(file,index_col=0)
I started realizing that things were being ordered differently every time the script ran. I have since removed that from the code and so far it works. Almost.
Unnamed: 0 Unnamed: 0.1
0 7
7
For some reason I have the above output in every file. I don't know why. But it's in the first 2 columns yet it still seems to be reading and writing correctly. Not sure what's going on here.

Writing Plain Text and DataFrames to CSV files

I am working with a file format that requires 2-3 header or indicator lines as well as tabular-formatted data. At the moment, I am using the code below to accomplish the task.
sevTmp and sceTmp are two tab-delimited text files where I've temporarily written my data frames. To be clear, these data frames were written to the files using each of their respective to_csv() methods.
Code Examples
severitiesELD.to_csv(sevTmp, sep = '\t', \
columns = severitiesELD.columns, index = False, header = severitiesCols)
scenariosELD.to_csv(sceTmp, sep = '\t', \
columns = scenariosELD.columns, index = False, header = scenariosCols)
In the block below, I read them into memory and insert a couple other lines (actual text and spacing) per the constructs of the file format. sevString / sceString represent two of the required lines aside from the data frame data
Those lines:
sevString = '#Severities\n'
sceString = '#Scenarios\n'
I'd love to avoid writing out/reading back in the data frames and write them (for the first time) in one shot along with my indicator strings. My solution below is working quite well, but want to speed things up if possible. Would love suggestions on best to do so while I continue to research alternatives in parallel.
with open(sevTmp,'r') as sevF:
with open(sceTmp, 'r') as sceF:
with open(finalFile,'w') as finalF:
# Write the severities line
finalF.write(sevString)
# Write out the actual ELD data, and loss cause names
finalF.write(sevF.read())
# Add a space
finalF.write('\n')
# Add in the scenarios string
finalF.write(sceString)
# Add in data frame again, this time to test writing out second time
finalF.write(sceF.read())

Categories