Need to write scraped data into csv file (threading) - python

Here is my code:
from download1 import download
import threading,lxml.html
def getInfo(initial,ending):
for Number in range(initial,ending):
Fields = ['country', 'area', 'population', 'iso', 'capital', 'continent', 'tld', 'currency_code',
'currency_name', 'phone',
'postal_code_format', 'postal_code_regex', 'languages', 'neighbours']
url = 'http://example.webscraping.com/places/default/view/%d'%Number
html=download(url)
tree = lxml.html.fromstring(html)
results=[]
for field in Fields:
x=tree.cssselect('table > tr#places_%s__row >td.w2p_fw' % field)[0].text_content()
results.append(x)#should i start writing here?
downloadthreads=[]
for i in range(1,252,63): #create 4 threads
downloadThread=threading.Thread(target=getInfo,args=(i,i+62))
downloadthreads.append(downloadThread)
downloadThread.start()
for threadobj in downloadthreads:
threadobj.join() #end of each thread
print "Done"
So results will have the values of Fields ,I need to write the data with Fields as top row (only once) then the values in results into CSV file.
I am not sure i can open the file in the function because threads will open the file multiple times simultaneously.
Note: i know threading isn't desirable when crawling but i am just testing

I think you should consider using some kind of queuing or thread pools. Thread pools are really useful if you want create several threads (not 4, I think you would use more than 4 threads, but 4 threads at a time).
An example of Queue technique can be found here.
Of course, you can label the files with its threads id, for example: "results_1.txt", "results_2.txt" and so on. Then, you can merge them after all threads finished.
You can use the basic concepts of Lock, Monitor, and so on, however I am not the biggest fan of them. An example of locking can be found here

Related

Asynchronous processing in spark pipeline

I have a local linux server which contains 4 cores. I am running a pyspark job on it locally which basically reads two tables from database and saves the data into 2 dataframes. Now i am using these 2 dataframes to do some processing and then i am using the resultant processed df to save it into elasticsearch. Below is the code
def save_to_es(df):
df.write.format('es').option('es.nodes', 'es_node').option('es.port', some_port_no.).option('es.resource', index_name).option('es.mapping', es_mappings).save()
def coreFun():
spark = SparkSession.builder.master("local[1]").appName('test').getOrCreate()
spark.catalog.clearCache()
spark.sparkContext.setLogLevel("ERROR")
sc = spark.sparkContext
sqlContext = SQLContext(sc)
select_sql = """(select * from db."master_table")"""
df_master = spark.read.format("jdbc").option("url", "jdbcurl").option("dbtable", select_sql).option("user", "username").option("password", "password").option("driver", "database_driver").load()
select_sql_child = """(select * from db."child_table")"""
df_child = spark.read.format("jdbc").option("url", "jdbcurl").option("dbtable", select_sql_cost).option("user", "username").option("password", "password").option("driver", "database_driver").load()
merged_df = merged_python_file.merged_function(df_master,df_child,sqlContext)
logic1_df = logic1_python_file.logic1_function(df_master,sqlContext)
logic2_df = logic2_python_file.logic2_function(df_master,sqlContext)
logic3_df = logic3_python_file.logic3_function(df_master,sqlContext)
logic4_df = logic4_python_file.logic4_function(df_master,sqlContext)
logic5_df = logic5_python_file.logic5_function(df_master,sqlContext)
save_to_es(merged_df)
save_to_es(logic1_df)
save_to_es(logic2_df)
save_to_es(logic3_df)
save_to_es(logic4_df)
save_to_es(logic5_df)
end_time = int(time.time())
print(end_time-start_time)
sc.stop()
if __name__ == "__main__":
coreFun()
There are different logic for processing written in separate python files e.g logic1 in logic1_python_file etc. I send my df_master to separate functions and they return resultant processed df back to driver. Now i use this resultant processed df to save into elasticsearch.
It works fine but problem is here everything is happening sequentially first merged_df gets processed and while it is getting processed others simply wait even though they are not really dependent on the o/p of merged_df function and then logic_1 gets processed while others wait and it goes on. This is not an ideal system design considering the o/p of one logic is not dependent on other.
I am sure asynchronous processing can help me here but i am not sure how to implement it here in my usecase. I know i may have to use some kind of queue(jms,kafka etc) to accomplish this but i dont have a complete picture.
Please let me know how can i utilize asynchronous processing here. Any other inputs which can help in improving the performance of job is welcome.
If during the processing of one single step like (merged_python_file.merged_function), only one core of the CPU is getting heavily utilized and others are nearly idle, multiprocessing can speed up. It can be achieved by using multiprocessing module of python. For more details can check answer on How to do parallel programming in Python?

Is it possible to operate on multiple txt files at once in python?

I have 1080 .txt files, each of which contain over 100k rows of values in three columns. I have to perform an average of the first column in each of these .txt files.
Any method that performs looping is proving to be too slow as only one file is loaded by numpy.loadtxt at a time.
The kicker is that I have 38 of these folders on which I need to perform this operation. So 38*1030 files in total. Using time module to get compute time for each numpy.loadtxt gives me around 1.7 seconds. So the total time to run over all folders is over 21 hours which seems a bit too much time.
So this has me wondering if there is a way to perform multiple operations at once by being able to open multiple txt files and performing average on the first column. Then also being able to store that average in the corresponding order of the txt files, since the order is important.
Since I am a begineer, I'm not sure if this even is the fastest way. Thanks in advance.
import numpy as np
import glob
import os
i = 0
while i < 39:
source_directory = "something/" + str(i) #Go to specific folder with the numbering
hw_array = sorted(glob.glob(source_directory + "/data_*.txt")) # read paths of 1080 txt files
velocity_array = np.zeros((30,36,3))
for probe in hw_array:
x = 35 - int((i-0.0001)/30) #describing position of the probes where velocities are measured
y = (30 - int(round((i)%30)))%30
velocity_column = np.loadtxt(data_file, usecols=(0)) #step that takes most time
average_array = np.mean(velocity_column, axis=0)
velocity_array[y,x,0] = average_array
velocity_array[y,x,1] = y*(2/29)
velocity_array[y,x,2] = x*0.5
np.save(r"C:/Users/md101/Desktop/AE2/Project2/raw/" + "R29" + "/data" + "R29", velocity_array) #save velocity array for use in later analysis
i += 1
Python has pretty slow I/O, and most of your time is spent in talking to the operating system and other costs associated with opening files.
Threading in python is strange and only provides an improvement in certain situations. Here is why it is good for your case. How threading works in Python is that a thread will do stuff if it has permission (called acquiring the GIL, or global interpreter lock. read about it). While it is waiting for something, like I/O, it will pass up the GIL to another thread. This will allow your files to be operated on (averaging the first row) while it has the GIL, and while the file is being opened it will pass the GIL on to another file to perform operations
It's completely possible to write a function that loads files from a directory, and spawn off one multiprocessing and get it done in close to 1/39th the time it was taking. Or, don't parallelize by directory, but by file and queue up the work and read from that work queue.
Some pseudocode:
pool = multiprocessing.Pool()
workers = []
for d in os.listdir("."):
for f in os.listdir(d):
workers.append(pool.apply_async(counter_function_for_file, (os.path.join(d, f),))
s = sum(worker.get() for worker in workers)
...and put your code for reading from the file in that counter_function_for_file(filename) function.

BiqQuery Storage. Python. Reading multiple streams in parallel issue (multiprocessing)

I am trying to use BALANCED ShardingStrategy to get more then 1 stream and python multiprocessing lib to read stream in parallel.
However, when reading streams in parallel the same rows number and data is returned. As, if I understand correctly, no data is assigned to any stream before it starts reading and is finalized, so two parallel streams try to read same data and a part of data is never read as a result.
Using LIQUID strategy we can read all the data from one stream, which cannot be split.
According to documentation it is possible to read multiple streams in parallel with BALANCED one. However, I cannot figure out how to read in parallel and to assign different data to each stream
I have the following toy code:
import pandas as pd
from google.cloud import bigquery_storage_v1beta1
import os
import google.auth
from multiprocessing import Pool
import multiprocessing
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='key.json'
credentials, your_project_id = google.auth.default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
bq_storage_client = bigquery_storage_v1beta1.BigQueryStorageClient(credentials=credentials)
table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = "bigquery-public-data"
table_ref.dataset_id = "ethereum_blockchain"
table_ref.table_id = "contracts"
parent = "projects/{}".format(your_project_id)
session = bq_storage_client.create_read_session(
table_ref,
parent,
format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.BALANCED),
)
def read_rows(stream_position, session=session):
reader = bq_storage_client.read_rows(bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[stream_position]), timeout=100000).to_arrow(session).to_pandas()
return reader
if __name__ == '__main__':
p = Pool(2)
output = p.map(read_rows,([i for i in range(0,2)]))
print(output)
Need assistance to have multiple streams being read in parallel.
Probably there is a way to assign data to a stream before the reading starts. Any examples of code or explanations and tips would be appreciated
I apologize for the partial answer, but it didn't fit in a comment.
LIQUID or BALANCED just affect how data is allocated to streams, not the fact that data arrives in multiple streams (see here).
When I ran a variant of your code with this read_rows function, I saw different data for the first row of both streams, so I was otherwise unable to replicate your problem with seeing the same data on this dataset with either shading strategy.
def read_rows(stream_position, session=session):
reader = bq_storage_client.read_rows(
bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[stream_position]),timeout=100000)
for row in reader.rows(session):
print(row)
break
I was running this code on a Linux compute engine instance.
I do worry that the output you are asking for in the map call is otherwise going to be quite large, however.

Nested For Loops With Calculations Vs. Linear Process

I'm iterating over M dataframes, each containing a column with N URLs. For each URL, I extract paragraph text, then conduct standard cleaning for textual analysis before calculating "sentiment" scores.
Is it more efficient for me to:
Continue as it is (compute scores in the URL for-loop itself)
Extract all of the text from URLs first, and then separately iterate over the list / column of text ?
Or does it not make any difference?
Currently running calculations within the loop itself. Each DF has about 15,000 - 20,000 URLs so it's taking an insane amount of time too!
# DFs are stored on a website
# I extract links to each .csv file and store it as a list in "df_links"
for link in df_links:
cleaned_articles = []
df = pd.read_csv(link, sep="\t", header=None)
# Conduct df cleaning
# URLs for articles to scrape are stored in 1 column, which I iterate over as...
for url in df['article_url']:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
para_text = [text.get_text() for text in soup.findAll('p')]
text = " ".join(para_text)
words = text.split()
if len(words) > 500:
# Conduct Text Cleaning & Scores Computations
# Cleaned text stored as a variable "clean_text"
cleaned_articles.append(clean_text)
df['article_text'] = cleaned_articles
df.to_csv('file_name.csv')
To answer the question, it shouldn't make too much of a difference if you download the data and then apply analysis to it. You'd just be re arranging the order in which you do a set of tasks that would effectively take the same time.
The only difference may be if the text corpus' are rather large and then read write time to disk will start to play a part so could be a little faster running the analytics all in memory. But this still isn't going to really solve your problem.
May I be so bold as to reinterpret your question as: "My analysis is taking too long help me speed it up!"
This sounds like a perfect use case for multiprocessing! Since this sounds like a data science project you'll need to pip install multiprocess if you're using a ipython notebook (like Jupyter) or import multiprocessing if using a python script. This is because of the way python passes information between processes, don't worry though the API's for both multiprocess and multiprocessing are identical!
A basic and easy way to speed up your analysis is to indent you for loop and put it in a function. That function can then be passed to a multiprocessing map which can spawn multiple processes and do the analysis on several urls all at once:
from multiprocess import Pool
import numpy as np
import os
import pandas as pd
num_cpus = os.cpu_count()
def analytics_function(*args):
#Your full function including fetching data goes here and accepts a array of links
return something
df_links_split = np.array_split(df_links, num_cpus * 2) #I normally just use 2 as a rule of thumb
pool = Pool(num_cpus * 2) #Start a pool with num_cpus * 2 processes
list_of_returned = pool.map(analytics_function, df_links_split)
This will spin up a load of processes and utilise your full cpu. You'll not be able to do much else on your computer, and you'll need to have your resource monitor open to check you're not maxing our your memory and slowing down/crashing the process. But it should significantly speed up your analysis by roughly a factor of num_cpus * 2!!
Extracting all of the texts then processing all of it or extracting one text then processing it before extracting the next wont do any difference.
Doing ABABAB takes as much time as doing AAABBB.
You might however be interested in using threads or asynchronous requests to fetch all of the data in parallel.

Multicore processing of multiple files and writing to a shared output file

So, I have some 30 files each 1 GB in size and I am reading them sequentially on a MAC with 16 GB RAM and 4 CPU Cores. The processing of each one of them is independent of the others. It is taking almost 2 hours to complete the processing. Each file has data for a day (time series/24hrs). So there is 30 days of data. After processing I am appending the output to a file day wise (i.e. Day 1, Day 2, and so on).
Can I solve this problem using Multiprocessing? Does it have any side effect (like Thrashing etc.)? Also it would be great if some one can guide me on the pattern. I read about Multiprocess, Pools and imap but it is still not clear to me as to how to write to the file sequentially (i.e. day wise).
My approach (either one of the below):
Use imap to get ordered output as I am looking for.
OR
To write individual output files for each input files and then merge them into one by sorting.
Is there a better pattern to solve this problem? Do I need to use a queue here? Confused!
basic demo of approach two:
from concurrent.futures import ProcessPoolExecutor
executor = ProcessPoolExecutor(max_workers=10)
inputs = ['a.input.txt', ]
outputs = ['a.output.txt', ]
def process(input, output):
""" process one file at a time."""
pass
def merge(files):
""" merge all output files. """
pass
for i in range(len(inputs)):
executor.submit(process, inputs[i], outputs[i])
merge(outputs)

Categories