Request status update Twitter stream data - python

I retrieved Twitter data via the streaming API on Python, however, I am also interested in how the public metrics evolve during the time. As a result, I would like to request on a daily basis the metrics.
Unfortunately, the API for the status update can only handle 100 requests at a time. I have a list of all id's, how is it possible to automatically split the string of id's so that all of them will be requested, always in batches of 100?
Thank you a lot in advance!

Keep it as list of IDs instead of single string.
And then you can use range(len(...)) with [n:n+100] like
# example data
all_ids = list(range(500))
SIZE = 100
#SIZE = 10 # test on smaller size
for n in range(0, len(all_ids), SIZE):
print(all_ids[n:n+SIZE])
You can even use yield to create special function for this
def split(data, size):
for n in range(0, len(data), size):
yield data[n:n+size]
# example data
all_ids = list(range(500))
SIZE = 100
SIZE = 10
for part in split(all_ids, SIZE):
print(part)
Eventually you can get [:100] and slice [100:] but this destroy list so you have to do it on copy of this list
# example data
all_ids = list(range(500))
SIZE = 100
#SIZE = 10 # test on smaller size
all_ids_copy = all_ids.copy()
while all_ids_copy:
print(all_ids_copy[:SIZE])
all_ids_copy = all_ids_copy[SIZE:]
You can also use some external modules for this.
from toolz import partition
# example data
all_ids = list(range(500))
SIZE = 100
#SIZE = 10 # test on smaller size
for part in partition(SIZE, all_ids):
print(part)
If you will have list of strings then you can convert back to single string using join()
print( ",".join(part) )
For list of integers you may need to convert integers to strings
print( ",".join(str(x) for x in part) )

Related

Repeated values received from a For Loop on Python

I am running a for loop in Python and it's coming out with the same value multiple times. Been trying everything but I can't fin where my mistake is.
I am trying to divide text into chunks of length=100 with the following code:
clean_file_body_string is my text
Context, my file has close to 500k characters.
I'm noticing the repetead values on the "print(meta) and also on my file
from tqdm.auto import tqdm # this is our progress bar
batch_size = math.ceil(len(clean_file_body_string)/100)
for i in tqdm(range(0, len(clean_file_body_string), 100)):
# set end position of each batch to take only what is needed
i_end = min(i+batch_size, len(clean_file_body_string))
# get batch of lines and IDs
#Next code is takes the text and puts it into chunks
lines_batch = [clean_file_body_string[i:i+100] for i in range(0, len(clean_file_body_string), 100)]
ids_batch = [str(n) for n in range(i, i_end)]
meta = [{'text': lines_batch} for i in range(0, len(text_chunks), 100)]
print(meta)
Been trying different methods but this code seems the simpler and only one I've managed to almost make it work.
Take into account I'm still learning python.

code is faster on single cpu but very slow on multiple processes why?

I have some code to sort some values originally in sparse matrix and zip it together with another data. I used some kind of optimizations to make it fast and the code is 20x faster than it was as it is below:
This code takes 8s on single CPU core:
# cosine_sim is a sparse csr matrix
# names is an numpy array of length 400k
cosine_sim_labeled = []
for i in range(0, cosine_sim.shape[0]):
row = cosine_sim.getrow(i).toarray()[0]
non_zero_sim_indexes = np.nonzero(row)
non_zero_sim_values = row[non_zero_sim_indexes]
non_zero_sim_values = [round(freq, 4) for freq in non_zero_sim_values]
non_zero_names_values = np.take(names, non_zero_sim_indexes)[0]
zipped = zip(non_zero_names_values, non_zero_sim_values)
cosine_sim_labeled.append(sorted(zipped, key=lambda cv: -cv[1])[1:][:top_similar_count])
But if I use same code with multi core (to make it even faster) it takes 300 seconds:
#split is array of arrays of numbers like [[1,2,3], [4,5,6]] it is meant to generate batches of array indexes to be processed with each paralel process
split = np.array_split(range(0, cosine_sim.shape[0]), cosine_sim.shape[0] / batch)
def sort_rows(split):
cosine_sim_labeled = []
for i in split:
row = cosine_sim.getrow(i).toarray()[0]
non_zero_sim_indexes = np.nonzero(row)
non_zero_sim_values = row[non_zero_sim_indexes]
non_zero_sim_values = [round(freq, 4) for freq in non_zero_sim_values]
non_zero_names_values = np.take(names, non_zero_sim_indexes)[0]
zipped = zip(non_zero_names_values, non_zero_sim_values)
cosine_sim_labeled.append(sorted(zipped, key=lambda cv: -cv[1])[1:][:top_similar_count])
return cosine_sim_labeled
# this ensures paralel CPU execution
rows = Parallel(n_jobs=CPU_use, verbose=40)(delayed(sort_rows)(x) for x in split)
cosine_sim_labeled = np.vstack(rows).tolist()
you do realize that your new parallel function sort_rows does not even use the split argument? all it does is to distribute all the data to all processes, which takes time, then each process is doing the exact same calculation, only to return the whole data back to the main process, which again takes time

Python - loop through N records at a time and then start again

I'm trying to write a script that calls Google Translation API in order to translate each line from an Excel file that has 1000 lines.
I'm using pandas to load and to read the values from a specific values and then I append the data frame to a list and then I use Google API to translate:
import os
from google.cloud import translate_v2 as translate
import pandas as pd
from datetime import datetime
# Variable for GCP service account credentials
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'path to credentials json'
# Path to the file
filepath = r'../file.xlsx'
# Instantiate the Google Translation API Client
translate_client = translate.Client()
# Read all the information from the Excel file within 'test' sheet name
df = pd.read_excel(filepath, sheet_name='test')
# Define an empty list
elements = []
# Loop the data frame and append the list
for i in df.index:
elements.append(df['EN'][i])
# Loop the list and translate each line
for item in elements:
output = translate_client.translate(
elements,
target_language='fr'
)
result = [
element['translatedText'] for element in output
]
print("The values corresponding to key : " + str(result))
After I append to the list the total number of the elements will be 1000. The problem with Google Translation API is that if you are sending multiple segments they call it, it returns the below error:
400 POST https://translation.googleapis.com/language/translate/v2: Too many text segments
I've investigated it and I have seen that sending 100 lines (in my case) would be a solution. Now I am a bit stuck.
How would I have to write the loop to iterate 100 lines at a time, to translate those 100 lines and then do something with the result, and then proceed with the other 100 and so on until it gets to the end?
Assuming you are able to pass a list into a single translate call, perhaps you could do something like that:
# Define a helper to step thru the list in chunks
def chunker(seq, size):
return (seq[pos : pos + size] for pos in range(0, len(seq), size))
# Then iterate and handle them accordignly
output = []
for chunk in chunker(elements, 100):
temp = translate_client.translate(
chunk,
target_language='fr'
)
output.extend(temp)

issue in executing scikit-learn linear regression model

I have a dataset the sample structure of which looks like this:
SV,Arizona,618,264,63,923
SV,Arizona,367,268,94,138
SV,Arizona,421,268,121,178
SV,Arizona,467,268,171,250
SV,Arizona,298,270,62,924
SV,Arizona,251,272,93,138
SV,Arizona,215,276,120,178
SV,Arizona,222,279,169,250
SV,Arizona,246,279,64,94
SV,Arizona,181,281,97,141
SV,Arizona,197,286,125.01,182
SV,Arizona,178,288,175.94,256
SV,California,492,208,63,923
SV,California,333,210,94,138
SV,California,361,213,121,178
SV,California,435,217,171,250
SV,California,222,215,62,92
SV,California,177,218,93,138
SV,California,177,222,120,178
SV,California,156,228,169,250
SV,California,239,225,64,94
SV,California,139,229,97,141
SV,California,198,234,125,182
The records are in order of company_id,state,profit,feature1,feature2,feature3.
Now I wrote this code which breaks he whole dataset into chunks of 12 records (for each company and for each state in that company there are 12 records) and then passes it to process_chunk() function. Inside process_chunk() the records in the chunk are processed and broken into test set and training set with record number 10 and 11 going into test set while rest going into training set. I also store the company_id and state of records in test set into a global list for future display of predicted values. I also append the predicted values to a global list final_prediction
Now the issue that I am facing is that company_list, state_list and test_set lists have the same size (of about 200 records) but final_prediction has size half of what other lists have (100) records. If the test_set list has size of 200 then shouldn't the final_prediction be also of size 200? My current code is:
from sklearn import linear_model
import numpy as np
import csv
final_prediction = []
company_list = []
state_list = []
def process_chunk(chuk):
training_set_feature_list = []
training_set_label_list = []
test_set_feature_list = []
test_set_label_list = []
np.set_printoptions(suppress=True)
prediction_list = []
# to divide into training & test, I am putting line 10th and 11th in test set
count = 0
for line in chuk:
# Converting strings to numpy arrays
if count == 9:
test_set_feature_list.append(np.array(line[3:4],dtype = np.float))
test_set_label_list.append(np.array(line[2],dtype = np.float))
company_list.append(line[0])
state_list.append(line[1])
elif count == 10:
test_set_feature_list.append(np.array(line[3:4],dtype = np.float))
test_set_label_list.append(np.array(line[2],dtype = np.float))
company_list.append(line[0])
state_list.append(line[1])
else:
training_set_feature_list.append(np.array(line[3:4],dtype = np.float))
training_set_label_list.append(np.array(line[2],dtype = np.float))
count += 1
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(training_set_feature_list, training_set_label_list)
prediction_list.append(regr.predict(test_set_feature_list))
np.set_printoptions(formatter={'float_kind':'{:f}'.format})
for items in prediction_list:
final_prediction.append(items)
# Load and parse the data
file_read = open('data.csv', 'r')
reader = csv.reader(file_read)
chunk, chunksize = [], 12
for i, line in enumerate(reader):
if (i % chunksize == 0 and i > 0):
process_chunk(chunk)
del chunk[:]
chunk.append(line)
# process the remainder
#process_chunk(chunk)
print len(company_list)
print len(test_set_feature_list)
print len(final_prediction)
Why is this difference in size coming and what mistake am I doing here in my code that I can rectify (maybe something that I am doing very naively and can be done in better way)?
Here:
prediction_list.append(regr.predict(test_set_feature_list))
np.set_printoptions(formatter={'float_kind':'{:f}'.format})
for items in prediction_list:
final_prediction.append(items)
prediction_list will be a list of arrays (since predict returns an array).
So you'll be appending arrays to your final_prediction, which is probably what messes up your count: len(final_prediction) will probably be equal to the number of chunks.
At this point, the lengths are ok if prediction_list has the same length as test_set_feature_list.
You probably want to use extend like this:
final_prediction.extend(regr.predict(test_set_feature_list))
Which is also easier to read.
Then the length of final_prediction should be fine, and it should be a single list, rather than a list of lists.

List integration as argument (beginner)

I am writing a script in python, but I am a beginner (started yesterday).
Basically, I just create chunks that I fill with ~10 pictures, align them, build the model, and build the texture. Now I have my chunks and I want to align them...
From the manual:
PhotoScan.alignChunks(chunks, reference, method=’points’, accuracy=’high’, preselection=False)
Aligns specified set of chunks.
Parameters
chunks (list) – List of chunks to be aligned.
reference (Chunk) – Chunk to be used as a reference.
method (string) – Alignment method in [’points’, ‘markers’].
accuracy (string) – Alignment accuracy in [’high’, ‘medium’, ‘low’].
preselection (boolean) – Enables image pair preselection.
Returns Success of operation.
Return type boolean
I tried to align the chunks, but the script throws an error at line 26:
TypeError: expected a list of chunks as an argument
Do you have any idea how I can make it work?
This is my current code:
import PhotoScan
doc = PhotoScan.app.document
main_doc = PhotoScan.app.document
chunk = PhotoScan.Chunk()
proj = PhotoScan.GeoProjection()
proj.init("EPSG::32641")
gc = chunk.ground_control
gc.projection = proj
working_path = "x:\\New_agisoft\\ok\\Optical\\"
for i in range (1,3):
new_chunk = PhotoScan.Chunk()
new_chunk.label = str(i)
loop = i*10
loo = (i-1)*10
doc.chunks.add(new_chunk)
for j in range (loo,loop):
file_path = working_path + str(j) + ".jpg"
new_chunk.photos.add(file_path)
gc = new_chunk.ground_control
gc.loadExif()
gc.apply()
main_doc.active = len(main_doc.chunks) - 1
doc.activeChunk.alignPhotos(accuracy="low", preselection="ground control")
doc.activeChunk.buildModel(quality="lowest", object="height field", geometry="smooth", faces=50000)
doc.activeChunk.buildTexture(mapping="generic", blending="average", width=2048, height=2048)
PhotoScan.alignChunks(,1,method="points",accuracy='low', preselection=True)
PhotoScan.alignChunks(,1,method="points",accuracy='low', preselection=True)
^
Before the ',' you need the chunks!
Note: I have never used this module.
You're calling PhotoScan.alignChunks with an empty first argument, while the documentation states that it expects a list of chunks.
You could initialize an empty list before your loop:
chunks = []
And add completed chunks to the list from inside the loop:
# ...
chunks.append(new_chunk)
Then call the function:
PhotoScan.alignChunks(chunks, chunk[0], ...)

Categories