Splitting multiple videos into shorter based on frame count - python

I can cut the video based on seconds, for example, I can cut the video from second 0 to second 10 and from second 10 to second 20. But I need to cut the video from frame 0 to frame 250 and from frame 250 to frame 500 because of some error due the counting of second. Does anyone has any idea about this?
Here is the code I use to cut based on seconds:
required_video_file = "H8.avi"
with open("Z:/temp/Letícia/Videos/teste/times.txt") as f:
times = f.readlines()
times = [x.strip() for x in times]
for time in times:
starttime = int(time.split('-')[0])
endtime = int(time.split("-")[1])
ffmpeg_extract_subclip(required_video_file, starttime, endtime, targetname=str(times.index(time)+1)+".avi")

Related

How to make a 1-hour gif using Python?

I want to make an 1 hour - long gif which will consist of 60 frames.
I already have built a gif_making function using PIL:
sortedFiles = sorted(glob.glob('*.png'),key=os.path.getmtime)
sortedFilesBackwards = sorted(glob.glob('*.png'),key=os.path.getmtime, reverse= True)
full = [] + sortedFiles[:-1] + sortedFilesBackwards[:-1]
frames = [Image.open(image) for image in full]
frame_one = frames[0]
frame_one.save(f"{units}{fileName}.gif", format="GIF", append_images=frames,
save_all=True, duration=12000, loop=0)
However, when I set the duration = 360 000 (milliseconds = 1 hour), I receive the following error:
struct.error: ushort format requires 0 <= number <= (32767 *2 +1)
I work on macOS.
P.S : I think it has something to do with the maximum amount of data within a struct ?
The duration is how long to display each frame for in milliseconds, so you need to set it to 1,000 for each frame to be displayed for a second.
Or set it to 30,000 for each frame to be displayed for 30 seconds and double up your frames.
Set FPS value when you saving your GIF
For example:
ani.save('animation.gif', fps=3)
The value you set will lengthen or shorten your gif

How do I stay within a for loop if certain conditions are met, before continuing to the next iteration?

I have three groups of data. I want to gather all of the data from the 3 groups. However, I can only get 100 records per request. Each group has more than 100 records. I can only tell how much data is in each group after I get the first batch of data for the group, which makes it seem like I can't use a while loop. Here's what I have.
def getOTCdata(weekStartDate):
#set start and record limit
start = 0
records = 100000
groups = ["G1", "G2", "G3"]
#create for loop to get data with filters
for tier in tier:
params = {
"compareFilters": [
{ "compareType": "equal", "fieldName": "group", "fieldValue": group}
]
}
url = 'myURL'
data = requests.post(url, data=json.dumps(params))
#code to download data - removed so it's not bogged down
#check if the record total is more or less than the data remaining
recordTotal = (data.headers)['Records']
if start + records < recordTotal:
start = +=100
#I WANT TO CONTINUE IN GROUP 1
else:
#MOVE TO GROUP 2
Let's assume G1 has 150 records. I want it to run in G1 one more time. Since I will have gathered all the data in the second turn, I'll then want to move to G2. The problem is that I don't know the record total until I make the request and download the data, so I can't use a while loop right after my for loop.
Use a while loop
recordTotal = (data.headers)['Records']
while start + records < recordTotal:
start += 100
It will not exit out until the condition of start + records < recordTotal.
I also changed start to start += 100 if you want the start variable to increment by 100 each time.

pausing for loop of a data stream for further manipulation

have the following code resulting in a streaming data frame with over 5000 rows every minute. As this data frame is within a for loop am unable to manipulate the data within the data frame. So I need to know how to segregate the data frame to come be out of the for loop, say every 5 minute and then restart again to collect the information in the data frame.
'''
df=pd.DataFrame(data=None)
def on_ticks(ws, ticks):
global df
for sc in ticks:
token=sc['instrument_token']
name=trd_portfolio[token]['name']
ltp=sc['last_price']
df1=pd.DataFrame([name,ltp]).T
df1.columns=['name','ltp']
df=df.append(df1,ignore_index=True)
print(df)
'''
Resultant output is
name ltp
0 GLAXO 1352.2
1 GSPL 195.75
2 ABAN 18
3 ADANIPOWER 36.2
4 CGPOWER 6
... ... ...
1470 COLPAL 1317
1471 ITC 196.2
1472 JUBLFOOD 1698.5
1473 HCLTECH 550.6
1474 INDIGO 964.8
[1475 rows x 2 columns]
further manipulation required on the data frame are like:
'''
df['change']=df.groupby('name')['ltp'].pct_change()*100
g = df.groupby('name')['change']
counts = g.agg(
pos_count=lambda s: s.gt(0).sum(),
neg_count=lambda s: s.lt(0).sum(),
net_count=lambda s: s.gt(0).sum()- s.lt(0).sum()).astype(int)
print(counts)
'''
However, am unable to freeze the for loop for a certain time for other processes to happen. I did try the sleep method, but it sleeps for given time and then goes back to the for loop.
Need guidance on how we can freeze the for loop for certain time so that the other codes can be executed and gain going back to the for loop to continue collecting the data.
There is no pausing of the loop but you can just pass the arguments to some other function that performs the other operations after every n iterations.The sudo code would be something like.
for loop in range(10000):
data #collecting data
if loop==100:
other_operation(data):
def other_operation(data):
#perform other operations here
This willperform the other manipulations after every 100 loop iterations.

How to update time for each data received in Python

I am a beginner in python and I am writing a script to save time for a data received from an algorithm
In the script I have an algorithm which accepts few parameters and returns the id of the data it detected. Below are few of its output:
S.No Data Id Time
1 0 2018-11-16 15:00:00
2 0, 1 2018-11-16 15:00:02
3 0, 1 2018-11-16 15:00:03
4 0, 1, 2 2018-11-16 15:00:05
5 0, 1, 2 2018-11-16 15:00:06
6 0, 2 2018-11-16 15:00:08
From the above output, we can see that at 1st attempt it detected data of id 0. In the 2nd attempt it detected the data of id 1 so total data id detected 0, 1. In 4th attempt, it detected 2 id. This keeps on going as it is running in while True. From above we can say that the time period for 0 id is 8 sec, for 1 time period is 4 sec and for 2 it is 3 sec. I need to calculate these time periods. For this I have written below code:
data_dict = {} # To store value of time for each data id
data_dict_sec = {} # To store value of seconds for each data id
data = amc_data_retention() # data contains the data id
for dataID in data.items():
if run_once:
run_once = False
data_dict[dataID] = time.time()
data_dict_sec[dataID] = 0
for dataID in data.items():
if dataID in data_dict:
sec = time.time() - data_dict[dataID]
data_dict_sec[dataID] += sec
data_dict[dataID] = time.time()
else:
print("New data detected")
The first for loop run once and it saves the value of time of dataID in dict. In the next for loop, that time is subtracted with current time and total seconds are saved in data_dict_sec. In first iteration, total seconds will be 0 but from the next iteration, it will start saving the correct seconds. This works fine only if there is 1 data id. As soon as 2nd data comes, it do not record time for it.
Can anyone please suggest a good way of writing this. Main objective is to keep saving the values of time period for each data id. Please help. Thanks
The only time this adds data_ID keys to the data_dict is in the first run. It should add each new data_ID is sees. I'm not seeing that the first for-loop is needed, where it only adds data_ID keys in the first run. It may do what you need if you moved the dictionary key initialization into the second for-loop where it checks if the data_ID is in the data_dict. If it isn't, then initialize it.
Perhaps this would do what you need:
data_dict = {} # To store value of time for each data id
data_dict_sec = {} # To store value of seconds for each data id
data = amc_data_retention() # data contains the data id
for dataID in data.items():
if dataID in data_dict:
sec = time.time() - data_dict[dataID]
data_dict_sec[dataID] += sec
data_dict[dataID] = time.time()
else:
print("New data detected")
data_dict[dataID] = time.time()
data_dict_sec[dataID] = 0

what is the best practice to turn a python function to running in Apache Spark

I have a python program to deal with big data in one computer(16 cpu cores). Because the data is bigger and bigger, I need it to run in 5 computers. I am fresh in Spark,still feel Confused after read some docs. I will appreciate if anyone can tell me what is the best way to make a small cluster.
Here is some details:
The program is trying to count the trade volume in every price for each stock (one day for a time) from tick transaction pandas dataframe data.
There are more than 3000 stocks, 1 billion transactions in one day. The size of data file(dataframe) is between 1~2 G.
getting the result of 300 days spend for 3 days on one computer now, I hope to add 4 more computers to short the time.
here are the sample code in python:
import sharedmem
import os
import multiprocessing as mp
def ticks_to_priceline(day=None):
# file name for the tick dataframe file, one day for a file
fn = get_tick_dataframe_filename_byday(day)
with pd.HDFStore(fn, 'r') as tick_store:
tick_dataframe = tick_store.select("tick")
all_stock_symbols = tick_dataframe.symbol.drop_duplicates()
sblist = []
# cut to small chunk
chunk = 300
for xx in range(len(all_stock_symbols) / chunk + 1):
sblist.append(all_stock_symbols[xx * chunk:(xx + 1) * stuck])
# run with all cpus
with sharedmem.MapReduce(np=mp.cpu_count()) as pool:
def work(chunk_list):
result = {}
for symbol in chunk_list:
data = tick_dataframe[tick_dataframe.symbol == symbol]
if not data.empty and len(data) > 99:
df1 = data.loc[:,
[u'timestamp', u'price', u'volume']]
df1['vol_diff'] = df1.volume.diff().fillna(0)
df2 = df1.loc[:, ['price', 'vol_diff']]
df2.price = df2.price.apply(int)
rs = df2.groupby('price').sum()
rs = rs.sort_index(ascending=0).reset_index()
result[symbol] = rs
return result
rslist = pool.map(work, sblist)
return rslist
here is a spark cluster in standalone mode I have already setup for testing. My main problem is how to rewrite the codes above.

Categories