how to align similar values in two arrays in python - python

I am trying to align two videos using their utc timestamps.
for example:
video 1 timestamps = 1234.4321, 1234.4731, 1234.5432, 1234.5638, ...
video 2 timestamps = 1234.4843, 1234.5001, 1234.5632, 1234.5992, ...
I would like to align them so that the closest timestamps within a .0150s window are aligned without aligning two values from one array to one value in the second array.
example output:
video 1 timestamps = 1234.4321, 1234.4731, _________, 1234.5432, 1234.5638, _________, ...
video 2 timestamps = _________, 1234.4843, 1234.5001, _________, 1234.5632, 1234.5992, ...
Can someone help?
EDIT
There was a little confusion with the timestamps. The issue isn't that they simply need to be shifted once every two values. Hopefully this updated example will clear it up. Both examples are correct. A single solution should be able to solve both.
Example 2:
timestamp3 = 1590595834.6775, 1590595834.70479, 1590595834.73812, 1590595834.77163, 1590595834.80438
timestamp4 = 1590595835.70971, 1590595835.73674, 1590595835.7695, 1590595835.80338, 1590595835.83634
output:
timestamp3 = 1590595835.6775, 1590595835.70479, 1590595835.73812, 1590595835.77163, 1590595835.80438, _______________, ...
timestamp4 = _______________, 1590595835.70971, 1590595835.73674, 1590595835.7695, 1590595835.80338, 1590595835.83634, ...

Something like this:
timestamp3 = [1590595834.6775, 1590595834.70479, 1590595834.73812, 1590595834.77163, 1590595834.80438]
timestamp4 = [1590595834.70971, 1590595834.73674, 1590595834.7695, 1590595834.80338, 1590595834.83634]
len3 = len(timestamp3)
len4 = len(timestamp4)
ins = '_____________'
diff = 0.015
ii = jj = 0
while True:
if timestamp3[ii] < timestamp4[jj] - diff:
timestamp4.insert(jj, ins)
len4 += 1
elif timestamp4[ii] < timestamp3[jj] - diff:
timestamp3.insert(ii, ins)
len3 += 1
ii += 1
jj += 1
if ii == len3 or jj == len4:
if len3 > len4:
timestamp4.extend([ins]*(len3-len4))
elif len4 > len3:
timestamp3.extend([ins]*(len4-len3))
break
print(timestamp3)
print(timestamp4)
Gives:
[1590595834.6775, 1590595834.70479, 1590595834.73812, 1590595834.77163, 1590595834.80438, '_____________']
['_____________', 1590595834.70971, 1590595834.73674, 1590595834.7695, 1590595834.80338, 1590595834.83634]

I think this is what you mean:
timestamps1 = [1234.4321, 1234.4731, 1234.5432, 1234.5638]
timestamps2 = [1234.4843, 1234.5001, 1234.5632, 1234.5992]
index = len(timestamps1)
while index > 0:
timestamps1.insert(index,'_______')
index -= 2
timestamps2.insert(index,'_______')
print(timestamps1)
print(timestamps2)
Output:
[1234.4321, 1234.4731, '_______', 1234.5432, 1234.5638, '_______']
['_______', 1234.4843, 1234.5001, '_______', 1234.5632, 1234.5992]

Related

Add a value in a column as a function of the timestamp and another column

The title may not be very clear, but with an example I hope it would make some sense.
I would like to create an output column (called "outputTics"), and put a 1 in it 0.21 seconds after a 1 appears in the "inputTics" column.
As you see, there is no value 0.21 seconds exactly after another value, so I'll put the 1 in the outputTics column two rows after : an example would be at the index 3, there is a 1 at 11.4 seconds so I'm putting an 1 in the output column at 11.6 seconds
If there is a 1 in the "inputTics" column 0.21 second of earlier, do not put a one in the output column : an example would be at the index 1 in the input column
Here is an example of the red column I would like to create.
Here is the code to create the dataframe :
A = pd.DataFrame({"Timestamp":[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.1,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9,13.0],
"inputTics":[0,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,1,1],
"outputTics":[0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0]})
You can use pd.Timedelta if you can to avoid python rounded numbers if you want
Create the column with zeros.
df['outputTics'] = 0
Define a function set_output_tic in the following manner
def set_output_tic(row):
if row['inputTics'] == 0:
return 0
index = df[df == row].dropna().index
# check for a 1 in input within 0.11 seconds
t = row['Timestamp'] + pd.TimeDelta(seconds = 0.11)
indices = df[df.Timestamp <= t].index
c = 0
for i in indices:
if df.loc[i,'inputTics'] == 0:
c = c + 1
else:
c = 0
break
if c > 0:
df.loc[indices[-1] + 1, 'outputTics'] = 1
return 0
then call the above function using df.apply
temp = df.apply(set_output_tic, axis = 1) # temp is practically useless
This was actually kinda tricky, but by playing with indices in numpy you can do it.
# Set timestamp as index for a moment
A = A.set_index(['Timestamp'])
# Find the timestamp indices of inputTics and add your 0.11
input_indices = A[A['inputTics']==1].index + 0.11
# Iterate through the indices and find the indices to update outputTics
output_indices = []
for ii in input_indices:
# Compare indices to full dataframe's timestamps
# and return index of nearest timestamp
oi = np.argmax((A.index - ii)>=0)
output_indices.append(oi)
# Create column of output ticks with 1s in the right place
output_tics = np.zeros(len(A))
output_tics[output_indices] = 1
# Add it to dataframe
A['outputTics'] = outputTics
# Add condition that if inputTics is 1, outputTics is 0
A['outputTics'] = A['outputTics'] - A['inputTics']
# Clean up negative values
A[A['outputTic']<0] = 0
# The first row becomes 1 because of indexing; change to 0
A = A.reset_index()
A.at[0, 'outputTics'] = 0

extracting a wavelength of a sample wave produced with discrete data

in the following piece of code I've extracted a window of data off of an audio sample(1000Hz signal). In the code, I've tried to obtain a wavelength of the signal.
https://paste.pound-python.org/show/HRVqQNy3w9Sr73q4oY8g/
sample = data[100:200]
x = 0
i = 1
num_occur = 0
while num_occur <2:
if sample[i] == sample[0]:
x = i
i += 1
num_occur += 1
else:
i += 1
wavelen = sample[:x]
But with less success...
the image of the sample : (https://pasteboard.co/HFFXGxW.png)
Well, I do understand what the problem is; even though matplotlib plots the wave as a continuous wave(due to the high sampling frequency), the wave is made up of discrete data, so there may or may not be a data value satisfying:
sample[i] == sample[0]
I'll greatly appreciate any help and advice on how to get around this problem.
Someone enlightened me on how to get to the answer. It's a simple but logical approach.
So I just needed to extract the points that would cut 0, and the wave between the 1st and 3rd such consecutive points would give me a wavelength.
Here's the code I wrote:
sample = data[100:200]
i = 1
num_occur = 0
cut_zero = []
x = 0
while num_occur < 3:
if sample[x] == abs(sample[x]):
if sample[i] == abs(sample[i]):
i +=1
else:
cut_zero.append(i)
num_occur += 1
i += 1
x = i
elif sample[x] != abs(sample[x]):
if sample[i] == abs(sample[i]):
cut_zero.append(i)
i += 1
num_occur += 1
x = i
else:
i += 1
print(cut_zero)
a = cut_zero[0]
b = cut_zero[2]
wavelen = sample[a:b]
Maybe I could do it more efficiently :), if so let me know.
here's the image of the wavelength https://pasteboard.co/HFLtOmr.png

Comparing values in Python data frame efficiently

I'm trading daily on Cryptocurrencies and would like to find which are the most desirable Cryptos for trading.
I have CSV file for every Crypto with the following fields:
Date Sell Buy
43051.23918 1925.16 1929.83
43051.23919 1925.12 1929.79
43051.23922 1925.12 1929.79
43051.23924 1926.16 1930.83
43051.23925 1926.12 1930.79
43051.23926 1926.12 1930.79
43051.23927 1950.96 1987.56
43051.23928 1190.90 1911.56
43051.23929 1926.12 1930.79
I would like to check:
How many quotes will end with profit:
for Buy positions - if one of the following Sells > current Buy.
for Sell positions - if one of the following Buys < current Sell.
How much time it would take to a theoretical position to become profitable.
What can be the profit potential.
I'm using the following code:
#converting from OLE to datetime
OLE_TIME_ZERO = dt.datetime(1899, 12, 30, 0, 0, 0)
def ole(oledt):
return OLE_TIME_ZERO + dt.timedelta(days=float(oledt))
#variables initialization
buy_time = ole(43031.57567) - ole(43031.57567)
sell_time = ole(43031.57567) - ole(43031.57567)
profit_buy_counter = 0
no_profit_buy_counter = 0
profit_sell_counter = 0
no_profit_sell_counter = 0
max_profit_buy_positions = 0
max_profit_buy_counter = 0
max_profit_sell_positions = 0
max_profit_sell_counter = 0
df = pd.read_csv("C:/P/Crypto/bitcoin_test_normal_276k.csv")
#comparing to max
for index, row in df.iterrows():
a = index + 1
df_slice = df[a:]
if df_slice["Sell"].max() - row["Buy"] > 0:
max_profit_buy_positions += df_slice["Sell"].max() - row["Buy"]
max_profit_buy_counter += 1
for index1, row1 in df_slice.iterrows():
if row["Buy"] < row1["Sell"] :
buy_time += ole(row1["Date"])- ole(row["Date"])
profit_buy_counter += 1
break
else:
no_profit_buy_counter += 1
#comparing to sell
for index, row in df.iterrows():
a = index + 1
df_slice = df[a:]
if row["Sell"] - df_slice["Buy"].min() > 0:
max_profit_sell_positions += row["Sell"] - df_slice["Buy"].min()
max_profit_sell_counter += 1
for index2, row2 in df_slice.iterrows():
if row["Sell"] > row2["Buy"] :
sell_time += ole(row2["Date"])- ole(row["Date"])
profit_sell_counter += 1
break
else:
no_profit_sell_counter += 1
num_rows = len(df.index)
buy_avg_time = buy_time/num_rows
sell_avg_time = sell_time/num_rows
if max_profit_buy_counter == 0:
avg_max_profit_buy = "There is no profitable buy positions"
else:
avg_max_profit_buy = max_profit_buy_positions/max_profit_buy_counter
if max_profit_sell_counter == 0:
avg_max_profit_sell = "There is no profitable sell positions"
else:
avg_max_profit_sell = max_profit_sell_positions/max_profit_sell_counter
The code works fine for 10K-20K lines but for a larger amount (276K) it take a long time (more than 10 hrs)
What can I do in order to improve it?
Is there any "Pythonic" way to compare each value in a data frame to all following values?
note - the dates in the CSV are in OLE so I need to convert it to Datetime.
File for testing:
Thanks for your comment.
Here you can find the file that I used:
First, I'd want to create the cumulative maximum/minimum values for Sell and Buy per row, so it's easy to compare to. pandas has cummax and cummin, but they go the wrong way. So we'll do:
df['Max Sell'] = df[::-1]['Sell'].cummax()[::-1]
df['Min Buy'] = df[::-1]['Buy'].cummin()[::-1]
Now, we can just compare each row:
df['Buy Profit'] = df['Max Sell'] - df['Buy']
df['Sell Profit'] = df['Sell'] - df['Min Buy']
I'm positive this isn't exactly what you want as I don't perfectly understand what you're trying to do, but hopefully it leads you in the right direction.
After comparing your function and mine, there is a slight difference, as your a is offset one off the index. Removing that offset, you'll see that my method produces the same results as yours, only in vastly shorter time:
for index, row in df.iterrows():
a = index
df_slice = df[a:]
assert (df_slice["Sell"].max() - row["Buy"]) == df['Max Sell'][a] - df['Buy'][a]
else:
print("All assertions passed!")
Note this will still take the very long time required by your function. Note that this can be fixed with shift, but I don't want to run your function for long enough to figure out what way to shift it.

Python: Why am I getting a "ZeroDivisionError: division by zero" in function?

My goal is to count the total amount of tweets in a file that fall under certain time zones.
I have the following function (I have noted the trouble area near the end of the function with comments):
def readTweets(inFile, wordsName):
words = []
lat = 0
long = 0
keyword = keywords(wordsName)
sents = keywordSentiment(wordsName)
value = 0
eastern = 0
central = 0
mountain = 0
pacific = 0
a = 0
b = 0
c = 0
d = 0
easternTweets = 0
centralTweets = 0
mountainTweets = 0
pacificTweets = 0
for line in inFile:
entry = line.split()
for n in range(0, len(entry) - 1):
entry[n] = entry[n].strip("[],!?#./-=+_#")
if n > 4: # n>4 because words begin on 5th index of list
entry[n] = entry[n].lower()
words.append(entry[n])
lat = float(entry[0])
long = float(entry[1])
timezone = getTimeZone(lat, long)
if timezone == "eastern":
easternTweets += 1
if timezone == "central":
centralTweets += 1
if timezone == "mountain":
mountainTweets += 1
if timezone == "pacific":
pacificTweets += 1
for i in range(0, len(words)):
for k in range(0, len(keyword)):
if words[i] == keyword[k]:
value = int(sents[k])
if timezone == "eastern":
eastern += value
a += 1
if timezone == "central":
central += value
b += 1
if timezone == "mountain":
mountain += value
c += 1
if timezone == "pacific":
pacific += value
d += 1
# the values of a,b,c,d are 0
easternTotal = eastern/a # getting error
centralTotal = central/b # for
mountainTotal = mountain/c # these
pacificTotal = pacific/d # values
print("Total tweets per time zone:")
print("Eastern: %d" % easternTweets)
print("Central: %d" % centralTweets)
print("Mountain: %d" % mountainTweets)
print("Pacific: %d" % pacificTweets)
I am getting a ZeroDivisionError: division by zero error for easternTotal and the other total values that use a, b, c, and d for division.
If I print the values of a, b, c, or d it shows 0. My question is why are their values 0? Does the value of a, b, c, and d not change in the if statements?
So the only way this can happen is because the code that increments a, b, c and d is never reached.
That can have a few reasons:
inFile is empty so the whole for loop never enters its body
len(words) is 0, so that for loop never enters its body
len(keywords) is 0, so that for loop never enters its body
The value of timezone is something other than those values you test for
words is initially [], so its length can stay 0 if that loop that appends things to it never runs.
From here, it's impossible for us to see which of these is happening, but it should be very easy for you with some print statements or such.
you divide eastern by 0. You can avoid it by doing
easternTotal = eastern/a if a > 0 else eastern
because you set a,b,c,d=0;
when readTweets(inFile, wordsName) did not get any data, "eastern/a" may cause "eastern/0 " .
So, make sure your readTweets() did get data first.

Python code not working as intended

I started learning Python < 2 weeks ago.
I'm trying to make a function to compute a 7 day moving average for data. Something wasn't going right so I tried it without the function.
moving_average = np.array([])
i = 0
for i in range(len(temp)-6):
sum_7 = np.array([])
avg_7 = 0
missing = 0
total = 7
j = 0
for j in range(i,i+7):
if pd.isnull(temp[j]):
total -= 1
missing += 1
if missing == 7:
moving_average = np.append(moving_average, np.nan)
break
if not pd.isnull(temp[j]):
sum_7 = np.append(sum_7, temp[j])
if j == (i+6):
avg_7 = sum(sum_7)/total
moving_average = np.append(moving_average, avg_7)
If I run this and look at the value of sum_7, it's just a single value in the numpy array which made all the moving_average values wrong. But if I remove the first for loop with the variable i and manually set i = 0 or any number in the range of the data set and run the exact same code from the inner for loop, sum_7 comes out as a length 7 numpy array. Originally, I just did sum += temp[j] but the same problem occurred, the total sum ended up as just the single value.
I've been staring at this trying to fix it for 3 hours and I'm clueless what's wrong. Originally I wrote the function in R so all I had to do was convert to python language and I don't know why sum_7 is coming up as a single value when there are two for loops. I tried to manually add an index variable to act as i to use it in the range(i, i+7) but got some weird error instead. I also don't know why that is.
https://gyazo.com/d900d1d7917074f336567b971c8a5cee
https://gyazo.com/132733df8bbdaf2847944d1be02e57d2
Hey you can using rolling() function and mean() function from pandas.
Link to the documentation :
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.rolling.html
df['moving_avg'] = df['your_column'].rolling(7).mean()
This would give you some NaN values also, but that is a part of rolling mean because you don't have all past 7 data points for first 6 values.
Seems like you misindented the important line:
moving_average = np.array([])
i = 0
for i in range(len(temp)-6):
sum_7 = np.array([])
avg_7 = 0
missing = 0
total = 7
j = 0
for j in range(i,i+7):
if pd.isnull(temp[j]):
total -= 1
missing += 1
if missing == 7:
moving_average = np.append(moving_average, np.nan)
break
# The following condition should be indented one more level
if not pd.isnull(temp[j]):
sum_7 = np.append(sum_7, temp[j])
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
if j == (i+6):
# this ^ condition does not do what you meant
# you should use a flag instead
avg_7 = sum(sum_7)/total
moving_average = np.append(moving_average, avg_7)
Instead of a flag you can use a for-else construct, but this is not readable. Here's the relevant documentation.
Shorter way to do this:
moving_average = np.array([])
for i in range(len(temp)-6):
ngram_7 = [t for t in temp[i:i+7] if not pd.isnull(t)]
average = (sum(ngram_7) / len(ngram_7)) if ngram_7 else np.nan
moving_average = np.append(moving_average, average)
This could be refactored further:
def average(ngram):
valid = [t for t in temp[i:i+7] if not pd.isnull(t)]
if not valid:
return np.nan
return sum(valid) / len(valid)
def ngrams(seq, n):
for i in range(len(seq) - n):
yield seq[i:i+n]
moving_average = [average(k) for k in ngrams(temp, 7)]

Categories