Find specific Row of Data from Pandas Dataframe in While Loop

Find specific Row of Data from Pandas Dataframe in While Loop - python

I am trying to take a csv, and read it as a Pandas Dataframe.
This Dataframe contains 4 rows of numbers.
I want to pick a specific row of data from the Dataframe.
In a While Loop, I want to select a random row from the Dataframe, and compare it to row that I picked.
I want it to continue to run through the while loop until that random row, is 100% equal to the row I picked prior.
Then I want the While Loop to break and I want it to have counted how many tries it took to match the random number.
Here's what I have so far:
This is an example of the Dataframe:
A B C D
1 2 7 12 14
2 4 5 11 23
3 4 6 14 20
4 4 7 13 50
5 9 6 14 35
Here is an example of my efforts:
import time
import pandas as pd
then = time.time()
count = 0
df = pd.read_csv('Get_Numbers.csv')
df.columns = ['A', 'B', 'C', 'D']
while True:
df_elements = df.sample(n=1)
random_row = df_elements
print(random_row)
find_this_row = df['A','B','C','D' == '4','7','13,'50']
print(find_this_row)
if find_this_row != random_row:
count += 1
else:
break
print("You found the correct numbers! And it only took " + str(count) + " tries to get there! Your numbers were: " + str(find_this_row))
now = time.time()
print("It took: ", now-then, " seconds")
The above code gives an obvious error... but I have tried so many different versions now of finding the find_this_row numbers that I just don't know what to do anymore, so I left this attempt in.
What I would like to try to avoid is using the specific index for the row I am trying to find, I would rather use just the values to find this.
I am using df_elements = df.sample(n=1) to select a row at random. This was to avoid using random.choice as I was not sure if that would work or which way is more time/memory efficient, but I'm open to advice on that as well.
In my mind it seems simple, randomly select a row of data, if it doesn't match the row of data that I want, keep randomly selecting rows of data until it does match. But I can't seem to execute it.
Any help is EXTREMELY Appreciated!

You can use values which returns np.ndarray of shape=(1, 2), use values[0] to get just 1D array.
Then compare the arrays with any()
import time
import pandas as pd
then = time.time()
df = pd.DataFrame(data={'A': [1, 2, 3],
'B': [8, 9, 10]})
find_this_row = [2, 9]
print("Looking for: {}".format(find_this_row))
count = 0
while True:
random_row = df.sample(n=1).values[0]
print(random_row)
if any(find_this_row != random_row):
count += 1
else:
break
print("You found the correct numbers! And it only took " + str(count) + " tries to get there! Your numbers were: " + str(find_this_row))
now = time.time()
print("It took: ", now-then, " seconds")

How about using values?
values will return you a list of values. And then you can compare two lists easily.
list1 == list2 will return an array of True and False values as it compares indexes of the corresponding lists. You can check if all of the values returned are True

Here's a method that tests one row at a time. We check if the values of the chosen row are equal to the values of the sampled DataFrame. We require that they all match.
row = df.sample(1)
counter = 0
not_a_match = True
while not_a_match:
not_a_match = ~(df.sample(n=1).values == row.values).all()
counter+=1
print(f'It took {counter} tries and the numbers were\n{row}')
#It took 9 tries and the numbers were
# A B C D
#4 4 7 13 50
If you want to get a little bit faster, you select one row and then sample the DataFrame with replacement many times. You can then check for the first time the sampled row equals your sampled DataFrame, giving you how many 'tries' it would have taken in a while loop, but in much less time. The loop protects against the unlikely case we do not find a match, given that it's sampling with replacement.
row = df.sample(1)
n = 0
none_match = True
k = 10 # Increase to check more matches at once.
while none_match:
matches = (df.sample(n=len(df)*k, replace=True).values == row.values).all(1)
none_match = ~matches.any() # Determine if none still match
n += k*len(df)*none_match # Only increment if none match
n = n + matches.argmax() + 1
print(f'It took {n} tries and the numbers were\n{row}')
#It took 3 tries and the numbers were
# A B C D
#4 4 7 13 50

A couple of hints first. This line does not work for me:
find_this_row = df['A','B','C','D' == '4','7','13,'50']
For 2 reasons:
a missing " ' " after ,'13
df is a DataFrame(), so using keys like below is not supported
df['A','B','C','D' ...
Either use keys to return a DataFrame():
df[['A','B','C','D']]
or as a Series():
df['A']
Since you need the whole row with multiple columns do this:
df2.iloc[4].values
array(['4', '7', '13', '50'], dtype=object)
Do the same with your sample row:
df2.sample(n=1).values
Comparison between rows needs to be done for all() elements/columns:
df2.sample(n=1).values == df2.iloc[4].values
array([[ True, False, False, False]])
with adding .all() like the following:
(df2.sample(n=1).values == df2.iloc[4].values).all()
which returns
True/False
All together:
import time
import pandas as pd
then = time.time()
count = 0
while True:
random_row = df2.sample(n=1).values
find_this_row = df2.iloc[4].values
if (random_row == find_this_row).all() == False:
count += 1
else:
break
print("You found the correct numbers! And it only took " + str(count) + " tries to get there! Your numbers were: " + str(find_this_row))
now = time.time()
print("It took: ", now-then, " seconds")

Related

is there a way to insert space between the characters of a string based on length using pandas?

I've been trying to write a function that inserts space based on the length of the strings in a column.
I have a dataframe with two columns one with postcodes and the other with the length of those postcodes. Shown below:
new_pstl_cd length
1 SS55HA 6
2 BD108EG 7
3 LS15HU 6
4 W19PX 5
I want to insert a space so that the column becomes
new_pstl_cd
1 SS5 5HA
2 BD10 8EG
3 LS1 5HU
4 W1 9PX
I have tried the below code without sucess:
def insert_space(charachter):
if postcode_test['length'] == 6:
return (postcode_test['new_pstl_cd'].str[0:3]+ charachter + postcode_test['new_pstl_cd'].str[3:])
if postcode_test['length'] == 5:
return (postcode_test['new_pstl_cd'].str[0:2]+ charachter + postcode_test['new_pstl_cd'].str[3:])
else:
return (postcode_test['new_pstl_cd'].str[0:4]+ charachter + postcode_test['new_pstl_cd'].str[4:])
How would I write a function using the lengths to do this? Please note that in this case it will always be the last 3 characters being separated

you don't even need length:
df['new_pstl_cd'] = df.new_pstl_cd.str[:-3] + ' ' + df.new_pstl_cd.str[-3:]
Output:
new_pstl_cd length
1 SS5 5HA 6
2 BD10 8EG 7
3 LS1 5HU 6
4 W1 9PX 5

The problem
This if/elif code you tried reflects a common misunderstanding about how Pandas works.
if only ever accepts a single Boolean value, i.e. True or False. postcode_test['length'] == 6 is a Pandas Series object, i.e. a collection of many Boolean values. It doesn't even make sense to use it as an input to if, as the error message you surely saw explains.
You need to figure out some way to apply an operation to some rows, but not other rows, but you can't use if. Pandas actually offers several ways to achieve this.
Possible solutions
As always, DO NOT COPY AND PASTE CODE THAT YOU DO NOT UNDERSTAND. Refer to the library documentation for help. This code is untested; it might contain bugs and there is no warranty attached.
Boolean subsetting
The simplest method is probably with boolean subsetting, which is more or less what you were trying to do originally.
postcode_len_6 = postcode_test['length'] == 6
postcode_len_5 = postcode_test['length'] == 5
postcode_len_6 and postcode_len_5 are Series objects with Boolean elements. Their elements correspond to elements of postcode_test['length'] Series, whose elements in turn correspond to rows of the postcode_test Dataframe.
You can use these Series objects as per the indexing and selecting data guide:
postcode_test.loc[postcode_len_6, 'new_pstl_cd'] = \
postcode_test.loc[postcode_len_6, 'new_pstl_cd'].str[:3] + ' '+ \
postcode_test.loc[postcode_len_6, 'new_pstl_cd'].str[3:]
postcode_test.loc[postcode_len_5, 'new_pstl_cd'] = \
postcode_test.loc[postcode_len_5, 'new_pstl_cd'].str[:2] + ' '+ \
postcode_test.loc[postcode_len_5, 'new_pstl_cd'].str[2:]
Using the "mask" method
This one can be unintuitive sometimes, so make sure you read the user's guide and the API documentation.
You begin, as before, by first finding the rows where the lengths are 5 or 6:
postcode_len_6 = postcode_test['length'] == 6
postcode_len_5 = postcode_test['length'] == 5
But instead of the big incantation with .loc, you use .mask instead:
postcode_test['new_pstl_cd'] = postcode_test['new_pstl_cd']\
.mask(postcode_len_6, lambda s: s.str[:3]+' '+s.str[3:])
postcode_test['new_pstl_cd'] = postcode_test['new_pstl_cd']\
.mask(postcode_len_5, lambda s: s.str[:2]+' '+s.str[2:])
Mapping a plain Python function
Another solution is to write a "scalar-valued" Python function that simply operates on strings.
def make_new_postcode(p):
if len(p) == 5:
p = p[:2] + ' ' + p[2:]
elif len(p) == 6:
p = p[:3] + ' ' + p[3:]
return p
postcode_test['new_pstl_cd'] = postcode_test['new_pstl_cd'].map(make_new_postcode)
Applying a plain Python function
You can also apply a function row-wise if you really want to re-use the existing "length" column:
def make_new_postcode(row):
l= row['length']
p= row['new_pstl_cd']
if l == 5:
p = p[:2] + ' ' + p[2:]
elif l == 6:
p = p[:3] + ' ' + p[3:]
return p
postcode_test['new_pstl_cd'] = postcode_test[['length', 'new_pstl_cd']].apply(make_new_postcode, axis=1, result_type='reduce')

How to make this for loop faster?

I know that python loops themselves are relatively slow when compared to other languages but when the correct functions are used they become much faster.
I have a pandas dataframe called "acoustics" which contains over 10 million rows:
print(acoustics)
timestamp c0 rowIndex
0 2016-01-01T00:00:12.000Z 13931.500000 8158791
1 2016-01-01T00:00:30.000Z 14084.099609 8158792
2 2016-01-01T00:00:48.000Z 13603.400391 8158793
3 2016-01-01T00:01:06.000Z 13977.299805 8158794
4 2016-01-01T00:01:24.000Z 13611.000000 8158795
5 2016-01-01T00:02:18.000Z 13695.000000 8158796
6 2016-01-01T00:02:36.000Z 13809.400391 8158797
7 2016-01-01T00:02:54.000Z 13756.000000 8158798
and there is the code I wrote:
acoustics = pd.read_csv("AccousticSandDetector.csv", skiprows=[1])
weights = [1/9, 1/18, 1/27, 1/36, 1/54]
sumWeights = np.sum(weights)
deltaAc = []
for i in range(5, len(acoustics)):
time = acoustics.iloc[i]['timestamp']
sum = 0
for c in range(5):
sum += (weights[c]/sumWeights)*(acoustics.iloc[i]['c0']-acoustics.iloc[i-c]['c0'])
print("Row " + str(i) + " of " + str(len(acoustics)) + " is iterated")
deltaAc.append([time, sum])
deltaAc = pd.DataFrame(deltaAc)
It takes a huge amount of time, how can I make it faster?

You can use diff from pandas and create all the differences for each row in an array, then multiply with your weigths and finally sum over the axis 1, such as:
deltaAc = pd.DataFrame({'timestamp': acoustics.loc[5:, 'timestamp'],
'summation': (np.array([acoustics.c0.diff(i) for i in range(5) ]).T[5:]
*np.array(weights)).sum(1)/sumWeights})
and you get the same values than what I get with your code:
print (deltaAc)
timestamp summation
5 2016-01-01T00:02:18.000Z -41.799986
6 2016-01-01T00:02:36.000Z 51.418728
7 2016-01-01T00:02:54.000Z -3.111184

First optimization, weights[c]/sumWeights could be done outside the loop.
weights_array = np.array([1/9, 1/18, 1/27, 1/36, 1/54])
sumWeights = np.sum(weights_array)
tmp = weights_array / sumWeights
...
sum += tmp[c]*...
I'm not familiar with pandas, but if you could extract your columns as 1D numpy array, it would be great for you. It might look something like:
# next lines to be tested, or find the correct way of extracting the column
c0_column = acoustics[['c0']].values
time_column = acoustics[['times']].values
...
sum = numpy.zeros(shape=(len(acoustics)-5,))
delta_ac = []
for c in range(5):
sum += tmp[c]*(c0_column[5:]-c0_column[5-c:len(acoustics)-c])
for i in range(len(acoustics)-5):
deltaAc.append([time[5+i], sum[i])

Dataframes have a great method rolling for constructing and applying windowing transformations; So, you don't need loops at all:
# df is your data frame
window_size = 5
weights = pd.np.array([1/9, 1/18, 1/27, 1/36, 1/54])
weights /= weights.sum()
df.loc[:,'deltaAc'] = df.loc[:, 'c0'].rolling(window_size).apply(lambda x: ((x[-1] - x)*weights).sum())

Python Pandas: Find a pattern in a DataFrame

I have the following Dataframe (1,2 millon rows):
df_test_2 = pd.DataFrame({"A":["end","beginn","end","end","beginn","beginn","end","end","end","beginn","end"],"B":[1,10,50,60,70,80,90,100,110,111,112]})`
Now I try to find a sequences. Each "beginn "should match the first "end"where the distance based on column B is at least 40
occur.
For the provided Dataframe that would mean:
The sould problem is that
Your help is highly appreciated.

I will assume that as your output you want a list of sequences with the starting and ending value. The second sequence that you identify in your picture has a distance lower to 40, so I also assumed that that was an error.
import pandas as pd
from collections import namedtuple
df_test_2 = pd.DataFrame({"A":["end","beginn","end","end","beginn","beginn","end","end","end","beginn","end"],"B":[1,10,50,60,70,80,90,100,110,111,112]})
sequence_list = []
Sequence = namedtuple('Sequence', ['beginn', 'end'])
beginn_flag = False
beginn_value = 0
for i, row in df_test_2.iterrows():
state = row['A']
value = row['B']
if not beginn_flag and state == 'beginn':
beginn_flag = True
beginn_value = value
elif beginn_flag and state == 'end':
if value >= beginn_value + 40:
new_seq = Sequence(beginn_value, value)
sequence_list.append(new_seq)
beginn_flag = False
print(sequence_list)
This code outputs the following:
[Sequence(beginn=10, end=50), Sequence(beginn=70, end=110)]
Two sequences, one starting at 10 and ending at 50 and the other one starting at 70 and ending at 110.

Comparing values in Python data frame efficiently

I'm trading daily on Cryptocurrencies and would like to find which are the most desirable Cryptos for trading.
I have CSV file for every Crypto with the following fields:
Date Sell Buy
43051.23918 1925.16 1929.83
43051.23919 1925.12 1929.79
43051.23922 1925.12 1929.79
43051.23924 1926.16 1930.83
43051.23925 1926.12 1930.79
43051.23926 1926.12 1930.79
43051.23927 1950.96 1987.56
43051.23928 1190.90 1911.56
43051.23929 1926.12 1930.79
I would like to check:
How many quotes will end with profit:
for Buy positions - if one of the following Sells > current Buy.
for Sell positions - if one of the following Buys < current Sell.
How much time it would take to a theoretical position to become profitable.
What can be the profit potential.
I'm using the following code:
#converting from OLE to datetime
OLE_TIME_ZERO = dt.datetime(1899, 12, 30, 0, 0, 0)
def ole(oledt):
return OLE_TIME_ZERO + dt.timedelta(days=float(oledt))
#variables initialization
buy_time = ole(43031.57567) - ole(43031.57567)
sell_time = ole(43031.57567) - ole(43031.57567)
profit_buy_counter = 0
no_profit_buy_counter = 0
profit_sell_counter = 0
no_profit_sell_counter = 0
max_profit_buy_positions = 0
max_profit_buy_counter = 0
max_profit_sell_positions = 0
max_profit_sell_counter = 0
df = pd.read_csv("C:/P/Crypto/bitcoin_test_normal_276k.csv")
#comparing to max
for index, row in df.iterrows():
a = index + 1
df_slice = df[a:]
if df_slice["Sell"].max() - row["Buy"] > 0:
max_profit_buy_positions += df_slice["Sell"].max() - row["Buy"]
max_profit_buy_counter += 1
for index1, row1 in df_slice.iterrows():
if row["Buy"] < row1["Sell"] :
buy_time += ole(row1["Date"])- ole(row["Date"])
profit_buy_counter += 1
break
else:
no_profit_buy_counter += 1
#comparing to sell
for index, row in df.iterrows():
a = index + 1
df_slice = df[a:]
if row["Sell"] - df_slice["Buy"].min() > 0:
max_profit_sell_positions += row["Sell"] - df_slice["Buy"].min()
max_profit_sell_counter += 1
for index2, row2 in df_slice.iterrows():
if row["Sell"] > row2["Buy"] :
sell_time += ole(row2["Date"])- ole(row["Date"])
profit_sell_counter += 1
break
else:
no_profit_sell_counter += 1
num_rows = len(df.index)
buy_avg_time = buy_time/num_rows
sell_avg_time = sell_time/num_rows
if max_profit_buy_counter == 0:
avg_max_profit_buy = "There is no profitable buy positions"
else:
avg_max_profit_buy = max_profit_buy_positions/max_profit_buy_counter
if max_profit_sell_counter == 0:
avg_max_profit_sell = "There is no profitable sell positions"
else:
avg_max_profit_sell = max_profit_sell_positions/max_profit_sell_counter
The code works fine for 10K-20K lines but for a larger amount (276K) it take a long time (more than 10 hrs)
What can I do in order to improve it?
Is there any "Pythonic" way to compare each value in a data frame to all following values?
note - the dates in the CSV are in OLE so I need to convert it to Datetime.
File for testing:
Thanks for your comment.
Here you can find the file that I used:

First, I'd want to create the cumulative maximum/minimum values for Sell and Buy per row, so it's easy to compare to. pandas has cummax and cummin, but they go the wrong way. So we'll do:
df['Max Sell'] = df[::-1]['Sell'].cummax()[::-1]
df['Min Buy'] = df[::-1]['Buy'].cummin()[::-1]
Now, we can just compare each row:
df['Buy Profit'] = df['Max Sell'] - df['Buy']
df['Sell Profit'] = df['Sell'] - df['Min Buy']
I'm positive this isn't exactly what you want as I don't perfectly understand what you're trying to do, but hopefully it leads you in the right direction.
After comparing your function and mine, there is a slight difference, as your a is offset one off the index. Removing that offset, you'll see that my method produces the same results as yours, only in vastly shorter time:
for index, row in df.iterrows():
a = index
df_slice = df[a:]
assert (df_slice["Sell"].max() - row["Buy"]) == df['Max Sell'][a] - df['Buy'][a]
else:
print("All assertions passed!")
Note this will still take the very long time required by your function. Note that this can be fixed with shift, but I don't want to run your function for long enough to figure out what way to shift it.

find if a number divisible by the input numbers

Given two numbers a and b, we have to find the nth number which is divisible by a or b.
The format looks like below:
Input :
First line consists of an integer T, denoting the number of test cases.
Second line contains three integers a, b and N
Output :
For each test case, print the Nth
number in a new line.
Constraints :
1≤t≤105
1≤a,b≤104
1≤N≤10
Sample Input
1
2 3 10
Sample Output
15
Explanation
The numbers which are divisible by 2
or 3 are: 2,3,4,6,8,9,10,12,14,15 and the 10th number is 15
My code
test_case=input()
if int(test_case)<=100000 and int(test_case)>=1:
for p in range(int(test_case)):
count=1
j=1
inp=list(map(int,input().strip('').split()))
if inp[0]<=10000 and inp[0]>=1 and inp[1]<=10000 and inp[1]>=1 and inp[1]<=1000000000 and inp[1]>=1:
while(True ):
if count<=inp[2] :
k=j
if j%inp[0]==0 or j%inp[1] ==0:
count=count+1
j=j+1
else :
j=j+1
else:
break
print(k)
else:
break
Problem Statement:
For single test case input 2000 3000 100000 it is taking more than one second to complete.I want if i can get the results in less than 1 second. Is there a time efficient approach to this problem,may be if we can use some data structure and algorithms here??

For every two numbers there will be number k such that k=a*b. There will only be so many multiples of a and b under k. This set can be created like so:
s = set(a*1, b*1, ... a*(b-1), b*(a-1), a*b)
Say we take the values a=2, b=3 then s = (2,3,4,6). These are the possible values of c:
[1 - 4] => (2,3,4,6)
[5 - 8] => 6 + (2,3,4,6)
[9 - 12] => 6*2 + (2,3,4,6)
...
Notice that the values repeat with a predictable pattern. To get the row you can take the value of c and divide by length of the set s (call it n). The set index is the mod of c by n. Subtract 1 for 1 indexing used in the problem.
row = floor((c-1)/n)
column = `(c-1) % n`
result = (a*b)*row + s(column)
Python impl:
a = 2000
b = 3000
c = 100000
s = list(set([a*i for i in range(1, b+1)] + [b*i for i in range(1, a+1)]))
print((((c-1)//len(s)) * (a*b)) + s[(c - 1)%len(s)])

I'm not certain to grasp exactly what you're trying to accomplish. But if I get it right, isn't the answer simply b*(N/2)? since you are listing the multiples of both numbers the Nth will always be the second you list times N/2.
In your initial example that would be 3*10/2=15.
In the code example, it would be 3000*100000/2=150'000'000
Update:
Code to compute the desired values using set's and lists to speed up the calculation process. I'm still wondering what the recurrence for the odd indexes could be if anyone happens to stumble upon it...
a = 2000
b = 3000
c = 100000
a_list = [a*x for x in range(1, c)]
b_list = [b*x for x in range(1, c)]
nums = set(a_list)
nums.update(b_list)
nums = sorted(nums)
print(nums[c-1])
This code runs in 0.14s on my laptop. Which is significantly below the requested threshold. Nonetheless, this values will depend on the machine the code is run on.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find specific Row of Data from Pandas Dataframe in While Loop - python

How about using values? values will return you a list of values. And then you can compare two lists easily. list1 == list2 will return an array of True and False values as it compares indexes of the corresponding lists. You can check if all of the values returned are True

Related

is there a way to insert space between the characters of a string based on length using pandas?

How to make this for loop faster?

Python Pandas: Find a pattern in a DataFrame

Comparing values in Python data frame efficiently

find if a number divisible by the input numbers

Categories

Resources