How can i read each row till the first occurence of NaN.?

How can i read each row till the first occurence of NaN.? - python

From excel file i want to read each row and use it independently to process
here how the data looks like in excel file
12 32 45 67 89 54 23 56 78 98
34 76 34 89 34 3
76 34 54 12 43 78 56
76 56 45 23 43 45 67 76 67 8
87 9 9 0 89 90 6 89
23 90 90 32 23 34 56 9 56 87
23 56 34 3 5 8 7 6 98
32 23 34 6 65 78 67 87 89 87
12 23 34 32 43 67 45
343 76 56 7 8 9 4
but when i read it through pandas then the remaining columns are filled with NaN.
the data after reading from pandas looks like
0 12 32 45 67 89 54 23.0 56.0 78.0 98.0
1 34 76 34 89 34 3 NaN NaN NaN NaN
2 76 34 54 12 43 78 56.0 NaN NaN NaN
3 76 56 45 23 43 45 67.0 76.0 67.0 8.0
4 87 9 9 0 89 90 6.0 89.0 NaN NaN
5 23 90 90 32 23 34 56.0 9.0 56.0 87.0
6 23 56 34 3 5 8 7.0 6.0 98.0 NaN
7 32 23 34 6 65 78 67.0 87.0 89.0 87.0
8 12 23 34 32 43 67 45.0 NaN NaN NaN
9 343 76 56 7 8 9 4.0 5.0 8.0 68.0
Here it can be seen the remaining columns of each row is filled with NaN which i don't want.
Nor i wanted to replace it with some other value or drop the whole rows contains NaN .
How can i read columns of each row till the first occurence of NaN. ?
For eg.The second row in pandas is 34 76 34 89 34 3 NaN NaN NaN NaN
so my desired output will be that it reads only 34 76 34 89 34 3
My preference is pandas but if not possible then is their any other way of doing it like with some other libraries
Any resource or reference will be helpful?
Thanks

While calling the pd.read_excel function, try setting keep_default_na = False. This will avoid default NaN values while reading.

Related

How to add all the data of the dataframe by identifying the common column in dataframe?

I've a dataframe DF1:
YEAR JAN_EARN FEB_EARN MAR_EARN APR_EARN MAY_EARN JUN_EARN JUL_EARN AUG_EARN SEP_EARN OCT_EARN NOV_EARN DEC_EARN
0 2017 20 21 22.0 23 24.0 25.0 26.0 27.0 28 29.0 30 31
1 2018 30 31 32.0 33 34.0 35.0 36.0 37.0 38 39.0 40 41
2 2019 40 41 42.0 43 NaN 45.0 NaN NaN 48 49.0 50 51
3 2017 50 51 52.0 53 54.0 55.0 56.0 57.0 58 59.0 60 61
4 2017 60 61 62.0 63 64.0 NaN 66.0 NaN 68 NaN 70 71
5 2021 70 71 72.0 73 74.0 75.0 76.0 77.0 78 79.0 80 81
6 2018 80 81 NaN 83 NaN 85.0 NaN 87.0 88 89.0 90 91
group the rows by common row in "YEAR" column and add all the data of that column.
I tried to check with this:
DF2['New'] = DF1.groupby(DF1.groupby('YEAR')).sum()
The Expected Output is like:
DF2;
YEAR JAN_EARN FEB_EARN ......
0 2017 130 133 ......
1 2018 110 112 ......
2 2019 40 41 ......
3 2021 70 71 ......
Thank You For Your Time :)

You were halfway through there, just rectify some small details as following :
Don't assign a groupby object to a newly defined column, replace your line of 'Df2['New'] = ...' with :
DF2 = DF1.groupby('YEAR' , as_index = False).sum().reset_index(drop = True)
If you wish to see all the columns relative to each year, create a list with the range of years your df has then apply a mask for each element in that list. You will obtain one dataframe per year then concatenate them with axis = 0.
Another way of doing so would be sorting DF1's years by chronological order then slicing. I'm afraid we misunderstood your question, if that's the case please develop more so we can help.

Merge dataframes and merge also columns into a single column

I have a dataframe df1
index A B C D E
0 0 92 84
1 1 98 49
2 2 49 68
3 3 0 58
4 4 91 95
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
and also this data frame df2
index C D E F
0 0 27 95 51 45
1 1 99 33 92 67
2 2 68 37 29 65
3 3 99 25 48 40
4 4 33 74 55 66
5 13 65 76 19 62
I wish to get to the following outcome when merging df1 and df2
index A B C D E F
0 0 92 84 27 95 51 45
1 1 98 49 99 33 92 67
2 2 49 68 68 37 29 65
3 3 0 58 99 25 48 40
4 4 91 95 33 74 55 66
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
11 13 65 76 19 62
However, I am keeping getting this when using pd. merge(),
df_total=df1.merge(df2,how="outer",on="index",suffixes=(None,"_"))
df_total.replace(to_replace=np.nan,value=" ", inplace=True)
df_total
index A B C D E C_ D_ E_ F
0 0 92 84 27 95 51 45
1 1 98 49 99 33 92 67
2 2 49 68 68 37 29 65
3 3 0 58 99 25 48 40
4 4 91 95 33 74 55 66
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
11 13 65 76 19 62
Is there a way to get the desirable outcome using pd.merge or similar function?
Thanks

You can use .combine_first():
# convert the empty cells ("") to NaNs
df1 = df1.replace("", np.nan)
df2 = df2.replace("", np.nan)
# set indices and combine the dataframes
df1 = df1.set_index("index")
print(df1.combine_first(df2.set_index("index")).reset_index().fillna(""))
Prints:
index A B C D E F
0 0 92.0 84.0 27.0 95.0 51.0 45.0
1 1 98.0 49.0 99.0 33.0 92.0 67.0
2 2 49.0 68.0 68.0 37.0 29.0 65.0
3 3 0.0 58.0 99.0 25.0 48.0 40.0
4 4 91.0 95.0 33.0 74.0 55.0 66.0
5 5 47.0 56.0 52.0 25.0 58.0
6 6 86.0 71.0 34.0 39.0 40.0
7 7 80.0 78.0 0.0 86.0 12.0
8 8 0.0 8.0 30.0 88.0 42.0
9 9 69.0 83.0 7.0 65.0 60.0
10 10 93.0 39.0 10.0 90.0 45.0
11 13 65.0 76.0 19.0 62.0

Why is my second thread not executing at all?

I have a GUI which takes values and when the button on the GUI is pressed receives a data block from my server and puts them into a queue and using threads calls two functions recvData() and calculate_threshold. The first function is to keep receiving data blocks from the server and put them in the queue whereas the second function is to remove the data blocks from queue continuously and perform some calculations on it before writing the results to the file. It continues to do it till the queue is not empty.
Below is my client code:
import socket
import turtle
#import time
import queue
import threading
from tkinter import *
class GUI:
entries = []
def __init__(self, master):
self.master = master
master.title("Collision Detection")
self.buff_data = queue.Queue()
self.t1 = threading.Thread(target = self.recvData)
self.t2 = threading.Thread(target=self.calculate_threshold)
self.entries = []
self.host = '127.0.0.1'
self.port = 5000
self.s = socket.socket()
self.s.connect((self.host, self.port))
self.create_GUI()
def create_GUI(self):
self.input_label = Label(root, text = "Input all the gratings set straight wavelength values in nm")
self.input_label.grid(row = 0)
self.core_string = "Core "
self.label_col_inc = 0
self.entry_col_inc = 1
self.core_range = range(1, 5)
for y in self.core_range:
self.core_text = self.core_string + str(y) + '_' + '25'
self.core_label = Label(root, text = self.core_text)
self.entry = Entry(root)
self.core_label.grid(row=1, column=self.label_col_inc, sticky=E)
self.entry.grid(row=1, column=self.entry_col_inc)
self.entries.append(self.entry)
self.label_col_inc += 2
self.entry_col_inc += 2
self.threshold_label = Label(root, text = "Threshold in nm")
self.entry_threshold = Entry(root)
self.threshold_label.grid(row = 2, sticky = E)
self.entry_threshold.grid(row = 2, column = 1)
self.light_label = Label(root, text = 'Status')
self.light_label.grid(row = 3, column = 3)
self.canvas = Canvas(root, width = 150, height = 50)
self.canvas.grid(row = 4, column = 3)
# Green light
self.green_light = turtle.RawTurtle(self.canvas)
self.green_light.shape('circle')
self.green_light.color('grey')
self.green_light.penup()
self.green_light.goto(0,0)
# Red light
self.red_light = turtle.RawTurtle(self.canvas)
self.red_light.shape('circle')
self.red_light.color('grey')
self.red_light.penup()
self.red_light.goto(40,0)
self.data_button = Button(root, text = "Get data above threshold", command = self.getData)
self.data_button.grid(row = 5, column = 0)
len_message = self.s.recv(4)
print('len_message', len_message)
bytes_length = int(len_message.decode('utf-8')) # for the self-made server
recvd_data = self.s.recv(bytes_length)
print('data', recvd_data)
self.buff_data.put(recvd_data)
#print('buffer', self.buff_data)
self.t1.start()
self.t2.start()
def recvData(self):
len_message = self.s.recv(4)
print('len_message', len_message)
while len_message:
bytes_length = int(len_message.decode('utf-8')) # for the self-made server
recvd_data = self.s.recv(bytes_length)
print('data', recvd_data)
self.buff_data.put(recvd_data)
len_message = self.s.recv(4)
print('len_message', len_message)
else:
print('out of loop')
self.s.close()
def calculate_threshold(self):
while not self.buff_data.empty:
rmv_data = self.buff_data.get()
stringdata = rmv_data.decode('utf-8')
rep_str = stringdata.replace(",", ".")
splitstr = rep_str.split()
# received wavelength values
inc = 34
wav_threshold = []
for y in self.entries:
straight_wav = float(y.get())
wav = float(splitstr[inc])
wav_diff = wav - straight_wav
if wav_diff < 0:
wav_diff = wav_diff * (-1)
wav_threshold.append(wav_diff)
inc += 56
threshold = float(self.entry_threshold.get())
# writing into the file
data = []
inc1 = 0
col1 = 2
col2 = 6
data.insert(0, (str(splitstr[0])))
data.insert(1, (str(splitstr[1])))
for x in wav_threshold:
if (x > threshold):
self.red_light.color('red')
self.green_light.color('grey')
data.insert(col1, (str(splitstr[34 + inc1])))
data.insert(col2,(str(x)))
else:
self.red_light.color('grey')
self.green_light.color('green')
data.insert(col1,'-')
data.insert(col2,'-')
inc1 += 56
col1 += 1
col2 += 1
self.write_file(data)
# function to write into the file
def write_file(self,data):
with open("Output.txt", "a") as text_file:
text_file.write('\t'.join(data[0:]))
text_file.write('\n')
if __name__ == '__main__':
root = Tk()
gui = GUI(root)
root.mainloop()
My server code is:
import socket
import threading
import os
def Main():
host = '127.0.0.1'
port = 5000
s = socket.socket()
s.bind((host,port))
s.listen(5)
print("Server started")
while True:
c,addr = s.accept()
print("Client connected ip:<" + str(addr) + ">")
c.sendall('1685 2020/03/02 14:42:05 318301 4 1 25 0 0 0 0 1513,094 1516,156 1519,154 1521,969 1525,029 1527,813 1530,921 1533,869 1536,740 1539,943 1542,921 1545,879 1548,843 1551,849 1554,760 1557,943 1560,782 1563,931 1566,786 1569,751 1572,690 1575,535 1578,638 1581,755 1584,759 41 39 33 39 48 44 49 55 61 58 64 55 68 74 68 59 57 74 61 68 58 64 54 47 46 2 25 0 0 0 0 1512,963 1515,935 1518,857 1521,849 1524,655 1527,577 1530,332 1533,233 1536,204 1539,488 1542,571 1545,725 1549,200 1552,430 1555,332 1558,484 1561,201 1564,285 1567,001 1569,870 1572,758 1575,491 1578,512 1581,547 1584,405 48 43 37 42 57 54 59 62 67 58 71 59 77 82 82 64 71 88 77 79 72 73 63 49 50 3 25 0 0 0 0 1513,394 1516,517 1519,536 1522,082 1525,428 1527,963 1531,288 1534,102 1536,659 1539,757 1542,707 1545,627 1548,389 1551,459 1554,406 1557,986 1560,667 1564,103 1567,036 1570,144 1573,189 1575,888 1579,185 1582,323 1585,338 35 36 32 37 57 58 61 64 75 73 70 62 61 62 59 51 52 64 58 62 70 70 64 54 55 4 25 0 0 0 0 1512,658 1515,752 1518,797 1521,707 1524,744 1527,627 1530,871 1534,002 1537,086 1540,320 1543,217 1546,010 1548,660 1551,385 1554,253 1557,074 1560,193 1563,116 1566,043 1568,963 1571,855 1574,957 1577,954 1581,128 1584,273 43 42 39 40 56 50 56 62 65 54 59 62 75 79 73 63 67 77 73 75 68 62 54 51 51 100 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN'.encode())
c.sendall('1685 2020/03/03 14:42:05 318302 4 1 25 0 0 0 0 1513,094 1516,156 1519,154 1521,969 1525,029 1527,813 1530,921 1533,869 1536,740 1539,943 1542,921 1545,879 1548,843 1551,849 1554,760 1557,943 1560,782 1563,931 1566,786 1569,751 1572,690 1575,535 1578,638 1581,755 1584,759 41 39 33 39 48 44 49 55 61 58 64 55 68 74 68 59 57 74 61 68 58 64 54 47 46 2 25 0 0 0 0 1512,963 1515,935 1518,857 1521,849 1524,655 1527,577 1530,332 1533,233 1536,204 1539,488 1542,571 1545,725 1549,200 1552,430 1555,332 1558,484 1561,201 1564,285 1567,001 1569,870 1572,758 1575,491 1578,512 1581,547 1584,405 48 43 37 42 57 54 59 62 67 58 71 59 77 82 82 64 71 88 77 79 72 73 63 49 50 3 25 0 0 0 0 1513,394 1516,517 1519,536 1522,082 1525,428 1527,963 1531,288 1534,102 1536,659 1539,757 1542,707 1545,627 1548,389 1551,459 1554,406 1557,986 1560,667 1564,103 1567,036 1570,144 1573,189 1575,888 1579,185 1582,323 1585,338 35 36 32 37 57 58 61 64 75 73 70 62 61 62 59 51 52 64 58 62 70 70 64 54 55 4 25 0 0 0 0 1512,658 1515,752 1518,797 1521,707 1524,744 1527,627 1530,871 1534,002 1537,086 1540,320 1543,217 1546,010 1548,660 1551,385 1554,253 1557,074 1560,193 1563,116 1566,043 1568,963 1571,855 1574,957 1577,954 1581,128 1584,273 43 42 39 40 56 50 56 62 65 54 59 62 75 79 73 63 67 77 73 75 68 62 54 51 51 100 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN'.encode())
c.sendall('1685 2020/03/04 14:42:05 318303 4 1 25 0 0 0 0 1513,094 1516,156 1519,154 1521,969 1525,029 1527,813 1530,921 1533,869 1536,740 1539,943 1542,921 1545,879 1548,843 1551,849 1554,760 1557,943 1560,782 1563,931 1566,786 1569,751 1572,690 1575,535 1578,638 1581,755 1584,759 41 39 33 39 48 44 49 55 61 58 64 55 68 74 68 59 57 74 61 68 58 64 54 47 46 2 25 0 0 0 0 1512,963 1515,935 1518,857 1521,849 1524,655 1527,577 1530,332 1533,233 1536,204 1539,488 1542,571 1545,725 1549,200 1552,430 1555,332 1558,484 1561,201 1564,285 1567,001 1569,870 1572,758 1575,491 1578,512 1581,547 1584,405 48 43 37 42 57 54 59 62 67 58 71 59 77 82 82 64 71 88 77 79 72 73 63 49 50 3 25 0 0 0 0 1513,394 1516,517 1519,536 1522,082 1525,428 1527,963 1531,288 1534,102 1536,659 1539,757 1542,707 1545,627 1548,389 1551,459 1554,406 1557,986 1560,667 1564,103 1567,036 1570,144 1573,189 1575,888 1579,185 1582,323 1585,338 35 36 32 37 57 58 61 64 75 73 70 62 61 62 59 51 52 64 58 62 70 70 64 54 55 4 25 0 0 0 0 1512,658 1515,752 1518,797 1521,707 1524,744 1527,627 1530,871 1534,002 1537,086 1540,320 1543,217 1546,010 1548,660 1551,385 1554,253 1557,074 1560,193 1563,116 1566,043 1568,963 1571,855 1574,957 1577,954 1581,128 1584,273 43 42 39 40 56 50 56 62 65 54 59 62 75 79 73 63 67 77 73 75 68 62 54 51 51 100 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN'.encode())
c.sendall('1685 2020/03/05 14:42:05 318411 4 1 25 0 0 0 0 1513,094 1516,156 1519,154 1521,969 1525,029 1527,813 1530,921 1533,869 1536,740 1539,943 1542,921 1545,879 1548,843 1551,849 1554,760 1557,943 1560,782 1563,931 1566,786 1569,751 1572,690 1575,535 1578,638 1581,755 1584,759 41 39 33 39 48 44 49 55 61 58 64 55 68 74 68 59 57 74 61 68 58 64 54 47 46 2 25 0 0 0 0 1512,963 1515,935 1518,857 1521,849 1524,655 1527,577 1530,332 1533,233 1536,204 1539,488 1542,571 1545,725 1549,200 1552,430 1555,332 1558,484 1561,201 1564,285 1567,001 1569,870 1572,758 1575,491 1578,512 1581,547 1584,405 48 43 37 42 57 54 59 62 67 58 71 59 77 82 82 64 71 88 77 79 72 73 63 49 50 3 25 0 0 0 0 1513,394 1516,517 1519,536 1522,082 1525,428 1527,963 1531,288 1534,102 1536,659 1539,757 1542,707 1545,627 1548,389 1551,459 1554,406 1557,986 1560,667 1564,103 1567,036 1570,144 1573,189 1575,888 1579,185 1582,323 1585,338 35 36 32 37 57 58 61 64 75 73 70 62 61 62 59 51 52 64 58 62 70 70 64 54 55 4 25 0 0 0 0 1512,658 1515,752 1518,797 1521,707 1524,744 1527,627 1530,871 1534,002 1537,086 1540,320 1543,217 1546,010 1548,660 1551,385 1554,253 1557,074 1560,193 1563,116 1566,043 1568,963 1571,855 1574,957 1577,954 1581,128 1584,273 43 42 39 40 56 50 56 62 65 54 59 62 75 79 73 63 67 77 73 75 68 62 54 51 51 100 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN'.encode())
c.close()
if __name__ == '__main__':
Main()
But I see, my second thread is not running. Can someone point out where I am going wrong. Please help!

Tkinter is single-threaded, so Thread has little to no useage.

Appending multiple rows in df2 to df1 based on datetime

I have 2 data frames, df1 and df2, both have the same format.
For example, df1 looks like this:
Date A B C D E
2018-03-01 1 40 30 30 70
2018-03-02 3 60 70 50 55
2018-03-03 4 60 70 45 80
2018-03-04 5 80 90 30 47
2018-03-05 3 40 40 37 20
df2 may look like this: The only difference is the start date
Date A B C D E
2018-03-03 4 60 70 45 80
2018-03-04 5 80 90 30 47
2018-03-05 3 40 40 37 20
2018-03-06 7 55 26 46 42
2018-03-07 2 73 46 33 25
I want to append all the rows from df2 to df1, in this case, all the rows from 2018-03-06 so that df1 becomes:
Date A B C D E
2018-03-01 1 40 30 30 70
2018-03-02 3 60 70 50 55
2018-03-03 4 60 70 45 80
2018-03-04 5 80 90 30 47
2018-03-05 3 40 40 37 20
2018-03-06 7 55 26 46 42
2018-03-07 2 73 46 33 25
Note: df2 may skip 2018-03-06, so all rows from 2018-03-07 will be copied and appended if that's the case.
My dtype for df['Date'] is datetime64. I got an error when I tried to index the last_date of df1 to find the next_date to copy from df2.
>>>> last_date = df1['Date'].tail(1)
>>>> next_date = datetime.datetime(last_date) + datetime.timedelta(days=1)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'Timestamp'
Alternatively, how would you copy all the rows in df2 (starting from the date after the last date of df1) and append them to df1? Thanks.

Option 1
Use combine_first on the Date column:
i = df1.set_index('Date')
j = df2[df2.Date.gt(df1.Date.max())].set_index('Date')
i.combine_first(j).reset_index()
Date A B C D E
0 2018-03-01 1.0 40.0 30.0 30.0 70.0
1 2018-03-02 3.0 60.0 70.0 50.0 55.0
2 2018-03-03 4.0 60.0 70.0 45.0 80.0
3 2018-03-04 5.0 80.0 90.0 30.0 47.0
4 2018-03-05 3.0 40.0 40.0 37.0 20.0
5 2018-03-06 7.0 55.0 26.0 46.0 42.0
6 2018-03-07 2.0 73.0 46.0 33.0 25.0
Option 2
concat + groupby
pd.concat([i, j]).groupby('Date').first().reset_index()
Date A B C D E
0 2018-03-01 1 40 30 30 70
1 2018-03-02 3 60 70 50 55
2 2018-03-03 4 60 70 45 80
3 2018-03-04 5 80 90 30 47
4 2018-03-05 3 40 40 37 20
5 2018-03-06 7 55 26 46 42
6 2018-03-07 2 73 46 33 25

Deleting DataFrame row in Pandas based on column value

I have the following DataFrame:
daysago line_race rating rw wrating
line_date
2007-03-31 62 11 56 1.000000 56.000000
2007-03-10 83 11 67 1.000000 67.000000
2007-02-10 111 9 66 1.000000 66.000000
2007-01-13 139 10 83 0.880678 73.096278
2006-12-23 160 10 88 0.793033 69.786942
2006-11-09 204 9 52 0.636655 33.106077
2006-10-22 222 8 66 0.581946 38.408408
2006-09-29 245 9 70 0.518825 36.317752
2006-09-16 258 11 68 0.486226 33.063381
2006-08-30 275 8 72 0.446667 32.160051
2006-02-11 475 5 65 0.164591 10.698423
2006-01-13 504 0 70 0.142409 9.968634
2006-01-02 515 0 64 0.134800 8.627219
2005-12-06 542 0 70 0.117803 8.246238
2005-11-29 549 0 70 0.113758 7.963072
2005-11-22 556 0 -1 0.109852 -0.109852
2005-11-01 577 0 -1 0.098919 -0.098919
2005-10-20 589 0 -1 0.093168 -0.093168
2005-09-27 612 0 -1 0.083063 -0.083063
2005-09-07 632 0 -1 0.075171 -0.075171
2005-06-12 719 0 69 0.048690 3.359623
2005-05-29 733 0 -1 0.045404 -0.045404
2005-05-02 760 0 -1 0.039679 -0.039679
2005-04-02 790 0 -1 0.034160 -0.034160
2005-03-13 810 0 -1 0.030915 -0.030915
2004-11-09 934 0 -1 0.016647 -0.016647
I need to remove the rows where line_race is equal to 0. What's the most efficient way to do this?

If I'm understanding correctly, it should be as simple as:
df = df[df.line_race != 0]

But for any future bypassers you could mention that df = df[df.line_race != 0] doesn't do anything when trying to filter for None/missing values.
Does work:
df = df[df.line_race != 0]
Doesn't do anything:
df = df[df.line_race != None]
Does work:
df = df[df.line_race.notnull()]

just to add another solution, particularly useful if you are using the new pandas assessors, other solutions will replace the original pandas and lose the assessors
df.drop(df.loc[df['line_race']==0].index, inplace=True)

If you want to delete rows based on multiple values of the column, you could use:
df[(df.line_race != 0) & (df.line_race != 10)]
To drop all rows with values 0 and 10 for line_race.

In case of multiple values and str dtype
I used the following to filter out given values in a col:
def filter_rows_by_values(df, col, values):
return df[~df[col].isin(values)]
Example:
In a DataFrame I want to remove rows which have values "b" and "c" in column "str"
df = pd.DataFrame({"str": ["a","a","a","a","b","b","c"], "other": [1,2,3,4,5,6,7]})
df
str other
0 a 1
1 a 2
2 a 3
3 a 4
4 b 5
5 b 6
6 c 7
filter_rows_by_values(df, "str", ["b","c"])
str other
0 a 1
1 a 2
2 a 3
3 a 4

Though the previous answer are almost similar to what I am going to do, but using the index method does not require using another indexing method .loc(). It can be done in a similar but precise manner as
df.drop(df.index[df['line_race'] == 0], inplace = True)

The best way to do this is with boolean masking:
In [56]: df
Out[56]:
line_date daysago line_race rating raw wrating
0 2007-03-31 62 11 56 1.000 56.000
1 2007-03-10 83 11 67 1.000 67.000
2 2007-02-10 111 9 66 1.000 66.000
3 2007-01-13 139 10 83 0.881 73.096
4 2006-12-23 160 10 88 0.793 69.787
5 2006-11-09 204 9 52 0.637 33.106
6 2006-10-22 222 8 66 0.582 38.408
7 2006-09-29 245 9 70 0.519 36.318
8 2006-09-16 258 11 68 0.486 33.063
9 2006-08-30 275 8 72 0.447 32.160
10 2006-02-11 475 5 65 0.165 10.698
11 2006-01-13 504 0 70 0.142 9.969
12 2006-01-02 515 0 64 0.135 8.627
13 2005-12-06 542 0 70 0.118 8.246
14 2005-11-29 549 0 70 0.114 7.963
15 2005-11-22 556 0 -1 0.110 -0.110
16 2005-11-01 577 0 -1 0.099 -0.099
17 2005-10-20 589 0 -1 0.093 -0.093
18 2005-09-27 612 0 -1 0.083 -0.083
19 2005-09-07 632 0 -1 0.075 -0.075
20 2005-06-12 719 0 69 0.049 3.360
21 2005-05-29 733 0 -1 0.045 -0.045
22 2005-05-02 760 0 -1 0.040 -0.040
23 2005-04-02 790 0 -1 0.034 -0.034
24 2005-03-13 810 0 -1 0.031 -0.031
25 2004-11-09 934 0 -1 0.017 -0.017
In [57]: df[df.line_race != 0]
Out[57]:
line_date daysago line_race rating raw wrating
0 2007-03-31 62 11 56 1.000 56.000
1 2007-03-10 83 11 67 1.000 67.000
2 2007-02-10 111 9 66 1.000 66.000
3 2007-01-13 139 10 83 0.881 73.096
4 2006-12-23 160 10 88 0.793 69.787
5 2006-11-09 204 9 52 0.637 33.106
6 2006-10-22 222 8 66 0.582 38.408
7 2006-09-29 245 9 70 0.519 36.318
8 2006-09-16 258 11 68 0.486 33.063
9 2006-08-30 275 8 72 0.447 32.160
10 2006-02-11 475 5 65 0.165 10.698
UPDATE: Now that pandas 0.13 is out, another way to do this is df.query('line_race != 0').

The given answer is correct nontheless as someone above said you can use df.query('line_race != 0') which depending on your problem is much faster. Highly recommend.

Another way of doing it. May not be the most efficient way as the code looks a bit more complex than the code mentioned in other answers, but still alternate way of doing the same thing.
df = df.drop(df[df['line_race']==0].index)

One of the efficient and pandaic way is using eq() method:
df[~df.line_race.eq(0)]

I compiled and run my code. This is accurate code. You can try it your own.
data = pd.read_excel('file.xlsx')
If you have any special character or space in column name you can write it in '' like in the given code:
data = data[data['expire/t'].notnull()]
print (date)
If there is just a single string column name without any space or special
character you can directly access it.
data = data[data.expire ! = 0]
print (date)

Adding one more way to do this.
df = df.query("line_race!=0")

There are various ways to achieve that. Will leave below various options, that one can use, depending on specificities of one's use case.
One will consider that OP's dataframe is stored in the variable df.
Option 1
For OP's case, considering that the only column with values 0 is the line_race, the following will do the work
df_new = df[df != 0].dropna()
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
However, as that is not always the case, would recommend checking the following options where one will specify the column name.
Option 2
tshauck's approach ends up being better than Option 1, because one is able to specify the column. There are, however, additional variations depending on how one wants to refer to the column:
For example, using the position in the dataframe
df_new = df[df[df.columns[2]] != 0]
Or by explicitly indicating the column as follows
df_new = df[df['line_race'] != 0]
One can also follow the same login but using a custom lambda function, such as
df_new = df[df.apply(lambda x: x['line_race'] != 0, axis=1)]
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 3
Using pandas.Series.map and a custom lambda function
df_new = df['line_race'].map(lambda x: x != 0)
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 4
Using pandas.DataFrame.drop as follows
df_new = df.drop(df[df['line_race'] == 0].index)
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 5
Using pandas.DataFrame.query as follows
df_new = df.query('line_race != 0')
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 6
Using pandas.DataFrame.drop and pandas.DataFrame.query as follows
df_new = df.drop(df.query('line_race == 0').index)
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 7
If one doesn't have strong opinions on the output, one can use a vectorized approach with numpy.select
df_new = np.select([df != 0], [df], default=np.nan)
[Out]:
[['2007-03-31' 62 11.0 56 1.0 56.0]
['2007-03-10' 83 11.0 67 1.0 67.0]
['2007-02-10' 111 9.0 66 1.0 66.0]
['2007-01-13' 139 10.0 83 0.880678 73.096278]
['2006-12-23' 160 10.0 88 0.793033 69.786942]
['2006-11-09' 204 9.0 52 0.636655 33.106077]
['2006-10-22' 222 8.0 66 0.581946 38.408408]
['2006-09-29' 245 9.0 70 0.518825 36.317752]
['2006-09-16' 258 11.0 68 0.486226 33.063381]
['2006-08-30' 275 8.0 72 0.446667 32.160051]
['2006-02-11' 475 5.0 65 0.164591 10.698423]]
This can also be converted to a dataframe with
df_new = pd.DataFrame(df_new, columns=df.columns)
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.0 56.0
1 2007-03-10 83 11.0 67 1.0 67.0
2 2007-02-10 111 9.0 66 1.0 66.0
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
With regards to the most efficient solution, that would depend on how one wants to measure efficiency. Assuming that one wants to measure the time of execution, one way that one can go about doing it is with time.perf_counter().
If one measures the time of execution for all the options above, one gets the following
method time
0 Option 1 0.00000110000837594271
1 Option 2.1 0.00000139995245262980
2 Option 2.2 0.00000369996996596456
3 Option 2.3 0.00000160001218318939
4 Option 3 0.00000110000837594271
5 Option 4 0.00000120000913739204
6 Option 5 0.00000140001066029072
7 Option 6 0.00000159995397552848
8 Option 7 0.00000150001142174006
However, this might change depending on the dataframe one uses, on the requirements (such as hardware), and more.
Notes:
There are various suggestions on using inplace=True. Would suggest reading this: https://stackoverflow.com/a/59242208/7109869
There are also some people with strong opinions on .apply(). Would suggest reading this: When should I (not) want to use pandas apply() in my code?
If one has missing values, one might want to consider as well pandas.DataFrame.dropna. Using the option 2, it would be something like
df = df[df['line_race'] != 0].dropna()
There are additional ways to measure the time of execution, so I would recommend this thread: How do I get time of a Python program's execution?

Just adding another way for DataFrame expanded over all columns:
for column in df.columns:
df = df[df[column]!=0]
Example:
def z_score(data,count):
threshold=3
for column in data.columns:
mean = np.mean(data[column])
std = np.std(data[column])
for i in data[column]:
zscore = (i-mean)/std
if(np.abs(zscore)>threshold):
count=count+1
data = data[data[column]!=i]
return data,count

Just in case you need to delete the row, but the value can be in different columns.
In my case I was using percentages so I wanted to delete the rows which has a value 1 in any column, since that means that it's the 100%
for x in df:
df.drop(df.loc[df[x]==1].index, inplace=True)
Is not optimal if your df have too many columns.

so many options provided(or maybe i didnt pay much attention to it, sorry if its the case), but no one mentioned this:
we can use this notation in pandas: ~ (this gives us the inverse of the condition)
df = df[~df["line_race"] == 0]

It doesn't make much difference for simple example like this, but for complicated logic, I prefer to use drop() when deleting rows because it is more straightforward than using inverse logic. For example, delete rows where A=1 AND (B=2 OR C=3).
Here's a scalable syntax that is easy to understand and can handle complicated logic:
df.drop( df.query(" `line_race` == 0 ").index)

You can try using this:
df.drop(df[df.line_race != 0].index, inplace = True)
.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can i read each row till the first occurence of NaN.? - python

While calling the pd.read_excel function, try setting keep_default_na = False. This will avoid default NaN values while reading.

Related

How to add all the data of the dataframe by identifying the common column in dataframe?

Merge dataframes and merge also columns into a single column

Why is my second thread not executing at all?

Appending multiple rows in df2 to df1 based on datetime

Deleting DataFrame row in Pandas based on column value

Categories

Resources