I'm trying to add each value from one column ('smoking') with another column ('sex') and put the result in a new column called 'something'. The dataset is a DataFrame called 'data'. The values in the columns 'smoking' and 'sex' are int64.
The rows of the column 'smoking' have 1 or 0. The number 1 means that the persons smoke and the number 0 means that the person doesn't smoke. In the column 'sex' have 0 and 1 too, 0 for female and 1 for male.
for index, row in data.iterrows():
data.loc[index, 'something'] = row['smoking'] + row['sex']
data
The problem is that in the column 'something' there is just number 2.0, that means even in a row of 'smoking' is 0 and in the row of 'sex' is 1, the sum in 'something' is 2.0.
I am not undestanding the error.
I'm using python 3.9.2
The dataset is in this link of kaggle: https://www.kaggle.com/andrewmvd/heart-failure-clinical-data
I see #Vishnudev just posted the solution in a comment, but allow me to explain what is going wrong:
The issue here is that the addition is somehow resulting in a float as a result instead of an int. There are two solutions:
With the loop, casting the result to int:
for index, row in data.iterrows():
data.loc[index, 'something'] = row['smoking'] + row['sex']
data = data.astype(int)
data
Without the loop (as #Vishnudev suggested):
data['something'] = data['smoking'] + data['sex']
data
You need not iterate over entire rows for doing that, you could just use:
data['something'] = data['smoking'] + data['sex']
Here is what I'm trying to achieve:
Result, Process/logical flow
Here is a sample of the first dataset:
list of symbols
Here is a sample of the second dataset that I am using as a reference to group first dataset symbols: reference for grouping
And here is my code:
stockN = pd.DataFrame(numstocks)
ticker = pd.DataFrame(ticks)
sorts = pd.DataFrame(columns=['Symbols'])
for x in range(len(stockN)):
if int(stockN[0][x]) < 10:
sorts.loc[x] = str(ticker[0][:x])
if int(stockN[0][x]) > 10:
sorts.loc[x] = str(ticker[0][x:x+10])
And my output is:
0 Series([], Name: 0, dtype: object)
1 0 GRA\nName: 0, dtype: object
2 2 RL\n3 UNVR\n4 EPC\n5 OI\n6 LEA\nName: 0, dtype: object
3 0 GRA\n1 WRK\n2 RL\nName: 0, dtype: object
4 0 GRA\n1 WRK\n2 RL\n3 UNVR\nName: 0, dtype: object
So- clearly, passing a str() is creating some problems already, but if I don't, the values are filled with NaN.
So why am I accessing more than just the contents of what I indicated?
My next issue is the slicing, as you can tell the logic there is a disaster but since I can't access the stockN number, I can't add that to my x variable. I assumed that I could create my data frame row by row and fill each row with the ticker symbols from ticker[x:x+y] where y = stockN(quantity). That value would then be used for the next iteration and so forth.
Edit: Forgot to mention that the ticker symbols per row are a max of 10, so if my stockN number is 27 for example, I only want the next 10, not 27. That's why the if/else and the x+10 slice.
Please let me know if you can help me. If you have a better way of going about this that would be very much appreciated too.
I have figured this out! Hopefully this helps anyone else looking for a similar solution, even though this is a relatively basic question.
First of all- while my nStocks values from the previous post displayed as numbers they were actually strings. So I fixed that by accessing the data frame at [0] and converting the values to back to a list, and then turning them to int with a list comprehension.
sCount = stockN[0].values.tolist()
sCount = [int(i) for i in sCount]
After that everything was much easier, and instead of creating a messy/complex slicing method I just dropped the rows as I accessed them (duh).
symbol = []
for x in range(len(sCount)):
if sCount[x] <= 10:
symbol.append(ticker[0][:sCount[x]])
ticker.drop(ticker.index[:sCount[x]], inplace=True)
if sCount[x] > 10:
symbol.append(ticker[0][:10])
ticker.drop(ticker.index[:10], inplace=True)
This is from an exercise in DataQuest. Dataset can be found here: https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file
I have a function that basically finds out how many times a word in the answer comes up in the question in jeopardy.
def AnsFromQ(row):
split_answer = row['clean_answer'].split(" ")
split_question = row['clean_question'].split(" ")
match_count = 0
if 'the' in split_answer:
split_answer.remove("the")
if len(split_answer) == 0:
return 0
for word in split_answer:
if word in split_question:
match_count += 1
return match_count / len(split_answer)
jeopardy['answer_in_question'] = jeopardy.apply(AnsFromQ, axis=1)
My question pertains to lines 2 and 3 using row['clean_answer']. How does python know that I want to refer to a cell (intersect of a row and column) without using something like jeopardy.loc[row,['clean_answer']]? I could have used any variable for row.
This code works. I just don't know why. If I use the code with loc, it gives me a warning in jupyter saying I need to use reindex().
When you use jeopardy.apply(AnsFromQ, axis=1), you are applying the AnsFromQ function to each row of the Dataframe. The row variable gives you a slice of the dataframe representing the current row. In order to use jeopardy.loc[row,['clean_answer']], row would need to be a value from jeopardy's Index.
In other words, row is a pd.Series that gives you a view of each row. Indexing into that with row['clean_answer'] gives you the cell from that column index in the current row.
I am outputting items from a dataframe to a csv. The rows, however, are too long. I need to have the csv add line breaks (/n) every X items (columns) so that individual rows in the output aren't too long. Is there a way to do this?
A,B,C,D,E,F,G,H,I,J,K
Becomes in the file (X=3) -
A,B,C
D,E,F
G,H,I
J,K
EDIT:
I have a 95% solution (assuming you have only 1 column):
size=50
indexes = np.arange(0,len(data),size) #have to use numpy since range is now an immutable type in python 3
indexes = np.append(indexes,[len(data)]) #add the uneven final index
i=0
while i < len(indexes)-1:
holder = pd.DataFrame(data.iloc[indexes[i]:indexes[i+1]]).T
holder.to_csv(filename, index=False, header=False)
i += 1
The only weirdness is that, despite not throwing any errors, the final loop of the while (with the uneven final index) does not write to the file, even though the information is in holder perfectly. Since no errors are thrown, I cannot figure out why the final information is not being written.
Assuming you have a number of values that is a multiple of 3 (note how I added L):
s = pd.Series(["A","B","C","D","E","F","G","H","I","J","K","L"])
df = pd.DataFrame(s.reshape((-1,3)))
You can they write df to CSV.
I have an array containing data for three different indicators (X-Z) in five different categories (A-E).
Now I want to check every column from the dataset whether there is a 0 in it. In case there is a 0 in a row, I want to delete all indicators of this type.
In my minimum example it should find the zero in one of the Y rows and consequently delete all Y rows.
AA =(['0','A','B','C','D','E'],
['X','2','3','3','3','4'],
['Y','3','4','9','7','3'],
['Z','3','4','6','3','4'],
['X','2','3','3','3','4'],
['Y','3','4','8','7','0'],
['Z','3','4','6','3','4'],
['X','2','5','3','3','4'],
['Y','3','4','0','7','3'],
['Z','3','4','6','3','4'])
My code is the following:
import numpy as np
sequence = 3 #number of columns per sequence X,Y,Z
AA = np.array(AA)
for i in range(1,AA.shape[0]):
for j in range(1,AA.shape[1]):
if j == 0.0:
for k in range(np.min((j-1)/sequence,1),AA.shape[0],sequence):
np.delete(AA, k, 0)
and should give me:
AA =(['0','A','B','C','D','E'],
['X','2','3','3','3','4'],
['Z','3','4','6','3','4'],
['X','2','3','3','3','4'],
['Z','3','4','6','3','4'],
['X','2','5','3','3','4'],
['Z','3','4','6','3','4'])
But somehow my code does not delete anything. So I guess I have a problem with the delete function, but I can't figure out what exactly the problem is.
EDIT:
In my real data the indicators (X-Z) don't have all exactly the same name but rather 'asdf - X' or 'qwer - Y - asdf'. So always the label part after the first '-' separator is identical.
So I cannot use a set() function on them but rather have to select the rows to delete by the distances from the row where the 0 was detected.
I would do it in two passes. It is a lot cleaner, and it might even be faster under some circumstances. Here's an implementation without numpy; feel free to convert it to use array().
AA =(['0','A','B','C','D','E'],
['X','2','3','3','3','4'],
['Y','3','4','9','7','3'],
['Z','3','4','6','3','4'],
['X','2','3','3','3','4'],
['Y','3','4','8','7','0'],
['Z','3','4','6','3','4'],
['X','2','5','3','3','4'],
['Y','3','4','0','7','3'],
['Z','3','4','6','3','4'])
todrop = set(row[0] for row in AA[1:] if '0' in row)
filtered = list(row for row in AA[1:] if row[0] not in todrop)
Since row[0] does not contain the exact indicator label, write a simple function that will extract the label and use that instead of the entire row[0]. Details depend on what your data actually looks like.
Option 2: In case you really want to do it by counting the rows (which I don't recommend): Save the row numbers modulo 3, instead of the row ID. It's about the same amount of work:
relabeled = list((n % 3, row) for n, row in enumerate(AA[1:]))
todrop = set(n for n, row in relabeled if '0' in row) # Will save {1} for Y
filtered = list(row for n, row in relabeled if n not in todrop)
You are trying to delete something while looping through it, and it will not work, since it will lose the references.
Instead of deleting current matrix, try to build another one with the values you want to, and then assign the matrix to the one you just created