I have two dataframes: df1 and df2. I am iterating through df1 using iterrows, and for a particular field in each line, I am looking into df2 for the line that matches that field, and trying to pull out a corresponding value from that line in df2 in a SCALAR format. Every way I try to do this I end up with another dataframe or series and I can't use that value as a scalar. Here is my latest attempt:
for index, row in df1.iterrows():
a = row[0]
b = df2.loc[(df2['name'] == a ), 'weight']
c = row[1] - b #this is where error happens
df1.set_value(index,'wtdif',c)
I get an error because 'b' in this case is not a scalar, if i print it out here is an example of what it looks like. The '24' here is the index of the row it was found in in df2. The other confusing part about this is that I can't index 'b' in any way even though it is a series (i.e. b[0] creates an error, as does b['weight'], etc.)
Name: weight, dtype: float64
24 141.5
You're getting an error because the only index in b is 24. You could use that or (more easily) index by location using,
b.iloc[0]
This is a common gotcha for new Pandas users. Indices are preserved when pulling data out of a Series or DataFrame. They do not, in general, run from 0 -> N-1 where N is the length of the Series or the number of rows in the DataFrame.
This will help a bit http://pandas.pydata.org/pandas-docs/stable/indexing.html although I admit it was confusing for me at first as well.
Welp, I am still getting "IndexError: single positional indexer is out-of-bounds" when I make that change to my code.
Your suggestion makes a lot of sense though and does work, thanks for posting that. I wrote a quick test script to verify the fix, and it did in fact work so thumbs up for that. I'll post that code here in case anyone else is ever curious.
I'm missing something here, I'll just have to keep working on what is wrong and what my next question should be...
import pandas as pd
import numpy as np
def foo(df1,df2):
df1['D'] = 0
for index,row in df1.iterrows():
name = row[2] #for some reason name ends up as column 3 in this dataframe rather than column 0? whatever, not important, but strange
temp = df2.loc[(df2['name'] == name), 'weight']
x = row[3] + temp.iloc[0] #
df1.set_value(index,'D',x)
print df1
df1 = pd.DataFrame({'name' : ['alex','bob', 'chris'], 'weight' : [140,150,160], 'A' : ['1','2','3'], 'B' : ['4','5','6']})
df2 = pd.DataFrame({'name' : ['alex','bob', 'chris'], 'weight' : [180,190,200], 'C' : ['1','2','3'], 'D' : ['4','5','6']})
print df1
print df2
foo(df1,df2)
Related
I have this Excel formula in A2 : =IF(B1=1;CONCAT(D1:Z1);"Null")
All cells are string or integer but some are empty. I filled the empty one with "null"
I've tried to translate it with pandas, and so far I wrote this :
'''
import pandas as pd
df = pd.read_table('C:/*path*/001.txt', sep=';', header=0, dtype=str)
rowcount = 0
for row in df:
rowcount+= 1
n = rowcount
m = len(df)
df['A']=""
for i in range(1,n):
if df[i-1]["B"]==1:
for k in range(2,m):
if df[i][k]!="Null"
df[i]['A']+=df[i][k]
'''
I can't find something close enough to my problem in questions, anyone can help?
I not sure you really expecting for this. If you need to fill empty cell with 'null' string in dataframe. You can use this
df.fillna('null', inplace=True)
If you provide the expected output with your input file. May helpful for the contributors.
Test dataframe:
df = pd.DataFrame({
"b":[1,0,1],
"c":["dummy", "dummy", "dummy"],
"d":["red", "green", "blue"],
"e":["-a", "-b", "-c"]
})
First step: add a new column and fill with NULL.
df["concatenated"] = "NULL"
Second step: filter by where column B is 1, and then set the value of the new column to the concatenation of columns D to Z.
df["concatenated"][df["b"]==1] = df[sub_columns].sum(axis=1)
df
Output:
EDIT: I notice there is an offset in your excel formula. Not sure if this is deliberate, but experiment with df.shift(-1) if so.
There's a lot to unpack here.
Firstly, len(df) gives us the row count. In your code, n and m will be one and the same.
Secondly, please never do chain indexing in pandas unless you absolutely have to. There's a number of reasons not to, one of them being that it's easy to make a mistake; also, assignment can fail. Here, we have a default range index, so we can use df.loc[i-1, 'B'] in place of df[i - 1]["B"].
Thirdly, the dtype is str, so please use =='1' rather than ==1.
If I understand your problem correctly, the following code should help you:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
'B': ['1','2','0','1'],
'C': ['first', 'second', 'third', 'fourth'],
'D': ['-1', '-2', '-3', '-4']
})
In [3]: RELEVANT_COLUMNS = ['C', 'D'] # You can also extract them in any automatic way you please
In [4]: df["A"] = ""
In [5]: df.loc[df['B'] == '1', 'A'] = df.loc[df['B'] == '1', RELEVANT_COLUMNS].sum(axis=1)
In [6]: df
Out[6]:
B C D A
0 1 first -1 first-1
1 2 second -2
2 0 third -3
3 1 fourth -4 fourth-4
We note which columns to concat (In [3]) (we do not want to make the mistake of adding a column later on and using it. Here if we add 'A' it doesn't hurt, because it's full of empty strings. But it's more manageable to save the columns we concat.
We then add the column with empty strings (In [4]) (if we skip this step, we'll get NaNs instead of empty strings for the records where B does not equal 1).
The In [5] uses pandas' boolean indexing (through the Series to scalar equality operator) to limit our scope to where column B equals 1, and then we pull up the columns to concat and do just that, using the an axis-reducing sum operation.
Say I have a dataframe as below (a representation of a much larger dataset) which has a code as a column along with another column (acutal dataset as many more).
import pandas as pd
df = pd.DataFrame({'code': [123456, 123758, 12334356, 4954968, 774853],
'col2': [1,2,3,4,5]})
Question: How can I store in a separate dataframe & remove from the original dataframe the entries of this dataframe (all columns associated with the entry as well) which don't have the first 3 characters as 123?
Attempted: To do this I have attempted to select out all rows which start with 123 and then use the not symbol ~ to select everything which doesn't start with this. I have stored this in a new dataframe since I want it saved and then tried dropping this from the original dataframe by getting its index as its not wanted.
# Converting column to a string
df['code'] = df['code'].astype(str)
# Saving entries which DONT start with 123 in a separate dataframe
df2 = df[~df['code'].str[0:3] == '123']
# Dropping those bad entries (starting with 123 chars) from dataframe
df = df.drop(df2.index, inplace=True)
However when I do this I come across the following error:
TypeError: bad operand type for unary ~: 'str'
Any alternate solutions along with corrections to my own would be much appreciated.
Desired Output: Should generalise for additional entries too. Notice that 4954968 & 774853 have gone since they don't start with 123
df_final = pd.Dataframe(df = pd.DataFrame({'code': [123456, 123758, 12334356, ], 'col2': [1,2,3]}))
In your solution is problem priority operators, so is necessary parentheses:
df2 = df[~(df['code'].str[0:3] == '123')]
print (df2)
code col2
3 4954968 4
4 774853 5
Better is change logic - select only matched values
df = df[(df['code'].str[0:3] == '123')]
print (df)
You can use startswith to identify the rows that you want. No need for a double negative.
import pandas as pd
df = df.loc[df['code'].str.startswith('123'), :]
I know this has been asked before but I cannot find an answer that is working for me. I have a dataframe df that contains a column age, but the values are not all integers, some are strings like 35-59. I want to drop those entries. I have tried these two solutions as suggested by kite but they both give me AttributeError: 'Series' object has no attribute 'isnumeric'
df.drop(df[df.age.isnumeric()].index, inplace=True)
df = df.query("age.isnumeric()")
df = df.reset_index(drop=True)
Additionally is there a simple way to edit the value of an entry if it matches a certain condition? For example instead of deleting rows that have age as a range of values, I could replace it with a random value within that range.
Try with:
df.drop(df[df.age.str.isnumeric() == False].index, inplace=True)
If you check documentation isnumeric is a method of Series.str and not of Series. That's why you get that error.
Also you will need the ==False because you have mixed types and get a series with only booleans.
I'm posting it in case this also helps you with your last question. You can use pandas.DataFrame.at with pandas.DataFrame.Itertuples for iteration over rows of the dataframe and replace values:
for row in df.itertuples():
# iterate every row and change the value of that column
if row.age == 'non_desirable_value:
df.at[row.Index, "age"] = 'desirable_value'
Hence, it could be:
for row in df.itertuples():
if row.age.str.isnumeric() == False or row.age == 'non_desirable_value':
df.at[row.Index, "age"] = 'desirable_value'
I have a data frame called v where columns are = ['self','id','desc','name','arch','rel']. And when I rename is as follows it won't let me drop columns giving column not found in axis error.
case1:
for i in range(0,len(v.columns)):
#I'm trying to add 'v_' prefix to all col names
v.columns.values[i] = 'v_' + v.columns.values[i]
v.drop('v_self',1)
#leads to error
KeyError: "['v_self'] not found in axis"
But if I do it as follows then it works fine
case2:
v.columns = ['v_self','v_id','v_desc','v_name','v_arch','v_rel']
v.drop('v_self',1)
# no error
In both cases if I do following it give same results for its columns
v.columns
#both cases gives
Index(['v_self', 'v_id', 'v_description', 'v_name', 'v_archived',
'v_released'],
dtype='object')
I can't understand why in the case1 it gives an error? Please help, thanks.
That's because .values returns the underlying values. You're not supposed to change those directly. Assigning directly to .columns is supported though.
Try something like this:
import pandas
df = pandas.DataFrame(
[
{key: 0 for key in ["self", "id", "desc", "name", "arch", "rel"]}
for _ in range(100)
]
)
# Add a v_ to every column
df.columns = [f"v_{column}" for column in df.columns]
# Drop one column
df = df.drop(columns=["v_self"])
To your "case 1":
You meet a bug (#38547) in pandas — “Direct renaming of 1 column seems to be accepted, but only old name is working”.
It means that after that "renaming", you may delete the first column
not by using
v.drop('v_self',1)
but using the old name
v.drop('self',1)`.
Of course, the better option is not using such a buggy renaming in the
current versions of pandas.
To renaming columns by adding a prefix to every label:
There is a direct dateframe method .add_prefix() for it, isn't it?
v = df.add_prefix("v_")
I am coming from an R background and used to being able to retrieve the value from a dataframe by using syntax like:
r_dataframe$some_column_name[row_number]
And I can assign a value to the dataframe by the following syntax:
r_dataframe$some_column_name[row_number] <= some_value
or without the arrow:
r_dataframe$some_column_name[row_number] = some_value
For example:
#create R dataframe data
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
employ.data <- data.frame(employee, salary, startdate)
#print out the name of this employee
employ.data$employee[2]
#assign the name
employ.data$employee[2] <= 'Some other name'
I'm now learning some Python and from what I can see the most straight-forward way to retreive a value from a pandas dataframe is:
pandas_dataframe['SomeColumnName'][row_number]
I can see the similarities to R.
However, what confuses me is that when it comes to modifying/assigning the value in the pandas dataframe I need to completely change the syntax to something like:
pandas_dataframe.at[row_number, 'SomeColumnName'] = some_value
To read this code is going to require a lot more concentration because the column name and row number have changed order.
Is this the only way to perform this pair of operations? Is there a more logical way to do this that respects the consistent use of column name and row number order?
If I understand what you mean correctly, as #sammywemmy mentioned you can use .loc and .iloc to get/change value in any row and column.
If the order of your dataframe rows changes, you must define index to get every row (datapoint) by its index, even if the order has changed.
Like below:
df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])
Now you can get the first row by its index:
df.loc['a'] # equivalent to df.iloc[0]
It turns out that pandas_dataframe.at[row_number, 'SomeColumnName'] can be used to modify AND retrieve information.