how to create a new column based on a condition? - python

Thank you for answering first. I want to combine the address and postcode if the condition is 0, and combine the address and postcode and 1 if the condition is 1. I tried the following, but the desired result does not come out. I would like to ask if the data type is a problem. Please give me a lot of advice.
import pandas as pd
import numpy as np
data = pd.read_csv('./address.csv')
data['full address'] = np.where(data['condition']='0', data['address']+'-'+data['postcode'], data['address']+'-'+data['postcode']+'-'+data['condition'])
output
SyntaxError: keyword can't be an expression
expected output

As pointed out by #quang-hoang in the comments, the syntax is incorrect. Also, remove the single quotes from 0 if the column datatype is int. You need the == for checking equality.
Please note that the np.where function as written will also output data['address']+'-'+data['postcode']+'-'+data['condition'] whenever data['condition'] is not equal to 0. It's essential an if-then-else clause.
import pandas as pd
import numpy as np
data = pd.read_csv('./address.csv')
data['full address'] = np.where(
data['condition']== 0,
data['address']+'-'+data['postcode'],
data['address']+'-'+data['postcode']+'-'+data['condition']
)

Related

type conversion in pandas on assignment of DataFrame series

I am noticing something a little strange in pandas (1.4.3). Is this the expected behaviour? The result of an optimization, or a bug? Basically I'd like to guarantee the type does not change unexpectedly, I'd at least like to see an error raised, so any tips are welcome.
If you assign all values of a series in a DataFrame this way, the dtype is altered
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.DataFrame({"a": np.array([1,2,3], dtype="int32")})
>>> df1.iloc[:, df1.columns.get_loc("a")] = 0
>>> df1["a"].dtype
dtype('int64')
and if you index the rows in a different way pandas does not convert the dtype
>>> df2 = pd.DataFrame({"a": np.array([1,2,3], dtype="int32")})
>>> df2.iloc[0:len(df2.index), df2.columns.get_loc("a")] = 0
>>> df2["a"].dtype
dtype('int32')
Not really an answer but some thoughts that might help you in your quest. My guess as to what is happening is this. In your multiple choice question above I am picking option A - optimization.
I think when 'pandas' sees df1.iloc[:, df1.columns.get_loc("a")] = 0 it is thinking full column(s) replacement of all rows. No slicing - even though df1.iloc[: ... ] is involved. [:] gets translated into all-rows-not-a-slice mode. When it sees = 0 it sees that (via broadcast) as full column(s) of int64. And since it is full replacement then the new column has the same dtype as the source.
But when it sees df2.iloc[0:len(df2.index), df2.columns.get_loc("a")] = 0 it goes into index-slice mode. Even though it is a full-column index slice it doesn't know that and makes an early decision to go into index-slice mode. Index-slice mode then operates on the assumption that only part of the column is going to be updated - not a replacement. Then in update mode the column is assumed to be partially updated and retains its existing dtype.
I got the above hypothesis from looking around at this: https://github.com/pandas-dev/pandas/blob/main/pandas/core/indexes/base.py
If I didn't have a day job I might have the time to actually find the smoking gun in those 6242 lines of code.
If you look at this code ( I wrote your code little differently to see what is
happening in the middle)
from pandas._libs import index
import pandas as pd
import numpy as np
dfx= pd.DataFrame({"x": np.array([4,5,6], dtype="int32")}
P=dfx.iloc[:, dfx.columns.get_loc("x")] = 0
P1=dfx.iloc[:, dfx.columns.get_loc("x")]
print(P1)# here you are automatically changing the datatype to int64 ( while
keep the value 0 , as int64 is default access mechanism for the hardware to
process the data.
print(P)
print(dfx["x"].dtype)
dfy= pd.DataFrame({"y": np.array([4,5,6], dtype="int32")})
Q=dfy.iloc[0:len(dfy.index), dfy.columns.get_loc("y")] = 0
print(Q)
Q1=dfy.iloc[0:len(dfy.index), dfy.columns.get_loc("y")]
print(Q1)
print(dfy["y"].dtype)
print(len(dfx.index))
print(len(dfy.index))
Don't know why this is happening, but adding square brackets seem to solve the issue:
df1.iloc[:, [df1.columns.get_loc("a")]] = 0
An other solution seems to be:
df1.iloc[range(len(df1.index)), df1.columns.get_loc("a")] = 0

Replace unknown values (with different median values)

I have a particular problem, I would like to clean and prepare my data and I have a lot of unknown values for the "highpoint_metres" column of my dataframe (members). As there is no missing information for the "peak_id", I calculated the median value of the height according to the peak_id to be more accurate.
I would like to do two steps: 1) add a new column to my "members" dataframe where there would be the value of the median but different depending on the "peak_id" (value calculated thanks to the code in the question). 2) That the code checks that the value in highpoint_metres is null, if it is, that the value of the new column is put instead. I don't know if this is clearer
code :
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
print(members)
mediane_peak_id = members[["peak_id","highpoint_metres"]].groupby("peak_id",as_index=False).median()
And I don't know how to continue from there (my level of python is very bad ;-))
I believe that's what you're looking for:
import numpy as np
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
median_highpoint_by_peak = members.groupby("peak_id")["highpoint_metres"].transform("median")
is_highpoint_missing = np.isnan(members.highpoint_metres)
members["highpoint_meters_imputed"] = np.where(is_highpoint_missing, median_highpoint_by_peak, members.highpoint_metres)
so one way to go about replacing 0 with median could be:
import numpy as np
df[col_name] = df[col_name].replace({0: np.median(df[col_name])})
You can also use apply function:
df[col_name] = df[col_name].apply(lambda x: np.median(df[col_name]) if x==0 else x)
Let me know if this helps.
So adding a little bit more info based on Marie's question.
One way to get median is through groupby and then left join it with the original dataframe.
df_gp = df.groupby(['peak_id']).agg(Median = (highpoint_metres, 'median')).reset_index()
df = pd.merge(df, df_gp, on='peak_id')
df = df.apply(lambda x['highpoint_metres']: x['Median'] if x['highpoint_metres']==np.nan else x['highpoint_metres'])
Let me know if this solves your issue

Iteration, calculation via pandas

I am new in Python and I would like to ask something.
My code reads a csv file. I want to use one column. I want to use an equation which calculates, depending on the value of the column I want to use, several values. I am using commands for and if.
my code
import pandas as pd
import matplotlib as mpl
import numpy as np
dfArxika = pd.read_csv('AIALL.csv', usecols=[0,1,2,3,4,5,6,7,8,9,10], header=None, index_col=False)
print(dfArxika.columns)
A=dfArxika[9]
for i in A:
if (A(i)>=4.8 and A(i)<66):
IA=(2.2*log10(A(i)/66)+5.5)
elif A(i)>=66:
IA=3.66*log10(A(i)/66)+5.5
else:
IA=2.2*log10(A(i)/66)+5.5
but command window shbows me the error:
TypeError: 'Series' object is not callable
Could you help me?
As #rdas mentioned in the comments, you are using parentheses () instead of brackets [] for indexing the values of your column.
I am not sure whatIA is in your example, but this might work:
for i in range(len(dfArxika)):
if (A.loc[i, 9]>=4.8 and A.loc[i, 9]<66):
IA=(2.2*log10(A.loc[i, 9]/66)+5.5)
elif A.loc[i, 9]>=66:
IA=3.66*log10(A.loc[i, 9]/66)+5.5
else:
IA=2.2*log10(A.loc[i, 9]/66)+5.5

Function not being applied properly on a pandas dataframe

I am very new to the whole pandas and numpy world. I have experience with python but not on this side. I was trying to work with a data set and I found a issue that I am not able to explain. It will be great if someone with experience helps me to understand what is going wrong in it.
I have a CSV file with three fields. "Age", "Working Class" and "income". The headers were missing so I loaded the CSV in the following manner -
import numpy as np
import pandas as pd
df = pd.read_csv("test.csv", index_col=False, header=None, names=["age", "workclass", "income"])
Now the data in the last column is in this format - "<=50K" or ">50K". I wanted to tranfer the data into either "0" or "1" based on the values above. So 0 for "<=50K" and 1 for ">50K". To accomplish that I wrote this line of code
def test_func(x):
if x == "<=50K":
return "0"
else:
return "1"
df['income'] = df['income'].apply(test_func)
This makes all the columns to become "1"! I did some printing inside test_func and it looks like x is having the right value and the type of x is of "str". I can not understand, in this case how come always the "else" part is getting executed and never the "if" part. What am I doing wrong?
It can be a very silly mistake that I am overlooking. I am not sure and any help will be great
Thanks in advance.
Option 1
astype
df['income'] = df['income'].eq("<=50K").astype(int)
Option 2
np.where
df['income'] = np.where(df.income == "<=50K", 1, 0)
I would just do:
df.loc[df['income']=='<=50K','income'] = 0
df.loc[df['income']!='<=50K','income'] = 1
Alex's solution is a classic, but there is a built-in if/this function in numpy called np.where. I'm not super familiar with it, but it would look something like...
df['income'] = np.where((df['income']=='<=50K','income'), 1 ,0)
Referenced np.where Stackoverflow Question

Python Dataframe

I am a Java programmer and I am learning python for Data Science and Analysis purposes.
I wish to clean the data in a Dataframe, but I am confused with the pandas logic and syntax.
What I wish to achieve is the something like the following Java code:
for( String name : names ) {
if (name == "test") {
name = "myValue";}
}
How can do it with python and pandas dataframe.
I tried as following but it does not work
import pandas as pd
import numpy as np
df = pd.read_csv('Dataset V02.csv')
array = df['Order Number'].unique()
#On average, one order how many items has?
for value in array:
count = 0
if df['Order Number'] == value:
......
I get error at df['Order Number']==value.
How can I identify the specific values and edit them?
In short, I want to:
-Check all the entries of 'Order Number' column
-Execute an action (example: replace the value, or count the value) each time the record is equal to a given value (example, the order code)
Just use the vectorised form for replacement:
df.loc[df['Order Number'] == 'test'
This will compare the entire column against a specific value, where this is True it will replace just those rows with the new value
For the second part if doesn't understand boolean arrays, it expects a scalar result. If you're just doing a unique value/frequency count then just do:
df['Order Number'].value_counts()
The code goes this way
import pandas as pd
df = pd.read_csv("Dataset V02.csv")
array = df['Order Number'].unique()
for value in array:
count = 0
if value in df['Order Number']:
.......
You need to use "in" to check the presence. Did I understand your problem correctly. If I did not, please comment, I will try to understand further.

Categories