I am very new to the whole pandas and numpy world. I have experience with python but not on this side. I was trying to work with a data set and I found a issue that I am not able to explain. It will be great if someone with experience helps me to understand what is going wrong in it.
I have a CSV file with three fields. "Age", "Working Class" and "income". The headers were missing so I loaded the CSV in the following manner -
import numpy as np
import pandas as pd
df = pd.read_csv("test.csv", index_col=False, header=None, names=["age", "workclass", "income"])
Now the data in the last column is in this format - "<=50K" or ">50K". I wanted to tranfer the data into either "0" or "1" based on the values above. So 0 for "<=50K" and 1 for ">50K". To accomplish that I wrote this line of code
def test_func(x):
if x == "<=50K":
return "0"
else:
return "1"
df['income'] = df['income'].apply(test_func)
This makes all the columns to become "1"! I did some printing inside test_func and it looks like x is having the right value and the type of x is of "str". I can not understand, in this case how come always the "else" part is getting executed and never the "if" part. What am I doing wrong?
It can be a very silly mistake that I am overlooking. I am not sure and any help will be great
Thanks in advance.
Option 1
astype
df['income'] = df['income'].eq("<=50K").astype(int)
Option 2
np.where
df['income'] = np.where(df.income == "<=50K", 1, 0)
I would just do:
df.loc[df['income']=='<=50K','income'] = 0
df.loc[df['income']!='<=50K','income'] = 1
Alex's solution is a classic, but there is a built-in if/this function in numpy called np.where. I'm not super familiar with it, but it would look something like...
df['income'] = np.where((df['income']=='<=50K','income'), 1 ,0)
Referenced np.where Stackoverflow Question
Related
I am noticing something a little strange in pandas (1.4.3). Is this the expected behaviour? The result of an optimization, or a bug? Basically I'd like to guarantee the type does not change unexpectedly, I'd at least like to see an error raised, so any tips are welcome.
If you assign all values of a series in a DataFrame this way, the dtype is altered
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.DataFrame({"a": np.array([1,2,3], dtype="int32")})
>>> df1.iloc[:, df1.columns.get_loc("a")] = 0
>>> df1["a"].dtype
dtype('int64')
and if you index the rows in a different way pandas does not convert the dtype
>>> df2 = pd.DataFrame({"a": np.array([1,2,3], dtype="int32")})
>>> df2.iloc[0:len(df2.index), df2.columns.get_loc("a")] = 0
>>> df2["a"].dtype
dtype('int32')
Not really an answer but some thoughts that might help you in your quest. My guess as to what is happening is this. In your multiple choice question above I am picking option A - optimization.
I think when 'pandas' sees df1.iloc[:, df1.columns.get_loc("a")] = 0 it is thinking full column(s) replacement of all rows. No slicing - even though df1.iloc[: ... ] is involved. [:] gets translated into all-rows-not-a-slice mode. When it sees = 0 it sees that (via broadcast) as full column(s) of int64. And since it is full replacement then the new column has the same dtype as the source.
But when it sees df2.iloc[0:len(df2.index), df2.columns.get_loc("a")] = 0 it goes into index-slice mode. Even though it is a full-column index slice it doesn't know that and makes an early decision to go into index-slice mode. Index-slice mode then operates on the assumption that only part of the column is going to be updated - not a replacement. Then in update mode the column is assumed to be partially updated and retains its existing dtype.
I got the above hypothesis from looking around at this: https://github.com/pandas-dev/pandas/blob/main/pandas/core/indexes/base.py
If I didn't have a day job I might have the time to actually find the smoking gun in those 6242 lines of code.
If you look at this code ( I wrote your code little differently to see what is
happening in the middle)
from pandas._libs import index
import pandas as pd
import numpy as np
dfx= pd.DataFrame({"x": np.array([4,5,6], dtype="int32")}
P=dfx.iloc[:, dfx.columns.get_loc("x")] = 0
P1=dfx.iloc[:, dfx.columns.get_loc("x")]
print(P1)# here you are automatically changing the datatype to int64 ( while
keep the value 0 , as int64 is default access mechanism for the hardware to
process the data.
print(P)
print(dfx["x"].dtype)
dfy= pd.DataFrame({"y": np.array([4,5,6], dtype="int32")})
Q=dfy.iloc[0:len(dfy.index), dfy.columns.get_loc("y")] = 0
print(Q)
Q1=dfy.iloc[0:len(dfy.index), dfy.columns.get_loc("y")]
print(Q1)
print(dfy["y"].dtype)
print(len(dfx.index))
print(len(dfy.index))
Don't know why this is happening, but adding square brackets seem to solve the issue:
df1.iloc[:, [df1.columns.get_loc("a")]] = 0
An other solution seems to be:
df1.iloc[range(len(df1.index)), df1.columns.get_loc("a")] = 0
I have a particular problem, I would like to clean and prepare my data and I have a lot of unknown values for the "highpoint_metres" column of my dataframe (members). As there is no missing information for the "peak_id", I calculated the median value of the height according to the peak_id to be more accurate.
I would like to do two steps: 1) add a new column to my "members" dataframe where there would be the value of the median but different depending on the "peak_id" (value calculated thanks to the code in the question). 2) That the code checks that the value in highpoint_metres is null, if it is, that the value of the new column is put instead. I don't know if this is clearer
code :
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
print(members)
mediane_peak_id = members[["peak_id","highpoint_metres"]].groupby("peak_id",as_index=False).median()
And I don't know how to continue from there (my level of python is very bad ;-))
I believe that's what you're looking for:
import numpy as np
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
median_highpoint_by_peak = members.groupby("peak_id")["highpoint_metres"].transform("median")
is_highpoint_missing = np.isnan(members.highpoint_metres)
members["highpoint_meters_imputed"] = np.where(is_highpoint_missing, median_highpoint_by_peak, members.highpoint_metres)
so one way to go about replacing 0 with median could be:
import numpy as np
df[col_name] = df[col_name].replace({0: np.median(df[col_name])})
You can also use apply function:
df[col_name] = df[col_name].apply(lambda x: np.median(df[col_name]) if x==0 else x)
Let me know if this helps.
So adding a little bit more info based on Marie's question.
One way to get median is through groupby and then left join it with the original dataframe.
df_gp = df.groupby(['peak_id']).agg(Median = (highpoint_metres, 'median')).reset_index()
df = pd.merge(df, df_gp, on='peak_id')
df = df.apply(lambda x['highpoint_metres']: x['Median'] if x['highpoint_metres']==np.nan else x['highpoint_metres'])
Let me know if this solves your issue
Thank you for answering first. I want to combine the address and postcode if the condition is 0, and combine the address and postcode and 1 if the condition is 1. I tried the following, but the desired result does not come out. I would like to ask if the data type is a problem. Please give me a lot of advice.
import pandas as pd
import numpy as np
data = pd.read_csv('./address.csv')
data['full address'] = np.where(data['condition']='0', data['address']+'-'+data['postcode'], data['address']+'-'+data['postcode']+'-'+data['condition'])
output
SyntaxError: keyword can't be an expression
expected output
As pointed out by #quang-hoang in the comments, the syntax is incorrect. Also, remove the single quotes from 0 if the column datatype is int. You need the == for checking equality.
Please note that the np.where function as written will also output data['address']+'-'+data['postcode']+'-'+data['condition'] whenever data['condition'] is not equal to 0. It's essential an if-then-else clause.
import pandas as pd
import numpy as np
data = pd.read_csv('./address.csv')
data['full address'] = np.where(
data['condition']== 0,
data['address']+'-'+data['postcode'],
data['address']+'-'+data['postcode']+'-'+data['condition']
)
I am new in Python and I would like to ask something.
My code reads a csv file. I want to use one column. I want to use an equation which calculates, depending on the value of the column I want to use, several values. I am using commands for and if.
my code
import pandas as pd
import matplotlib as mpl
import numpy as np
dfArxika = pd.read_csv('AIALL.csv', usecols=[0,1,2,3,4,5,6,7,8,9,10], header=None, index_col=False)
print(dfArxika.columns)
A=dfArxika[9]
for i in A:
if (A(i)>=4.8 and A(i)<66):
IA=(2.2*log10(A(i)/66)+5.5)
elif A(i)>=66:
IA=3.66*log10(A(i)/66)+5.5
else:
IA=2.2*log10(A(i)/66)+5.5
but command window shbows me the error:
TypeError: 'Series' object is not callable
Could you help me?
As #rdas mentioned in the comments, you are using parentheses () instead of brackets [] for indexing the values of your column.
I am not sure whatIA is in your example, but this might work:
for i in range(len(dfArxika)):
if (A.loc[i, 9]>=4.8 and A.loc[i, 9]<66):
IA=(2.2*log10(A.loc[i, 9]/66)+5.5)
elif A.loc[i, 9]>=66:
IA=3.66*log10(A.loc[i, 9]/66)+5.5
else:
IA=2.2*log10(A.loc[i, 9]/66)+5.5
Sorry if I haven't explained things very well. I'm a complete novice please feel free to critic
I've searched every where but I havent found anything close to subtracting a percent. when its done on its own(x-.10=y) it works wonderfully. the only problem is Im trying to make 'x' stand for sample_.csv[0] or the numerical value from first column from my understanding.
import csv
import numpy as np
import pandas as pd
readdata = csv.reader(open("sample_.csv"))
x = input(sample_.csv[0])
y = input(x * .10)
print(x + y)
the column looks something like this
"20,a,"
"25,b,"
"35,c,"
"45,d,"
I think you should only need pandas for this task. I'm guessing you want to apply this operation on one column:
import pandas as pd
df = pd.read_csv('sample_.csv') # assuming columns within csv header.
df['new_col'] = df['20,a'] * 1.1 # Faster than adding to a percentage x + 0.1x = 1.1*x
df.to_csv('new_sample.csv', index=False) # Default behavior is to write index, which I personally don't like.
BTW: input is a reserved command in python and asks for input from the user. I'm guessing you don't want this behavior but I could be wrong.
import pandas as pd
df = pd.read_csv("sample_.csv")
df['newcolumn'] = df['column'].apply(lambda x : x * .10)
Please try this.