Iteration, calculation via pandas - python

I am new in Python and I would like to ask something.
My code reads a csv file. I want to use one column. I want to use an equation which calculates, depending on the value of the column I want to use, several values. I am using commands for and if.
my code
import pandas as pd
import matplotlib as mpl
import numpy as np
dfArxika = pd.read_csv('AIALL.csv', usecols=[0,1,2,3,4,5,6,7,8,9,10], header=None, index_col=False)
print(dfArxika.columns)
A=dfArxika[9]
for i in A:
if (A(i)>=4.8 and A(i)<66):
IA=(2.2*log10(A(i)/66)+5.5)
elif A(i)>=66:
IA=3.66*log10(A(i)/66)+5.5
else:
IA=2.2*log10(A(i)/66)+5.5
but command window shbows me the error:
TypeError: 'Series' object is not callable
Could you help me?

As #rdas mentioned in the comments, you are using parentheses () instead of brackets [] for indexing the values of your column.
I am not sure whatIA is in your example, but this might work:
for i in range(len(dfArxika)):
if (A.loc[i, 9]>=4.8 and A.loc[i, 9]<66):
IA=(2.2*log10(A.loc[i, 9]/66)+5.5)
elif A.loc[i, 9]>=66:
IA=3.66*log10(A.loc[i, 9]/66)+5.5
else:
IA=2.2*log10(A.loc[i, 9]/66)+5.5

Related

how to create a new column based on a condition?

Thank you for answering first. I want to combine the address and postcode if the condition is 0, and combine the address and postcode and 1 if the condition is 1. I tried the following, but the desired result does not come out. I would like to ask if the data type is a problem. Please give me a lot of advice.
import pandas as pd
import numpy as np
data = pd.read_csv('./address.csv')
data['full address'] = np.where(data['condition']='0', data['address']+'-'+data['postcode'], data['address']+'-'+data['postcode']+'-'+data['condition'])
output
SyntaxError: keyword can't be an expression
expected output
As pointed out by #quang-hoang in the comments, the syntax is incorrect. Also, remove the single quotes from 0 if the column datatype is int. You need the == for checking equality.
Please note that the np.where function as written will also output data['address']+'-'+data['postcode']+'-'+data['condition'] whenever data['condition'] is not equal to 0. It's essential an if-then-else clause.
import pandas as pd
import numpy as np
data = pd.read_csv('./address.csv')
data['full address'] = np.where(
data['condition']== 0,
data['address']+'-'+data['postcode'],
data['address']+'-'+data['postcode']+'-'+data['condition']
)

Create many distribution plots using For loop with seaborn

I'm trying to create many distribution plots at once to few different fields. I have created simple for loop but I make always the same mistake and python doesn't understand what is "i".
This is the code I have written:
for i in data.columns:
sns.distplot(data[i])
KeyError: 'i'
I have also tried to put 'i' instead of i, but I get error:
TypeError: unsupported operand type(s) for /: 'str' and 'int'
I b elieve my mistake is something basic that I don't know about loops so understand that will help me a lot in the future.
My end goal is to get many distribution plots (with skewness a kurtosis values) at once without writing each one of them.
To run only over numeric columns use:
numeric_data = data._get_numeric_data()
for i in numeric_data.columns:
sns.distplot(numeric_data[i])
As mentioned in the comments, you cannot make a distplot from a string column. If you want to ignore string columns, you can check for each column as you are iterating through them as such:
for i in data.columns:
if(data[i].dtype == np.float64 or data[i].dtype == np.int64):
sns.distplot(data[i])
else:
//your code to handle strings.
I ran a simple test based on what you needed and it works fine on my machine. Here is the code:
import seaborn as sns
import matplotlib.pyplot as plt
a = [1,2,3,4]
c = [1,4,6,7,4,6,7,4,3,5,543,543,54,46,656,76,43,56]
d = [43,3,3,56,5,76,686,876,8768,78,77,98,79,8798,987,978,98]
sns.distplot(a)
e = [a,c,d]
for i, col in enumerate(e):
plt.figure(i)
sns.distplot(col)
plt.show()
In your case, it would be like this:
import matplotlib.pyplot as plt
for index, i in enumerate(data.columns):
if(data[i].dtype == np.float64 or data[i].dtype == np.int64):
plt.figure(index)
sns.distplot(data[i])
else:
//your code to handle strings.
plt.show()

.How to subtract a percentage from a csv file and then output it into another file? I'd preferably like a formula like x*.10=y

Sorry if I haven't explained things very well. I'm a complete novice please feel free to critic
I've searched every where but I havent found anything close to subtracting a percent. when its done on its own(x-.10=y) it works wonderfully. the only problem is Im trying to make 'x' stand for sample_.csv[0] or the numerical value from first column from my understanding.
import csv
import numpy as np
import pandas as pd
readdata = csv.reader(open("sample_.csv"))
x = input(sample_.csv[0])
y = input(x * .10)
print(x + y)
the column looks something like this
"20,a,"
"25,b,"
"35,c,"
"45,d,"
I think you should only need pandas for this task. I'm guessing you want to apply this operation on one column:
import pandas as pd
df = pd.read_csv('sample_.csv') # assuming columns within csv header.
df['new_col'] = df['20,a'] * 1.1 # Faster than adding to a percentage x + 0.1x = 1.1*x
df.to_csv('new_sample.csv', index=False) # Default behavior is to write index, which I personally don't like.
BTW: input is a reserved command in python and asks for input from the user. I'm guessing you don't want this behavior but I could be wrong.
import pandas as pd
df = pd.read_csv("sample_.csv")
df['newcolumn'] = df['column'].apply(lambda x : x * .10)
Please try this.

Importing numbers as string into a dataframe from text

I'm trying to import a text file into Python as a dataframe.
My text file essentially consists of 2 columns, both of which are numbers.
The problem is: I want one of the columns to be imported as a string (since many of the 'numbers' start with a zero, e.g. 0123, and I will need this column to merge the df with another later on)
My code looks like this:
mydata = pd.read_csv("text_file.txt", sep = "\t", dtype = {"header_col2": str})
However, I still lose the zeros in the output, so a 4-digit number is turned into a 3-digit number.
I'm assuming there is something wrong with my import code but I could not find any solution yet.
I'm new to python/pandas, so any help/suggestions would be much appreciated!
Hard to see why your original code not working:
from io import StringIO
import pandas as pd
# this mimics your data
mock_txt = StringIO("""header_col2\theader_col3
0123\t5
0333\t10
""")
# same reading as you suggested
df = pd.read_csv(mock_txt, sep = "\t", dtype = {"header_col2": str})
# are they really strings?
assert isinstance(df.header_col2[0], str)
assert isinstance(df.header_col2[1], str)
P.S. as always at SO - really nice to have some of the data and a minimal working example with code in the original post.

Function not being applied properly on a pandas dataframe

I am very new to the whole pandas and numpy world. I have experience with python but not on this side. I was trying to work with a data set and I found a issue that I am not able to explain. It will be great if someone with experience helps me to understand what is going wrong in it.
I have a CSV file with three fields. "Age", "Working Class" and "income". The headers were missing so I loaded the CSV in the following manner -
import numpy as np
import pandas as pd
df = pd.read_csv("test.csv", index_col=False, header=None, names=["age", "workclass", "income"])
Now the data in the last column is in this format - "<=50K" or ">50K". I wanted to tranfer the data into either "0" or "1" based on the values above. So 0 for "<=50K" and 1 for ">50K". To accomplish that I wrote this line of code
def test_func(x):
if x == "<=50K":
return "0"
else:
return "1"
df['income'] = df['income'].apply(test_func)
This makes all the columns to become "1"! I did some printing inside test_func and it looks like x is having the right value and the type of x is of "str". I can not understand, in this case how come always the "else" part is getting executed and never the "if" part. What am I doing wrong?
It can be a very silly mistake that I am overlooking. I am not sure and any help will be great
Thanks in advance.
Option 1
astype
df['income'] = df['income'].eq("<=50K").astype(int)
Option 2
np.where
df['income'] = np.where(df.income == "<=50K", 1, 0)
I would just do:
df.loc[df['income']=='<=50K','income'] = 0
df.loc[df['income']!='<=50K','income'] = 1
Alex's solution is a classic, but there is a built-in if/this function in numpy called np.where. I'm not super familiar with it, but it would look something like...
df['income'] = np.where((df['income']=='<=50K','income'), 1 ,0)
Referenced np.where Stackoverflow Question

Categories