Functions: how to use columns as parameters? - python

I would need to create a new column with data extracted from another column.
Name Surname Age
Nivea Jones 45
Kelly Pams 68
Matthew Currigan 24
...
I would like to create a new column with only the first letter from the name and surname, i.e.
Name Surname Age Short FN
Nivea Jones 45 NJ
Kelly Pams 68 KP
Matthew Currigan 24 MC
...
I did as follows:
df['Short FN'] = df['Name'].str.get(0) +df['Surname'].str.get(0)
and it works well. However, I would need to build a function, with two columns (in this case, name and surname) as parameters:
def sh(x,y):
df['Short FN'] = df[x].str.get(0) +df[y].str.get(0)
return
and it does not work, probably because I should keep in mind that I am using columns from a dataframe as parameter. Also, I do not know if and what I should return.
Could you please explain me how to create a function where I check/pass columns and how to use this function (not clear to me if I need to iterate through rows using a for loop)?

You can do this:
def sh(x, y):
return x[0] + y[0]
df['Short'] = df.apply(lambda x: sh(x['Name'], x['Surname']), axis=1)
print(df)
Name Surname Age Short
0 Nivea Jones 45 NJ
1 Kelly Pams 68 KP
2 Matthew Currigan 24 MC

There are several ways to do that. The simplest way, assuming df is global (as it seems to be in your case), is:
def short_name(col1, col2):
return df[col1].str[0] + df[col2].str[0]
calling short_name("Name", "Surname")
produces:
0 NJ
1 KP
2 MC
dtype: object
You can now use it in whatever way you want. For example:
df["sn"] = short_name("Name", "Surname")
print(df)
# produces:
Name Surname Age sn
0 Nivea Jones 45 NJ
1 Kelly Pams 68 KP
2 Matthew Currigan 24 MC

Related

Doing vlookup-like things on Python with multiple lookup values

Many of us know that the syntax for a Vlookup function on Excel is as follows:
=vlookup([lookup value], [lookup table/range], [column selected], [approximate/exact match (optional)])
I want to do something on Python with a lookup table (in dataframe form) that looks something like this:
Name Date of Birth ID#
Jack 1/1/2003 0
Ryan 1/8/2003 1
Bob 12/2/2002 2
Jack 3/9/2003 3
...and so on. Note how the two Jacks are assigned different ID numbers because they are born on different dates.
Say I have something like a gradebook (again, in dataframe form) that looks like this:
Name Date of Birth Test 1 Test 2
Jack 1/1/2003 89 91
Ryan 1/8/2003 92 88
Jack 3/9/2003 93 79
Bob 12/2/2002 80 84
...
How do I make it so that the result looks like this?
ID# Name Date of Birth Test 1 Test 2
0 Jack 1/1/2003 89 91
3 Ryan 1/8/2003 92 88
1 Jack 3/9/2003 93 79
2 Bob 12/2/2002 80 84
...
It seems to me that the "lookup value" would involve multiple columns of data ('Name' and 'Date of Birth'). I kind of know how to do this in Excel, but how do I do it in Python?
Turns out that I can just do
pd.merge([lookup value], [lookup table], on = ['Name', 'Date of Birth']
which produces
Name Date of Birth Test 1 Test 2 ID#
Jack 1/1/2003 89 91 0
Ryan 1/8/2003 92 88 3
Jack 3/9/2003 93 79 1
Bob 12/2/2002 80 84 2
...
Then everything needed is to move the last column to the front.

How to add new row in csv using Python panda

Hello this is my csv data
Age Name
0 22 George
1 33 lucas
2 22 Nick
3 12 Leo
4 32 Adriano
5 53 Bram
6 11 David
7 32 Andrei
8 22 Sergio
i want to use if else statement , for example if George is adult create new row and insert +
i mean
Age Name Adul
22 George +
What is best way?
This is my code Which i am using to read data from csv
import pandas as pd
produtos = pd.read_csv('User.csv', nrows=9)
print(produtos)
for i, produto in produtos.iterrows():
print(i,produto['Age'],produto['Name'])
IIUC, you want to create a new column (not row) call "Adul". You can do this with numpy.where:
import numpy as np
produtos["Adul"] = np.where(produtos["Age"].ge(18), "+", np.nan)
Edit:
To only do this for a specific name, you could use:
name = input("Name")
if name in produtos["Name"].tolist():
if produtos.loc[produtos["Name"]==name, "Age"] >= 18:
produtos.loc[produtos["Name"]==name, "Adul"] = "+"
You can do this:
produtos["Adul"] = np.where(produtos["Age"] >= 18, "+", np.nan)

i want to separate data frame based on marks and download it then

This is my data frame:
Name Age Stream Percentage
0 A 21 Math 88
1 B 19 Commerce 92
2 C 20 Arts 95
3 D 18 Biology 70
0 E 21 Math 88
1 F 19 Commerce 92
2 G 20 Arts 95
3 H 18 Biology 70
I want to download different excel file for each subject in one loop so basically, I should get 4 excel files for each subject
i tried this but didn't work:
n=0
for subjects in df.stream:
df.to_excel("sub"+ str(n)+".xlsx")
n+=1
I think groupby is helpful here. and you can use enumerate to keep track of the index.
for i, (group, group_df) in enumerate(df.groupby('stream')):
group_df.to_excel('sub{}.xlsx'.format(i))
# Alternatively, to name the file based on the stream...
group_df.to_excel('sub{}.xlsx'.format(group))
group is going to be the name of the stream.
group_df is going to be a sub-dataframe containing all the data in that group.

Read structured file in python

I have a file with data similar to this:
[START]
Name = Peter
Sex = Male
Age = 34
Income[2020] = 40000
Income[2019] = 38500
[END]
[START]
Name = Maria
Sex = Female
Age = 28
Income[2020] = 43000
Income[2019] = 42500
Income[2018] = 40000
[END]
[START]
Name = Jane
Sex = Female
Age = 41
Income[2020] = 60500
Income[2019] = 57500
Income[2018] = 54000
[END]
I want to read this data into a pandas dataframe so that at the end it is similar to this
Name Sex Age Income[2020] Income[2019] Income[2018]
Peter Male 34 40000 38500 NaN
Maria Female 28 43000 42500 40000
Jane Female 41 60500 57500 54000
So far, I wasn't able to figure out if this is a standard data file format (it has some similarities to JSON but is still very different).
Is there an elegant and fast way to read this data to a dataframe?
Elegant I do not know, but easy way, yes. Python is very good at parsing simple formatted text.
Here, [START] starts a new record, [END] ends it, and inside a record, you have key = value lines. You can easily build a custom parser to generate a list of records to feed into a pandas DataFrame:
inblock = False
fieldnames = []
data = []
for line in open(filename):
if inblock:
if line.strip() == '[END]':
inblock = False
elif '=' in line:
k, v = (i.strip() for i in line.split('=', 1))
record[k] = v
if not k in fieldnames:
fieldnames.append(k)
else:
if line.strip() == '[START]':
inblock = True
record = {}
data.append(record)
df = pd.DataFrame(data, columns=fieldnames)
df is as expected:
Name Sex Age Income[2020] Income[2019] Income[2018]
0 Peter Male 34 40000 38500 NaN
1 Maria Female 28 43000 42500 40000
2 Jane Female 41 60500 57500 54000

Modify dataframe row - Panda Python

I hope this has not been posted yet, I have not found anything that helped me. So i have this data frame df
Id Numero Voie CodePostal Commune \
1 940010005V-59 59 Rue d'Ablon 94480 Ablon-sur-Seine
2 940010005V-61 61 Rue d'Ablon 94480 Ablon-sur-Seine
3 940010005V-65 65 Rue d'Ablon 94480 Ablon-sur-Seine
Source Latitude Longitude \
1 C+O 48.721350 2.414291
2 C+O 48.722434 2.413538
3 OSM 48.721141 2.415030
Adresse AdresseGPS LatitudeGPS \
1 59 Rue d'Ablon, Ablon-sur-Seine, France 0.0
2 61 Rue d'Ablon, Ablon-sur-Seine, France 0.0
3 65 Rue d'Ablon, Ablon-sur-Seine, France 0.0
LongitudeGPS
1 0.0
2 0.0
3 0.0
I imported it from a csv and added the last three columns using
df = df.assign(AdresseGPS="",LatitudeGPS = 0.,LongitudeGPS = 0.)
What i want to do is modify these last three columns using a function
def funcRow(dataIn):
dataOut = dataIn
dataOut['AdresseGPS'] = 't'
dataOut['LatitudeGPS'] = 1
return(dataOut)
However when I do
df.ix[1,] = funcRow(df.ix[1,])
I get the following error : IndexError: tuple index out of range
I printed both
df.ix[1,] & funcRow(df.ix[1,])
I get the following:
print df.ix[1,]
Id 940010005V-59
Numero 59
Voie Rue d'Ablon
CodePostal 94480
Commune Ablon-sur-Seine
Source C+O
Latitude 48.7214
Longitude 2.41429
Adresse 59 Rue d'Ablon, Ablon-sur-Seine, France
AdresseGPS
LatitudeGPS 0
LongitudeGPS 0
Name: 1, dtype: object
print funcRow
Id 940010005V-59
Numero 59
Voie Rue d'Ablon
CodePostal 94480
Commune Ablon-sur-Seine
Source C+O
Latitude 48.7214
Longitude 2.41429
Adresse 59 Rue d'Ablon, Ablon-sur-Seine, France
AdresseGPS t
LatitudeGPS 1
LongitudeGPS 0
Name: 1, dtype: object
I am quite new to using data frames with Python so I provided lots of details, not sure if everything is relevant. I have tried this using others functions such as loc or iloc instead of ix but still get the same error.
Any advice would be very welcome :)
I think the "safest" way to solve this is with .loc[] instead of .ix[].
Try this:
def funcRow(dataIn):
dataOut = dataIn
dataOut['AdresseGPS'] = 't'
dataOut['LatitudeGPS'] = 1
return(dataOut)
df.loc[1,:] = funcRow(df.loc[1,:])
(In case you're not used to .loc[]: the first argument is the row selection, the second argument is the column selection, and giving ":" means you choose all).
When I run the code above I get a warning message, but it does return the updated dataframe if I print df.
(Bonus: This blog post is an excellent reference when learning loc, iloc and ix: http://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/)
According to the Documentation,
.ix[] supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type.
I think you want to access last three columns of a whole dataframe values.
If it is you can try,
df.ix[:] = funcRow(df.ix[:]) #for whole rows
or
df.ix[start:end]=funcRow(df.ix[start:end]) #for specific rows
or if you want to access only particular row then you can use this,
df.ix[n] = funcRow(df.ix[n])
I hope it might be help you to solve your problem.
This should work:
df.ix[1] = funcRow(df.ix[1,])
I probably need to take a look at the source code to see why the following doesn't work:
df.ix[1,] = funcRow(df.ix[1,])

Categories