how to get just a string from a data frame - python
I am trying to define a function with two arguments : df (dataframe), and an integer (employerID) as my arguments. this function will return the full name of the employer.
If the given ID does not belong to any employee, I want to return the string "UNKNOWN" / If no middle name is given only return "LAST, FIRST". / If only the middle initial is given the return the full name in the format "LAST, FIRST M." with the middle initial followed by a '.'.
def getFullName(df, int1):
df = pd.read_excel('/home/data/AdventureWorks/Employees.xls')
newdf = df[(df['EmployeeID'] == int1)]
print("'" + newdf['LastName'].item() + "," + " " + newdf['FirstName'].item() + " " + newdf['MiddleName'].item() + "." + "'")
getFullName('df', 110)
I wrote this code but came up with two problems :
1) if I don't put quotation mark around df, it will give me an error message, but I just want to take a data frame as an argument not a string.
2) this code can't deal with someone with out middle name.
I am sorry but I used pd.read_excel to read the excel file which you can not access. I know it will be hard for you to test the codes without the excel file, if someone let me know how to create a random data frame with the column names, I will go ahead and change it. Thank you,
I created some fake data for this:
EmployeeID FirstName LastName MiddleName
0 0 a a a
1 1 b b b
2 2 c c c
3 3 d d d
4 4 e e e
5 5 f f f
6 6 g g g
7 7 h h h
8 8 i i i
9 9 j j None
EmployeeID 9 has no middle name, but everyone else does. The way I would do it is to break up the logic into two parts. The first, for when you cannot find the EmployeeID. The second manages the printing of the employee's name. That second part should also have two sets of logic, one to control if the employee has a middle name, and the other for if they don't. You could likely combine a lot of this into single line statements, but you will likely sacrifice clarity.
I also removed the pd.read_excel call from the function. If you want to pass the dataframe in to the function, then the dataframe should be created oustide of it.
def getFullName(df, int1):
newdf = df[(df['EmployeeID'] == int1)]
# if the dataframe is empty, then we can't find the give ID
# otherwise, go ahead and print out the employee's info
if(newdf.empty):
print("UNKNOWN")
return "UNKNOWN"
else:
# all strings will start with the LastName and FirstName
# we will then add the MiddleName if it's present
# and then we can end the string with the final '
s = "'" + newdf['LastName'].item() + ", " +newdf['FirstName'].item()
if (newdf['MiddleName'].item()):
s = s + " " + newdf['MiddleName'].item() + "."
s = s + "'"
print(s)
return s
I have the function returning values in case you want to manipulate the string further. But that was just me.
If you run getFullName(df, 1) you should get 'b, b b.'. And for getFullName(df, 9) you should get 'j, j'.
So in full, it would be:
df = pd.read_excel('/home/data/AdventureWorks/Employees.xls')
getFullName(df, 1) #outputs 'b, b b.'
getFullName(df, 9) #outputs 'j, j'
getFullName(df, 10) #outputs UNKNOWN
Fake data:
d = {'EmployeeID' : [0,1,2,3,4,5,6,7,8,9],
'FirstName' : ['a','b','c','d','e','f','g','h','i','j'],
'LastName' : ['a','b','c','d','e','f','g','h','i','j'],
'MiddleName' : ['a','b','c','d','e','f','g','h','i',None]}
df = pd.DataFrame(d)
Related
How to not set value to slice of copy [duplicate]
This question already has answers here: How to deal with SettingWithCopyWarning in Pandas (20 answers) Closed 2 years ago. I am trying to replace string values in a column without creating a copy. I have looked at the docs provided in the warning and also this question. I have also tried using .replace() with the same results. What am I not understanding? Code: import pandas as pd from datetime import timedelta # set csv file as constant TRADER_READER = pd.read_csv('TastyTrades.csv') TRADER_READER['Strategy'] = '' def iron_condor(): TRADER_READER['Date'] = pd.to_datetime(TRADER_READER['Date'], format="%Y-%m-%d %H:%M:%S") a = 0 b = 1 c = 2 d = 3 for row in TRADER_READER.index: start_time = TRADER_READER['Date'][a] end_time = start_time + timedelta(seconds=5) e = TRADER_READER.iloc[a] f = TRADER_READER.iloc[b] g = TRADER_READER.iloc[c] h = TRADER_READER.iloc[d] if start_time <= f['Date'] <= end_time and f['Underlying Symbol'] == e['Underlying Symbol']: if start_time <= g['Date'] <= end_time and g['Underlying Symbol'] == e['Underlying Symbol']: if start_time <= h['Date'] <= end_time and h['Underlying Symbol'] == e['Underlying Symbol']: e.loc[e['Strategy']] = 'Iron Condor' f.loc[f['Strategy']] = 'Iron Condor' g.loc[g['Strategy']] = 'Iron Condor' h.loc[h['Strategy']] = 'Iron Condor' print(e, f, g, h) if (d + 1) > int(TRADER_READER.index[-1]): break else: a += 1 b += 1 c += 1 d += 1 iron_condor() Warning: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_with_indexer(indexer, value) Hopefully this satisfies the data needed to replicate: ,Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put 36,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103P00206500,Equity Option,Bought 1 QQQ 01/03/20 Put 206.50 # 0.07,-7,1,-7,-1.0,-0.14,100.0,QQQ,1/3/2020,206.5,PUT 37,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103C00217500,Equity Option,Bought 1 QQQ 01/03/20 Call 217.50 # 0.03,-3,1,-3,-1.0,-0.14,100.0,QQQ,1/3/2020,217.5,CALL 38,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103P00209000,Equity Option,Sold 1 QQQ 01/03/20 Put 209.00 # 0.14,14,1,14,-1.0,-0.15,100.0,QQQ,1/3/2020,209.0,PUT 39,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103C00214500,Equity Option,Sold 1 QQQ 01/03/20 Call 214.50 # 0.30,30,1,30,-1.0,-0.15,100.0,QQQ,1/3/2020,214.5,CALL 40,2020-01-03 16:08:13,Trade,BUY_TO_CLOSE,QQQ 200103C00214500,Equity Option,Bought 1 QQQ 01/03/20 Call 214.50 # 0.07,-7,1,-7,0.0,-0.14,100.0,QQQ,1/3/2020,214.5,CALL Expected result: ,Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put 36,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103P00206500,Equity Option,Bought 1 QQQ 01/03/20 Put 206.50 # 0.07,-7,1,-7,-1.0,-0.14,100.0,QQQ,1/3/2020,206.5,PUT,Iron Condor 37,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103C00217500,Equity Option,Bought 1 QQQ 01/03/20 Call 217.50 # 0.03,-3,1,-3,-1.0,-0.14,100.0,QQQ,1/3/2020,217.5,CALL,Iron Condor 38,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103P00209000,Equity Option,Sold 1 QQQ 01/03/20 Put 209.00 # 0.14,14,1,14,-1.0,-0.15,100.0,QQQ,1/3/2020,209.0,PUT,Iron Condor 39,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103C00214500,Equity Option,Sold 1 QQQ 01/03/20 Call 214.50 # 0.30,30,1,30,-1.0,-0.15,100.0,QQQ,1/3/2020,214.5,CALL,Iron Condor 40,2020-01-03 16:08:13,Trade,BUY_TO_CLOSE,QQQ 200103C00214500,Equity Option,Bought 1 QQQ 01/03/20 Call 214.50 # 0.07,-7,1,-7,0.0,-0.14,100.0,QQQ,1/3/2020,214.5,CALL,
Let's start from some improvements in the initial part of your code: The leftmost column of your input file is apparently the index column, so it should be read as the index. The consequence is some different approach to the way to access rows (details later). The Date column can be converted to datetime64 as early as at the reading time. So the initial part of your code can be: TRADER_READER = pd.read_csv('Input.csv', index_col=0, parse_dates=['Date']) TRADER_READER['Strategy'] = '' Then I decided to organize the loop other way: indStart is the integer index of the index column. As you process your file in "overlapping" couples of 4 consecutive rows, a more natural way to organize the loop is to stop on 4-th row from the end. So the loop is over the range(TRADER_READER.index.size - 3). Indices of 4 rows of interest can be read from the respective slice of the index, i.e. [indStart : indStart + 4] Check of particular row can be performed with a nested function. To avoid your warning, setting of values in Strategy column should be performed using loc on the original DataFrame, with row parameter for the respective row and column parameter for Strategy. The whole update (for the current couple of 4 rows) can be performed in a single instruction, specifying row parameter as a slice, from a thru d. So the code can be something like below: def iron_condor(): def rowCheck(row): return start_time <= row.Date <= end_time and row['Underlying Symbol'] == undSymb for indStart in range(TRADER_READER.index.size - 3): a, b, c, d = TRADER_READER.index[indStart : indStart + 4] e = TRADER_READER.loc[a] undSymb = e['Underlying Symbol'] start_time = e.Date end_time = start_time + pd.Timedelta('5S') if rowCheck(TRADER_READER.loc[b]) and rowCheck(TRADER_READER.loc[c]) and rowCheck(TRADER_READER.loc[d]): TRADER_READER.loc[a:d, 'Strategy'] = 'Iron Condor' print('New values:') print(TRADER_READER.loc[a:d]) No need to increment a, b, c and d. Neither break is needed. Edit If for some reason you have to do other updates on the rows in question, you can change my code accordingly. But I don't understand "this csv file will make a new column" in your comment. For now anything you do is performed on the DataFrame in memory. Only after that you can save the DataFrame back to the original file. But note that even your code changes the type of Date column, so I assume you do it once and then the type of this column is just datetime64. So you probably should change the type of Date column as a separate operation and then (possibly many times) update thie DataFrame and save the updated content back to the source file. Edit following the comment as of 21:22:46Z re.search('.*TO_OPEN$', row['Action']) returns a re.Match object if a match has been found, otherwise None. So can not compare this result with the string searched. If you wanted to get the string matched, you should run e.g.: mtch = re.search('.*TO_OPEN$', row['Action']) textFound = None if mtch: textFound = mtch.group(0) But you actually don't need to do it. It is enough to check whether a match has been found, so the condition can be: found = bool(re.search('.*TO_OPEN$', row['Action'])) (note that None cast to bool returns False and any non-Null object returns True). Yet another (probably simpler and quicker) solution is that you run just: row.Action.endswith('TO_OPEN') without invoking any regex fuction.
Here is a quite elaborating post that can not only answer your question but also explain in details why things are the case. Deal with SettingWithCopyWarning In short if you want to set the value of the original df, either use .replace(inplace=True) or df.loc[condition, theColtoBeSet] = new_val
how to update/replace list items inside a nested loop after matching the condition?
image is added so that you can look how my dataframe df2 will look like i have written a code to check condition and if it matches it will update the items of my list but it's working at all it return me the same not updated or same previous list. is this code wrong? please suggest emp1=[] for j in range(8,df.shape[0],10): for i in range(2,len(df.columns)): b=df.iloc[j][3] #values are appended from dataframe to list and values are like['3 : 3','4 : 4',.....] ess=[] for i in range(df2.shape[0]): a=df2.iloc[i][2] ess.append(a) #values taken from file which are(3,4,5,6,7,8,....etc i.e unique id number) nm=[] for i in range(df2.shape[0]): b=df2.iloc[i][3] nm.append(b) #this list contains name of the employees ap= [i.split(' : ', 1)[0] for i in emp1] #split it with ' : ' and stored in two another list(if 3 : 3 then it will store left 3) bp= [i.split(' : ', 1)[1] for i in emp1] #if 3 : 3 the it will store right 3 cp=' : ' #the purpose is to replace right 3 with the name i.e 3 : nameabc and then again join to the list for i in range(len(emp1)): for j in range(len(ess)): #print(i,j) if ap[i]==ess[j]: bp[i]=nm[j] for i in range(df.shape[0]): ap[i]=ap[i]+cp # adding ' : ' after left integer emp = [i + j for i, j in zip(ap, bp)] # joining both the values expected output: if emp1 contains 3 : 3 then after processing it should show like 3 : nameabc
May be I missed something, but I don't see you assigning any value to emp1. Its empty and for "ap" and "bp", you are looping over empty emp1. That may be the one causing problem.
How to calculate mode over two columns in a python dataframe?
There are two columns in my csv: FirstName and LastName. I need to find the most common full name. Eg: FirstName LastName A X A P A Y A Z B X B Z C X C W C W I have tried using the mode function: df["FirstName"].mode()[0] df["LastName"].mode()[0] But it wont work over two columns The mode of each columns are : FirstName : A - occurs 4 times LastName : X - occurs 3 times But the output should be "C W". As this is the full name that occur most times.
You can do, (df['FirstName'] + df['LastName']).mode()[0] # Output : 'CW' If you really need space in between first and last names you can concatenate ' ' like this, (df['FirstName'] + ' ' + df['LastName']).mode()[0] # Output : 'C W'
You can combine the columns and find mode, df.apply(tuple, 1).mode()[0] ('C', 'W')
You can concatenate those into a single string with: full_names = df.FirstName + df.LastName full_names.mode()[0]
python - Replace first five characters in a column with asterisks
I have a column called SSN in a CSV file with values like this 289-31-9165 I need to loop through the values in this column and replace the first five characters so it looks like this ***-**-9165 Here's the code I have so far: emp_file = "Resources/employee_data1.csv" emp_pd = pd.read_csv(emp_file) new_ssn = emp_pd["SSN"].str.replace([:5], "*") emp_pd["SSN"] = new_ssn How do I loop through the value and replace just the first five numbers (only) with asterisks and keep the hiphens as is?
Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format. emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])
You can simply achieve this with replace() method: Example dataframe : borrows from #AkshayNevrekar.. >>> df ssn 0 111-22-3333 1 121-22-1123 2 345-87-3425 Result: >>> df.replace(r'^\d{3}-\d{2}', "***-**", regex=True) ssn 0 ***-**-3333 1 ***-**-1123 2 ***-**-3425 OR >>> df.ssn.replace(r'^\d{3}-\d{2}', "***-**", regex=True) 0 ***-**-3333 1 ***-**-1123 2 ***-**-3425 Name: ssn, dtype: object OR: df['ssn'] = df['ssn'].str.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
Put your asterisks in front, then grab the last 4 digits. new_ssn = '***-**-' + emp_pd["SSN"][-4:]
You can use regex df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']}) def func(x): return re.sub(r'\d{3}-\d{2}','***-**', x) df['ssn'] = df['ssn'].apply(func) print(df) Output: ssn 0 ***-**-3333 1 ***-**-1123 2 ***-**-3425
Python 3 openpyxl finding all of string in column
I am just starting to learn python and am looking for some direction on a script I am working on to text out daily pick up for my drivers. The vendor name is entered into a spreadsheet along with a purchase order # and notes. What i would like to do is cycle through column "A", find all instances of a vendor name, grab the corresponding B & C cell values and save all info to a text file. I can get it to work if I name the search string explicitly but not if its a variable. Here is what I have so far: TestList=[] TestDict= {} LineNumber = 0 for i in range(1, maxrow + 1): VendorName = sheet.cell(row = i, column = 1) if VendorName.value == "CERTIFIED LETTERING":#here is where im lost #print (VendorName.coordinate) VendLoc = str(VendorName.coordinate) TestList.append(VendLoc) TestDict[VendorName.value]=[TestList] test = (TestDict["CERTIFIED LETTERING"][0]) ListLength = (len(test)) ListPo = [] List_Notes = [] number = 0 for i in range (0, ListLength): PO = (str('B'+ test[number][1])) Note = (str('C'+ test[number][1])) ListPo.append(PO) List_Notes.append(Note) number = number + 1 number = 0 TestVend =(str(VendorName.value)) sonnetFile = open('testsaveforpickups.txt', 'w') sonnetFile.write("Pick up at:" + '\n') sonnetFile.write(str(VendorName.value)+'\n') for i in range (0, ListLength): sonnetFile.write ("PO# "+ str(sheet[ListPo[number]].value)+'\n' +"NOTES: " + str(sheet[List_Notes[number]].value)+'\n') number = number + 1 sonnetFile.close() the results are as follows: Pick up at: CERTIFIED LETTERING PO# 1111111-00 NOTES: aaa PO# 333333-00 NOTES: ccc PO# 555555-00 NOTES: eee I've tried everything i could think of to change the current string of "CERTIFIED LETTERING" to a variable name, including creating a list of all vendors in column A and using that as a dictionary to go off of. Any help or ideas to point me in the right direction would be appreciated. And I apologise for any formatting errors. I'm new to posting here.