Creating a new column with concatenated values from another column - python

I am trying to create a new column in this data frame. The data set has multiple records for each PERSON because each record is a different account. The new column values should be a combination of the values for each PERSON in the TYPE column. For example, if John Doe has four accounts the value next to his nae in the new column should be a concatenation of the values in TYPE. An example of the final data frame is below. Thanks in advance.
enter image description here

You can do this in two lines (first code, then explanation):
Code:
in: name_types = df.pivot_table(index='Name', values='AccountType', aggfunc=set)
out:
AccountType
Name
Jane Doe {D}
John Doe {L, W, D}
Larry Wild {L, D}
Patti Shortcake {L, W}
in: df['ClientType'] = df['Name'].apply(lambda x: name_types.loc[x]['AccountType'])
Explanation:
The pivot table gets all the AccountTypes for each individual name, and removes all duplicates using the 'set' aggregate function.
The apply function then iterates through each 'Name' in the main data frame, looks up the AccountType associated with each in name_typed, and adds it to the new column ClientType in the main dataframe.
And you're done!
Addendum:
If you need the column to be a string instead of a set, use:
in: def to_string(the_set):
string = ''
for item in the_set:
string += item
return string
in: df['ClientType'] = df['ClientType'].apply(to_string)
in: df.head()
out:
Name AccountType ClientType
0 Jane Doe D D
1 John Doe D LDW
2 John Doe D LDW
3 John Doe L LDW
4 John Doe D LDW

Related

Pandas remove every entry with a specific value

I would like to go through every row (entry) in my df and remove every entry that has the value of " " (which yes is an empty string).
So if my data set is:
Name Gender Age
Jack 5
Anna F 6
Carl M 7
Jake M 7
Therefore Jack would be removed from the dataset.
On another note, I would also like to remove entries that has the value "Unspecified" and "Undetermined" as well.
Eg:
Name Gender Age Address
Jack 5 *address*
Anna F 6 *address*
Carl M 7 Undetermined
Jake M 7 Unspecified
Now,
Jack will be removed due to empty field.
Carl will be removed due to the value Undetermined present in a column.
Jake will be removed due to the value Unspecified present in a column.
For now, this has been my approach but I keep getting a TypeError.
list = []
for i in df.columns:
if df[i] == "":
# everytime there is an empty string, add 1 to list
list.append(1)
# count list to see how many entries there are with empty string
len(list)
Please help me with this. I would prefer a for loop being used due to there being about 22 columns and 9000+ rows in my actual dataset.
Note - I do understand that there are other questions asked like this, its just that none of them apply to my situation, meaning that most of them are only useful for a few columns and I do not wish to hardcode all 22 columns.
Edit - Thank you for all your feedbacks, you all have been incredibly helpful.
To delete a row based on a condition use the following:
df = df.drop(df[condition].index)
For example:
df = df.drop(df[Age==5].index) , will drop the row where the Age is 5.
I've come across a post regarding the same dating back to 2017, it should help you understand it more clearer.
Regarding question 2, here's how to remove rows with the specified values in a given column:
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Let's assume we have a Pandas DataFrame object df.
To remove every row given your conditions, simply do:
df = df[df.Gender == " " or df.df.Age == " " or df.Address in [" ", "Undetermined", "Unspecified"]]
If the unspecified fields are NaN, you can also do:
df = df.dropna(how="any", axis = 0)
Answer from #ThatCSFresher or #Bence will help you out in removing rows based on single column... Which is great!
However, I think there are multiple condition in your query needed to check across multiple columns at once in a loop. So, probably apply-lambda can do the job; Try the following code;
df = pd.DataFrame({"Name":["Jack","Anna","Carl","Jake"],
"Gender":["","F","M","M"],
"Age":[5,6,7,7],
"Address":["address","address","Undetermined","Unspecified"]})
df["Noise_Tag"] = df.apply(lambda x: "Noise" if ("" in list(x)) or ("Undetermined" in list(x)) or ("Unspecified" in list(x)) else "No Noise",axis=1)
df1 = df[df["Noise_Tag"] == "No Noise"]
del df1["Noise_Tag"]
# Output of df;
Name Gender Age Address Noise_Tag
0 Jack 5 address Noise
1 Anna F 6 address No Noise
2 Carl M 7 Undetermined Noise
3 Jake M 7 Unspecified Noise
# Output of df1;
Name Gender Age Address
1 Anna F 6 address
Well, OP actually wants to delete any column with "empty" string.
df = df[~(df=="").any(axis=1)] # deletes all rows that have empty string in any column.
If you want to delete specifically for address column, then you can just delete using
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Or if any column with Undetermined or Unspecified, try similar as the first solution in my post, just by replacing the empty string with Undertermined or Unspecified.
df = df[~((df=="Undetermined") | (df=="Unspecified")).any(axis=1)]
You can build masks and then filter the df according to it:
m1 = df.eq('').any(axis=1)
# m1 is True if any cell in a row has an empty string
m2 = df['Address'].isin(['Undetermined', 'Unspecified'])
# m2 is True if a row has one of the values in the list in column 'Address'
out = df[~m1 & ~m2] # invert both condition and get the desired output
print(out)
Output:
Name Gender Age Address
1 Anna F 6 *address*
Used Input:
df = pd.DataFrame({'Name': ['Jack', 'Anna', 'Carl', 'Jake'],
'Gender': ['', 'F', 'M', 'M'],
'Age': [5, 6, 7, 7],
'Address': ['*address*', '*address*', 'Undetermined', 'Unspecified']}
)
using lambda fun
Code:
df[df.apply(lambda x: False if (x.Address in ['Undetermined', 'Unspecified'] or '' in list(x)) else True, axis=1)]
Output:
Name Gender Age Address
1 Anna F 6 *add

How to check if column with name is valid by regex?

How to check if column with name is valid by regex in Pandas:
I have tried this way:
r = df['ADDRESS'].str.findall('^((?:0[1-9]|1[0-2]))[.\/\\\\-]*([0-9]{2})[.\/\\\\]([1|2][0-9]{3})$')
if not r:
df['ADDRESS'] = "Bird"
It does not work for meю
My goas is to move values from columns to specifci columns by content.
Exmaple is:
ID NAME EMAIL
1 #GMAIL.COM Inigma
# Mary 4
As result I should arrange values by columns names:
ID NAME EMAIL
1 Inigma #GMAIL.COM
2 Mary #

How do I extract values from different columns after a groupby in pandas?

I have the following input file in csv:
INPUT
ID,GroupID,Person,Parent
ID_001,A001,John Doe,Yes
ID_002,A001,Mary Jane,No
ID_003,A001,James Smith;John Doe,Yes
ID_004,B003,Nathan Drake,Yes
ID_005,B003,Troy Baker,No
The desired output is the following:
** DESIRED OUTPUT**
ID,GroupID,Person
ID_001,A001,John Doe;Mary Jane;James Smith
ID_003,A001,John Doe;Mary Jane;James Smith
ID_004,B003,Nathan Drake;Troy Baker
Basically, I want to group by the same GroupID and then concatenate all the values present in Person column that belong to that group. Then, in my output, for each group I want to return the ID(s) of those rows where the Parent column is "Yes", the GroupID, and the concatenated person values for each group.
I am able to concatenate all person values for a particular group and remove any duplicate values from the person column in my output. Here is what I have so far:
import pandas as pd
inputcsv = path to the input csv file
outputcsv = path to the output csv file
colnames = ['ID', 'GroupID', 'Person', 'Parent']
df1 = pd.read_csv(inputcsv, names = colnames, header = None, skiprows = 1)
#First I do a groupby on GroupID, concatenate the values in the Person column, and finally remove the duplicate person values from the output before saving the df to a csv.
df2 = df1.groupby('GroupID')['Person'].apply(';'.join).str.split(';').apply(set).apply(';'.join).reset_index()
df2.to_csv(outputcsv, sep=',', index=False)
This yields the following output:
GroupID,Person
A001,John Doe;Mary Jane;James Smith
B003,Nathan Drake;Troy Baker
I can't figure out how to include the ID column and include all rows in a group where the Parent is "Yes" (as shown in the desired output above).
IIUC
df.Person=df.Person.str.split(';')#1st split the string to list
df['Person']=df.groupby(['GroupID']).Person.transform(lambda x : ';'.join(set(sum(x,[]))))# then we do transform , this will add each group rowwise same result , link https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object
df=df.loc[df.Parent.eq('Yes')] # then using Parent to filter
df
Out[239]:
ID GroupID Person Parent
0 ID_001 A001 James Smith;John Doe;Mary Jane Yes
2 ID_003 A001 James Smith;John Doe;Mary Jane Yes
3 ID_004 B003 Troy Baker;Nathan Drake Yes

How to use pandas python3 to get just Middle Initial from Middle name column of CSV and write to new CSV

I need help. I have a CSV file that contains names (First, Middle, Last)
I would like to know a way to use pandas to convert Middle Name to just a Middle initial, and save First Name, Middle Init, Last Name to a new csv.
Source CSV
First Name,Middle Name,Last Name
Richard,Dale,Leaphart
Jimmy,Waylon,Autry
Willie,Hank,Paisley
Richard,Jason,Timmons
Larry,Josiah,Williams
What I need new CSV to look like:
First Name,Middle Name,Last Name
Richard,D,Leaphart
Jimmy,W,Autry
Willie,H,Paisley
Richard,J,Timmons
Larry,J,Williams
Here is the Python3 code using pandas that I have so far that is reading and writing to a new CSV file. I just need a some help modifying that one column of each row, saving just the first Character.
'''
Read CSV file with First Name, Middle Name, Last Name
Write CSV file with First Name, Middle Initial, Last Name
Print before and after in the terminal to show work was done
'''
import pandas
from pathlib import Path, PureWindowsPath
winCsvReadPath = PureWindowsPath("D:\\TestDir\\csv\\test\\original-
NameList.csv")
originalCsv = Path(winCsvReadPath)
winCsvWritePath= PureWindowsPath("D:\\TestDir\\csv\\test\\modded-
NameList2.csv")
moddedCsv = Path(winCsvWritePath)
df = pandas.read_csv(originalCsv, index_col='First Name')
df.to_csv(moddedCsv)
df2 = pandas.read_csv(moddedCsv, index_col='First Name')
print(df)
print(df2)
Thanks in advance..
You can use the str accessor, which allows you to slice strings like you would in normal Python:
df['Middle Name'] = df['Middle Name'].str[0]
>>> df
First Name Middle Name Last Name
0 Richard D Leaphart
1 Jimmy W Autry
2 Willie H Paisley
3 Richard J Timmons
4 Larry J Williams
Or Just to another approach with str.extract
Your csv file processing with pandas:
>>> df = pd.read_csv("sample.csv", sep=",")
>>> df
First Name Middle Name Last Name
0 Richard Dale Leaphart
1 Jimmy Waylon Autry
2 Willie Hank Paisley
3 Richard Jason Timmons
4 Larry Josiah Williams
Second, Middle Name extraction from the DataFrame:
assuming all the names starting with first letter with upper case.
>>> df['Middle Name'] = df['Middle Name'].str.extract('([A-Z]\w{0})')
# df['Middle Name'] = df['Middle Name'].str.extract('([A-Z]\w{0})', expand=True)
>>> df
First Name Middle Name Last Name
0 Richard D Leaphart
1 Jimmy W Autry
2 Willie H Paisley
3 Richard J Timmons
4 Larry J Williams

Pandas - Matching reference number to find earliest date

I'm hoping to pick your brains on optimization. I am still learning more and more about python and using it for my day to day operation analyst position. One of the tasks I have is sorting through approx 60k unique record identifiers, and searching through another dataframe that has approx 120k records of interactions, the employee who authored the interaction and the time it happened.
For Reference, the two dataframes at this point look like:
main_data = Unique Identifier Only
nok_data = Authored By Name, Unique Identifer(known as Case File Identifier), Note Text, Created On.
My set up currently runs it at approximately sorting through and matching my data at 2500 rows per minute, so approximately 25-30 minutes or so for a run. What I am curious is are there any steps I performed that are:
Redundant and inefficient overall slowing my process
A poor use of syntax to work around my lack of knowledge.
Below is my code:
nok_data = pd.read_csv("raw nok data.csv") #Data set from warehouse
main_data = pd.read_csv("exampledata.csv") #Data set taken from iTx ids from referral view
row_count = 0
error_count = 0
print(nok_data.columns.values.tolist())
print(main_data.columns.values.tolist()) #Commented out, used to grab header titles if needed.
data_length = len(main_data) #used for counting how many records left.
earliest_nok = {}
nok_data["Created On"] = pd.to_datetime(nok_data["Created On"]) #convert all dates to datetime at beginning.
for row in main_data["iTx Case ID"]:
list_data = []
nok = nok_data["Case File Identifier"] == row
matching_dates = nok_data[["Created On", "Authored By Name"]][nok == True] #takes created on date only if nok shows row was true
if len(matching_dates) > 0:
try:
min_dates = matching_dates.min(axis=0)
earliest_nok[row] = [min_dates[0], min_dates[1]]
except ValueError:
error_count += 1
earliest_nok[row] = None
row_count += 1
print("{} out of {} records").format(row_count, data_length)
with open('finaloutput.csv','wb') as csv_file:
writer = csv.writer(csv_file)
for key, value in earliest_nok.items():
writer.writerow([key, value])
Looking for any advice or expertise from those performing code like this much longer then I have. I appreciate all of you who even just took the time to read this. Happy Tuesday,
Andy M.
**** EDIT REQUESTED TO SHOW DATA
Sorry for my novice move there not including any data type.
main_data example
ITX Case ID
2017-023597
2017-023594
2017-023592
2017-023590
nok_data aka "raw nok data.csv"
Authored By: Case File Identifier: Note Text: Authored on
John Doe 2017-023594 Random Text 4/1/2017 13:24:35
John Doe 2017-023594 Random Text 4/1/2017 13:11:20
Jane Doe 2017-023590 Random Text 4/3/2017 09:32:00
Jane Doe 2017-023590 Random Text 4/3/2017 07:43:23
Jane Doe 2017-023590 Random Text 4/3/2017 7:41:00
John Doe 2017-023592 Random Text 4/5/2017 23:32:35
John Doe 2017-023592 Random Text 4/6/2017 00:00:35
It looks like you want to group on the Case File Identifier and get the minimum date and corresponding author.
# Sort the data by `Case File Identifier:` and `Authored on` date
# so that you can easily get the author corresponding to the min date using `first`.
nok_data.sort_values(['Case File Identifier:', 'Authored on'], inplace=True)
df = (
nok_data[nok_data['Case File Identifier:'].isin(main_data['ITX Case ID'])]
.groupby('Case File Identifier:')['Authored on', 'Authored By:'].first()
)
d = {k: [v['Authored on'], v['Authored By:']] for k, v in df.to_dict('index').iteritems()}
>>> d
{'2017-023590': ['4/3/17 7:41', 'Jane Doe'],
'2017-023592': ['4/5/17 23:32', 'John Doe'],
'2017-023594': ['4/1/17 13:11', 'John Doe']}
>>> df
Authored on Authored By:
Case File Identifier:
2017-023590 4/3/17 7:41 Jane Doe
2017-023592 4/5/17 23:32 John Doe
2017-023594 4/1/17 13:11 John Doe
It is probably easier to use df.to_csv(...).
The items from main_data['ITX Case ID'] where there is no matching record have been ignored but could be included if required.

Categories