Add new column to existing dataframe from substring of existing column

Add new column to existing dataframe from substring of existing column - python

I have a df that looks similar to this:
|email|first_name|last_name|id|group_email|
|-|-|-|-|-|
|drew#mail.com|drew|barry|05|san-red-gate-rate#mail.com|
|nate#mail.com|nate|lewis|03|san-blue-gate-factor#mail.com|
|chris#mail.com|chris|ryan|04|san-red-wheels-drive#mail.com|
I parse out the group_code, the sub string after the 3rd hyphen. I now want to add this sub tring back into the dataframe for each entry. So the df will look like so:
|email|first_name|last_name|id|group_email|group_code|
|-|-|-|-|-|-|
|drew#mail.com|drew|barry|05|san-red-gate-rate#mail.com|rate|
|nate#mail.com|nate|lewis|03|san-blue-gate-factor#mail.com|factor|
|chris#mail.com|chris|ryan|04|san-red-wheels-drive#mail.com|drive|
How can I go about doing this?

Let's try
df['group_code'] = (df['group_email'].str.extract('(-[^#]*){3}')[0]
.str.lstrip('-'))
print(df)
email first_name last_name id group_email group_code
0 drew#mail.com drew barry 5 san-red-gate-rate#mail.com rate
1 nate#mail.com nate lewis 3 san-blue-gate-factor#mail.com factor
2 chris#mail.com chris ryan 4 san-red-wheels-drive#mail.com drive

Related

Is there a way to create a new column for values that are in the original column either using Excel or Python?

I have the below Pandas Series with values that look like this:
Info
# ID 3.22.33.2
Location: Texas
Address: 1321 madeupstreet
Name: mike b
Address: 6.3.1
There are almost 1000 rows with this data so the problem I am having is:
Question
Can I run a code in python or Excel to extract the values from these rows in such a way that it would place ID#s on a separate column, Location in a another column ...etc
So it would look something like this:
IDs
Location
Address
Name
3.22.33.2
Texas
1321 Madeupstreet
mike b
Some items wont have a Name and in that case, could it just leave it blank? or write No name found?
I tried creating separate lists (this data came from a text file) but that method is not working for me, and I do not have any code to share at the moment.so I copy pasted all the values into an excel sheet.
Note (I do not care about the second Address line, so if it is easier to ignore that is fine).

Based on the specific series and for the sake of easiness I suggest you can try:
df['Index'] = df['Info'].replace("# ","",regex=True).str.split().str[0]
df['Values'] = [' '.join(x) for x in df['Info'].replace("# ","",regex=True).str.split().str[1:]]
output = df.set_index('Index').drop(columns='Info').T
Returning:
Index ID Location: Address: Name: Address:
Values 3.22.33.2 Texas 1321 madeupstreet mike b 6.3.1

IIUC, split the header and values, then pivot_table:
out = (df['Info']
.str.split(':|(?<=ID)\s', expand=True)
.set_axis(['col', 'value'], axis=1)
.assign(index=lambda d: d['col'].str.endswith('ID').cumsum())
.pivot_table(index='index', columns='col', values='value', aggfunc='first')
.rename_axis(index=None, columns=None)
)
output:
# ID Address Location Name
1 3.22.33.2 1321 madeupstreet Texas mike b

It was the row with ID entry that caused hastle but here is my solution
I first changed the entry with ID to have : just like all other rows
ID_index = df.Info[0].index('ID')
df.Info[0] = df.Info[0][:ID_index+2] + ':' + df.Info[0][ID_index+2:]
Then split the rows by : and transpose the series
new_df = df.Info.str.split(':', expand=True).T
new_df.columns = new_df.iloc[0].values
new_df = new_df.iloc[1:]
gives
# ID Location Address Name Address
1 3.22.33.2 Texas 1321 madeupstreet mike b 6.3.1

Python - Matching and extracting data from excel with pandas

I am working on a python script that automates some phone calls for me. I have a tool to test with that I can interact with REST API. I need to select a specific carrier based on which country code is entered. So let's say my user enters 12145221414 in my excel document, I want to choose AT&T as the carrier. How would I accept input from the first column of the table and then output what's in the 2nd column?
Obviously this can get a little tricky, since I would need to match up to 3-4 digits on the front of a phone number. My plan is to write a function that then takes the initial number and then plugs the carrier that needs to be used for that country.
Any idea how I could extract this data from the table? How would I make it so that if you entered Barbados (1246), then Lime is selected instead of AT&T?
Here's my code thus far and tables. I'm not sure how I can read one table and then pull data from that table to use for my matching function.
testlist.xlsx
| Number |
|:------------|
|8155555555|
|12465555555|
|12135555555|
|96655555555|
|525555555555|
carriers.xlsx
| countryCode | Carrier |
|:------------|:--------|
|1246|LIME|
|1|AT&T|
|81|Softbank|
|52|Telmex|
|966|Zain|
import pandas as pd
import os
FILE_PATH = "C:/temp/testlist.xlsx"
xl_1 = pd.ExcelFile(FILE_PATH)
num_df = xl_1.parse('Numbers')
FILE_PATH = "C:/temp/carriers.xlsx"
xl_2 = pd.ExcelFile(FILE_PATH)
car_df = xl_2.parse('Carriers')
for index, row in num_df.iterrows():

Any idea how I could extract this data from the table? How would I
make it so that if you entered Barbados (1246), then Lime is selected
instead of AT&T?
carriers.xlsx
countryCode
Carrier
1246
LIME
1
AT&T
81
Softbank
52
Telmex
966
Zain
script.py
import pandas as pd
FILE_PATH = "./carriers.xlsx"
df = pd.read_excel(FILE_PATH)
rows_list = df.to_dict('records')
code_carrier_map = {}
for row in rows_list:
code_carrier_map[row["countryCode"]] = row["Carrier"]
print(type(code_carrier_map), code_carrier_map)
print(f"{code_carrier_map.get(1)=}")
print(f"{code_carrier_map.get(1246)=}")
print(f"{code_carrier_map.get(52)=}")
print(f"{code_carrier_map.get(81)=}")
print(f"{code_carrier_map.get(966)=}")
Output
$ python3 script.py
<class 'dict'> {1246: 'LIME', 1: 'AT&T', 81: 'Softbank', 52: 'Telmex', 966: 'Zain'}
code_carrier_map.get(1)='AT&T'
code_carrier_map.get(1246)='LIME'
code_carrier_map.get(52)='Telmex'
code_carrier_map.get(81)='Softbank'
code_carrier_map.get(966)='Zain'
Then if you want to parse phone numbers, don't reinvent the wheel, just use this phonenumbers library.
Code
import phonenumbers
num = "+12145221414"
phone_number = phonenumbers.parse(num)
print(f"{num=}")
print(f"{phone_number.country_code=}")
print(f"{code_carrier_map.get(phone_number.country_code)=}")
Output
num='+12145221414'
phone_number.country_code=1
code_carrier_map.get(phone_number.country_code)='AT&T'

Let's assume the following input:
>>> df1
Number
0 8155555555
1 12465555555
2 12135555555
3 96655555555
4 525555555555
>>> df2
countryCode Carrier
0 1246 LIME
1 1 AT&T
2 81 Softbank
3 52 Telmex
4 966 Zain
First we need to rework a bit df2 to sort the countryCode in descending order, make it as string and set it to index.
The trick for later is to sort countryCode in descending order. This will ensure that a longer country codes, such as "1246" is matched before a shorter one like "1".
>>> df2 = df2.sort_values(by='countryCode', ascending=False).astype(str).set_index('countryCode')
>>> df2
Carrier
countryCode
1246 LIME
966 Zain
81 Softbank
52 Telmex
1 AT&T
Finally, we use a regex (here '1246|966|81|52|1' using '|'.join(df2.index)) made from the country codes in descending order to extract the longest code, and we map it to the carrier:
(df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
)
output:
0 Softbank
1 LIME
2 AT&T
3 Zain
4 Telmex
Name: 0, dtype: object
NB. to add it to the initial dataframe:
df1['carrier'] = (df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
).to_clipboard(0)
output:
Number carrier
0 8155555555 Softbank
1 12465555555 LIME
2 12135555555 AT&T
3 96655555555 Zain
4 525555555555 Telmex

If I understand it correctly, you just want to get the first characters from the input column (Number) and then match this with the second dataframe from carriers.xlsx.
Extract first characters of a Number column. Hint: The nbr_of_chars variable should be based on the maximum character length of the column countryCode in the carriers.xlsx
nbr_of_chars = 4
df.loc[df['Number'].notnull(), 'FirstCharsColumn'] = df['Number'].str[:nbr_of_chars]
Then the matching should be fairly easy with dataframe joins.

I can think only of an inefficient solution.
First, sort the data frame of carriers in the reverse alphabetical order of country codes. That way, longer prefixes will be closer to the beginning.
codes = xl_2.sort_values('countryCode', ascending=False)
Next, define a function that matches a number with each country code in the second data frame and finds the index of the first match, if any (remember, that match is the longest).
def cc2carrier(num):
matches = codes['countryCode'].apply(lambda x: num.startswith(x))
if not matches.any(): #Not found
return np.nan
return codes.loc[matches.idxmax()]['Carrier']
Now, apply the function to the numbers dataframe:
xl_1['Number'].apply(cc2carrier)
#1 Softbank
#2 LIME
#3 AT&T
#4 Zain
#5 Telmex
#Name: Number, dtype: object

How to split two first names that together in two different words in python

I am trying to split misspelled first names. Most of them are joined together. I was wondering if there is any way to separate two first names that are together into two different words.
For example, if the misspelled name is trujillohernandez then to be separated to trujillo hernandez.
I am trying to create a function that can do this for a whole column with thousands of misspelled names like the example above. However, I haven't been successful. Spell-checkers libraries do not work given that these are first names and they are Hispanic names.
I would be really grateful if you can help to develop some sort of function to make it happen.

As noted in the comments above not having a list of possible names will cause a problem. However, and perhaps not perfect, but to offer something try...
Given a dataframe example like...
Name
0 sofíagomez
1 isabelladelgado
2 luisvazquez
3 juanhernandez
4 valentinatrujillo
5 camilagutierrez
6 joséramos
7 carlossantana
Code (Python):
import pandas as pd
import requests
# longest list of hispanic surnames I could find in a table
url = r'https://namecensus.com/data/hispanic.html'
# download the table into a frame and clean up the header
page = requests.get(url)
table = pd.read_html(page.text.replace('<br />',' '))
df = table[0]
df.columns = df.iloc[0]
df = df[1:]
# move the frame of surnames to a list
last_names = df['Last name / Surname'].tolist()
last_names = [each_string.lower() for each_string in last_names]
# create a test dataframe of joined firstnames and lastnames
data = {'Name' : ['sofíagomez', 'isabelladelgado', 'luisvazquez', 'juanhernandez', 'valentinatrujillo', 'camilagutierrez', 'joséramos', 'carlossantana']}
df = pd.DataFrame(data, columns=['Name'])
# create new columns for the matched names
lastname = '({})'.format('|'.join(last_names))
df['Firstname'] = df.Name.str.replace(str(lastname)+'$', '', regex=True).fillna('--not found--')
df['Lastname'] = df.Name.str.extract(str(lastname)+'$', expand=False).fillna('--not found--')
# output the dataframe
print('\n\n')
print(df)
Outputs:
Name Firstname Lastname
0 sofíagomez sofía gomez
1 isabelladelgado isabella delgado
2 luisvazquez luis vazquez
3 juanhernandez juan hernandez
4 valentinatrujillo valentina trujillo
5 camilagutierrez camila gutierrez
6 joséramos josé ramos
7 carlossantana carlos santana
Further cleanup may be required but perhaps it gets the majority of names split.

How to use pandas python3 to get just Middle Initial from Middle name column of CSV and write to new CSV

I need help. I have a CSV file that contains names (First, Middle, Last)
I would like to know a way to use pandas to convert Middle Name to just a Middle initial, and save First Name, Middle Init, Last Name to a new csv.
Source CSV
First Name,Middle Name,Last Name
Richard,Dale,Leaphart
Jimmy,Waylon,Autry
Willie,Hank,Paisley
Richard,Jason,Timmons
Larry,Josiah,Williams
What I need new CSV to look like:
First Name,Middle Name,Last Name
Richard,D,Leaphart
Jimmy,W,Autry
Willie,H,Paisley
Richard,J,Timmons
Larry,J,Williams
Here is the Python3 code using pandas that I have so far that is reading and writing to a new CSV file. I just need a some help modifying that one column of each row, saving just the first Character.
'''
Read CSV file with First Name, Middle Name, Last Name
Write CSV file with First Name, Middle Initial, Last Name
Print before and after in the terminal to show work was done
'''
import pandas
from pathlib import Path, PureWindowsPath
winCsvReadPath = PureWindowsPath("D:\\TestDir\\csv\\test\\original-
NameList.csv")
originalCsv = Path(winCsvReadPath)
winCsvWritePath= PureWindowsPath("D:\\TestDir\\csv\\test\\modded-
NameList2.csv")
moddedCsv = Path(winCsvWritePath)
df = pandas.read_csv(originalCsv, index_col='First Name')
df.to_csv(moddedCsv)
df2 = pandas.read_csv(moddedCsv, index_col='First Name')
print(df)
print(df2)
Thanks in advance..

You can use the str accessor, which allows you to slice strings like you would in normal Python:
df['Middle Name'] = df['Middle Name'].str[0]
>>> df
First Name Middle Name Last Name
0 Richard D Leaphart
1 Jimmy W Autry
2 Willie H Paisley
3 Richard J Timmons
4 Larry J Williams

Or Just to another approach with str.extract
Your csv file processing with pandas:
>>> df = pd.read_csv("sample.csv", sep=",")
>>> df
First Name Middle Name Last Name
0 Richard Dale Leaphart
1 Jimmy Waylon Autry
2 Willie Hank Paisley
3 Richard Jason Timmons
4 Larry Josiah Williams
Second, Middle Name extraction from the DataFrame:
assuming all the names starting with first letter with upper case.
>>> df['Middle Name'] = df['Middle Name'].str.extract('([A-Z]\w{0})')
# df['Middle Name'] = df['Middle Name'].str.extract('([A-Z]\w{0})', expand=True)
>>> df
First Name Middle Name Last Name
0 Richard D Leaphart
1 Jimmy W Autry
2 Willie H Paisley
3 Richard J Timmons
4 Larry J Williams

How to remove lines in pandas data frame based on specific character

I got this in my data frame
name : john,
address : Milton Kings,
phone : 43133241
Concern:
customer complaint about the services is so suck
thank you
How can I process the above to remove only line of text in data frame containing :? My objective is to get the lines which contains the following only.
customer complaint about the services is so suck
Kindly help.

One thing you can do is to separate the sentence after ':' from your data frame. And you can do this by creating a series from your data frame.
Let's say c is your series.
c=pd.Series(df['column'])
s=[c[i].split(':')[1] for i in range(len(c))]
By doing this you will be able to separate your sentence from colon.

Assuming you want to keep the second part of the sentences, you can use the applymap
method to solve your problem.
import pandas as pd
#Reproduce the dataframe
l = ["name : john",
"address : Milton Kings",
"phone : 43133241",
"Concern : customer complaint about the services is so suck" ]
df = pd.DataFrame(l)
#split on each element of the dataframe, and keep the second part
df.applymap(lambda x: x.split(":")[1])
input :
0
0 name : john
1 address : Milton Kings
2 phone : 43133241
3 Concern : customer complaint about the services is so suck
output :
0
0 john
1 Milton Kings
2 43133241
3 customer complaint about the services is so suck

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Add new column to existing dataframe from substring of existing column - python

Related

Is there a way to create a new column for values that are in the original column either using Excel or Python?

Python - Matching and extracting data from excel with pandas

How to split two first names that together in two different words in python

How to use pandas python3 to get just Middle Initial from Middle name column of CSV and write to new CSV

How to remove lines in pandas data frame based on specific character

Categories

Resources