Python: Match string and replace in place - python

I have two dataframes (DF1 & DF2) with multiple columns of text information. I need to match and update one column in DF1
DF1:
Code Name
A A: Andrew
B B: Bill
C C: Chuck
DF2:
Number Codes
1 A
2 B;C
3 A;C
My required output is to transform DF2 as follows:
DF2:
Number Codes
1 A: Andrew
2 B: Bill;C: Chuck
3 A: Andrew;C: Chuck
So far I have tried to use:
df2['Codes'] = df2['Codes'].replace(to_replace="A", value="A: Andrew", regex=True)
But this is not practical for larger datasets.
Do I use the same df.replace function and do some looping to find every code and replace? Or is there other ways to do it better?
One option I'm trying to learn about is using sub() with regex, but i'm new to regex and learning the basics of it.

You should just try to split Column then apply dict with zip and replace
di=dict(zip(df1.Name.str.split(":").str[0],df1.Name))
df2["Codes"]=df2["Codes"].replace(di, regex=True)

Related

Python - how to store multiple variables in one column of a .csv file and then read those variables into a list

I think this issue is pretty basic, but I haven't seen it answered online. I'm using python and I have 'pandas' installed to make things easier. If there's a way to do it without 'pandas' that would be awesome too! I'm coding a node connected map. I want to be able to take in some .csv file with a 'previous' and 'next' node list. I want this data to be then read by the program and stored in a list. For example:
.csv file:
Name
Previous
Next
Alpha
one two
Three
Beta
four
five
Charlie
six
seven eight
what I want in my program:
alpha, [one, two], [three]
beta, [four], [five]
charlie, [six], [seven, eight]
I have heard about two ways to write multiple variables in one .csv column.
One way was placing a space in between the two values/variables:
alpha,one two,three
and another way I've heard to solve this is use " marks and separate with a comma:
alpha,"one,two",three
Although I have heard about these answers before, I haven't been able to implement them. When reading the data in my program, it will assume that the space is part of the string or that the comma is part of the string.
file = pd.read_csv("connections.csv")
previous_alpha = []
previous_alpha.append(file.previous[0])
So, instead of having a list of two strings [one, two] my program will have a list containing one string that looks like ["one,two"] or [one two]
I can change the way the variables are structured in the .csv file or the code reading in the data. Thanks for all the help in advance!
There are multiple ways of doing this. Each for a different way you start with CSV data.
First method will have the data in CSV as a single row with lists of things:
Name,Previous,Next
Alpha,"One,Two",Three
Beta,Four,Five
Charlie,Six,"Seven,Eight"
Note the quotation around the lists. We can use apply to change the values. The convert function will just split the string using , as the delimiter.
import pandas as pd
def convert(x):
return x.split(',')
df = pd.read_csv('file.csv')
df['Previous'] = df['Previous'].apply(convert)
df['Next'] = df['Previous'].apply(convert)
Second, each row is repeated for Name with the values in CSV:
Name,Previous,Next
Alpha,One,Three
Alpha,Two,Three
Beta,Four,Five
Charlie,Six,Seven
Charlie,Six,Eight
We can you the agg function to aggregate. The convert function drops the duplicates and returns as a list.
import pandas as pd
def convert(x):
return x.drop_duplicates().to_list()
df = pd.read_csv('file.csv')
df = df.groupby('Name').agg({'Previous': convert, 'Next': convert})
The results should look like this:
Previous Next
Name
Alpha [One, Two] [Three]
Beta [Four] [Five]
Charlie [Six] [Seven, Eight]
If you have this DataFrame:
Name Previous Next
0 Alpha one two Three
1 Beta four five
2 Charlie six seven eight
Then you can split the strings in required columns and save the CSV normally:
df["Previous"] = df["Previous"].str.split()
df["Next"] = df["Next"].str.split()
print(df)
df.to_csv("data.csv", index=False)
Name Previous Next
0 Alpha [one, two] [Three]
1 Beta [four] [five]
2 Charlie [six] [seven, eight]
To load the data back, you can use pd.read_csv with converters= parameter:
from ast import literal_eval
df = pd.read_csv(
"data.csv", converters={"Previous": literal_eval, "Next": literal_eval}
)
print(df)
Prints:
Name Previous Next
0 Alpha [one, two] [Three]
1 Beta [four] [five]
2 Charlie [six] [seven, eight]

Python: How to rename a set of columns in multiple data frames in certain way

I have a few data frames and each has multiple columns with names defined in the same way.
Below is an example:
person_name birth_dt_1
Bob 1991-01-05
Abby 1994-09-09
Elsa 1956-08-15
I'd like to find a way which can replace the underscore with space in column names and make the first letter of each word upper case. For the numbers in column names, we can keep them as this.
Below is what I want:
Person Name Birth Dt 1
Bob 1991-01-05
Abby 1994-09-09
Elsa 1956-08-15
I did not want to use rename function in pandas because I have to specify each column's name, which might be too redundant as the fact that I have multiple data frames, and each has multiple columns.
Any suggestions on how to efficiently doing it? Maybe define a function that can be applied to multiple datasets?
Thanks in advance!
Try this
df.columns = df.columns.str.title().str.replace('_', ' ')
Out[387]:
Person Name Birth Dt 1
0 Bob 1991-01-05
1 Abby 1994-09-09
2 Elsa 1956-08-15
You can also leverage re package in python.
re.findall can be used to get the list of potential substrings, which can then be used to join with space.
Example:
import re
def rename_cols(name):
return " ".join([i.capitalize() for i in re.findall('[a-z0-9]+', name)])
df_clean = df.rename(columns=rename_cols)
Out[1]:
Person Name Birth Dt 1
0 Bob 1991-01-05
1 Abby 1994-09-09
2 Elsa 1956-08-15

Loop over different Dataframes Pandas

Hello World,
I have 52 dataframes (df and data) with each of them containing only Name with first letter of Alphabet.
Below is an example for letter A & B (containing in df and data)
dfA dfB
Code Name Code Name
15 Amiks 68 Bernard
157 Alis 14 Barpti
dataA dataB
Code Name Code Name
Amiks Berti
Alis Bernard
Anatole Barpti
Question:
Not an expert in Python, How can I loop over dataframes with the same letter? and not looking for all dataframes but just the same letter.
For Eg: Check whether:
dataA.Name is in dfA.Name ?
dataB.Name is in dfB.Name ?
dataZ.Name is in dfZ.Name ?
Edit
The original DF are below
df data
Code Name Code Name
15 Amiks Amiks
157 Alis Alis
14 Barpti Bernard
68 Bernard Barpti
I just created one df per first letter.
The aim is to speed up the computational time and avoid checking within the whole DF, when we can check for only rows with the same first letter.
Thanks for anyone helping.
You should considere variable names as only meaningful for you and not for the Python interpretor. Of course, there are hacky ways using locals() or globals() but they should never be used in normal code.
That means that is your dataframes are related, you should use a data structure to trace the relation. You could use 2 lists or a list of pairs, for example:
dataframes = [ (dfA, dataA), (dfB, dataB), ...]
If the letter itself matter, you could use a dictionary:
dataframes = { 'A': (dfA, dataA), 'B': (dfB, dataB), ...}
Then you can easily iterate:
for letter in dataframes:
commons = dataframes[letter][0].merge(dataframes[letter][1], on='Name',
suffixes=('_x', ''))[['Code', 'Name']]
# process commons which contains code and name where datax.Name is in dfx.Name
...
But unless the original dataframe is really huge, I would suggest you to benchmark the one single dataframe with first letter way against the 52 dataframes one. Pandas if rather efficient provided everything fits in memory...
For those who might be interested, below is what I have done:
Step 1: re-merge all df's and data's.
The output is :
Df (containing all letters)
Data (containing all letters)
Step 2: Retrieve First letter of Name
# Function to retrieve first letter
def first_letter(data, col):
a= data[col].map(lambda x: x[0])
b= np.unique(a)
c= b.tolist()
return c
Step 3: Loop over DF and Data
Compare only Name for which first letter is the same
# Apply function to retrieve unique first letter of Name's column and UV
first= first_letter(df, 'Name')
# Loop over each first letter to apply the matching by starting with the same letter for both DF
for letter in first:
df_first= df[df['Name'].str.startswith(letter)]
data_first= data[data['Name'].str.startswith(letter)]
# Process
....
With this code, I can match Names only for those who have the same first letter and not looking into the whole DF.

Compare value of rows in Dataframe

I want to know if the values in two different rows of a Dataframe are the same.
My df looks something like this:
df['Name1']:
Alex,
Peter,
Herbert,
Seppi,
Huaba
df['Name2']:
Alexander,
peter,
herbert,
Sepp,
huaba
First I want to Apply .rstrip() and .toLower(), but these methods seem to only work on Strings. I tried Str(df['Name1'] which worked, but the comparison gave me the wrong result.
I also tried the following:
df["Name1"].isin(df["Name2"]).value_counts())
df["Name1"].eq(df["Name2"]).value_counts())
Problem 1: I think .isin also returns true if a Substring is found e.g. alex.isin(alexander)would return true then. Which is not what I´m looking for.
Problem 2: I think .eg would do it for me. But I still have the problem with the .rstrip() and to.lower() methods.
What is the best way to count the amount of same entries?
print (df)
Name1 Name2
0 Alex Alexander
1 Peter peter
2 Herbert herbert
3 Seppi Sepp
4 Huaba huaba
If need compare each row:
out1 = df["Name1"].str.lower().eq(df["Name2"].str.lower()).sum()
If need compare all values of Name1 by all values by Name2:
out2 = df["Name1"].str.lower().isin(df["Name2"].str.lower()).sum()
Use set to find the common values between two dataframe columns
common_values = list(set(df.Name1) & set(df.Name2) )
count = len(common_values)

Pattern Match in List of Strings, Create New Column in pandas

I have a pandas dataframe with the following general format:
id,product_name_extract
1,00012CDN
2,14311121NDC
3,NDC37ba
4,47CD27
I also have a list of product codes I would like to match (unfortunately, I have to do NLP extraction, so it will not be a clean match) and then create a new column with the matching list value:
product_name = ['12CDN','21NDC','37ba','7CD2']
id,product_name_extract,product_name_mapped
1,00012CDN,12CDN
2,14311121NDC,21NDC
3,NDC37ba,37ba
4,47CD27,7CD2
I am not too worried about there being collisions.
This would be easy enough if I just needed a True/False indicator using contains and the list values concatenated together with "|" for alternation, but I am a bit stumped now on how I would create a column value of the exact match. Any tips or trick appreciated!
Since you're not worried about collisions, you can join your product_name list with the | operator, and use that as a regex:
df['product_name_mapped'] = (df.product_name_extract.str
.findall('|'.join(product_name))
.str[0])
Result:
>>> df
id product_name_extract product_name_mapped
0 1 00012CDN 12CDN
1 2 14311121NDC 21NDC
2 3 NDC37ba 37ba
3 4 47CD27 7CD2

Categories