I am trying to create a column based on the simple logic but it does not work.
I'd like to create a new column named 'Commodity' with a simple logic:
if df['ID'].str[:3] = 'FWD':
df['Commodity'] = df['ID'].str[3:6]
My DF looks like that:
df = pd.DataFrame({'ID':['FWDUSD921','FWDNZD344','EUR'], 'Volumes': [10,20,33]})
If no match, leave space blank (or put 0 - does not matter)
I tried lambdas, if, and apply methods but keep getting error messages.
Use a regular expression here
df.assign(Commodity=df.ID.str.extract(r'^FWD(\w{3})'))
ID Volumes Commodity
0 FWDUSD921 10 USD
1 FWDNZD344 20 NZD
2 EUR 33 NaN
Regex Explanation
^ # asserts position at start of line
FWD # matches FWD exactly
( # matching group 1
\w{3} # match 3 characters that match a-zA-Z0-9_
) # end of matching group
If there are any other requirements for what a defines a "currency string" (maybe only letters?), you can replace the \w with that requirement.
Just create a mask with your condition, and use it with .loc[]
mask = df['ID'].str[:3] == 'FWD'
df.loc[mask, 'Commodity'] = df.loc[mask, 'ID'].str[3:6]
Try:
s = df['ID'].str[:3].eq('FWD')
df.loc[s, 'Commodity'] = df['ID'].str[3:6]
Output:
ID Volumes Commodity
0 FWDUSD921 10 USD
1 FWDNZD344 20 NZD
2 EUR 33 NaN
This should do the trick:
import numpy as np
df['Currency'] = np.where(df['ID'].str[:3]=='FWD', df['ID'].str[3:6], '')
Related
How to match data value with its regex type but the regex is in another dataframe? Here is the sample Data df and Regex df. Note that these two df have different shape as the regex df is just reference df and only contain unique value.
**Data df** **Regex df**
**Country Type Data** **Country Type Regex**
MY ABC MY1234567890 MY ABC ^MY[0-9]{10}
IT ABC IT1234567890 IT ABC ^IT[0-9]{10}
PL PQR PL123456 PL PQR ^PL
MY ABC 456792abc MY DEF ^\w{6,10}$
IT ABC MY45889976 IT XYZ ^\w{6,10}$
IT ABC IT56788897
For the data that is not match to its own regex, how can I find match for the data with its Country but scan through all the type that the country has. For example, this data 'MY45889976' does not follow its regex (IT) country and (ABC) type. But it match with another type for its country which is the (XYZ) type. So it will add another column and give the type that it match with.
My desired output is something like this,
Country Type Data Data Quality Suggestion
0 MY ABC MY1234567890 1 0
1 IT ABC IT1234567890 1 0
2 IT ABC MY45889976 0 XYZ
3 IT ABC IT567888976 0 XYZ
4 PL PQR PL123456 1 0
5 MY XYZ 456792abc 0 DEF
This is what I have done to match the regex to get the data quality column (after concatenation),
df['Data Quality'] = df.apply(lambda r:re.match(r['Regex'],r['Data']) and 1 or 0, axis=1)
But I'm not sure how to move forward. Is there any easy way to do this without concatenation and how to find matching regex by scanning its whole type but tie to its country only. Thanks
refer to:Match column with its own regex in another column Python
just apply a new Coumun suggestion, it's logic depend on your description.
def func(dfRow):
#find the same Country and Type
sameDF = regexDF.loc[(regexDF['Country'] == dfRow['Country']) & (regexDF['Type'] == dfRow['Type'])]
if sameDF.size > 0 and re.match(sameDF.iloc[0]["Regex"],dfRow["Data"]):
return 0
#find the same Country, then find mathec Type
sameCountryDF = regexDF.loc[(regexDF['Country'] == dfRow['Country'])]
for index, row in sameCountryDF.iterrows():
if re.match(row["Regex"], dfRow["Data"]):
return row["Type"]
df["Suggestion"]=df.apply(func, axis=1)
I suggest the following, merging by Country and doing both operations in the same DataFrame (finding regex that match for the type in data_df and for the type in regex_df) as follows:
# First I merge only on country
new_df = pd.merge(df, df_regex, on="Country")
# Then I define an indicator for types that differ between the two DF
new_df["indicator"] = np.where(new_df["Type_x"] == new_df["Type_y"], "both", "right")
# I see if the regex matches Data for the `Type` in df
new_df['Data Quality'] = new_df.apply(lambda x:
np.where(re.match(x['Regex'], x['Data']) and
(x["indicator"] == "both"),
1, 0), axis=1)
# Then I fill Suggestion by looking if the regex matches data for the type in df_regex
new_df['Suggestion'] = new_df.apply(lambda x:
np.where(re.match(x['Regex'], x['Data']) and
(x["indicator"] == "right"),
x["Type_y"], ""), axis=1)
# I remove lines where there is no suggestion and I just added lines from df_regex
new_df = new_df.loc[~((new_df["indicator"] == "right") & (new_df["Suggestion"] == "")), :]
new_df = new_df.sort_values(["Country", "Type_x", "Data"])
# After sorting I move Suggestion up one line
new_df["Suggestion"] = new_df["Suggestion"].shift(periods=-1)
new_df = new_df.loc[new_df["indicator"] == "both", :]
new_df = new_df.drop(columns=["indicator", "Type_y", "Regex"]).fillna("")
And you get this result:
Country Type_x Data Data Quality Suggestion
4 IT ABC IT1234567890 1
8 IT ABC IT56788897 0 XYZ
6 IT ABC MY45889976 0 XYZ
2 MY ABC 456792abc 0 DEF
0 MY ABC MY1234567890 1
10 PL PQR PL123456 1
The last line of your output seems to have the wrong Type since it is not in data_df.
By using your sample data I find ABC for Data == "456792abc" and your suggestion DEF.
The question has been asked a lot, however I'm still not close to the solution. I have a column which looks something like this
What I want to do is separate the country and language in different columns like
Country Language
Vietnam Vietnamese_display 1
Indonesia Tamil__1
India Tamil_Video_5
I'm using the following code to get it done however there are a lot of factors that needs to be taken into account and I'm not sure how to do it
df[['Country', 'Language']] = df['Line Item'].str.split('_\s+', n=1, expand=True)
How can I skip the first "_" to get my desired results? Thanks
You may use
df[['Country', 'Language']] = df['Line Item'].str.extract(r'^_*([^_]+)_(.+)')
See the regex demo
Details
^ - start of string
_* - 0 or more underscores
([^_]+) - Capturing group 1: any one or more chars other than _
_ - a _ char
(.+) - Group 2: any one or more chars other than line break chars.
Pandas test:
df = pd.DataFrame({'Line Item': ['Vietnam_Vietnamese_display 1','Indonesia_Tamil__1','India_Tamil_Video_5']})
df[['Country', 'Language']] = df['Line Item'].str.extract(r'^_*([^_]+)_(.+)')
df
# Line Item Country Language
# 0 Vietnam_Vietnamese_display 1 Vietnam Vietnamese_display 1
# 1 Indonesia_Tamil__1 Indonesia Tamil__1
# 2 India_Tamil_Video_5 India Tamil_Video_5
It should be fairly simple yet I'm not able to achieve it.
I have a dataframe df1, having a column "name_str". Example below:
name_str
0 alp:ha
1 bra:vo
2 charl:ie
I have to create another column that would comprise - say 5 characters - that start after the colon (:). I've written the following code:
import pandas as pd
data = {'name_str':["alp:ha", "bra:vo", "charl:ie"]}
#indx = ["name_1",]
df1 = pd.DataFrame(data=data)
n= df1['name_str'].str.find(":")+1
df1['slize'] = df1['name_str'].str.slice(n,2)
print(df1)
But the output is disappointing: NaanN
name_str slize
0 alp:ha NaN
1 bra:vo NaN
2 charl:ie NaN
The output should've been:
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Would anyone please help? Appreciate it.
You can use str.extract to extract everything after the colon with this regular expression: :(.*)
df1['slize'] = df1.name_str.str.extract(':(.*)')
>>> df1
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Edit, based on your updated question
If you'd like to extract up to 5 characters after the colon, then you can use this modification:
df['slize'] = df1.name_str.str.extract(':(.{,5})')
I have a column called SSN in a CSV file with values like this
289-31-9165
I need to loop through the values in this column and replace the first five characters so it looks like this
***-**-9165
Here's the code I have so far:
emp_file = "Resources/employee_data1.csv"
emp_pd = pd.read_csv(emp_file)
new_ssn = emp_pd["SSN"].str.replace([:5], "*")
emp_pd["SSN"] = new_ssn
How do I loop through the value and replace just the first five numbers (only) with asterisks and keep the hiphens as is?
Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format.
emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])
You can simply achieve this with replace() method:
Example dataframe :
borrows from #AkshayNevrekar..
>>> df
ssn
0 111-22-3333
1 121-22-1123
2 345-87-3425
Result:
>>> df.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
OR
>>> df.ssn.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
Name: ssn, dtype: object
OR:
df['ssn'] = df['ssn'].str.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
Put your asterisks in front, then grab the last 4 digits.
new_ssn = '***-**-' + emp_pd["SSN"][-4:]
You can use regex
df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']})
def func(x):
return re.sub(r'\d{3}-\d{2}','***-**', x)
df['ssn'] = df['ssn'].apply(func)
print(df)
Output:
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')
You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa
import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:
Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line