It should be fairly simple yet I'm not able to achieve it.
I have a dataframe df1, having a column "name_str". Example below:
name_str
0 alp:ha
1 bra:vo
2 charl:ie
I have to create another column that would comprise - say 5 characters - that start after the colon (:). I've written the following code:
import pandas as pd
data = {'name_str':["alp:ha", "bra:vo", "charl:ie"]}
#indx = ["name_1",]
df1 = pd.DataFrame(data=data)
n= df1['name_str'].str.find(":")+1
df1['slize'] = df1['name_str'].str.slice(n,2)
print(df1)
But the output is disappointing: NaanN
name_str slize
0 alp:ha NaN
1 bra:vo NaN
2 charl:ie NaN
The output should've been:
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Would anyone please help? Appreciate it.
You can use str.extract to extract everything after the colon with this regular expression: :(.*)
df1['slize'] = df1.name_str.str.extract(':(.*)')
>>> df1
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Edit, based on your updated question
If you'd like to extract up to 5 characters after the colon, then you can use this modification:
df['slize'] = df1.name_str.str.extract(':(.{,5})')
Related
I have a subset of data (single column) we'll call ID:
ID
0 07-1401469
1 07-89556629
2 07-12187595
3 07-381962
4 07-99999085
The current format is (usually) YY-[up to 8-character ID].
The desired output format is a more uniformed YYYY-xxxxxxxx:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
Knowing that I've done padding in the past, the thought process was to combine
df['id'].str.split('-').str[0].apply(lambda x: '{0:20>4}'.format(x))
df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x))
However I ran into a few problems:
The '20' in '{0:20>4}' must be a singular value and not a string
Trying to do something like the below just results in df['id'] taking the properties of the last lambda & trying any other way to combine multiple apply/lambdas just didn't work. I started going down the pad left/right route but that seemed to be taking be backwards.
df['id'] = (df['id'].str.split('-').str[0].apply(lambda x: '{0:X>4}'.format(x)).str[1].apply(lambda x: '{0:0>8}'.format(x)))
The current solution I have (but HATE because its long, messy, and just not clean IMO) is:
df['idyear'] = df['id'].str.split('-').str[0].apply(lambda x: '{:X>4}'.format(x)) # Split on '-' and pad with X
df['idyear'] = df['idyear'].str.replace('XX', '20') # Replace XX with 20 to conform to YYYY
df['idnum'] = df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x)) # Pad 0s up to 8 digits
df['id'] = df['idyear'].map(str) + "-" + df['idnum'] # Merge idyear and idnum to remake id
del df['idnum'] # delete extra
del df['idyear'] # delete extra
Which does work
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
But my questions are
Is there a way to run multiple apply() functions in a single line so I'm not making temp variables
Is there a better way than replacing 'XX' for '20'
I feel like this entire code block can be compress to 1 or 2 lines I just don't know how. Everything I've seen on SO and Pandas documentation on highlights/relates to singular manipulation so far.
One option is to split; then use str.zfill to pad '0's. Also prepend '20's before splitting, since you seem to need it anyway:
tmp = df['ID'].radd('20').str.split('-')
df['ID'] = tmp.str[0] + '-'+ tmp.str[1].str.zfill(8)
Output:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
I'd do it in two steps, using .str.replace:
df["ID"] = df["ID"].str.replace(r"^(\d{2})-", r"20\1-", regex=True)
df["ID"] = df["ID"].str.replace(r"-(\d+)", lambda g: f"-{g[1]:0>8}", regex=True)
print(df)
Prints:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
I'm trying to modify the Address column data by removing all the characters before the comma.
Sample data:
**ADDRESS**
0 Ksfc Layout,Bangalore
1 Vishweshwara Nagar,Mysore
2 Jigani,Bangalore
3 Sector-1 Vaishali,Ghaziabad
4 New Town,Kolkata
Expected Output:
**ADDRESS**
0 Bangalore
1 Mysore
2 Bangalore
3 Ghaziabad
4 Kolkata
I tried this code but it's not working can someone correct the code?
import pandas as pd
import regex as re
data = pd.read_csv("train.csv")
data.ADDRESS.replace(re.sub(r'.*,',"", data.ADDRESS), regex=True, inplace=True)
Try this:
data.ADDRESS = data.ADDRESS.str.split(',').str[-1]
You can do it without a regex:
def removeFirst(x):
return x.split(",")[-1]
df['ADDRESS'] = df['ADDRESS'].apply(removeFirst)
You can try like this without Regex:
data['ADDRESS'] = data['ADDRESS'].str.split(',').str[-1]
Use Series.str.replace:
data['ADDRESS'] = data['ADDRESS'].str.replace(r'.*,', '')
See proof
How to match data value with its regex type but the regex is in another dataframe? Here is the sample Data df and Regex df. Note that these two df have different shape as the regex df is just reference df and only contain unique value.
**Data df** **Regex df**
**Country Type Data** **Country Type Regex**
MY ABC MY1234567890 MY ABC ^MY[0-9]{10}
IT ABC IT1234567890 IT ABC ^IT[0-9]{10}
PL PQR PL123456 PL PQR ^PL
MY ABC 456792abc MY DEF ^\w{6,10}$
IT ABC MY45889976 IT XYZ ^\w{6,10}$
IT ABC IT56788897
For the data that is not match to its own regex, how can I find match for the data with its Country but scan through all the type that the country has. For example, this data 'MY45889976' does not follow its regex (IT) country and (ABC) type. But it match with another type for its country which is the (XYZ) type. So it will add another column and give the type that it match with.
My desired output is something like this,
Country Type Data Data Quality Suggestion
0 MY ABC MY1234567890 1 0
1 IT ABC IT1234567890 1 0
2 IT ABC MY45889976 0 XYZ
3 IT ABC IT567888976 0 XYZ
4 PL PQR PL123456 1 0
5 MY XYZ 456792abc 0 DEF
This is what I have done to match the regex to get the data quality column (after concatenation),
df['Data Quality'] = df.apply(lambda r:re.match(r['Regex'],r['Data']) and 1 or 0, axis=1)
But I'm not sure how to move forward. Is there any easy way to do this without concatenation and how to find matching regex by scanning its whole type but tie to its country only. Thanks
refer to:Match column with its own regex in another column Python
just apply a new Coumun suggestion, it's logic depend on your description.
def func(dfRow):
#find the same Country and Type
sameDF = regexDF.loc[(regexDF['Country'] == dfRow['Country']) & (regexDF['Type'] == dfRow['Type'])]
if sameDF.size > 0 and re.match(sameDF.iloc[0]["Regex"],dfRow["Data"]):
return 0
#find the same Country, then find mathec Type
sameCountryDF = regexDF.loc[(regexDF['Country'] == dfRow['Country'])]
for index, row in sameCountryDF.iterrows():
if re.match(row["Regex"], dfRow["Data"]):
return row["Type"]
df["Suggestion"]=df.apply(func, axis=1)
I suggest the following, merging by Country and doing both operations in the same DataFrame (finding regex that match for the type in data_df and for the type in regex_df) as follows:
# First I merge only on country
new_df = pd.merge(df, df_regex, on="Country")
# Then I define an indicator for types that differ between the two DF
new_df["indicator"] = np.where(new_df["Type_x"] == new_df["Type_y"], "both", "right")
# I see if the regex matches Data for the `Type` in df
new_df['Data Quality'] = new_df.apply(lambda x:
np.where(re.match(x['Regex'], x['Data']) and
(x["indicator"] == "both"),
1, 0), axis=1)
# Then I fill Suggestion by looking if the regex matches data for the type in df_regex
new_df['Suggestion'] = new_df.apply(lambda x:
np.where(re.match(x['Regex'], x['Data']) and
(x["indicator"] == "right"),
x["Type_y"], ""), axis=1)
# I remove lines where there is no suggestion and I just added lines from df_regex
new_df = new_df.loc[~((new_df["indicator"] == "right") & (new_df["Suggestion"] == "")), :]
new_df = new_df.sort_values(["Country", "Type_x", "Data"])
# After sorting I move Suggestion up one line
new_df["Suggestion"] = new_df["Suggestion"].shift(periods=-1)
new_df = new_df.loc[new_df["indicator"] == "both", :]
new_df = new_df.drop(columns=["indicator", "Type_y", "Regex"]).fillna("")
And you get this result:
Country Type_x Data Data Quality Suggestion
4 IT ABC IT1234567890 1
8 IT ABC IT56788897 0 XYZ
6 IT ABC MY45889976 0 XYZ
2 MY ABC 456792abc 0 DEF
0 MY ABC MY1234567890 1
10 PL PQR PL123456 1
The last line of your output seems to have the wrong Type since it is not in data_df.
By using your sample data I find ABC for Data == "456792abc" and your suggestion DEF.
I am trying to create a column based on the simple logic but it does not work.
I'd like to create a new column named 'Commodity' with a simple logic:
if df['ID'].str[:3] = 'FWD':
df['Commodity'] = df['ID'].str[3:6]
My DF looks like that:
df = pd.DataFrame({'ID':['FWDUSD921','FWDNZD344','EUR'], 'Volumes': [10,20,33]})
If no match, leave space blank (or put 0 - does not matter)
I tried lambdas, if, and apply methods but keep getting error messages.
Use a regular expression here
df.assign(Commodity=df.ID.str.extract(r'^FWD(\w{3})'))
ID Volumes Commodity
0 FWDUSD921 10 USD
1 FWDNZD344 20 NZD
2 EUR 33 NaN
Regex Explanation
^ # asserts position at start of line
FWD # matches FWD exactly
( # matching group 1
\w{3} # match 3 characters that match a-zA-Z0-9_
) # end of matching group
If there are any other requirements for what a defines a "currency string" (maybe only letters?), you can replace the \w with that requirement.
Just create a mask with your condition, and use it with .loc[]
mask = df['ID'].str[:3] == 'FWD'
df.loc[mask, 'Commodity'] = df.loc[mask, 'ID'].str[3:6]
Try:
s = df['ID'].str[:3].eq('FWD')
df.loc[s, 'Commodity'] = df['ID'].str[3:6]
Output:
ID Volumes Commodity
0 FWDUSD921 10 USD
1 FWDNZD344 20 NZD
2 EUR 33 NaN
This should do the trick:
import numpy as np
df['Currency'] = np.where(df['ID'].str[:3]=='FWD', df['ID'].str[3:6], '')
I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')
You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa
import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:
Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line