Trying to split and parse characters from an column and submit the parsed data into different column .
I was trying the same by parsing with _ in the given column data, It was working good until the number of '_' present in the string was fixed to 2.
Input Data:
Col1
U_a65839_Jan87Apr88
U_b98652_Feb88Apr88_(2).jpg.pdf
V_C56478_mar89Apr89
Q_d15634_Apr90Apr91
Q_d15634_Apr90Apr91_(3).jpeg.pdf
S_e15336_may91Apr93
NaN
Expected Output:
col2
Jan87Apr88
Feb88Apr88
mar89Apr89
Apr90Apr91
Apr90Apr91
may91Apr93
Code i have been trying :
df = pd.read_excel(open(r'Dats.xlsx', 'rb'), sheet_name='Sheet1')
df['Col2'] = df.Col1.str.replace(
'.*_', '', regex=True
)
print(df['Col2'])
I think you want this:
col2 = df.Col1.str.split("_", expand=True)[2]
output:
0 Jan87Apr88
1 Feb88Apr88
2 mar89Apr89
3 Apr90Apr91
4 Apr90Apr91
5 may91Apr93
6 NaN
(you can dropna if you don't want the last row)
Use str.extract here:
df["col2"] = df["Col1"].str.extract(r'((?:[a-z]{3}\d{2}){2})', flags=re.IGNORECASE)
Demo
Based on your question, the pandas DataFrame apply can be a good solution:
First, clean the DataFrame by replacing NaNs with empty string ''
df = pd.DataFrame(data=['U_a65839_Jan87Apr88', 'U_b98652_Feb88Apr88_(2).jpg.pdf', 'V_C56478_mar89Apr89', 'Q_d15634_Apr90Apr91', 'Q_d15634_Apr90Apr91_(3).jpeg.pdf', 'S_e15336_may91Apr93', None], columns=['Col1'])
df = df.fillna('')
Col1
0 U_a65839_Jan87Apr88
1 U_b98652_Feb88Apr88_(2).jpg.pdf
2 V_C56478_mar89Apr89
3 Q_d15634_Apr90Apr91
4 Q_d15634_Apr90Apr91_(3).jpeg.pdf
5 S_e15336_may91Apr93
6
Next, define a function to extract the required string with regex
def fun(s):
import re
m = re.search(r'\w{3}\d{2}\w{3}\d{2}', s)
if m:
return m.group(0)
else:
return ''
Then, easily apply the function to DataFrame:
df['Col2'] = df['Col1'].apply(fun)
Col1 Col2
0 U_a65839_Jan87Apr88 Jan87Apr88
1 U_b98652_Feb88Apr88_(2).jpg.pdf Feb88Apr88
2 V_C56478_mar89Apr89 mar89Apr89
3 Q_d15634_Apr90Apr91 Apr90Apr91
4 Q_d15634_Apr90Apr91_(3).jpeg.pdf Apr90Apr91
5 S_e15336_may91Apr93 may91Apr93
6
Hope the above helps.
Related
Does anyone know how I'd format this string (which is a column in a dataframe) to be a float so I can sort by the column please?
£880,000
£88,500
£850,000
£845,000
i.e. I want this to become
88,500
845,000
850,000
880,000
Thanks in advance!
Assuming 'col' the column name.
If you just want to sort, and keep as string, you can use natsorted:
from natsort import natsort_key
df.sort_values(by='col', key=natsort_key)
# OR
from natsort import natsort_keygen
df.sort_values(by='col', key=natsort_keygen())
output:
col
1 £88,500
3 £845,000
2 £850,000
0 £880,000
If you want to convert to floats:
df['col'] = pd.to_numeric(df['col'].str.replace('[^\d.]', '', regex=True))
df.sort_values(by='col')
output:
col
1 88500
3 845000
2 850000
0 880000
If you want strings, you can use str.lstrip:
df['col'] = df['col'].str.lstrip('£')
output:
col
0 880,000
1 88,500
2 850,000
3 845,000
I am trying to remove alpha characters and special characters(,) from the column values. When trying to remove the alpha characters it gives NaN as output .
Input Data :
col2
2565.0
23899
876.44
1765.7
3,253.0CA
9876.9B
Output Data :
col2
2565.0
23899
876.44
1765.7
3253.0
9876.9
Code i have been using:
df['col2'] = df['col2'].str.replace(r"[a-zA-Z]",'')
df['col2']=df['col2'].fillna('').str.replace(',',"").astype(float)
Please suggest how to resolve this.
Use Series.replace and regex which matches "not numbers and dot"
df['col2'] = df.col2.replace('[^\d.]', '', regex=True).astype(float)
Output
col2
0 2565.00
1 23899.00
2 876.44
3 1765.70
4 3253.00
5 9876.90
Use Series.str.replace:
df['col2'] = df['col2'].str.replace(r'[a-zA-Z,]','', regex=True).astype(float)
print (df)
col2
0 2565.00
1 23899.00
2 876.44
3 1765.70
4 3253.00
5 9876.90
Suppose I have a df:
df = pd.DataFrame({'col': ['ABCXDEF', 'ABCYDEF']})
How can I extract the string that is surrounded by ABC & the first occurrence of DEF? Desired output:
col
0 X
1 Y
Note that I don't want a solution based on exact positions, like:
df.col.str[3:4]
(update: look for the first occurrence of 'DEF')
Use this regex:
df = pd.DataFrame({'col': ['ABCXDEF', 'ABCYDEFDEFDEF']})
print(df.col.str.extract(r"ABC(.*?)DEF"))
The result is:
0
0 X
1 Y
I need to group by and then return the values of a column in a concatenated form. While I have managed to do this, the returned dataframe has a column name 0. Just 0. Is there a way to specify what the results will be.
all_columns_grouped = all_columns.groupby(['INDEX','URL'], as_index = False)['VALUE'].apply(lambda x: ' '.join(x)).reset_index()
The resulting groupby object has the headers
INDEX | URL | 0
The results are in the 0 column.
While I have managed to rename the column using
.rename(index=str, columns={0: "variant"}) this seems very in elegant.
Any way to provide a header for the column? Thanks
The simpliest is remove as_index = False for return Series and add parameter name to reset_index:
Sample:
all_columns = pd.DataFrame({'VALUE':['a','s','d','ss','t','y'],
'URL':[5,5,4,4,4,4],
'INDEX':list('aaabbb')})
print (all_columns)
INDEX URL VALUE
0 a 5 a
1 a 5 s
2 a 4 d
3 b 4 ss
4 b 4 t
5 b 4 y
all_columns_grouped = all_columns.groupby(['INDEX','URL'])['VALUE'] \
.apply(' '.join) \
.reset_index(name='variant')
print (all_columns_grouped)
INDEX URL variant
0 a 4 d
1 a 5 a s
2 b 4 ss t y
You can use agg when applied to a column (VALUE in this case) to assign column names to the result of a function.
# Sample data (thanks #jezrael)
all_columns = pd.DataFrame({'VALUE':['a','s','d','ss','t','y'],
'URL':[5,5,4,4,4,4],
'INDEX':list('aaabbb')})
# Solution
>>> all_columns.groupby(['INDEX','URL'], as_index=False)['VALUE'].agg(
{'variant': lambda x: ' '.join(x)})
INDEX URL variant
0 a 4 d
1 a 5 a s
2 b 4 ss t y
Is there a way to return the name/header of a column into a string in a pandas dataframe? I want to work with a row of data which has the same prefix. The dataframe header looks like this:
col_00 | col_01 | ... | col_51 | bc_00 | cd_00 | cd_01 | ... | cd_90
I'd like to apply a function to each row, but only from col_00 to col_51 and to cd_00 to cd_90 separately. To do this, I thought I'd collect the column names into a list, fe. to_work_with would be the list of columns starting with the prefix 'col', apply the function to df[to_work_with]. Then I'd change the to_work_with and it would contain the list of columns starting with the 'cd' prefix et cetera. But I don't know how to iterate through the column names.
So basically, the thing I'm looking for is this function:
to_work_with = column names in the df that start with "thisstring"
How can I do that? Thank you!
You can use boolean indexing with str.startswith:
cols = df.columns[df.columns.str.startswith('cd')]
print (cols)
Index(['cd_00', 'cd_01', 'cd_02', 'cd_90'], dtype='object')
Sample:
print (df)
col_00 col_01 col_02 col_51 bc_00 cd_00 cd_01 cd_02 cd_90
0 1 2 3 4 5 6 7 8 9
cols = df.columns[df.columns.str.startswith('cd')]
print (cols)
Index(['cd_00', 'cd_01', 'cd_02', 'cd_90'], dtype='object')
#if want apply some function for filtered columns only
def f(x):
return x + 1
df[cols] = df[cols].apply(f)
print (df)
col_00 col_01 col_02 col_51 bc_00 cd_00 cd_01 cd_02 cd_90
0 1 2 3 4 5 7 8 9 10
Another solution with list comprehension:
cols = [col for col in df.columns if col.startswith("cd")]
print (cols)
['cd_00', 'cd_01', 'cd_02', 'cd_90']