How to insert string value into specific column value on python pandas? - python

I have the following dataframe.
import pandas as pd
data=['ABC1','ABC2','ABC3','ABC4']
data = pd.DataFrame(data,columns=["Column A"])
Column A
0 ABC1
1 ABC2
2 ABC3
3 ABC4
How to insert "-" a ABC on column A of data?
Output:
Column A
0 ABC-1
1 ABC-2
2 ABC-3
3 ABC-4

The Simplest solution to Use replace method as a regex and inplace method to make it permanent in the dataframe.
>>> data['Column A'].replace(['ABC'], 'ABC-', regex=True, inplace=True)
print(data)
Column A
0 ABC-1
1 ABC-2
2 ABC-3
3 ABC-4

A possible solution is
data['Column A'] = data['Column A'].str[:-1] + '-' + data['Column A'].str[-1]
print (data)
# Column A
#0 ABC-1
#1 ABC-2
#2 ABC-3
#3 ABC-4

Here's a way which only assumes that the numbers to be preceded by a dash are at the end:
df['ColumnA'].str.split('([A-z]+)(\d+)').str.join('-').str.strip('-')
0 ABC-1
1 ABC-2
2 ABC-3
3 ABC-4
Another example:
df = pd.DataFrame({'ColumnA':['asf1','Ads2','A34']})
Will give:
df['ColumnA'].str.split('([A-z]+)(\d+)').str.join('-').str.strip('-')
0 asf-1
1 Ads-2
2 A-34

Related

Distinguish repeating column names by adding an integer using pandas

I have some columns that have the same names. I would like to add a 1 to the repeating column names
Data
Date Type hi hello stat hi hello
1/1/2022 a 0 0 1 1 0
Desired
Date Type hi hello stat hi1 hello1
1/1/2022 a 0 0 1 1 0
Doing
mask = df['col2'].duplicated(keep=False)
I believe I can utilize mask, but not sure how to efficiently achieve this without calling out the actual column. I would like to call the full dataset and allow the algorithm to update the dupe.
Any suggestion is appreciated
Use the built-in parser method _maybe_dedup_names():
df.columns = pd.io.parsers.base_parser.ParserBase({'usecols': None})._maybe_dedup_names(df.columns)
# Date Type hi hello stat hi.1 hello.1
# 0 1/1/2022 a 0 0 1 1 0
This is what pandas uses to deduplicate column headers from read_csv().
Note that it scales to any number of duplicate names:
cols = ['hi'] * 3 + ['hello'] * 5
pd.io.parsers.base_parser.ParserBase({'usecols': None})._maybe_dedup_names(cols)
# ['hi', 'hi.1', 'hi.2', 'hello', 'hello.1', 'hello.2', 'hello.3', 'hello.4']
In pandas < 1.3:
df.columns = pd.io.parsers.ParserBase({})._maybe_dedup_names(df.columns)
You need to apply the duplicated operation to the column names. And then map the duplication information to a string, which you can then add to the original column names.
df.columns = df.columns+[{False:'',True:'1'}[x] for x in df.columns.duplicated()]
We can do
s = df.columns.to_series().groupby(df.columns).cumcount().replace({0:''}).astype(str).radd('.')
df.columns = (df.columns + s).str.strip('.')
df
Out[153]:
Date Type hi hello stat hi.1 hello.1
0 1/1/2022 a 0 0 1 1 0

Add a character in a string inside a column dataframe

I have a dataframe with some numbers (or strings, it doesn't actually matter). The thing is that I need to add a character in the middle of them. The dataframe looks like this (I got it from Google Takeout)
id A B
1 512343 -1234
1 213 1231345
1 18379 187623
And I want to add a comma in the second position
id A B
1 51,2343 -12,34
1 21,3 12,31345
1 18,379 18,7623
A and B are actually longitude and latitude so I think it is not possible to achieve to add the comma in the right place since there is no way to know if a number is supposed to have one or two digits as coordinates, but it would do the trick if I can put the comma on the second position.
This should do the trick:
df[["A", "B"]]=df[["A", "B"]].astype(str).replace(r"(\d{2})(\d+)", r"\1,\2", regex=True)
Outputs:
id A B
0 1 51,2343 -12,34
1 1 21,3 12,31345
2 1 18,379 18,7623
Here's another approach with str.extract:
for c in ['A','B']:
df[c] = df[c].astype(str).str.extract('(-?\d{2})(\d*)').agg(','.join,axis=1)
Output:
id A B
0 1 51,2343 -12,34
1 1 21,3 12,31345
2 1 18,379 18,7623
You could do something like this -
import numpy as np
df['A'] = np.where(df['A']>=0,'', '-') + ( df['A'].abs().astype(str).str[:2] + ',' + df['A'].abs().astype(str).str[2:] )
df['B'] = np.where(df['B']>=0,'', '-') + ( df['B'].abs().astype(str).str[:2] + ',' + df['B'].abs().astype(str).str[2:] )
df
id A B
0 1 51,2343 -12,34
1 1 21,3 12,31345
2 1 18,379 18,7623

Split Dataframe from back to front

Somebody know how to make a split from back to front,
when I make a split like
dfgeo['geo'].str.split(',',expand=True)
I have:
1,2,3,4,nan,nan,nan
but I want
nan,nan,nan,4,3,2,1
thanks peopleee :)
if you're looking to reverse the column order you can do this:
new_df = dfgeo['geo'].str.split(',', expand=True)
new_df[new_df.columns[::-1]]
Try this:
list(reversed(dfgeo['geo'].str.split(',',expand=True)))
Assuming your code returns a list!
Use iloc with ::-1 for swap order of columns:
dfgeo = pd.DataFrame({'geo': ['1,2,3,4', '1,2,3,4,5,6,7']})
print (dfgeo)
geo
0 1,2,3,4
1 1,2,3,4,5,6,7
df = dfgeo['geo'].str.split(',',expand=True).iloc[:, ::-1]
#if necessary set default columns names
df.columns = np.arange(len(df.columns))
print (df)
0 1 2 3 4 5 6
0 None None None 4 3 2 1
1 7 6 5 4 3 2 1

How do I specify a column header for pandas groupby result?

I need to group by and then return the values of a column in a concatenated form. While I have managed to do this, the returned dataframe has a column name 0. Just 0. Is there a way to specify what the results will be.
all_columns_grouped = all_columns.groupby(['INDEX','URL'], as_index = False)['VALUE'].apply(lambda x: ' '.join(x)).reset_index()
The resulting groupby object has the headers
INDEX | URL | 0
The results are in the 0 column.
While I have managed to rename the column using
.rename(index=str, columns={0: "variant"}) this seems very in elegant.
Any way to provide a header for the column? Thanks
The simpliest is remove as_index = False for return Series and add parameter name to reset_index:
Sample:
all_columns = pd.DataFrame({'VALUE':['a','s','d','ss','t','y'],
'URL':[5,5,4,4,4,4],
'INDEX':list('aaabbb')})
print (all_columns)
INDEX URL VALUE
0 a 5 a
1 a 5 s
2 a 4 d
3 b 4 ss
4 b 4 t
5 b 4 y
all_columns_grouped = all_columns.groupby(['INDEX','URL'])['VALUE'] \
.apply(' '.join) \
.reset_index(name='variant')
print (all_columns_grouped)
INDEX URL variant
0 a 4 d
1 a 5 a s
2 b 4 ss t y
You can use agg when applied to a column (VALUE in this case) to assign column names to the result of a function.
# Sample data (thanks #jezrael)
all_columns = pd.DataFrame({'VALUE':['a','s','d','ss','t','y'],
'URL':[5,5,4,4,4,4],
'INDEX':list('aaabbb')})
# Solution
>>> all_columns.groupby(['INDEX','URL'], as_index=False)['VALUE'].agg(
{'variant': lambda x: ' '.join(x)})
INDEX URL variant
0 a 4 d
1 a 5 a s
2 b 4 ss t y

How to process column names and create new columns

This is my pandas DataFrame with original column names.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
1 3 0 0
2 1 1 5
Firstly I want to extract all unique variations of cm, e.g. in this case cm1 and cm2.
After this I want to create a new column per each unique cm. In this example there should be 2 new columns.
Finally in each new column I should store the total count of non-zero original column values, i.e.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
1 3 0 0 2 0
2 1 1 5 2 1
I implemented the first step as follows:
cols = pd.DataFrame(list(df.columns))
ind = [c for c in df.columns if 'cm' in c]
df.ix[:, ind].columns
How to proceed with steps 2 and 3, so that the solution is automatic (I don't want to manually define column names cm1 and cm2, because in original data set I might have many cm variations.
You can use:
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
0 1 3 0 0
1 2 1 1 5
First you can filter columns contains string cm, so columns without cm are removed.
df1 = df.filter(regex='cm')
Now you can change columns to new values like cm1, cm2, cm3.
print [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
['cm1', 'cm1', 'cm2']
df1.columns = [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
print df1
cm1 cm1 cm2
0 1 3 0
1 2 1 1
Now you can count non - zero values - change df1 to boolean DataFrame and sum - True are converted to 1 and False to 0. You need count by unique column names - so groupby columns and sum values.
df1 = df1.astype(bool)
print df1
cm1 cm1 cm2
0 True True False
1 True True True
print df1.groupby(df1.columns, axis=1).sum()
cm1 cm2
0 2 0
1 2 1
You need unique columns, which are added to original df:
print df1.columns.unique()
['cm1' 'cm2']
Last you can add new columns by df[['cm1','cm2']] from groupby function:
df[df1.columns.unique()] = df1.groupby(df1.columns, axis=1).sum()
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
0 1 3 0 0 2 0
1 2 1 1 5 2 1
Once you know which columns have cm in them you can map them (with a dict) to the desired new column with an adapted version of this answer:
col_map = {c:'cm'+c[c.index('cm') + len('cm')] for c in ind}
# ^ if you are hard coding this in you might as well use 2
so that instead of the string after cm it is cm and the character directly following, in this case it would be:
{'old_dm_cm1': 'cm1', 'old_dt_cm1_tt': 'cm1', 'old_rr_cm2_epf': 'cm2'}
Then add the new columns to the DataFrame by iterating over the dict:
for col,new_col in col_map.items():
if new_col not in df:
df[new_col] =[int(a!=0) for a in df[col]]
else:
df[new_col]+=[int(a!=0) for a in df[col]]
note that int(a!=0) will simply give 0 if the value is 0 and 1 otherwise. The only issue with this is because dicts are inherently unordered it may be preferable to add the new columns in order according to the values: (like the answer here)
import operator
for col,new_col in sorted(col_map.items(),key=operator.itemgetter(1)):
if new_col in df:
df[new_col]+=[int(a!=0) for a in df[col]]
else:
df[new_col] =[int(a!=0) for a in df[col]]
to ensure the new columns are inserted in order.

Categories