how to split a column by another column in pandas dataframe - python

I am cleaning data in pandas dataframe, I want split a column by another column.
I want split column 'id' by column 'eNBID',but don't know how to split
import pandas as pd
id_list = ['4600375067649','4600375077246','460037495681','460037495694']
eNBID_list = ['750676','750772','749568','749569']
df=pd.DataFrame({'id':id_list,'eNBID':eNBID_list})
df.head()
id eNBID
4600375067649 750676
4600375077246 750772
460037495681 749568
460037495694 749569
What I want:
df.head()
id eNBID
460-03-750676-49 750676
460-03-750772-46 750772
460-03-749568-1 749568
460-03-749569-4 749569
#column 'eNBID' is the third part of column 'id', the item length in column 'eNBID' is 6 or 7.

considering the 46003 will remain same for all ids
df['id'] = df.apply(lambda x: '-'.join([i[:3]+'-'+i[3:] if '460' in i else i for i in list(re.findall('(\w*)'+'('+x.eNBID+')'+'(\w*)',x.id)[0])]), axis=1)
Output
id eNBID
0 460-03-750676-49 750676
1 460-03-750772-46 750772
2 460-03-749568-1 749568
3 460-03-749569-4 749569

Considering '-' after 3rd, 5th, 11th position:
df['id'] = df['id'].apply(lambda s: s[:3] + '-' + s[3:5] + '-' + s[5:11] + '-' + s[11:])

Related

Pandas reorder raw content

I do have the following Excel-File
Which I've converted it to DataFrame and dropped 2 columns using below code:
df = pd.read_excel(self.file)
df.drop(['Name', 'Scopus ID'], axis=1, inplace=True)
Now, My target is to switch all names orders within the df.
For example,
the first name is Adedokun, Babatunde Olubayo
which i would like to convert it to Babatunde Olubayo Adedokun
how to do that for the entire df whatever name is it?
Split the name and reconcat them.
import pandas as pd
data = {'Name': ['Adedokun, Babatunde Olubayo', "Uwizeye, Dieudonné"]}
df = pd.DataFrame(data)
def swap_name(name):
name = name.split(', ')
return name[1] + ' ' + name[0]
df['Name'] = df['Name'].apply(swap_name)
df
Output:
> Name
> 0 Babatunde Olubayo Adedokun
> 1 Dieudonné Uwizeye
Let's assume you want to do the operation on "Other Names 1":
df.loc[:, "Other Names1"] = df["Other Names1"].str.split(",").apply(lambda row: " ".join(row))
You can use str accessor:
df['Name'] = df['Name'].str.split(', ').str[::-1].str.join(' ')
print(df)
# Output
Name
0 Babatunde Olubayo Adedokun
1 Dieudonné Uwizeye

Merge several rows to a single row with lists in the cells only if elements are different

I have the following table:
source_system geo_id product_subfamily product_deny_list product_allow_list transaction_deny_list operation_allow_list operation_filter
0 CONFIRMING_SCHF FRK CASH_MGMT ' ' 'CNF' ' ' ' ' NaN
1 EQUATION_SCHF FRK CASH_MGMT 'CD','TEST','CB' 'CA' '408','805','385','856','320','420','825','355... ' ' NaN
I would like to convert it to this new table of one single row:
source_system geo_id product_subfamily product_deny_list product_allow_list transaction_deny_list operation_allow_list operation_filter
0 [CONFIRMING_SCHF, EQUATION_SCHF] FRK CASH_MGMT ['CD','TEST','CB'] ['CNF', 'CA'] ' ' ' ' NaN
During the conversion, lists should be created in each cell only of the elements between the multiple rows are different, if they are the same, then, only the single value should be kept. If there was a blank string in a row and a value different from a blank string in the other row, the blank string should be removed and the other value kept.
How could i do this?
Thanks in advance
Mini example of your data + solution:
d = {'source_system ': ['CONFIRMING_SCHF ', 'EQUATION_SCHF'], 'geo_id': ['FRK', 'FRK']}
df = pd.DataFrame(data=d)
df_list = df.apply(lambda x: list(set(x)))
df = pd.DataFrame(data=df_list).T
Result:
import pandas as pd
import numpy as np
def apply_func(x):
x = list(filter(None, set(x))) # Filter blank spaces
if len(x) <= 1:
return ''.join(x)
return x
df = pd.DataFrame((['CONFIRMING_SCHF','FRK','CASH_MGMT', '', 'CNF' ,'','', np.nan],
['EQUATION_SCHF','FRK', 'CASH_MGMT', "'CD','TEST','CB'",'CA' , '408355','',np.nan]),
columns = ['source_system','geo_id','product_subfamily','product_deny_list','product_allow_list','transaction_deny_list','operation_allow_list','operation_filter'])
df['UNIQUE'] = 1
df_list = df.groupby('UNIQUE').agg(apply_func) #You can apply reset_index(drop=True) as well
df_list
This might be not proper solution for you because i have added additional column named as UNIQUE, but i am getting the output you expected, where you can apply almost all your conditions in the apply_func function
Output
source_system geo_id product_subfamily product_deny_list product_allow_list transaction_deny_list operation_allow_list operation_filter
UNIQUE
1 [CONFIRMING_SCHF, EQUATION_SCHF] FRK CASH_MGMT 'CD','TEST','CB' [CNF, CA] 408355 [nan, nan]

divide the row into two rows after several columns

I have CSV file and I try to split my row into many rows if it contains more than 4 columns
Example:-
enter image description here
Expected Output:
enter image description here
So there are way to do that in pandas or python
Sorry if this is a simple question
When there are two columns with the same name in CSV file, the pandas dataframe automatically appends an integer value to the duplicate column name
for example:
This CSV file :
Will become this :
df = pd.read_csv("Book1.csv")
df
Now to solve your question, lets consider the above dataframe as the input dataframe.
Try this :
cols = df.columns.tolist()
cols.remove('id')
start = 0
end = 4
new_df = []
final_cols = ['id','x1','y1','x2','y2']
while start<len(cols):
if end>len(cols):
end = len(cols)
temp = cols[start:end]
start = end
end = end+4
temp_df = df.loc[:,['id']+temp]
temp_df.columns = final_cols[:1+len(temp)]
if len(temp)<4:
temp_df[final_cols[1+len(temp):]] = None
print(temp_df)
new_df.append(temp_df)
pd.concat(new_df).reset_index(drop = True)
Result:
You can first set the video column as index then concat your remaining every 4 columns into a new dataframe. At last, reset index to get video column back.
df.set_index('video', inplace=True)
dfs = []
for i in range(len(df.columns)//4):
d = df.iloc[:, range(i*4,i*4+4)]
dfs.append(d.set_axis(['x_center', 'y_center']*2, axis=1))
df_ = pd.concat(dfs).reset_index()
I think the following list comprehension should work, but it gives an positional indexing error on my machine and I don't know why
df_ = pd.concat([df.iloc[: range(i*4, i*4+4)].set_axis(['x_center', 'y_center']*2, axis=1) for i in range(len(df.columns)//4)])
print(df_)
video x_center y_center x_center y_center
0 1_1 31.510973 22.610222 31.383655 22.488293
1 1_1 31.856295 22.830109 32.016905 22.948702
2 1_1 32.011684 22.990689 31.933356 23.004779

How to split column values in a pandas dataframe

How do i split a single column in a DataFrame that has a string without creating more columns. And get rid of the brackets.
For example two rows looks like this:
df = pd.DataFrame({'Ala Carte':'||LA1: 53565 \nCH2: 54565',
'Blistex':'|Cust: 65565\nCarrier: 2565|',
'Dermatology':'||RTR1\n65331\n\nRTR2\n65331'})
And I would like for the output dataframe to look like this, where the information column is a string:
Customer Information
Ala Carte LA1: 53565
CH2: 54565
Blistex Cust: 65565
Carrier: 2565
Dermatology RTR1: 65331
RTR2: 65331
Within the same column for Information
This should do it :
import pandas as pd
### CREATE DATAFRAME
df = pd.DataFrame({'name' : ['Ala Carte', 'Blistex'],
'information': ['||LA1: 53565 \nCH2: 54565',
'|Cust: 65565\nCarrier: 2565|']
})
### SPLIT COLUMNS INTO A LIST
df['information'] = df['information'].apply(lambda x: x.replace("|", "").split("\n"))
### EXPLODE THE COLUMN
df.explode('information')
I decided to just replace the '\n' to '||" as a way to separate the two different values. Combine the two columns using this def
def combine_with_nan(x, cols):
combined=''
for column in cols:
try:
np.isnan(x[column])
Temp = ''
except:
Temp = x[column]
combined= combined + ' || ' + Temp
return combined
cols=['Columns you want to merge']
practicedf = practicedf.apply(combine_with_nan, axis=1,args=(cols,)).to_frame().replace(r"\\n"," || ", regex=True)

Formatting a specific row of integers to the ssn style

I want to format a specific column of integers to ssn format (xxx-xx-xxxx). I saw that openpyxl has builtin styles. I have been using pandas and wasn't sure if it could do this specific format.
I did see this -
df.iloc[:,:].str.replace(',', '')
but I want to replace the ',' with '-'.
import pandas as pd
df = pd.read_excel('C:/Python/Python37/Files/Original.xls')
df.drop(['StartDate', 'EndDate','EmployeeID'], axis = 1, inplace=True)
df.rename(columns={'CheckNumber': 'W/E Date', 'CheckBranch': 'Branch','DeductionAmount':'Amount'},inplace=True)
df = df[['Branch','Deduction','CheckDate','W/E Date','SSN','LastName','FirstName','Amount','Agency','CaseNumber']]
ssn = (df['SSN'] # the integer column
.astype(str) # cast integers to string
.str.zfill(8) # zero-padding
.pipe(lambda s: s.str[:2] + '-' + s.str[2:4] + '-' + s.str[4:]))
writer = pd.ExcelWriter('C:/Python/Python37/Files/Deductions Report.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
Your question is a bit confusing, see if this helps:
If you have a column of integers and you want to create a new one made up of strings in SSN (Social Security Number) format. You can try something like:
df['SSN'] = (df['SSN'] # the "integer" column
.astype(int) # the integer column
.astype(str) # cast integers to string
.str.zfill(9) # zero-padding
.pipe(lambda s: s.str[:3] + '-' + s.str[3:5] + '-' + s.str[5:]))
Setup
Social Security numbers are nine-digit numbers using the form: AAA-GG-SSSS
s = pd.Series([111223333, 222334444])
0 111223333
1 222334444
dtype: int64
Option 1
Using zip and numpy.unravel_index:
pd.Series([
'{}-{}-{}'.format(*el)
for el in zip(*np.unravel_index(s, (1000,100,10000)))
])
Option 2
Using f-strings:
pd.Series([f'{i[:3]}-{i[3:5]}-{i[5:]}' for i in s.astype(str)])
Both produce:
0 111-22-3333
1 222-33-4444
dtype: object
I prefer:
df["ssn"] = df["ssn"].astype(str)
df["ssn"] = df["ssn"].str.strip()
df["ssn"] = (
df.ssn.str.replace("(", "")
.str.replace(")", "")
.str.replace("-", "")
.str.replace(" ", "")
.apply(lambda x: f"{x[:3]}-{x[3:5]}-{x[5:]}")
)
This take into account rows that are partially formatted, fully formatted, or not formatted and correctly formats them all.
For Example:
data = [111111111,123456789,"222-11-3333","433-3131234"]
df = pd.DataFrame(data, columns=['ssn'])
Gives you:
Before
After the code you then get:
After

Categories