Python Pandas Extracting specific data from column

Python Pandas Extracting specific data from column - python

I have one column as an object contains multiple data separated by ( | )
I would like to extract only the customer order number which is start with
( 44 ) sometimes the order number in the beginning, sometimes in the middle, sometimes in the end
And sometimes is duplicated
44019541285_P_002 | 0317209757 | 87186978110350851 | 387186978103840544 |
87186978110202440 | 44019119315 | 87186978110202440 | 44019119315
87186978110326832 | 44019453624 | 87186978110326832 | 44019453624
44019406029 | 0317196878 | 87186978110313085 | 387186978120481881|
44019480564 | 0317202711 | 87186978110335810 | 387186978103844160 |
Wishing result
44019541285
44019119315
44019453624
44019406029
44019480564

My code:
import pandas as pd
from io import StringIO
data = '''
Order_Numbers
44019541285_P_002 | 0317209757 | 87186978110350851 | 387186978103840544 | 0652569032
87186978110202440 | 44019119315 | 87186978110202440 | 44019119315
87186978110326832 | 44019453624 | 87186978110326832 | 44019453624
44019406029 | 0317196878 | 87186978110313085 | 387186978120481881|
44019480564 | 0317202711 | 87186978110335810 | 387186978103844160 | 630552498
'''
df = pd.read_csv(StringIO(data.replace(' ','')))
df
'''
Order_Numbers
0 44019541285_P_002|0317209757|87186978110350851...
1 87186978110202440|44019119315|8718697811020244...
2 87186978110326832|44019453624|8718697811032683...
3 44019406029|0317196878|87186978110313085|38718...
4 44019480564|0317202711|87186978110335810|38718...
'''
Final code：
(
df.Order_Numbers.str.split('|', expand=True)
.astype(str)
.where(lambda x: x.applymap(lambda y: y[:2] == '44'))
.bfill(axis=1)
[0]
.str.split('_').str.get(0)
)
0 44019541285
1 44019119315
2 44019453624
3 44019406029
4 44019480564
Name: 0, dtype: object

import pandas as pd
df = pd.DataFrame({
'order_number':[
'44019541285_P_002 | 0317209757 | 87186978110350851 | 387186978103840544 | 0652569032',
'87186978110202440 | 44019119315 | 87186978110202440 | 44019119315',
'87186978110326832 | 44019453624 | 87186978110326832 | 44019453624',
'44019406029 | 0317196878 | 87186978110313085 | 387186978120481881|',
'44019480564 | 0317202711 | 87186978110335810 | 387186978103844160 | 630552498'
]
})
def extract_customer_order(order_number):
order_number = order_number.replace(' ','') # remove all space to make it easy to process e.g. '44019541285_P_002 | 0317209757 ' -> '44019541285_P_002|0317209757'
order_number_list = order_number.split('|') # split the string at every | to multiple string in list '44019541285_P_002|0317209757' -> ['44019541285_P_002', '0317209757']
result = []
for order in order_number_list:
if order.startswith('44'): # select only order number starting with '44'
if order not in result: # to prevent duplicate order number
result += [order]
# if you want the result as string separated by '|', uncomment line below
# result = '|'.join(result)
return result
df['customer_order'] = df['order_number'].apply(extract_customer_order)

Related

Pandas Coalesce Multiple Columns, NaN

I want to coalesce 4 columns using pandas. I've tried this:
final['join_key'] = final['book'].astype('str') + final['bdr'] + final['cusip'].fillna(final['isin']).fillna(final['Deal'].astype('str')).fillna(final['Id'])
When I use this it returns:
+-------+--------+-------+------+------+------------+------------------+
| book | bdr | cusip | isin | Deal | Id | join_key |
+-------+--------+-------+------+------+------------+------------------+
| 17236 | ETFROS | | | | 8012398421 | 17236.0ETFROSnan |
+-------+--------+-------+------+------+------------+------------------+
The field Id is not properly appending to my join_key field.
Any help would be appreciated, thanks.
Update:
+------------+------+------+-----------+--------------+------+------------+----------------------------+
| endOfDay | book | bdr | cusip | isin | Deal | Id | join_key |
+------------+------+------+-----------+--------------+------+------------+----------------------------+
| 31/10/2019 | 15 | ITOR | 371494AM7 | US371494AM77 | 161 | 8013210731 | 20191031|15|ITOR|371494AM7 |
| 31/10/2019 | 15 | ITOR | | | | 8011898573 | 20191031|15|ITOR| |
| 31/10/2019 | 15 | ITOR | | | | 8011898742 | 20191031|15|ITOR| |
| 31/10/2019 | 15 | ITOR | | | | 8011899418 | 20191031|15|ITOR| |
+------------+------+------+-----------+--------------+------+------------+----------------------------+
df['join_key'] = ("20191031|" + df['book'].astype('str') + "|" + df['bdr'] + "|" + df[['cusip', 'isin', 'Deal', 'id']].bfill(1)['cusip'].astype(str))
For some reason this code isnt picking up Id as part of the key.

The last chain fillna for cusip is too complicated. You may change it to bfill
final['join_key'] = (final['book'].astype('str') +
final['bdr'] +
final[['cusip', 'isin', 'Deal', 'Id']].bfill(1)['cusip'].astype(str))

Try this:
import pandas as pd
import numpy as np
# setup (ignore)
final = pd.DataFrame({
'book': [17236],
'bdr': ['ETFROS'],
'cusip': [np.nan],
'isin': [np.nan],
'Deal': [np.nan],
'Id': ['8012398421'],
})
# answer
final['join_key'] = final['book'].astype('str') + final['bdr'] + final['cusip'].fillna(final['isin']).fillna(final['Deal']).fillna(final['Id']).astype('str')
Output
book bdr cusip isin Deal Id join_key
0 17236 ETFROS NaN NaN NaN 8012398421 17236ETFROS8012398421

How to aggregate and restructure dataframe data in pyspark (column wise)

I am trying to aggregate data in pyspark dataframe on a particular criteria. I am trying to align the acct based on switchOUT amount to switchIN amount. So that accounts with money switching out of becomes from account and other accounts become to_accounts.
Data I am getting in the dataframe to begin with
+--------+------+-----------+----------+----------+-----------+
| person | acct | close_amt | open_amt | switchIN | switchOUT |
+--------+------+-----------+----------+----------+-----------+
| A | 1 | 125 | 50 | 75 | 0 |
+--------+------+-----------+----------+----------+-----------+
| A | 2 | 100 | 75 | 25 | 0 |
+--------+------+-----------+----------+----------+-----------+
| A | 3 | 200 | 300 | 0 | 100 |
+--------+------+-----------+----------+----------+-----------+
To this table
+--------+--------+-----------+----------+----------+
| person | from_acct| to_acct | switchIN | switchOUT|
+--------+----------+--------+----------+-----------+
| A | 3 | 1 | 75 | 100 |
+--------+----------+--------+----------+-----------+
| A | 3 | 2 | 25 | 100 |
+--------+----------+--------+----------+-----------+
And also how can I do it so that it works for N number of rows (not just 3 accounts)
So far I have used this code
# define udf
def sorter(l):
res = sorted(l, key=operator.itemgetter(1))
return [item[0] for item in res]
def list_to_string(l):
res = 'from_fund_' +str(l[0]) + '_to_fund_'+str(l[1])
return res
def listfirstAcc(l):
res = str(l[0])
return res
def listSecAcc(l):
res = str(l[1])
return res
sort_udf = F.udf(sorter)
list_str = F.udf(list_to_string)
extractFirstFund = F.udf(listfirstAcc)
extractSecondFund = F.udf(listSecAcc)
# Add additional columns
df= df.withColumn("move", sort_udf("list_col").alias("sorted_list"))
df= df.withColumn("move_string", list_str("move"))
df= df.withColumn("From_Acct",extractFirstFund("move"))
df= df.withColumn("To_Acct",extractSecondFund("move"))
Current outcome I am getting:
+--------+--------+-----------+----------+----------+
| person | from_acct| to_acct | switchIN | switchOUT|
+--------+----------+--------+----------+-----------+
| A | 3 | 1,2 | 75 | 100 |
+--------+----------+--------+----------+-----------+

Select rows from pandas data frame based on range from another

I have two dataframes.
First dataframe: df_json
+------------+-----------------+-----------+------------+
| chromosome | ensembl_id | gene_end | gene_start |
+------------+-----------------+-----------+------------+
| 7 | ENSG00000122543 | 5886362 | 5879827 |
| 12 | ENSG00000111325 | 122980043 | 122974580 |
| 17 | ENSG00000181396 | 82418637 | 82389223 |
| 6 | ENSG00000119900 | 71308950 | 71288803 |
| 9 | ENSG00000106809 | 92404696 | 92383967 |
+------------+-----------------+-----------+------------+
Second dataframe: df
+------------+-----------------+-----------+------------+
| rs_id | variant | gene_id | chromosome |
+------------+-----------------+-----------+------------+
| rs13184706 | 5:43888254:C:T | 43888254| 5 |
| rs58824264 | 5:43888493:C:T | 43888493| 5 |
+------------+-----------------+-----------+------------+
I want to iterate through df_json and for each row in df_json, select the rows from df whose gene_id is in range (gene_start, gene_end) and df['chromosome'] == df_json['chromosome']. Also, I need to create a new column in the resulting dataframe which has the ensembl_id from df_json.
I am able to achieve the same using the code below but it is very slow. I need a faster way to do this as I need to execute this on millions of rows.
result_df = []
for row in df_json.itertuples():
gene_end, gene_start = row[3], row[4]
gene = df.loc[(df['gene_id'].between(gene_start, gene_end, inclusive=True)) & (df['chromosome'] == row[1])]
gene['ensembl_id'] = row[2]
result_df.append(gene)
print(krishna[0])

You should avoid iterating pandas dataframe rows where possible, as this is inefficient and less readable.
You can implement your logic using pd.DataFrame.merge and pd.Series.between. I have changed the data in your example to make it more interesting.
import pandas as pd
df_json = pd.DataFrame({'chromosome': [7, 12, 17, 6, 9],
'ensembl_id': ['ENSG00000122543', 'ENSG00000111325', 'ENSG00000181396',
'ENSG00000119900', 'ENSG00000106809'],
'gene_end': [5886362, 122980043, 82418637, 71308950, 92404696],
'gene_start': [5879827, 122974580, 82389223, 71288803, 92383967]})
df = pd.DataFrame({'rs_id': ['rs13184706', 'rs58824264'],
'variant': ['5:43888254:C:T', '5:43888493:C:T'],
'gene_id': [5880000, 43888493],
'chromosome': [7, 9]})
res = df_json.merge(df, how='left', on='chromosome')
res = res[res['gene_id'].between(res['gene_start'], res['gene_end'])]
print(res)
# chromosome ensembl_id gene_end gene_start gene_id rs_id \
# 0 7 ENSG00000122543 5886362 5879827 5880000.0 rs13184706
# variant
# 0 5:43888254:C:T

Use pyranges for large datasets. It is very efficient and fast:
import pyranges as pr
c = """Chromosome ensembl_id End Start
7 ENSG00000122543 5886362 5879827
12 ENSG00000111325 122980043 122974580
17 ENSG00000181396 82418637 82389223
5 MadeUp 43889000 43888253
6 ENSG00000119900 71308950 71288803
9 ENSG00000106809 92404696 92383967"""
c2 = """rs_id variant Start End Chromosome
rs13184706 5:43888254:C:T 43888254 43888256 5
rs58824264 5:43888493:C:T 43888493 43888494 5"""
gr = pr.from_string(c)
gr2 = pr.from_string(c2)
j = gr.join(gr2)
# +--------------+--------------+-----------+-----------+------------+----------------+-----------+-----------+
# | Chromosome | ensembl_id | End | Start | rs_id | variant | Start_b | End_b |
# | (category) | (object) | (int32) | (int32) | (object) | (object) | (int32) | (int32) |
# |--------------+--------------+-----------+-----------+------------+----------------+-----------+-----------|
# | 5 | MadeUp | 43889000 | 43888253 | rs13184706 | 5:43888254:C:T | 43888254 | 43888256 |
# | 5 | MadeUp | 43889000 | 43888253 | rs58824264 | 5:43888493:C:T | 43888493 | 43888494 |
# +--------------+--------------+-----------+-----------+------------+----------------+-----------+-----------+
# Unstranded PyRanges object has 2 rows and 8 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
df = j.df # as pandas df

How to group columns alphabetically with tabulate module?

I am using the tabulate module to print information nicely at the console. I am using python 2.6
I currently have this:
+-------------------------------+
| Task | Status | Rating |
|---------+---------------------+
| A | Done | Good |
| B | Done | Bad |
| C | Pending | |
| D | Done | Good |
+---------+----------+----------+
I want to go to this:
+-------------------------------+
| Task | Status | Rating |
|---------+---------------------+
| A | Done | Good |
| B | Done | Bad |
| D | Done | Good |
| C | Pending | |
+---------+----------+----------+
So that all of the Dones are grouped together.
Currently the tabulate receives a dictionary and I unpack the values like this:
def generate_table(data):
table = []
headers = ['Task', 'Status', 'Rating']
for key, value in data.iteritems():
print key, value
if 'Rating' in value:
m, l = value['Status'], value['Rating']
m = m.split('/')[-1]
temp = [key,m,l]
table.append(temp)
else:
m, l = value['Status'], None
m = m.split('/')[-1]
temp = [key,m,l]
table.append(temp)
print tabulate(table, headers, tablefmt="psql")

You can sort your resulting table by Status column after your for loop:
sorted(table, key=lambda status: status[1])
This will effectively "group" the values alphabetically.

search for string in pandas row

How can I search through the entire row in a pandas dataframe for a phrase and if it exist create a new col where says it says 'Yes' and what columns in that row it found it in? I would like to be able to ignore case as well.

You could use Pandas apply function, which allows you to traverse rows or columns and apply your own function to them.
For example, given a dataframe
+--------------------------------------+------------+---+
| deviceid | devicetype | 1 |
+--------------------------------------+------------+---+
| b569dcb7-4498-4cb4-81be-333a7f89e65f | Google | 1 |
| 04d3b752-f7a1-42ae-8e8a-9322cda4fd7f | Android | 2 |
| cf7391c5-a82f-4889-8d9e-0a423f132026 | Android | 3 |
+--------------------------------------+------------+---+
Define a function
def pr(array, value):
condition = array[array.str.contains(value).fillna(False)].index.tolist()
if condition:
ret = array.append(pd.Series({"condition":['Yes'] + condition}))
else:
ret = array.append(pd.Series({"condition":['No'] + condition}))
return ret
Use it
df.apply(pr, axis=1, args=('Google',))
+---+--------------------------------------+------------+---+-------------------+
| | deviceid | devicetype | 1 | condition |
+---+--------------------------------------+------------+---+-------------------+
| 0 | b569dcb7-4498-4cb4-81be-333a7f89e65f | Google | 1 | [Yes, devicetype] |
| 1 | 04d3b752-f7a1-42ae-8e8a-9322cda4fd7f | Android | 2 | [No] |
| 2 | cf7391c5-a82f-4889-8d9e-0a423f132026 | Android | 3 | [No] |
+---+--------------------------------------+------------+---+-------------------+

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas Extracting specific data from column - python

Related

Pandas Coalesce Multiple Columns, NaN

How to aggregate and restructure dataframe data in pyspark (column wise)

Select rows from pandas data frame based on range from another

How to group columns alphabetically with tabulate module?

search for string in pandas row

Categories

Resources