Split and Join Series in Pandas - python

I have two series in the dataframe below. The first is a string which will appear in the second, which will be a url string. What I want to do is change the first series by concatenating on extra characters, and have that change applied onto the second string.
import pandas as pd
#import urlparse
d = {'OrigWord' : ['bunny', 'bear', 'bull'], 'WordinUrl' : ['http://www.animal.com/bunny/ear.html', 'http://www.animal.com/bear/ear.html', 'http://www.animal.com/bull/ear.html'] }
df = pd.DataFrame(d)
def trial(source_col, dest_col):
splitter = dest_col.str.split(str(source_col))
print type(splitter)
print splitter
res = 'angry_' + str(source_col).join(splitter)
return res
df['Final'] = df.applymap(trial(df.OrigWord, df.WordinUrl))
I'm trying to find the string from the source_col, then split on that string in the dest_col, then effect that change on the string in dest_col. Here I have it as a new series called Final but I would rather inplace. I think the main issue are the splitter variable, which isn't working and the application of the function.
Here's how result should look:
OrigWord WordinUrl
angry_bunny http://www.animal.com/angry_bunny/ear.html
angry_bear http://www.animal.com/angry_bear/ear.html
angry_bull http://www.animal.com/angry_bull/ear.html

apply isn't really designed to apply to multiple columns in the same row. What you can do is to change your function so that it takes in a series instead and then assigns source_col, dest_col to the appropriate value in the series. One way of doing it is as below:
def trial(x):
source_col = x["OrigWord"]
dest_col = x['WordinUrl' ]
splitter = str(dest_col).split(str(source_col))
res = splitter[0] + 'angry_' + source_col + splitter[1]
return res
df['Final'] = df.apply(trial,axis = 1 )

here is an alternative approach:
df['WordinUrl'] = (df.apply(lambda x: x.WordinUrl.replace(x.OrigWord,
'angry_' + x.OrigWord), axis=1))
In [25]: df
Out[25]:
OrigWord WordinUrl
0 bunny http://www.animal.com/angry_bunny/ear.html
1 bear http://www.animal.com/angry_bear/ear.html
2 bull http://www.animal.com/angry_bull/ear.html

Instead of using split, you can use the replace method to prepend the angry_ to the corresponding source:
def trial(row):
row.WordinUrl = row.WordinUrl.replace(row.OrigWord, "angry_" + row.OrigWord)
row.OrigWord = "angry_" + row.OrigWord
return row
df.apply(trial, axis = 1)
OrigWord WordinUrl
0 angry_bunny http://www.animal.com/angry_bunny/ear.html
1 angry_bear http://www.animal.com/angry_bear/ear.html
2 angry_bull http://www.animal.com/angry_bull/ear.html

Related

Python dataframes - grouping series

I'm trying to execute a filter in python, but I'm stuck at the end, when I need to group the resullt.
I have a json, which is this one: https://api.jsonbin.io/b/62300664a703bb67492bd3fc/3
And what I'm trying to do with it is filtering "apiFamily" searching for "payments-ted" or "payments-doc". If I find a match, I then must verify that the column "ApiEndpoints" has at least two endpoints in it.
My ultimate goal is to append both "apiFamily" in one row and all the ApiEndpoints" in another row. Something like this:
"ApiFamily": [
"payments-ted",
"payments-doc"
]
"ApiEndpoints": [
"/ted",
"/electronic-ted",
"/phone-ted",
"/banking-ted",
"/shared-automated-teller-machines-ted"
"/doc",
"/electronic-doc",
"/phone-doc",
"/banking-doc",
"/shared-automated-teller-machines-doc"
]
I have managed so achieve partial sucess, searching for a single condition:
#ApiFilter = df[(df['ApiFamily'] == 'payments-pix') & (rolesFilter['ApiEndpoints'].apply(lambda x: len(x)) >= 2)]
This obviously extracts only payments-pix which contains two or more ApiEndpoints.
Now I can manage to check both conditions, if I try this:
#ApiFilter = df[((df['ApiFamily'] == 'payments-ted') | (df['ApiFamily'] == 'payments-doc') &(df['ApiEndpoints'].apply(lambda x: len(x)) >= 2)]
I will get the correct rows, but it will obviously list the brand twice.
When I try to groupby the result, all I get is this:
TypeError: unhashable type: 'Series'
My doubt is: how to avoid this error? I assume I must do some sort of conversion of the columns that have multiple itens inside a row, but what is the best method?
I have tried this solution , it is kind of round-about but gets the final result you want
First get the data into a dictionary object
>>> import requests
>>> url = 'https://api.jsonbin.io/b/62300664a703bb67492bd3fc/3'
>>> response = requests.get(url)
>>> d = response.json()
We just need the ApiFamily and ApiEndpoints into a new dictionary
>>> dNew = {}
>>> for item in d['data'] :
>>> if item['ApiFamily'] in ['payments-ted','payments-doc']:
>>> dNew[item['ApiFamily']] = item['ApiEndpoints']
Change dNew into a dataframe and transpose it.
>>> df1 = pd.DataFrame(dNew)
>>> df1 = df1.applymap ( lambda x : '\'' + x + '\'')
>>> df2 = df1.transpose()
At this stage df2 looks like this -
>>> print(df2)
0 1 2 3 \
payments-ted '/ted' '/electronic-ted' '/phone-ted' '/banking-ted'
payments-doc '/doc' '/electronic-doc' '/phone-doc' '/banking-doc'
4
payments-ted '/shared-automated-teller-machines-ted'
payments-doc '/shared-automated-teller-machines-doc'
Now join all the columns using the comma symbol
>>> df2['final'] = df2.apply( ','.join , axis=1)
Finally
>>> df2 = df2[['final']]
>>> print(df2)
final
payments-ted '/ted','/electronic-ted','/phone-ted','/bankin...
payments-doc '/doc','/electronic-doc','/phone-doc','/bankin...

apply lambda or define a function to return 1 else 0 in dask dataframe

Probably easy, but I am still learning.
I am creating a new column in dask dataframe where the value will come from after extracting the last four str characters of date column in str ddmmyyyy.
What I did:
have is a list of inv_years
extract the lst four characters of the string date
tried to define a function that if the extracted years are in the inv_years list, return 1 else 0 in a new column.
Issue: How do I write a working function or better in fewer lines a lambda function
def valid_yr(x):
inv_years = ['1921','1969','2026','2030','2041','2060','2062']
validity_year = ddf['string_ddmmyyyy'].str[-4:] #extract the last four to get the year
if validity_year.isin(inv_years):
x = 1
else:
x = 0
return x
#create a new column and apply function
ddf['validity_year']= ??? # what to write here?
A very grumpy way I could come up with is
inv_years = ['1921','1969','2026','2030','2041','2060','2062']
ddf['validity_year'] = ddf.apply(lambda row: 1 if row.string_ddmmyyyy[-4:] in inv_years else 0, axis=1)
or to try and get your approach working we initially modify your function a bit so as it's argument is a single row.
def valid_yr(row):
inv_years = ['1921','1969','2026','2030','2041','2060','2062']
validity_year = row.string_ddmmyyyy[-4:]
if validity_year in inv_years:
x = 1
else:
x = 0
return x
Now we can apply this function to all rows.
ddf['validity_year'] = ddf.apply(valid_yr, axis=1)

Rename columns in dataframe using bespoke function python pandas

I've got a data frame with column names like 'AH_AP' and 'AH_AS'.
Essentially all i want to do is swap the part before the underscore and the part after the underscore so that the column headers are 'AP_AH' and 'AS_AH'.
I can do that if the elements are in a list, but i've no idea how to get that to apply to column names.
My solution if it were a list goes like this:
columns = ['AH_AP','AS_AS']
def rejig_col_names():
elements_of_header = columns.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
i'm guessing i need to apply this to something like the below, but i've no idea how, or how to reference a single column within df.columns:
df.columns = df.columns.map()
Any help appreciated. Thanks :)
You can do it this way:
Input:
df = pd.DataFrame(data=[['1','2'], ['3','4']], columns=['AH_PH', 'AH_AS'])
print(df)
AH_PH AH_AS
0 1 2
1 3 4
Output:
df.columns = df.columns.str.split('_').str[::-1].str.join('_')
print(df)
PH_AH AS_AH
0 1 2
1 3 4
Explained:
Use string accessor and the split method on '_'
Then using the str accessor with index slicing reversing, [::-1], you
can reverse the order of the list
Lastly, using the string accessor and join, we can concatenate the
list back together again.
You were almost there: you can do
df.columns = df.columns.map(rejig_col_names)
except that the function gets called with a column name as argument, so change it like this:
def rejig_col_names(col_name):
elements_of_header = col_name.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
An alternative to the other answer. Using your function and DataFrame.rename
import pandas as pd
def rejig_col_names(columns):
elements_of_header = columns.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
data = {
'A_B': [1, 2, 3],
'C_D': [4, 5, 6],
}
df = pd.DataFrame(data)
df.rename(rejig_col_names, axis='columns', inplace=True)
print(df)
str.replace is also an option via swapping capture groups:
Sample input borrowed from ScottBoston
df = pd.DataFrame(data=[['1', '2'], ['3', '4']], columns=['AH_PH', 'AH_AS'])
Then Capture everything before and after the '_' and swap capture group 1 and 2.
df.columns = df.columns.str.replace(r'^(.*)_(.*)$', r'\2_\1', regex=True)
PH_AH AS_AH
0 1 2
1 3 4

How to combine DataFame column data and fixed text string

I want to combine 4 columns within a larger DataFrame with a custom (space) delimiter (which I have done with the code below) but then I want to add a fixed string to the start and end of each concatenation.
The columns are pairs of X & Y coordinates, but they can be dealt with as str for this purpose (once I've trimmed to 3 decimal places).
I have found many options on this website for joining the columns, but none to join columns and a consistent fixed string.The lazy way would be for me to just make two more DataFrame columns, one for the start, one for the end, and cat everything. Is there a more sophisticated way to do it?
import pandas as pd
from pandas import DataFrame
import numpy as np
def str_join(df, sep, *cols):
from functools import reduce
return reduce (lambda x,y: x.astype(str).str.cat(y.astype(str), sep=sep),
[df[col] for col in cols])
data= pd.read_csv('/Users/XXXXXX/Desktop/Lines.csv')
df=pd.DataFrame(data, columns=['Name','SOLE','SOLN','EOLE','EOLN','EOLKP','Wind','Wave'])
df['SOLE']=round(df['SOLE'],3)
df['SOLN']=round(df['SOLN'],3)
df['EOLE']=round(df['EOLE'],3)
df['EOLN']=round(df['EOLN'],3)
df['WKT']=str_join(df,' ','SOLE','SOLN','EOLE','EOLN')
df.to_csv('OutLine.csv') #turn on to create output file
which gives me.
WKT
476912.131 6670122.285 470329.949 6676260.271
What I want to do is add '(LINESTRING ' to the start of each concatenation and ')' to the end of each to give me.
WKT
(LINESTRING 476912.131 6670122.285 470329.949 6676260.271 )
You could also create a collection of the columns you want to export, do a quick data type format, and apply a join.
target_cols = ['SOLE','SOLN','EOLE','EOLN',]
# Make sure to use along axis 1 (columns) because default is 0
# Also, if you're on Python 3.6+, I think you can use f-strings to format your floats.
df['WKT'] = df[target_cols].apply(lambda x: '(LINESTRING ' + ' '.join(f"{i:.3f}" for i in x) + ')', axis=1)
result:
In [0]: df.iloc[:,-3:]
Out [0]:
Wind Wave WKT
0 wind1 wave1 (LINESTRING 476912.131 6670122.285 470329.949 ...
** Sorry, I'm using Spyder, which is a terminal output miser. Here's a printout of 'WKT'
In [1]: print(df['WKT'].values)
Out [1]: ['(LINESTRING 476912.131 6670122.285 470329.949 6676260.271)']
* **EDIT: To add a comma after 'SOLN', we could use an alternative route:
target_cols = ['SOLE','SOLN','EOLE','EOLN',]
# Format strings in advance
# Set comma_col to our desired column name. This could also be a tuple for multiple names, then replace `==` with `in` in the loop below.
comma_col = 'SOLN'
# To find the last column, which doesn't need a space here, we just select the last value from our list. I did it this way in case our list order doesn't match the dataframe order.
last_col = df[target_cols].columns.values.tolist()[-1]
# Traditional if-then method
for col in df[target_cols]:
if col == comma_col:
df[col] = df[col].apply(lambda x: f"{x:.3f}" + ",") # Explicit comma
elif col == last_col:
df[col] = df[col].apply(lambda x: f"{x:.3f}")
else:
df[col] = df[col].apply(lambda x: f"{x:.3f}" + " ") # Explicit whitespace
# Adding our 'WKT' column as before, but the .join() portion doesn't have a space in it now.
df['WKT'] = df[target_cols].apply(lambda x: '(LINESTRING ' + ''.join(i for i in x) + ')', axis=1)
Finally:
In [0]: print(df['WKT'][0])
Out [0]: (LINESTRING 476912.131 6670122.286,470329.950 6676260.271)
Your function already looks good just you need to add few things:
def str_join(df, sep, *cols):
# All cols must be numeric to use df[col].round(3)
from functools import reduce
return reduce (lambda x,y: 'LINESTRING ' + x.astype(str).str.cat(y.astype(str) + ' )', sep=sep),
[df[col].round(3) for col in cols])
use it this way
df['new']='LINESTRING'
df['WKT']=pd.concat([df['new'],df['SOLE'],df['SOLN'],df['EOLE'],df['EOLN']])

Pandas - Working on multiple columns seems slow

I have some trouble processing a big csv with Pandas. Csv consists of an index and about other 450 columns in groups of 3, something like this:
cola1 colb1 colc1 cola2 colb2 colc2 cola3 colb3 colc3
1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1
2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2
3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3
For each trio of columns I would like to "analyze B column (it's a sort of "CONTROL field" and depending on its value I should then return a value by processing col A and C.
Finally I need to return a concatenation of all resulting columns starting from 150 to 1.
I already tried with apply but it seems too slow (10 min to process 50k rows).
df['Path'] = df.apply(lambda x: getFullPath(x), axis=1)
with an example function you can find here:
https://pastebin.com/S9QWTGGV
I tried extracting a list of unique combinations of cola,colb,colc - preprocessing the list - and applying map to generate results and it speeds up a little:
for i in range(1,151):
df['Concat' + str(i)] = df['cola' + str(i)] + '|' + df['colb' + str(i)] + '|' + df['colc' + str(i)]
concats = []
for i in range(1,151):
concats.append('Concat' + str(i))
ret = df[concats].values.ravel()
uniq = list(set(ret))
list = {}
for member in ret:
list[member] = getPath2(member)
for i in range(1,MAX_COLS + 1):
df['Res' + str(i)] = df['Concat' + str(i)].map(list)
df['Path'] = df.apply(getFullPath2,axis=1)
function getPath and getFullPath2 are defined as example here:
https://pastebin.com/zpFF2wXD
But it seems still a little bit slow (6 min for processing everything)
Do you have any suggestion on how I could speed up csv processing?
I don't even know if the way I using to "concatenate" columns could be better :), tried with Series.cat but I didn't get how to chain only some columns and not the full df
Thanks very much!
Mic
Amended answer: I see from your criteria, you actually have multiple controls on each column. I think what works is to split these into 3 dataframes, applying your mapping as follows:
import pandas as pd
series = {
'cola1': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb1': pd.Series(['ret1','ret1','ret2'],index=[1,2,3]),
'colc1': pd.Series(['B_1','C_2','B_3'],index=[1,2,3]),
'cola2': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb2': pd.Series(['ret3','ret1','ret2'],index=[1,2,3]),
'colc2': pd.Series(['B_2','A_1','A_3'],index=[1,2,3]),
'cola3': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb3': pd.Series(['ret2','ret2','ret1'],index=[1,2,3]),
'colc3': pd.Series(['A_1','B_2','C_3'],index=[1,2,3]),
}
your_df = pd.DataFrame(series, index=[1,2,3], columns=['cola1','colb1','colc1','cola2','colb2','colc2','cola3','colb3','colc3'])
# Split your dataframe into three frames for each column type
bframes = your_df[[col for col in your_df.columns if 'colb' in col]]
aframes = your_df[[col for col in your_df.columns if 'cola' in col]]
cframes = your_df[[col for col in your_df.columns if 'colc' in col]]
for df in [bframes, aframes, cframes]:
df.columns = ['col1','col2','col3']
# Mapping criteria
def map_colb(c):
if c == 'ret1':
return 'A'
elif c == 'ret2':
return None
else:
return 'F'
def map_cola(a):
if a.startswith('D_'):
return 'D'
else:
return 'E'
def map_colc(c):
if c.startswith('B_'):
return 'B'
elif c.startswith('C_'):
return 'C'
elif c.startswith('A_'):
return None
else:
return 'F'
# Use it on each frame
aframes = aframes.applymap(map_cola)
bframes = bframes.applymap(map_colb)
cframes = cframes.applymap(map_colc)
# The trick here is filling 'None's from the left to right in order of precedence
final = bframes.fillna(cframes.fillna(aframes))
# Then just combine them using whatever delimiter you like
# df.values.tolist() turns a row into a list
pathlist = ['|'.join(item) for item in final.values.tolist()]
This gives a result of:
In[70]: pathlist
Out[71]: ['A|F|D', 'A|A|B', 'B|E|A']

Categories