Python dataframes - grouping series

Python dataframes - grouping series - python

I'm trying to execute a filter in python, but I'm stuck at the end, when I need to group the resullt.
I have a json, which is this one: https://api.jsonbin.io/b/62300664a703bb67492bd3fc/3
And what I'm trying to do with it is filtering "apiFamily" searching for "payments-ted" or "payments-doc". If I find a match, I then must verify that the column "ApiEndpoints" has at least two endpoints in it.
My ultimate goal is to append both "apiFamily" in one row and all the ApiEndpoints" in another row. Something like this:
"ApiFamily": [
"payments-ted",
"payments-doc"
]
"ApiEndpoints": [
"/ted",
"/electronic-ted",
"/phone-ted",
"/banking-ted",
"/shared-automated-teller-machines-ted"
"/doc",
"/electronic-doc",
"/phone-doc",
"/banking-doc",
"/shared-automated-teller-machines-doc"
]
I have managed so achieve partial sucess, searching for a single condition:
#ApiFilter = df[(df['ApiFamily'] == 'payments-pix') & (rolesFilter['ApiEndpoints'].apply(lambda x: len(x)) >= 2)]
This obviously extracts only payments-pix which contains two or more ApiEndpoints.
Now I can manage to check both conditions, if I try this:
#ApiFilter = df[((df['ApiFamily'] == 'payments-ted') | (df['ApiFamily'] == 'payments-doc') &(df['ApiEndpoints'].apply(lambda x: len(x)) >= 2)]
I will get the correct rows, but it will obviously list the brand twice.
When I try to groupby the result, all I get is this:
TypeError: unhashable type: 'Series'
My doubt is: how to avoid this error? I assume I must do some sort of conversion of the columns that have multiple itens inside a row, but what is the best method?

I have tried this solution , it is kind of round-about but gets the final result you want
First get the data into a dictionary object
>>> import requests
>>> url = 'https://api.jsonbin.io/b/62300664a703bb67492bd3fc/3'
>>> response = requests.get(url)
>>> d = response.json()
We just need the ApiFamily and ApiEndpoints into a new dictionary
>>> dNew = {}
>>> for item in d['data'] :
>>> if item['ApiFamily'] in ['payments-ted','payments-doc']:
>>> dNew[item['ApiFamily']] = item['ApiEndpoints']
Change dNew into a dataframe and transpose it.
>>> df1 = pd.DataFrame(dNew)
>>> df1 = df1.applymap ( lambda x : '\'' + x + '\'')
>>> df2 = df1.transpose()
At this stage df2 looks like this -
>>> print(df2)
0 1 2 3 \
payments-ted '/ted' '/electronic-ted' '/phone-ted' '/banking-ted'
payments-doc '/doc' '/electronic-doc' '/phone-doc' '/banking-doc'
4
payments-ted '/shared-automated-teller-machines-ted'
payments-doc '/shared-automated-teller-machines-doc'
Now join all the columns using the comma symbol
>>> df2['final'] = df2.apply( ','.join , axis=1)
Finally
>>> df2 = df2[['final']]
>>> print(df2)
final
payments-ted '/ted','/electronic-ted','/phone-ted','/bankin...
payments-doc '/doc','/electronic-doc','/phone-doc','/bankin...

Related

Extracting values to new columns with pandas

I have a dataframe where the coordinates column comes in this format
[-7.821, 37.033]
I would like to create two columns where the first is lonand the second is lat
I've tried
my_dict = df_map['coordinates'].to_dict()
df_map_new = pd.DataFrame(list(my_dict.items()),columns = ['lon','lat'])
But the dictionary that is created does not split the values between ,
Instead it creates a dict with the following format
0: '[-7.821, 37.033]'
What is the best way to extract the values within [,] and put them into two new columns in the original dataframe df_map?
Thank you in advance!

You can parse string:
pattern = r"\[(?P<lon>.*),\s*(?P<lat>.*)\]"
out = df_map['coordinates'].str.extract(pattern).astype(float)
print(out)
# Output
lon lat
0 -7.821 37.033

Convert values to lists by ast.literal_eval, then to lists instead dicts:
import ast
my_L = df_map['coordinates'].apply(ast.literal_eval).tolist()
df_map_new = pd.DataFrame(my_L,columns = ['lon','lat'])

Additionally to the answers already provided, you can also try this:
ser_lon = df['coordinates'].apply(lambda x: x[0])
ser_lat = df['coordinates'].apply(lambda x: x[1])
df_map['lon'] = ser_lon
df_map['lat'] = ser_lat

Trying to split output by ','

I have an object for my output. Now I want to split my output and
create a df with the values.
This is the output I work with:
Seriennummer
701085.0 ([1525.5804581812297, 255.9005481721001, 0.596...
701086.0 ([1193.0420594479258, 271.17468806239793, 0.65...
701087.0 ([1265.5151604213813, 217.26487934586433, 0.60...
701088.0 ([1535.8282855508626, 200.6196628705149, 0.548...
701089.0 ([1500.4964672930257, 247.8883736673866, 0.583...
701090.0 ([1203.6453723293514, 258.5749562983118, 0.638...
701091.0 ([1607.1851164005993, 209.82194423587782, 0.56...
701092.0 ([1711.7277933836879, 231.1560159770871, 0.567...
dtype: object
This is what I am doing and my attempt to split my output:
x=df.T.iloc[1]
y=df.T.iloc[2]
def logifunc(x,c,a,b):
return c / (1 + (a) * np.exp(-b*(x)))
result = df.groupby('Seriennummer').apply(lambda grp:
opt.curve_fit(logifunc, grp.mrwSmpVWi, grp.mrwSmpP, p0=[110, 400, -2]))
print(result)
for element in result:
parts = element.split(',')
print (parts)
It doesn't work. I get the Error:
AttributeError: 'tuple' object has no attribute 'split'
#jezrael
It works. Now it shows a lot of data I don't need. Do you have an idea how I can drop every with the data I don't need.
Seriennummer 0 1 2
701085.0 1525.5804581812297 255.9005481721001 0.5969011082719918
701085.0 [ 9.41414894e+03 -2.07982124e+03 -2.30130078e+00] [-2.07982124e+03 1.44373786e+03 9.59282709e-01] [-2.30130078e+00 9.59282709e-01 7.75807643e-04]
701086.0 1193.0420594479258 271.17468806239793 0.6592054681687264
701086.0 [ 5.21906135e+03 -2.23855187e+03 -2.11896425e+00] [-2.23855187e+03 2.61036500e+03 1.67396324e+00] [-2.11896425e+00 1.67396324e+00 1.22581746e-03]
701087.0 1265.5151604213813 217.26487934586433 0.607183527397275

Use Series.explode with DataFrame constructor:
s = result.explode()
df1 = pd.DataFrame(s.tolist(), index=s.index)
If small data and/or performnace is not important:
df1 = result.explode().apply(pd.Series)

Pandas - Working on multiple columns seems slow

I have some trouble processing a big csv with Pandas. Csv consists of an index and about other 450 columns in groups of 3, something like this:
cola1 colb1 colc1 cola2 colb2 colc2 cola3 colb3 colc3
1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1
2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2
3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3
For each trio of columns I would like to "analyze B column (it's a sort of "CONTROL field" and depending on its value I should then return a value by processing col A and C.
Finally I need to return a concatenation of all resulting columns starting from 150 to 1.
I already tried with apply but it seems too slow (10 min to process 50k rows).
df['Path'] = df.apply(lambda x: getFullPath(x), axis=1)
with an example function you can find here:
https://pastebin.com/S9QWTGGV
I tried extracting a list of unique combinations of cola,colb,colc - preprocessing the list - and applying map to generate results and it speeds up a little:
for i in range(1,151):
df['Concat' + str(i)] = df['cola' + str(i)] + '|' + df['colb' + str(i)] + '|' + df['colc' + str(i)]
concats = []
for i in range(1,151):
concats.append('Concat' + str(i))
ret = df[concats].values.ravel()
uniq = list(set(ret))
list = {}
for member in ret:
list[member] = getPath2(member)
for i in range(1,MAX_COLS + 1):
df['Res' + str(i)] = df['Concat' + str(i)].map(list)
df['Path'] = df.apply(getFullPath2,axis=1)
function getPath and getFullPath2 are defined as example here:
https://pastebin.com/zpFF2wXD
But it seems still a little bit slow (6 min for processing everything)
Do you have any suggestion on how I could speed up csv processing?
I don't even know if the way I using to "concatenate" columns could be better :), tried with Series.cat but I didn't get how to chain only some columns and not the full df
Thanks very much!
Mic

Amended answer: I see from your criteria, you actually have multiple controls on each column. I think what works is to split these into 3 dataframes, applying your mapping as follows:
import pandas as pd
series = {
'cola1': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb1': pd.Series(['ret1','ret1','ret2'],index=[1,2,3]),
'colc1': pd.Series(['B_1','C_2','B_3'],index=[1,2,3]),
'cola2': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb2': pd.Series(['ret3','ret1','ret2'],index=[1,2,3]),
'colc2': pd.Series(['B_2','A_1','A_3'],index=[1,2,3]),
'cola3': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb3': pd.Series(['ret2','ret2','ret1'],index=[1,2,3]),
'colc3': pd.Series(['A_1','B_2','C_3'],index=[1,2,3]),
}
your_df = pd.DataFrame(series, index=[1,2,3], columns=['cola1','colb1','colc1','cola2','colb2','colc2','cola3','colb3','colc3'])
# Split your dataframe into three frames for each column type
bframes = your_df[[col for col in your_df.columns if 'colb' in col]]
aframes = your_df[[col for col in your_df.columns if 'cola' in col]]
cframes = your_df[[col for col in your_df.columns if 'colc' in col]]
for df in [bframes, aframes, cframes]:
df.columns = ['col1','col2','col3']
# Mapping criteria
def map_colb(c):
if c == 'ret1':
return 'A'
elif c == 'ret2':
return None
else:
return 'F'
def map_cola(a):
if a.startswith('D_'):
return 'D'
else:
return 'E'
def map_colc(c):
if c.startswith('B_'):
return 'B'
elif c.startswith('C_'):
return 'C'
elif c.startswith('A_'):
return None
else:
return 'F'
# Use it on each frame
aframes = aframes.applymap(map_cola)
bframes = bframes.applymap(map_colb)
cframes = cframes.applymap(map_colc)
# The trick here is filling 'None's from the left to right in order of precedence
final = bframes.fillna(cframes.fillna(aframes))
# Then just combine them using whatever delimiter you like
# df.values.tolist() turns a row into a list
pathlist = ['|'.join(item) for item in final.values.tolist()]
This gives a result of:
In[70]: pathlist
Out[71]: ['A|F|D', 'A|A|B', 'B|E|A']

Split and Join Series in Pandas

I have two series in the dataframe below. The first is a string which will appear in the second, which will be a url string. What I want to do is change the first series by concatenating on extra characters, and have that change applied onto the second string.
import pandas as pd
#import urlparse
d = {'OrigWord' : ['bunny', 'bear', 'bull'], 'WordinUrl' : ['http://www.animal.com/bunny/ear.html', 'http://www.animal.com/bear/ear.html', 'http://www.animal.com/bull/ear.html'] }
df = pd.DataFrame(d)
def trial(source_col, dest_col):
splitter = dest_col.str.split(str(source_col))
print type(splitter)
print splitter
res = 'angry_' + str(source_col).join(splitter)
return res
df['Final'] = df.applymap(trial(df.OrigWord, df.WordinUrl))
I'm trying to find the string from the source_col, then split on that string in the dest_col, then effect that change on the string in dest_col. Here I have it as a new series called Final but I would rather inplace. I think the main issue are the splitter variable, which isn't working and the application of the function.
Here's how result should look:
OrigWord WordinUrl
angry_bunny http://www.animal.com/angry_bunny/ear.html
angry_bear http://www.animal.com/angry_bear/ear.html
angry_bull http://www.animal.com/angry_bull/ear.html

apply isn't really designed to apply to multiple columns in the same row. What you can do is to change your function so that it takes in a series instead and then assigns source_col, dest_col to the appropriate value in the series. One way of doing it is as below:
def trial(x):
source_col = x["OrigWord"]
dest_col = x['WordinUrl' ]
splitter = str(dest_col).split(str(source_col))
res = splitter[0] + 'angry_' + source_col + splitter[1]
return res
df['Final'] = df.apply(trial,axis = 1 )

here is an alternative approach:
df['WordinUrl'] = (df.apply(lambda x: x.WordinUrl.replace(x.OrigWord,
'angry_' + x.OrigWord), axis=1))
In [25]: df
Out[25]:
OrigWord WordinUrl
0 bunny http://www.animal.com/angry_bunny/ear.html
1 bear http://www.animal.com/angry_bear/ear.html
2 bull http://www.animal.com/angry_bull/ear.html

Instead of using split, you can use the replace method to prepend the angry_ to the corresponding source:
def trial(row):
row.WordinUrl = row.WordinUrl.replace(row.OrigWord, "angry_" + row.OrigWord)
row.OrigWord = "angry_" + row.OrigWord
return row
df.apply(trial, axis = 1)
OrigWord WordinUrl
0 angry_bunny http://www.animal.com/angry_bunny/ear.html
1 angry_bear http://www.animal.com/angry_bear/ear.html
2 angry_bull http://www.animal.com/angry_bull/ear.html

python pandas apply function group by group

I have a pandas matrix df of the form:
user_id time url
4 20140502 'w.lejournal.fr/actualite/politique/sarkozy-terminator_1557749.html',
7 20140307 'w.lejournal.fr/palmares/palmares-immobilier/'
10 20140604 'w.lejournal.fr/actualite/societe/adeline-hazan-devient-la-nouvelle-controleuse-des-lieux-de-privation-de-liberte_1558176.html'
etc...
I want to use the groupby function to group by user, then to make some statistics of the words appearing in the urls of each user, for example, get how many times there is the world 'actualite' in a user urls.
For now, my code is:
def my_stat_function(temp_set):
res = 0
for (u,t) in temp_set:
if 'actualite' in u and t > 20140101:
res += 1
return res
group_user = df.groupby('user_id')
output_list = []
for (i,group) in group_user:
dfg = pandas.DataFrame(group)
temp_set = [tuple(x) for x in dfg[['url','time']].values]
temp_var = my_stat_function(temp_set)
output_list.append([i]+[temp_var])
outputDf = pandas.DataFrame(data = output_list, columns = ['user_id','stat'])
My question is: can I avoid to iterate group by group to apply my_stat_function, and is there exist something faster, maybe applying the function apply? I would really like something more "pandas-ish' and faster.
Thank you for your help.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python dataframes - grouping series - python

Related

Extracting values to new columns with pandas

Trying to split output by ','

Pandas - Working on multiple columns seems slow

Split and Join Series in Pandas

python pandas apply function group by group

Categories

Resources