How to select multiple columns after applying group by a column - python

I'm trying to group by "sender" column and extract some related columns.Here is part of my dataset:
row number,type,rcvTime,sender,pos_x,pos_y,pos_z,spd_x,spd_y,spd_z,acl_x,acl_y,acl_z,hed_x,hed_y,hed_z
0,2,25207.0,15,136.07,1118.46,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09,-1.0,0.0
1,2,25208.0,15,136.19,1117.14,0.0,0.22,-2.31,0.0,0.14,-1.48,0.0,0.09,-1.0,0.0
2,3,25208.81,21,152.66,904.56,0.0,0.06,-0.75,0.0,0.18,-2.43,0.0,0.07,-1.0,0.0
3,2,25209.0,15,136.69,1113.79,0.0,0.39,-4.18,0.0,0.15,-1.64,0.0,0.09,-1.0,0.0
4,3,25209.81,21,152.98,902.59,0.0,0.22,-2.91,0.0,0.12,-1.68,0.0,0.07,-1.0,0.0
5,2,25210.0,15,133.77,1108.01,0.0,0.58,-6.17,0.0,0.16,-1.76,0.0,0.09,-1.0,0.0
6,3,25210.81,21,153.25,898.68,0.0,0.37,-4.65,0.0,0.11,-1.35,0.0,0.08,-1.0,0.0
7,2,25211.0,15,134.37,1100.75,0.0,0.76,-8.14,0.0,0.18,-1.93,0.0,0.09,-1.0,0.0
8,3,25211.81,21,153.82,893.0,0.0,0.65,-6.67,0.0,0.25,-2.54,0.0,0.1,-1.0,0.0
9,3,25211.93,27,122.87,892.12,0.0,5.63,0.32,0.0,-1.57,-0.09,0.0,1.0,0.04,0.0
Here is what I have tried and the result is just all the 'rcvTime' data for that sender However I need all other columns like pos_x,spd_x as well:
import numpy as np
import pandas as pd
df = pd.read_csv(r"/Users/h/trace.csv")
df.head()
df1 = df.groupby('sender')['rcvTime'].apply(list).reset_index(name='new')
print(df1)
What I need is the following data, I just wrote for sender=15:
rowNumber,sender,rcvTime,pos_x,spd_x,rcvTime,pos_x,spd_x,rcvTime,pos_x,spd_x,...
0,15,25207.0,136.07,0.0,25208.0,136.19,0.22, 25209.0,... 25210.0,..., 25211.0, ...
1,21,25208.81,152.66,0.06, 25209.81,..., 25210.81,..., 25211.81,..., 25212...
2,27,25211.93..., 25212.93..., 25213.93..., 25214.93..., 25215...

IIUC you search for something like this:
df1 = df.groupby('sender',as_index=False).agg(lambda x: list(x))
EDIT
I'm sure there is a better way, but here how I managed to achieve your desired output:
cols = ['rcvTime', 'pos_x', 'spd_x']
grouped = df.groupby('sender')[cols]
list_of_lists = [tup[1].values.flatten().tolist() for tup in grouped.pipe(list)]
res = pd.DataFrame({'sender': grouped.groups.keys(), f'{cols*len(grouped.groups.keys())}' : list_of_lists})
print(res)
sender ['rcvTime', 'pos_x', 'spd_x', 'rcvTime', 'pos_x', 'spd_x', 'rcvTime', 'pos_x', 'spd_x']
0 15 [25207.0, 136.07, 0.0, 25208.0, 136.19, 0.22, ...
1 21 [25208.81, 152.66, 0.06, 25209.81, 152.98, 0.2...
2 27 [25211.93, 122.87, 5.63]
Still think, you don't benefit of the possibilities with pandas when formatting your data like this.

Related

How do I capture the properties I want from a string?

I hope you are well I have the following string:
"{\"code\":0,\"description\":\"Done\",\"response\":{\"id\":\"8-717-2346\",\"idType\":\"CIP\",\"suscriptionId\":\"92118213\"},....\"childProducts\":[]}}"...
To which I'm trying to capture the attributes: id, idType and subscriptionId and map them as a dataframe, but the entire body of the .cvs puts it in a single row so it is almost impossible for me to work without index
desired output:
id, idType, suscriptionID
0. '7-84-1811', 'CIP', 21312421412
1. '1-232-42', 'IO' , 21421e324
My code:
import pandas as pd
import json
path = '/example.csv'
df = pd.read_csv(path)
normalize_df = json.load(df)
print(df)
Considering your string is in JSON format, you can do this.
drop columns, transpose, and get headers right.
toEscape = "{\"code\":0,\"description\":\"Done\",\"response\":{\"id\":\"8-717-2346\",\"idType\":\"CIP\",\"suscriptionId\":\"92118213\"}}"
json_string = toEscape.encode('utf-8').decode('unicode_escape')
df = pd.read_json(json_string)
df = df.drop(["code","description"], axis=1)
df = df.transpose().reset_index().drop("index", axis=1)
df.to_csv("user_details.csv")
the output looks like this:
id idType suscriptionId
0 8-717-2346 CIP 92118213
Thank you for the question.

How to aggregate a dataframe then transpose it with Pandas

I'm trying to achieve this kind of transformation with Pandas.
I made this code but unfortunately it doesn't give the result I'm searching for.
CODE :
import pandas as pd
df = pd.read_csv('file.csv', delimiter=';')
df = df.count().reset_index().T.reset_index()
df.columns = df.iloc[0]
df = df[1:]
df
RESULT :
Do you have any proposition ? Any help will be appreciated.
First create columns for test nonOK and then use named aggregatoin for count, sum column Values and for count Trues values use sum again, last sum both columns:
df = (df.assign(NumberOfTest1 = df['Test one'].eq('nonOK'),
NumberOfTest2 = df['Test two'].eq('nonOK'))
.groupby('Category', as_index=False)
.agg(NumberOfID = ('ID','size'),
Values = ('Values','sum'),
NumberOfTest1 = ('NumberOfTest1','sum'),
NumberOfTest2 = ('NumberOfTest2','sum'))
.assign(TotalTest = lambda x: x['NumberOfTest1'] + x['NumberOfTest2']))

How to merge multiple columns with same names in a dataframe

I have the following dataframe as below:
df = pd.DataFrame({'Field':'FAPERF',
'Form':'LIVERID',
'Folder':'ALL',
'Logline':'9',
'Data':'Yes',
'Data':'Blank',
'Data':'No',
'Logline':'10'}) '''
I need dataframe:
df = pd.DataFrame({'Field':['FAPERF','FAPERF'],
'Form':['LIVERID','LIVERID'],
'Folder':['ALL','ALL'],
'Logline':['9','10'],
'Data':['Yes','Blank','No']}) '''
I had tried using the below code but not able to achieve desired output.
res3.set_index(res3.groupby(level=0).cumcount(), append=True['Data'].unstack(0)
Can anyone please help me.
I believe your best option is to create multiple data frames with the same column name ( example 3 df with column name : "Data" ) then simply perform a concat function over Data frames :
df1 = pd.DataFrame({'Field':'FAPERF',
'Form':'LIVERID',
'Folder':'ALL',
'Logline':'9',
'Data':'Yes'}
df2 = pd.DataFrame({
'Data':'No',
'Logline':'10'})
df3 = pd.DataFrame({'Data':'Blank'})
frames = [df1, df2, df3]
result = pd.concat(frames)
You just need to add to list in which you specify the logline and data_type for each row.
import pandas as pd
import numpy as np
list_df = []
data_type_list = ["yes","no","Blank"]
logline_type = ["9","10",'10']
for x in range (len(data_type_list)):
new_dict = { 'Field':['FAPERF'], 'Form':['LIVERID'],'Folder':['ALL'],"Data" : [data_type_list[x]], "Logline" : [logline_type[x]]}
df = pd.DataFrame(new_dict)
list_df.append(df)
new_df = pd.concat(list_df)
print(new_df)

Count occurrences of number from specific column in python

I am trying to do the equivalent of a COUNTIF() function in excel. I am stuck at how to tell the .count() function to read from a specific column in excel.
I have
df = pd.read_csv('testdata.csv')
df.count('1')
but this does not work, and even if it did it is not specific enough.
I am thinking I may have to use read_csv to read specific columns individually.
Example:
Column name
4
4
3
2
4
1
the function would output that there is one '1' and I could run it again and find out that there are three '4' answers. etc.
I got it to work! Thank you
I used:
print (df.col.value_counts().loc['x']
Here is an example of a simple 'countif' recipe you could try:
import pandas as pd
def countif(rng, criteria):
return rng.eq(criteria).sum()
Example use
df = pd.DataFrame({'column1': [4,4,3,2,4,1],
'column2': [1,2,3,4,5,6]})
countif(df['column1'], 1)
If all else fails, why not try something like this?
import numpy as np
import pandas
import matplotlib.pyplot as plt
df = pandas.DataFrame(data=np.random.randint(0, 100, size=100), columns=["col1"])
counters = {}
for i in range(len(df)):
if df.iloc[i]["col1"] in counters:
counters[df.iloc[i]["col1"]] += 1
else:
counters[df.iloc[i]["col1"]] = 1
print(counters)
plt.bar(counters.keys(), counters.values())
plt.show()

Dictionary in Pandas DataFrame, how to split the columns

I have a DataFrame that consists of one column ('Vals') which is a dictionary. The DataFrame looks more or less like this:
In[215]: fff
Out[213]:
Vals
0 {u'TradeId': u'JP32767', u'TradeSourceNam...
1 {u'TradeId': u'UUJ2X16', u'TradeSourceNam...
2 {u'TradeId': u'JJ35A12', u'TradeSourceNam...
When looking at an individual row the dictionary looks like this:
In[220]: fff['Vals'][100]
Out[218]:
{u'BrdsTraderBookCode': u'dffH',
u'Measures': [{u'AssetName': u'Ie0',
u'DefinitionId': u'6dbb',
u'MeasureValues': [{u'Amount': -18.64}],
u'ReportingCurrency': u'USD',
u'ValuationId': u'669bb'}],
u'SnapshotId': 12739,
u'TradeId': u'17304M',
u'TradeLegId': u'31827',
u'TradeSourceName': u'xxxeee',
u'TradeVersion': 1}
How can I split the the columns and create a new DataFrame, so that I get one column with TradeId and another one with MeasureValues?
try this:
l=[]
for idx, row in df['Vals'].iteritems():
temp_df = pd.DataFrame(row['Measures'][0]['MeasureValues'])
temp_df['TradeId'] = row['TradeId']
l.append(temp_df)
pd.concat(l,axis=0)
Here's a way to get TradeId and MeasureValues (using twice your sample row above to illustrate the iteration):
new_df = pd.DataFrame()
for id, data in fff.iterrows():
d = {'TradeId': data.ix[0]['TradeId']}
d.update(data.ix[0]['Measures'][0]['MeasureValues'][0])
new_df = pd.concat([new_df, pd.DataFrame.from_dict(d, orient='index').T])
Amount TradeId
0 -18.64 17304M
0 -18.64 17304M

Categories