Groupby, pivot and concatenate in Pandas, breaking by date? - python

I have a dataframe which looks like this:
df = pd.DataFrame([
[123, 'abc', '121'],
[123, 'abc', '121'],
[456, 'def', '121'],
[123, 'abc', '122'],
[123, 'abc', '122'],
[456, 'def', '145'],
[456, 'def', '145'],
[456, 'def', '121'],
], columns=['userid', 'name', 'dt'])
From this question, I have managed to transpose it.
So, the desired df would be:
userid1_date1 name_1 name_2 ... name_n
userid1_date2 name_1 name_2 ... name_n
userid2 name_1 name_2 ... name_n
userid3_date1 name_1 name_2 ... name_n
But, I want to seperate the rows depending on the date. For example, is a user 123 has data in two days, then the rows should be seperate for each day's api events.
I wouldn't really be needing the userid after the transformation, so you can use it anyway.
My plan was:
Group the df w.r.t the dt column
Pivot all the groups such that each looks like this:
userid1_date1 name_1 name_2 ... name_n
Now, concatenate the pivoted data
But, I have no clue how to do this in pandas!

Try:
def tweak(df):
return df.reset_index().name
df.set_index('userid').groupby(level=0).apply(tweak)
Demonstration
df = pd.DataFrame([[1, 'a'], [1, 'c'], [1, 'c'], [1, 'd'], [1, 'e'],
[1, 'a'], [1, 'c'], [1, 'c'], [1, 'd'], [1, 'e'],
[2, 'a'], [2, 'a'], [2, 'c'], [2, 'd'], [2, 'e'],
[2, 'a'], [2, 'a'], [2, 'c'], [2, 'd'], [2, 'e'],
], columns=['userid', 'name'])
def tweak(df):
return df.reset_index().name
df.set_index('userid').groupby(level=0).apply(tweak)

Related

map a data frame to a nested list

I have a data frame with 2 columns
and I also have a nested list containing the elements of the first column
I want to append the remaining column elements and the index to the nested list
|Column A | Column B |
| -------- | -------- |
|100,2 | A |
|101,5 | B |
|103,6 | C |
|104,6 | D |
|105,7 | E |
the nested list looks like
[[100.2,101.5],[103.6,104.6],[105.7]]
The output
[[[0,100.2,'A'],[1,101.5,'B']],[[3,103.6,'C'],[4,104.6,'D']],[[5,105.7,'E']]]
from a dataframe to a nested list
Use pandas to solve above problem with simplicity.
import pandas as pd
df = pd.DataFrame({'Column A': [100.2,101.5,103.6,103.6,105.7],
'Column B': ['A', 'B', 'C', 'D', 'E']})
grouped = df.groupby('Column A')
result = [[[i, row['Column A'], row['Column B']] for i, row in
group.iterrows()] for _, group in grouped]
print(result)
Result:
[[[0, 100.2, 'A']], [[1, 101.5, 'B']], [[2, 103.6, 'C'], [3, 103.6, 'D']], [[4, 105.7, 'E']]]
Process finished with exit code 0
We can use pandas reset_index() and tolist() functions to solve above
import pandas as pd
df = pd.DataFrame({'Column A': [100.2,101.5,103.6,103.6,105.7],
'Column B': ['A', 'B', 'C', 'D', 'E']})
df = df.reset_index(level=0)
print(df.values.tolist())
output:
[[0, 100.2, 'A'],
[1, 101.5, 'B'],
[2, 103.6, 'C'],
[3, 103.6, 'D'],
[4, 105.7, 'E']]
import pandas as pd
df = pd.DataFrame({'A': [100.2,101.5,103.6,104.6,105.7],
'B': ['A', 'B', 'C', 'D', 'E']})
a = [[100.2,101.5],[103.6,104.6],[105.7]]
df1 = df.reset_index().set_index('A', drop=False)
[[df1.loc[v].tolist() for v in row] for row in a]
produces
[[[0, 100.2, 'A'], [1, 101.5, 'B']],
[[2, 103.6, 'C'], [3, 104.6, 'D']],
[[4, 105.7, 'E']]]

Convert column values to a list based on ID in Pandas dataframe

I have a pandas dataframe
df = pd.DataFrame({'ID': ['A', 'A', 'A', 'B', 'B', 'B', 'B'], 'value': [1, 1, 1, 1, 2, 2, 2], 'group': [1, 2, 3, 2, 2, 2, 3]})
I want to create a new dataframe with unique IDs, but all possible (unique) values / groups added to a list, what would look like this:
df_new = pd.DataFrame({'ID': ['A', 'B'], 'value': [[1], [1,2]], 'group': [[1,2,3], [2,3]]})
You can use GroupBy.agg with set
df.groupby('ID', as_index=False).agg(set)
ID value group
0 A {1} {1, 2, 3}
1 B {1, 2} {2, 3}
import pandas as pd
df = pd.DataFrame({'ID': ['A', 'A', 'A', 'B', 'B', 'B', 'B'], 'value': [1, 1, 1, 1, 2, 2, 2], 'group': [1, 2, 3, 2, 2, 2, 3]})
df_new = pd.DataFrame({'ID': ['A', 'B'], 'value': [[1], [1,2]], 'group': [[1,2,3], [2,3]]})
df.groupby('ID', as_index=False).agg(lambda x: x.unique().tolist())
# ID value group
# 0 A [1] [1, 2, 3]
# 1 B [1, 2] [2, 3]

Calculate duration between events with pandas

I have a dataframe
df = pd.DataFrame([['2018-07-02', 'B'],
['2018-07-03', 'A'],
['2018-07-06', 'B'],
['2018-07-08', 'B'],
['2018-07-09', 'A'],
['2018-07-09', 'A'],
['2018-07-10', 'A'],
['2018-07-12', 'B'],
['2018-07-15', 'A'],
['2018-07-16', 'A'],
['2018-07-18', 'B'],
['2018-07-22', 'A'],
['2018-07-25', 'B'],
['2018-07-25', 'B'],
['2018-07-27', 'A'],
['2018-07-28', 'A']], columns = ['DateEvent','Event'])
where counting starts with event A and ends with event B. Some events could start on more than one day and end on more than one day.
I already calculated the difference:
df = df.set_index('DateEvent')
begin = df.loc[df['Event'] == 'A'].index
cutoffs = df.loc[df['Event'] == 'B'].index
idx = cutoffs.searchsorted(begin)
mask = idx < len(cutoffs)
idx = idx[mask]
begin = begin[mask]
end = cutoffs[idx]
pd.DataFrame({'begin':begin, 'end':end})
but I get the difference for multiple starts and ends also:
begin end
0 2018-07-03 2018-07-06
1 2018-07-09 2018-07-12
2 2018-07-09 2018-07-12
3 2018-07-10 2018-07-12
4 2018-07-15 2018-07-18
5 2018-07-16 2018-07-18
6 2018-07-22 2018-07-25
The desired output includes the first occurrence of event A and the last occurrence of event B... looking for maximum duration, just to be sure.
I could loop before or after to delete the unnecessary events, but is there a nicer, more pythonic way?
Thank you,
Aleš
EDIT:
I've been using the code sucessfully as a function in a groupby. But it's not clean and it does take some time. How can I rewrite the code to include the group in the df?
df = pd.DataFrame([['2.07.2018', 1, 'B'],
['3.07.2018', 1, 'A'],
['3.07.2018', 2, 'A'],
['6.07.2018', 2, 'B'],
['8.07.2018', 2, 'B'],
['9.07.2018', 2, 'A'],
['9.07.2018', 2, 'A'],
['9.07.2018', 2, 'B'],
['9.07.2018', 3, 'A'],
['10.07.2018', 3, 'A'],
['10.07.2018', 3, 'B'],
['12.07.2018', 3, 'B'],
['15.07.2018', 3, 'A'],
['16.07.2018', 4, 'A'],
['16.07.2018', 4, 'B'],
['18.07.2018', 4, 'B'],
['18.07.2018', 4, 'A'],
['22.07.2018', 5, 'A'],
['25.07.2018', 5, 'B'],
['25.07.2018', 7, 'B'],
['25.07.2018', 7, 'A'],
['25.07.2018', 7, 'B'],
['27.07.2018', 9, 'A'],
['28.07.2018', 9, 'A'],
['28.07.2018', 9, 'B']], columns = ['DateEvent','Group','Event'])
I'm trying to somehow do a combination of cumsum on a group, but cannot get the desired results.
Thank you!
Let's try:
df = pd.DataFrame([['2018-07-02', 'B'],
['2018-07-03', 'A'],
['2018-07-06', 'B'],
['2018-07-08', 'B'],
['2018-07-09', 'A'],
['2018-07-09', 'A'],
['2018-07-10', 'A'],
['2018-07-12', 'B'],
['2018-07-15', 'A'],
['2018-07-16', 'A'],
['2018-07-18', 'B'],
['2018-07-22', 'A'],
['2018-07-25', 'B'],
['2018-07-25', 'B'],
['2018-07-27', 'A'],
['2018-07-28', 'A']], columns = ['DateEvent','Event'])
a = (df['Event'] != 'A').cumsum()
a = a.groupby(a).cumcount()
df['Event Group'] = (a == 1).cumsum()
df_out = df.groupby('Event Group').filter(lambda x: set(x['Event']) == set(['A','B']))\
.groupby('Event Group')['DateEvent'].agg(['first','last'])\
.rename(columns={'first':'start','last':'end'})\
.reset_index()
print(df_out)
Output:
Event Group start end
0 1 2018-07-03 2018-07-08
1 2 2018-07-09 2018-07-12
2 3 2018-07-15 2018-07-18
3 4 2018-07-22 2018-07-25
Edit
a = (df['Event'] != 'A').cumsum().mask(df['Event'] != 'A')
df['Event Group'] = a.ffill()
df_out = df.groupby('Event Group').filter(lambda x: set(x['Event']) == set(['A','B']))\
.groupby('Event Group')['DateEvent'].agg(['first','last'])\
.rename(columns={'first':'start','last':'end'})\
.reset_index()

Changing 2-dimensional list to standard matrix form

org = [['A', 'a', 1],
['A', 'b', 2],
['A', 'c', 3],
['B', 'a', 4],
['B', 'b', 5],
['B', 'c', 6],
['C', 'a', 7],
['C', 'b', 8],
['C', 'c', 9]]
I want to change the 'org' to the standard matrix form like below.
transform = [['\t','A', 'B', 'C'],
['a', 1, 4, 7],
['b', 2, 5, 8],
['c', 3, 6, 9]]
I made a small function that converts this.
The code I wrote is below:
import numpy as np
def matrix(li):
column = ['\t']
row = []
result = []
rest = []
for i in li:
if i[0] not in column:
column.append(i[0])
if i[1] not in row:
row.append(i[1])
result.append(column)
for i in li:
for r in row:
if r == i[1]:
rest.append([i[2]])
rest = np.array(rest).reshape((len(row),len(column)-1)).tolist()
for i in range(len(rest)):
rest[i] = [row[i]]+rest[i]
result += rest
for i in result:
print(i)
matrix(org)
The result was this:
>>>['\t', 'school', 'kids', 'really']
[72, 0.008962252017017516, 0.04770759762717251, 0.08993156334317577]
[224, 0.004180594204995023, 0.04450803342634945, 0.04195010047081213]
[385, 0.0021807662921382335, 0.023217182598008267, 0.06564858527712682]
I don't think this is efficient since I use so many for loops.
Is there any efficient way to do this?
Since you are using 3rd party libraries, this is a task well suited for pandas.
There is some messy, but not inefficient, work to incorporate index and columns as per your requirement.
org = [['A', 'a', 1],
['A', 'b', 2],
['A', 'c', 3],
['B', 'a', 4],
['B', 'b', 5],
['B', 'c', 6],
['C', 'a', 7],
['C', 'b', 8],
['C', 'c', 9]]
df = pd.DataFrame(org)
pvt = df.pivot_table(index=0, columns=1, values=2)
cols = ['\t'] + pvt.columns.tolist()
res = pvt.values.T.tolist()
res.insert(0, pvt.index.tolist())
res = [[i]+j for i, j in zip(cols, res)]
print(res)
[['\t', 'A', 'B', 'C'],
['a', 1, 4, 7],
['b', 2, 5, 8],
['c', 3, 6, 9]]
Here's another "manual" way using only numpy:
org_arr = np.array(org)
key1 = np.unique(org_arr[:,0])
key2 = np.unique(org_arr[:,1])
values = org_arr[:,2].reshape((len(key1),len(key2))).transpose()
np.block([
["\t", key1 ],
[key2[:,None], values]
])
""" # alternatively, for numpy < 1.13.0
np.vstack((
np.hstack(("\t", key1)),
np.hstack((key2[:, None], values))
))
"""
For simplicity, it requires the input matrix to be strictly ordered (first col is major and ascending ...).
Output:
Out[58]:
array([['\t', 'A', 'B', 'C'],
['a', '1', '4', '7'],
['b', '2', '5', '8'],
['c', '3', '6', '9']],
dtype='<U1')

extracting a range of elements from a csv / 2d array

I want to extract elements from a range of elements is a specific column from a csv file.
I've simplified the problem to this:
data = [['a',1,'A',100],['b',2,'B',200],['c',3,'C',300],['d',4,'D',400]]
print(data[0:2][:],'\nROWS 0&1')
print(data[:][0:2],'\nCOLS 1&1')
I thought that meant
'show me all columns for just row 0 and 1'
'show me all the rows for just column 0 and 1'
But the output is always just showing me rows 0 and 1, never the columns,
[['a', 1, 'A', 100], ['b', 2, 'B', 200]]
ROWS 0&1
[['a', 1, 'A', 100], ['b', 2, 'B', 200]]
COLS 1&1
when I want to see this:
['a', 1, 'A', 100,'b', 2, 'B', 200] # ... i.e. ROWS 0 and 1
['a','b','c','d',1,2,3,4]
Is there a nice way to do this?
Your problem here is that data[:] is just a copy of data:
>>> data
[['a', 1, 'A', 100], ['b', 2, 'B', 200], ['c', 3, 'C', 300], ['d', 4, 'D', 400]]
>>> data[:]
[['a', 1, 'A', 100], ['b', 2, 'B', 200], ['c', 3, 'C', 300], ['d', 4, 'D', 400]]
... so both your attempts at slicing are giving you the same result as data[0:2].
You can get just columns 0 and 1 with a list comprehension:
>>> [x[0:2] for x in data]
[['a', 1], ['b', 2], ['c', 3], ['d', 4]]
... which can be rearranged to the order you want with zip():
>>> list(zip(*(x[0:2] for x in data)))
[('a', 'b', 'c', 'd'), (1, 2, 3, 4)]
To get a single list rather than a list of 2 tuples, use itertools.chain.from_iterable():
>>> from itertools import chain
>>> list(chain.from_iterable(zip(*(x[0:2] for x in data))))
['a', 'b', 'c', 'd', 1, 2, 3, 4]
... which can also be used to collapse data[0:2]:
>>> list(chain.from_iterable(data[0:2]))
['a', 1, 'A', 100, 'b', 2, 'B', 200]

Categories