groupby multi colums and change it to dataFrame/array

groupby multi colums and change it to dataFrame/array - python

Hi I have a dataFrame like this:
Value day hour min
Time
2015-12-19 10:08:52 1805 2015-12-19 10 8
2015-12-19 10:09:52 1794 2015-12-19 10 9
2015-12-19 10:19:51 1796 2015-12-19 10 19
2015-12-19 10:20:51 1806 2015-12-19 10 20
2015-12-19 10:29:52 1802 2015-12-19 10 29
2015-12-19 10:30:52 1800 2015-12-19 10 30
2015-12-19 10:40:51 1804 2015-12-19 10 40
2015-12-19 10:41:51 1798 2015-12-19 10 41
2015-12-19 10:50:51 1790 2015-12-19 10 50
2015-12-19 10:51:52 1811 2015-12-19 10 51
2015-12-19 11:00:51 1803 2015-12-19 11 0
2015-12-19 11:01:52 1784 2015-12-19 11 1
... ... ... ... ...
2016-07-15 17:30:13 1811 2016-07-15 17 30
2016-07-15 17:31:13 1787 2016-07-15 17 31
2016-07-15 17:41:13 1800 2016-07-15 17 41
2016-07-15 17:42:13 1795 2016-07-15 17 42
I want to group it by day and hour, and finally make it a multi-dimentional array for the "Value" column like this for example:
based on grouping of day and hour, I need to get each hour something like this:
2015-12-19 10 [1805, 1794, 1796, 1806, 1802, 1800, 1804, 179... ]
2015-12-20 11 [1803, 1793, 1795, 1801, 1796, 1796, 1788, 180... ]
...
2016-07-15 17 [1794, 1792, 1788, 1799, 1811, 1803, 1808, 179... ]
In the end, I wish I can have a dataframe like this:
Time_index hour value1 value2 value3 ........value20
2015-12-19 10 1805, 1794, 1796, 1806 ... 1804, 1791, 1788, 1812
2015-12-20 11 1803, 1793, 1795, 1801 ... 1796, 1796, 1788, 1800
...
2016-07-15 17 1794, 1792, 1788, 1799 ... 1811, 1803, 1808, 1790
OR a array like this:
[[1805, 1794, 1796, 1806, 1802, 1800, 1804, 179... ],[1803, 1793, 1795, 1801, 1796, 1796, 1788, 180... ]....[1794, 1792, 1788, 1799, 1811, 1803, 1808, 179... ]]
I was able to get groupby with one column works:
grouped_0 = train_df.groupby(['day'])
grouped = grouped_0.aggregate(lambda x: list(x))
grouped['grouped'] = grouped['Value']
The output of the dataFrame grouped's 'grouped' column is like:
2015-12-19 [1805, 1794, 1796, 1806, 1802, 1800, 1804, 179...
2015-12-20 [1790, 1809, 1809, 1789, 1807, 1804, 1790, 179...
2015-12-21 [1794, 1792, 1788, 1799, 1811, 1803, 1808, 179...
2015-12-22 [1815, 1812, 1798, 1808, 1802, 1788, 1808, 179...
2015-12-23 [1803, 1800, 1799, 1803, 1802, 1804, 1788, 179...
2015-12-24 [1803, 1795, 1801, 1798, 1799, 1802, 1799, 179...
However, when I tried this:
grouped_0 = train_df.groupby(['day', 'hour'])
grouped = grouped_0.aggregate(lambda x: list(x))
grouped['grouped'] = grouped['Value']
it threw this error:
Traceback (most recent call last):
File "<input>", line 3, in <module>
File "C:\Apps\Continuum\Anaconda2\envs\python36\lib\site-packages\pandas\core\groupby.py", line 4036, in aggregate
return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
File "C:\Apps\Continuum\Anaconda2\envs\python36\lib\site-packages\pandas\core\groupby.py", line 3476, in aggregate
return self._python_agg_general(arg, *args, **kwargs)
File "C:\Apps\Continuum\Anaconda2\envs\python36\lib\site-packages\pandas\core\groupby.py", line 848, in _python_agg_general
result, counts = self.grouper.agg_series(obj, f)
File "C:\Apps\Continuum\Anaconda2\envs\python36\lib\site-packages\pandas\core\groupby.py", line 2180, in agg_series
return self._aggregate_series_pure_python(obj, func)
File "C:\Apps\Continuum\Anaconda2\envs\python36\lib\site-packages\pandas\core\groupby.py", line 2215, in _aggregate_series_pure_python
raise ValueError('Function does not reduce')
ValueError: Function does not reduce
my pandas version:
pd.version
'0.20.3'

Yes, using agg for this isn't the best idea, because, unless the result is a container with a single object, the result is not considered valid.
You can use groupby + apply for this.
g = df.groupby(['day', 'hour']).Value.apply(lambda x: x.values.tolist())
g
day hour
2015-12-19 10 [1805, 1794, 1796, 1806, 1802, 1800, 1804, 179...
11 [1803, 1784]
2016-07-15 17 [1811, 1787, 1800, 1795]
Name: Value, dtype: object
If you want each element in its own column, you'd do it like this:
v = pd.DataFrame(g.values.tolist(), index=g.index)\
.rename(columns=lambda x: 'value{}'.format(x + 1)).reset_index()
v is your final result.

Related

How to remove unwanted lines in azure(python)

/usr/local/lib/python3.8/dist-packages/attr/__init__.py 27 0 100%
/usr/local/lib/python3.8/dist-packages/attr/_cmp.py 55 45 18% 51-100, 108-114, 122-137, 144-147, 154
/usr/local/lib/python3.8/dist-packages/attr/_compat.py 96 48 50% 22-24, 28-107, 123, 132, 153-156, 175, 191-212, 234, 241-242
/usr/local/lib/python3.8/dist-packages/attr/_config.py 9 4 56% 19-22, 33
/usr/local/lib/python3.8/dist-packages/attr/_funcs.py 96 84 12% 54-116, 130-189, 225-289, 301, 323-341, 360-370, 409-422
/usr/local/lib/python3.8/dist-packages/attr/_make.py 977 346 65% 84, 87, 90, 115-116, 121, 274, 280, 285, 293, 296, 299, 351-352, 413, 431, 450, 457-481, 501-507, 529-532, 556, 581, 590-591, 602, 611, 623-634, 642, 649, 734-754, 763, 792-796, 807-810, 838-839, 847, 881, 914-915, 918, 929-939, 954, 962-971, 1011, 1064, 1069-1090, 1098-1099, 1105-1106, 1112-1113, 1130, 1134, 1145, 1156, 1163, 1170-1171, 1186, 1212-1216, 1501, 1509, 1514, 1523, 1552, 1571, 1576, 1583, 1596, 1610, 1620, 1641-1646, 1690-1698, 1722-1732, 1758-1762, 1788-1799, 1829, 1840-1843, 1849-1852, 1858-1861, 1867-1870, 1928, 1954-2015, 2047-2054, 2075-2082, 2093-2099, 2103, 2131, 2138, 2144-2147, 2149, 2200, 2213, 2224, 2235-2287, 2313, 2336, 2344, 2380, 2388-2396, 2407-2418, 2428, 2447, 2454-2469, 2488, 2544-2553, 2558-2560, 2564-2569, 2694, 2702, 2732-2734, 2748-2752, 2759, 2768, 2771-2776, 2925-2929, 2941-2946, 2981, 2987-2988, 3035-3079, 3095-3096, 3109-3117, 3135-3173
/usr/local/lib/python3.8/dist-packages/attr/_next_gen.py 37 24 35% 82-147, 175, 198, 214
/usr/local/lib/python3.8/dist-packages/attr/_version_info.py 37 17 54% 60-69, 72-77, 80-87
/usr/local/lib/python3.8/dist-packages/attr/converters.py 58 47 19% 40-62, 83-114, 143-155
/usr/local/lib/python3.8/dist-packages/attr/exceptions.py 18 4 78% 89-91, 94
/usr/local/lib/python3.8/dist-packages/attr/filters.py 16 9 44% 17, 32-37, 49-54
/usr/local/lib/python3.8/dist-packages/attr/setters.py 28 16 43% 21-26, 37, 46-55, 65-69
/usr/local/lib/python3.8/dist-packages/yaml/resolver.py 135 97 28% 22-23, 30, 33, 51-89, 92-112, 115-118, 122-141, 144-165
/usr/local/lib/python3.8/dist-packages/yaml/scanner.py 753 672 11% 39-44, 60-109, 115-123, 128-133, 137-141, 146-154, 159-258, 272-277, 286-293, 301-310, 314-321, 340-347, 351-355, 364-367, 374-388, 393-400, 403, 406, 411-422, 425, 428, 433-445, 448, 451, 456-468, 473-482, 487-515, 520-543, 548-599, 604-610, 615-621, 626-632, 635, 638, 643-649, 652, 655, 660-666, 671-679, 687-688, 693-696, 701-704, 709, 714-719, 724-729, 745-746, 772-785, 789-804, 808-825, 829-842, 846-855, 859-865, 869-874, 878-883, 887-897, 908-933, 937-974, 979-1049, 1054-1090, 1094-1104, 1108-1119, 1123-1132, 1141-1155, 1187-1226, 1230-1250, 1254-1268, 1276-1309, 1315-1346, 1352-1370, 1375-1395, 1399-1414, 1425-1435
/usr/local/lib/python3.8/dist-packages/yaml/serializer.py 85 70 18% 17-25, 28-34, 37-41, 47-58, 61-72, 75-76, 79-110
/usr/local/lib/python3.8/dist-packages/yaml/tokens.py
these lines are checking for other repos,
So how to remove all these unwanted pipelines in azure, while running the pipeline
Please provide the solution

How to plot date at x axis

So I have this df with the first column called "Week":
0 2018-01-07
1 2018-01-14
2 2018-01-21
3 2018-01-28
4 2018-02-04
5 2018-02-11
6 2018-02-18
7 2018-02-25
8 2018-03-04
9 2018-03-11
10 2018-03-18
11 2018-03-25
12 2018-04-01
13 2018-04-08
14 2018-04-15
15 2018-04-22
16 2018-04-29
17 2018-05-06
Name: Week, dtype: object
And other three columns with different names and intergers as values.
My ideia is to plot these dates at X axis and the other 3 columns with ints at Y.
I've tried everything I found but nothing have worked yet...
I did:
df.set_index('Week')
df.plot()
plt.show()
Which worked very well, but X axis stil a float in range(0, 17)...
I also tried:
df['Week'] = pd.to_datetime(df['Week'])
df.set_index('Week')
df.plot()
plt.show()
But I got this error:
Traceback (most recent call last):
File "C:\Users\mar\Desktop\Web Dev\PJ E\EZA.py", line 33, in <module>
df.plot()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 2677, in __call__
sort_columns=sort_columns, **kwds)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 1902, in plot_frame
**kwds)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 1729, in _plot
plot_obj.generate()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 258, in generate
self._post_plot_logic_common(ax, self.data)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 397, in _post_plot_logic_common
self._apply_axis_properties(ax.yaxis, fontsize=self.fontsize)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 470, in _apply_axis_properties
labels = axis.get_majorticklabels() + axis.get_minorticklabels()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\axis.py", line 1188, in get_majorticklabels
ticks = self.get_major_ticks()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\axis.py", line 1339, in get_major_ticks
numticks = len(self.get_major_locator()())
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\dates.py", line 1054, in __call__
self.refresh()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\dates.py", line 1074, in refresh
dmin, dmax = self.viewlim_to_dt()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\dates.py", line 832, in viewlim_to_dt
return num2date(vmin, self.tz), num2date(vmax, self.tz)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\dates.py", line 441, in num2date
return _from_ordinalf(x, tz)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\dates.py", line 256, in _from_ordinalf
dt = datetime.datetime.fromordinal(ix).replace(tzinfo=UTC)
ValueError: ordinal must be >= 1
Thanks in advance.

you can do something like this below:
df['Week'] = pd.to_datetime(df['Week'])
df.set_index('Week', inplace=True)
df.plot()

Convert pandas dataframe of lists to dict of dataframes

I have a dataframe (with a DateTime index) , in which some of the columns contain lists, each with 6 elements.
In: dframe.head()
Out:
A B \
timestamp
2017-05-01 00:32:25 30 [-3512, 375, -1025, -358, -1296, -4019]
2017-05-01 00:32:55 30 [-3519, 372, -1026, -361, -1302, -4020]
2017-05-01 00:33:25 30 [-3514, 371, -1026, -360, -1297, -4018]
2017-05-01 00:33:55 30 [-3517, 377, -1030, -363, -1293, -4027]
2017-05-01 00:34:25 30 [-3515, 372, -1033, -361, -1299, -4025]
C D
timestamp
2017-05-01 00:32:25 [1104, 1643, 625, 1374, 5414, 2066] 49.93
2017-05-01 00:32:55 [1106, 1643, 622, 1385, 5441, 2074] 49.94
2017-05-01 00:33:25 [1105, 1643, 623, 1373, 5445, 2074] 49.91
2017-05-01 00:33:55 [1105, 1646, 620, 1384, 5438, 2076] 49.91
2017-05-01 00:34:25 [1104, 1645, 613, 1374, 5431, 2082] 49.94
I have a dictionary dict_of_dfs which I want to populate with 6 dataframes,
dict_of_dfs = {1: df1, 2:df2, 3:df3, 4:df4, 5:df5, 6:df6}
where the ith dataframe contains the ith items from each list, so the first dataframe in the dict will be:
In:df1
Out:
A B C D
timestamp
2017-05-01 00:32:25 30 -3512 1104 49.93
2017-05-01 00:32:55 30 -3519 1106 49.94
2017-05-01 00:33:25 30 -3514 1105 49.91
2017-05-01 00:33:55 30 -3517 1105 49.91
2017-05-01 00:34:25 30 -3515 1104 49.94
and so-on.
The actual dataframe has more columns than this and thousands of rows.
What's the simplest, most python way to make the conversion?

You can use dict comprehension with assign and for select values of lists use str[0], str[1]:
N = 6
dfs = {i:df.assign(B=df['B'].str[i-1], C=df['C'].str[i-1]) for i in range(1,N + 1)}
print(dfs[1])
timestamp A B C D
0 2017-05-01 00:32:25 30 -3512 1104 49.93
1 2017-05-01 00:32:55 30 -3519 1106 49.94
2 2017-05-01 00:33:25 30 -3514 1105 49.91
3 2017-05-01 00:33:55 30 -3517 1105 49.91
4 2017-05-01 00:34:25 30 -3515 1104 49.94
Another solution:
dfs = {i:df.apply(lambda x: x.str[i-1] if type(x.iat[0]) == list else x) for i in range(1,7)}
print(dfs[1])
timestamp A B C D
0 2017-05-01 00:32:25 30 -3512 1104 49.93
1 2017-05-01 00:32:55 30 -3519 1106 49.94
2 2017-05-01 00:33:25 30 -3514 1105 49.91
3 2017-05-01 00:33:55 30 -3517 1105 49.91
4 2017-05-01 00:34:25 30 -3515 1104 49.94
Timings:
df = pd.concat([df]*10000).reset_index(drop=True)
In [185]: %timeit {i:df.assign(B=df['B'].str[i-1], C=df['C'].str[i-1]) for i in range(1,N+1)}
1 loop, best of 3: 420 ms per loop
In [186]: %timeit {i:df.apply(lambda x: x.str[i-1] if type(x.iat[0]) == list else x) for i in range(1,7)}
1 loop, best of 3: 447 ms per loop
In [187]: %timeit {(i+1):df.applymap(lambda x: x[i] if type(x) == list else x) for i in range(6)}
1 loop, best of 3: 881 ms per loop

Setup
df = pd.DataFrame({'A': {'2017-05-01 00:32:25': 30,
'2017-05-01 00:32:55': 30,
'2017-05-01 00:33:25': 30,
'2017-05-01 00:33:55': 30,
'2017-05-01 00:34:25': 30},
'B': {'2017-05-01 00:32:25': [-3512, 375, -1025, -358, -1296, -4019],
'2017-05-01 00:32:55': [-3519, 372, -1026, -361, -1302, -4020],
'2017-05-01 00:33:25': [-3514, 371, -1026, -360, -1297, -4018],
'2017-05-01 00:33:55': [-3517, 377, -1030, -363, -1293, -4027],
'2017-05-01 00:34:25': [-3515, 372, -1033, -361, -1299, -4025]},
'C': {'2017-05-01 00:32:25': [1104, 1643, 625, 1374, 5414, 2066],
'2017-05-01 00:32:55': [1106, 1643, 622, 1385, 5441, 2074],
'2017-05-01 00:33:25': [1105, 1643, 623, 1373, 5445, 2074],
'2017-05-01 00:33:55': [1105, 1646, 620, 1384, 5438, 2076],
'2017-05-01 00:34:25': [1104, 1645, 613, 1374, 5431, 2082]},
'D': {'2017-05-01 00:32:25': 49.93,
'2017-05-01 00:32:55': 49.94,
'2017-05-01 00:33:25': 49.1,
'2017-05-01 00:33:55': 49.91,
'2017-05-01 00:34:25': 49.94}})
Solution
Construct the df dict using dict comprehension. The sub df is generated using the applymap function. It can convert all columns with a list of 6 elements:
dict_of_dfs = {(i+1):df.applymap(lambda x: x[i] if type(x) == list else x) for i in range(6)}
print(dict_of_dfs[1])
A B C D
2017-05-01 00:32:25 30 -3512 1104 49.93
2017-05-01 00:32:55 30 -3519 1106 49.94
2017-05-01 00:33:25 30 -3514 1105 49.10
2017-05-01 00:33:55 30 -3517 1105 49.91
2017-05-01 00:34:25 30 -3515 1104 49.94
print(dict_of_dfs[2])
A B C D
2017-05-01 00:32:25 30 375 1643 49.93
2017-05-01 00:32:55 30 372 1643 49.94
2017-05-01 00:33:25 30 371 1643 49.10
2017-05-01 00:33:55 30 377 1646 49.91
2017-05-01 00:34:25 30 372 1645 49.94

Compute the running (cumulative) maximum for a series in pandas

Given:
d = {
'High': [954,
953,
952,
955,
956,
952,
951,
950,
]
}
df = pandas.DataFrame(d)
I want to add another column which is the max at each index from the beginning. For example the desired column would be:
'Max': [954,
954,
954,
955,
956,
956,
956,
956]
I tried with a pandas rolling function but the window cannot be dynamic it seems

Use cummax
df.High.cummax()
0 954
1 954
2 954
3 955
4 956
5 956
6 956
7 956
Name: High, dtype: int64
df['Max'] = df.High.cummax()
df

Plot a data frame

I have a data frame like this:
ReviewDate_month,ProductId,Reviewer
01,185,185
02,155,155
03,130,130
04,111,111
05,110,110
06,98,98
07,101,92
08,71,71
09,73,73
10,76,76
11,105,105
12,189,189
I want to plot it, ReviewDate_Month in X, Product ID and Reviewer in Y ideally. But I will start with 1 line either Product ID or Reviewer.
so i tried:
df_no_monthlycount.plot.line
Got below error msg:
File "C:/Users/user/PycharmProjects/Assign2/Main.py", line 59, in <module>
01 185 185
02 155 155
03 130 130
04 111 111
05 110 110
06 98 98
07 101 92
08 71 71
09 73 73
10 76 76
df_no_monthlycount.plot.line
AttributeError: 'function' object has no attribute 'line'
11 105 105
12 189 189
Process finished with exit code 1
I also tried this:
df_no_monthlycount.plot(x=df_helful_monthlymean['ReviewDate_month'],y=df_helful_monthlymean['ProductId'],style='o')
Error msg like this:
Traceback (most recent call last):
File "C:/Users/user/PycharmProjects/Assign2/Main.py", line 52, in <module>
df_no_monthlycount.plot(x=df_helful_monthlymean['ReviewDate_month'],y=df_helful_monthlymean['ProductId'],style='o')
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 1797, in __getitem__
return self._getitem_column(key)
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 1804, in _getitem_column
return self._get_item_cache(key)
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1084, in _get_item_cache
values = self._data.get(item)
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 2851, in get
loc = self.items.get_loc(item)
File "C:\Python34\lib\site-packages\pandas\core\index.py", line 1572, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas\index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas\index.c:3838)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:3718)
File "pandas\hashtable.pyx", line 686, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12294)
File "pandas\hashtable.pyx", line 694, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12245)
KeyError: 'ReviewDate_month'

Call the plot as shown below:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
print(df)
df.plot(x ='ReviewDate_month',y=['ProductId', 'Reviewer'] ,kind='line')
plt.show()
Will give you:

If you want to plot ReviewDate_Month in X, Product ID and Reviewer in Y, you can do it this way:
df_no_monthlycount.plot(x='ReviewDate_Month', y=['Product ID', 'Reviewer'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

groupby multi colums and change it to dataFrame/array - python

Related

How to remove unwanted lines in azure(python)

How to plot date at x axis

Convert pandas dataframe of lists to dict of dataframes

Compute the running (cumulative) maximum for a series in pandas

Plot a data frame

Categories

Resources