Pyparsing two-dimensional list - python

I have the following sample data:
165 150 238 402 395 571 365 446 284 278 322 282 236
16 5 19 10 12 5 18 22 6 4 5
259 224 249 193 170 151 95 86 101 58 49
6013 7413 8976 10392 12678 9618 9054 8842 9387 11088 11393;
It is the equivalent of a two dimensional array (except each row does not have an equal amount of columns). At the end of each line is a space and then a \n except for the final entry which is followed by no space and only a ;.
Would anyone know the pyparsing grammer to parse this? I've been trying something along the following lines but it will not match.
data = Group(OneOrMore(Group(OneOrMore(Word(nums) + SPACE)) + LINE) + \
Group(OneOrMore(Word(nums) + SPACE)) + Word(nums) + Literal(";")
The desired output would ideally be as follows
[['165', '150', '238', '402', '395', '571', '365', '446', '284', '278',
'322', '282', '236'], ['16', '5', ... ], [...], ['6013', ..., '11393']]
Any assistance would be greatly appreciated.

You can use the stopOn argument to OneOrMore to make it stop matching. Then, since newlines are by default skippable whitespace, the next group can start matching, and it will just skip over the newline and start at the next integer.
import pyparsing as pp
data_line = pp.Group(pp.OneOrMore(pp.pyparsing_common.integer(), stopOn=pp.LineEnd()))
data_lines = pp.OneOrMore(data_line) + pp.Suppress(';')
Applying this to your sample data:
data = """\
165 150 238 402 395 571 365 446 284 278 322 282 236
16 5 19 10 12 5 18 22 6 4 5
259 224 249 193 170 151 95 86 101 58 49
6013 7413 8976 10392 12678 9618 9054 8842 9387 11088 11393;"""
parsed = data_lines.parseString(data)
from pprint import pprint
pprint(parsed.asList())
Prints:
[[165, 150, 238, 402, 395, 571, 365, 446, 284, 278, 322, 282, 236],
[16, 5, 19, 10, 12, 5, 18, 22, 6, 4, 5],
[259, 224, 249, 193, 170, 151, 95, 86, 101, 58, 49],
[6013, 7413, 8976, 10392, 12678, 9618, 9054, 8842, 9387, 11088, 11393]]

Related

Can a Dataframe of NBA players be sorted by various conditions: to combine the rows of players w/ multiple entries bc they played on many teams?

I want to remove any players who didn't have over 1000 MP(minutes played).
I could easily write:
league_stats= pd.read_csv("1996.csv")
league_stats = league_stats.drop("Player-additional", axis=1)
league_stats_1000 = league_stats[league_stats['MP'] > 1000]
However, because players sometimes play for multiple teams in a year...this code doesn't account for that.
For example, Sam Cassell has four entries and none are above 1000 MP, but in total his MP for the season was over 1000. By running the above code I remove him from the new dataframe.
I am wondering if there is a way to sort the Dataframe by matching Rank(the RK column gives players who played on different teams the same rank number for each team they played on) and then sort it by... if the total of their MP is 1000=<.
This is the page I got the data from: 1996-1997 season.
Above the data table and to the left of the blue check box there is a dropdown menu called "Share and Export". From there I clicked on "Get table as CSV (for Excel)". After that I saved the CSV to a text editor and change the file extension to .csv to upload it to Jupyter Notebook.
This is a solution I came up with:
url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html'
df = pd.read_html(url)[0]
tot_df = df.loc[df['Tm'] == 'TOT']
mp_1000 = tot_df.loc[tot_df["MP"] < 1000]
# Create list of indexes with unnecessary entries to be removed. We have TOT and don't need these rows.
# *** For the record, I came up with this list by manually going through the data.
indexes_to_remove = [5,6,24, 25, 66, 67, 248, 249, 447, 448, 449, 275, 276, 277, 19, 20, 21, 377, 378, 477, 478, 479,
54, 55, 451, 452, 337, 338, 156, 157, 73, 74, 546, 547, 435, 436, 437, 142, 143, 421, 42, 43, 232,
233, 571, 572, 363, 364, 531, 532, 201, 202, 111, 112, 139, 140, 307, 308, 557, 558, 93, 94, 512,
513, 206, 207, 208, 250, 259, 286, 287, 367, 368, 271, 272, 102, 103, 34, 35, 457, 458, 190, 191,
372, 373, 165, 166
]
df_drop_tot = df.drop(labels=indexes_to_remove, axis=0)
df_drop_tot
First off, no need to manually download the csv and then read it into pandas. You can load in the table using pandas' .read_html().
And yes, you can simply get the list of ranks, player names, or whatever, that have greater than 1000 MP, then use that list to filter the dataframe.
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html'
df = pd.read_html(url)[0]
df = df[df['Rk'].ne('Rk')]
df['MP'] = df['MP'].astype(int)
players_1000_rk_list = list(df[df['MP'] >= 1000]['Rk']) #<- coverts the "Rk" column into a list. I can then use that in the next line to only keep the "Rk" values that are in the list of "Rk"s that are >= 1000 MPs
players_df = df[df['Rk'].isin(players_1000_rk_list)]
Output: filters down from 574 rows to 282 rows
print(players_df)
Rk Player Pos Age Tm G ... AST STL BLK TOV PF PTS
0 1 Mahmoud Abdul-Rauf PG 27 SAC 75 ... 189 56 6 119 174 1031
1 2 Shareef Abdur-Rahim PF 20 VAN 80 ... 175 79 79 225 199 1494
3 4 Cory Alexander PG 23 SAS 80 ... 254 82 16 146 148 577
7 6 Ray Allen* SG 21 MIL 82 ... 210 75 10 149 218 1102
10 9 Greg Anderson C 32 SAS 82 ... 34 63 67 73 225 322
.. ... ... .. .. ... .. ... ... ... .. ... ... ...
581 430 Walt Williams SF 26 TOR 73 ... 197 97 62 174 282 1199
582 431 Corliss Williamson SF 23 SAC 79 ... 124 60 49 157 263 915
583 432 Kevin Willis PF 34 HOU 75 ... 71 42 32 119 216 842
589 438 Lorenzen Wright C 21 LAC 77 ... 49 48 60 79 211 561
590 439 Sharone Wright C 24 TOR 60 ... 28 15 50 93 146 390
[282 rows x 30 columns]

Cluster objects by geometric coordinates (Y axis)

I've got a pandas DataFrame with records describing rectangles with absolute coordinates of all the 4 points: TL (top-left), TR (top-right), BL (bottom-left) and BR (bottom-right). As it is, the rects seem to follow a row-like pattern, where there are conspicuous clusters forming "rows", like in this picture:
The data look like this:
tl_x tl_y tr_x tr_y br_x br_y bl_x bl_y ht wd
0 1567 136 1707 136 1707 153 1567 153 17 140
1 1360 154 1548 154 1548 175 1360 175 21 188
2 1567 154 1747 154 1747 174 1567 174 20 180
3 1311 175 1548 175 1548 196 1311 196 21 237
4 1565 174 1741 174 1741 199 1565 199 25 176
5 1566 196 1753 196 1753 220 1566 220 24 187
...
I need to cluster these objects along the bl_y or br_y column (bottom Y coordinate) to produce a 2D list of "rows" like:
As you see, objects in each "row" may have slightly varying Y coordinates (not exactly equivalent in each cluster). What I basically need is some function to add a separate e.g. clustered_y column to the DF and then sort by this column.
What's the simplest way to go?
Given the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"tl_x": {0: 1567, 1: 1360, 2: 1567, 3: 1311, 4: 1565, 5: 1566},
"tl_y": {0: 136, 1: 154, 2: 154, 3: 175, 4: 174, 5: 196},
"tr_x": {0: 1707, 1: 1548, 2: 1747, 3: 1548, 4: 1741, 5: 1753},
"tr_y": {0: 136, 1: 154, 2: 154, 3: 175, 4: 174, 5: 196},
"br_x": {0: 1707, 1: 1548, 2: 1747, 3: 1548, 4: 1741, 5: 1753},
"br_y": {0: 153, 1: 175, 2: 174, 3: 196, 4: 199, 5: 220},
"bl_x": {0: 1567, 1: 1360, 2: 1567, 3: 1311, 4: 1565, 5: 1566},
"bl_y": {0: 153, 1: 175, 2: 174, 3: 196, 4: 199, 5: 220},
"ht": {0: 17, 1: 21, 2: 20, 3: 21, 4: 25, 5: 24},
"wd": {0: 140, 1: 188, 2: 180, 3: 237, 4: 176, 5: 187},
}
)
Here is one way to do it:
# Calculate distance between "br_y" values
df = df.sort_values(by="br_y")
df["previous"] = df["br_y"].shift(1).fillna(method="bfill")
df["distance"] = df["br_y"] - df["previous"]
# Group values if distance > 5% of "br_y" values mean (arbitrarily chosen)
clusters = df.copy().loc[df["distance"] > 0.05 * df["br_y"].mean()]
clusters["clustered_br_y"] = [f"row{i}" for i in range(clusters.shape[0])]
# Add clusters back to dataframe and cleanup
df = (
pd.merge(
how="left",
left=df,
right=clusters["clustered_br_y"],
left_index=True,
right_index=True,
)
.fillna(method="ffill")
.fillna(method="bfill")
.drop(columns=["previous", "distance"])
.reset_index(drop=True)
)
tl_x tl_y tr_x tr_y br_x br_y bl_x bl_y ht wd clustered_br_y
0 1567 136 1707 136 1707 153 1567 153 17 140 row0
1 1567 154 1747 154 1747 174 1567 174 20 180 row0
2 1360 154 1548 154 1548 175 1360 175 21 188 row0
3 1311 175 1548 175 1548 196 1311 196 21 237 row1
4 1565 174 1741 174 1741 199 1565 199 25 176 row1
5 1566 196 1753 196 1753 220 1566 220 24 187 row2

How to read data that has been split into multiple columns?

I have the following dataframe:
q
1 0.83 97 0.7 193 0.238782 289 0.129692 385 0.090692
2 0.75 98 0.7 194 0.238782 290 0.129692 386 0.090692
...
96 0.94693 192 0.299753 288 0.145046 384 0.0965338 480 0.0823061
This data comes from somewhere else, and it has been split. However, the values correspond to a single variable 'q', along with its indices. To clarify, even though there are many columns, they all correspond to one column 'q', plus an index column (notice that the starting index of each column is the continuation of the end of the previous column).
How can I read the data with pandas? I believe I can do it by assigning names to each column and then merging them all together, but I was looking for a more elegant solution. Plus, the number of columns is not fixed.
This is the code that I am using at the moment:
q_param = pd.read_csv('Initial_solutions/initial_q_20y.dat', delim_whitespace=True)
Which does not do the trick. I would prefer to use pandas to solve this issue, but I can also work without it.
EDIT:
At the request of #user17242583, the following command:
print(q_param.head().to_dict())
Gives this output:
{'q': {(1, 0.83, 97, 0.7, 193, 0.238782, 289, 0.129692, 385): 0.090692, (2, 0.75, 98, 0.7, 194, 0.238782, 290, 0.129692, 386): 0.090692, (3, 0.64, 99, 0.64, 195, 0.238782, 291, 0.129692, 387): 0.090692, (4, 0.7, 100, 0.7, 196, 0.238782, 292, 0.129692, 388): 0.0884839, (5, 0.64, 101, 0.64, 197, 0.238782, 293, 0.129692, 389): 0.090692}}
It seems most of your data is index. Try:
df = pd.DataFrame({k:v for lst in [list(k)+[v] for k,v in q_param['q'].items()] for k,v in zip(lst[::2],lst[1::2])}, index=['q']).T.sort_index()
Try this:
data = {
0: pd.concat(q[c] for c in q.columns[0::2]).reset_index(drop=True),
1: pd.concat(q[c] for c in q.columns[1::2]).reset_index(drop=True),
}
df = pd.DataFrame(data)
Output:
>>> df
0 1
0 1 0.830000
1 2 0.750000
2 3 0.640000
3 4 0.700000
4 5 0.640000
5 97 0.700000
6 98 0.700000
7 99 0.640000
8 100 0.700000
9 101 0.640000
10 193 0.238782
11 194 0.238782
12 195 0.238782
13 196 0.238782
14 197 0.238782
15 289 0.129692
16 290 0.129692
17 291 0.129692
18 292 0.129692
19 293 0.129692
20 385 0.090692
21 386 0.090692
22 387 0.090692
23 388 0.088484
24 389 0.090692

Group by continuous indexes in Pandas DataFrame

I'm working on code for sensors data analysis using python.
I'm taking rows from DataFrame (of gyro data in the example) according to some condition.
import pandas as pd
gyro = pd.read_csv("gyroOutput.csv")
above = gyro[gyro['gyro_z'] > 0.30]
above
Out[162]:
gyro_x gyro_y gyro_z elapsed
27 0.026632 0.021305 0.305731 4.927
28 0.017044 0.011718 0.344080 5.115
29 0.008522 0.013848 0.380299 5.289
30 0.006392 0.026632 0.412257 5.470
31 0.007457 0.005326 0.448476 5.643
32 -0.004261 0.012783 0.465521 5.822
33 -0.001065 0.000000 0.452737 6.002
34 0.009587 0.006392 0.445281 6.181
35 0.010653 0.001065 0.412257 6.361
36 0.006392 0.003196 0.373908 6.543
37 -0.006392 0.007457 0.320645 6.722
108 -0.036219 0.052198 0.323840 19.470
109 -0.061785 -0.001065 0.389887 19.654
110 -0.049002 0.018109 0.453803 19.835
111 -0.038350 0.078830 0.513458 20.015
112 -0.034088 0.011718 0.555003 20.192
113 -0.005326 -0.001065 0.607201 20.374
114 0.009587 0.058590 0.629571 20.553
115 0.038350 -0.029827 0.598679 20.727
116 0.006392 0.013848 0.546481 20.907
117 0.007457 0.030893 0.478304 21.086
118 0.012783 -0.035154 0.446346 21.266
119 0.005326 -0.026632 0.367516 21.444
352 0.007457 0.028762 0.313188 63.284
353 0.006392 -0.011718 0.332363 63.463
354 0.008522 0.030893 0.378169 63.643
355 -0.015979 0.039415 0.409062 63.822
356 -0.009587 -0.022371 0.423975 64.002
357 -0.008522 0.023436 0.450607 64.181
358 -0.011718 0.047937 0.453803 64.361
That result data frame (above) holds groups of continuous indexes rows. For example, lines 27-37.
I want to get all those group's, couldn't find any way to do it using DataFrame.groupby or any other function.
I could iterate over the rows and separate them myself, but maybe there's some simpler way using pandas functions.
IIUC:
In [294]: df.groupby(df.index.to_series().diff().ne(1).cumsum()).groups
Out[294]:
{1: Int64Index([27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37], dtype='int64'),
2: Int64Index([108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119], dtype='int64'),
3: Int64Index([352, 353, 354, 355, 356, 357, 358], dtype='int64')}

How to transform a 3d arrays into a dataframe in python

I have a 3d array as follows:
ThreeD_Arrays = np.random.randint(0, 1000, (5, 4, 3))
array([[[715, 226, 632],
[305, 97, 534],
[ 88, 592, 902],
[172, 932, 263]],
[[895, 837, 431],
[649, 717, 39],
[363, 121, 274],
[334, 359, 816]],
[[520, 692, 230],
[452, 816, 887],
[688, 509, 770],
[290, 856, 584]],
[[286, 358, 462],
[831, 26, 332],
[424, 178, 642],
[955, 42, 938]],
[[ 44, 119, 757],
[908, 937, 728],
[809, 28, 442],
[832, 220, 348]]])
Now I would like to have it into a DataFrame like this:
Add a Date column like indicated and the column names A, B, C.
How to do this transformation? Thanks!
Based on the answer to this question, we can use a MultiIndex. First, create the MultiIndex and a flattened DataFrame.
A = np.random.randint(0, 1000, (5, 4, 3))
names = ['x', 'y', 'z']
index = pd.MultiIndex.from_product([range(s)for s in A.shape], names=names)
df = pd.DataFrame({'A': A.flatten()}, index=index)['A']
Now we can reshape it however we like:
df = df.unstack(level='x').swaplevel().sort_index()
df.columns = ['A', 'B', 'C']
df.index.names = ['DATE', 'i']
This is the result:
A B C
DATE i
0 0 715 226 632
1 895 837 431
2 520 692 230
3 286 358 462
4 44 119 757
1 0 305 97 534
1 649 717 39
2 452 816 887
3 831 26 332
4 908 937 728
2 0 88 592 902
1 363 121 274
2 688 509 770
3 424 178 642
4 809 28 442
3 0 172 932 263
1 334 359 816
2 290 856 584
3 955 42 938
4 832 220 348
You could convert your 3D array to a Pandas Panel, then flatten it to a 2D DataFrame (using .to_frame()):
import numpy as np
import pandas as pd
np.random.seed(2016)
arr = np.random.randint(0, 1000, (5, 4, 3))
pan = pd.Panel(arr)
df = pan.swapaxes(0, 2).to_frame()
df.index = df.index.droplevel('minor')
df.index.name = 'Date'
df.index = df.index+1
df.columns = list('ABC')
yields
A B C
Date
1 875 702 266
1 940 180 971
1 254 649 353
1 824 677 745
...
4 675 488 939
4 382 238 225
4 923 926 633
4 664 639 616
4 770 274 378
Alternatively, you could reshape the array to shape (20, 3), form the DataFrame as usual, and then fix the index:
import numpy as np
import pandas as pd
np.random.seed(2016)
arr = np.random.randint(0, 1000, (5, 4, 3))
df = pd.DataFrame(arr.reshape(-1, 3), columns=list('ABC'))
df.index = np.repeat(np.arange(arr.shape[0]), arr.shape[1]) + 1
df.index.name = 'Date'
print(df)
yields the same result.
ThreeD_Arrays = np.random.randint(0, 1000, (5, 4, 3))
df = pd.DataFrame([list(l) for l in ThreeD_Arrays]).stack().apply(pd.Series).reset_index(1, drop=True)
df.index.name = 'Date'
df.columns = list('ABC')

Categories