I have a 3d array as follows:
ThreeD_Arrays = np.random.randint(0, 1000, (5, 4, 3))
array([[[715, 226, 632],
[305, 97, 534],
[ 88, 592, 902],
[172, 932, 263]],
[[895, 837, 431],
[649, 717, 39],
[363, 121, 274],
[334, 359, 816]],
[[520, 692, 230],
[452, 816, 887],
[688, 509, 770],
[290, 856, 584]],
[[286, 358, 462],
[831, 26, 332],
[424, 178, 642],
[955, 42, 938]],
[[ 44, 119, 757],
[908, 937, 728],
[809, 28, 442],
[832, 220, 348]]])
Now I would like to have it into a DataFrame like this:
Add a Date column like indicated and the column names A, B, C.
How to do this transformation? Thanks!
Based on the answer to this question, we can use a MultiIndex. First, create the MultiIndex and a flattened DataFrame.
A = np.random.randint(0, 1000, (5, 4, 3))
names = ['x', 'y', 'z']
index = pd.MultiIndex.from_product([range(s)for s in A.shape], names=names)
df = pd.DataFrame({'A': A.flatten()}, index=index)['A']
Now we can reshape it however we like:
df = df.unstack(level='x').swaplevel().sort_index()
df.columns = ['A', 'B', 'C']
df.index.names = ['DATE', 'i']
This is the result:
A B C
DATE i
0 0 715 226 632
1 895 837 431
2 520 692 230
3 286 358 462
4 44 119 757
1 0 305 97 534
1 649 717 39
2 452 816 887
3 831 26 332
4 908 937 728
2 0 88 592 902
1 363 121 274
2 688 509 770
3 424 178 642
4 809 28 442
3 0 172 932 263
1 334 359 816
2 290 856 584
3 955 42 938
4 832 220 348
You could convert your 3D array to a Pandas Panel, then flatten it to a 2D DataFrame (using .to_frame()):
import numpy as np
import pandas as pd
np.random.seed(2016)
arr = np.random.randint(0, 1000, (5, 4, 3))
pan = pd.Panel(arr)
df = pan.swapaxes(0, 2).to_frame()
df.index = df.index.droplevel('minor')
df.index.name = 'Date'
df.index = df.index+1
df.columns = list('ABC')
yields
A B C
Date
1 875 702 266
1 940 180 971
1 254 649 353
1 824 677 745
...
4 675 488 939
4 382 238 225
4 923 926 633
4 664 639 616
4 770 274 378
Alternatively, you could reshape the array to shape (20, 3), form the DataFrame as usual, and then fix the index:
import numpy as np
import pandas as pd
np.random.seed(2016)
arr = np.random.randint(0, 1000, (5, 4, 3))
df = pd.DataFrame(arr.reshape(-1, 3), columns=list('ABC'))
df.index = np.repeat(np.arange(arr.shape[0]), arr.shape[1]) + 1
df.index.name = 'Date'
print(df)
yields the same result.
ThreeD_Arrays = np.random.randint(0, 1000, (5, 4, 3))
df = pd.DataFrame([list(l) for l in ThreeD_Arrays]).stack().apply(pd.Series).reset_index(1, drop=True)
df.index.name = 'Date'
df.columns = list('ABC')
Related
I've got a pandas DataFrame with records describing rectangles with absolute coordinates of all the 4 points: TL (top-left), TR (top-right), BL (bottom-left) and BR (bottom-right). As it is, the rects seem to follow a row-like pattern, where there are conspicuous clusters forming "rows", like in this picture:
The data look like this:
tl_x tl_y tr_x tr_y br_x br_y bl_x bl_y ht wd
0 1567 136 1707 136 1707 153 1567 153 17 140
1 1360 154 1548 154 1548 175 1360 175 21 188
2 1567 154 1747 154 1747 174 1567 174 20 180
3 1311 175 1548 175 1548 196 1311 196 21 237
4 1565 174 1741 174 1741 199 1565 199 25 176
5 1566 196 1753 196 1753 220 1566 220 24 187
...
I need to cluster these objects along the bl_y or br_y column (bottom Y coordinate) to produce a 2D list of "rows" like:
As you see, objects in each "row" may have slightly varying Y coordinates (not exactly equivalent in each cluster). What I basically need is some function to add a separate e.g. clustered_y column to the DF and then sort by this column.
What's the simplest way to go?
Given the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"tl_x": {0: 1567, 1: 1360, 2: 1567, 3: 1311, 4: 1565, 5: 1566},
"tl_y": {0: 136, 1: 154, 2: 154, 3: 175, 4: 174, 5: 196},
"tr_x": {0: 1707, 1: 1548, 2: 1747, 3: 1548, 4: 1741, 5: 1753},
"tr_y": {0: 136, 1: 154, 2: 154, 3: 175, 4: 174, 5: 196},
"br_x": {0: 1707, 1: 1548, 2: 1747, 3: 1548, 4: 1741, 5: 1753},
"br_y": {0: 153, 1: 175, 2: 174, 3: 196, 4: 199, 5: 220},
"bl_x": {0: 1567, 1: 1360, 2: 1567, 3: 1311, 4: 1565, 5: 1566},
"bl_y": {0: 153, 1: 175, 2: 174, 3: 196, 4: 199, 5: 220},
"ht": {0: 17, 1: 21, 2: 20, 3: 21, 4: 25, 5: 24},
"wd": {0: 140, 1: 188, 2: 180, 3: 237, 4: 176, 5: 187},
}
)
Here is one way to do it:
# Calculate distance between "br_y" values
df = df.sort_values(by="br_y")
df["previous"] = df["br_y"].shift(1).fillna(method="bfill")
df["distance"] = df["br_y"] - df["previous"]
# Group values if distance > 5% of "br_y" values mean (arbitrarily chosen)
clusters = df.copy().loc[df["distance"] > 0.05 * df["br_y"].mean()]
clusters["clustered_br_y"] = [f"row{i}" for i in range(clusters.shape[0])]
# Add clusters back to dataframe and cleanup
df = (
pd.merge(
how="left",
left=df,
right=clusters["clustered_br_y"],
left_index=True,
right_index=True,
)
.fillna(method="ffill")
.fillna(method="bfill")
.drop(columns=["previous", "distance"])
.reset_index(drop=True)
)
tl_x tl_y tr_x tr_y br_x br_y bl_x bl_y ht wd clustered_br_y
0 1567 136 1707 136 1707 153 1567 153 17 140 row0
1 1567 154 1747 154 1747 174 1567 174 20 180 row0
2 1360 154 1548 154 1548 175 1360 175 21 188 row0
3 1311 175 1548 175 1548 196 1311 196 21 237 row1
4 1565 174 1741 174 1741 199 1565 199 25 176 row1
5 1566 196 1753 196 1753 220 1566 220 24 187 row2
The following is the first couple of columns of a data frame, and I calculate V1_x - V1_y, V2_x - V2_y, V3_x - V3_y etc. The difference variable names differ only by the last character (either x or y)
import pandas as pd
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Address': ['xx', 'yy', 'zz','ww'], 'V1_x': [20, 21, 19, 18], 'V2_x': [233, 142, 643, 254], 'V3_x': [343, 543, 254, 543], 'V1_y': [20, 21, 19, 18], 'V2_y': [233, 142, 643, 254], 'V3_y': [343, 543, 254, 543]}
df = pd.DataFrame(data)
df
Name Address V1_x V2_x V3_x V1_y V2_y V3_y
0 Tom xx 20 233 343 20 233 343
1 Joseph yy 21 142 543 21 142 543
2 Krish zz 19 643 254 19 643 254
3 John ww 18 254 543 18 254 543
I currently do the calculation by manually defining the column names:
new_df = pd.DataFrame()
new_df['Name'] = df['Name']
new_df['Address'] = df['Address']
new_df['Col1'] = df['V1_x']-df['V1_y']
new_df['Col1'] = df['V2_x']-df['V2_y']
new_df['Col1'] = df['V3_x']-df['V3_y']
Is there an approach that I can use to check if the last column names only differ by the last character and difference them if so?
Try creating a multiindex header using .str.split then reshape the dataframe and using pd.DataFrame.eval for calcuation then reshape back to original form with additional columns. Lastly flatten the multiindex header using list comprehension with f-string formatting:
dfi = df.set_index(['Name', 'Address'])
dfi.columns = dfi.columns.str.split('_', expand=True)
dfs = dfi.stack(0).eval('diff=x-y').unstack()
dfs.columns = [f'{j}_{i}' for i, j in dfs.columns]
dfs
Output:
V1_x V2_x V3_x V1_y V2_y V3_y V1_diff V2_diff V3_diff
Name Address
John ww 18 254 543 18 254 543 0 0 0
Joseph yy 21 142 543 21 142 543 0 0 0
Krish zz 19 643 254 19 643 254 0 0 0
Tom xx 20 233 343 20 233 343 0 0 0
I was trying to convert the xml table into data frame using Beautiful Soup .
import bs4 as bs
import urllib.request
import pandas as pd
source = urllib.request.urlopen("http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml").read()
soup = bs.BeautifulSoup(source,'xml')
GName = soup.find_all('GeneratorName')
Ftype = soup.find_all('FuelType')
Hour = soup.find_all('Hour')
Mwatt = soup.find_all('EnergyMW')
data = []
for i in range(0,len(GName)):
rows = [GName[i].get_text(),Ftype[i].get_text(),
Hour [i].get_text(),Mwatt[i].get_text()
]
data.append(rows)
df = pd.DataFrame(data,columns = ['Generator Name','Fuel Type',
'Hour','Energy MW'],
dtype = int)
display(df)
Generator Name Fuel Type Hour Energy MW
0 BRUCEA-G1 NUCLEAR 1 777
1 BRUCEA-G2 NUCLEAR 2 777
2 BRUCEA-G3 NUCLEAR 3 777
3 BRUCEA-G4 NUCLEAR 4 778
4 BRUCEB-G5 NUCLEAR 5 780
... ... ... ... ...
175 STONE MILLS SF SOLAR 8 0
176 WINDSOR AIRPORT SF SOLAR 9 0
177 ATIKOKAN-G1 BIOFUEL 10 0
178 CALSTOCKGS BIOFUEL 11 0
179 TBAYBOWATER CTS BIOFUEL 12 0
180 rows × 4 columns
The final data frame gives only the Energy MW of index 0 only . It should be for all 180 stations .
I am stuck up .
Thanks
from lxml.html import parse
import pandas as pd
def main(url):
data = parse(url).find('.//generators')
allin = []
for i in data:
allin.append({
'GeneratorName': i[0].text,
'FuelType': i[1].text,
'Outputs': [x.text for x in i[2].cssselect('EnergyMW')],
'Capabilities': [x.text for x in i[3].cssselect('EnergyMW')],
'capacities': [x.text for x in i[4].cssselect('EnergyMW')]
})
df = pd.DataFrame(allin)
print(df)
main('http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml')
Output:
GeneratorName ... capacities
0 BRUCEA-G1 ... [795, 795, 795, 795, 795, 795, 795, 795, 795, ...
1 BRUCEA-G2 ... [779, 779, 779, 779, 779, 779, 779, 779, 779, ...
2 BRUCEA-G3 ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 BRUCEA-G4 ... [760, 760, 760, 760, 760, 760, 760, 760, 760, ...
4 BRUCEB-G5 ... [817, 817, 817, 817, 817, 817, 817, 817, 817, ...
.. ... ... ...
175 STONE MILLS SF ... [54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 5...
176 WINDSOR AIRPORT SF ... [50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 5...
177 ATIKOKAN-G1 ... [215, 215, 215, 215, 215, 215, 215, 215, 215, ...
178 CALSTOCKGS ... [38, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
179 TBAYBOWATER CTS ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, ...
[180 rows x 5 columns]
One way might be to use the indicated transform document and etree.XSLT to apply the transform which generates the table. Select that table, with pandas read_html, then do a little sprucing on the headers if desired.
from lxml import etree
from pandas import read_html as rh
transform = etree.XSLT(etree.parse('http://reports.ieso.ca/docrefs/stylesheet/GenOutputCapability_HTML_t1-4.xsl'))
result_tree = transform(etree.parse('http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml'))
df = rh(str(result_tree), match = 'Hours')[0]
df.columns = df.iloc[1, :]
df = df.iloc[2:, ]
df
Read about the transform steps here: https://lxml.de/xpathxslt.html#xslt
That's a pretty gnarly xml you have there and it takes some efforts to convert to a table which looks like the one on the page:
#first, some required imports
import itertools
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = 'http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml'
req = requests.get(url)
soup = bs(req.text,'lxml')
gens = soup.select('Generator')
#We have to get the names of the 180 generators and double it to have 2 rows each:
gen_names = [gen.select_one('generatorname').text for gen in gens]
gen_names = list(itertools.chain.from_iterable(itertools.repeat(x, 2) for x in gen_names))
# we also need to create a list 180 Capability and Output pairs and flatten it to 360:
vars = list(itertools.chain.from_iterable(itertools.repeat(x, 180) for x in [["Capability","Output"]]))
vars = list(itertools.chain(*vars))
#all that in order to create a MultiIndex dataframe:
index = pd.MultiIndex.from_arrays([gen_names,vars], names=["Generator", "Hours"])
#create column names equal to the hours - note that, depending on the time of day the data is downloaded there could be more or less columns
cols = list(range(1,20))
#now collect the data; there may be shorter ways to do that, but for I used a longer method, for easier readability
data = []
for gen in gens:
row = []
row.append([g.text for g in gen.select('Capability energymw')])
row.append([g.text for g in gen.select('Output energymw')])
data.extend(row)
pd.DataFrame(data,index=index,columns=cols)
Output (pardon the formatting):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Generator Hours
BRUCEA-G1 Capability 795 795 795 795 795 795 795 795 795 795 795 795 795 795 795 795 795 795 795
Output 776 775 775 774 775 775 774 774 774 775 776 777 777 775 774 773 773 774 774
BRUCEA-G2 Capability 779 779 779 779 779 779 779 779 779 779 779 779 779 779 779 779 779 779 779
etc.
I have a dataframe like:
x1 y1 x2 y2
0 149 2653 2152 2656
1 149 2465 2152 2468
2 149 1403 2152 1406
3 149 1215 2152 1218
4 170 2692 2170 2695
5 170 2475 2170 2478
6 170 1413 2170 1416
7 170 1285 2170 1288
I need to pair by each two rows from data frame index. i.e., [0,1], [2,3], [4,5], [6,7] etc.,
and extract x1,y1 from first row of the pair x2,y2 from second row of the pair, similarly for each pair of rows.
Sample Output:
[[149,2653,2152,2468],[149,1403,2152,1218],[170,2692,2170,2478],[170,1413,2170,1288]]
Please feel free to ask if it's not clear.
So far I tried grouping by pairs, and tried shift operation.
But I didn't manage to make make pair records.
Python solution:
Select values of columns by positions to lists:
a = df[['x2', 'y2']].iloc[1::2].values.tolist()
b = df[['x1', 'y1']].iloc[0::2].values.tolist()
And then zip and join together in list comprehension:
L = [y + x for x, y in zip(a, b)]
print (L)
[[149, 2653, 2152, 2468], [149, 1403, 2152, 1218],
[170, 2692, 2170, 2478], [170, 1413, 2170, 1288]]
Thank you, #user2285236 for another solution:
L = np.concatenate([df.loc[::2, ['x1', 'y1']], df.loc[1::2, ['x2', 'y2']]], axis=1).tolist()
Pure pandas solution:
First DataFrameGroupBy.shift by each 2 rows:
df[['x2', 'y2']] = df.groupby(np.arange(len(df)) // 2)[['x2', 'y2']].shift(-1)
print (df)
x1 y1 x2 y2
0 149 2653 2152.0 2468.0
1 149 2465 NaN NaN
2 149 1403 2152.0 1218.0
3 149 1215 NaN NaN
4 170 2692 2170.0 2478.0
5 170 2475 NaN NaN
6 170 1413 2170.0 1288.0
7 170 1285 NaN NaN
Then remove NaNs rows, convert to int and then to list:
print (df.dropna().astype(int).values.tolist())
[[149, 2653, 2152, 2468], [149, 1403, 2152, 1218],
[170, 2692, 2170, 2478], [170, 1413, 2170, 1288]]
Here's one solution via numpy.hstack. Note it is natural to feed numpy arrays directly to pd.DataFrame, since this is how Pandas stores data internally.
import numpy as np
arr = np.hstack((df[['x1', 'y1']].values[::2],
df[['x2', 'y2']].values[1::2]))
res = pd.DataFrame(arr)
print(res)
0 1 2 3
0 149 2653 2152 2468
1 149 1403 2152 1218
2 170 2692 2170 2478
3 170 1413 2170 1288
Here's a solution using a custom iterator based on iterrows(), but it's a bit clunky:
import pandas as pd
df = pd.DataFrame( columns=['x1','y1','x2','y2'], data=
[[149, 2653, 2152, 2656], [149, 2465, 2152, 2468], [149, 1403, 2152, 1406], [149, 1215, 2152, 1218],
[170, 2692, 2170, 2695], [170, 2475, 2170, 2478], [170, 1413, 2170, 1416], [170, 1285, 2170, 1288]] )
def iter_oddeven_pairs(df):
row_it = df.iterrows()
try:
while True:
_,row = next(row_it)
yield row[0:2]
_,row = next(row_it)
yield row[2:4]
except StopIteration:
pass
print(pd.concat([pair for pair in iter_oddeven_pairs(df)]))
I have the following sample data:
165 150 238 402 395 571 365 446 284 278 322 282 236
16 5 19 10 12 5 18 22 6 4 5
259 224 249 193 170 151 95 86 101 58 49
6013 7413 8976 10392 12678 9618 9054 8842 9387 11088 11393;
It is the equivalent of a two dimensional array (except each row does not have an equal amount of columns). At the end of each line is a space and then a \n except for the final entry which is followed by no space and only a ;.
Would anyone know the pyparsing grammer to parse this? I've been trying something along the following lines but it will not match.
data = Group(OneOrMore(Group(OneOrMore(Word(nums) + SPACE)) + LINE) + \
Group(OneOrMore(Word(nums) + SPACE)) + Word(nums) + Literal(";")
The desired output would ideally be as follows
[['165', '150', '238', '402', '395', '571', '365', '446', '284', '278',
'322', '282', '236'], ['16', '5', ... ], [...], ['6013', ..., '11393']]
Any assistance would be greatly appreciated.
You can use the stopOn argument to OneOrMore to make it stop matching. Then, since newlines are by default skippable whitespace, the next group can start matching, and it will just skip over the newline and start at the next integer.
import pyparsing as pp
data_line = pp.Group(pp.OneOrMore(pp.pyparsing_common.integer(), stopOn=pp.LineEnd()))
data_lines = pp.OneOrMore(data_line) + pp.Suppress(';')
Applying this to your sample data:
data = """\
165 150 238 402 395 571 365 446 284 278 322 282 236
16 5 19 10 12 5 18 22 6 4 5
259 224 249 193 170 151 95 86 101 58 49
6013 7413 8976 10392 12678 9618 9054 8842 9387 11088 11393;"""
parsed = data_lines.parseString(data)
from pprint import pprint
pprint(parsed.asList())
Prints:
[[165, 150, 238, 402, 395, 571, 365, 446, 284, 278, 322, 282, 236],
[16, 5, 19, 10, 12, 5, 18, 22, 6, 4, 5],
[259, 224, 249, 193, 170, 151, 95, 86, 101, 58, 49],
[6013, 7413, 8976, 10392, 12678, 9618, 9054, 8842, 9387, 11088, 11393]]