I have a pandas dataframe whose columns I have turned into a list and edited and rearranged. I'm trying to reassign the columns like as follows:
sapie_columns = sapie_df_working.columns.tolist()
sapie_columns = [sapie_columns[-1]] + sapie_columns[3:-1]
sapie_df_working = sapie_df_working[sapie_columns]
but it turns my dataframe (initially with 32 columns) into one with 164 columns. I think this is because a number of the existing columns have the same column name (i.e., "90% CI Lower Bound"). I'm curious about why is is happening and how I can rearrange and edit my dataframe's columns as I want to.
For reference, here is a snippet of my dataframe:
# sapie_df_working
2 State FIPS Code County FIPS Code Postal Code Name Poverty Estimate, All Ages 90% CI Lower Bound 90% CI Upper Bound Poverty Percent, All Ages 90% CI Lower Bound 90% CI Upper Bound ... 90% CI Upper Bound Median Household Income 90% CI Lower Bound 90% CI Upper Bound Poverty Estimate, Age 0-4 90% CI Lower Bound 90% CI Upper Bound Poverty Percent, Age 0-4 90% CI Lower Bound 90% CI Upper Bound
3 00 000 US United States 38371394 38309115 38433673 11.9 11.9 11.9 ... 14.9 67340 67251 67429 3146325 3133736 3158914 16.8 16.7 16.9
4 01 000 AL Alabama 714568 695249 733887 14.9 14.5 15.3 ... 20.7 53958 53013 54903 66169 61541 70797 23.3 21.7 24.9
5 01 001 AL Autauga County 6242 4930 7554 11.2 8.8 13.6 ... 19.3 67565 59132 75998 . . . . . .
6 01 003 AL Baldwin County 20189 15535 24843 8.9 6.8 11 ... 16.1 71135 66540 75730 . . . . . .
7 01 005 AL Barbour County 5548 4210 6886 25.5 19.3 31.7 ... 47.2 38866 33510 44222 . . . . . .
df = df[specific_column_names] is indeed producing this result because of duplicate column names. Filtering with column names in this case is tricky, as it's unclear exactly which column is being referred to.
In case of duplicate column names I would instead use column indices to filter the DataFrame.
Let's look at an example:
>>> import pandas as pd
>>> mock_data = [[11.29, 33.1283, -1.219, -33.11, 930.1, 33.91, 0.1213, 0.134], [9.0, 99.101, 9381.0, -940.11, 55.41, -941.1, -1.3913, 1933.1], [-192.1, 0.123, 0.1243, 0.213, 751.1, 991.1, -1.333, 9481.1]]
>>> mock_columns = ['a', 'b', 'c', 'a', 'd', 'b', 'g', 'a']
>>> df = pd.DataFrame(columns=mock_columns, data=mock_data)
>>> df
a b c a d b g a
0 11.29 33.1283 -1.2190 -33.110 930.10 33.91 0.1213 0.134
1 9.00 99.1010 9381.0000 -940.110 55.41 -941.10 -1.3913 1933.100
2 -192.10 0.1230 0.1243 0.213 751.10 991.10 -1.3330 9481.100
>>> columns = df.columns.tolist()
>>> filtered_column_indices = [len(columns) - 1] + list(range(3, len(columns) - 1))
>>> df.iloc[:, filtered_column_indices]
a a d b g
0 0.134 -33.110 930.10 33.91 0.1213
1 1933.100 -940.110 55.41 -941.10 -1.3913
2 9481.100 0.213 751.10 991.10 -1.3330
In the example, instead of extracting column names with [sapie_columns[-1]] + sapie_columns[3:-1], I extracted the equivalent indices and used that to filter the DataFrame using iloc.
I need to combine two dataframes that contain information about train track sections: while the "Line" identifies a track section, the two attributes "A" and "B" are given for subsections of the Line defined by start point and end point on the line; these subsections do not match between the two dataframes:
df1
Line startpoint endpoint Attribute_A
100 2.506 2.809 B-70
100 2.809 2.924 B-91
100 2.924 4.065 B-84
100 4.065 4.21 B-70
100 4.21 4.224 B-91
...
df2
Line startpoint endpoint Attribute_B
100 2.5 2.6 140
100 2.6 2.7 158
100 2.7 2.8 131
100 2.8 2.9 124
100 2.9 3.0 178
...
What I would need is a merged dataframe that gives me the combination of Attributes A and B for the respective minimal subsections where they are shared:
df3
Line startpoint endpoint Attribute_A Attribute_B
100 2.5 2.506 nan 140
100 2.506 2.6 B-70 140
100 2.6 2.7 B-70 158
100 2.7 2.8 B-70 131
100 2.8 2.809 B-70 124
100 2.809 2.9 B-91 124
100 2.9 2.924 B-91 178
100 2.924 3.0 B-84 178
...
How can I do this best in python? I'm somewhate new to it and while I get around basic calculations between rows and columns, I'm at my wit's ends with this problem; the approach of merging and sorting the two dataframes and calculating the respective differences between start- / endpoints didn't get me very far and I can't seem to find applicable information on the forums. I'm grateful for any hint !
Here is my solution, a bit long but it works:
First step is finding the intervals:
all_start_points = set(df1['startpoint'].values.tolist() + df2['startpoint'].values.tolist())
all_end_points = set(df1['endpoint'].values.tolist() + df2['endpoint'].values.tolist())
all_points = sorted(list(all_start_points.union(all_end_points)))
intervals = [(start, end) for start, end in zip(all_points[:-1], all_points[1:])]
Then we need to find the relevant interval in each dataframe (if present):
import numpy as np
def find_interval(df, interval):
return df[(df['startpoint']<=interval[0]) &
(df['endpoint']>=interval[1])]
attr_A = [find_interval(df1, intv)['Attribute_A'] for intv in intervals]
attr_A = [el.iloc[0] if len(el)>0 else np.nan for el in attr_A]
attr_B = [find_interval(df2, intv)['Attribute_B'] for intv in intervals]
attr_B = [el.iloc[0] if len(el)>0 else np.nan for el in attr_B]
Finally, we put everything together:
out = pd.DataFrame(intervals, columns = ['startpoint', 'endpoint'])
out = pd.concat([out, pd.Series(attr_A).to_frame('Attribute_A'), pd.Series(attr_B).to_frame('Attribute_B')], axis = 1)
out['Line'] = 100
And I get the expected result:
out
Out[111]:
startpoint endpoint Attribute_A Attribute_B Line
0 2.500 2.506 NaN 140.0 100
1 2.506 2.600 B-70 140.0 100
2 2.600 2.700 B-70 158.0 100
3 2.700 2.800 B-70 131.0 100
4 2.800 2.809 B-70 124.0 100
5 2.809 2.900 B-91 124.0 100
6 2.900 2.924 B-91 178.0 100
7 2.924 3.000 B-84 178.0 100
8 3.000 4.065 B-84 NaN 100
9 4.065 4.210 B-70 NaN 100
10 4.210 4.224 B-91 NaN 100
I have an excel file with 31 tabs that corresponds to a day in the month of May. Each tab or sheet has 3 columns (Height, Spd, Dir).
I want to find the day that has the maximum wind speeds. I tried using excel's function MAX=MAX(wind1:wind31!C1:C17) to find it but it only gave a maximum value. Is there a way to determine the day that has the highest wind speeds of the entire month not just one max value, seeing as the height plays a role. Do I have to do some statistical juggling (pardon the lingo)?
I have the R software as well as Python but I am mostly a novice.
These are data from 3 of the 31 sheets.
Day 1 Day 2 Day 3 and so on
Height Dir Spd Height Dir Spd Height Dir Spd
139 333 6.5 110 254 3.6 157 341 6.9
790 343 5.9 767 264 4.3 814 357 6.2
1492 343 5.7 1471 274 6.6 1522 0 5.6
3079 297 9.4 3061 284 14.9 3127 317 10.3
4311 293 19 4291 289 21.9 4375 309 14.9
5731 291 28.6 5706 292 30.4 5809 306 19.1
7406 288 38.7 7381 294 42.8 7498 299 22.4
9462 286 47.6 9440 294 56 9550 290 22.5
10694 285 47.9 10679 293 61 10777 288 22.4
12129 281 46.9 12130 296 60.6 12207 292 23.8
13940 279 33.8 13936 296 40.4 13994 282 25.4
16473 279 13.8 16464 282 13.7 16517 286 11.7
18673 278 3 18665 324 2.9 18716 323 2.6
20786 63 2.3 20775 61 2.9 20824 59 4.1
24036 100 6 24015 104 4.4 24072 96 6.9
26676 85 5.5 26656 73 4 26719 83 7.9
31287 103 6.9 31253 102 7.9 31335 101 10.2
If you get your data into a contiguous format like this:
Day Height Dir Spd
1 139 333 6.5
1 790 343 5.9
1 1492 343 5.7
. . . .
. . . .
. . . .
2 110 254 3.6
2 767 264 4.3
. . . .
. . . .
31 26719 83 7.9
31 31335 101 10.2
You can simply use this formula in Excel OFFSET(A1,MATCH(MAX(Spd),Spd,0),0) where cell A1 is the top left of the grid and contains the word Day. Max(Spd) is the max of the whole Spd column. Offset and Match are Excel functions.
Another solution would be to name the ranges of the Spd data in each sheet, say Spd_1, Spd_2,..so on, for each day. The Excel function MAX(INDIRECT("Spd_1")), MAX(INDIRECT("Spd_2")), etc, could then be used on the named ranges represented as strings in a single sheet. You could then use a single max function to find the corresponding day.
If you can load the same data up in R as a data frame, then you can do something like this
subset(df,Spd==max(df[,"Spd"]))$Day where df is the name of the data frame you read in via read.csv, or read.table, or something similar.
Both of the above can be repeated the min in place of max to find the lowest speed.
If you can't get it into that format, or cannot use Excel's INDIRECT, then the best solution would be to use simple VBA in Excel to loop through the sheets.
In all cases you may have to think about how you will deal with ties - as in 2 or more different days with the same (maximum) speed.
If you can live with R making unique column names for repeated column names, you won't need to muck with getting the day # into the individual column names (that munging is a bit much for this post) and you can then just remove the "Day" header row, leaving month of columns of readings together like you have above and make that into a CSV that R can read with read.csv().
This is the R data frame structure from reading in the data snippet above:
dat <- structure(list(Height = c(139L, 790L, 1492L, 3079L, 4311L, 5731L,
7406L, 9462L, 10694L, 12129L, 13940L, 16473L, 18673L, 20786L,
24036L, 26676L, 31287L), Dir = c(333L, 343L, 343L, 297L, 293L,
291L, 288L, 286L, 285L, 281L, 279L, 279L, 278L, 63L, 100L, 85L,
103L), Spd = c(6.5, 5.9, 5.7, 9.4, 19, 28.6, 38.7, 47.6, 47.9,
46.9, 33.8, 13.8, 3, 2.3, 6, 5.5, 6.9), Height.1 = c(110L, 767L,
1471L, 3061L, 4291L, 5706L, 7381L, 9440L, 10679L, 12130L, 13936L,
16464L, 18665L, 20775L, 24015L, 26656L, 31253L), Dir.1 = c(254L,
264L, 274L, 284L, 289L, 292L, 294L, 294L, 293L, 296L, 296L, 282L,
324L, 61L, 104L, 73L, 102L), Spd.1 = c(3.6, 4.3, 6.6, 14.9, 21.9,
30.4, 42.8, 56, 61, 60.6, 40.4, 13.7, 2.9, 2.9, 4.4, 4, 7.9),
Height.2 = c(157L, 814L, 1522L, 3127L, 4375L, 5809L, 7498L,
9550L, 10777L, 12207L, 13994L, 16517L, 18716L, 20824L, 24072L,
26719L, 31335L), Dir.2 = c(341L, 357L, 0L, 317L, 309L, 306L,
299L, 290L, 288L, 292L, 282L, 286L, 323L, 59L, 96L, 83L,
101L), Spd.2 = c(6.9, 6.2, 5.6, 10.3, 14.9, 19.1, 22.4, 22.5,
22.4, 23.8, 25.4, 11.7, 2.6, 4.1, 6.9, 7.9, 10.2)), .Names = c("Height",
"Dir", "Spd", "Height.1", "Dir.1", "Spd.1", "Height.2", "Dir.2",
"Spd.2"), class = "data.frame", row.names = c(NA, -17L))
and, here that is in a slightly better descriptive format:
str(dat)
## 'data.frame': 17 obs. of 9 variables:
## $ Height : int 139 790 1492 3079 4311 5731 7406 9462 10694 12129 ...
## $ Dir : int 333 343 343 297 293 291 288 286 285 281 ...
## $ Spd : num 6.5 5.9 5.7 9.4 19 28.6 38.7 47.6 47.9 46.9 ...
## $ Height.1: int 110 767 1471 3061 4291 5706 7381 9440 10679 12130 ...
## $ Dir.1 : int 254 264 274 284 289 292 294 294 293 296 ...
## $ Spd.1 : num 3.6 4.3 6.6 14.9 21.9 30.4 42.8 56 61 60.6 ...
## $ Height.2: int 157 814 1522 3127 4375 5809 7498 9550 10777 12207 ...
## $ Dir.2 : int 341 357 0 317 309 306 299 290 288 292 ...
## $ Spd.2 : num 6.9 6.2 5.6 10.3 14.9 19.1 22.4 22.5 22.4 23.8 ...
To get the column name of the max speed value for the whole data frame, we'll need to first just work on the "Spd" columns:
# only work with "Spd" columns
tmp <- dat[,which(grepl("Spd", names(dat)))]
# showing what we have left
str(tmp)
## 'data.frame': 17 obs. of 3 variables:
## $ Spd : num 6.5 5.9 5.7 9.4 19 28.6 38.7 47.6 47.9 46.9 ...
## $ Spd.1: num 3.6 4.3 6.6 14.9 21.9 30.4 42.8 56 61 60.6 ...
## $ Spd.2: num 6.9 6.2 5.6 10.3 14.9 19.1 22.4 22.5 22.4 23.8 ...
Then get the max value for each column:
# get max value in each "Spd" column
apply(tmp, 2, max)
## Spd Spd.1 Spd.2
## 47.9 61.0 25.4
But we really just want the column with the overall max value, so we'll feed that apply into which.max:
# which one of those has the max value (returns name & position)
which.max(apply(tmp, 2, max))
## Spd.1
## 2
And are left with the column name/# with the max value.
All of that can be done on one horribly, unreadable line:
which.max(apply(dat[, which(grepl("Spd", names(dat)))], 2, max))
which I'm only including to show it's not as complex of an operation as the explanation might make it seem like it could be.
Python and the pandas module is one possible solution:
#! /usr/bin/env python
import pandas as pd
# Export the tabs as csv-files: day1.csv, day2.csv, ..., day31.csv.
# Assume the first line is a header line and that columns are
# separated by ',':
#
# Height , Dir , Spd
# 139 , 333 , 6.5
# 790 , 343 , 5.9
# ...
#
# Use or own column names and skip header.
column_names = ['height', 'direction', 'speed']
# Read in the data for each day.
alldays = []
for d in range(1, 32):
fname = "day{}.csv".format(d)
frame = pd.read_csv(fname, names=column_names, header=0)
frame['day'] = d
alldays.append(frame)
# Concatenate all days into DataFrame.
data = pd.concat(alldays, ignore_index=True)
# Get index for max and use it to retrieve the day and the speed.
idx_max = data.speed.idxmax()
max_row = data.ix[idx_max]
print("Maximum wind speed {} on day {}".format(max_row.speed, int(max_row.day)))
# Same as above but for the minimum.
idx_min = data.speed.idxmin()
min_row = data.ix[idx_min]
print("Minimum wind speed {} on day {}".format(min_row.speed, int(min_row.day)))
Save this as script highlow.py. Using ipython and the example data provided I get the following:
>>> run highlow
Maximum wind speed 61.0 on day 2
Minimum wind speed 2.3 on day 1
>>> data.speed.describe()
count 51.000000
mean 18.209804
std 16.784853
min 2.300000
25% 5.800000
50% 10.300000
75% 24.600000
max 61.000000
dtype: float64
>>>