Loop through grouped data - Python/Pandas - python

I'm trying to perform an action on grouped data in Pandas. For each group, I want to loop through the rows and compare them to the first row in the group. If conditions are met, then I want to print out the row details. My data looks like this:
Orig Dest Route Vol Per VolPct
ORD ICN A 2,251 0.64 0.78
ORD ICN B 366 0.97 0.13
ORD ICN C 142 0.14 0.05
DCA FRA A 9,059 0.71 0.85
DCA FRA B 1,348 0.92 0.13
DCA FRA C 281 0.8 0.03
My groups are Orig, Dest pairs. If a row in the group other than the first row has a Per greater than the first row and a VolPct greater than .1, I want to output the grouped pair and the route. In this example, the output would be:
ORD ICN B
DCA FRA B
My attempted code is as follows:
for lane in otp.groupby(otp['Orig','Dest']):
X = lane.first(['Per'])
for row in lane:
if (row['Per'] > X and row['VolPct'] > .1):
print(row['Orig','Dest','Route'])
However, this isn't working so I'm obviously not doing something right. I'm also not sure how to tell Python to ignore the first row when in the "row in lane" loop. Any ideas? Thanks!

You are pretty close as it is.
First, you are calling groupby incorrectly. You should just pass a list of the column names instead of a DataFrame object. So, instead of otp.groupby(otp['Orig','Dest']) you should use otp.groupby(['Orig','Dest']).
Once you are looping through the groups you will hit more issues. A group in a groupby object is actually a tuple. The first item in that tuple is the grouping key and the second is the grouped data. For example your first group would be the following tuple:
(('DCA', 'FRA'), Orig Dest Route Vol Per VolPct
3 DCA FRA A 9,059 0.71 0.85
4 DCA FRA B 1,348 0.92 0.13
5 DCA FRA C 281 0.80 0.03)
You will need to change the way you set X to reflect this. For example, X = lane.first(['Per']) should become X = lane[1].iloc[0].Per. After that you only have a minor errors in the way you iterate through the rows and access multiple columns in a row. To wrap it all up your loop should be something like so:
for key, lane in otp.groupby(otp['Orig','Dest']):
X = lane.iloc[0].Per
for idx, row in lane.iterrows():
if (row['Per'] > X and row['VolPct'] > .1):
print(row[['Orig','Dest','Route']])
Note that I use iterrows to iterate through the rows, and I use double brackets when accessing multiple columns in a DataFrame.
You don't really need to tell pandas to ignore the first row in each group as it should never trigger your if statement, but if you did want to skip it you could use lane[1:].iterrows().

Related

Making arrays of columns (or rows) of a (space-delimited) textfile in Python

I have seen similar questions but the answers always give strings of rows. I want to make arrays of the columns of a text file. I have a textfile like this (There is a text FILE that looks like this, but has 106 rows and 19 columns):
O2 CO2 NOx Ash Other
20.9 1.6 0.04 0.0002 0.0
22.0 2.3 0.31 0.0005 0.0
19.86 2.1 0.05 0.0002 0.0
17.06 3.01 0.28 0.006 0.001
I expect to have arrays of columns (either a 2D array of all columns or a 1D array of each column), and the first row is only for names so then a list for the first row. Since I would like to plot them later.
The desired result would be for example for a column:
array([0.04,
0.31 ,
0.05,
0.28 ], dtype=float32)
and for the first row:
species= ['O2','CO2','NOx','Ash',' Other']
I'd recommend not to manually loop over values in large data sets (in this case a sort of tab separated relational model). Just use the methods of a safe and well-known library like NumPy:
import numpy as np
data = np.transpose(np.loadtxt("/path/to/file.txt", skiprows=1, delimiter="\t"))
with the inner loadtxt you read your file and with skiprows=1 parameter skip the first row (the column names) to avoid incompatible data types and further conversions. If you need this row in the same structure just use insert a new row at index 0. then you need to transpose the matrix for which there's a safe method in NumPy as well. I just used the output of loadtxt (which is a list of lists for each row) as input of transpose to give a one-liner. But it's better to use them apart in order to avoid "train wrecks" and also be able to see what happens in between and eventually correct the unwanted results.
PS: the delimiter parameter must be adjusted to match the one in the original file. Check the loadtxt documentation for more info. I considered it to be a TAB. #KostasCharitidis - thanks for your note
UPDATE3
st = open('file.txt', 'r').read()
dct = []
species = []
for row in st.split('\n')[0].split(' '):
species.append(row)
for no, row in enumerate(st.split('\n')[1:]):
dct.append([])
for elem in row.split(' '):
dct[no].append([float(elem)])
print(species)
print(dct)
RESULT
['O2', 'CO2', 'NOx', 'Ash', 'Other']
[[[20.9], [1.6], [0.04], [0.0002], [0.0]], [[22.0], [2.3], [0.31], [0.0005], [0.0]], [[19.86], [2.1], [0.05], [0.0002], [0.0]], [[17.06], [3.01], [0.28], [0.006], [0.001]]]
file.txt
O2 CO2 NOx Ash Other
20.9 1.6 0.04 0.0002 0.0
22.0 2.3 0.31 0.0005 0.0
19.86 2.1 0.05 0.0002 0.0
17.06 3.01 0.28 0.006 0.001

Using python print max and min values and date associated with the max and min values

I am new to programming and am trying to write a program that evaluates and prints the max AVE.SPEED value and the date associated with that value from a csv file.
This would be an example of the file data set:
STATION DATE AVE_SPEED
0 US68 2018-03-22 0.00
1 US68 2018-03-23 0.00
2 US68 2018-03-24 0.00
3 US68 2018-03-26 0.24
4 US68 2018-03-27 2.28
5 US68 2018-03-28 0.21
6 US10 2018-03-29 0.04
7 US10 2018-03-30 0.00
8 US10 2018-03-31 0.00
9 US10 2018-04-01 0.00
10 US10 2018-04-02 0.02
This is what I have come up with so far but it just prints the entire set at the end.
import pandas as pd
df = pd.read_csv (r'data_01.csv')
max1 = df['AVE_SPEED'].max()
print ('Max Speed in MPH: ' + str(max1))
groupby_max1 = df.groupby(['DATE']).max()
print ('Maximum Average Speed Value and Date of Occurance: ' + str(groupby_max1))
Your initial average speed max is correct in pandas.
To find the corresponding date, I would do the following:
mport pandas as pd
df = pd.read_csv (r'data_01.csv')
max1 = df['AVE_SPEED'].max()
print ('Max Speed in MPH: ' + str(max1))
date_of_max = df[df['AVE_SPEED'] == max1]['date'].values[0]
Effectively, you're creating another dataframe where any "AVE_SPEED" must equal the max speed (it should be a single row unless there's are multiple instances of the same max speed). From there, you return the 'date' value of that dataframe/row.
You can then print/return the max velocity and corresponding date as needed.
I would like to suggest a non-pandas approach to this as a lot of new programmers focus on learning pandas instead of learning python -- especially here it might be easier to understand what plain python is doing instead of using a dataframe:
with open('data_01.csv') as f:
data = f.readlines()[1:] # ditch the header
data = [x.split() for x in data] # turn each line in to a list of its values
data.sort(key=lambda x: -float(x[-1])) # sort by the last item in each list (the speed) ascending
print(data[0][2]) # print the date (index 2) from the first item in your sorted data

Create a row in pandas dataframe

I am trying to create a row in my existing pandas dataframe and the value of a new row should be a computation
I have a dataframe that looks like the below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
I want to add a row called "Metric" which is the sum of "LE_St" variable for "Rating" >= 4 and <6 divided by "LE_St" for "All" i.e Metric = (0.05+1.77)/10.17
My output dataframe should look like below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
Metric 0.44
I believe your approach to the dataframe is wrong.
Usually rows hold values correlating with columns in a matter that makes sense and not hold random information. the power of pandas and python is for holding and manipulating data. You can easily compute a value from a column or even all columns and store them in a "summary" like dataframe or in separate values. That might help you with this as well.
for computation on a column (i.e. Series object) you can use the .sum() method (or any other of the computational tools) and slice your dataframe by values in the "rating" column.
for random computation of small statistics you will be rather off with excel :)
an example of a solution might look like this:
all = 10.17 # i dont know where this value comes from
df = df[df['rating'].between(4, 6, inclusive=True)]
metric = sliced_df['LE_ST'].sum()/all
print metric # or store it somewhere however you like

Expanding pandas data frame with date range in columns

I have a pandas dataframe with dates and strings similar to this:
Start End Note Item
2016-10-22 2016-11-05 Z A
2017-02-11 2017-02-25 W B
I need to expand/transform it to the below, filling in weeks (W-SAT) in between the Start and End columns and forward filling the data in Note and Items:
Start Note Item
2016-10-22 Z A
2016-10-29 Z A
2016-11-05 Z A
2017-02-11 W B
2017-02-18 W B
2017-02-25 W B
What's the best way to do this with pandas? Some sort of multi-index apply?
You can iterate over each row and create a new dataframe and then concatenate them together
pd.concat([pd.DataFrame({'Start': pd.date_range(row.Start, row.End, freq='W-SAT'),
'Note': row.Note,
'Item': row.Item}, columns=['Start', 'Note', 'Item'])
for i, row in df.iterrows()], ignore_index=True)
Start Note Item
0 2016-10-22 Z A
1 2016-10-29 Z A
2 2016-11-05 Z A
3 2017-02-11 W B
4 2017-02-18 W B
5 2017-02-25 W B
You don't need iteration at all.
df_start_end = df.melt(id_vars=['Note','Item'],value_name='date')
df = df_start_end.groupby('Note').apply(lambda x: x.set_index('date').resample('W').pad()).drop(columns=['Note','variable']).reset_index()
If the number of unique values of df['End'] - df['Start'] is not too large, but the number of rows in your dataset is large, then the following function will be much faster than looping over your dataset:
def date_expander(dataframe: pd.DataFrame,
start_dt_colname: str,
end_dt_colname: str,
time_unit: str,
new_colname: str,
end_inclusive: bool) -> pd.DataFrame:
td = pd.Timedelta(1, time_unit)
# add a timediff column:
dataframe['_dt_diff'] = dataframe[end_dt_colname] - dataframe[start_dt_colname]
# get the maximum timediff:
max_diff = int((dataframe['_dt_diff'] / td).max())
# for each possible timediff, get the intermediate time-differences:
df_diffs = pd.concat([pd.DataFrame({'_to_add': np.arange(0, dt_diff + end_inclusive) * td}).assign(_dt_diff=dt_diff * td)
for dt_diff in range(max_diff + 1)])
# join to the original dataframe
data_expanded = dataframe.merge(df_diffs, on='_dt_diff')
# the new dt column is just start plus the intermediate diffs:
data_expanded[new_colname] = data_expanded[start_dt_colname] + data_expanded['_to_add']
# remove start-end cols, as well as temp cols used for calculations:
to_drop = [start_dt_colname, end_dt_colname, '_to_add', '_dt_diff']
if new_colname in to_drop:
to_drop.remove(new_colname)
data_expanded = data_expanded.drop(columns=to_drop)
# don't modify dataframe in place:
del dataframe['_dt_diff']
return data_expanded
So I recently spent a bit of time trying to figure out an efficient pandas-based approach to this issue (which is very trivial with data.table in R) and wanted to share the approach I came up with here:
df.set_index("Note").apply(
lambda row: pd.date_range(row["Start"], row["End"], freq="W-SAT").values, axis=1
).explode()
Note: using .values makes a big difference in performance!
There are quite a few solutions here already and I wanted to compare the speed for different numbers of rows and periods - see results (in seconds) below:
n_rows is the number of initial rows and n_periods is the number of periods per row i.e. the windows size: the combinations below always result in 1 million rows when expanded
the other columns are named after the posters of the solutions
note I made a slight tweak to Gen's approach whereby, after pd.melt(), I do df.set_index("date").groupby("Note").resample("W-SAT").ffill() - I labelled this Gen2 and it seems to perform slightly better and gives the same result
each n_rows, n_periods combination was ran 10 times and results were then averaged
Anyway, jwdink's solution looks like a winner when there are many rows and few periods, whereas my solution seems to better on the other end of the spectrum, though only marginally ahead of the others as the number of rows decreases:
n_rows
n_periods
jwdink
TedPetrou
Gen
Gen2
robbie
250
4000
6.63
0.33
0.64
0.45
0.28
500
2000
3.21
0.65
1.18
0.81
0.34
1000
1000
1.57
1.28
2.30
1.60
0.48
2000
500
0.83
2.57
4.68
3.24
0.71
5000
200
0.40
6.10
13.26
9.59
1.43
If you want to run your own tests on this, my code is available in my GitHub repo - note I created a DateExpander class object that wraps all the functions to make it easier to scale the simulation.
Also, for reference, I used a 2-core STANDARD_DS11_V2 Azure VM - only for about 10 minutes, so this literally me giving my 2 cents on the issue!

Working with samples, applying a function to a large subset of columns

I have data that consist of 1,000 samples from a distribution of a rate for several different countries stored in a pandas DataFrame:
s1 s2 ... s1000 pop
region country
NA USA 0.25 0.27 0.23 300
CAN 0.16 0.14 0.13 35
LA MEX ...
I need to multiply each sample by the population.To accomplish this, I currently have:
for column in data.filter(regex='sample'):
data[column] = data[column]*data['pop']
While this works, iterating over columns feels like it goes against the spirit of python and numpy. Is there a more natural way I'm not seeing? I would normally use apply, but I don't know how to use apply and still get the unique population value for each row.
More context: The reason I need to do this multiplication is because I want to aggregate the data by region, collapsing USA and CAN into North America, for example. However, because my data are rates, I cannot simply add- I must multiply by population to turn them into counts.
I might do something like
>>> df
s1 s2 s1000 pop
region country
NaN USA 0.25 0.27 0.23 300
CAN 0.16 0.14 0.13 35
[2 rows x 4 columns]
>>> df.iloc[:,:-1] = df.iloc[:, :-1].mul(df["pop"], axis=0)
>>> df
s1 s2 s1000 pop
region country
NaN USA 75.0 81.0 69.00 300
CAN 5.6 4.9 4.55 35
[2 rows x 4 columns]
where instead of iloc-ing every column except the last you could use any other loc-based filter.

Categories