I have a data frame including number of sold tickets in different price buckets for each flight.
For each record/row, I want to use the value in one column as an index in iloc function, to sum up values in a specific number of columns.
Like, for each row, I want to sum up values from column index 5 to value in ['iloc_index']
I tried df.iloc[:, 5:df['iloc_index']].sum(axis=1) but it did not work.
sample data:
A B C D iloc_value total
0 1 2 3 2 1
1 1 3 4 2 2
2 4 6 3 2 1
for each row, I want to sum up the number of columns based on the value in ['iloc_value']
for example,
for row0, I want the total to be 1+2
for row1, I want the total to be 1+3+4
for row2, I want the total to be 4+6
EDIT:
I quickly got the results this way:
First define a function that can do it for one row:
def sum_till_iloc_value(row):
return sum(row[:row['iloc_value']+1])
Then apply it to all rows to generate your output:
df_flights['sum'] = df_flights.apply(sum_till_iloc_value, axis=1)
A B C D iloc_value sum
0 1 2 3 2 1 3
1 1 3 4 2 2 8
2 4 6 3 2 1 10
PREVIOUSLY:
Assuming you have information that looks like:
df_flights = pd.DataFrame({'flight':['f1', 'f2', 'f3'], 'business':[2,3,4], 'economy':[6,7,8]})
df_flights
flight business economy
0 f1 2 6
1 f2 3 7
2 f3 4 8
you can sum the columns you want as below:
df_flights['seat_count'] = df_flights['business'] + df_flights['economy']
This will create a new column that you can later select:
df_flights[['flight', 'seat_count']]
flight seat_count
0 f1 8
1 f2 10
2 f3 12
Here's a way to do that in a fully vectorized way: melting the dataframe, summing only the relevant columns, and getting the total back into the dataframe:
d = dict([[y, x] for x, y in enumerate(df.columns[:-1])])
temp_df = df.copy()
temp_df = temp_df.rename(columns=d)
temp_df = temp_df.reset_index().melt(id_vars = ["index", "iloc_value"])
temp_df = temp_df[temp_df.variable <= temp_df.iloc_value]
df["total"] = temp_df.groupby("index").value.sum()
The output is:
A B C D iloc_value total
0 1 2 3 2 1 3
1 1 3 4 2 2 8
2 4 6 3 2 1 10
Related
I need to take the lowest value over n rows and add it to these n rows in a new colomn of the dataframe. For example:
n=3
Column 1 Column 2
5 3
3 3
4 3
7 2
8 2
2 2
5 4
4 4
9 4
8 2
2 2
3 2
5 2
Please take note that if the number of rows is not dividable by n, the last values are incorporated in the last group. So in this example n=4 for the end of the dataframe.
Thanking you in advance!
I do not know any straight forward way to do this, but here is a working example (not elegant, but working...).
If you do not worry about the number of rows being dividable by n, you could use .groupby():
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
df['new_col']=df.groupby(df.index // n).transform('min')
which yields:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 4
7 6 4
8 4 4
9 1 1
10 2 1
However, we can see that the last 2 rows are grouped together, instead of them being grouped with the 3 previous values in this case.
A way around would be to look at the .count() of elements in each group generated by grouby, and check the last one:
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
# Temporary dataframe
A = df.groupby(df.index // n).transform('min')
# The min value of each group in a second dataframe
min_df = df.groupby(df.index // n).min()
# The size of the last group
last_batch = df.groupby(df.index // n).count()[-1:]
# if the last size is not equal to n
if last_batch.values[0][0] !=n:
last_group = last_batch+n
A[-last_group.values[0][0]:]=min_df[-2:].min()
# Assign the temporary modified dataframe to df
df['new_col'] = A
which yields the expected result:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 1
7 6 1
8 4 1
9 1 1
10 2 1
Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!
Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1
If I have a following dataframe:
A B C D E
1 1 2 0 1 0
2 0 0 0 1 -1
3 1 1 3 -5 2
4 -3 4 2 6 0
5 2 4 1 9 -1
T 1 2 2 4 1
The last row is my threshold values for each column. I want to count each column values whether lower its threshold values or not in python pandas.
Desired Output is;
A B C D E
Count 2 2 3 3 4
But, I need to figure it out with a general solution, not for these specific columns. Because I have a large dataset. I cannot specify a column name for each of them in the code.
Could you please help me with this?
Select all rows without first by indexing and compare by DataFrame.lt by last row, then sum and convert Series to one row DataFrame by Series.to_frame with transpose by DataFrame.T:
df = df.iloc[:-1].lt(df.iloc[-1]).sum().to_frame('count').T
print (df)
A B C D E
count 2 2 3 3 4
Numpy alternative with DataFrame constructor:
arr = df.values
df = pd.DataFrame([np.sum(arr[:-1] < arr[-1], axis=0)], columns=df.columns, index=['count'])
print (df)
A B C D E
count 2 2 3 3 4
I have the following df:
df = pd.DataFrame({'ID1':[1,2,3,4,5,6],'ID2':[2,6,6,2,1,2],'AREA':[1,1,1,1,1,1]})
...
ID1 ID2 AREA
0 1 2 1
1 2 6 1
2 3 6 1
3 4 2 1
4 5 1 1
5 6 2 1
I accumulate the AREA column as so:
for id_ in df.ID1:
id1_filter = df.ID1 == id_
id2_filter = (df.ID1 == id_) | (df.ID2 == id_)
df.loc[id1_filter, 'AREA'] = df.loc[id2_filter].AREA.sum()
print(df)
...
ID1 ID2 AREA
0 1 2 2
1 2 6 5
2 3 6 1
3 4 2 1
4 5 1 1
5 6 2 7
For each id_ in ID1, the AREA is summed where ID1 == id_ or ID2 == id_,
and it is always run when df is sorted on ID1.
The real dataframe I'm working on though is 150,000 records, each row belonging to a unique ID1.
Running the above on this dataframe takes 2.5 hours. Since this operation will take place repeatedly
for the foreseeable future, I decided to store the indices of the True values in id1_filter and id2_filter
in a DB with the following schema.
Table ID1:
ID_,INDEX_
1 , 0
2 , 1
etc, ect
Table ID2:
ID_,INDEX_
1 , 0
1 , 4
2 , 0
2 , 1
2 , 3
2 , 5
etc, etc
The next time I run the accumulation on the AREA column (now filled with different AREA values)
I read in the sql tables and the convert them to dicts. I then use these dicts
to grab the records I need during the summing loop.
id1_dict = pd.read_sql('select * from ID1',db_engine).groupby('ID_').INDEX_.unique().to_dict()
id2_dict = pd.read_sql('select * from ID2',db_engine).groupby('ID_').INDEX_.unique().to_dict()
# print indices for id1_filter and id2_fillter for id 1
print(id1_dict[1])
print(id2_dict[1])
...
[0]
[0, 4]
for id_ in df.ID1:
df.loc[id1_dict[id_], 'AREA'] = df.loc[id2_dict[id_]].AREA.sum()
When run this way it only takes 6 minutes!
My question: Is there a better/standard way to handle this scenario, i.e storing dataframe selections for
later use? Side note, I have set an index on the SQL table's ID columns and tried getting the
indices by querying the table for each id, and it works well, but still takes a little longer than the above (9 mins).
One way to do it is like this:
df = df.set_index('ID1')
for row in df.join(df.groupby('ID2')['AREA'].apply(lambda x: x.index.tolist()),rsuffix='_').dropna().itertuples():
df.loc[row[0],'AREA'] += df.loc[row[3],'AREA'].sum()
df = df.reset_index()
and you get the result expected
ID1 ID2 AREA
0 1 2 2
1 2 6 5
2 3 6 1
3 4 2 1
4 5 1 1
5 6 2 7
Now on a bigger df like:
df = pd.DataFrame( {'ID1':range(1,1501),'ID2': np.random.randint(1,1501,(1500,)),'AREA':[1]*1500},
columns = ['ID1','ID2','AREA'])
The method presented here turns in about 0.76 s on my computer while your first is running in 6.5 s.
Ultimately, you could create a df_list such as:
df_list = (df.set_index('ID1')
.join(df.set_index('ID1').groupby('ID2')['AREA']
.apply(lambda x: x.index.tolist()),rsuffix='_ID2')
.dropna().drop(['AREA','ID2'],1))
to keep somewhere the information that linked ID1 and ID2: here you can see the id is equal to 2 in the column ID2, where the value of ID1 = 1, 4 and 6
AREA_ID2
ID1
1 [5]
2 [1, 4, 6]
6 [2, 3]
and then you can run to not re-create the df_list, with a small difference in the code:
df = df.set_index('ID1')
for row in df_list.itertuples():
df.loc[row[0],'AREA'] += df.loc[row[1],'AREA'].sum()
df = df.reset_index()
Hope it's faster
I have multi indexed survey data in pandas dataframe in a list format. One subset of it is given below. I am trying to sort the columns first in ascending order and than take the count of value for each rank.
rank these items on a scale of 1-5 rank these items on a scale of 1-3
A B C
X Y Z
1 2 1
1 2 1 2 3 2
2 1 3 1 2 3
3 1 1 1 1 2
5 3 2 2 1 3
1 1 2 3 1 1
NA 5 3 3 2 NA
4 NA NA 1 3 1
1 4 4 2 NA 2
3 3 5`
I used the following code to drop NA from the data frame and take the count of values but i am not able to sort the values in ascending order before doing the count.
`def find_column_with_max_length(df):
n = ''
count = 0
for name, values in df.iteritems():
df[name] = pd.to_numeric(df[name], errors='coerce')
c = len(df[name].value_counts().values.tolist())
if c > count:
count = c
n = name
return n`
df1 = pd.DataFrame()
df1 = df1.sort_values('n', ascending=True)
n = find_column_with_max_length(df)
df1[n] = pd.DataFrame(df[n].value_counts().values.tolist(), columns=[n])
for name, values in df.iteritems():
if name != n:
df1[name] = pd.DataFrame(df[name].value_counts().values.tolist(),
columns=[name])`
Any help in this regard would be greatly appreciated.