Trying to code a Python equivalent of SUMIFs feature in Excel - python

I am trying to rewrite a .xlsx file from scratch using Python. The excel sheet has 99 rows and 11 columns. I have generated 99 rows x 8 columns already and I am currently working on generating the 99 rows x 9th column.
This 9th column is calculated based on a SUM-IFS formula in excel. It takes into account columns 2, 4 and 7.
Col. 2 has numerical int values.
Col. 4 has three letter airport code values like NYC for New York City
Col. 7 also has three letter airport code values like DEL for Delhi.
The sum-if formula for column 9 cells
SUMIFS(B:B, D:D, D2, G:G, G2)
Hence it sums the numerical values in column 2 for corresponding cities in col. 4 and col. 7. If there is only one occurrence of the pair of cities in col. 4 and col. 7 then there is nothing to sum and the cell in col.9 = int value of cell in col. 2
However, if are multiple occurrences of the pair of cities in col. 4 and col. 7 then the corresponding values in col. 2 are SUMMED and that becomes the value of the cell in col. 9
Example:
In this example, col. 2 is Sales, col.4 is Origin City, col. 7 is Destination City and col. 9 is the Result that utilizes =SUMIFS(B:B,C:C,C2,D:D,D2)
I am trying to calculate the column 9 using python on the large data set that I have. For now, I have been able to create a list of dictionaries, where I have made the key as origin_city-destination_city and the value as the integer value of col. 2. The list of dicts has 99 rows like the excel file, hence each row of the excel file is represented as a dict. On printing the dictionary, it is something like this:
{'YTO-YVR': 570}
{'YVR-YTO': 542}
{'YTO-YYC': 420}
{'YYT-YTO': 32}
{'YWG-YYC': 115}
I have been contemplating if it is possible to loop over the list of dicts and create a SUMIFS version of it --- resulting in 99 dicts in the list, with each dict having the sumif value. After this I have to write all these values to the column in the excel file..
I hope someone here can help !! Thank you very much in advance :)

You can use pandas' groupby with transform:
import pandas as pd
df = pd.DataFrame({'Sales': [100,110,200,300,150,200,100],
'Origin': ['YYZ','YEA','CDG','YYZ','YEA','YVR','YEA'],
'Dest': ['DEL','NYC','YUL','DEL','YTO','HKG','NYC']})
df['Result'] = df.groupby(['Origin','Dest']).Sales.transform('sum')
Result:
Sales Origin Dest Result
0 100 YYZ DEL 400
1 110 YEA NYC 210
2 200 CDG YUL 200
3 300 YYZ DEL 400
4 150 YEA YTO 150
5 200 YVR HKG 200
6 100 YEA NYC 210

Related

Pandas check for next encounter of the same value and save values

I got a DataFrame with the structure of
row Date Description Amount
1 23/11/2022 KLARNA 5.3
2 23/11/2022 ALDI 10.4
3 23/11/2022 LIDL 11.5
4 24/11/2022 Repayment of amount -5.3
5 24/11/2022 Repayment of amount -11.5
6 25/11/2022 Amazon 105.0
7 25/11/2022 Amazon 210.0
8 27/11/2022 Repayment of amount -315.0
9 28/11/2022 Aldi 55.43
10 29/11/2022 Zalando 5.3
11 29/11/2022 ebay 5.3
12 30/11/2022 Repayment of amount -60.73
I'm looking for a solution to filter out even rows with their next value. My question is if this could be easily done with Pandas. Something like find the next value and filter it out into this dict.
I was previously trying something starting with
for i, row in df.iterrows():
if df[i:20]["Amount"].isin([row["Amount"] * -1]):
print("yes contains it")
#row_found = row from if statement
final_dict.update({str(i), : (row, row_found)})
#df.drop(i)
#df.drop(row_found[index])
But I can't get the index of the sub DataFrame to even it out. Also, I've read that it's not best practice to iterate over pandas DataFrame and to use the built-in features. However, I can't find a function that fits my needs.
Also, only the next value should be evened out, as future values might not have been paid yet. For example row 1 matches with row 4, but row 10 and 11 should stay until a match is found later. This is the case for rows 9 and 10, as their values match with row 12. Then row 9, 10 and 12 should get dropped from the Data Frame and get a dictionary entry and so on.
The most efficient way that I can think of at the moment would be to iterate every single row and compare it with the next row(s). But the more come that don't fit, the more possible values that could add up to Repayment might appear (could be tens of value rows without a single Repayment row).
After the function, the DataFrame should look like
row Date Description Amount
2 23/11/2022 ALDI 10.4
11 29/11/2022 ebay 5.3
with the dict like
final_dict = {
"1": (row 1 DataFrame,row 4 DataFrame),
"2": (row 3 DataFrame,row 5 DataFrame)
"3": ([row 6 DataFrame,row 7 DataFrame], row 8 DataFrame),
"4": ([row 9 DataFrame,row 10 DataFrame], row 11 DataFrame)
}
Normally I would take an Id for the key of the dict entry, but the data set doesn't provide one and I'm not sure if it would be best practice to generate them manually. If it isn't the solution of using a list of tuples (containing lists eventually) could be also a solution.
final_list = [
(row 1 DataFrame,row 4 DataFrame),
(row 3 DataFrame,row 5 DataFrame)
([row 6 DataFrame,row 7 DataFrame], row 8 DataFrame),
([row 9 DataFrame,row 10 DataFrame], row 11 DataFrame)
]

Iterate over certain columns with unique values and generate plots python

New to pandas and much help would be appreciated. I'm currently analyzing some Airbnb data and have over 50 different columns. Some of these columns have tens of thousands of unique values while some have very few unique values (categorical).
How do I loop over the columns that have less than 10 unique values to generate plots for them?
Count of unique values in each column:
id 38185
last_scraped 3
name 36774
description 34061
neighborhood_overview 18479
picture_url 37010
host_since 4316
host_location 1740
host_about 14178
host_response_time 4
host_response_rate 78
host_acceptance_rate 101
host_is_superhost 2
host_neighbourhood 486
host_total_listings_count 92
host_verifications 525
host_has_profile_pic 2
host_identity_verified 2
neighbourhood_cleansed 222
neighbourhood_group_cleansed 5
property_type 80
room_type 4
The above is stored through unique_vals = df.nunique()
Apologies if this is a repeat question, the closest answer I could find was Iterate through columns to generate separate plots in python but it pertained to the entire data set
Thanks!
You can filter the columns using df.columns[ unique_vals < 10 ]
You can also pass the df.nunique() call directly if you wish:
unique_columns = df.columns[ df.nunique() < 10 ]

Calculate column values in pandas based on previous rows of data in another column

Let's say I have a table with two columns: Date and Amount. Number of rows are not more than 3000.
Row Date Amount
1 15/05/2021 248
2 16/05/2021 115
3 17/05/2021 387
4 18/05/2021 214
5 19/05/2021 678
6 20/05/2021 489
7 21/05/2021 875
8 22/05/2021 123
................
I need to add a third column which will calculate the trim mean values based on the Amount column.
I will be using this function: my_table['TrimMean'] = stats.trim_mean(my_table['Amount'], 0.1), but adapted for my problem.
The problem is that this is not a fixed range, but a dynamic one, following this logic: for each row in my table, the trim mean value will be calculated based on the previous 90 values of the Amount column, starting from the row above current row. If there are less that 90 values, then calculate with whatever amount of rows is available.
e.g. TrimMean[1000]=stats.trim_mean(array from column Amount containing values from rows 910 to 999) TrimMean[12]=stats.trim_mean(array from column Amount containing values from rows 1 to 11)
Hope that makes sense.
Is there any way I can calculate this in a simple way, without going through row by row iteration?
We can calculate the trim_mean by applying the function over a rolling window of size 90 and min_periods=1
from scipy.stats import trim_mean
df['Amount'].rolling(90, min_periods=1).apply(trim_mean, args=(0.1, )).shift()
0 NaN
1 248.000000
2 181.500000
3 250.000000
4 241.000000
5 328.400000
6 355.166667
7 429.428571
Name: Amount, dtype: float64

How do I subset with .isin (seems like it doesn't work properly)?

I'm a student from Moscow State University and I'm doing a small research about suburban railroads. I crawled information from wikipedia about all stations in Moscow region and now I need to subset those, that are Moscow Central Diameter 1 (railway line) station. I have a list of Diameter 1 stations (d1_names) and what I'm trying to do is to subset from whole dataframe (suburban_rail) with isin pandas method. The problem is it returns only 2 stations (the first one and the last one), though I'm pretty sure there are some more, because using str.contains with absent stations returns what I was looking for (so they are in dataframe). I've already checked spelling and tried to apply strip() to each element of both dataframe and stations' list. Attached several screenshots of my code.
suburban_rail dataframe
stations' list I use to subset
what isin returns
checking manually for Bakovka station
checking manually for Nemchinovka station
Thanks in advance!
Next time provide a minimal reproducible example, such as the one below:
suburban_rail = pd.DataFrame({'station_name': ['a','b','c','d'], 'latitude': [1,2,3,4], 'longitude': [10,20,30,40]})
d1_names = pd.Series(['a','c','d'])
suburban_rail
station_name latitude longitude
0 a 1 10
1 b 2 20
2 c 3 30
3 d 4 40
Now, to answer your question: using .loc the problem is solved:
suburban_rail.loc[suburban_rail.station_name.isin(d1_names)]
station_name latitude longitude
0 a 1 10
2 c 3 30
3 d 4 40

dataframe count frequency of a string in a column

I have a csv that contains 1000 rows in a python code and returning a new dataframe with 3 columns:
noOfPeople and Description, Location
My final df will be like this one:
id companyName noOfPeople Description Location
1 comp1 75 tech USA
2 comp2 22 fashion USA
3 comp3 70 tech USA
I want to write a code that will stop once I have 200 rows where noOfPeople is greater or equal to 70 and it will return all the rest rows empty. So the code will count columns where noOfPeople >=70. Once I have 200 rows that has this condition, the code will stop.
Can someone help?
df[df['noOfPeople'] >= 70].iloc[:200]
Use head or iloc for select first 200 values and then get max:
print (df1['noOfPeople'].iloc[:199].max())
And add your filter what ever you need.

Categories