Pandas groupby(...).mean() lost keys - python

I have dataframe rounds (which was the result of deleting a column from another dataframe) with the following structure (can't post pics, sorry):
----------------------------
|type|N|D|NATC|K|iters|time|
----------------------------
rows of data
----------------------------
I use groupby so I can then get the mean of the groups, like so:
rounds = results.groupby(['type','N','D','NATC','K','iters'])
results_mean = rounds.mean()
I get the means that I wanted but I get a problem with the keys. The results_mean dataframe has the following structure:
----------------------------
| | | | | | |time|
|type|N|D|NATC|K|iters| |
----------------------------
rows of data
----------------------------
The only key recognized is time (I executed results_mean.keys()).
What did I do wrong? How can I fix it?

In your aggregated data, time is the only column. The other ones are indices.
groupby has a parameter as_index. From the documentation:
as_index : boolean, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
So you can get the desired output by calling
rounds = results.groupby(['type','N','D','NATC','K','iters'], as_index = False)
results_mean = rounds.mean()
Or, if you want, you can always convert indices to keys by using reset_index. Using
rounds = results.groupby(['type','N','D','NATC','K','iters'])
results_mean = rounds.mean().reset_index()
should have the desired effect as well.

I've got the same problem of losing the dataframes's keys due to the use of the group_by() function and the answer I found for that problem was to convert the Dataframe into a CSV file then read this file.

Related

which function could be further calculation with INDEX+MATCH(excel) in python

I'm very new to trying use python with my work,
here's an excel worksheet I got:
A as product name in July,
B as product Qty in July,
C as product name in Aug,
D as product Qty in Aug
I needed to get the result of difference between them:
find exactly sold Qty in next month
calculated the subtract
|A | B|C | D|
|SRHAVWRQT | 1|SRHAVWRQT | 4|
|SCMARAED3MQT| 0|SCMARAED3MQT| 2|
|WVVMUMCRQT | 7|WVVMUMCBR | 7|
...
...
I know how to solved this in excel like what I did,
with INDEX + MATCH and the difference:
=G3-INDEX(B:B,MATCH(F3,A:A,0))
than I would having the result as I need
The original data
The result perform
but how am I would perform it in python?
and which tool would be use?
(e.g. pandas? numpy?)
with other answer I'd read, but it seems just performed only INDEX/MATCH function
and/or they are trying to solve the calculation between Multiple Sheet
but I just need the result of 2 columns.
How to perform an Excel INDEX MATCH equivalent in Python
Index match with python
Calculate Match percentage of values from One sheet to another Using Python
https://focaalvarez.medium.com/vlookup-and-index-match-equivalences-in-pandas-160ac2910399
Or there's just will be a completely different way of processing in python
A classic use case there. For anything involving Excel and Python, you'll want to familiarise yourself with the Pandas library; it can handle a lot of what you're asking for.
Now, to how to solve this problem in particular. I'm going to assume that the data in the relevant worksheet is as you showed it above; No column headings, with the data from row 1 down in columns A, B, C and D. You could use the below code to load this into Python; This loading would load it without column or row names, and as such the dataframe loaded in python would start at [0,0] rather than "A1", since rows and columns in Pandas start at 0.
import pandas as pd
excelData = pd.read_excel("<DOCUMENT NAME>", sheet_name="<SHEET NAME>", header=None)
After you have loaded the data, you then need to match the month 1 data to its month 2 indices. This is a little complicated, and the way I recommend doing it involves defining your own python function using the "def" keyword. A version of this I quickly whipped up is below:
#Extract columns "A & B" and "C & D" to separate month 1 and 2 dataframes respectively.
month1_data: pd.DataFrame = excelData.iloc[:, 0:2]
month2_data: pd.DataFrame = excelData.iloc[:, 2:4]
#Define a matching function to match a single row (series) with its corresponding row in a passed dataframe
def matchDataToIndex(dataLine: pd.Series, comparison: pd.DataFrame):
matchIndex = list(comparison.iloc[0].tolist()).index(dataLine.tolist()[0])
return dataLine.append(pd.Series([matchIndex]))
#Apply the defined matching function to each row of the month 1 data
month1_data_with_match = month1_data.apply(matchDataToIndex, axis=1, args=(month2_data,))
There is a lot of stuff there that you are probably not familiar with if you are only just getting started with Python, hence why I recommend getting acquainted with Pandas. That being said, after that is run, the variable month1_data_with_match will be a three column table, containing the following columns:
Your Column A product name data.
Your Column B product quantity data.
An index expressing which row in month2_data contains the matching Column C and D data.
With those three pieces of information together, you should then be able to calculate your other statistics.

Aggregate function in pandas dataframe not working appropriately

I'm trying to sum a certain column based on a groupby of another column, I have the code right, but the output is wildly different. So I tried a simply min() function on that groupby, the output from this is also completely different from the expected output, did I do something wrong by chance?
Below are the images of the df displayed. I grouped it by lga_desc, and when tested for minimum value from those rows, I get the wrong output
|Taxable Income |lga_desc|
|300,000,450 |Alpine |
|240,000 |Alpine |
|700,000 |Alpine |
|260,000,450 |Ararat |
|469,000 |Ararat |
|5,200,000 |Ararat |
df = df.groupby('lga_desc')
df = df['Taxable income'].min()
output when applying min function:
lga_desc
Alpine 700,000
Ararat 469,000
these are the wrong outputs, from the given dataframe
thank you for the help!
Update: After careful checking on my code again, apparently when I imported this file, all numbers became strings. So a lesson, don't forget to make sure your numbers are actual numbers! not strings :)
You need to convert the data type to int first:
df['Taxable Income'] = df['Taxable Income'].str.replace(',', '').astype(int)
result = df.groupby('lga_desc')['Taxable Income'].min().reset_index()
OUTPUT:
lga_desc Taxable Income
0 Alpine 240000
1 Ararat 469000

Python Pandas - Find rows where element is in row's array

I want to find all rows where a certain value is present inside the column's list value.
So imagine I have a dataframe set up like this:
| placeID | users |
------------------------------------------------
| 134986| [U1030, U1017, U1123, U1044...] |
| 133986| [U1034, U1011, U1133, U1044...] |
| 134886| [U1031, U1015, U1133, U1044...] |
| 134976| [U1130, U1016, U1133, U1044...] |
How can I get all rows where 'U1030' exists in the users column?
Or... is the real problem that I should not have my data arranged like this, and I should instead explode that column to have a row for each user?
What's the right way to approach this?
The way you have stored data looks fine to me. You do not need to change the format of storing data.
Try this :
df1 = df[df['users'].str.contains("U1030")]
print(df1)
This will give you all the rows containing specified user in df format.
When you are wanting to check whether a value exists inside the column when the value in the column is a list, it's helpful to use the map function.
Implementing it like below, with a lambda inline function, the list of values stored in the 'users' column is mapped to the value u, and userID is compared to it...
Really the answer is pretty straightforward when you look at the code below:
# user_filter filters the dataframe to all the rows where
# 'userID' is NOT in the 'users' column (the value of which
# is a list type)
user_filter = df['users'].map(lambda u: userID not in u)
# cuisine_filter filters the dataframe to only the rows
# where 'cuisine' exists in the 'cuisines' column (the value
# of which is a list type)
cuisine_filter = df['cuisines'].map(lambda c: cuisine in c)
# Display the result, filtering by the weight assigned
df[user_filter & cuisine_filter]

Count elements satisfying an extra condition on another column when group-bying in pyspark

The following pyspark command
df = dataFrame.groupBy("URL_short").count().select("URL_short", col("count").alias("NumOfReqs"))
created the following result.
|URL_short |NumOfReqs|
+-----------------------------------------------------------------------------------------+---------+
|http1 | 500 |
|http4 | 500 |
|http2 | 500 |
|http3 | 500 |
In the original DataFrame dataFrame I have a column named success whose type is text. The value can be "true" or "false".
In the result I would like to have an additional column named for example NumOfSuccess which counts the elements having entry "true" in the original column success per category URL_short.
How can I modify
df = dataFrame.groupBy("URL_short").count().select("URL_short", col("count").alias("NumOfReqs"))
to output also the column satisfying the condition success=="trueperURL_short` category?
One way to do it is to add another aggregation expression (also turn the count into an agg expression):
import pyspark.sql.functions as f
dataFrame.groupBy("URL_short").agg(
f.count('*').alias('NumOfReqs'),
f.sum(f.when(f.col('success'), 1).otherwise(0)).alias('CountOfSuccess')
).show()
Note this assumes your success column is boolean type, if it's string, change the expression to f.sum(f.when(f.col('success') == 'true', 1).otherwise(0)).alias('CountOfSuccess')

Appending to a Pandas dataframe and specifying the row index

I'm a little confused about the workings of the Pandas dataframe.
I have a pd.dataframe that looks like this:
index | val1 | val2
-----------------------------------
20-11-2017 22:33:20 | 0.33 | 05.43
23-11-2017 23:34:14 | 4.23 | 09.43
I'd like to append a row to it, and be able to specify the index, which in my case is a date and time.
I have tried the following methods:
dataframe = pd.DataFrame(columns=['val1', 'val2'])
dataframe.loc[someDate] = [someVal, someVal]
This seems to overwrite if the index already exists, but I want to be able to have duplicate indices.
dataframe = pd.DataFrame(columns=['val1', 'val2'])
record = pd.Series(
index=[someDate],
data=[someVal, someVal]
)
dataframe.append(record)
This causes the application to hang without returning an exception or error.
Am I missing something? Is this the correct way of doing the thing I want to achieve?

Categories