I have data frame df and I would like to keep a running total of names that occur in a column of that data frame. I am trying to calculate the running total column:
name running total
a 1
a 2
b 1
a 3
c 1
b 2
There are two ways I thought to do this:
Loop through the dataframe and use a separate dictionary containing name and current count. The current count for the relevant name would increase by 1 each time the loop is carried out, and that value would be copied into my dataframe.
Change the count in field for each value in the dataframe. In excel I would use a countif combined with a drag down formula A$1:A1 to fix the first value but make the second value relative so that the range I am looking in changes with the row.
The problem is I am not sure how to implement these. Does anyone have any ideas on which is preferable and how these could be implemented?
#bunji is right. I'm assuming you're using pandas and that your data is in a dataframe called df. To add the running totals to your dataframe, you could do something like this:
df['running total'] = df.groupby(['name']).cumcount() + 1
The + 1 gives you a 1 for your first occurrence instead of 0, which is what you would get otherwise.
Related
Rather new to Python and so far I've only done the very basics in an intro to python class. I was handed this data set and thought it could very easily be handled in python but I have no idea where to begin.
I have a 3 column table in excel. First column is a code, second column is a row number, and third column is a numeric value. If the first and second column combined are unique, that is if the first column is FLD04 and the second column is 1, then I want to find the difference + 1 between the max and min value in the 3rd column and print a line that reads FLD04 1 30 (30 being the result of the difference between the max and min + 1). And iterate this over and over for every instance where the first and second column together are unique.
IDK I can't figure out how to past the excel info as anything but an image. Sorry. Just wanted to post it to help illustrate what I am dealing with
enter image description here
When I first learn Python, I like to print out those intermediate variables and see how they look. You may try the following code out, but change "Column#1,2,3" into the right names.
data = pd.read_excel(...) # Read your data: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
max = data.groupby(["CODE", "ROW"])["NUMBER"].max() #Group data by the first 2 columns, and for each group, find the maximum value in column 3
min = data.groupby(["CODE", "ROW"])["NUMBER"].min() #Get minimums
result = max - min + 1
print(result)
From the dataframe below:
I would like to group column 'datum' by date 01-01-2019 and so on. and get an average at the same time on column 'PM10_gemiddelde'.
So now all 01-01-2019 (24 times) is on hour base and i need it combined to 1 and get the average on column ' PM10_gemiddelde' at the same time. See picture for the data.
besides that, PM10_gemiddelde has also negative data. How can i erase that data in python easily?
Thank you!
ps. im new with python
What you are trying to do can be achieve by:
data[['datum','PM10_gemiddelde']].loc[data['PM10_gemiddelde'] > 0 ].groupby(['datum']).mean()
You can create a new column with the average of PM10_gemiddelde using groupby along with transform. Try the following:
Assuming your dataframe is called df, start first by removing the negative data:
new_df = df[df['PM10_gemiddelde'] > 0]
Then, you can create a new column that contains the average value for every date:
new_df['avg_col'] = new_df.groupby('datum')['PM10_gemiddelde'].transform('mean')
I have a "large" DataFrame table with index being country codes (alpha-3) and columns being years (1900 to 2000) imported via a pd.read_csv(...) [as I understand, these are actually string so I need to pass it as '1945' for example].
The values are 0,1,2,3.
I need to "spread" these values until the next non-0 for each row.
example : 0 0 1 0 0 3 0 0 2 1
becomes: 0 0 1 1 1 3 3 3 2 1
I understand that I should not use iterations (current implementation is something like this, as you can see, using 2 loops is not optimal, I guess I could get rid of one by using apply(row) )
def spread_values(df):
for idx in df.index:
previous_v = 0
for t_year in range(min_year, max_year):
current_v = df.loc[idx, str(t_year)]
if current_v == 0 and previous_v != 0:
df.loc[idx, str(t_year)] = previous_v
else:
previous_v = current_v
However I am told I should use the apply() function, or vectorisation or list comprehension because it is not optimal?
The apply function however, regardless of the axis, does not allow to dynamically get the index/column (which I need to conditionally update the cell), and I think the core issue I can't make the vec or list options work is because I do not have a finite set of column names but rather a wide range (all examples I see use a handful of named columns...)
What would be the more optimal / more elegant solution here?
OR are DataFrames not suited for my data at all? what should I use instead?
You can use df.replace(to_replace=0, method='ffil). This will fill all zeros in your dataframe (except for zeros occuring at the start of your dataframe) with the previous non-zero value per column.
If you want to do it rowwise unfortunately the .replace() function does not accept an axis argument. But you can transpose your dataframe, replace the zeros and transpose it again: df.T.replace(0, method='ffill').T
I was struggling with how to word the question, so I will provide an example of what I am trying to do below. I have a dataframe that looks like this:
ID CODE COST
0 60086 V2401 105.38
1 60142 V2500 221.58
2 60086 V2500 105.38
3 60134 V2750 35
4 60134 V2020 0
I am trying to create a dataframe that has the ID as rows, the CODE as columns, and the COST as values since the cost for the same code is different per ID. How can I do this in?
This seems like a classic "long to wide" problem, and there are several ways to do it. You can try pivot, for example:
df.pivot_table(index='ID', columns='CODE', values='COST')
(assuming that the dataframe is df.)
I have the dataframe named Tasks, containing a column named UserName. I want to count every occurrence of a row containing the same UserName, therefore getting to know how many tasks a user has been assigned to. For a better understanding, here's how my dataframe looks like:
In order to achieve this, I used the code below:
Most_Involved = Tasks['UserName'].value_counts()
But this got me a DataFrame like this:
Index Username
John 4
Paul 1
Radu 1
Which is not exactly what I am looking for. How should I re-write the code in order to achieve this:
Most_Involved
Index UserName Tasks
0 John 4
1 Paul 1
2 Radu 1
You can use transform to add a new column to existing data frame:
df['Tasks'] = df.groupby('UserName')['UserName'].transform('size')
# finally select the columns needed
df = df[['Index','UserName','Tasks']]
you can find duplicate rows based on columns by using pandas.
duplicateRowsDF = dataframe[dataframe.duplicated(['columnName'])]
here is the complete solution