Edit:
I think i've solved it, leaving it here incase anyone finds it helpful (or can improve what I have done)
My solution was:
totalYield2017.loc['Q1'] = [fishLandingsQ1_df['Landed Weight (tonnes)'].sum(), fishLandingsQ1_df['Value(£)'].sum()]
I have created a blank dataframe
I want to populate it with sums of columns from other dataframes.
I have 4 other dataframes (one for each quarter, Q1, Q2 etc) that have columns for weight and value. I want to create a sum of each of those columns for the first row in my blank dataframe.
I have included a picture of one of the quarters dataframes, its the last 2 columns I want the sums of to be inputted into the blank dataframe
I think i've solved it, leaving it here in case anyone finds it helpful (or can improve what I have done)
My solution was;
totalYield2017.loc['Q1'] = [fishLandingsQ1_df['Landed Weight (tonnes)'].sum(), fishLandingsQ1_df['Value(£)'].sum()]
Related
Although there are several related questions answered in Pandas, I cannot solve this issue. I have a large dataframe (~ 49000 rows) and want to drop rows the meet two conditions at the same time(~ 120):
For one column: an exact string
For another column: a NaN value
My code is ignoring the conditions and no row is removed.
to_remove = ['string1', 'string2']
df.drop(df[df['Column 1'].isin(to_remove) & (df['Column 2'].isna())].index, inplace=True)
What am I doing wrong? Thanks for any hint!
Instead of calling drop, and passing the index, You can create the mask for the condition for which you want to keep the rows, then take only those rows. Also, the logic error seems to be there, you are checking two different condition combined by AND for the same column values.
df[~(df['Column1'].isin(to_remove) & (df['Column2'].isna()))]
Also, if you need to check in the same column, then you probably want to combine the conditions by or i.e. |
If needed, you can reset_index at last.
Also, as side note, your list to_remove has two same string values, I'm assuming thats a typo in the question.
Ive attempted to search the forum for this question, but, I believe I may not be asking it correctly. So here it goes.
I have a large data set with many columns. Originally, I needed to sum all columns for each row by multiple groups based on a name pattern of variables. I was able to do so via:
cols = data.filter(regex=r'_name$').columns
data['sum'] = data.groupby(['id','group'],as_index=False)[cols].sum().assign(sum = lambda x: x.sum(axis=1))
By running this code, I receive a modified dataframe grouped by my 2 factor variables (group & id), with all the columns, and the final sum column I need. However, now, I want to return the final sum column back into the original dataframe. The above code returns the entire modified dataframe into my sum column. I know this is achievable in R by simply adding a .$sum at the end of a piped code. Any ideas on how to get this in pandas?
My hopeful output is just a the addition of the final "sum" variable from the above lines of code into my original dataframe.
Edit: To clarify, the code above returns this entire dataframe:
All I want returned is the column in yellow
is this what you need?
data['sum'] = data.groupby(['id','group'])[cols].transform('sum').sum(axis = 1)
I have this data set that I have been able to organise to the best of my abilities. I`m stuck on the next step. Here is a picture of the df:
My goal is to organise it in a way so that I have the columns month, genres, and time_watched_hours.
If I do the following:
df = df.groupby(['month']).sum().reset_index()
It only sums down the 1`s in the genre columns, whereas I need to add each instance of that genre occurring with the time_watched_hours. For example, in the first row, it would add 4.84 hours for genre comedies. In the third row, 0.84 hours for genre_Crime, and so on.
Once that`s organised, I will use the following to get it in the format I need:
df_cleaned = df.melt(id_vars='month',value_name='time_watched_hours',var_name='Genres').rename(columns=str.title)
Any advice on how to tackle this problem would be greatly appreciated! Thanks!
EDIT: Looking at this further, it would also work to replace the "1" in each row with the time_watched_hours value, then I can groupby().sum() down. Note there may be more than one value of "1" per row.
Ended up finding and using mask for each column which worked perfectly. Downside was I had to list it for each column
df['genre_Action & Adventure'].mask(df['genre_Action & Adventure'] == 1, df['time_watched_hours'], inplace=True)
I am currently working with dataframes in pandas. In sum, I have a dataframe called "Claims" filled with customer claims data, and I want to parse all the rows in the dataframe based on the unique values found in the field 'Part ID.' I would then like to take each set of rows and append it one at a time to an empty dataframe called "emptydf." This dataframe has the same column headings as the "Claims" dataframe. Since the values in the 'Part ID' column change from week to week, I would like to find some way to do this dynamically, rather than comb through the dataframe each week manually. I was thinking of somehow incorporating the df.where() expression and a For Loop, but am at a loss as to how to put it all together. Any insight into how to go about this, or even some better methods, would be great! The code I have thus far is divided into two steps as follows:
emptydf = Claims[0:0]
#Create empty dataframe
2.Parse_Claims = Claims.query('Part_ID == 1009')
emptydf = emptydf.append(Parse_Claims)
#Parse the dataframe by each unique Part ID number and append to empty dataframe. As you can see, I can only hard code one Part ID number at a time so far. This would take hours to complete manually, so I would love to figure out a way to iterate through the Part ID column and append the data dynamically.
Needless to say, I am super new to Python, so I definitely appreciate your patience in advance!
empty_df = list(Claims.groupby(Claims['Part_ID']))
this will create a list of tuples one for each part id. each tuple has 2 elements 1st is part id and 2nd is subset for that part id
I have some data with 4 features of interest: account_id, location_id, date_from and date_to. Each entry corresponds to a period where a customer account was associated with a particular location.
There are some pairs of account_id and location_id which have multiple entries, with different dates. This means that the customer is associated with the location for a longer period, covered by multiple consecutive entries.
So I want to create an extra column with the total length of time that a customer was associated with a given location. I am able to use groupby and apply to calculate this for each pair (see code below).. this works fine but I don't understand how to then add this back into the original dataframe as a new column.
lengths = non_zero_df.groupby(['account_id','location_id'], group_keys=False).apply(lambda x: x.date_to.max() - x.date_from.min())
Thanks
I think Mephy is right that this should probably go to StackOverflow.
You're going to have a shape incompatibility because there will be fewer entries in the grouped result than in the original table. You'll need to do the equivalent of an SQL left outer join with the original table and the results, and you'll have the total length show up multiple times in the new column -- every time you have an equal (account_id, location_id) pair, you'll have the same value in the new column. (There's nothing necessarily wrong with this, but it could cause an issue if people are trying to sum up the new column, for example)
Check out pandas.DataFrame.join (you can also use merge). You'll want to join the old table with the results, on (account_id, location_id), as a left (or outer) join.