Pandas Group By and Sum , Header being removed - python

after I run the following code I seem to lose the headers of my dataframe. If i remove the below line, my headers exist.
unifiedview = unifiedview.groupby(['key','MTM'])['MTM'].sum()
When i use to_csv my excel has no headers.
ive tried :
unifiedview = unifiedview.groupby(['key','MTM'], as_index = False)['MTM'].sum()
unifiedview = unifiedview.reset_index()
any help would be appreciated.

Calling
unifiedview.groupby(['key','MTM'])['MTM']'
will return a Pandas Series of only the 'MTM' column...
Therefore, the expression
unifiedview.groupby(['key','MTM'])['MTM'].sum() will return the sum of the GroupBy'd 'MTM' column...
unifiedview.groupby(['key','MTM']).sum().reset_index() should return the sum of all columns in unifiedview of the int or float dtype.
Are you looking to preserve all columns from the original dataframe?
Also, you must place an aggregate function after the groupby clause...
unifiedview.groupby(['key','MTM']) must have a .count(), .sum(), .mean(), ... method to group your columns...
unifiedview.groupby(['key','MTM']).sum()
unifiedview.groupby(['key','MTM']).count()
unifiedview.groupby(['key','MTM']).mean()
Is this helping you get in the right direction?

What version of pandas are you using? If you check the documentation it states:
Write out the column names. If a list of strings is given it is assumed to be aliases for the column names.
Changed in version 0.24.0: Previously defaulted to False for Series
Since you are transforming your dataframe into a series object this might be the cause of your issue.
The documenation can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

Related

How to drop rows in python that don't have date values?

I need help cleaning a very large dataframe. One of the rows is "PostingTimeUtc" should be only dates but several rows inserted wrong and they have strings of text instead. How can I select all the rows for "PostingTimeUtc" which have strings instead of dates and drop them?
I'm new to this site and to coding, so please let me know if I'm being vague.
Please remember to add examples even if short -
This may work in your case:
from pandas.api.types import is_datetime64_any_dtype as is_datetime
df[df['column name'].map(is_datetime)]
Where map applies the is_datetime function (results in True or False) to each row and the Boolean filter is applied to the dataframe.
Don't forget to assign df to this result to retain the values as it is not done inplace.
df = df[df['column name'].map(is_datetime)]
I am assuming it's the pandas data frame. You can do this to filter rows on the basis of regex.
df.column_name.str.contains('your regex here')

Returning column from dataframe by name

I have dataframe with given names of columns and I want to to return a column with specified name:
name_of_column = 'name1' # string variable
I tried to use this:
dataframe.iloc[:, name_of_column]
But it did not work. What should I do?
Use loc instead of iloc and your syntax will work. iloc is for indexing by integer position (this is what the i stands for), while loc is for indexing by label. So you can use:
dataframe.loc[:, name_of_column]
Having said this, the more usual way to retrieve a series is to use __getitem__ directly:
dataframe[name_of_column]
You can just do:
dataframe[column_name]
Will select the column.
iloc() method finds an item in pandas by index.
More examples the selecting data you can find in Pandas Indexing and Selecting Data

How can I set the index of a generated pandas Series to a column from a DataFrame?

In pandas this operation creates a Series:
q7.loc[:, list(q7)].max(axis=1) - q7.loc[:, list(q7)].min(axis=1)
I would like to be able to set the index as a list of values from a df colum. Ie
list(df['Colname'])
I've tried to create the series then update it with the series generated from the first code snippet. I've also searched the docs and don't see a method that will allow me to do this. I would prefer not to manually iterate over it.
Help is appreciated.
You can simply store that series to a variable say S and set the index accordingly as shown below..!!
S = (q7.loc[:, list(q7)].max(axis=1) - q7.loc[:, list(q7)].min(axis=1))
S.index = df['Colname']
The code is provided assuming the lengths of the series and Column from the dataframe is equal. Hope this helps.!!
If you want to reset series s index, you can do:
s.index = new_index_list

Accessing groups in Pandas lambda function

I have a Pandas dataframe with a multiindex. Level 0 is 'Strain' and level 1 is 'JGI library.' Each 'Strain' has several 'JGI library' columns associated with it. I would like to use a lambda function to apply a t-test to compare two different strains. To troubleshoot, I have been taking one row of my dataframe using the .iloc[0] command.
row = pvalDf.iloc[0]
parent = 'LL1004'
child = 'LL345'
ttest_ind(row.groupby(level='Strain').get_group(parent), row.groupby(level='Strain').get_group(child))[1]
This works as expected. Now I try to apply it to my whole dataframe
parent = 'LL1004'
child = 'LL345'
pvalDf = countsDf4.apply(lambda row: ttest_ind(row.groupby(level='Strain').get_group(parent), row.groupby(level='Strain').get_group(child))[1])
Now I get an error message saying, "ValueError: ('level name Strain is not the name of the index', 'occurred at index (LL1004, BCHAC)')"
'LL1004' is a 'Strain,' but Pandas doesn't seem to be aware of this. It looks like maybe the multiindex was not passed to the lambda function correctly? Is there a better way to troubleshoot lambda functions than using .iloc[0]?
I put a copy of my Jupyter notebook and an excel file with the countsDf4 dataframe on Github https://github.com/danolson1/pandas_ttest
Thanks,
Dan
How about, more simply:
pvalDf = countsDf4.apply(lambda row: ttest_ind(row[parent], row[child]), axis=1)
I've tested it on your notebook and it works.
Your problem is that DataFrame.apply() by default applies the function to each column, not to each row. So, you need to specify the axis=1 parameter to override the default behavior and apply the function row by row.
Also, there's no reason to use row.groupby(level='Strain').get_group(x) when you could simply index the group of columns by row[x]. :)

pandas dataframe add new column based on calulation on other column and avoid chained index

I have a pandas dataframe and I need to add a new column, which would be based on calculation of specific columns,indicated by a column 'site'. I have found a way to do this with resort to numpy, but always it gives warning about chained index. I am sure there should be better solution, please help if you know.
df_num_bin1['Chip_id_3']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_89_S1]*0x100+df_num_bin1[WB_78_S1],df_num_bin1[WB_89_S2]*0x100+df_num_bin1[WB_78_S2])
df_num_bin1['Chip_id_2']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_67_S1]*0x100+df_num_bin1[WB_56_S1],df_num_bin1[WB_67_S2]*0x100+df_num_bin1[WB_56_S2])
df_num_bin1['Chip_id_1']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_45_S1]*0x100+df_num_bin1[WB_34_S1],df_num_bin1[WB_45_S2]*0x100+df_num_bin1[WB_34_S2])
df_num_bin1['Chip_id_0']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_23_S1]*0x100+df_num_bin1[WB_12_S1],df_num_bin1[WB_23_S2]*0x100+df_num_bin1[WB_12_S2])
df_num_bin1['mac_low']=(df_num_bin1['Chip_id_1'].map(int) % 0x10000) *0x100+df_num_bin1['Chip_id_0'].map(int) // 0x1000000
The code above have 2 issues:
1: Here the value of column [key_site_num] determines which columns I should extract chip id data from. In this example it is only of site 0 or 1, but actually it could be 2 or 3 as well. I would need a general solution.
2: it generates chained index warning;
C:\Anaconda2\lib\site-packages\ipykernel\__main__.py:35: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Well, i´m not too sure about your first quest but i think that this will help you.
import pandas as pd
reader = pd.read_csv(path,engine='python')
reader['new'] = reader['treasury.maturity.rate']+reader['bond.yield']
reader.to_csv('test.csv',index=False)
As you can see,you don´t need get the values before operate with them only reference the column where they are; and to do the same for only a specific row you could filter the dataframe before create the new column.

Categories