Add 3 new columns to DataFrame - python

Using Python I have the following:
indicators = service.getIndicators(data["temperature"])
The variables data and indicators are of type DataFrame.
In indicators I get 3 columns each with the values of one indicator.
I am adding the 3 columns to data DataFrame where first column has Temperature values:
data["InA"] = indicators[indicators.columns[0]]
data["InAB"] = indicators[indicators.columns[1]]
data["OutC"] = indicators[indicators.columns[2]]
Is there a shorter way to call getIndicators and place the result in data DataFrame.
I feel I am using to much code just for this.

Related

Pulling columns of dataframe into separate dataframe, then replacing duplicates with mean values

I'm new to the world of python so I apologize in advance if this question seems pretty rudimentary. I'm trying to pull columns of one dataframe into a separate dataframe. I want to replace the duplicate columns from the first dataframe with one column that contains the mean values into the second dataframe. I hope this makes sense!
To provide some background, I am tracking gene expression over certain time points. I have a dataframe that is 17 rows x 33 columns. Every row in this data frame corresponds to a particular exon. Every column on this data frame corresponds to a time-point (AGE).
The dataframe looks like this:
Some of these columns contain the same name (age) and I'd like to calculate the mean of ONLY the columns with the same name, so that, for example, I get one column for "12 pcw" rather than three separate columns for "12 pcw." After which I hope to pull these values from the first dataframe into a second dataframe for averaged values.
I'm hoping to use a for loop to loop through each age (column) to get the average expression across the subjects.
I will explain my process so far below:
#1) Get list of UNIQUE string names from age list
unique_ages = set(column_names)
#2) Create an empty dataframe that gives an outline of what I want my averaged data to fit/be put in
mean_df = pd.DataFrame(index=exons, columns=unique_ages)
#3) Now I want to loop through each age to get the average expression across the donors present. This is where I'm trying to utilize a for loop to create a pipeline to process other data frames that I will be working with in the future.
for age in unique_ages:
print(age)
age_df = pd.DataFrame() ##pull columns of df as separate df that have this string
if len(age_df.columns) > 1: ##check if df has >1 SAME column, if so, take avg across SAME columns
mean = df.mean(axis=1)
mean_df[age] = mean
else:
## just pull out the values and put them into your temp_df
#4) Now, with my new averaged array (or same array if multiple ages NOT present), I want to place this array into my 'temp_df' under the appropriate columns. I understand that I should use the 'age' variable provided by the for loop to get the proper locationname of the column in my temp df. However I'm not sure how to do this. This has all been quite a steep learning curve and I feel like it's a simple solution but I can't seem to wrap my head around it. Any help would be greatly appreciated.
There is no need for a for loop (there often isn't with Pandas :)). You can simply use df.groupby(lambda x:x, axis=1).mean(). An example:
data = [[1,2,3],[4,5,6]]
cols = ['col1', 'col2', 'col2']
df = pd.DataFrame(data=data, columns=cols)
# col1 col2 col2
# 0 1 2 3
# 1 4 5 6
df = df.groupby(lambda x:x, axis=1).mean()
# col1 col2
# 0 1.0 2.5
# 1 4.0 5.5
The groupby function takes another function (the lambda) which basically means that it will insert each column name, and that it will return the group that column belongs to. In our case, we just want the column name itself to be the group. So, on the third column named col2, it will say 'this column belongs to group named col2' which already exists (because the second column was passed earlier). You then provide the aggregation you want, in this case the mean().

Does Pandas `concat` method ignore the `axis` parameter, when concatenating a DataFrame with a Series?

I created a Series object.
for index, entry in a_data_frame.iterrows():
...
Then I would like to concatenate this series to a new/another data frame. My goal is to build up the new data frame based on some unique recombination of the rows in the previous one.
a_new_frame = pandas.concat((a_new_frame, a_series))
The series will be appended to the end of the columns disregarding the value of the axis parameter.
Why?
My experiments allow me to assume that Pandas "thinks" series as columns. When I convert a series to a data frame, it will result in a frame with a single column.
a_series.to_frame()
It makes sense to me that I was unable to use (to concatenate) this series ("column") with a data frame as a "row". The simplest solution is to transpose the new data frame before concatenation.
a_series.to_frame().transpose()
a_new_frame = pandas.concat((a_new_frame, a_series.to_frame().transpose()))
It's hard to tell what your code is actually doing, but if you want to resample a Dataframe or a Series you can use the .sample() method.
series = pd.Series([1,2,3,4,5])
series.sample(len(series))
Output
2 2
0 1
4 2
1 1
3 2
Name: Path Id, dtype: object

how to create a dataframe using groupy such that the grouping criteria is contained in the data

I wanted to create a 2D dataframe about coronavirus such that it contains a column containing countries and another one containing number of deaths. the csv file that I am using is date oriented so for some days the number of deaths is 0 so I decided to group them by Country and sum them up. yet it returned a dataframe with 1 column only. but when I write it to a csv file it creates 2 columns.
here is my code:
#import matplotlib.pyplot as plt
from pandas.core.frame import DataFrame
covid_data = pd.read_csv('countries-aggregated.csv')
bar_data = pd.DataFrame(covid_data.groupby('Country')['Deaths'].sum())
Difficult to give you a perfect answer without the dataset, however, groupby will set your key as index, thus returning a Series. You can pass as_index=False:
bar_data = covid_data.groupby('Country', as_index=False)['Deaths'].sum()
Or, if you have only one column in the DataFrame to aggregate:
bar_data = covid_data.groupby('Country', as_index=False).sum()

adding a first difference column to a pandas dataframe

I have a dataframe df with two columns date and data. I want to take the first difference of the data column and add it as a new column.
It seems that df.set_index('date').shift() or df.set_index('date').diff() give me the desired result. However, when I try to add it as a new column, I get NaN for all the rows.
How can I fix this command:
df['firstdiff'] = df.set_index('date').shift()
to make it work?

Select columns in a pandas DataFrame

I have a pandas dataframe with hundreds of columns of antibiotic names. Each specific antibiotic is coded in the dataframe as ending in E, T, or P to indicate empirical, treatment, or prophylactic regimens.
An example excerpt from the column list is:
['MeropenemP', 'MeropenemE', 'MeropenemT', DoripenemP', 'DoripenemE',
'DoripenemT', ImipenemP', 'ImipenemE', 'ImipenemT', 'BiapenemP',
'BiapenemE', 'BiapenemT', 'PanipenemP', 'PanipenemE',
'PanipenemT','PipTazP', 'PipTazE', 'PipTazT','PiperacillinP',
'PiperacillinE', 'PiperacillinT']
A small sample of data is located here:
Sample antibiotic data
It is simple enough for me to separate out columns any type into separate dataframes with a regex, e.g. to select all the empirically prescribed antibiotics columns I use:
E_cols = master.filter(axis=1, regex=('[a-z]+E$'))
Each column has a binary value (0,1) for prescription of each antibiotic regimen type per person (row).
Question:
How would I go about summing the rows of all columns (1's) for each type of regimen type and generating a new column for each result in the dataframe e.g. total_emperical, total_prophylactic, total_treatment.
The reason I want to add to the existing dataframe is that I wish to filter on other values for each regimen type.
Once you've generated the list of columns that match your reg exp then you can just create the new total columns like so:
df['total_emperical'] = df[E_cols].sum(axis=1)
and repeat for the other totals.
Passing axis=1 to sum will sum row-wise

Categories