Creating a new column in Pandas - python

Thank you in advance for taking the time to help me! (Code provided below) (Data Here)
I am trying to average the first 3 columns and insert it as a new column labeled 'Topsoil'. What is the best way to go about doing that?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
raw_data = pd.read_csv('all-deep-soil-temperatures.csv', index_col=1, parse_dates=True)
df_all_stations = raw_data.copy()
df_selected_station.fillna(method = 'ffill', inplace=True);
df_selected_station_D=df_selected_station.resample(rule='D').mean()
df_selected_station_D['Day'] = df_selected_station_D.index.dayofyear
mean=df_selected_station_D.groupby(by='Day').mean()
mean['Day']=mean.index
#mean.head()

Try this :
mean['avg3col']=mean[['5 cm', '10 cm','15 cm']].mean(axis=1)

df['new column'] = (df['col1'] + df['col2'] + df['col3'])/3

You could use the apply method in the following way:
mean['Topsoil'] = mean.apply(lambda row: np.mean(row[0:3]), axis=1)
You can read about the apply method in the following link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
The logic is that you perform the same task along a specific axis multiple times.
Note: It is not wise to call data-structures in names of functions, in your case it might be better be mean_df rather the mean

Use DataFrame.iloc for select by positions - first 3 columns with mean:
mean['Topsoil'] = mean.iloc[:, :3].mean(axis=1)

Related

is there any simliar function of idxmax() in py-polars in groupby?

import polars as pl
import pandas as pd
A = ['a','a','a','a','a','a','a','b','b','b','b','b','b','b']
B = [1,2,3,4,5,6,7,8,9,10,11,12,13,14]
df = pl.DataFrame({'cola':A,
'colb':B})
df_pd = df.to_pandas()
index = df_pd.groupby('cola')['colb'].idxmax()
df_pd.loc[index,'top'] = 1
in pandas i can get the column of top using idxmax().
however, in polars
i use the arg_max()
index = df[pl.col('colb').arg_max().over('cola').flatten()]
seems cannot get what i want..
is there any way to get generate a column of 'top' in polars?
thx a lot!
In Polars, window functions (the .over()) will do an aggregation + self-join (see https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Expr.over.html?highlight=over#polars.Expr.over), which means you cannot return a unique value per row, which is what you are after.
A way to compute the top column is to use apply:
df.groupby("cola").apply(lambda x: x.with_columns([pl.col("colb"), (pl.col("colb")==pl.col("colb").max()).alias("top")]))

Compare Every 2 row Using Pandas and show the different

I have problem with our code to compare every 2 row with different excel file:
and we have code to compare every row:
import pandas as pd
import numpy as np
old_df = pd.read_excel('Test.xlsx', sheet_name="Best Practice Config", names="A", header=None)
new_df = pd.read_excel('Test.xlsx',sheet_name="Existing Config", names="B", header=None)
compare = old_df[~old_df["A"].isin(new_df["B"])
but i need compare 2 row , Please advise what is the best way of pandas to do that.
Try to use pandas.DataFrame.compare method. The documentation is available here.
old_df.compare(new_df)
I hope it will be useful for you.

Python, pandas print most frequent 1-1000 from csv

I have the following code:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import csv
data1=pd.read_csv('11-01 412-605.csv', low_memory=False)
d412=pd.DataFrame(data1, columns=['size', 'price', 'date'])
new_df = pd.value_counts(d412['size']).reset_index()
new_df.columns = ['size', 'frequency']
print (new_df)
export_csv = new_df.to_csv ('empty.csv', index = None, header=True)
Which outputs:
output
However, I want to print out the values that only have a count of 1-1000. How do I do that because right now it prints out all the values.
I tried:
new_df = pd.value_counts(d412['size']<1000).reset_index()
But that does not work as it prints out true or false for all values less than 1000
try
print(new_df.loc[df_new['frequency']<1000,:])
if i misunderstood the columns of the count please substitute 'frequency' with 'size'
Welcome to Stack Overflow!
As per the reference of Series.value_counts, it's clear value_counts() doesn't allow filtering the values. You can filter the data using DataFrame.loc in a later step, as it's mentioned by others too. So, the following code shall work:
new_df = pd.value_counts(d412['size']).reset_index()
new_df.columns = ['size', 'frequency']
print(new_df.loc[new_df['frequency'] <= 1000])

Pandas dropping columns and rows from a dataframe that came from Excel

I am trying to drop some useless columns in a dataframe but I am getting the error: "too many indices for array"
Here is my code :
import pandas as pd
def answer_one():
energy = pd.read_excel("Energy Indicators.xls")
energy.drop(energy.index[0,1], axis = 1)
answer_one()
Option 1
Your syntax is wrong when slicing the index and it should be the columns
import pandas as pd
energy = pd.read_excel("Energy Indicators.xls")
energy.drop(energy.columns[[0,1]], axis=1)
Option 2
I'd do it like this
import pandas as pd
energy = pd.read_excel("Energy Indicators.xls")
energy.iloc[:, 2:]
I think it's better to skip unneeded columns when parsing/reading Excel file:
energy = pd.read_excel("Energy Indicators.xls", usecols='C:ZZ')
If you're trying to drop the column need to change the syntax. You can refer to them by the header or the index. Here is how you would refer to them by name.
import pandas as pd
energy = pd.read_excel("Energy Indicators.xls")
energy.drop(['first_colum', 'second_column'], axis=1, inplace=True)
Another solution would be to exclude them in the first place:
energy = pd.read_excel("Energy Indicators.xls", usecols=[2:])
This will help speed up the import as well.

Data Frame Indexing

Using python3 I wrote a code for calculating data. Code is as follows:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
def data(symbols):
dates = pd.date_range('2016/01/01','2016/12/23')
df=pd.DataFrame(index=dates)
for symbol in symbols:
df_temp=pd.read_csv("/home/furqan/Desktop/Data/{}.csv".format(symbol),
index_col='Date',parse_dates=True,usecols=['Date',"Close"],
na_values = ['nan'])
df_temp=df_temp.rename(columns={'Close':symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
df=(df/df.ix[0,: ])
return df
symbols = ['FABL','HINOON']
df=data(symbols)
print(df)
p_value=(np.zeros((2,2),dtype="float"))
p_value[0,0]=0.5
p_value[1,1]=0.5
print(df.shape[1])
print(p_value.shape[0])
df=np.dot(df,p_value)
print(df.shape[1])
print(df.shape[0])
print(df)
When I print df for second time the index has vanished. I think the issue is due to matrix multiplication. How can I get the indexing and column headings back into df?
To resolve your issue, because you are using numpy methods, these typically return a numpy array which is why any existing columns and index labels will have been lost.
So instead of
df=np.dot(df,p_value)
you can do
df=df.dot(p_value)
Additionally because p_value is a pure numpy array, there is no column names here so you can either create a df using existing column names:
p_value=pd.DataFrame(np.zeros((2,2),dtype="float"), columns = df.columns)
or just overwrite the column names directly after calculating the dot product like so:
df.columns = ['FABL', 'HINOON']

Categories