I am trying to assign values to some rows using pandas dataframe. Is there any function to do this?
For a whole column:
df = df.assign(column=value)
... where column is the name of the column.
For a specific column of a specific row:
df.at[row, column] = value
... where row is the index of the row, and column is the name of the column.
The later changes the dataframe "in place".
There is a good tutorial here.
Basically, try this:
import pandas as pd
import numpy as np
# Creating a dataframe
# Setting the seed value to re-generate the result.
np.random.seed(25)
df = pd.DataFrame(np.random.rand(10, 3), columns =['A', 'B', 'C'])
# np.random.rand(10, 3) has generated a
# random 2-Dimensional array of shape 10 * 3
# which is then converted to a dataframe
df
You will get something like this:
Related
I have a numeric np array which I want to use that as a condition/filter over a column number 4 of a dataframe (df) to extract a subset of dataframe (sale_data_sub). However, I am getting an empty sale_data_sub (with just the name of all the columns and no rows) as a result of the code
sale_data_sub = df.loc[df[4].isin(sale_condition_arr)].values
sale_condition_arr is a numpy array
df is the original dataframe with 100 columns
sale_data_subset is the desired sub_dataframe
Sorry that I didn't include a working sample.
the issue is that your df dataframe don't have headers assigned.
try:
#give your dataframe a header:
df = df.set_axis([str(i) for i in list(range(len(df.columns)))], axis='columns')
#then proceed to your usual work with df:
sale_data_sub = df.loc[df["4"].isin(sale_condition_arr)].values #be careful, it's df["4"] not df[4]
I have a df that has a column of lists.
Python Pandas rolling aggregate a column of lists
import pandas as pd
import numpy as np
# Get some time series data
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv")
input_cols = ['A', 'B']
df['single_input_vector'] = df[input_cols].apply(tuple, axis=1).apply(list)
I am wondering if there is a way to create a rolling aggregate of the 'single_input_vector' column for a given window. I looked at the following SO link but it does not provide a way to include a window. In my case, the desired output column for a window of 3 would be:
Row1: [[24.68, 164.93]]
Row2: [[24.68, 164.93], [24.18, 164.89]]
Row3: [[24.68, 164.93], [24.18, 164.89], [23.99, 164.63]]
Row4: [[24.18, 164.89], [23.99, 164.63], [24.14, 163.92]]
and so on.
I can't think of a more efficient way to do this, so while this does work there may be performance constraints on massive data sets.
We are basically using rolling count to create a start:stop set of slicing indices.
import pandas as pd
import numpy as np
# Get some time series data
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv")
input_cols = ['A', 'B']
df['single_input_vector'] = df[input_cols].apply(tuple, axis=1).apply(list)
window = 3
df['len'] = df['A'].rolling(window=window).count()
df['vector_list'] = df.apply(lambda x: df['single_input_vector'][max(0,x.name-(window-1)):int(x.name)+1].values, axis=1)
How do I convert a numpy array into a dataframe column. Let's say I have created an empty dataframe, df, and I loop through code to create 5 numpy arrays. Each iteration of my for loop, I want to convert the numpy array I have created in that iteration into a column in my dataframe. Just to clarify, I do not want to create a new dataframe every iteration of my loop, I only want to add a column to the existing one. The code I have below is sketchy and not syntactically correct, but illustrates my point.
df = pd.dataframe()
for i in range(5):
arr = create_numpy_arr(blah) # creates a numpy array
df[i] = # convert arr to df column
This is the simplest way:
df['column_name']=pd.Series(arr)
Since you want to create a column and not an entire DataFrame from your array, you could do
import pandas as pd
import numpy as np
column_series = pd.Series(np.array([0, 1, 2, 3]))
To assign that column to an existing DataFrame:
df = df.assign(column_name=column_series)
The above will add a column named column_name into df.
If, instead, you don't have any DataFrame to assign those values to, you can pass a dict to the constructor to create a named column from your numpy array:
df = pd.DataFrame({ 'column_name': np.array([0, 1, 2, 3]) })
That will work
import pandas as pd
import numpy as np
df = pd.DataFrame()
for i in range(5):
arr = np.random.rand(10)
df[i] = arr
Maybe a simpler way is to use the vectorization
arr = np.random.rand(10, 5)
df = pd.DataFrame(arr)
I have a dataframe that currently looks like this:
import numpy as np
raw_data = {'Series_Date':['2017-03-10','2017-03-13','2017-03-14','2017-03-15'],'SP':[35.6,56.7,41,41],'1M':[-7.8,56,56,-3.4],'3M':[24,-31,53,5]}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['Series_Date','SP','1M','3M'])
print df
I would like to transponse in a way such that all the value fields get transposed to the Value Column and the date is appended as a row item. The column name of the value field becomes a row for the Description column. That is the resulting Dataframe should look like this:
import numpy as np
raw_data = {'Series_Date':['2017-03-10','2017-03-10','2017-03-10','2017-03-13','2017-03-13','2017-03-13','2017-03-14','2017-03-14','2017-03-14','2017-03-15','2017-03-15','2017-03-15'],'Value':[35.6,-7.8,24,56.7,56,-31,41,56,53,41,-3.4,5],'Desc':['SP','1M','3M','SP','1M','3M','SP','1M','3M','SP','1M','3M']}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['Series_Date','Value','Desc'])
print df
Could someone please help how I can flip and transpose my DataFrame this way?
Use pd.melt to transform DF from a wide format to a long one:
idx = "Series_Date" # identifier variable
pd.melt(df, id_vars=idx, var_name="Desc").sort_values(idx).reset_index(drop=True)
Using python3 I wrote a code for calculating data. Code is as follows:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
def data(symbols):
dates = pd.date_range('2016/01/01','2016/12/23')
df=pd.DataFrame(index=dates)
for symbol in symbols:
df_temp=pd.read_csv("/home/furqan/Desktop/Data/{}.csv".format(symbol),
index_col='Date',parse_dates=True,usecols=['Date',"Close"],
na_values = ['nan'])
df_temp=df_temp.rename(columns={'Close':symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
df=(df/df.ix[0,: ])
return df
symbols = ['FABL','HINOON']
df=data(symbols)
print(df)
p_value=(np.zeros((2,2),dtype="float"))
p_value[0,0]=0.5
p_value[1,1]=0.5
print(df.shape[1])
print(p_value.shape[0])
df=np.dot(df,p_value)
print(df.shape[1])
print(df.shape[0])
print(df)
When I print df for second time the index has vanished. I think the issue is due to matrix multiplication. How can I get the indexing and column headings back into df?
To resolve your issue, because you are using numpy methods, these typically return a numpy array which is why any existing columns and index labels will have been lost.
So instead of
df=np.dot(df,p_value)
you can do
df=df.dot(p_value)
Additionally because p_value is a pure numpy array, there is no column names here so you can either create a df using existing column names:
p_value=pd.DataFrame(np.zeros((2,2),dtype="float"), columns = df.columns)
or just overwrite the column names directly after calculating the dot product like so:
df.columns = ['FABL', 'HINOON']