Using python3 I wrote a code for calculating data. Code is as follows:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
def data(symbols):
dates = pd.date_range('2016/01/01','2016/12/23')
df=pd.DataFrame(index=dates)
for symbol in symbols:
df_temp=pd.read_csv("/home/furqan/Desktop/Data/{}.csv".format(symbol),
index_col='Date',parse_dates=True,usecols=['Date',"Close"],
na_values = ['nan'])
df_temp=df_temp.rename(columns={'Close':symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
df=(df/df.ix[0,: ])
return df
symbols = ['FABL','HINOON']
df=data(symbols)
print(df)
p_value=(np.zeros((2,2),dtype="float"))
p_value[0,0]=0.5
p_value[1,1]=0.5
print(df.shape[1])
print(p_value.shape[0])
df=np.dot(df,p_value)
print(df.shape[1])
print(df.shape[0])
print(df)
When I print df for second time the index has vanished. I think the issue is due to matrix multiplication. How can I get the indexing and column headings back into df?
To resolve your issue, because you are using numpy methods, these typically return a numpy array which is why any existing columns and index labels will have been lost.
So instead of
df=np.dot(df,p_value)
you can do
df=df.dot(p_value)
Additionally because p_value is a pure numpy array, there is no column names here so you can either create a df using existing column names:
p_value=pd.DataFrame(np.zeros((2,2),dtype="float"), columns = df.columns)
or just overwrite the column names directly after calculating the dot product like so:
df.columns = ['FABL', 'HINOON']
Related
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
df = pd.read_csv('time_series_covid_19_deaths_US.csv')
df = df.drop(['UID','iso2','iso3','code3','FIPS','Admin2','Combined_Key'],axis =1)
for name, values in df.iteritems():
if '/' in name:
df.drop([name],axis=1,inplace =True)
df2 = df.set_index(['Lat','Long_'])
print(df2.head())
lat = df2[df2["Lat"]]
print(lat)
long = df2[df2['Long_']]
Code is above. I got the data set from https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset - using the US deaths.
I have attached an image of the output. I do not know what this error means.
Apologies if worded ambiguously / incorrectly, or if there is a preexisting answer somewhere
When you define an index using one or more columns, e.g. via set_index(), these columns are promoted to index and by default no longer accessible using the df[<colname>] notation. This behavior can be changed with set_index(..., drop=False) but that's usually not necessary.
With the index in place, use df.loc[] to access single rows by their index value, aka label. Read about label-based indexing here.
To access the values of your MultiIndex as you would do with a column, you can use df.index.get_level_values(<colname>).array (or .to_numpy()). So in your case you could write:
lat = df2.index.get_level_values('Lat').array
print(lat)
long = df2.index.get_level_values('Long_').array
print(lat)
BTW: read_csv() has a useful usecols argument that lets you specify which columns to load (others will be ignored).
I am trying to assign values to some rows using pandas dataframe. Is there any function to do this?
For a whole column:
df = df.assign(column=value)
... where column is the name of the column.
For a specific column of a specific row:
df.at[row, column] = value
... where row is the index of the row, and column is the name of the column.
The later changes the dataframe "in place".
There is a good tutorial here.
Basically, try this:
import pandas as pd
import numpy as np
# Creating a dataframe
# Setting the seed value to re-generate the result.
np.random.seed(25)
df = pd.DataFrame(np.random.rand(10, 3), columns =['A', 'B', 'C'])
# np.random.rand(10, 3) has generated a
# random 2-Dimensional array of shape 10 * 3
# which is then converted to a dataframe
df
You will get something like this:
I have a dataframe that with temperature values over the years, What I want to do is put all the rows that are from year 2015 into a new dataframe. Currently, the Date column is an object type with the str format looking like this: YYYY-MM-DD
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("C:\\whatever\weather.csv")
weather_2015 = df.loc[df.Date == df.Date.str.startswith("2015"), :]
weather_2015.head()
this is what the data looks like in the main data frame
NOTE: if I do something like
weather_2015 = df.loc[df.Date == "2015-02-03", :]
weather_2015.head()
I get what I'd expect, dates only that match 2015-02-03
pd.Series.str.startswith returns a boolean mask, you don't need to compare it to df.Date again. You could just index with it directly:
weather_2015 = df[df.Date.str.startswith("2015")]
You don't even need .loc here.
Note that if you want to make changes on this slice, you might prefer a copy, in which case you should call df.copy:
weather_2015 = df[df.Date.str.startswith("2015")].copy()
How do I convert a numpy array into a dataframe column. Let's say I have created an empty dataframe, df, and I loop through code to create 5 numpy arrays. Each iteration of my for loop, I want to convert the numpy array I have created in that iteration into a column in my dataframe. Just to clarify, I do not want to create a new dataframe every iteration of my loop, I only want to add a column to the existing one. The code I have below is sketchy and not syntactically correct, but illustrates my point.
df = pd.dataframe()
for i in range(5):
arr = create_numpy_arr(blah) # creates a numpy array
df[i] = # convert arr to df column
This is the simplest way:
df['column_name']=pd.Series(arr)
Since you want to create a column and not an entire DataFrame from your array, you could do
import pandas as pd
import numpy as np
column_series = pd.Series(np.array([0, 1, 2, 3]))
To assign that column to an existing DataFrame:
df = df.assign(column_name=column_series)
The above will add a column named column_name into df.
If, instead, you don't have any DataFrame to assign those values to, you can pass a dict to the constructor to create a named column from your numpy array:
df = pd.DataFrame({ 'column_name': np.array([0, 1, 2, 3]) })
That will work
import pandas as pd
import numpy as np
df = pd.DataFrame()
for i in range(5):
arr = np.random.rand(10)
df[i] = arr
Maybe a simpler way is to use the vectorization
arr = np.random.rand(10, 5)
df = pd.DataFrame(arr)
I have a dataframe that currently looks like this:
import numpy as np
raw_data = {'Series_Date':['2017-03-10','2017-03-13','2017-03-14','2017-03-15'],'SP':[35.6,56.7,41,41],'1M':[-7.8,56,56,-3.4],'3M':[24,-31,53,5]}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['Series_Date','SP','1M','3M'])
print df
I would like to transponse in a way such that all the value fields get transposed to the Value Column and the date is appended as a row item. The column name of the value field becomes a row for the Description column. That is the resulting Dataframe should look like this:
import numpy as np
raw_data = {'Series_Date':['2017-03-10','2017-03-10','2017-03-10','2017-03-13','2017-03-13','2017-03-13','2017-03-14','2017-03-14','2017-03-14','2017-03-15','2017-03-15','2017-03-15'],'Value':[35.6,-7.8,24,56.7,56,-31,41,56,53,41,-3.4,5],'Desc':['SP','1M','3M','SP','1M','3M','SP','1M','3M','SP','1M','3M']}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['Series_Date','Value','Desc'])
print df
Could someone please help how I can flip and transpose my DataFrame this way?
Use pd.melt to transform DF from a wide format to a long one:
idx = "Series_Date" # identifier variable
pd.melt(df, id_vars=idx, var_name="Desc").sort_values(idx).reset_index(drop=True)