How do i merge columns of 5 dataframes into one dataframe? - python

This is my code :
stock_A = pd.DataFrame(data[:5])
stock_B = pd.DataFrame(data[5:11])
stock_C = pd.DataFrame(data[11:16])
stock_D = pd.DataFrame(data[16:21])
stock_E = pd.DataFrame(data[21:26])
Close_price=pd.DataFrame()
Close_price['Stock A'] = stock_A['Close Price']
Close_price['Stock B'] = stock_B['Close Price']
Close_price['Stock C'] = stock_C['Close Price']
Close_price['Stock D'] = stock_D['Close Price']
Close_price['Stock E'] = stock_E['Close Price']
and the output I'm getting is
Stock A Stock B Stock C Stock D Stock E
Date
2017-05-16 955.00 NaN NaN NaN NaN
2017-05-17 952.80 NaN NaN NaN NaN
2017-05-18 961.75 NaN NaN NaN NaN
2017-05-19 957.95 NaN NaN NaN NaN
2017-05-22 961.45 NaN NaN NaN NaN
I don't understand why am I getting 'NaN' values for the rest of the columns.
how do I get the actual values ?

To do this, you can use the concat function to achieve this. You should try:
close_price = pd.concat([stock_A, stock_B, stock_C, stock_D, stock_E], axis=1)
The "axis" is important as it determines whether data is stacked horizontally or vertically (0=horizontal and 1=vertical).
If the indexes have been repeated use the reset_index() function to reset the indexes.

Related

Pandas dataframe only reading first value, NaN for everything else

I am attempting to read a csv with pandas and then insert into a SQL table. I am reading the data from the csv correctly when I print(data), but once I add it into the dataframe it is only reading the very first column, and is inserting NaN for every other value in the csv. Code and output below;
data = pd.read_csv (localFilePath)
print(data)
df = pd.DataFrame(data, columns= ['Date','EECode','LastName','FirstName', \
'HomeDepartmentCode','HomeDepartmentDesc','PayClass','InPunchTime', \
'OutPunchTime','DepartmentCode','DepartmentDesc','JobCodesCode', \
'JobCodesDesc','TeamCode','TeamDesc','EarnCode'])
print(df)
for row in df.itertuples():
SQLInsert = ('''
INSERT INTO [Reporting].[dbo].[Paycom_Missing_Punch]
(Date, EECode, LastName, FirstName, HomeDepartmentCode,
HomeDepartmentDesc, PayClass, InPunchTime, OutPunchTime,
DepartmentCode, DepartmentDesc, JobCodesCode, JobCodesDesc,
TeamCode, TeamDesc, EarnCode)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
'''
)
args = row.Date, row.EECode, row.LastName, row.FirstName, \
row.HomeDepartmentCode, row.HomeDepartmentDesc, row.PayClass, row.InPunchTime, \
row.OutPunchTime, row.DepartmentCode, row.DepartmentDesc, row.JobCodesCode, \
row.JobCodesDesc, row.TeamCode, row.TeamDesc, row.EarnCode
#print(SQLInsert)
#print(args)
cursor.execute(SQLInsert, args)
conn.commit()
output when I print(data);
Date EE Code ... Team Desc Earn Code
0 01/21/2021 1435 ... Indiana DWD NaN
1 01/21/2021 1435 ... Indiana DWD NaN
2 01/22/2021 1180 ... Supervisors NaN
3 01/21/2021 1664 ... Technical Support Desk NaN
4 01/21/2021 1078 ... Supervisors NaN
output once I add it to the dataframe;
Date EECode LastName ... TeamCode TeamDesc EarnCode
0 01/21/2021 NaN NaN ... NaN NaN NaN
1 01/21/2021 NaN NaN ... NaN NaN NaN
2 01/22/2021 NaN NaN ... NaN NaN NaN
3 01/21/2021 NaN NaN ... NaN NaN NaN
4 01/21/2021 NaN NaN ... NaN NaN NaN
I assume the problem is how I am passing the values to the dataframe, but from everything I have read or seen, the way I am doing it looks correct.
The problem is the way you're doing the df. You're creating the dataframe first with your data. Then you're trying to create another dataframe of it, using names that don't exist. To fix your problem simply do this:
>>> col_names = ['Date','EECode','LastName','FirstName', \
'HomeDepartmentCode','HomeDepartmentDesc','PayClass','InPunchTime', \
'OutPunchTime','DepartmentCode','DepartmentDesc','JobCodesCode', \
'JobCodesDesc','TeamCode','TeamDesc','EarnCode']
>>> df = pd.read_csv(localFilePath)
>>> df.columns = col_names

Bind one row cell with multiple rows cell for excle sheet in panda jupyter notebook

I have an excel sheet like this.
If I search using the below method I got only 1 row.
df4 = df.loc[(df['NAME '] == 'HIR')]
df4
But I want to get all rows connecting with this name (same for birthdate and place).
expected output:
How can I achieve this? how can I bind these things
You need to forward fill the data with ffill():
df = df.replace('', np.nan) # in case you don't have null values, but you have empty strings
df['NAME '] = df['NAME '].ffill()
df4 = df.loc[(df['NAME '] == 'HIR')]
df4
That will then bring up all of the rows when you use loc. You can do this on other columns as well.
First you need to remove those blank rows in your excel. then fill values by the previous value
import pandas as pd
df = pd.read_excel('so.xlsx')
df = df[~df['HOBBY'].isna()]
df[['SNO','NAME']] = df[['SNO','NAME']].ffill()
df
SNO NAME HOBBY COURSE BIRTHDATE PLACE
0 1.0 HIR DANCING BTECH 1990.0 USA
1 1.0 HIR MUSIC MTECH NaN NaN
2 1.0 HIR TRAVELLING AI NaN NaN
4 2.0 BH GAMES BTECH 1992.0 INDIA
5 2.0 BH BOOKS AI NaN NaN
6 2.0 BH SWIMMING NaN NaN NaN
7 2.0 BH MUSIC NaN NaN NaN
8 2.0 BH DANCING NaN NaN NaN

Python Pandas - Rolling regressions for multiple columns in a dataframe

I have a large dataframe containing daily timeseries of prices for 10,000 columns (stocks) over a period of 20 years (5000 rows x 10000 columns). Missing observations are indicated by NaNs.
0 1 2 3 4 5 6 7 8 \
31.12.2009 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
01.01.2010 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
04.01.2010 31.85 66.99 NaN NaN NaN NaN 404.93 57.04 NaN
05.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
06.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
Now I want to run a rolling regression for a 250 day window for each column over the whole sample period and save the coefficient in another dataframe
Iterating over the colums and rows using two for-loops isn't very efficient, so I tried this but getting the following error message
def regress(start, end):
y = df_returns.iloc[start:end].values
if np.isnan(y).any() == False:
X = np.arange(len(y))
X = sm.add_constant(X, has_constant="add")
model = sm.OLS(y,X).fit()
return model.params[1]
else:
return np.nan
regression_window = 250
for t in (regression_window, len(df_returns.index)):
df_coef[t] = df_returns.apply(regress(t-regression_window, t), axis=1)
TypeError: ("'float' object is not callable", 'occurred at index 31.12.2009')
here is my version, using df.rolling() instead and iterating over the columns.
I am not completely sure it is what you were looking for don't hesitate to comment
import statsmodels.regression.linear_model as sm
import statsmodels.tools.tools as sm2
df_returns =pd.DataFrame({'0':[30,30,31,32,32],'1':[60,60,60,60,60],'2':[np.NaN,np.NaN,np.NaN,np.NaN,np.NaN]})
def regress(X,Z):
if np.isnan(X).any() == False:
model = sm.OLS(X,Z).fit()
return model.params[1]
else:
return np.NaN
regression_window = 3
Z = np.arange(regression_window)
Z= sm2.add_constant(Z, has_constant="add")
df_coef=pd.DataFrame()
for col in df_returns.columns:
df_coef[col]=df_returns[col].rolling(window=regression_window).apply(lambda col : regress(col, Z))
df_coef

Skip `# ` character when reading header with pandas read_csv

I have a file that looks like this:
# Time Cm Cd Cl Cl(f) Cl(r) Cm Cd Cl Cl(f) Cl(r)
1.000000000000e+01 -5.743573465913e-01 -5.860160539688e-01 -1.339511756657e+00 -1.244113224920e+00 -9.539853173733e-02
2.000000000000e+01 6.491397073110e-02 1.320098727949e-02 6.147195262817e-01 3.722737338720e-01 2.424457924098e-01
3.000000000000e+01 3.554043329234e-02 4.296597501519e-01 7.901295853361e-01 4.306052259604e-01 3.595243593757e-01
Is there any way I can tell pandas that Time is the first column name?
I read it this way
dat = pd.read_csv('%sdt.dat'%s, delim_whitespace=True)
Which is somehow telling pandas that the first column is named #:
dat.columns
Index(['#', 'Time', 'Cm', 'Cd', 'Cl', 'Cl(f)', 'Cl(r)', 'Cm.1', 'Cd.1', 'Cl.1', 'Cl(f).1', 'Cl(r).1'],
dtype='object')
How can I tell pandas' read_csv to ignore the first two characters in the header or otherwise get the column names I want from read_csv?
Here is one potential work-around:
headers = pd.read_csv('%sdt.dat'%s, delim_whitespace=True, nrows=0).columns[1:]
dat = pd.read_csv('%sdt.dat'%s, delim_whitespace=True, header=None, skiprows=1, names=headers)
alternatively, you could fix the columns with some post-processing:
col_mapper = {old:new for old, new in zip(dat.columns, dat.columns[1:])}
dat = dat.iloc[:, :-1].rename(col_mapper, axis=1)
Instead of using any whitespace as a separator you can specify that there must be at least 2 whitespace characters since your data appears to be separated by multiple spaces. This will name the first column '# Time' and you can rename it afterwards to remove the '# ' prefix:
df = pd.read_csv('%sdt.dat'%s, sep='\s{2,}', engine='python')
print(df)
# Time Cm Cd Cl Cl(f) Cl(r) Cm.1 Cd.1 Cl.1 Cl(f).1 Cl(r).1
0 10.0 -0.574357 -0.586016 -1.339512 -1.244113 -0.095399 NaN NaN NaN NaN NaN
1 20.0 0.064914 0.013201 0.614720 0.372274 0.242446 NaN NaN NaN NaN NaN
2 30.0 0.035540 0.429660 0.790130 0.430605 0.359524 NaN NaN NaN NaN NaN
df.columns = ['Time'] + list(df.columns[1:])
print(df)
Time Cm Cd Cl Cl(f) Cl(r) Cm.1 Cd.1 Cl.1 Cl(f).1 Cl(r).1
0 10.0 -0.574357 -0.586016 -1.339512 -1.244113 -0.095399 NaN NaN NaN NaN NaN
1 20.0 0.064914 0.013201 0.614720 0.372274 0.242446 NaN NaN NaN NaN NaN
2 30.0 0.035540 0.429660 0.790130 0.430605 0.359524 NaN NaN NaN NaN NaN

I'm trying modify a pandas data frame so that I will have 2 columns. A frequency column and a date column.

Basically, what I'm working with is a dataframe with all of the parking tickets given out in one year. Every ticket takes up its own row in the unaltered dataframe. What I want to do is group all the tickets by date so that I have 2 columns (date, and the amount of tickets issued on that day). Right now I can achieve that, however, the date is not considered a column by pandas.
import numpy as np
import matplotlib as mp
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.read_csv('C:/Users/brett/OneDrive/Data Science
Fundamentals/Parking_Tags_Data_2012.csv')
unnecessary_cols = ['tag_number_masked', 'infraction_code',
'infraction_description', 'set_fine_amount', 'time_of_infraction',
'location1', 'location2', 'location3', 'location4',
'province']
df1 = df1.drop (unnecessary_cols, 1)
df1 =
(df1.groupby('date_of_infraction').agg({'date_of_infraction':'count'}))
df1['frequency'] =
(df1.groupby('date_of_infraction').agg({'date_of_infraction':'count'}))
print (df1)
df1 = (df1.iloc[121:274])
The output is:
date_of_infraction date_of_infraction frequency
20120101 1059 NaN
20120102 2711 NaN
20120103 6889 NaN
20120104 8030 NaN
20120105 7991 NaN
20120106 8693 NaN
20120107 7237 NaN
20120108 5061 NaN
20120109 7974 NaN
20120110 8872 NaN
20120111 9110 NaN
20120112 8667 NaN
20120113 7247 NaN
20120114 7211 NaN
20120115 6116 NaN
20120116 9168 NaN
20120117 8973 NaN
20120118 9016 NaN
20120119 7998 NaN
20120120 8214 NaN
20120121 6400 NaN
20120122 6355 NaN
20120123 7777 NaN
20120124 8628 NaN
20120125 8527 NaN
20120126 8239 NaN
20120127 8667 NaN
20120128 7174 NaN
20120129 5378 NaN
20120130 7901 NaN
... ... ...
20121202 5342 NaN
20121203 7336 NaN
20121204 7258 NaN
20121205 8629 NaN
20121206 8893 NaN
20121207 8479 NaN
20121208 7680 NaN
20121209 5357 NaN
20121210 7589 NaN
20121211 8918 NaN
20121212 9149 NaN
20121213 7583 NaN
20121214 8329 NaN
20121215 7072 NaN
20121216 5614 NaN
20121217 8038 NaN
20121218 8194 NaN
20121219 6799 NaN
20121220 7102 NaN
20121221 7616 NaN
20121222 5575 NaN
20121223 4403 NaN
20121224 5492 NaN
20121225 673 NaN
20121226 1488 NaN
20121227 4428 NaN
20121228 5882 NaN
20121229 3858 NaN
20121230 3817 NaN
20121231 4530 NaN
Essentially, I want to move all the columns over by one to the right. Right now pandas only considers the last two columns as actual columns. I hope this made sense.
The count of infractions per date should be achievable with just one call to groupby. Try this:
import numpy as np
import pandas as pd
df1 = pd.read_csv('C:/Users/brett/OneDrive/Data Science
Fundamentals/Parking_Tags_Data_2012.csv')
unnecessary_cols = ['tag_number_masked', 'infraction_code',
'infraction_description', 'set_fine_amount', 'time_of_infraction',
'location1', 'location2', 'location3', 'location4',
'province']
df1 = df1.drop(unnecessary_cols, 1)
# reset_index() to move the dates into their own column
counts = df1.groupby('date_of_infraction').count().reset_index()
print(counts)
Note that any dates with zero tickets will not show up as 0; instead, they will simply be absent from counts.
If this doesn't work, it would be helpful for us to see the first few rows of df1 after you drop the unnecessary rows.
Try using as_index=False.
For example:
import numpy as np
import pandas as pd
data = {"date_of_infraction":["20120101", "20120101", "20120202", "20120202"],
"foo":np.random.random(4)}
df = pd.DataFrame(data)
df
date_of_infraction foo
0 20120101 0.681286
1 20120101 0.826723
2 20120202 0.669367
3 20120202 0.766019
(df.groupby("date_of_infraction", as_index=False) # <-- acts like reset_index()
.foo.count()
.rename(columns={"foo":"frequency"})
)
date_of_infraction frequency
0 20120101 2
1 20120202 2

Categories