Reindex a dataframe using tickers from another dataframe - python

I read csv files into dataframe using
from glob import glob
import pandas as pd
def read_file(f):
df = pd.read_csv(f)
df['ticker'] = f.split('.')[0]
return df
df = pd.concat([read_file(f) for f in glob('*.csv')])
df = df.set_index(['Date','ticker'])[['Close']].unstack()
And got the following dataframe:
Close
ticker AAPL AMD BIDU GOOGL IXIC
Date
2011-06-01 12.339643 8.370000 132.470001 263.063049 2769.189941
.
.
.
now I would like to use the 'ticker' to reindex another random dataframe created by
data = np.random.random((df.shape[1], 100))
df1 = pd.DataFrame(data)
which looks like:
0 1 2 3 4 5 6 \...
0 0.493036 0.114539 0.862388 0.156381 0.030477 0.094902 0.132268
1 0.486184 0.483585 0.090874 0.751288 0.042761 0.150361 0.781567
2 0.318586 0.078662 0.238091 0.963334 0.815566 0.274273 0.320380
3 0.708489 0.354177 0.285239 0.565553 0.212956 0.275228 0.597578
4 0.150210 0.423037 0.785664 0.956781 0.894701 0.707344 0.883821
5 0.005920 0.115123 0.334728 0.874415 0.537229 0.557406 0.338663
6 0.066458 0.189493 0.887536 0.915425 0.513706 0.628737 0.132074
7 0.729326 0.241142 0.574517 0.784602 0.287874 0.402234 0.926567
8 0.284867 0.996575 0.002095 0.325658 0.525330 0.493434 0.701801
9 0.355176 0.365045 0.270155 0.681947 0.153718 0.644909 0.952764
10 0.352828 0.557434 0.919820 0.952302 0.941161 0.246068 0.538714
11 0.465394 0.101752 0.746205 0.897994 0.528437 0.001023 0.979411
I tried
df1 = df1.set_index(df.columns.values)
but it seems my df only has one level of index since the error says
IndexError: Too many levels: Index has only 1 level, not 2
But if I check the index by df.index it gives me the Date, can someone help me solve this problem?

You can get the column labels of a particular level of the MultiIndex in df by MultiIndex.get_level_values, as follows:
df_ticker = df.columns.get_level_values('ticker')
Then, if df1 has the same number of columns, you can copy the labels extracted to df1 by:
df1.columns = df_ticker

Related

Concatenate arrays into a single table using pandas

I have a .csv file, from this file I group it by year so that it gives me as a result the maximum, minimum and average values
import pandas as pd
DF = pd.read_csv("PJME_hourly.csv")
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
His output is as follows:
2002 PJME_MW
max 55934.000000
min 19247.000000
mean 31565.617106
2003 PJME_MW
max 53737.000000
min 19414.000000
mean 31698.758621
2004 PJME_MW
max 51962.000000
min 19543.000000
mean 32270.434867
I would like to know how I can make it all join in a single column (PJME_MW), but that each group of operations (max, min, mean) is identified by the year that corresponds to it.
If you convert the dates to_datetime(), you can group them using the dt.year accessor:
df = pd.read_csv('PJME_hourly.csv')
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
Toy example:
df = pd.DataFrame({'Datetime': ['2019-01-01','2019-02-01','2020-01-01','2020-02-01','2021-01-01'], 'PJME_MV': [3,5,30,50,100]})
# Datetime PJME_MV
# 0 2019-01-01 3
# 1 2019-02-01 5
# 2 2020-01-01 30
# 3 2020-02-01 50
# 4 2021-01-01 100
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
# PJME_MV
# min max mean
# Datetime
# 2019 3 5 4
# 2020 30 50 40
# 2021 100 100 100
The code could be optimized but how is now works, change this part of your code:
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
Use this instead
aggs = ['max','min','mean']
df_group = df.groupby('Datetime')['PJME_MW'].agg(aggs).reset_index()
out_columns = ['agg_year', 'PJME_MW']
out = []
aux = pd.DataFrame(columns=out_columns)
for agg in aggs:
aux['agg_year'] = agg + '_' + df_group['Datetime']
aux['PJME_MW'] = df_group[agg]
out.append(aux)
df_out = pd.concat(out)
Edit: Concatenation form has been changed
Final edit: I didn't understand the whole problem, sorry. You don't need the code after groupby function

Pandas: how to adapt one dataframe to another based on the date?

I have two data frames that collect historical price series of two different stocks. applying describe () I noticed that the elements of the first stock are 1291 while those of the second are 1275. This difference is due to the fact that the two securities are listed on different stock exchanges and therefore show differences on some dates. What I would like to do is keep the two separate dataframes, but make sure that in the first dataframe, all those rows whose dates are not present in the second dataframe are deleted in order to have the perfect matching of the two dataframes to do the analyzes. I have read that there are functions such as merge () or join () but I have not been able to understand well how to use them (if these are the correct functions). I thank those who will use some of their time to answer my question.
"ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 1275 and the array at index 1 has size 1291"
Thank you
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas_datareader as web
from scipy import stats
import seaborn as sns
pd.options.display.min_rows= None
pd.options.display.max_rows= None
tickers = ['DISW.MI','IXJ','NRJ.PA','SGOL','VDC','VGT']
wts= [0.19,0.18,0.2,0.08,0.09,0.26]
price_data = web.get_data_yahoo(tickers,
start = '2016-01-01',
end = '2021-01-01')
price_data = price_data['Adj Close']
ret_data = price_data.pct_change()[1:]
port_ret = (ret_data * wts).sum(axis = 1)
benchmark_price = web.get_data_yahoo('ACWE.PA',
start = '2016-01-01',
end = '2021-01-01')
benchmark_ret = benchmark_price["Adj Close"].pct_change()[1:].dropna()
#From now i get error
sns.regplot(benchmark_ret.values,
port_ret.values)
plt.xlabel("Benchmark Returns")
plt.ylabel("Portfolio Returns")
plt.title("Portfolio Returns vs Benchmark Returns")
plt.show()
(beta, alpha) = stats.linregress(benchmark_ret.values,
port_ret.values)[0:2]
print("The portfolio beta is", round(beta, 4))
Let's consider a toy example.
df1 consists of 6 days of data and df2 consists of 5 days of data.
What I have understood, you want df1 also to have 5 days of data matching the dates with df2.
df1
df1 = pd.DataFrame({
'date':pd.date_range('2021-05-17', periods=6),
'px':np.random.rand(6)
})
df1
date px
0 2021-05-17 0.054907
1 2021-05-18 0.192294
2 2021-05-19 0.214051
3 2021-05-20 0.623223
4 2021-05-21 0.004627
5 2021-05-22 0.127086
df2
df2 = pd.DataFrame({
'date':pd.date_range('2021-05-17', periods=5),
'px':np.random.rand(5)
})
df2
date px
0 2021-05-17 0.650976
1 2021-05-18 0.393061
2 2021-05-19 0.985700
3 2021-05-20 0.879786
4 2021-05-21 0.463206
Code
To consider only matching dates in df1 from df2.
df1 = df1[df1.date.isin(df2.date)]
Output df1
date px
0 2021-05-17 0.054907
1 2021-05-18 0.192294
2 2021-05-19 0.214051
3 2021-05-20 0.623223
4 2021-05-21 0.004627

Integer YYYYMMDD to DateTime (e.g. 01JAN2021)

I'm having some difficulties converting an integer-type column of a Pandas DataFrame representing dates (in YYYYMMDD format) to a DateTime type column and parsing the result in a specific format (e.g., 01JAN2021). Here's a sample DataFrame to get started:
import pandas as pd
df = pd.DataFrame(data={"CUS_DATE": [19550703, 19631212, 19720319, 19890205, 19900726]})
print(df)
CUS_DATE
0 19550703
1 19631212
2 19720319
3 19890205
4 19900726
Of all the things I have tried so far, the only one that worked was the following:
df["CUS_DATE"] = pd.to_datetime(df['CUS_DATE'], format='%Y%m%d')
print(df)
CUS_DATE
0 1955-07-03
1 1963-12-12
2 1972-03-19
3 1989-02-05
4 1990-07-26
But the above is not the result I'm looking for. My desired output should be the following:
CUS_DATE
0 03JUL1955
1 12DEC1963
2 19MAR1972
3 05FEB1989
4 26JUL1990
Any additional help would be appreciated.
Do this:
In [1347]: df["CUS_DATE"] = pd.to_datetime(df['CUS_DATE'], format='%Y%m%d')
In [1359]: df["CUS_DATE"] = df["CUS_DATE"].apply(lambda x: x.strftime('%d%b%Y').upper())
In [1360]: df
Out[1360]:
CUS_DATE
0 03JUL1955
1 12DEC1963
2 19MAR1972
3 05FEB1989
4 26JUL1990
You can use, in addition to pandas.to_datetime, the methods pandas.Series.dt.strftime and pandas.Series.str.upper:
df["CUS_DATE"] = (pd.to_datetime(df['CUS_DATE'], format='%Y%m%d')
.dt.strftime('%d%b%Y').str.upper())
# CUS_DATE
#0 03JUL1955
#1 12DEC1963
#2 19MAR1972
#3 05FEB1989
#4 26JUL1990
Also, check this documentation where you can find the datetime format codes.

Pandas: concatenate dataframes

I have 2 dataframe
category count_sec_target
3D-шутеры 0.09375
Cериалы 201.90625
GPS и ГЛОНАСС 0.015625
Hi-Tech 187.1484375
Абитуриентам 0.8125
Авиакомпании 8.40625
and
category count_sec_random
3D-шутеры 0.369565217
Hi-Tech 70.42391304
АСУ ТП, промэлектроника 0.934782609
Абитуриентам 1.413043478
Авиакомпании 14.93478261
Авто 480.3369565
I need to concatenate that And get
category count_sec_target count_sec_random
3D-шутеры 0.09375 0.369565217
Cериалы 201.90625 0
GPS и ГЛОНАСС 0.015625 0
Hi-Tech 187.1484375 70.42391304
Абитуриентам 0.8125 1.413043478
Авиакомпании 8.40625 14.93478261
АСУ ТП, промэлектроника 0 0.934782609
Авто 0 480.3369565
And next I want to divide values in col (count_sec_target / count_sec_random) * 100%
But when I try to concatenate df
frames = [df1, df1]
df = pd.concat(frames)
I get
category count_sec_random count_sec_target
0 3D-шутеры 0.369565 NaN
1 Hi-Tech 70.423913 NaN
2 АСУ ТП, промэлектроника 0.934783 NaN
3 Абитуриентам 1.413043 NaN
4 Авиакомпании 14.934783 NaN
Also I try df = df1.append(df2)
BUt I get wrong result.
How can I fix that?
df3 = pd.concat([d.set_index('category') for d in frames], axis=1).fillna(0)
df3['ratio'] = df3.count_sec_random / df3.count_sec_target
df3
Setup Reference
import pandas as pd
from StringIO import StringIO
t1 = """category;count_sec_target
3D-шутеры;0.09375
Cериалы;201.90625
GPS и ГЛОНАСС;0.015625
Hi-Tech;187.1484375
Абитуриентам;0.8125
Авиакомпании;8.40625"""
t2 = """category;count_sec_random
3D-шутеры;0.369565217
Hi-Tech;70.42391304
АСУ ТП, промэлектроника;0.934782609
Абитуриентам;1.413043478
Авиакомпании;14.93478261
Авто;480.3369565"""
df1 = pd.read_csv(StringIO(t1), sep=';')
df2 = pd.read_csv(StringIO(t2), sep=';')
frames = [df1, df2]
Merge should be appropriate here:
df_1.merge(df_2, on='category', how='outer').fillna(0)
To get the division output, simply do:
df['division'] = df['count_sec_target'].div(df['count_sec_random']) * 100
where: df is the merged DF

How to correctly read csv in Pandas while changing the names of the columns

An absolute basic read_csv question.
I have data that looks like the following in a csv file -
Date,Open Price,High Price,Low Price,Close Price,WAP,No.of Shares,No. of Trades,Total Turnover (Rs.),Deliverable Quantity,% Deli. Qty to Traded Qty,Spread High-Low,Spread Close-Open
28-February-2015,2270.00,2310.00,2258.00,2294.85,2279.192067772602217319,73422,8043,167342840.00,11556,15.74,52.00,24.85
27-February-2015,2267.25,2280.85,2258.00,2266.35,2269.239841485775122730,50721,4938,115098114.00,12297,24.24,22.85,-0.90
26-February-2015,2314.90,2314.90,2250.00,2259.50,2277.198324862194860047,69845,8403,159050917.00,22046,31.56,64.90,-55.40
25-February-2015,2290.00,2332.00,2278.35,2318.05,2315.100614216488163214,161995,10174,375034724.00,102972,63.56,53.65,28.05
24-February-2015,2276.05,2295.00,2258.00,2278.15,2281.058946240263344242,52251,7726,119187611.00,13292,25.44,37.00,2.10
23-February-2015,2303.95,2311.00,2253.25,2270.70,2281.912259219760108491,75951,7344,173313518.00,24969,32.88,57.75,-33.25
20-February-2015,2324.00,2335.20,2277.00,2284.30,2301.631421152326354478,79717,10233,183479152.00,23045,28.91,58.20,-39.70
19-February-2015,2304.00,2333.90,2292.00,2326.60,2321.485466301625211160,85835,8847,199264705.00,29728,34.63,41.90,22.60
18-February-2015,2284.00,2305.00,2261.10,2295.75,2282.060986778089405300,69884,6639,159479550.00,26665,38.16,43.90,11.75
16-February-2015,2281.00,2305.85,2266.00,2278.50,2284.961866239581019628,85541,10149,195457923.00,22164,25.91,39.85,-2.50
13-February-2015,2311.00,2324.90,2286.95,2296.40,2311.371235111317676864,109731,5570,253629077.00,69039,62.92,37.95,-14.60
12-February-2015,2280.00,2322.85,2275.00,2315.45,2301.372038211769425569,79766,9095,183571242.00,33981,42.60,47.85,35.45
11-February-2015,2275.00,2295.00,2258.25,2287.20,2279.587966250020639664,60563,7467,138058686.00,20058,33.12,36.75,12.20
10-February-2015,2244.90,2297.40,2225.00,2280.30,2269.562228214830293104,141656,13026,321497107.00,55577,39.23,72.40,35.40
--
I am trying to read this data in a pandas dataframe using the following variations of read_csv. I am only interested in two columns.
z = pd.read_csv('file.csv', parse_dates=True, index_col="Date", usecols=["Date", "Open Price", "Close Price"], names=["Date", "O", "C"], header=0)
What I get is
O C
Date
2015-02-28 NaN NaN
2015-02-27 NaN NaN
2015-02-26 NaN NaN
2015-02-25 NaN NaN
2015-02-24 NaN NaN
Or
z = pd.read_csv('file.csv', parse_dates=True, index_col="Date", usecols=["Date", "Open", "Close"], names=["Date", "Open Price", "Close Price"], header=0)
The result is -
Open Price Close Price
Date
2015-02-28 NaN NaN
2015-02-27 NaN NaN
2015-02-26 NaN NaN
2015-02-25 NaN NaN
Am I missing something fundamental or is there an issue with read_csv of pandas 0.13.1 - my version on Debian Wheezy?
You are right, something is odd with the name attributes. Seems to me that you can not use both in the same time. Either you set the name for every columns of the CSV file or you don't set the name at all. So it seems that you can't set the name when you are not taking all the colums (usecols)
names : array-like
List of column names to use. If file contains no header row, then you should explicitly pass header=None
You might already know it but you can rename the colums after also.
import pandas as pd
from StringIO import StringIO
csv = r"""Date,Open Price,High Price,Low Price,Close Price,WAP,No.of Shares,No. of Trades,Total Turnover (Rs.),Deliverable Quantity,% Deli. Qty to Traded Qty,Spread High-Low,Spread Close-Open
28-February-2015,2270.00,2310.00,2258.00,2294.85,2279.192067772602217319,73422,8043,167342840.00,11556,15.74,52.00,24.85
27-February-2015,2267.25,2280.85,2258.00,2266.35,2269.239841485775122730,50721,4938,115098114.00,12297,24.24,22.85,-0.90
26-February-2015,2314.90,2314.90,2250.00,2259.50,2277.198324862194860047,69845,8403,159050917.00,22046,31.56,64.90,-55.40
25-February-2015,2290.00,2332.00,2278.35,2318.05,2315.100614216488163214,161995,10174,375034724.00,102972,63.56,53.65,28.05
24-February-2015,2276.05,2295.00,2258.00,2278.15,2281.058946240263344242,52251,7726,119187611.00,13292,25.44,37.00,2.10
23-February-2015,2303.95,2311.00,2253.25,2270.70,2281.912259219760108491,75951,7344,173313518.00,24969,32.88,57.75,-33.25
20-February-2015,2324.00,2335.20,2277.00,2284.30,2301.631421152326354478,79717,10233,183479152.00,23045,28.91,58.20,-39.70
19-February-2015,2304.00,2333.90,2292.00,2326.60,2321.485466301625211160,85835,8847,199264705.00,29728,34.63,41.90,22.60
18-February-2015,2284.00,2305.00,2261.10,2295.75,2282.060986778089405300,69884,6639,159479550.00,26665,38.16,43.90,11.75
16-February-2015,2281.00,2305.85,2266.00,2278.50,2284.961866239581019628,85541,10149,195457923.00,22164,25.91,39.85,-2.50
13-February-2015,2311.00,2324.90,2286.95,2296.40,2311.371235111317676864,109731,5570,253629077.00,69039,62.92,37.95,-14.60
12-February-2015,2280.00,2322.85,2275.00,2315.45,2301.372038211769425569,79766,9095,183571242.00,33981,42.60,47.85,35.45
11-February-2015,2275.00,2295.00,2258.25,2287.20,2279.587966250020639664,60563,7467,138058686.00,20058,33.12,36.75,12.20
10-February-2015,2244.90,2297.40,2225.00,2280.30,2269.562228214830293104,141656,13026,321497107.00,55577,39.23,72.40,35.40"""
df = pd.read_csv(StringIO(csv),
usecols=["Date", "Open Price", "Close Price"],
header=0)
df.columns = ['Date', 'O', 'C']
df
output:
Date O C
0 28-February-2015 2270.00 2294.85
1 27-February-2015 2267.25 2266.35
2 26-February-2015 2314.90 2259.50
3 25-February-2015 2290.00 2318.05
4 24-February-2015 2276.05 2278.15
5 23-February-2015 2303.95 2270.70
6 20-February-2015 2324.00 2284.30
7 19-February-2015 2304.00 2326.60
8 18-February-2015 2284.00 2295.75
9 16-February-2015 2281.00 2278.50
10 13-February-2015 2311.00 2296.40
11 12-February-2015 2280.00 2315.45
12 11-February-2015 2275.00 2287.20
13 10-February-2015 2244.90 2280.30
According to documentation your usecols list should be subset of new names list
usecols : list-like or callable, default None
Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or strings
that correspond to column names provided either by the user in `names` or
inferred from the document header row(s).
Example of csv
"OLD1", "OLD2", "OLD3"
1,2,3
4,5,6
Code for renaming OLDX -> NEWX and using only NEW2 + NEW3
import pandas as pd
d = pd.read_csv('test.csv', header=0, names=['NEW1', 'NEW2', 'NEW3'], usecols=['NEW2', 'NEW3'])
Output
NEW2 NEW3
0 2 3
1 5 6
NOTE: Even if above is working as expected there is an issue while changing engine='python'
d = pd.read_csv('test.csv', header=0, engine='python',
names=['NEW1', 'NEW2', 'NEW3'], usecols=['NEW2', 'NEW3'])
ValueError: Number of passed names did not match number of header fields in the file
Workaround is set header=None and skiprows=[0,]:
d = pd.read_csv('test.csv', header=None, skiprows=[0,], engine='python', names=['NEW1', 'NEW2', 'NEW3'], usecols=['NEW2', 'NEW3'])
Output
NEW2 NEW3
0 2 3
1 5 6
Pandas version: 0.23.4

Categories