Pandas: concatenate dataframes - python

I have 2 dataframe
category count_sec_target
3D-шутеры 0.09375
Cериалы 201.90625
GPS и ГЛОНАСС 0.015625
Hi-Tech 187.1484375
Абитуриентам 0.8125
Авиакомпании 8.40625
and
category count_sec_random
3D-шутеры 0.369565217
Hi-Tech 70.42391304
АСУ ТП, промэлектроника 0.934782609
Абитуриентам 1.413043478
Авиакомпании 14.93478261
Авто 480.3369565
I need to concatenate that And get
category count_sec_target count_sec_random
3D-шутеры 0.09375 0.369565217
Cериалы 201.90625 0
GPS и ГЛОНАСС 0.015625 0
Hi-Tech 187.1484375 70.42391304
Абитуриентам 0.8125 1.413043478
Авиакомпании 8.40625 14.93478261
АСУ ТП, промэлектроника 0 0.934782609
Авто 0 480.3369565
And next I want to divide values in col (count_sec_target / count_sec_random) * 100%
But when I try to concatenate df
frames = [df1, df1]
df = pd.concat(frames)
I get
category count_sec_random count_sec_target
0 3D-шутеры 0.369565 NaN
1 Hi-Tech 70.423913 NaN
2 АСУ ТП, промэлектроника 0.934783 NaN
3 Абитуриентам 1.413043 NaN
4 Авиакомпании 14.934783 NaN
Also I try df = df1.append(df2)
BUt I get wrong result.
How can I fix that?

df3 = pd.concat([d.set_index('category') for d in frames], axis=1).fillna(0)
df3['ratio'] = df3.count_sec_random / df3.count_sec_target
df3
Setup Reference
import pandas as pd
from StringIO import StringIO
t1 = """category;count_sec_target
3D-шутеры;0.09375
Cериалы;201.90625
GPS и ГЛОНАСС;0.015625
Hi-Tech;187.1484375
Абитуриентам;0.8125
Авиакомпании;8.40625"""
t2 = """category;count_sec_random
3D-шутеры;0.369565217
Hi-Tech;70.42391304
АСУ ТП, промэлектроника;0.934782609
Абитуриентам;1.413043478
Авиакомпании;14.93478261
Авто;480.3369565"""
df1 = pd.read_csv(StringIO(t1), sep=';')
df2 = pd.read_csv(StringIO(t2), sep=';')
frames = [df1, df2]

Merge should be appropriate here:
df_1.merge(df_2, on='category', how='outer').fillna(0)
To get the division output, simply do:
df['division'] = df['count_sec_target'].div(df['count_sec_random']) * 100
where: df is the merged DF

Related

Grouper() and agg() functions produce multiple copies when squashed

I have a sample dataframe as given below.
import pandas as pd
import numpy as np
NaN = np.nan
data = {'ID':['A', 'A', 'A', 'B','B','B'],
'Date':['2021-09-20 04:34:57', '2021-09-20 04:37:25', '2021-09-20 04:38:26', '2021-09-01
00:12:29','2021-09-01 11:20:58','2021-09-02 09:20:58'],
'Name':['xx','xx',NaN,'yy',NaN,NaN],
'Height':[174,174,NaN,160,NaN,NaN],
'Weight':[74,NaN,NaN,58,NaN,NaN],
'Gender':[NaN,'Male',NaN,NaN,'Female',NaN],
'Interests':[NaN,NaN,'Hiking,Sports',NaN,NaN,'Singing']}
df1 = pd.DataFrame(data)
df1
I want to combine the data present on the same date into a single row. The 'Date' column is in timestamp format. I have written a code for it. Here is my TRY code:
TRY:
df1['Date'] = pd.to_datetime(df1['Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.agg(lambda x: ''.join(x.dropna().astype(str)))
.reset_index()
).replace('', np.nan)
This gives an output where if there are multiple entries of same value, the final result has multiple entries in the same row as shown below.
Obtained Output
However, I do not want the values to be repeated if there are multiple entries. The final output should look like the image shown below.
Required Output
The first column should not have 'xx' and 174.0 instead of 'xxxx' and '174.0 174.0'.
Any help is greatly appreciated. Thank you.
In your case replace agg join to first
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.first()
.reset_index()
).replace('', np.nan)
df_out
Out[113]:
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female None
2 B 2021-09-02 None NaN NaN None Singing
Since you're only trying to keep the first available value for each column for each date, you can do:
>>> df1.groupby(["ID", pd.Grouper(key='Date', freq='D')]).agg("first").reset_index()
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female None
2 B 2021-09-02 None NaN NaN None Singing

Reindex a dataframe using tickers from another dataframe

I read csv files into dataframe using
from glob import glob
import pandas as pd
def read_file(f):
df = pd.read_csv(f)
df['ticker'] = f.split('.')[0]
return df
df = pd.concat([read_file(f) for f in glob('*.csv')])
df = df.set_index(['Date','ticker'])[['Close']].unstack()
And got the following dataframe:
Close
ticker AAPL AMD BIDU GOOGL IXIC
Date
2011-06-01 12.339643 8.370000 132.470001 263.063049 2769.189941
.
.
.
now I would like to use the 'ticker' to reindex another random dataframe created by
data = np.random.random((df.shape[1], 100))
df1 = pd.DataFrame(data)
which looks like:
0 1 2 3 4 5 6 \...
0 0.493036 0.114539 0.862388 0.156381 0.030477 0.094902 0.132268
1 0.486184 0.483585 0.090874 0.751288 0.042761 0.150361 0.781567
2 0.318586 0.078662 0.238091 0.963334 0.815566 0.274273 0.320380
3 0.708489 0.354177 0.285239 0.565553 0.212956 0.275228 0.597578
4 0.150210 0.423037 0.785664 0.956781 0.894701 0.707344 0.883821
5 0.005920 0.115123 0.334728 0.874415 0.537229 0.557406 0.338663
6 0.066458 0.189493 0.887536 0.915425 0.513706 0.628737 0.132074
7 0.729326 0.241142 0.574517 0.784602 0.287874 0.402234 0.926567
8 0.284867 0.996575 0.002095 0.325658 0.525330 0.493434 0.701801
9 0.355176 0.365045 0.270155 0.681947 0.153718 0.644909 0.952764
10 0.352828 0.557434 0.919820 0.952302 0.941161 0.246068 0.538714
11 0.465394 0.101752 0.746205 0.897994 0.528437 0.001023 0.979411
I tried
df1 = df1.set_index(df.columns.values)
but it seems my df only has one level of index since the error says
IndexError: Too many levels: Index has only 1 level, not 2
But if I check the index by df.index it gives me the Date, can someone help me solve this problem?
You can get the column labels of a particular level of the MultiIndex in df by MultiIndex.get_level_values, as follows:
df_ticker = df.columns.get_level_values('ticker')
Then, if df1 has the same number of columns, you can copy the labels extracted to df1 by:
df1.columns = df_ticker

compare dfs with nearest Lon,Lat (Python, Pandas)

I have a large df1 with columns(Lon,Lat,V1,V2,V3) and a large df2(V4,V5,Lat,Lon,V6). dfs coordinates are not exact match. df2 have different row numbers. I want to:
1)Find the nearest df2(Lon,Lat) to df1(Lon,Lat) based on (abs(df1.Lon-df2.Lon<=0.11))&(abs(df1.Lat-df2.Lat<=0.11))
2)Create new df3 with columns (df1.Lon,df1.Lat, df1.V1,df2.V6).
df1:
Lon,Lat,V1,V2,V3
-94.9324,34.9099,5.0,66.9,46.6
-103.524,34.457,6.0,186.7,3.8
-92.5145,38.7823,4.0,188.7,273.5
-92.5143,37.3182,2.0,78.8,218.4
-92.5142,36.6965,5.0,98.5,27.7
-89.2187,36.4448,7.3,79.8,35.8
df2:
V4,V5,Lat,Lon,V6
20190329,10,35.0,-94.9,105.9
20180329,11,34.5,-103.5,305.9
20170329,15,38.7,-92.5,206.0
20160329,14,36.5,-89.22,402.1
20150329,13,36.7,-92.6,316.1
20140329,05,37.4,-92.5,290.0
20130329,05,33.8,-89.2,250.0
df3:
Lon,Lat,V1,V6
-94.9324,34.9099,5.0,105.9
-103.524,34.457,6.0,305.9
-92.5145,38.7823,4.0,206.0
-92.5143,37.3182,2.0,290.0
-92.5142,36.6965,5.0,316.1
-89.2187,36.4448,7.3,402.1
Different codes not working:
df3 = df1.loc[~((abs(df2.Lat - df1.Lat) <= 0.11) & (abs(df2.Lon - df1.Lon) <= 0.11))]
df3 = df1.where((abs(df1[df1.Lon] - df2[df2.Lon]) <=0.11) & (abs(df1[df1.Lat] -df2[df2.Lat]) <=0.11))
df3 = pd.merge(df1, df2, on=[(abs(df1.Lon-df2.Lon)<=0.11), (abs(df1.Lat-df2.Lat)<=0.11)], how='inner')
It is possible, but with cross join, so if large DataFrames, need much memory:
df = pd.merge(df1.assign(A=1), df2.assign(A=1), on='A', how='outer', suffixes=('','_'))
cols = ['Lon','Lat','V1','V6']
df3 = df[(((df.Lat_ - df.Lat) <= 0.11).abs() & ((df.Lon_ - df.Lon).abs() <= 0.11))]
df3 = df3.drop_duplicates(subset=df1.columns)[cols]
print (df3)
Lon Lat V1 V6
0 -94.9324 34.9099 5.0 105.9
8 -103.5240 34.4570 6.0 305.9
16 -92.5145 38.7823 4.0 206.0
25 -92.5143 37.3182 2.0 316.1
32 -92.5142 36.6965 5.0 316.1
38 -89.2187 36.4448 7.3 402.1

How to add a DataFrame in one column of a DataFrame

I am creating a dataframe to store informations on samples. Some of my columns label have the format index:subindex. Is there a better way of doing that? I was looking at pd.MultiIndex but my subindices are specific to the index.
import pandas as pd
df = pd.DataFrame(
np.random.random(size=(1234, 6)),
columns=['ID',
'Charge:pH2', 'Charge:pH4', 'Charge:pH6',
'Extinction:Wavelength200nm', 'Extinction:Wavelength500nm'])
I would like to be able to call df.loc[:, 'ID'] or df.loc[:, 'Charge'] or df.loc[:, ('Charge', 'pH6')]
You could use MultiIndex.from_tuple:
import numpy as np
import pandas as pd
df = pd.DataFrame(
np.random.random(size=(1234, 6)),
columns=['ID','Charge:pH2', 'Charge:pH4', 'Charge:pH6','Extinction:Wavelength200nm', 'Extinction:Wavelength500nm'])
df.columns = pd.MultiIndex.from_tuples(map(tuple, df.columns.str.split(':')))
print(df.head(10))
Output
ID Charge ... Extinction
NaN pH2 ... Wavelength200nm Wavelength500nm
0 0.301592 0.137384 ... 0.074137 0.339948
1 0.737711 0.557524 ... 0.813727 0.586845
2 0.615398 0.529687 ... 0.148700 0.466916
3 0.411509 0.725513 ... 0.380019 0.876992
4 0.031172 0.623944 ... 0.311610 0.488207
5 0.022140 0.450630 ... 0.422927 0.479094
6 0.119681 0.221624 ... 0.710848 0.719201
7 0.252039 0.632321 ... 0.453235 0.952687
8 0.379501 0.356493 ... 0.141977 0.028836
9 0.249950 0.316020 ... 0.307337 0.881437
[10 rows x 6 columns]
All the required indexing schemes work:
print(df.loc[:, 'ID'].shape)
print(df.loc[:, 'Charge'].shape)
print(df.loc[:, ('Charge', 'pH6')].shape)
Output
(1234, 1)
(1234, 3)
(1234,)
I think the best is create index or Multiindex with not columns possible split (with no splitter) and then create MultiIndex by split with expand=True:
np.random.seed(2019)
df = pd.DataFrame(
np.random.random(size=(3, 6)),
columns=['ID',
'Charge:pH2', 'Charge:pH4', 'Charge:pH6',
'Extinction:Wavelength200nm', 'Extinction:Wavelength500nm'])
df = df.set_index('ID')
df.columns = df.columns.str.split(':', expand=True)
print (df)
Charge Extinction
pH2 pH4 pH6 Wavelength200nm Wavelength500nm
ID
0.903482 0.393081 0.623970 0.637877 0.880499 0.299172
0.702198 0.903206 0.881382 0.405750 0.452447 0.267070
0.162865 0.889215 0.148476 0.984723 0.032361 0.515351
Solution with not set ID in index is possible, but get NaN for second level for not splitted columns names:
df.columns = df.columns.str.split(':', expand=True)
print (df)
ID Charge Extinction
NaN pH2 pH4 pH6 Wavelength200nm Wavelength500nm
0 0.903482 0.393081 0.623970 0.637877 0.880499 0.299172
1 0.702198 0.903206 0.881382 0.405750 0.452447 0.267070
2 0.162865 0.889215 0.148476 0.984723 0.032361 0.515351
Last select by columns names, also is possible use DataFrame.xs if want select by second level:
print (df['Charge'])
pH2 pH4 pH6
ID
0.903482 0.393081 0.623970 0.637877
0.702198 0.903206 0.881382 0.405750
0.162865 0.889215 0.148476 0.984723
print (df.xs('Charge', axis=1, level=0))
pH2 pH4 pH6
ID
0.903482 0.393081 0.623970 0.637877
0.702198 0.903206 0.881382 0.405750
0.162865 0.889215 0.148476 0.984723
print (df.xs('pH4', axis=1, level=1))
Charge
ID
0.903482 0.623970
0.702198 0.881382
0.162865 0.148476

How to correctly read csv in Pandas while changing the names of the columns

An absolute basic read_csv question.
I have data that looks like the following in a csv file -
Date,Open Price,High Price,Low Price,Close Price,WAP,No.of Shares,No. of Trades,Total Turnover (Rs.),Deliverable Quantity,% Deli. Qty to Traded Qty,Spread High-Low,Spread Close-Open
28-February-2015,2270.00,2310.00,2258.00,2294.85,2279.192067772602217319,73422,8043,167342840.00,11556,15.74,52.00,24.85
27-February-2015,2267.25,2280.85,2258.00,2266.35,2269.239841485775122730,50721,4938,115098114.00,12297,24.24,22.85,-0.90
26-February-2015,2314.90,2314.90,2250.00,2259.50,2277.198324862194860047,69845,8403,159050917.00,22046,31.56,64.90,-55.40
25-February-2015,2290.00,2332.00,2278.35,2318.05,2315.100614216488163214,161995,10174,375034724.00,102972,63.56,53.65,28.05
24-February-2015,2276.05,2295.00,2258.00,2278.15,2281.058946240263344242,52251,7726,119187611.00,13292,25.44,37.00,2.10
23-February-2015,2303.95,2311.00,2253.25,2270.70,2281.912259219760108491,75951,7344,173313518.00,24969,32.88,57.75,-33.25
20-February-2015,2324.00,2335.20,2277.00,2284.30,2301.631421152326354478,79717,10233,183479152.00,23045,28.91,58.20,-39.70
19-February-2015,2304.00,2333.90,2292.00,2326.60,2321.485466301625211160,85835,8847,199264705.00,29728,34.63,41.90,22.60
18-February-2015,2284.00,2305.00,2261.10,2295.75,2282.060986778089405300,69884,6639,159479550.00,26665,38.16,43.90,11.75
16-February-2015,2281.00,2305.85,2266.00,2278.50,2284.961866239581019628,85541,10149,195457923.00,22164,25.91,39.85,-2.50
13-February-2015,2311.00,2324.90,2286.95,2296.40,2311.371235111317676864,109731,5570,253629077.00,69039,62.92,37.95,-14.60
12-February-2015,2280.00,2322.85,2275.00,2315.45,2301.372038211769425569,79766,9095,183571242.00,33981,42.60,47.85,35.45
11-February-2015,2275.00,2295.00,2258.25,2287.20,2279.587966250020639664,60563,7467,138058686.00,20058,33.12,36.75,12.20
10-February-2015,2244.90,2297.40,2225.00,2280.30,2269.562228214830293104,141656,13026,321497107.00,55577,39.23,72.40,35.40
--
I am trying to read this data in a pandas dataframe using the following variations of read_csv. I am only interested in two columns.
z = pd.read_csv('file.csv', parse_dates=True, index_col="Date", usecols=["Date", "Open Price", "Close Price"], names=["Date", "O", "C"], header=0)
What I get is
O C
Date
2015-02-28 NaN NaN
2015-02-27 NaN NaN
2015-02-26 NaN NaN
2015-02-25 NaN NaN
2015-02-24 NaN NaN
Or
z = pd.read_csv('file.csv', parse_dates=True, index_col="Date", usecols=["Date", "Open", "Close"], names=["Date", "Open Price", "Close Price"], header=0)
The result is -
Open Price Close Price
Date
2015-02-28 NaN NaN
2015-02-27 NaN NaN
2015-02-26 NaN NaN
2015-02-25 NaN NaN
Am I missing something fundamental or is there an issue with read_csv of pandas 0.13.1 - my version on Debian Wheezy?
You are right, something is odd with the name attributes. Seems to me that you can not use both in the same time. Either you set the name for every columns of the CSV file or you don't set the name at all. So it seems that you can't set the name when you are not taking all the colums (usecols)
names : array-like
List of column names to use. If file contains no header row, then you should explicitly pass header=None
You might already know it but you can rename the colums after also.
import pandas as pd
from StringIO import StringIO
csv = r"""Date,Open Price,High Price,Low Price,Close Price,WAP,No.of Shares,No. of Trades,Total Turnover (Rs.),Deliverable Quantity,% Deli. Qty to Traded Qty,Spread High-Low,Spread Close-Open
28-February-2015,2270.00,2310.00,2258.00,2294.85,2279.192067772602217319,73422,8043,167342840.00,11556,15.74,52.00,24.85
27-February-2015,2267.25,2280.85,2258.00,2266.35,2269.239841485775122730,50721,4938,115098114.00,12297,24.24,22.85,-0.90
26-February-2015,2314.90,2314.90,2250.00,2259.50,2277.198324862194860047,69845,8403,159050917.00,22046,31.56,64.90,-55.40
25-February-2015,2290.00,2332.00,2278.35,2318.05,2315.100614216488163214,161995,10174,375034724.00,102972,63.56,53.65,28.05
24-February-2015,2276.05,2295.00,2258.00,2278.15,2281.058946240263344242,52251,7726,119187611.00,13292,25.44,37.00,2.10
23-February-2015,2303.95,2311.00,2253.25,2270.70,2281.912259219760108491,75951,7344,173313518.00,24969,32.88,57.75,-33.25
20-February-2015,2324.00,2335.20,2277.00,2284.30,2301.631421152326354478,79717,10233,183479152.00,23045,28.91,58.20,-39.70
19-February-2015,2304.00,2333.90,2292.00,2326.60,2321.485466301625211160,85835,8847,199264705.00,29728,34.63,41.90,22.60
18-February-2015,2284.00,2305.00,2261.10,2295.75,2282.060986778089405300,69884,6639,159479550.00,26665,38.16,43.90,11.75
16-February-2015,2281.00,2305.85,2266.00,2278.50,2284.961866239581019628,85541,10149,195457923.00,22164,25.91,39.85,-2.50
13-February-2015,2311.00,2324.90,2286.95,2296.40,2311.371235111317676864,109731,5570,253629077.00,69039,62.92,37.95,-14.60
12-February-2015,2280.00,2322.85,2275.00,2315.45,2301.372038211769425569,79766,9095,183571242.00,33981,42.60,47.85,35.45
11-February-2015,2275.00,2295.00,2258.25,2287.20,2279.587966250020639664,60563,7467,138058686.00,20058,33.12,36.75,12.20
10-February-2015,2244.90,2297.40,2225.00,2280.30,2269.562228214830293104,141656,13026,321497107.00,55577,39.23,72.40,35.40"""
df = pd.read_csv(StringIO(csv),
usecols=["Date", "Open Price", "Close Price"],
header=0)
df.columns = ['Date', 'O', 'C']
df
output:
Date O C
0 28-February-2015 2270.00 2294.85
1 27-February-2015 2267.25 2266.35
2 26-February-2015 2314.90 2259.50
3 25-February-2015 2290.00 2318.05
4 24-February-2015 2276.05 2278.15
5 23-February-2015 2303.95 2270.70
6 20-February-2015 2324.00 2284.30
7 19-February-2015 2304.00 2326.60
8 18-February-2015 2284.00 2295.75
9 16-February-2015 2281.00 2278.50
10 13-February-2015 2311.00 2296.40
11 12-February-2015 2280.00 2315.45
12 11-February-2015 2275.00 2287.20
13 10-February-2015 2244.90 2280.30
According to documentation your usecols list should be subset of new names list
usecols : list-like or callable, default None
Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or strings
that correspond to column names provided either by the user in `names` or
inferred from the document header row(s).
Example of csv
"OLD1", "OLD2", "OLD3"
1,2,3
4,5,6
Code for renaming OLDX -> NEWX and using only NEW2 + NEW3
import pandas as pd
d = pd.read_csv('test.csv', header=0, names=['NEW1', 'NEW2', 'NEW3'], usecols=['NEW2', 'NEW3'])
Output
NEW2 NEW3
0 2 3
1 5 6
NOTE: Even if above is working as expected there is an issue while changing engine='python'
d = pd.read_csv('test.csv', header=0, engine='python',
names=['NEW1', 'NEW2', 'NEW3'], usecols=['NEW2', 'NEW3'])
ValueError: Number of passed names did not match number of header fields in the file
Workaround is set header=None and skiprows=[0,]:
d = pd.read_csv('test.csv', header=None, skiprows=[0,], engine='python', names=['NEW1', 'NEW2', 'NEW3'], usecols=['NEW2', 'NEW3'])
Output
NEW2 NEW3
0 2 3
1 5 6
Pandas version: 0.23.4

Categories