Can any one please elaborate with good example the difference between header and skiprows in syntax of
pd.read_excel("name",header=number,skiprows=number)
You can follow this article, which explains the difference between the parameters header and skiprows with examples from the olympic dataset, which can be downloaded here.
To summarize: the default behavior for pd.read() is to read in all of the rows, which in the case of this dataset, includes an unnecessary first row of row numbers.
import pandas as pd
df = pd.read_csv('olympics.csv')
df.head()
0 1 2 3 4 ... 11 12 13 14 15
0 NaN № Summer 01 ! 02 ! 03 ! ... № Games 01 ! 02 ! 03 ! Combined total
1 Afghanistan (AFG) 13 0 0 2 ... 13 0 0 2 2
2 Algeria (ALG) 12 5 2 8 ... 15 5 2 8 15
3 Argentina (ARG) 23 18 24 28 ... 41 18 24 28 70
4 Armenia (ARM) 5 1 2 9 ... 11 1 2 9 12
However the parameter skiprows allows you to delete one or more rows when you read in the .csv file:
df1 = pd.read_csv('olympics.csv', skiprows = 1)
df1.head()
Unnamed: 0 № Summer 01 ! 02 ! ... 01 !.2 02 !.2 03 !.2 Combined total
0 Afghanistan (AFG) 13 0 0 ... 0 0 2 2
1 Algeria (ALG) 12 5 2 ... 5 2 8 15
2 Argentina (ARG) 23 18 24 ... 18 24 28 70
3 Armenia (ARM) 5 1 2 ... 1 2 9 12
4 Australasia (ANZ) [ANZ] 2 3 4 ... 3 4 5 12
And if you want to skip a bunch of different rows, you can do the following (notice the missing countries):
df2 = pd.read_csv('olympics.csv', skiprows = [0, 2, 3])
df2.head()
Unnamed: 0 № Summer 01 ! 02 ! ... 01 !.2 02 !.2 03 !.2 Combined total
0 Argentina (ARG) 23 18 24 ... 18 24 28 70
1 Armenia (ARM) 5 1 2 ... 1 2 9 12
2 Australasia (ANZ) [ANZ] 2 3 4 ... 3 4 5 12
3 Australia (AUS) [AUS] [Z] 25 139 152 ... 144 155 181 480
4 Austria (AUT) 26 18 33 ... 77 111 116 304
The header parameter tells you where to start reading in the .csv, which in the following case, does the same thing as skiprows = 1:
# this gives the same result as df1 = pd.read_csv(‘olympics.csv’, skiprows = 1)
df4 = pd.read_csv('olympics.csv', header = 1)
df4.head()
Unnamed: 0 № Summer 01 ! 02 ! ... 01 !.2 02 !.2 03 !.2 Combined total
0 Afghanistan (AFG) 13 0 0 ... 0 0 2 2
1 Algeria (ALG) 12 5 2 ... 5 2 8 15
2 Argentina (ARG) 23 18 24 ... 18 24 28 70
3 Armenia (ARM) 5 1 2 ... 1 2 9 12
4 Australasia (ANZ) [ANZ] 2 3 4 ... 3 4 5 12
However you cannot use the header parameter to skip a bunch of different rows. You would not be able to replicate df2 using the header parameter. Hopefully this clears things up.
Related
I have one dataframe like this,
tabla_aciertos= {'Numeros_acertados' : [5,5,5,4,4,3,4,2,3,3,1,2,2],'Estrellas_acertadas': [2,1,0,2,1,2,0,2,1,0,2,1,0]}
categorias = [1,2,3,4,5,6,7,8,9,10,11,12,13]
categoria_de_premios = pd.DataFrame (tabla_aciertos,index = [categorias] )
categoria_de_premios
Numeros_acertados Estrellas_acertadas
1 5 2
2 5 1
3 5 0
4 4 2
5 4 1
6 3 2
7 4 0
8 2 2
9 3 1
10 3 0
11 1 2
12 2 1
13 2 0
and another df :
sorteos_anteriores.iloc[:,:]
uno dos tres cuatro cinco Estrella1 Estrella2 bolas_Acertadas estrellas_Acertadas
Fecha
2020-10-13 5 14 38 41 46 1 10 0 1
2020-09-10 11 15 35 41 50 5 8 1 0
2020-06-10 4 21 36 41 47 9 11 0 0
2020-02-10 6 12 15 40 45 3 9 0 0
2020-09-29 4 14 16 41 44 11 12 0 1
... ... ... ... ... ... ... ... ... ...
2004-12-03 15 24 28 44 47 4 5 0 0
2004-05-03 4 7 33 37 39 1 5 0 1
2004-02-27 14 18 19 31 37 4 5 0 0
2004-02-20 7 13 39 47 50 2 5 1 0
2004-02-13 16 29 32 36 41 7 9 0 0
1363 rows × 9 columns
Now I need to see in each and every row of the df "sorteos_anteriores" is in one of the all row from the first df, "tabla_aciertos" .
Let me give you one example,
Inmagine in "sorteos_anteriores" you have in:
2019-11-2 in the column "bolas_Acertadas"= 5 and "estrellas_Acertadas= 1". Now you go to fist table, "tabla_aciertos" and you find that in (index 2 = "Numeros_acertados" = 5 and Estrellas_acertadas=1) . You have won a second (index=2) class prize. You should create a new column "Prize" in "sorteos_anteriores" and in each row write a number from 1 to 13 if you have some kind of prize of 0 or Nan if you not.
I have try :
sorteos_anteriores ['categorias'] = sorteos_anteriores(sorteos_anteriores.loc[:,'bolas_Acertadas':'estrellas_Acertadas'] == tabla_premios.iloc[ : ,0:2])
Also with where and merge, but nothing works.
Thanks for your help.
Thanks to Cuina Max I could do it.
answer here
# supposing that the indexes, starting from one, correspond to the the premiums
categoria_de_premios['Categoria'] = df.index
# Merge using pd.merge and the appropriate arguments
sorteos_anteriores = (sorteos_anteriores.merge(
categoria_de_premios,
how='outer',
left_on=['bolas_Acertadas','estrellas_Acertadas'],
right_on=['Numeros_acertados', 'Estrellas_acertadas']
)).drop(columns=['Numeros_acertados', 'Estrellas_acertadas'])
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have a one_sec_flt DataFrame that has 300,000+ points and a flask DataFrame that has 230 points. Both DataFrames have columns Hour, Minute, Second. I want to append the flask DataFrame to the same time it was taken in the one_sec_flt data.
Flasks DataFrame
year month day hour minute second... gas1 gas2 gas3
0 2018 4 8 16 27 48... 10 25 191
1 2018 4 8 16 40 20... 45 34 257
...
229 2018 5 12 14 10 05... 3 72 108
one_sec_flt DataFrame
Year Month Day Hour Min Second... temp wind
0 2018 4 8 14 30 20... 300 10
1 2018 4 8 14 45 15... 310 8
...
305,212 2018 5 12 14 10 05... 308 24
I have this code I started with but I don't know how to append one DataFrame to another at that exact timestamp.
for i in range(len(flasks)):
for j in range(len(one_sec_flt)):
if (flasks.hour.iloc[i] == one_sec_flt.Hour.iloc[j]):
if (flasks.minute.iloc[i] == one_sec_flt.Min.iloc[j]):
if (flasks.second.iloc[i] == one_sec_flt.Sec.iloc[j]):
print('match')
My output goal would look like:
Year Month Day Hour Min Second... temp wind gas1 gas2 gas3
0 2018 4 8 14 30 20... 300 10 nan nan nan
1 2018 4 8 14 45 15... 310 8 nan nan nan
2 2018 4 8 15 15 47... ... ... nan nan nan
3 2018 4 8 16 27 48... ... ... 10 25 191
4 2018 4 8 16 30 11... ... ... nan nan nan
5 2018 4 8 16 40 20... ... ... 45 34 257
... ... ... ... ... ... ... ... ... ... ... ...
305,212 2018 5 12 14 10 05... 308 24 3 72 108
If you can concatenate both the dataframes Flask & one_sec_flt, then sort by the times, it might achieve what you are looking for(at least, if I understood the problem statement correctly).
Flasks
Out[13]:
year month day hour minute second
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20
one_sec
Out[14]:
year month day hour minute second
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
df_res = pd.concat([Flasks,one_sec])
df_res
Out[16]:
year month day hour minute second
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
df_res.sort_values(by=['year','month','day','hour','minute','second'])
Out[17]:
year month day hour minute second
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20
i have a pandas dataframe such as below:
id price hour minute date
1 10 03 07 01/11
2 4 03 59 01/11
3 5 02 21 01/11
4 6 03 47 02/09
5 1 04 28 02/04
6 7 05 50 01/11
7 3 02 01 01/11
8 2 01 23 01/11
...
and i want an output like:
id price hour minute date cumprice
1 10 03 07 01/11 19
2 4 03 59 01/11 14
3 5 02 21 01/11 20
4 6 03 47 02/09 6
5 1 04 28 02/04 1
6 7 05 50 01/11 7
7 3 02 01 01/11 10
8 2 01 23 01/11 10
...
I dont have any idea to do this job fast.
anybody could help me, to do this fast ?
You could groupby the date and use transform to add a column with the sum of the prices per group:
df['cumsprice'] = df.groupby('date').price.transform('sum')
id price hour minute date cumsprice
0 1 10 3 7 01/11 19
1 2 4 3 59 01/11 19
2 3 5 2 21 01/11 19
3 4 6 3 47 02/09 6
4 5 1 4 28 02/04 1
Update
Update after changing the expected solution. In order to group by consecutive dates that are equal, you can create a custom grouper for by checking on which rows the dates change, and taking the cumsum of these:
g = df.date.ne(df.date.shift(1))
df['cumprice'] = df.groupby(g.cumsum()).price.transform('sum')
print(df)
id price hour minute date cumsprice cumprice
0 1 10 3 7 01/11 31 19.0
1 2 4 3 59 01/11 31 19.0
2 3 5 2 21 01/11 31 19.0
3 4 6 3 47 02/09 6 6.0
4 5 1 4 28 02/04 1 1.0
5 6 12 5 50 01/11 31 12.0
I have a pandas DataFrame like this
year id1 id2 jan jan1 jan2 feb feb1 feb2 mar mar1 mar2 ....
2018 01 10 3 30 31 2 23 25 7 52 53 ....
2018 01 20 ....
2018 02 10 ....
2018 02 20 ....
and I need this format
year month id1 id2 val val1 val2
2018 01 01 10 3 30 31
2018 02 01 10 2 23 25
2018 03 01 10 7 52 53
..........
As you can see, I have 3 values for each month, and I only add one column assigned to the month with 3 columns for the values. If it were only one column, I think I could use stack.
I wouldn't have any problem renaming the month columns to 01 01-1 01-2 (for january) or something like that to make it easier.
I'm also thinking on separating the info on 3 different DataFrames to stack them separately and then merge the results, or should I melt it?
Any ideas for achieving this easily?
using reshape and stack
pd.DataFrame(df.set_index(['year','id1','id2']).values.reshape(4,3,3).tolist(),
index=df.set_index(['year','id1','id2']).index,
columns=[1,2,3])\
.stack().apply(pd.Series).reset_index().rename(columns={'level_3':'month'})
Out[261]:
year id1 id2 month 0 1 2
0 2018 1 10 1 3 30 31
1 2018 1 10 2 2 23 25
2 2018 1 10 3 7 52 53
3 2018 1 20 1 3 30 31
4 2018 1 20 2 2 23 25
5 2018 1 20 3 7 52 53
6 2018 2 10 1 3 30 31
7 2018 2 10 2 2 23 25
8 2018 2 10 3 7 52 53
9 2018 2 20 1 3 30 31
10 2018 2 20 2 2 23 25
11 2018 2 20 3 7 52 53
So I renamed the header columns this way
01 01 01 02 02 02 03 03 03 ...
year id1 id2 val val1 val2 val val1 val2 val val1 val2 ....
2018 01 10 3 30 31 2 23 25 7 52 53 ....
2018 01 20 ....
2018 02 10 ....
2018 02 20 ....
on a file, and opened it this way
df = pd.read_csv('my_file.csv',header=[0, 1], index_col=[0,1,2], skipinitialspace=True, tupleize_cols=True)
df.columns = pd.MultiIndex.from_tuples(df.columns)
then, I actually only needed to stack it on level 0
df = df.stack(level=0)
and add the titles
df.index.names = ['year','id1','id2','month']
df = df.reset_index()
This is my pandas data frame pandas data frame
ID Position Time(in Hours) Date
01 18 2 01/01/2016
01 21 4 01/10/2016
01 19 2 01/10/2016
05 19 5 01/10/2016
05 21 1 01/10/2016
05 19 8 01/10/2016
02 19 18 02/10/2016
02 35 11 02/10/2016
I need to assign '1' for the maximum Time for each Id and Date else assign '0'.
My code is
def find_max(db7):
max_row = db7['Time'].max()
labels = np.where((db7['Time_in_Second'] == max_row),'1','0')
return max_row
db7['Max'] = db7['Time'].map(find_max)
But I'm getting below error. How do I do this please?
TypeError: 'float' object is not subscriptable
My Expected out put should be:
ID Position Time(in Hours) Date Max
01 18 2 01/01/2016 0
01 21 4 01/10/2016 1
01 19 2 01/10/2016 0
05 19 5 01/10/2016 0
05 21 1 01/10/2016 0
05 19 8 01/10/2016 1
02 19 18 02/10/2016 1
02 35 11 02/10/2016 0
Use groupby with transform max and numpy.where for assign new values:
max1 = db7.groupby(['ID','Date'])['Time(in Hours)'].transform('max')
db7['Max'] = np.where(db7['Time(in Hours)'].eq(max1), '1', '0')
print (db7)
ID Position Time(in Hours) Date Max
0 1 18 2 01/01/2016 1
1 1 21 4 01/10/2016 1
2 1 19 2 01/10/2016 0
3 5 19 5 01/10/2016 0
4 5 21 1 01/10/2016 0
5 5 19 8 01/10/2016 1
6 2 19 18 02/10/2016 1
7 2 35 11 02/10/2016 0
Or convert Trues and Falses to '1' and '0' by double astype:
max1 = db7.groupby(['ID','Date'])['Time(in Hours)'].transform('max')
db7['Max'] = db7['Time(in Hours)'].eq(max1).astype(int).astype(str)
print (db7)
ID Position Time(in Hours) Date Max
0 1 18 2 01/01/2016 1
1 1 21 4 01/10/2016 1
2 1 19 2 01/10/2016 0
3 5 19 5 01/10/2016 0
4 5 21 1 01/10/2016 0
5 5 19 8 01/10/2016 1
6 2 19 18 02/10/2016 1
7 2 35 11 02/10/2016 0
Detail:
print (max1)
0 2
1 4
2 4
3 8
4 8
5 8
6 18
7 18
Name: Time(in Hours), dtype: int64
#eq is same as ==
print (db7['Time(in Hours)'].eq(max1))
0 True
1 True
2 False
3 False
4 False
5 True
6 True
7 False
Name: Time(in Hours), dtype: bool
EDIT:
If need group by only column ID:
max1 = db7.groupby('ID')['Time(in Hours)'].transform('max')
db7['Max'] = np.where(db7['Time(in Hours)'].eq(max1), '1', '0')
print (db7)
ID Position Time(in Hours) Date Max
0 1 18 2 01/01/2016 0
1 1 21 4 01/10/2016 1
2 1 19 2 01/10/2016 0
3 5 19 5 01/10/2016 0
4 5 21 1 01/10/2016 0
5 5 19 8 01/10/2016 1
6 2 19 18 02/10/2016 1
7 2 35 11 02/10/2016 0
print (max1)
0 4
1 4
2 4
3 8
4 8
5 8
6 18
7 18
Name: Time(in Hours), dtype: int64