How do you reshape a csv file in python using pandas? - python

I'm trying to reshape this csv file in python to have 1,000 rows and all 12 columns but it shows up as only one column
I've tried using df.iloc[:][:1000] it shortens it to 1000 rows but it still only gives me 1 column
Dataframe for Wine
df_w= pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv')
df_w= df_w.iloc[:12][:1000]
df_w
The dataframe that shows up is 1000 rows with only 1 column and I'm wondering how you set the data into its respective column title

Use sep=';', for semicolon delimiter:
df_w= pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')
df_w= df_w.iloc[:12][:1000]
df_w
Output:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
5 7.4 0.66 0.00 1.8 0.075 13.0 40.0 0.9978 3.51 0.56 9.4 5
6 7.9 0.60 0.06 1.6 0.069 15.0 59.0 0.9964 3.30 0.46 9.4 5
7 7.3 0.65 0.00 1.2 0.065 15.0 21.0 0.9946 3.39 0.47 10.0 7
8 7.8 0.58 0.02 2.0 0.073 9.0 18.0 0.9968 3.36 0.57 9.5 7
9 7.5 0.50 0.36 6.1 0.071 17.0 102.0 0.9978 3.35 0.80 10.5 5
10 6.7 0.58 0.08 1.8 0.097 15.0 65.0 0.9959 3.28 0.54 9.2 5
11 7.5 0.50 0.36 6.1 0.071 17.0 102.0 0.9978 3.35 0.80 10.5 5

Related

pd.concat ends up with ValueError: no types given

I'm trying to merge two dfs (basically th same df at different time) using pd.concat.
here is my code:
Aujourdhui = datetime.datetime.now()
Aujourdhui = (Aujourdhui.strftime("%X"))
PerfsL1 = pd.read_html('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard', header=1)[0]
PerfsL1.columns = ['Équipe', 'Used_players', 'age', 'Possesion', "nb_matchs", "Starts", "Min",
'90s','Buts','Assists', 'No_penaltis', 'Penaltis', 'Penaltis_tentes',
'Cartons_jaunes', 'Cartons_rouges', 'Buts/90mn','Assists/90mn', 'B+A /90mn',
'NoPenaltis/90mn', 'B+A+P/90mn','Exp_buts','Exp_NoPenaltis', 'Exp_Assists', 'Exp_NP+A',
'Exp_buts/90mn', 'Exp_Assists/90mn','Exp_B+A/90mn','Exp_NoPenaltis/90mn', 'Exp_NP+A/90mn']
PerfsL1.insert(0, "Date", Aujourdhui)
print(PerfsL1)
PerfsL12 = pd.read_csv('Ligue_1_Perfs.csv', index_col=0)
print(PerfsL12)
PerfsL1 = pd.concat([PerfsL1, PerfsL12], ignore_index = True)
print (PerfsL1)
I successfully managed to get both df individually which are sharing the same columns, but I can't merge them, getting
ValueError: no types given.
Do you have an idea where it could be coming from ?
EDIT
Here are both dataframes:
'Ligue_1.csv'
Date Équipe Used_players age Possesion nb_matchs ... Exp_NP+A Exp_buts/90mn Exp_Assists/90mn Exp_B+A/90mn Exp_NoPenaltis/90mn Exp_NP+A/90mn
0 00:37:48 Ajaccio 18 29.1 34.5 2 ... 1.6 0.97 0.24 1.20 0.57 0.81
1 00:37:48 Angers 18 26.8 55.0 2 ... 5.9 1.78 1.18 2.96 1.78 2.96
2 00:37:48 Auxerre 15 29.4 39.5 2 ... 3.3 0.83 0.80 1.63 0.83 1.63
3 00:37:48 Brest 18 26.8 42.5 2 ... 5.0 1.67 1.23 2.90 1.28 2.51
4 00:37:48 Clermont Foot 18 27.8 48.5 2 ... 1.8 0.89 0.38 1.27 0.50 0.88
5 00:37:48 Lens 16 26.2 63.0 2 ... 5.6 1.92 1.29 3.21 1.53 2.82
6 00:37:48 Lille 18 27.2 65.0 2 ... 7.3 2.02 1.65 3.66 2.02 3.66
7 00:37:48 Lorient 14 25.8 36.0 1 ... 0.6 0.37 0.26 0.63 0.37 0.63
8 00:37:48 Lyon 15 26.0 68.0 1 ... 1.2 1.52 0.49 2.00 0.73 1.22
9 00:37:48 Marseille 17 26.9 55.0 2 ... 4.9 1.40 1.03 2.43 1.40 2.43
10 00:37:48 Monaco 19 24.8 40.5 2 ... 7.1 2.74 1.19 3.93 2.35 3.54
11 00:37:48 Montpellier 19 25.5 47.5 2 ... 3.2 0.93 0.66 1.59 0.93 1.59
12 00:37:48 Nantes 16 26.9 40.5 2 ... 3.9 1.37 0.60 1.97 1.37 1.97
13 00:37:48 Nice 18 25.9 54.0 2 ... 3.1 1.25 0.69 1.94 0.86 1.55
14 00:37:48 Paris S-G 18 27.6 60.0 2 ... 8.1 3.05 1.76 4.81 2.27 4.03
print(PerfsL1 = pd.read_html('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard', header=1)[0])
Date Équipe Used_players age Possesion nb_matchs ... Exp_NP+A Exp_buts/90mn Exp_Assists/90mn Exp_B+A/90mn Exp_NoPenaltis/90mn Exp_NP+A/90mn
0 09:56:18 Ajaccio 18 29.1 34.5 2 ... 1.6 0.97 0.24 1.20 0.57 0.81
1 09:56:18 Angers 18 26.8 55.0 2 ... 5.9 1.78 1.18 2.96 1.78 2.96
2 09:56:18 Auxerre 15 29.4 39.5 2 ... 3.3 0.83 0.80 1.63 0.83 1.63
3 09:56:18 Brest 18 26.8 42.5 2 ... 5.0 1.67 1.23 2.90 1.28 2.51
4 09:56:18 Clermont Foot 18 27.8 48.5 2 ... 1.8 0.89 0.38 1.27 0.50 0.88
5 09:56:18 Lens 16 26.2 63.0 2 ... 5.6 1.92 1.29 3.21 1.53 2.82
6 09:56:18 Lille 18 27.2 65.0 2 ... 7.3 2.02 1.65 3.66 2.02 3.66
7 09:56:18 Lorient 14 25.8 36.0 1 ... 0.6 0.37 0.26 0.63 0.37 0.63
8 09:56:18 Lyon 15 26.0 68.0 1 ... 1.2 1.52 0.49 2.00 0.73 1.22
9 09:56:18 Marseille 17 26.9 55.0 2 ... 4.9 1.40 1.03 2.43 1.40 2.43
10 09:56:18 Monaco 19 24.8 40.5 2 ... 7.1 2.74 1.19 3.93 2.35 3.54
11 09:56:18 Montpellier 19 25.5 47.5 2 ... 3.2 0.93 0.66 1.59 0.93 1.59
12 09:56:18 Nantes 16 26.9 40.5 2 ... 3.9 1.37 0.60 1.97 1.37 1.97
13 09:56:18 Nice 18 25.9 54.0 2 ... 3.1 1.25 0.69 1.94 0.86 1.55
Thanks you for your support and have a great day !
Your code should work.
Nevertheless, try this before the concat:
PerfsL1["Date"] = pd.to_datetime(PerfsL1["Date"], format="%X", errors=‘coerce’)
I finally managed to concat both tables.
The solution was to put but both as csv before:
table1 = pd.read_html ('http://.......1........com)
table1.to_csv ('C://.....1........')
table1 = pd.read_csv('C://.....1........')
table2 = pd.read_html ('http://.......2........com)
table2.to_csv ('C://.....2........')
table2 = pd.read_csv('C://.....2........')
x = pd.concat([table2, table1])
And now it works perfectly !
Thanks for your help !

Calculating momentum signal in python using 1 month and 12 month lag

I am wanting to calculate a simple momentum signal. The method I am following is 1 month lagged cumret divided by 12 month lagged cumret minus 1.
date starts at 1/5/14 and ends at 1/5/16. As a 12 month lag is required, the first mom signal has to start 12 months after the start of date. Hence, why the first mom signal starts at 1/5/15.
Here is the data utilized:
import pandas as pd
data = {'date':['1/5/14','1/6/14','1/7/14','1/8/14','1/9/14','1/10/14','1/11/14','1/12/14' .,'1/1/15','1/2/15','1/3/15','1/4/15','1/5/15','1/6/15','1/7/15','1/8/15','1/9/15','1/10/15','1/11/15','1/12/15','1/1/16','1/2/16','1/3/16','1/4/16','1/5/16'],
'id': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a' ],
'ret':[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20, 0.21, 0.22, 0.23, 0.24, 0.25],
'cumret':[1.01,1.03, 1.06,1.1 ,1.15,1.21,1.28, 1.36,1.45,1.55,1.66, 1.78,1.91,2.05,2.2,2.36, 2.53,2.71,2.9,3.1,3.31,3.53, 3.76,4,4.25]}
df = pd.DataFrame(data).set_index(['date', 'id'])
Desired output
ret cumret mom
date id
1/5/14 a .01 1.01
1/6/14 a .02 1.03
1/7/14 a .03 1.06
1/8/14 a .04 1.1
1/9/14 a .05 1.15
1/10/14 a .06 1.21
1/11/14 a .07 1.28
1/12/14 a .08 1.36
1/1/15 a .09 1.45
1/2/15 a .1 1.55
1/3/15 a .11 1.66
1/4/15 a .12 1.78
1/5/15 a .13 1.91 .8
1/6/15 a .14 2.05 .9
1/7/15 a .15 2.2 .9
1/8/15 a .16 2.36 1
1/9/15 a .17 2.53 1.1
1/10/15 a .18 2.71 1.1
1/11/15 a .19 2.9 1.1
1/12/15 a .2 3.1 1.1
1/1/16 a .21 3.31 1.1
1/2/16 a .22 3.53 1.1
1/3/16 a .23 3.76 1.1
1/4/16 a .24 4 1.1
1/5/16 a .25 4.25 1.1
This is the code tried to calculate mom
df['mom'] = ((df['cumret'].shift(-1) / (df['cumret'].shift(-12))) - 1).groupby(level = ['id'])
Entire dataset has more id e.g. a, b, c. Just included 1 variable for this example.
Any help would be awesome! :)
As far as I know, momentum is simply rate of change. Pandas has a built-in method for this:
df['mom'] = df['ret'].pct_change(12) # 12 month change
Also, I am not sure why you are using cumret instead of ret to calculate momentum.
Update: If you have multiple IDs that you need to go through, I'd recommend:
for i in df.index.levels[1]:
temp = df.loc[(slice(None), i), "ret"].pct_change(11)
df.loc[(slice(None), i), "mom"] = temp
# or df.loc[(slice(None), i), "mom"] = df.loc[(slice(None), i), "ret"].pct_change(11) for short
Output:
ret cumret mom
date id
1/5/14 a 0.01 1.01 NaN
1/6/14 a 0.02 1.03 NaN
1/7/14 a 0.03 1.06 NaN
1/8/14 a 0.04 1.10 NaN
1/9/14 a 0.05 1.15 NaN
1/10/14 a 0.06 1.21 NaN
1/11/14 a 0.07 1.28 NaN
1/12/14 a 0.08 1.36 NaN
1/1/15 a 0.09 1.45 NaN
1/2/15 a 0.10 1.55 NaN
1/3/15 a 0.11 1.66 NaN
1/4/15 a 0.12 1.78 11.000000
1/5/15 a 0.13 1.91 5.500000
1/6/15 a 0.14 2.05 3.666667
1/7/15 a 0.15 2.20 2.750000
1/8/15 a 0.16 2.36 2.200000
1/9/15 a 0.17 2.53 1.833333
1/10/15 a 0.18 2.71 1.571429
1/11/15 a 0.19 2.90 1.375000
1/12/15 a 0.20 3.10 1.222222
1/1/16 a 0.21 3.31 1.100000
1/2/16 a 0.22 3.53 1.000000
1/3/16 a 0.23 3.76 0.916667
1/4/16 a 0.24 4.00 0.846154
1/5/16 a 0.25 4.25 0.785714
1/5/14 b 0.01 1.01 NaN
1/6/14 b 0.02 1.03 NaN
1/7/14 b 0.03 1.06 NaN
1/8/14 b 0.04 1.10 NaN
1/9/14 b 0.05 1.15 NaN
1/10/14 b 0.06 1.21 NaN
1/11/14 b 0.07 1.28 NaN
1/12/14 b 0.08 1.36 NaN
1/1/15 b 0.09 1.45 NaN
1/2/15 b 0.10 1.55 NaN
1/3/15 b 0.11 1.66 NaN
1/4/15 b 0.12 1.78 11.000000
1/5/15 b 0.13 1.91 5.500000
1/6/15 b 0.14 2.05 3.666667
1/7/15 b 0.15 2.20 2.750000
1/8/15 b 0.16 2.36 2.200000
1/9/15 b 0.17 2.53 1.833333
1/10/15 b 0.18 2.71 1.571429
1/11/15 b 0.19 2.90 1.375000
1/12/15 b 0.20 3.10 1.222222
1/1/16 b 0.21 3.31 1.100000
1/2/16 b 0.22 3.53 1.000000
1/3/16 b 0.23 3.76 0.916667
1/4/16 b 0.24 4.00 0.846154
1/5/16 b 0.25 4.25 0.785714

Split data frame based on specific value in Python

I have a dataframe as below:
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
10000 .90 1.10 1.30 1.50 2.10 3.10 5.60 8.40 15.80
15000 1.35 1.65 1.95 2.25 3.15 4.65 8.40 12.60 23.70
20000 1.80 2.20 2.60 3.00 4.20 6.20 11.20 16.80 31.60
25000 2.25 2.75 3.25 3.75 5.25 7.75 14.00 21.00 39.50
30000 2.70 3.30 3.90 4.50 6.30 9.30 16.80 25.20 47.40
35000 3.15 3.85 4.55 5.25 7.35 10.85 19.60 29.40 55.30
40000 3.60 4.40 5.20 6.00 8.40 12.40 22.40 33.60 63.20
45000 4.05 4.95 5.85 6.75 9.45 13.95 25.20 37.80 71.10
50000 4.50 5.50 6.50 7.50 10.50 15.50 28.00 42.00 79.00
10000 .60 .80 1.00 1.20 1.80 2.80 5.30 8.10 15.50
15000 .90 1.20 1.50 1.80 2.70 4.20 7.95 12.15 23.25
20000 1.20 1.60 2.00 2.40 3.60 5.60 10.60 16.20 31.00
25000 1.50 2.00 2.50 3.00 4.50 7.00 13.25 20.25 38.75
30000 1.80 2.40 3.00 3.60 5.40 8.40 15.90 24.30 46.50
35000 2.10 2.80 3.50 4.20 6.30 9.80 18.55 28.35 54.25
40000 2.40 3.20 4.00 4.80 7.20 11.20 21.20 32.40 62.00
45000 2.70 3.60 4.50 5.40 8.10 12.60 23.85 36.45 69.75
50000 3.00 4.00 5.00 6.00 9.00 14.00 26.50 40.50 77.50
1000 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20
2000 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39
3000 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59
4000 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78
5000 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98
6000 1.17 1.17 1.17 1.17 1.17 1.17 1.17 1.17 1.17
7000 1.37 1.37 1.37 1.37 1.37 1.37 1.37 1.37 1.37
8000 1.56 1.56 1.56 1.56 1.56 1.56 1.56 1.56 1.56
9000 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76
10000 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95
Now I would like to split them into 3 dataframes based on the 'Size'
df1: From 10000 - before next occurrence of 10000
df2: Second 10000 - before 1000
df3: From 1000 to end
Otherwise,it is fine to have a temporary variable (temp column) in the same dataframe specifying categories like S1,S2 and S3 respectively for above ranges.
Could anyone guide me how to go about this?
Regards
Assumng that you want to break on the decreases, you could use the compare-cumsum-groupby pattern:
parts = list(df.groupby((df["Size"].diff() < 0).cumsum()))
which gives me (suppressing boring rows in the middle)
>>> for key, group in parts:
... print(key)
... print(group)
... print("----")
...
0
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
0 10000 0.90 1.10 1.30 1.50 2.10 3.10 5.6 8.4 15.8
1 15000 1.35 1.65 1.95 2.25 3.15 4.65 8.4 12.6 23.7
2 20000 1.80 2.20 2.60 3.00 4.20 6.20 11.2 16.8 31.6
[...]
7 45000 4.05 4.95 5.85 6.75 9.45 13.95 25.2 37.8 71.1
8 50000 4.50 5.50 6.50 7.50 10.50 15.50 28.0 42.0 79.0
----
1
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
9 10000 0.6 0.8 1.0 1.2 1.8 2.8 5.30 8.10 15.50
10 15000 0.9 1.2 1.5 1.8 2.7 4.2 7.95 12.15 23.25
11 20000 1.2 1.6 2.0 2.4 3.6 5.6 10.60 16.20 31.00
[...]
16 45000 2.7 3.6 4.5 5.4 8.1 12.6 23.85 36.45 69.75
17 50000 3.0 4.0 5.0 6.0 9.0 14.0 26.50 40.50 77.50
----
2
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
18 1000 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20
19 2000 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39
20 3000 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59
[...]
26 9000 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76
27 10000 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.90
----
Not so elegant but this works:
In [259]:
ranges=[]
first = df.index[0]
criteria = df.index[df['Size'].diff() < 0]
for idx in criteria:
ranges.append((first, idx))
first += idx
ranges
Out[259]:
[(0, 9), (9, 18)]
In [261]:
splits = []
for r in ranges:
splits.append(df.iloc[r[0]:r[1]])
splits.append(df.iloc[ranges[-1][0]:])
splits
Out[261]:
[ Size C1 C2 C3 C4 C5 C6 C7 C8 C9
0 10000 0.90 1.10 1.30 1.50 2.10 3.10 5.6 8.4 15.8
1 15000 1.35 1.65 1.95 2.25 3.15 4.65 8.4 12.6 23.7
2 20000 1.80 2.20 2.60 3.00 4.20 6.20 11.2 16.8 31.6
3 25000 2.25 2.75 3.25 3.75 5.25 7.75 14.0 21.0 39.5
4 30000 2.70 3.30 3.90 4.50 6.30 9.30 16.8 25.2 47.4
5 35000 3.15 3.85 4.55 5.25 7.35 10.85 19.6 29.4 55.3
6 40000 3.60 4.40 5.20 6.00 8.40 12.40 22.4 33.6 63.2
7 45000 4.05 4.95 5.85 6.75 9.45 13.95 25.2 37.8 71.1
8 50000 4.50 5.50 6.50 7.50 10.50 15.50 28.0 42.0 79.0,
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
9 10000 0.6 0.8 1.0 1.2 1.8 2.8 5.30 8.10 15.50
10 15000 0.9 1.2 1.5 1.8 2.7 4.2 7.95 12.15 23.25
11 20000 1.2 1.6 2.0 2.4 3.6 5.6 10.60 16.20 31.00
12 25000 1.5 2.0 2.5 3.0 4.5 7.0 13.25 20.25 38.75
13 30000 1.8 2.4 3.0 3.6 5.4 8.4 15.90 24.30 46.50
14 35000 2.1 2.8 3.5 4.2 6.3 9.8 18.55 28.35 54.25
15 40000 2.4 3.2 4.0 4.8 7.2 11.2 21.20 32.40 62.00
16 45000 2.7 3.6 4.5 5.4 8.1 12.6 23.85 36.45 69.75
17 50000 3.0 4.0 5.0 6.0 9.0 14.0 26.50 40.50 77.50,
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
9 10000 0.60 0.80 1.00 1.20 1.80 2.80 5.30 8.10 15.50
10 15000 0.90 1.20 1.50 1.80 2.70 4.20 7.95 12.15 23.25
11 20000 1.20 1.60 2.00 2.40 3.60 5.60 10.60 16.20 31.00
12 25000 1.50 2.00 2.50 3.00 4.50 7.00 13.25 20.25 38.75
13 30000 1.80 2.40 3.00 3.60 5.40 8.40 15.90 24.30 46.50
14 35000 2.10 2.80 3.50 4.20 6.30 9.80 18.55 28.35 54.25
15 40000 2.40 3.20 4.00 4.80 7.20 11.20 21.20 32.40 62.00
16 45000 2.70 3.60 4.50 5.40 8.10 12.60 23.85 36.45 69.75
17 50000 3.00 4.00 5.00 6.00 9.00 14.00 26.50 40.50 77.50
18 1000 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20
19 2000 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39
20 3000 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59
21 4000 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78
22 5000 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98
23 6000 1.17 1.17 1.17 1.17 1.17 1.17 1.17 1.17 1.17
24 7000 1.37 1.37 1.37 1.37 1.37 1.37 1.37 1.37 1.37
25 8000 1.56 1.56 1.56 1.56 1.56 1.56 1.56 1.56 1.56
26 9000 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76
27 10000 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95]
So firstly this looks to see when the size stops increasing:
df['Size'].diff() < 0
and we use to mask the index, we then iterate over these ranges to create a list of tuple ranges.
We iterate over these ranges to slice the df in the last step.

Reading csv file using pandas where columns are separated by varying amounts of whitespace and commas

I want to read the csv file as a pandas dataframe. CSV file is here: https://www.dropbox.com/s/o3xc74f8v4winaj/aaaa.csv?dl=0
In particular,
I want to skip the first row
The column headers are in row 2. In this case, they are: 1, 1, 2 and TOT. I do not want to hardcode them though. It is ok if the only column that gets extracted is TOT
I do not want to use a non-pandas approach if possible.
Here is what I am doing:
df = pandas.read_csv('https://www.dropbox.com/s/o3xc74f8v4winaj/aaaa.csv?dl=0', skiprows=1, skipinitialspace=True, sep=' ')
But this gives the error:
*** CParserError: Error tokenizing data. C error: Expected 5 fields in line 4, saw 6
The output should look something like this:
1 1 2 TOT
0 DEPTH(m) 0.01 1.24 1.52
1 BD 33kpa(t/m3) 1.6 1.6 1.6
2 SAND(%) 42.1 42.1 65.1
3 SILT(%) 37.9 37.9 16.9
4 CLAY(%) 20 20 18
5 ROCK(%) 12 12 12
6 WLS(kg/ha) 0 5 0.1 5.1
7 WLM(kg/ha) 0 5 0.1 5.1
8 WLSL(kg/ha) 0 4 0.1 4.1
9 WLSC(kg/ha) 0 2.1 0 2.1
10 WLMC(kg/ha) 0 2.1 0 2.1
11 WLSLC(kg/ha) 0 1.7 0 1.7
12 WLSLNC(kg/ha) 0 0.4 0 0.4
13 WBMC(kg/ha) 9 1102.1 250.9 1361.9
14 WHSC(kg/ha) 69 8432 1920 10420
15 WHPC(kg/ha) 146 18018 4102 22266
16 WOC(kg/ha) 224 27556 6272 34
17 WLSN(kg/ha) 0 0 0 0
18 WLMN(kg/ha) 0 0.2 0 0.2
19 WBMN(kg/ha) 0.9 110.2 25.1 136.2
20 WHSN(kg/ha) 7 843 192 1042
21 WHPN(kg/ha) 15 1802 410 2227
22 WON(kg/ha) 22 2755 627 3405
23 CFEM(kg/ha) 0
You can specify a regular expression to be used as your delimiter, in your case it will work with [\s,]{2,20}, i.e. 2 or more spaces or commas:
In [180]: pd.read_csv('aaaa.csv',
skiprows = 1,
sep='[\s,]{2,20}',
index_col=0)
Out[180]:
Unnamed: 1 1 1.1 2 TOT
0
1 DEPTH(m) 0.01 1.24 1.52 NaN
2 BD 33kpa(t/m3) 1.60 1.60 1.60 NaN
3 SAND(%) 42.10 42.10 65.10 NaN
4 SILT(%) 37.90 37.90 16.90 NaN
5 CLAY(%) 20.00 20.00 18.00 NaN
6 ROCK(%) 12.00 12.00 12.00 NaN
7 WLS(kg/ha) 0.00 5.00 0.10 5.1
8 WLM(kg/ha) 0.00 5.00 0.10 5.1
9 WLSL(kg/ha) 0.00 4.00 0.10 4.1
10 WLSC(kg/ha) 0.00 2.10 0.00 2.1
11 WLMC(kg/ha) 0.00 2.10 0.00 2.1
12 WLSLC(kg/ha) 0.00 1.70 0.00 1.7
13 WLSLNC(kg/ha) 0.00 0.40 0.00 0.4
14 WBMC(kg/ha) 9.00 1102.10 250.90 1361.9
15 WHSC(kg/ha) 69.00 8432.00 1920.00 10420.0
16 WHPC(kg/ha) 146.00 18018.00 4102.00 22266.0
17 WOC(kg/ha) 224.00 27556.00 6272.00 34.0
18 WLSN(kg/ha) 0.00 0.00 0.00 0.0
19 WLMN(kg/ha) 0.00 0.20 0.00 0.2
20 WBMN(kg/ha) 0.90 110.20 25.10 136.2
21 WHSN(kg/ha) 7.00 843.00 192.00 1042.0
22 WHPN(kg/ha) 15.00 1802.00 410.00 2227.0
23 WON(kg/ha) 22.00 2755.00 627.00 3405.0
24 CFEM(kg/ha) 0.00 NaN NaN NaN
25, None NaN NaN NaN NaN
26, None NaN NaN NaN NaN
You need to specify the names of the columns. Notice the trick I used to get two columns called 1 (one is an integer name and the other is text).
Given how badly the data is structured, this is not perfect (note row 2 where BD and 33kpa got split because of the space between them).
pd.read_csv('/Downloads/aaaa.csv',
skiprows=2,
skipinitialspace=True,
sep=' ',
names=['Index', 'Description',1,"1",2,'TOT'],
index_col=0)
Description 1 1 2 TOT
Index
1, DEPTH(m) 0.01 1.24 1.52 NaN
2, BD 33kpa(t/m3) 1.60 1.60 1.6
3, SAND(%) 42.1 42.10 65.10 NaN
4, SILT(%) 37.9 37.90 16.90 NaN
5, CLAY(%) 20.0 20.00 18.00 NaN
6, ROCK(%) 12.0 12.00 12.00 NaN
7, WLS(kg/ha) 0.0 5.00 0.10 5.1
8, WLM(kg/ha) 0.0 5.00 0.10 5.1
9, WLSL(kg/ha) 0.0 4.00 0.10 4.1
10, WLSC(kg/ha) 0.0 2.10 0.00 2.1
11, WLMC(kg/ha) 0.0 2.10 0.00 2.1
12, WLSLC(kg/ha) 0.0 1.70 0.00 1.7
13, WLSLNC(kg/ha) 0.0 0.40 0.00 0.4
14, WBMC(kg/ha) 9.0 1102.10 250.90 1361.9
15, WHSC(kg/ha) 69. 8432.00 1920.00 10420.0
16, WHPC(kg/ha) 146. 18018.00 4102.00 22266.0
17, WOC(kg/ha) 224. 27556.00 6272.00 34.0
18, WLSN(kg/ha) 0.0 0.00 0.00 0.0
19, WLMN(kg/ha) 0.0 0.20 0.00 0.2
20, WBMN(kg/ha) 0.9 110.20 25.10 136.2
21, WHSN(kg/ha) 7. 843.00 192.00 1042.0
22, WHPN(kg/ha) 15. 1802.00 410.00 2227.0
23, WON(kg/ha) 22. 2755.00 627.00 3405.0
24, CFEM(kg/ha) 0. NaN NaN NaN
25, NaN NaN NaN NaN NaN
26, NaN NaN NaN NaN NaN
Or you can reset the index.
>>> (pd.read_csv('/Downloads/aaaa.csv',
skiprows=2,
skipinitialspace=True,
sep=' ',
names=['Index', 'Description',1,"1",2,'TOT'],
index_col=0)
.reset_index(drop=True)
.dropna(axis=0, how='all'))
Description 1 1 2 TOT
0 DEPTH(m) 0.01 1.24 1.52 NaN
1 BD 33kpa(t/m3) 1.60 1.60 1.6
2 SAND(%) 42.1 42.10 65.10 NaN
3 SILT(%) 37.9 37.90 16.90 NaN
4 CLAY(%) 20.0 20.00 18.00 NaN
5 ROCK(%) 12.0 12.00 12.00 NaN
6 WLS(kg/ha) 0.0 5.00 0.10 5.1
7 WLM(kg/ha) 0.0 5.00 0.10 5.1
8 WLSL(kg/ha) 0.0 4.00 0.10 4.1
9 WLSC(kg/ha) 0.0 2.10 0.00 2.1
10 WLMC(kg/ha) 0.0 2.10 0.00 2.1
11 WLSLC(kg/ha) 0.0 1.70 0.00 1.7
12 WLSLNC(kg/ha) 0.0 0.40 0.00 0.4
13 WBMC(kg/ha) 9.0 1102.10 250.90 1361.9
14 WHSC(kg/ha) 69. 8432.00 1920.00 10420.0
15 WHPC(kg/ha) 146. 18018.00 4102.00 22266.0
16 WOC(kg/ha) 224. 27556.00 6272.00 34.0
17 WLSN(kg/ha) 0.0 0.00 0.00 0.0
18 WLMN(kg/ha) 0.0 0.20 0.00 0.2
19 WBMN(kg/ha) 0.9 110.20 25.10 136.2
20 WHSN(kg/ha) 7. 843.00 192.00 1042.0
21 WHPN(kg/ha) 15. 1802.00 410.00 2227.0
22 WON(kg/ha) 22. 2755.00 627.00 3405.0
23 CFEM(kg/ha) 0. NaN NaN NaN

Error in getting specific row from python pandas

I want to extract a row by name from the foll. dataframe:
Unnamed: 1 1 1.1 2 TOT
0
1 DEPTH(m) 0.01 1.24 1.52 NaN
2 BD 33kpa(t/m3) 1.60 1.60 1.60 NaN
3 SAND(%) 42.10 42.10 65.10 NaN
4 SILT(%) 37.90 37.90 16.90 NaN
5 CLAY(%) 20.00 20.00 18.00 NaN
6 ROCK(%) 12.00 12.00 12.00 NaN
7 WLS(kg/ha) 2.60 8.20 0.10 10.9
8 WLM(kg/ha) 5.00 8.30 0.00 13.4
9 WLSL(kg/ha) 0.00 3.80 0.10 3.9
10 WLSC(kg/ha) 1.10 3.50 0.00 4.6
11 WLMC(kg/ha) 2.10 3.50 0.00 5.6
12 WLSLC(kg/ha) 0.00 1.60 0.00 1.6
13 WLSLNC(kg/ha) 1.10 1.80 0.00 2.9
14 WBMC(kg/ha) 3.40 835.10 195.20 1033.7
15 WHSC(kg/ha) 66.00 8462.00 1924.00 10451.0
16 WHPC(kg/ha) 146.00 18020.00 4102.00 22269.0
17 WOC(kg/ha) 219.00 27324.00 6221.00 34.0
18 WLSN(kg/ha) 0.00 0.00 0.00 0.0
19 WLMN(kg/ha) 0.00 0.10 0.00 0.1
20 WBMN(kg/ha) 0.50 92.60 19.30 112.5
21 WHSN(kg/ha) 7.00 843.00 191.00 1041.0
22 WHPN(kg/ha) 15.00 1802.00 410.00 2227.0
23 WON(kg/ha) 22.00 2738.00 621.00 3381.0
I want to extract the row containing info on WOC(kg/ha). here is what I am doing:
df.loc['WOC(kg/ha)']
but I get the error:
*** KeyError: 'the label [WOC(kg/ha)] is not in the [index]'
You don't have that label in your index, it's in your first column the following should work:
df.loc[df['Unnamed: 1'] == 'WOC(kg/ha)']
otherwise set the index to that column and your code would work fine:
df.set_index('Unnamed: 1', inplace=True)
Also, this can be used to set index without explicitly specifying column name: df.set_index(df.columns[0], inplace=True)

Categories