kf = 10
sets = 90
for i in range(0, kf):
chunk[i] = pd.read_csv("Dataset.csv", skiprows=(i*sets), nrows=sets)
By printing i always get the 90 first rows instead of 0 to 89 and 90 to 179 etc. How can i correct the initialization in order to first skip the lines and the start reading the file ?
Output with kf = 100 and sets = 9.
X1 X2 X3 ... X29 X30 Target
0 -2.335543 -2.325887 -2.367347 ... 2.001746 3.102024 1
1 -0.132771 0.463992 -0.282286 ... 3.003794 2.473191 1
2 -1.000121 -1.512276 -3.326958 ... 0.155254 5.855211 1
3 -1.170981 -3.493062 -2.241450 ... 3.228326 3.301115 1
4 -1.449553 -1.428624 -1.401973 ... 1.547833 2.008935 1
5 -1.657024 -1.567071 -1.784387 ... 0.606907 -2.135309 1
6 -0.323730 -1.237250 -2.679961 ... -1.365039 3.101155 1
7 -1.011255 -0.706056 -1.583983 ... -0.678562 -1.950106 1
8 0.388855 0.359412 0.037113 ... -3.413041 -4.051897 1
[9 rows x 31 columns]
X1 X2 X3 ... X29 X30 Target
0 -2.335543 -2.325887 -2.367347 ... 2.001746 3.102024 1
1 -0.132771 0.463992 -0.282286 ... 3.003794 2.473191 1
2 -1.000121 -1.512276 -3.326958 ... 0.155254 5.855211 1
3 -1.170981 -3.493062 -2.241450 ... 3.228326 3.301115 1
4 -1.449553 -1.428624 -1.401973 ... 1.547833 2.008935 1
5 -1.657024 -1.567071 -1.784387 ... 0.606907 -2.135309 1
6 -0.323730 -1.237250 -2.679961 ... -1.365039 3.101155 1
7 -1.011255 -0.706056 -1.583983 ... -0.678562 -1.950106 1
8 0.388855 0.359412 0.037113 ... -3.413041 -4.051897 1
[9 rows x 31 columns]
I believe you need parameter chunksize in read_csv:
for df in pd.read_csv("Dataset.csv", chunksize=sets):
print(df)
EDIT:
I create sample data with your code, problem is values of columns are incorrectly parsed, so is necessary parameter names with if-else with None for first group:
import pandas as pd
#original data
temp=u"""colA,colB
A,1
B,2
A,3
C,4
B,5
A,6
C,7
B,8
A,9
C,10
B,11
A,12
C,13
D,14
B,15
C,16"""
kf = 3
sets = 6
#after testing replace 'pd.compat.StringIO(temp)' to 'Dataset.csv'
cols = pd.read_csv(pd.compat.StringIO(temp), nrows=0).columns
print (cols)
Index(['colA', 'colB'], dtype='object')
for i in range(0, kf):
if i == 0:
val = 0
names = None
else:
val = 1
names = cols
df = pd.read_csv(pd.compat.StringIO(temp),
skiprows=(i*sets) + val,
nrows=sets,
names=names)
print (df)
colA colB
0 A 1
1 B 2
2 A 3
3 C 4
4 B 5
5 A 6
colA colB
0 C 7
1 B 8
2 A 9
3 C 10
4 B 11
5 A 12
colA colB
0 C 13
1 D 14
2 B 15
3 C 16
Related
For example, create 3 DataFrames and concatenate it:
df1 = pd.DataFrame({'1':np.random.rand(10),
'2':np.random.rand(10),
'3':np.random.rand(10),
'4':np.random.rand(10)}).assign(Location = 'A')
df2 = pd.DataFrame({'1':np.random.rand(10),
'2':np.random.rand(10),
'3':np.random.rand(10),
'4':np.random.rand(10)}).assign(Location = 'B')
df3 = pd.DataFrame({'1':np.random.rand(10),
'2':np.random.rand(10),
'3':np.random.rand(10),
'4':np.random.rand(10)}).assign(Location = 'C')
cdf = pd.concat([df1, df2, df3])
It gives us:
1 2 3 4 Location
0 0.243073 0.946684 0.197635 0.941140 A
1 0.564369 0.747247 0.784011 0.990973 A
2 0.211258 0.976495 0.039176 0.892454 A
3 0.112643 0.674075 0.158149 0.764700 A
4 0.603871 0.793998 0.460644 0.191124 A
5 0.050415 0.460084 0.791560 0.991015 A
6 0.215605 0.103512 0.362365 0.877847 A
7 0.508424 0.429722 0.148857 0.733401 A
8 0.073434 0.575820 0.382577 0.441545 A
9 0.242787 0.366286 0.283623 0.213178 A
0 0.839207 0.154315 0.083025 0.085282 B
1 0.387318 0.693141 0.211551 0.714996 B
2 0.257663 0.413717 0.548559 0.283256 B
3 0.523662 0.861542 0.765760 0.983974 B
4 0.303765 0.774284 0.875710 0.473910 B
5 0.021572 0.937372 0.652112 0.924507 B
6 0.261787 0.004706 0.120351 0.863288 B
7 0.713536 0.741964 0.367541 0.097494 B
8 0.195457 0.845070 0.202422 0.393003 B
9 0.985934 0.167777 0.799820 0.578928 B
0 0.204089 0.108707 0.022954 0.233555 C
1 0.620620 0.900607 0.882538 0.560776 C
2 0.077108 0.397158 0.592832 0.740506 C
3 0.422404 0.040622 0.180017 0.168166 C
4 0.893277 0.740311 0.128431 0.982510 C
5 0.242766 0.030185 0.074385 0.543565 C
6 0.740050 0.161741 0.182423 0.917792 C
7 0.360404 0.374892 0.773396 0.631226 C
8 0.495498 0.589751 0.619968 0.331093 C
9 0.552977 0.040000 0.657023 0.093788 C
I am trying to exchange the columns [1, 2, 3, 4] for the elements of Location, which is [A, B, C].
The result is expected to be:
A B C original_axis
0 0.088333 0.009939 0.117679 1
1 0.340463 0.537707 0.261598 1
2 0.243001 0.497268 0.098827 1
3 0.514245 0.133166 0.435097 1
4 0.715491 0.597498 0.635434 1
5 0.578538 0.017931 0.255325 1
6 0.854928 0.485360 0.954285 1
7 0.460750 0.959085 0.657368 1
8 0.573592 0.990451 0.605646 1
9 0.086094 0.206977 0.801752 1
0 0.549036 0.370826 0.387348 2
1 0.323326 0.891108 0.546695 2
2 0.695849 0.329717 0.089323 2
3 0.815171 0.125956 0.973977 2
4 0.123839 0.093963 0.094081 2
5 0.679839 0.093035 0.378317 2
6 0.144703 0.467749 0.146022 2
7 0.011454 0.226450 0.833976 2
8 0.527904 0.848700 0.457423 2
9 0.492924 0.027877 0.608478 2
0 0.723353 0.225212 0.411559 3
1 0.271650 0.042965 0.102511 3
2 0.089035 0.443239 0.222270 3
3 0.848120 0.325830 0.025566 3
4 0.991819 0.820941 0.824764 3
5 0.458587 0.998833 0.519268 3
6 0.615102 0.518176 0.878235 3
7 0.459854 0.920891 0.295850 3
8 0.913309 0.023715 0.871933 3
9 0.847045 0.635648 0.976897 3
0 0.017517 0.744492 0.626891 4
1 0.217091 0.480721 0.488773 4
2 0.261332 0.727520 0.851186 4
3 0.636652 0.198315 0.274114 4
4 0.674848 0.121301 0.979112 4
5 0.179912 0.923860 0.626627 4
6 0.838932 0.708422 0.007303 4
7 0.960088 0.221822 0.717819 4
8 0.388269 0.451810 0.379369 4
9 0.639050 0.323490 0.078356 4
To do it, I melt it by usingmdf = pd.melt(cdf, id_vars='Location', var_name='time')
Location time value
0 A 1 0.046463
1 A 1 0.282879
2 A 1 0.570865
3 A 1 0.719256
4 A 1 0.771334
.. ... ... ...
115 C 4 0.189733
116 C 4 0.244067
117 C 4 0.763828
118 C 4 0.268741
119 C 4 0.586709
And use mdf.pivot(index = None, columns = 'Location', values = 'value'), but get:
Location A B C
0 0.046463 NaN NaN
1 0.282879 NaN NaN
2 0.570865 NaN NaN
3 0.719256 NaN NaN
4 0.771334 NaN NaN
.. ... .. ...
115 NaN NaN 0.189733
116 NaN NaN 0.244067
117 NaN NaN 0.763828
118 NaN NaN 0.268741
119 NaN NaN 0.586709
Then, I use mdf.pivot_table(columns = 'Location', values = 'value', index = None), but get:
Location A B C
value 0.497467 0.421472 0.474544
I don't know how to correct it.
How can I perform an above-mentioned exchange?
since the data is randomised, the outputs are likely to be different - the solution should be applicable though. The idea is a combination of long to wide and back to long - if you encounter index error due to duplicates, then you can add a unique counter with a groupby:
(cdf
.set_index('Location', append = True)
.unstack('Location')
.stack(level=0)
.rename_axis(index = [None, 'original_axis'], columns=None)
# depending on your end goal you may skip
# the code below
.reset_index('original_axis')
.loc(axis=1)[['A', 'B', 'C', 'original_axis']]
.reset_index(drop=True)
)
A B C original_axis
0 0.243073 0.839207 0.204089 1
1 0.946684 0.154315 0.108707 2
2 0.197635 0.083025 0.022954 3
3 0.941140 0.085282 0.233555 4
4 0.564369 0.387318 0.620620 1
5 0.747247 0.693141 0.900607 2
6 0.784011 0.211551 0.882538 3
7 0.990973 0.714996 0.560776 4
8 0.211258 0.257663 0.077108 1
9 0.976495 0.413717 0.397158 2
10 0.039176 0.548559 0.592832 3
11 0.892454 0.283256 0.740506 4
12 0.112643 0.523662 0.422404 1
13 0.674075 0.861542 0.040622 2
14 0.158149 0.765760 0.180017 3
15 0.764700 0.983974 0.168166 4
16 0.603871 0.303765 0.893277 1
17 0.793998 0.774284 0.740311 2
18 0.460644 0.875710 0.128431 3
19 0.191124 0.473910 0.982510 4
20 0.050415 0.021572 0.242766 1
21 0.460084 0.937372 0.030185 2
22 0.791560 0.652112 0.074385 3
23 0.991015 0.924507 0.543565 4
24 0.215605 0.261787 0.740050 1
25 0.103512 0.004706 0.161741 2
26 0.362365 0.120351 0.182423 3
27 0.877847 0.863288 0.917792 4
28 0.508424 0.713536 0.360404 1
29 0.429722 0.741964 0.374892 2
30 0.148857 0.367541 0.773396 3
31 0.733401 0.097494 0.631226 4
32 0.073434 0.195457 0.495498 1
33 0.575820 0.845070 0.589751 2
34 0.382577 0.202422 0.619968 3
35 0.441545 0.393003 0.331093 4
36 0.242787 0.985934 0.552977 1
37 0.366286 0.167777 0.040000 2
38 0.283623 0.799820 0.657023 3
39 0.213178 0.578928 0.093788 4
If possible use counter by index values from original DataFrame cdf add ignore_index=False to melt, convert to index and add to pivoting to index parameter, last convert first level of MultiIndex to columns and add some data cleaning:
cdf = pd.concat([df1, df2, df3])
mdf = (pd.melt(cdf, id_vars='Location', var_name='time', ignore_index=False)
.reset_index()
.pivot_table(index = ['index','time'], columns = 'Location', values = 'value')
.reset_index(level=1)
.sort_values('time')
.reset_index(drop=True)
.rename_axis(None, axis=1)
.assign(original_axis = lambda x: x.pop('time'))
)
print (mdf.head(20))
A B C original_axis
0 0.321988 0.261495 0.834801 1
1 0.727275 0.721664 0.189777 1
2 0.890422 0.405724 0.549623 1
3 0.141341 0.078760 0.550035 1
4 0.524387 0.663287 0.742467 1
5 0.544935 0.218225 0.629431 1
6 0.588052 0.553420 0.021791 1
7 0.467896 0.972283 0.013206 1
8 0.022090 0.411311 0.715893 1
9 0.126596 0.625526 0.347060 1
10 0.390991 0.183550 0.115243 2
11 0.360875 0.110048 0.518665 2
12 0.180323 0.412962 0.709591 2
13 0.162077 0.912792 0.607409 2
14 0.337959 0.357006 0.474816 2
15 0.394469 0.863313 0.514885 2
16 0.501382 0.729779 0.627029 2
17 0.035648 0.585990 0.732794 2
18 0.456373 0.187173 0.362184 2
19 0.151172 0.391720 0.894907 2
Or:
cdf = pd.concat([df1, df2, df3])
mdf = (pd.melt(cdf.assign(g = cdf.groupby('Location').cumcount()),
id_vars=['Location','g'], var_name='time', ignore_index=False)
.pivot_table(index = ['g','time'], columns = 'Location', values = 'value')
.reset_index(level=1)
.sort_values('time')
.reset_index(drop=True)
.rename_axis(None, axis=1)
.assign(original_axis = lambda x: x.pop('time'))
)
print (mdf.head(20))
A B C original_axis
0 0.321988 0.261495 0.834801 1
1 0.727275 0.721664 0.189777 1
2 0.890422 0.405724 0.549623 1
3 0.141341 0.078760 0.550035 1
4 0.524387 0.663287 0.742467 1
5 0.544935 0.218225 0.629431 1
6 0.588052 0.553420 0.021791 1
7 0.467896 0.972283 0.013206 1
8 0.022090 0.411311 0.715893 1
9 0.126596 0.625526 0.347060 1
10 0.390991 0.183550 0.115243 2
11 0.360875 0.110048 0.518665 2
12 0.180323 0.412962 0.709591 2
13 0.162077 0.912792 0.607409 2
14 0.337959 0.357006 0.474816 2
15 0.394469 0.863313 0.514885 2
16 0.501382 0.729779 0.627029 2
17 0.035648 0.585990 0.732794 2
18 0.456373 0.187173 0.362184 2
19 0.151172 0.391720 0.894907 2
I have this data:
df = pd.DataFrame({'X1':[0,5,4,8,9,0,7,6],
'X2':[4,1,3,5,6,2,3,3],
'X3':['A','A','B','B','B','C','C','C']})
df = df.set_index('X3')
I want to set X3 as index so I would do:
df = df.set_index('X3')
And the result is:
However, I'm looking to set the same index but as a multi-index format (even if it's not a multi-index). This is the expected result:
X1 X2
A 0 4
5 1
B 4 3
8 5
9 6
C 0 2
7 3
6 3
Is that possible?
EDIT
Answering the comments, the reason I want to achieve this is that I want to order df by X1 values without losing "the grouping" of X3, so I would be able to see the order in each X3 group. I can't do it with sort_values(['X3', 'X1'], ascending=[False, False]) because the first condition of sorting must be the maximum of each group keeping all rows from the same group together. So I can see the group that has the maximum X1 and immediately see how are the other values of the same group, then the second group contains the second maximum of X1 and immediately see how are the other values of the second group, and so on.
Not sure if this helps; but what if you first get rid of repeated values from 'X3' columns then set that as index...
import pandas as pd
df = pd.DataFrame({'X1':[0,5,4,8,9,0,7,6],
'X2':[4,1,3,5,6,2,3,3],
'X3':['A','A','B','B','B','C','C','C']})
lst_index = []
for x in df["X3"]:
if x in lst_index:
lst_index.append("")
else:
lst_index.append(x)
del df["X3"] # delete this column if not required anymore...
df.index = lst_index
# Output
X1 X2
A 0 4
5 1
B 4 3
8 5
9 6
C 0 2
7 3
6 3
try:
df
X1 X2 X3
0 0 4 A
1 5 1 A
2 4 3 B
3 8 5 B
4 9 6 B
5 0 2 C
6 7 3 C
7 6 3 C
df = df.set_index('X3')
new_index = pd.Series([i if not j else '' for i, j in list(zip(df.index, df.index.duplicated())) ])
df = df.set_index(new_index)
X1 X2
A 0 4
5 1
B 4 3
8 5
9 6
C 0 2
7 3
6 3
I have to create a result pandas dataframe from a source pandas dataframe having two columns. The result dataframe should have headers of two types, one type should be from the source dataframe derived from one of the column values appending the column header with the values. The other header is taken as it is from the source dataframe with unique values if there are duplicates. The result dataframe cell values should be 1 depending on whether there was a corresponding derived header from the column values or 0 if none.
The dataframes are as below
dfm = pd.DataFrame({'v' : [44,39,39,8,40,8,15,15],'x':[1,1,1,1,2,2,2,2]})
dfm
v x
44 1
39 1
39 1
8 1
40 2
8 2
15 2
15 2
result
x v_8 v_15 v_39 v_40 v_44
1 1 0 1 0 1
2 1 1 0 1 0
You can use function crosstab, then find values higher as 1 and convert it to 1 and 0 by astype:
dfm = pd.DataFrame({'v' : [44,39,39,8,40,8,15,15],'x':[1,1,1,1,2,2,2,2]})
print dfm
v x
0 44 1
1 39 1
2 39 1
3 8 1
4 40 2
5 8 2
6 15 2
7 15 2
df = pd.crosstab(dfm.x, dfm.v)
#rename columns
df.rename(columns=lambda x: 'v_' + str(x), inplace=True)
#reset column name v to ''
df.columns.name = ''
print df
v_8 v_15 v_39 v_40 v_44
x
1 1 0 2 0 1
2 1 2 0 1 0
print (df > 0)
v_8 v_15 v_39 v_40 v_44
x
1 True False True False True
2 True True False True False
print (df > 0).astype(int)
v_8 v_15 v_39 v_40 v_44
x
1 1 0 1 0 1
2 1 1 0 1 0
Could you use something like this. I'm not sure its the best solution so I'm interested to see what others post...
import pandas as pd
dfm = pd.DataFrame({'v' : [44,39,39,8,40,8,15,15],'x':[1,1,1,1,2,2,2,2]}).sort(columns="v")
col_heads = ["v_{}".format(x) for x in dfm.v.unique()]
row_index = dfm.x.unique()
nf = pd.DataFrame(index=row_index, columns=col_heads).fillna(0)
nf.index.name = 'x'
a = dfm.groupby("v")
for i,name in a:
for val in name.x:
nf.ix[val, "v_{}".format(i)] = 1
print nf
This outputs the following:
v_8 v_15 v_39 v_40 v_44
x
1 1 0 1 0 1
2 1 1 0 1 0
I want to treat non consecutive ids as different variables during groupby, so that I can take return the first value of stamp, and the sum of increment as a new dataframe. Here is sample input and output.
import pandas as pd
import numpy as np
df = pd.DataFrame([np.array(['a','a','a','b','c','b','b','a','a','a']),
np.arange(1, 11), np.ones(10)]).T
df.columns = ['id', 'stamp', 'increment']
df_result = pd.DataFrame([ np.array(['a','b','c','b','a']),
np.array([1,4,5,6,8]), np.array([3,1,1,2,3])]).T
df_result.columns = ['id', 'stamp', 'increment_sum']
In [2]: df
Out[2]:
id stamp increment
0 a 1 1
1 a 2 1
2 a 3 1
3 b 4 1
4 c 5 1
5 b 6 1
6 b 7 1
7 a 8 1
8 a 9 1
9 a 10 1
In [3]: df_result
Out[3]:
id stamp increment_sum
0 a 1 3
1 b 4 1
2 c 5 1
3 b 6 2
4 a 8 3
I can accomplish this via
def get_result(d):
sum = d.increment.sum()
stamp = d.stamp.min()
name = d.id.max()
return name, stamp, sum
#idea from http://stackoverflow.com/questions/25147091/combine-consecutive-rows-with-the-same-column-values
df['key'] = (df['id'] != df['id'].shift(1)).astype(int).cumsum()
result = zip(*df.groupby([df.key]).apply(get_result))
df = pd.DataFrame(np.array(result).T)
df.columns = ['id', 'stamp', 'increment_sum']
But I'm sure there must be a more elegant solution
Not that good in terms of optimum code, but solves the problem
> df_group = df.groupby('id')
we cant use id alone for groupby, so adding another new column to groupby within id based whether it is continuous or not
> df['group_diff'] = df_group['stamp'].diff().apply(lambda v: float('nan') if v == 1 else v).ffill().fillna(0)
> df
id stamp increment group_diff
0 a 1 1 0
1 a 2 1 0
2 a 3 1 0
3 b 4 1 0
4 c 5 1 0
5 b 6 1 2
6 b 7 1 2
7 a 8 1 5
8 a 9 1 5
9 a 10 1 5
Now we can the new column group_diff for secondary grouping.. Added sort function in the end as suggested in the comments to get the exact function
> df.groupby(['id','group_diff']).agg({'increment':sum, 'stamp': 'first'}).reset_index()[['id', 'stamp','increment']].sort('stamp')
id stamp increment
0 a 1 3
2 b 4 1
4 c 5 1
3 b 6 2
1 a 8 3
I have got a pd.DataFrame
Time Value
a 1 1 1
2 2 5
3 5 7
b 1 1 5
2 2 9
3 10 11
I want to multiply the column Value with the column Time - Time(t-1) and write the result to a column Product, starting with row b, but separately for each top level index.
For example Product('1','b') should be (Time('1','b') - Time('1','a')) * Value('1','b'). To do this, i would need a "shifted" version of column Time "starting" at row b so that i could do df["Product"] = (df["Time"].shifted - df["Time"]) * df["Value"]. The result should look like this:
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
This should do it:
>>> time_shifted = df['Time'].groupby(level=0).apply(lambda x: x.shift())
>>> df['Product'] = ((df.Time - time_shifted)*df.Value).fillna(0)
>>> df
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
Hey this should do what you need it to. Comment if I missed anything.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Time':[1,2,5,1,2,10],'Value':[1,5,7,5,9,11]},
index = [['a','a','a','b','b','b'],[1,2,3,1,2,3]])
def product(x):
x['Product'] = (x['Time']-x.shift()['Time'])*x['Value']
return x
df = df.groupby(level =0).apply(product)
df['Product'] = df['Product'].replace(np.nan, 0)
print df