I have a dataframe containing a number of columns and rows, in all of the columns except for the leftmost two, there is data of the form "integer-integer". I would like to split all of these columns into two columns, with each integer in its own cell, and remove the dash.
I have tried to follow the answers in Pandas Dataframe: Split multiple columns each into two columns, but it seems that they are splitting after one element, while I would like to split on the "-".
By way of example, suppose I have a dataframe of the form:
I would like to split the columns labelled 2 through to 22, to have them called 2F, 2A, 3F, 3A, ..., 6A with the data in the first row being R1, Hawthorn, 229, 225, 91, 81, ..., 12.
Thank you for any help.
You can use DataFrame.set_index with DataFrame.stack for Series, then split to new 2 columns by Series.str.split, convert to integers, create new columns names by DataFrame.set_axis, reshape by DataFrame.unstack, sorting columns by DataFrame.sort_index and last flatten MultiIndexwith convert index to columns by DataFrame.reset_index:
#first replace columns names to default values
df.columns = range(len(df.columns))
df = (df.set_index([0,1])
.stack()
.str.split('-', expand=True)
.astype(int)
.set_axis(['F','A'], axis=1, inplace=False)
.unstack()
.sort_index(axis=1, level=[1,0], ascending=[True, False]))
df.columns = df.columns.map(lambda x: f'{x[1]}{x[0]}')
df = df.reset_index()
print (df)
0 1 2F 2A 3F 3A 4F 4A 5F 5A 6F 6A
0 R1 Hawthorn 229 225 91 81 216 142 439 367 7 12
1 R2 Sydney 226 214 93 92 151 167 377 381 12 8
2 R3 Geelong 216 228 91 166 159 121 369 349 16 14
3 R4 North Melbourne 213 239 169 126 142 155 355 394 8 9
4 R5 Gold Coast 248 226 166 94 267 169 455 389 18 6
5 R6 St Kilda 242 197 118 161 158 156 466 353 15 16
6 R7 Fremantle 225 219 72 84 224 185 449 464 7 5
For Input:
df = pd.DataFrame({0: ['R1'], 1: ['Hawthorn'], 2: ['229-225'], 3: ['91-81'], 4:['210-142'], 5:['439-367'], 6:['7-12']})
0 1 2 3 4 5 6
0 R1 Hawthorn 229-225 91-81 210-142 439-367 7-12
Trying the code:
for i in df.columns[2::]:
df[[str(i)+'F', str(i)+'A']] =pd.DataFrame(df[i].str.split('-').tolist(), index= df.index)
del df[i]
Prints (1st row):
0 1 2F 2A 3F 3A 4F 4A 5F 5A 6F 6A
0 R1 Hawthorn 229 225 91 81 210 142 439 367 7 12
you can use lambda function for split a series
import pandas as pd
df = pd.read_csv("data.csv")
df.head()
>>> data
0 12-24
1 13-26
2 14-28
3 15-30
df["d1"] = df["data"].apply(lambda x: x.split("-")[0])
df["d2"] = df["data"].apply(lambda x: x.split("-")[1])
df.head()
>>>
data d1 d2
0 12-24 12 24
1 13-26 13 26
2 14-28 14 28
3 15-30 15 30
Related
I have a dataframe clothes_acc with column shoe_size containing values like:
index shoe_size
134 37-38
963 43-45
968 39-42
969 43-45
970 37-39
What I want to do is to write the whole range of values separately in each line. So I would get:
index shoe_size
134 37
134 38
963 43
963 44
963 45
968 39
968 40
968 41
968 42
...
Currently, I have the following code which works fine except it is very slow for the dataframe with 500k rows. (clothes_acc actually contains other values in the column that are not important here, which is why I take a subset of the dataframe with the mentioned values and save it in the tmp variable).
for i, row in tqdm(tmp.iterrows(), total=tmp.shape[0]):
clothes_acc = clothes_acc.drop([i])
spl = [int(s) for s in row['shoe_size']]
for j in range(spl[0],spl[1]+1):
replicate = row.copy()
replicate['shoe_size'] = str(j)
clothes_acc = clothes_acc.append(replicate)
clothes_acc.reset_index(drop=True,inplace=True)
Could anyone please suggest an improvement?
Convert the string range to a list of integer sizes and call explode():
df['shoe_size'] = df.apply(lambda x:
[i for i in range(int(x['shoe_size'].split('-')[0]), int(x['shoe_size'].split('-')[1]) + 1)],
axis=1)
df = df.explode(column='shoe_size')
For example, if df is:
df = pd.DataFrame({
'shoe_size': ['37-38', '43-45', '39-42', '43-45', '37-39']
})
... this will give the following result:
shoe_size
0 37
0 38
1 43
1 44
1 45
2 39
2 40
2 41
2 42
3 43
3 44
3 45
4 37
4 38
4 39
One option (more memory intensive) is to extract the bounds of the ranges, merge on all possible values and then filter to where the merged value is between the range. This will work okay when the shoe_sizes overlap for many of the products so that the cross join isn't insanely huge.
import pandas as pd
# Bring ranges over to df
ranges = (clothes_acc['shoe_size'].str.split('-', expand=True)
.apply(pd.to_numeric)
.rename(columns={0: 'lower', 1: 'upper'}))
clothes_acc = pd.concat([clothes_acc, ranges], axis=1)
# index shoe_size lower upper
#0 134 37-38 37 38
#1 963 43-45 43 45
#2 968 39-42 39 42
#3 969 43-45 43 45
#4 970 37-39 37 39
vals = pd.DataFrame({'shoe_size': np.arange(clothes_acc.lower.min(),
clothes_acc.upper.max()+1)})
res = (clothes_acc.drop(columns='shoe_size')
.merge(vals, how='cross')
.query('lower <= shoe_size <= upper')
.drop(columns=['lower', 'upper']))
print(res)
index shoe_size
0 134 37
1 134 38
15 963 43
16 963 44
17 963 45
20 968 39
21 968 40
22 968 41
23 968 42
33 969 43
34 969 44
35 969 45
36 970 37
37 970 38
38 970 39
How can I join/merge two Pandas DataFrames with partially overlapping indexes, where I wish the resulting joined DataFrame to retain the column values in the first DataFrame i.e. dropping the duplicates in df2?
import pandas as pd
import io
df1 = """
date; count
'2020-01-01'; 210
'2020-01-02'; 189
'2020-01-03'; 612
'2020-01-04'; 492
'2020-01-05'; 185
'2020-01-06'; 492
'2020-01-07'; 155
'2020-01-08'; 62
'2020-01-09'; 15
"""
df2 = """
date; count
'2020-01-04'; 21
'2020-01-05'; 516
'2020-01-06'; 121
'2020-01-07'; 116
'2020-01-08'; 82
'2020-01-09'; 121
'2020-01-10'; 116
'2020-01-11'; 82
'2020-01-12'; 116
'2020-01-13'; 82
"""
df1 = pd.read_csv(io.StringIO(df1), sep=";")
df2 = pd.read_csv(io.StringIO(df2), sep=";")
print(df1)
print(df2)
I have tried using
df1.reset_index().merge(df2, how='outer').set_index('date')
however, this drops the joined df2 values. Is there a method to keep the duplicated rows of the first dataframe?
Desired outcome:
print(df3)
date count
'2020-01-01' 210
'2020-01-02' 189
'2020-01-03' 612
'2020-01-04' 492
'2020-01-05' 185
'2020-01-06' 492
'2020-01-07' 155
'2020-01-08' 62
'2020-01-09' 15
'2020-01-10' 116
'2020-01-11' 82
'2020-01-12' 116
'2020-01-13' 82
Any help greatly appreciated, thank you.
Use combine_first:
df3 = (df1.set_index('date')
.combine_first(df2.set_index('date'))
.reset_index()
)
Output:
date count
0 '2020-01-01' 210
1 '2020-01-02' 189
2 '2020-01-03' 612
3 '2020-01-04' 492
4 '2020-01-05' 185
5 '2020-01-06' 492
6 '2020-01-07' 155
7 '2020-01-08' 62
8 '2020-01-09' 15
9 '2020-01-10' 116
10 '2020-01-11' 82
11 '2020-01-12' 116
12 '2020-01-13' 82
here is another way usingconcat and drop_duplicates:
df3=pd.concat([df1, df2]).drop_duplicates(["date"], keep="first", ignore_index=True)
output:
date count
0 '2020-01-01' 210
1 '2020-01-02' 189
2 '2020-01-03' 612
3 '2020-01-04' 492
4 '2020-01-05' 185
5 '2020-01-06' 492
6 '2020-01-07' 155
7 '2020-01-08' 62
8 '2020-01-09' 15
9 '2020-01-10' 116
10 '2020-01-11' 82
11 '2020-01-12' 116
12 '2020-01-13' 82
Python Pandas DataFrame can have hierarchical index (MultiIndex) or hierarchical columns.
I'm looking for a way to know number of levels (depth) of index and columns.
len(df.index.levels)
seems to only work with MultiIndex but it doesn't work with normal index.
Is there an attribute for this (which will works for MultiIndex but also simple Index) ?
df.index.depth
or
df.columns.depth
will be great.
One example of MultiIndex columns and index:
import pandas as pd
import numpy as np
def mklbl(prefix,n):
return ["%s%s" % (prefix,i) for i in range(n)]
def mi_sample():
miindex = pd.MultiIndex.from_product([mklbl('A',4),
mklbl('B',2),
mklbl('C',4),
mklbl('D',2)])
micolumns = pd.MultiIndex.from_tuples([('a','foo'),('a','bar'),
('b','foo'),('b','bah')],
names=['lvl0', 'lvl1'])
dfmi = pd.DataFrame(np.arange(len(miindex)*len(micolumns)).reshape((len(miindex),len(micolumns))),
index=miindex,
columns=micolumns).sortlevel().sortlevel(axis=1)
return(dfmi)
df = mi_sample()
So df looks like:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 9 8 11 10
D1 13 12 15 14
C2 D0 17 16 19 18
D1 21 20 23 22
C3 D0 25 24 27 26
D1 29 28 31 30
B1 C0 D0 33 32 35 34
D1 37 36 39 38
C1 D0 41 40 43 42
D1 45 44 47 46
C2 D0 49 48 51 50
D1 53 52 55 54
C3 D0 57 56 59 58
D1 61 60 63 62
A1 B0 C0 D0 65 64 67 66
D1 69 68 71 70
C1 D0 73 72 75 74
D1 77 76 79 78
C2 D0 81 80 83 82
D1 85 84 87 86
C3 D0 89 88 91 90
D1 93 92 95 94
B1 C0 D0 97 96 99 98
D1 101 100 103 102
C1 D0 105 104 107 106
D1 109 108 111 110
C2 D0 113 112 115 114
D1 117 116 119 118
... ... ... ... ...
A2 B0 C1 D0 137 136 139 138
D1 141 140 143 142
C2 D0 145 144 147 146
D1 149 148 151 150
C3 D0 153 152 155 154
D1 157 156 159 158
B1 C0 D0 161 160 163 162
D1 165 164 167 166
C1 D0 169 168 171 170
D1 173 172 175 174
C2 D0 177 176 179 178
D1 181 180 183 182
C3 D0 185 184 187 186
D1 189 188 191 190
A3 B0 C0 D0 193 192 195 194
D1 197 196 199 198
C1 D0 201 200 203 202
D1 205 204 207 206
C2 D0 209 208 211 210
D1 213 212 215 214
C3 D0 217 216 219 218
D1 221 220 223 222
B1 C0 D0 225 224 227 226
D1 229 228 231 230
C1 D0 233 232 235 234
D1 237 236 239 238
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 249 248 251 250
D1 253 252 255 254
[64 rows x 4 columns]
To make summarized version of comments above:
You can use .nlevels attribute which gives the number of levels for an index and columns:
df = pd.DataFrame(np.random.rand(2,2), index=[['A','A'],['B','C']], columns=['a','b'])
df
a b
A B 0.558 0.336
C 0.148 0.436
df.index.nlevels
2
df.columns.nlevels
1
As #joris mentioned above len(df.columns.levels) will not work in the example above as columns is not MultiIndex, giving:
AttributeError: 'Index' object has no attribute 'levels'
But it will work fine for index in the example above:
len(df.index.levels)
2
You probably need something like len(set(df.a)) which works on both index or normal column.
I have a big DataFrame df and I want to count each values. I can't do:
df = pandas.read_csv('my_big_data.csv')
values_df = df.apply(value_counts)
because It is a very big Database.
I think it must be possible to do it chunk by chunk with chunksize, but I can't see how.
In [9]: pd.set_option('max_rows',10)
Construct a sample frame
In [10]: df = DataFrame(np.random.randint(0,100,size=100000).reshape(-1,1))
In [11]: df
Out[11]:
0
0 50
1 35
2 20
3 66
4 8
... ..
99995 51
99996 33
99997 43
99998 41
99999 56
[100000 rows x 1 columns]
In [12]: df.to_csv('test.csv')
Chunk read it and construct the .value_counts for each chunks
Concacatenate all of these results (so you have a frame that is indexed by the value being counts and the values are the counts).
In [13]: result = pd.concat([ chunk.apply(Series.value_counts) for chunk in pd.read_csv('test.csv',index_col=0,chunksize=10000) ] )
In [14]: result
Out[14]:
0
18 121
75 116
39 116
55 115
60 114
.. ...
88 83
8 83
56 82
76 76
18 73
[1000 rows x 1 columns]
Then groupby the index which puts all of the duplicates (indexes) in a groups. Summing give the sum of the individual value_counts.
In [15]: result.groupby(result.index).sum()
Out[15]:
0
0 1017
1 1015
2 992
3 1051
4 973
.. ...
95 1014
96 949
97 1011
98 999
99 981
[100 rows x 1 columns]
I have the table below contained in the DataFrame pivoted below :
cost cost cost val1 val1 val1
user_id 1 2 3 1 2 3
timestamp
01/01/2011 1 100 3 5
01/02/2011 20 8
01/07/2012 19 57
01/11/2012 3100 49
21/12/2012 240 30
14/09/2013 21 63
01/12/2013 3200 51
I would like to know how I obtain another dataframe containing only fields associated to a specific user-id, i.e (based on my example) to be able to obtain something like df_by_user_id = pivoted ['user_id'=1] or df_by_user_id = pivoted ['user_id'=2] or df_by_user_id = pivoted ['user_id'=3] (knowing that the table above is grouped by 'timestamp' and 'user_id). (My final purpose being to be able to make a plot for each user_id).
The code use in order to obtain the above table is :
import pandas as pd
newnames = ['timestamp','user_id', 'cost', 'val1','val2', 'val3','code']
df = pd.read_csv('mytest.csv', names = newnames, header = False, parse_dates=True, dayfirst=True)
df['timestamp'] = pd.to_datetime(df['timestamp'], dayfirst=True)
pivoted = df.pivot('timestamp', 'user_id')
Thanks in advance for your help.
So let's start out with this reproducible dataframe:
import numpy as np
import pandas
np.random.seed(0)
N = 6
data = np.random.random_integers(low=0, high=200, size=(N, N))
cols = pandas.MultiIndex.from_product([('cost', 'value'), (1, 2, 3)], names=['quantity', 'user_id'])
dates = pandas.DatetimeIndex(freq='1M', start='2010-01-01', periods=N, name='date')
df = pandas.DataFrame(data, columns=cols, index=dates)
which is:
quantity cost value
user_id 1 2 3 1 2 3
date
2010-01-31 172 47 117 192 67 195
2010-02-28 103 9 21 36 87 70
2010-03-31 88 140 58 193 39 87
2010-04-30 174 88 81 165 25 77
2010-05-31 72 9 148 115 197 79
2010-06-30 175 192 82 99 177 29
Take a cross-section (xs) along axis 1 of the dataframe
df.xs(1, level='user_id', axis=1)
Which gives:
quantity cost value
date
2010-01-31 172 192
2010-02-28 103 36
2010-03-31 88 193
2010-04-30 174 165
2010-05-31 72 115
2010-06-30 175 99
Alternatively, you could pick out all of the costs with:
df.xs('cost', level='quantity', axis=1)
user_id 1 2 3
date
2010-01-31 172 47 117
2010-02-28 103 9 21
2010-03-31 88 140 58
2010-04-30 174 88 81
2010-05-31 72 9 148
2010-06-30 175 192 82
Since that level of the columns isn't named in your dataframe, you can access it with it's index:
df.xs('cost', level=0, axis=1)
user_id 1 2 3
date
2010-01-31 172 47 117
2010-02-28 103 9 21
2010-03-31 88 140 58
2010-04-30 174 88 81
2010-05-31 72 9 148
2010-06-30 175 192 82
If you had a multi-level index on rows, you could use axis=0 to select items base on row labels. But since you're concerned with columns right now, use axis=1