value counts of a Database chunk by chunk using pandas - python

I have a big DataFrame df and I want to count each values. I can't do:
df = pandas.read_csv('my_big_data.csv')
values_df = df.apply(value_counts)
because It is a very big Database.
I think it must be possible to do it chunk by chunk with chunksize, but I can't see how.

In [9]: pd.set_option('max_rows',10)
Construct a sample frame
In [10]: df = DataFrame(np.random.randint(0,100,size=100000).reshape(-1,1))
In [11]: df
Out[11]:
0
0 50
1 35
2 20
3 66
4 8
... ..
99995 51
99996 33
99997 43
99998 41
99999 56
[100000 rows x 1 columns]
In [12]: df.to_csv('test.csv')
Chunk read it and construct the .value_counts for each chunks
Concacatenate all of these results (so you have a frame that is indexed by the value being counts and the values are the counts).
In [13]: result = pd.concat([ chunk.apply(Series.value_counts) for chunk in pd.read_csv('test.csv',index_col=0,chunksize=10000) ] )
In [14]: result
Out[14]:
0
18 121
75 116
39 116
55 115
60 114
.. ...
88 83
8 83
56 82
76 76
18 73
[1000 rows x 1 columns]
Then groupby the index which puts all of the duplicates (indexes) in a groups. Summing give the sum of the individual value_counts.
In [15]: result.groupby(result.index).sum()
Out[15]:
0
0 1017
1 1015
2 992
3 1051
4 973
.. ...
95 1014
96 949
97 1011
98 999
99 981
[100 rows x 1 columns]

Related

Optimizing the rewriting of the range values in a column into separate rows

I have a dataframe clothes_acc with column shoe_size containing values like:
index shoe_size
134 37-38
963 43-45
968 39-42
969 43-45
970 37-39
What I want to do is to write the whole range of values separately in each line. So I would get:
index shoe_size
134 37
134 38
963 43
963 44
963 45
968 39
968 40
968 41
968 42
...
Currently, I have the following code which works fine except it is very slow for the dataframe with 500k rows. (clothes_acc actually contains other values in the column that are not important here, which is why I take a subset of the dataframe with the mentioned values and save it in the tmp variable).
for i, row in tqdm(tmp.iterrows(), total=tmp.shape[0]):
clothes_acc = clothes_acc.drop([i])
spl = [int(s) for s in row['shoe_size']]
for j in range(spl[0],spl[1]+1):
replicate = row.copy()
replicate['shoe_size'] = str(j)
clothes_acc = clothes_acc.append(replicate)
clothes_acc.reset_index(drop=True,inplace=True)
Could anyone please suggest an improvement?
Convert the string range to a list of integer sizes and call explode():
df['shoe_size'] = df.apply(lambda x:
[i for i in range(int(x['shoe_size'].split('-')[0]), int(x['shoe_size'].split('-')[1]) + 1)],
axis=1)
df = df.explode(column='shoe_size')
For example, if df is:
df = pd.DataFrame({
'shoe_size': ['37-38', '43-45', '39-42', '43-45', '37-39']
})
... this will give the following result:
shoe_size
0 37
0 38
1 43
1 44
1 45
2 39
2 40
2 41
2 42
3 43
3 44
3 45
4 37
4 38
4 39
One option (more memory intensive) is to extract the bounds of the ranges, merge on all possible values and then filter to where the merged value is between the range. This will work okay when the shoe_sizes overlap for many of the products so that the cross join isn't insanely huge.
import pandas as pd
# Bring ranges over to df
ranges = (clothes_acc['shoe_size'].str.split('-', expand=True)
.apply(pd.to_numeric)
.rename(columns={0: 'lower', 1: 'upper'}))
clothes_acc = pd.concat([clothes_acc, ranges], axis=1)
# index shoe_size lower upper
#0 134 37-38 37 38
#1 963 43-45 43 45
#2 968 39-42 39 42
#3 969 43-45 43 45
#4 970 37-39 37 39
vals = pd.DataFrame({'shoe_size': np.arange(clothes_acc.lower.min(),
clothes_acc.upper.max()+1)})
res = (clothes_acc.drop(columns='shoe_size')
.merge(vals, how='cross')
.query('lower <= shoe_size <= upper')
.drop(columns=['lower', 'upper']))
print(res)
index shoe_size
0 134 37
1 134 38
15 963 43
16 963 44
17 963 45
20 968 39
21 968 40
22 968 41
23 968 42
33 969 43
34 969 44
35 969 45
36 970 37
37 970 38
38 970 39

Pandas Dataframe

I have a dataframe containing a number of columns and rows, in all of the columns except for the leftmost two, there is data of the form "integer-integer". I would like to split all of these columns into two columns, with each integer in its own cell, and remove the dash.
I have tried to follow the answers in Pandas Dataframe: Split multiple columns each into two columns, but it seems that they are splitting after one element, while I would like to split on the "-".
By way of example, suppose I have a dataframe of the form:
I would like to split the columns labelled 2 through to 22, to have them called 2F, 2A, 3F, 3A, ..., 6A with the data in the first row being R1, Hawthorn, 229, 225, 91, 81, ..., 12.
Thank you for any help.
You can use DataFrame.set_index with DataFrame.stack for Series, then split to new 2 columns by Series.str.split, convert to integers, create new columns names by DataFrame.set_axis, reshape by DataFrame.unstack, sorting columns by DataFrame.sort_index and last flatten MultiIndexwith convert index to columns by DataFrame.reset_index:
#first replace columns names to default values
df.columns = range(len(df.columns))
df = (df.set_index([0,1])
.stack()
.str.split('-', expand=True)
.astype(int)
.set_axis(['F','A'], axis=1, inplace=False)
.unstack()
.sort_index(axis=1, level=[1,0], ascending=[True, False]))
df.columns = df.columns.map(lambda x: f'{x[1]}{x[0]}')
df = df.reset_index()
print (df)
0 1 2F 2A 3F 3A 4F 4A 5F 5A 6F 6A
0 R1 Hawthorn 229 225 91 81 216 142 439 367 7 12
1 R2 Sydney 226 214 93 92 151 167 377 381 12 8
2 R3 Geelong 216 228 91 166 159 121 369 349 16 14
3 R4 North Melbourne 213 239 169 126 142 155 355 394 8 9
4 R5 Gold Coast 248 226 166 94 267 169 455 389 18 6
5 R6 St Kilda 242 197 118 161 158 156 466 353 15 16
6 R7 Fremantle 225 219 72 84 224 185 449 464 7 5
For Input:
df = pd.DataFrame({0: ['R1'], 1: ['Hawthorn'], 2: ['229-225'], 3: ['91-81'], 4:['210-142'], 5:['439-367'], 6:['7-12']})
0 1 2 3 4 5 6
0 R1 Hawthorn 229-225 91-81 210-142 439-367 7-12
Trying the code:
for i in df.columns[2::]:
df[[str(i)+'F', str(i)+'A']] =pd.DataFrame(df[i].str.split('-').tolist(), index= df.index)
del df[i]
Prints (1st row):
0 1 2F 2A 3F 3A 4F 4A 5F 5A 6F 6A
0 R1 Hawthorn 229 225 91 81 210 142 439 367 7 12
you can use lambda function for split a series
import pandas as pd
df = pd.read_csv("data.csv")
df.head()
>>> data
0 12-24
1 13-26
2 14-28
3 15-30
df["d1"] = df["data"].apply(lambda x: x.split("-")[0])
df["d2"] = df["data"].apply(lambda x: x.split("-")[1])
df.head()
>>>
data d1 d2
0 12-24 12 24
1 13-26 13 26
2 14-28 14 28
3 15-30 15 30

Set columns of DataFrame to sum of columns of another in pandas

I have a DataFrame that looks like the below, call this "values":
I would like to create another, call it "sums" that contains the sum of the DataFrame "values" from the column in "sums" to the end. It would look like the below:
I would like to create this without looking through the entire DataFrame, data point by data point. I have been trying with .apply() as seen below, but I keep getting the error: unsupported operand type(s) for +: 'int' and 'datetime.date'
In [26]: values = pandas.DataFrame({0:[96,54,27,28],
1:[55,75,32,37],2:[54,99,36,46],3:[35,77,0,10],4:[62,25,0,25],
5:[0,66,0,89],6:[0,66,0,89],7:[0,0,0,0],8:[0,0,0,0]})
In [28]: sums = values.copy()
In [29]: sums.iloc[:,:] = ''
In [31]: for column in sums:
...: sums[column].apply(sum(values.loc[:,column:]))
...:
Traceback (most recent call last):
File "<ipython-input-31-030442e5005e>", line 2, in <module>
sums[column].apply(sum(values.loc[:,column:]))
File "C:\WinPython64bit\python-3.5.2.amd64\lib\site-packages\pandas\core\series.py", line 2220, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas\src\inference.pyx", line 1088, in pandas.lib.map_infer (pandas\lib.c:63043)
TypeError: 'numpy.int64' object is not callable
In [32]: for column in sums:
...: sums[column] = sum(values.loc[:,column:])
In [33]: sums
Out[33]:
0 1 2 3 4 5 6 7 8
0 36 36 35 33 30 26 21 15 8
1 36 36 35 33 30 26 21 15 8
2 36 36 35 33 30 26 21 15 8
3 36 36 35 33 30 26 21 15 8
Is there a way to do this without looping each point individually?
Without looping, you can reverse your dataframe, cumsum per line and then re-reverse it:
>>> values.iloc[:,::-1].cumsum(axis=1).iloc[:,::-1]
0 1 2 3 4 5 6 7 8
0 302 206 151 97 62 0 0 0 0
1 462 408 333 234 157 132 66 0 0
2 95 68 36 0 0 0 0 0 0
3 324 296 259 213 203 178 89 0 0
You can use the .cumsum() method to get the cumulative sum. The problem is that is operates from left to right, where you need it from right to left.
So we will reverse you data frame, use cumsum(), then set the axes back into the proper order.
import pandas as pd
values = pd.DataFrame({0:[96,54,27,28],
1:[55,75,32,37],2:[54,99,36,46],3:[35,77,0,10],4:[62,25,0,25],
5:[0,66,0,89],6:[0,66,0,89],7:[0,0,0,0],8:[0,0,0,0]})
values[values.columns[::-1]].cumsum(axis=1).reindex_axis(values.columns, axis=1)
# returns:
0 1 2 3 4 5 6 7 8
0 302 206 151 97 62 0 0 0 0
1 462 408 333 234 157 132 66 0 0
2 95 68 36 0 0 0 0 0 0
3 324 296 259 213 203 178 89 0 0

Make seperate pandas data frames per header

So I have a csv that contains data on a daily basis separated by a header. Is there anyway I can make separate pandas dfs each time the program hits a header?
The data bascially looks like this
#dateinformation
data1, data2, data3
data4, data5, data6
#dateinformation
an example of the real csv is this
#7240320140101002301 131
21101400B 86 12B 110 325 25
10100000 200B 6B 110 325 77
20 95300 -9999 -27B 100-9999-9999
10 92500 820B -39B 90 290
.....
#7240320140102002301
21101400B 86 14B 110 325 25
10100000 200B 2B 110 325 77
20 95300 -9999 -85B 100-9999-9999
10 92500 820B -25B 90 290
I've already got the formatting of the actual data down fine. I just need some help with how to separate out the different sets within the csv
(code below based on the header row starting with '#')
I suppose in theory you'd do this with read_table and chunksize but in practice I had trouble getting that to work very well on account of the different number of fields per row. The following is fairly simple but I did have to resort to iterrows.
In [1435]: df_list = []
...: df = pd.DataFrame()
...: j = 0
...: foo = pd.read_csv('foo.txt',sep=' *',names=list('abcdef'))
...: for i, row in foo.ix[1:].iterrows():
...: if row[0][0] == '#':
...: df_list.append(df)
...: df = pd.DataFrame()
...: else:
...: df = df.append(row)
...: df_list.append(df)
In [1436]: df_list[0]
Out[1436]:
a b c d e f
1 21101400B 86 12B 110 325 25
2 10100000 200B 6B 110 325 77
3 20 95300 -9999 -27B 100-9999-9999 NaN
4 10 92500 820B -39B 90 290
In [1437]: df_list[1]
Out[1437]:
a b c d e f
6 21101400B 86 14B 110 325 25
7 10100000 200B 2B 110 325 77
8 20 95300 -9999 -85B 100-9999-9999 NaN
9 10 92500 820B -25B 90 290
This answer is based on the assumption that each 'frame' contains the same number of rows
First we read the file with pandas read_csv(). We leverage the comment parameter to omit each of your headers and read in just the data
df = pd.read_csv('data.txt', comment='#', delim_whitespace=True, header=None)
df
0 1 2 3 4 5
0 21101400B 86 12B 110 325 25
1 10100000 200B 6B 110 325 77
2 20 95300 -9999 -27B 100-9999-9999 NaN
3 10 92500 820B -39B 90 290
4 21101400B 86 14B 110 325 25
5 10100000 200B 2B 110 325 77
6 20 95300 -9999 -85B 100-9999-9999 NaN
7 10 92500 820B -25B 90 290
Then a for loop to parse and store each frame in a list. I'm assuming number of rows = 4
frames = []
for begin in range(0,len(df),4):
frames.append(df[begin:begin+4])
frames[0]
0 1 2 3 4 5
0 21101400B 86 12B 110 325 25
1 10100000 200B 6B 110 325 77
2 20 95300 -9999 -27B 100-9999-9999 NaN
3 10 92500 820B -39B 90 290

How to get all fields for only a specic user_id from a pivot dataframe indexed by two fields 'timestamps' and 'user_id'?

I have the table below contained in the DataFrame pivoted below :
cost cost cost val1 val1 val1
user_id 1 2 3 1 2 3
timestamp
01/01/2011 1 100 3 5
01/02/2011 20 8
01/07/2012 19 57
01/11/2012 3100 49
21/12/2012 240 30
14/09/2013 21 63
01/12/2013 3200 51
I would like to know how I obtain another dataframe containing only fields associated to a specific user-id, i.e (based on my example) to be able to obtain something like df_by_user_id = pivoted ['user_id'=1] or df_by_user_id = pivoted ['user_id'=2] or df_by_user_id = pivoted ['user_id'=3] (knowing that the table above is grouped by 'timestamp' and 'user_id). (My final purpose being to be able to make a plot for each user_id).
The code use in order to obtain the above table is :
import pandas as pd
newnames = ['timestamp','user_id', 'cost', 'val1','val2', 'val3','code']
df = pd.read_csv('mytest.csv', names = newnames, header = False, parse_dates=True, dayfirst=True)
df['timestamp'] = pd.to_datetime(df['timestamp'], dayfirst=True)
pivoted = df.pivot('timestamp', 'user_id')
Thanks in advance for your help.
So let's start out with this reproducible dataframe:
import numpy as np
import pandas
np.random.seed(0)
N = 6
data = np.random.random_integers(low=0, high=200, size=(N, N))
cols = pandas.MultiIndex.from_product([('cost', 'value'), (1, 2, 3)], names=['quantity', 'user_id'])
dates = pandas.DatetimeIndex(freq='1M', start='2010-01-01', periods=N, name='date')
df = pandas.DataFrame(data, columns=cols, index=dates)
which is:
quantity cost value
user_id 1 2 3 1 2 3
date
2010-01-31 172 47 117 192 67 195
2010-02-28 103 9 21 36 87 70
2010-03-31 88 140 58 193 39 87
2010-04-30 174 88 81 165 25 77
2010-05-31 72 9 148 115 197 79
2010-06-30 175 192 82 99 177 29
Take a cross-section (xs) along axis 1 of the dataframe
df.xs(1, level='user_id', axis=1)
Which gives:
quantity cost value
date
2010-01-31 172 192
2010-02-28 103 36
2010-03-31 88 193
2010-04-30 174 165
2010-05-31 72 115
2010-06-30 175 99
Alternatively, you could pick out all of the costs with:
df.xs('cost', level='quantity', axis=1)
user_id 1 2 3
date
2010-01-31 172 47 117
2010-02-28 103 9 21
2010-03-31 88 140 58
2010-04-30 174 88 81
2010-05-31 72 9 148
2010-06-30 175 192 82
Since that level of the columns isn't named in your dataframe, you can access it with it's index:
df.xs('cost', level=0, axis=1)
user_id 1 2 3
date
2010-01-31 172 47 117
2010-02-28 103 9 21
2010-03-31 88 140 58
2010-04-30 174 88 81
2010-05-31 72 9 148
2010-06-30 175 192 82
If you had a multi-level index on rows, you could use axis=0 to select items base on row labels. But since you're concerned with columns right now, use axis=1

Categories