So I have data similar to this:
import pandas as pd
df = pd.DataFrame({'Order ID':[555,556,557,558,559,560,561,562,563,564,565,566],
'State':["MA","MA","MA","MA","MA","MA","CT","CT","CT","CT","CT","CT"],
'County':["Essex","Essex","Essex","Worcester","Worcester","Worcester","Bristol","Bristol","Bristol","Hartford","Hartford","Hartford"],
'AP':[50,50,75,100,100,125,150,150,175,200,200,225]})
but I need to add a column that shows the mode of AP grouped by State and County. I can get the mode this way:
(df.groupby(['State', 'County']).AP.agg(Mode = (lambda x: x.value_counts().index[0])).reset_index().round(0))
I'm just not sure how I can get that data added to the original data so that it looks like this:
Order ID
State
County
AP
Mode
555
MA
Essex
50
50
556
MA
Essex
50
50
557
MA
Essex
75
50
558
MA
Worcester
100
100
559
MA
Worcester
100
100
560
MA
Worcester
125
100
561
CT
Bristol
150
150
562
CT
Bristol
150
150
563
CT
Bristol
175
150
564
CT
Hartford
200
200
565
CT
Hartford
200
200
566
CT
Hartford
225
200
Use GroupBy.transform for new column:
df['Mode'] = (df.groupby(['State', 'County']).AP
.transform(lambda x: x.value_counts().index[0]))
Or Series.mode:
df['Mode'] = df.groupby(['State', 'County']).AP.transform(lambda x: x.mode().iat[0])
Related
I am trying to convert a csv file to pandas df.
The data is of the following type (SROIE dataset) (this is just a small part of total file):
76,50,323,50,323,84,76,84,TAN WOON YANN
110,165,315,165,315,188,110,188,INDAH GIFT & HOME DECO
126,191,297,191,297,214,126,214,27,JALAN DEDAP 13,
129,218,287,218,287,236,129,236,TAMAN JOHOR JAYA,
100,243,324,243,324,261,100,261,81100 JOHOR BAHRU,JOHOR.
70,268,201,268,201,285,70,285,TEL:07-3507405
THE ISSUE LIES ONLY IN THE LAST COLUMN, WHICH DOESN'T DISPLAY THE ENTIRE TEXT INFORMATION I NEED.
Based on an answer I found on pandas dataframe read csv with rows that have/not have comma at the end , I used the following code:
pd.read_csv(r'D:\E_Drive\everything else\C2\SROIE2019\0325updated.task1train(626p)\X00016469619.txt',usecols=np.arange(0,9), header=None)
This gave the following output:
The problem is that, for example in line 3 (row labelled 2 in pd dataframe)i.e.
126,191,297,191,297,214,126,214,27,JALAN DEDAP 13,
I need
27,JALAN DEDAP 13,
but I am getting
27
only. Same is the issue in line 5 (row labelled 4 in pd dataframe):
100,243,324,243,324,261,100,261,81100 JOHOR BAHRU,JOHOR.
I need
81100 JOHOR BAHRU,JOHOR.
but I am getting
81100 JOHOR BAHRU
The following approach might be sufficient? It first reads the rows using a standard CSV reader and rejoins the end columns before loading it into pandas.
import pandas as pd
import csv
with open('X00016469619.txt', newline='') as f_input:
csv_input = csv.reader(f_input)
data = [row[:8] + [', '.join(row[8:])] for row in csv_input]
df = pd.DataFrame(data)
print(df)
Giving you:
0 1 2 3 4 5 6 7 8
0 76 50 323 50 323 84 76 84 TAN WOON YANN
1 110 165 315 165 315 188 110 188 INDAH GIFT & HOME DECO
2 126 191 297 191 297 214 126 214 27, JALAN DEDAP 13,
3 129 218 287 218 287 236 129 236 TAMAN JOHOR JAYA,
4 100 243 324 243 324 261 100 261 81100 JOHOR BAHRU, JOHOR.
5 70 268 201 268 201 285 70 285 TEL:07-3507405
I have a dataframe, from where I extracted some sample data:
Time Val
0 70000 -322
1 70500 -439
2 71000 -528
3 71500 -606
4 72000 -642
5 72500 -663
6 73000 -620
7 73500 -561
8 74000 -592
9 74500 -614
10 75000 -630
11 75500 -719
12 80000 -613
13 80500 -127
14 81000 -235
15 81500 -186
16 82000 -82
17 82500 836
18 83000 1137
183 70000 -106
184 70500 -117
185 71000 -626
186 71500 -810
187 72000 -822
188 72500 -676
189 73000 -639
190 73500 -664
191 74000 -708
192 74500 -515
193 75000 -61
194 75500 -121
195 80000 -145
196 80500 -57
197 81000 -133
198 81500 101
199 82000 235
200 82500 585
201 83000 550
366 70000 18
367 70500 138
368 71000 22
369 71500 -68
370 72000 -146
371 72500 -163
372 73000 -251
373 73500 -230
374 74000 -218
375 74500 -137
376 75000 -126
Now I would like to compare the value from 'Val' at time 73000 with the value [i-3].
If the value is less, then append the continuous values to the list until Time has reached 80000.
I wrote this loop but the problem is that 'Val' compares ALL values [i-3] between 73000 and 80000. I want that the comparison happens ONLY at 73000, and if the condition is true, write the data to the list (until Time 80000)
box = []
for i in df.index:
if df.Time[i] >= 73000 and df.Time[i] <= 80000 and df.Val[i] < df.Val[i-3]:
box.append(
{
'Time': df.Time[i],
'newVAL': df.Val[i],
}
)
box = pd.DataFrame (box, columns = ['Time','newVAL'])
How could I change the code in order to achieve this?
You need to remember the result of the comparison in another variable, and reset it whenever you encounter a time value outside your desired interval. The code would look like this.
box = []
writeToList = False
for i in df.index:
if df.Time[i] < 73000 or df.Time[i] > 80000:
writeToList = False
if df.Time[i] == 73000 and df.Val[i] < df.Val[i-3]:
writeToList = True
if writeToList and df.Time[i] >= 73000 and df.Time[i] <= 80000 :
box.append(
{
'Time': df.Time[i],
'newVAL': df.Val[i],
}
)
box = pd.DataFrame (box, columns = ['Time','newVAL'])
Hope this helps.
Unnamed: 4 GDP in billions of chained 2009 dollars.1
214 2000q1 12359.1
215 2000q2 12592.5
216 2000q3 12607.7
217 2000q4 12679.3
218 2001q1 12643.3
219 2001q2 12710.3
220 2001q3 12670.1
221 2001q4 12705.3
222 2002q1 12822.3
223 2002q2 12893.0
224 2002q3 12955.8
225 2002q4 12964.0
226 2003q1 13031.2
227 2003q2 13152.1
228 2003q3 13372.4
229 2003q4 13528.7
230 2004q1 13606.5
231 2004q2 13706.2
232 2004q3 13830.8
233 2004q4 13950.4
234 2005q1 14099.1
235 2005q2 14172.7
236 2005q3 14291.8
237 2005q4 14373.4
238 2006q1 14546.1
239 2006q2 14589.6
240 2006q3 14602.6
241 2006q4 14716.9
242 2007q1 14726.0
243 2007q2 14838.7
... ... ...
250 2009q1 14375.0
251 2009q2 14355.6
252 2009q3 14402.5
253 2009q4 14541.9
254 2010q1 14604.8
255 2010q2 14745.9
256 2010q3 14845.5
257 2010q4 14939.0
258 2011q1 14881.3
259 2011q2 14989.6
260 2011q3 15021.1
261 2011q4 15190.3
262 2012q1 15291.0
263 2012q2 15362.4
264 2012q3 15380.8
265 2012q4 15384.3
266 2013q1 15491.9
267 2013q2 15521.6
268 2013q3 15641.3
269 2013q4 15793.9
270 2014q1 15747.0
271 2014q2 15900.8
272 2014q3 16094.5
273 2014q4 16186.7
274 2015q1 16269.0
275 2015q2 16374.2
276 2015q3 16454.9
277 2015q4 16490.7
278 2016q1 16525.0
279 2016q2 16583.1
I have the above dataframe. I want to compare the values in the column GDP in billions of chained 2009 dollars.1 and report the index and value of the row for which the value of the column is consecutively less for two values above it. I am using the following code but i am not getting the result
datan = pd.read_excel('gdplev.xls', skiprows = 5)
datan.drop(datan.iloc[0:230, 0:4], inplace = True, axis = 1)
datan = datan[214:]
datan = datan.drop(['GDP in billions of current dollars.1', 'Unnamed: 7'], axis = 1)
datan
for item in datan['GDP in billions of chained 2009 dollars.1']:
if item > item+1 and item+1 > item+2:
print(item+2)
Please help
I suggest the following:
# First I reproduce a similar DataFrame than yours
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({"quarter" : pd.date_range("2000q1", freq="Q", periods = 10),
"gdp": np.random.rand(10)*10000})
df["quarter"] = pd.Series(df["quarter"].dt.year).astype("str") + "q" + pd.Series(df["quarter"].dt.quarter).astype("str")
# Then I create two columns that are the lags of gdp
df["gdpN_1"] = df["gdp"].shift()
df["gdpN_2"] = df["gdpN_1"].shift()
# I create a top when gdp is below gdp at past quarter and the quarter before that
df["top"] = (df["gdp"] < df["gdpN_1"]) & (df["gdp"] < df["gdpN_2"])
# I only select rows for which top is True
new_df = df.loc[df["top"], ["quarter", "gdp"]]
And the result for new_df is :
quarter gdp
2 2000q3 2268.514536
5 2001q2 4231.064601
8 2002q1 4809.319015
9 2002q2 3921.175182
Any quick way to achieve the below output pls?
Input:
Code Items
123 eq-hk
456 ca-eu; tp-lbe
789 ca-us
321 go-ch
654 ca-au; go-au
987 go-jp
147 co-ml; go-ml
258 ca-us
369 ca-us; ca-my
741 ca-us
852 ca-eu
963 ca-ml; co-ml; go-ml
Output:
Code eq ca go co tp
123 hk
456 eu lbe
789 us
321 ch
654 au au
987 jp
147 ml ml
258 us
369 us,my
741 us
852 eu
963 ml ml ml
Am again running into loops and a very ugly code to make it work. If there is an elegant way to achieve this pls?
Thank you!
This is a little bit complicate
(df.set_index('Code')
.Items.str.split(';',expand=True)
.stack()
.str.split('-',expand=True)
.set_index(0,append=True)[1]
.unstack()
.fillna('')
.sum(level=0))
0 ca co eq go tp
Code
123 hk
147 ml ml
258 us
321 ch
369 usmy
456 eu lbe
654 au au
741 us
789 us
852 eu
963 ml ml ml
987 jp
# using str split to get unnest the column,
#then we do stack, and str split again , then set the first column to index
# after unstack we yield the result
List comprehensions work better (read: much faster) for string problems like this which require multiple levels of splitting.
df2 = pd.DataFrame([
dict(y.split('-') for y in x.split('; '))
for x in df.Items]).fillna('')
df2.insert(0, 'Code', df.Code)
print(df2)
Code ca co eq go tp
0 123 hk
1 456 eu lbe
2 789 us
3 321 ch
4 654 au au
5 987 jp
6 147 ml ml
7 258 us # Should be "us,my"... see below.
8 369 my
9 741 us
10 852 eu
11 963 ml ml ml
This does not handle the situation where multiple items with the same key can be present in a row. For that, a slightly more involved solution is needed.
from itertools import chain
v = [x.split('; ') for x in df.Items]
X = pd.Series(df.Code.values.repeat([len(x) for x in v]))
Y = pd.DataFrame([x.split('-') for x in chain.from_iterable(v)])
df2 = pd.concat([X, Y], axis=1, ignore_index=True)
(df2.set_index([0, 1, 3])[2]
.unstack(1)
.fillna('')
.groupby(level=0)
.agg(lambda x: ','.join(x).strip(','))
1 ca co eq go tp
0
123 hk
147 ml ml
258 us
321 ch
369 us,my
456 eu lbe
654 au au
741 us
789 us
852 eu
963 ml ml ml
987 jp
import pandas as pd
df = pd.DataFrame([
('123', 'eq-hk'),
('456', 'ca-eu; tp-lbe'),
('789', 'ca-us'),
('321', 'go-ch'),
('654', 'ca-au; go-au'),
('987', 'go-jp'),
('147', 'co-ml; go-ml'),
('258', 'ca-us'),
('369', 'ca-us; ca-my'),
('741', 'ca-us'),
('852', 'ca-eu'),
('963', 'ca-ml; co-ml; go-ml')],
columns=['Code', 'Items'])
# Get item type list from each row, sum (concatenate) the lists and convert
# to a set to remove duplicates
item_types = set(df['Items'].str.findall('(\w+)-').sum())
print(item_types)
# {'ca', 'co', 'eq', 'go', 'tp'}
# Generate a column for each item type
df1 = pd.DataFrame(df['Code'])
for t in item_types:
df1[t] = df['Items'].str.findall('%s-(\w+)' % t).apply(lambda x: ''.join(x))
print(df1)
# Code ca tp eq co go
#0 123 hk
#1 456 eu lbe
#2 789 us
#3 321 ch
#4 654 au au
#5 987 jp
#6 147 ml ml
#7 258 us
#8 369 usmy
#9 741 us
#10 852 eu
#11 963 ml ml ml
Problem
I have the following data frame (note values are just to show format):
>>> print df
Country Public Private
Date
2013-01-17 BE 3389
2013-01-17 DK 4532 681
2013-02-21 DE 2453 1752
2013-02-21 IE 5143
2013-02-21 ES 8633 353
2013-03-21 FR 262
2013-03-21 LT 358
I would like to pivot it to show the following format:
Country Country1 Country2
Private Public Private Public
Date
2013-01-17 681 353 262 5143
2013-02-21 149 176 124 1757
2013-03-21 149 176 124 1757
Generate Problem
This will generate the problem
import pandas as pd
data =[['2013-01-17', 'BE',1000,3389],
['2013-01-17', 'IE',5823, 681],
['2013-01-17', 'FR',1000,1752],
['2013-02-17', 'IE',1000,5143],
['2013-02-17', 'FR',1000, 353],
['2013-03-17', 'FR',1000, 262],
['2013-03-17', 'BE',1000, 358]]
df = pd.DataFrame(data,columns=['Date','Country','Public','Private']).set_index('Date')
Attempts
The best I can manage is getting Country and the Data Description the wrong way round:
>>> print df.pivot(index=df.index,columns='Country').fillna('')
Public Private
Country AT BE DE DK
Date
2013-01-17 1000 1000 1000 1000
2013-02-21 1000 1000 1000 1000
2013-03-21 1000 1000 1000 1000
You can use swap levels to swap them. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.swaplevel.html
df.pivot(index=df.index,columns='Country').fillna('').swaplevel(0,1, axis=1).sortlevel(axis=1)