appending values to lists in python - python

I have the following code.
rushingyards = 0
passingyards = 0
templist = []
combineddf = play.groupby(['GameCode','PlayType']).sum()
combineddf.to_csv('data/combined.csv', sep=',')
combineddff =pd.DataFrame.from_csv('data/combined.csv')
temp = {}
for row in combineddff.itertuples():
if row[1] in ('RUSH', 'PASS'):
temp['GameCode'] = row[0]
if row[1] == 'RUSH':
temp['Rushingyards'] = row[10]
else:
temp['PassingYards'] = row[10]
else:
continue
templist.append(temp)
The head of my combined csv is below.
PlayType PlayNumber PeriodNumber Clock OffenseTeamCode \
GameCode
2047220131026 ATTEMPT 779 19 2220 1896
2047220131026 FIELD_GOAL 351 9 1057 946
2047220131026 KICKOFF 1244 32 4388 3316
2047220131026 PASS 8200 204 6549 14730
2047220131026 PENALTY 1148 29 1481 2372
DefenseTeamCode OffensePoints DefensePoints Down Distance \
GameCode
2047220131026 1896 142 123 NaN NaN
2047220131026 476 52 51 12 17
2047220131026 2846 231 195 NaN NaN
2047220131026 23190 1131 1405 147 720
2047220131026 2842 188 198 19 84
Spot DriveNumber DrivePlay
GameCode
2047220131026 24 NaN NaN
2047220131026 19 49 3
2047220131026 850 NaN NaN
2047220131026 3719 1161 80
2047220131026 514 164 1
I have to check if the playtype is Rush or pass and accordingly create a list like following.
Gamecode rushing_yards passingyards
299004720130829 893 401
299004720130824 450 657
299004720130821 430 357
I am not able to append the values correctly. Evey time it runs, it gives all similar values of gamecode, rushing_yards and passingyards. Kindly help.

This is because you are appending the reference to the object temp. You are basically just storing references to the same object which is causing the values to be the same for all of them. Put your temp dict inside the for loop and you should see resolution of this issue as it instantiates a new dict object upon each iteration in the loop.

Related

python pandas how to read csv file by block

I'm trying to read a CSV file, block by block.
CSV looks like:
No.,time,00:00:00,00:00:01,00:00:02,00:00:03,00:00:04,00:00:05,00:00:06,00:00:07,00:00:08,00:00:09,00:00:0A,...
1,2021/09/12 02:16,235,610,345,997,446,130,129,94,555,274,4,
2,2021/09/12 02:17,364,210,371,341,294,87,179,106,425,262,3,
1434,2021/09/12 02:28,269,135,372,262,307,73,86,93,512,283,4,
1435,2021/09/12 02:29,281,207,688,322,233,75,69,85,663,276,2,
No.,time,00:00:10,00:00:11,00:00:12,00:00:13,00:00:14,00:00:15,00:00:16,00:00:17,00:00:18,00:00:19,00:00:1A,...
1,2021/09/12 02:16,255,619,200,100,453,456,4,19,56,23,4,
2,2021/09/12 02:17,368,21,37,31,24,8,19,1006,4205,2062,30,
1434,2021/09/12 02:28,2689,1835,3782,2682,307,743,256,741,52,23,6,
1435,2021/09/12 02:29,2281,2047,6848,3522,2353,755,659,885,6863,26,36,
Blocks start with No., and data rows follow.
def run(sock, delay, zipobj):
zf = zipfile.ZipFile(zipobj)
for f in zf.namelist():
print(zf.filename)
print("csv name: ", f)
df = pd.read_csv(zf.open(f), skiprows=[0,1,2,3,4,5] #,"nrows=1435? (but for the next blocks?")
print(df, '\n')
date_pattern='%Y/%m/%d %H:%M'
df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time,date_pattern))), axis=1) # create epoch as a column
tuples=[] # data will be saved in a list
formated_str='perf.type.serial.object.00.00.00.TOTAL_IOPS'
for each_column in list(df.columns)[2:-1]:
for e in zip(list(df['epoch']),list(df[each_column])):
each_column=each_column.replace("X", '')
#print(f"perf.type.serial.LDEV.{each_column}.TOTAL_IOPS",e)
tuples.append((f"perf.type.serial.LDEV.{each_column}.TOTAL_IOPS",e))
package = pickle.dumps(tuples, 1)
size = struct.pack('!L', len(package))
sock.sendall(size)
sock.sendall(package)
time.sleep(delay)
Many thanks for help,
Load your file with pd.read_csv and create block at each time the row of your first column is No.. Use groupby to iterate over each block and create a new dataframe.
data = pd.read_csv('data.csv', header=None)
dfs = []
for _, df in data.groupby(data[0].eq('No.').cumsum()):
df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0])
dfs.append(df.rename_axis(columns=None))
Output:
# First block
>>> dfs[0]
No. time 00:00:00 00:00:01 00:00:02 00:00:03 00:00:04 00:00:05 00:00:06 00:00:07 00:00:08 00:00:09 00:00:0A ...
0 1 2021/09/12 02:16 235 610 345 997 446 130 129 94 555 274 4 NaN
1 2 2021/09/12 02:17 364 210 371 341 294 87 179 106 425 262 3 NaN
2 1434 2021/09/12 02:28 269 135 372 262 307 73 86 93 512 283 4 NaN
3 1435 2021/09/12 02:29 281 207 688 322 233 75 69 85 663 276 2 NaN
# Second block
>>> dfs[1]
No. time 00:00:10 00:00:11 00:00:12 00:00:13 00:00:14 00:00:15 00:00:16 00:00:17 00:00:18 00:00:19 00:00:1A ...
0 1 2021/09/12 02:16 255 619 200 100 453 456 4 19 56 23 4 NaN
1 2 2021/09/12 02:17 368 21 37 31 24 8 19 1006 4205 2062 30 NaN
2 1434 2021/09/12 02:28 2689 1835 3782 2682 307 743 256 741 52 23 6 NaN
3 1435 2021/09/12 02:29 2281 2047 6848 3522 2353 755 659 885 6863 26 36 NaN
and so on.
Sorry, i don't find a correct way with your code:
def run(sock, delay, zipobj):
zf = zipfile.ZipFile(zipobj)
for f in zf.namelist():
print("using zip :", zf.filename)
str = f
myobject = re.search(r'(^[a-zA-Z]{4})_.*', str)
Objects = myobject.group(1)
if Objects == 'LDEV':
metric = re.search('.*LDEV_(.*)/.*', str)
metric = metric.group(1)
elif Objects == 'Port':
metric = re.search('.*/(Port_.*).csv', str)
metric = metric.group(1)
else:
print("None")
print("using csv : ", f)
#df = pd.read_csv(zf.open(f), skiprows=[0,1,2,3,4,5])
data = pd.read_csv(zf.open(f), header=None, skiprows=[0,1,2,3,4,5])
dfs = []
for _, df in data.groupby(data[0].eq('No.').cumsum()):
df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0])
dfs.append(df.rename_axis(columns=None))
print("here")
date_pattern='%Y/%m/%d %H:%M'
df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time,date_pattern))), axis=1) # create epoch as a column
tuples=[] # data will be saved in a list
#formated_str='perf.type.serial.object.00.00.00.TOTAL_IOPS'
for each_column in list(df.columns)[2:-1]:
for e in zip(list(df['epoch']),list(df[each_column])):
each_column=each_column.replace("X", '')
tuples.append((f"perf.type.serial.{Objects}.{each_column}.{metric}",e))
package = pickle.dumps(tuples, 1)
size = struct.pack('!L', len(package))
sock.sendall(size)
sock.sendall(package)
time.sleep(delay)
thanks for your help,

How to iterate through dataframe and pass columns to glm function in Python?

I have a dataframe with 7 variables:
RACA pca pp pcx psc lp csc
0 BARBUDA 1915 470 150 140 87.65 91.41
1 BARBUDA 1345 305 100 110 79.32 98.28
2 BARBUDA 1185 295 80 85 62.19 83.12
3 BARBUDA 1755 385 120 130 80.65 90.01
4 BARBUDA 1570 325 120 120 77.96 87.99
5 CANELUDA 1640 365 110 115 81.38 87.26
6 CANELUDA 1960 525 135 145 89.21 99.37
7 CANELUDA 1715 410 100 120 79.35 99.84
8 CANELUDA 1615 380 100 110 76.32 99.27
9 CANELUDA 2230 500 165 160 90.22 99.56
10 CANELUDA 1570 400 105 95 85.24 83.95
11 COMERCIAL 1815 380 145 90 73.32 92.81
12 COMERCIAL 2475 345 180 140 71.77 105.64
13 COMERCIAL 1870 295 125 125 72.36 97.89
14 COMERCIAL 2435 565 185 160 73.24 107.39
15 COMERCIAL 1705 315 115 125 72.03 96.11
16 COMERCIAL 2220 495 165 150 87.63 96.89
17 PELOCO 1145 250 75 85 50.57 77.90
18 PELOCO 705 85 55 50 38.26 78.09
19 PELOCO 1140 195 80 75 66.15 96.35
20 PELOCO 1355 250 90 95 50.60 91.39
21 PELOCO 1095 220 80 80 53.03 84.57
22 PELOCO 1580 255 125 120 59.30 95.57
I want to fit a glm for every dependent variable, pca:csc, in R it's quite simple to do it, but I don't know how to get this working on Python. I tried to write a for loop and pass the column name to the formula but so far didn't work out:
for column in df:
col = str(column)
model = sm.formula.glm(paste(col,"~ RACA"), data=df).fit()
print(model.summary())
I am using Pandas and statsmodel
import pandas as pd
import statsmodels.api as sm
I imagine it must be so simple, but I sincerely couldn't figure it out yet.
I was able to figure out a solution, don't know if it's the most efficient or elegant one, but give the results I wanted:
for column in df.loc[:,'pca':'csc']:
col = str(column)
formula = col + "~RACA"
model = sm.formula.glm(formula = formula, data=df).fit()
print(model.summary())
I am open to suggestions on how I could improve this. Thank you!

Editting values in a dataframe based of the information in another dataframe

I have one dataframe called _df1 which looks like this. Please not that this is not the entire dataframe but parts of it.
_df1:
frame id x1 y1 x2 y2
1 1 1363 569 103 241
2 1 1362 568 103 241
3 1 1362 568 103 241
4 1 1362 568 103 241
964 5 925 932 80 255
965 5 925 932 79 255
966 5 925 932 79 255
967 5 924 932 80 255
968 5 924 932 79 255
16 6 631 761 100 251
17 6 631 761 100 251
18 6 631 761 100 251
19 6 631 761 100 251
20 6 631 761 100 251
21 6 631 761 100 251
88 7 623 901 144 123
89 7 623 901 144 123
90 7 623 901 144 123
91 7 623 901 144 123
92 7 623 901 144 123
93 7 623 901 144 123
94 7 623 901 144 123
In the full database, there are 108003 rows and 141 unique IDs in the dataframe. An ID represents a specific object and the ID is repeated as long as that frame has that object. In other words, my data has 141 different objects and 108003 frames. I wrote a code to identify frames that have the same objects but is labelled with a different ID. This is saved in another dataframe called _df2 which looks like this. This is also only part of the dataframe, not the entire thing.
_df2:
indexID matchID
4 5
6 7
8 9
12 13
18 19
20 21
.
.
.
The second dataframe shows which indexes has been wrongly classified as a different object. This means that the ID in 'matchID' is actually the same object as 'indexID'. This 'indexID' in _df2 corresponds to 'id' in _df1.
Taking the first line in _df2 as an example, it says that index 4 and 5 is the same. Therefore, I need to change the 'id' values, in _df1, of all the frames with 'id' 5 to 4. This is an example of what the final table should look like since 5 has to be classified as 4 and 7 has to be classified as 6.
Output:
frame id x1 y1 x2 y2
1 1 1363 569 103 241
2 1 1362 568 103 241
3 1 1362 568 103 241
4 1 1362 568 103 241
964 4 925 932 80 255
965 4 925 932 79 255
966 4 925 932 79 255
967 4 924 932 80 255
968 4 924 932 79 255
16 6 631 761 100 251
17 6 631 761 100 251
18 6 631 761 100 251
19 6 631 761 100 251
20 6 631 761 100 251
21 6 631 761 100 251
88 6 623 901 144 123
89 6 623 901 144 123
90 6 623 901 144 123
91 6 623 901 144 123
92 6 623 901 144 123
93 6 623 901 144 123
94 6 623 901 144 123
Using replace
df1.id=df.id.replace(dict(zip(df2.indexID,df2.matchID)))

Renaming a subset of index from a dataframe

I have a dataframe which looks like this
Geneid PRKCZ.exon1 PRKCZ.exon2 PRKCZ.exon3 PRKCZ.exon4 PRKCZ.exon5 PRKCZ.exon6 PRKCZ.exon7 PRKCZ.exon8 PRKCZ.exon9 PRKCZ.exon10 ... FLNA.exon31 FLNA.exon32 FLNA.exon33 FLNA.exon34 FLNA.exon35 FLNA.exon36 FLNA.exon37 FLNA.exon38 MTCP1.exon1 MTCP1.exon2
S28 22 127 135 77 120 159 49 38 409 67 ... 112 104 37 83 47 18 110 70 167 19
22 3 630 178 259 142 640 77 121 521 452 ... 636 288 281 538 276 109 242 314 790 484
S04 16 658 320 337 315 881 188 162 769 577 ... 1291 420 369 859 507 208 554 408 1172 706
56 26 663 343 390 314 1090 263 200 844 592 ... 675 243 250 472 280 133 300 275 750 473
S27 13 1525 571 1081 560 1867 427 370 1348 1530 ... 1817 926 551 1554 808 224 971 1313 1293 701
5 rows × 8297 columns
In that above dataframe I need to add an extra column with information about the index. And so I made a list -healthy with all the index to be labelled as h and rest everything should be d.
And so tried the following lines:
healthy=['39','41','49','50','51','52','53','54','56']
H_type =pd.Series( ['h' for x in df.loc[healthy]
else 'd' for x in df]).to_frame()
But it is throwing me following error:
SyntaxError: invalid syntax
Any help would be really appreciated
In the end I am aiming something like this:
Geneid sampletype SSX4.exon4 SSX2.exon11 DUX4.exon5 SSX2.exon3 SSX4.exon5 SSX2.exon10 SSX4.exon7 SSX2.exon9 SSX4.exon8 ... SETD2.exon21 FAT2.exon15 CASC5.exon8 FAT1.exon21 FAT3.exon9 MLL.exon31 NACA.exon7 RANBP2.exon20 APC.exon16 APOB.exon4
S28 h 0 0 0 0 0 0 0 0 0 ... 2480 2003 2749 1760 2425 3330 4758 2508 4367 4094
22 h 0 0 0 0 0 0 0 0 0 ... 8986 7200 10123 12422 14528 18393 9612 15325 8788 11584
S04 h 0 0 0 0 0 0 0 0 0 ... 14518 16657 17500 15996 17367 17948 18037 19446 24179 28924
56 h 0 0 0 0 0 0 0 0 0 ... 17784 17846 20811 17337 18135 19264 19336 22512 28318 32405
S27 h 0 0 0 0 0 0 0 0 0 ... 10375 20403 11559 18895 18410 12754 21527 11603 16619 37679
Thank you
I think you can use numpy.where with isin, if Geneid is column.
EDIT by comment:
There can be integers in column Geneid, so you can cast to string by astype.
healthy=['39','41','49','50','51','52','53','54','56']
df['type'] = np.where(df['Geneid'].astype(str).isin(healthy), 'h', 'd')
#get last column to list
print df.columns[-1].split()
['type']
#create new list from last column and all columns without last
cols = df.columns[-1].split() + df.columns[:-1].tolist()
print cols
['type', 'Geneid', 'PRKCZ.exon1', 'PRKCZ.exon2', 'PRKCZ.exon3', 'PRKCZ.exon4',
'PRKCZ.exon5', 'PRKCZ.exon6', 'PRKCZ.exon7', 'PRKCZ.exon8', 'PRKCZ.exon9',
'PRKCZ.exon10', 'FLNA.exon31', 'FLNA.exon32', 'FLNA.exon33', 'FLNA.exon34',
'FLNA.exon35', 'FLNA.exon36', 'FLNA.exon37', 'FLNA.exon38', 'MTCP1.exon1', 'MTCP1.exon2']
#reorder columns
print df[cols]
type Geneid PRKCZ.exon1 PRKCZ.exon2 PRKCZ.exon3 PRKCZ.exon4 \
0 d S28 22 127 135 77
1 d 22 3 630 178 259
2 d S04 16 658 320 337
3 h 56 26 663 343 390
4 d S27 13 1525 571 1081
PRKCZ.exon5 PRKCZ.exon6 PRKCZ.exon7 PRKCZ.exon8 ... \
0 120 159 49 38 ...
1 142 640 77 121 ...
2 315 881 188 162 ...
3 314 1090 263 200 ...
4 560 1867 427 370 ...
FLNA.exon31 FLNA.exon32 FLNA.exon33 FLNA.exon34 FLNA.exon35 \
0 112 104 37 83 47
1 636 288 281 538 276
2 1291 420 369 859 507
3 675 243 250 472 280
4 1817 926 551 1554 808
FLNA.exon36 FLNA.exon37 FLNA.exon38 MTCP1.exon1 MTCP1.exon2
0 18 110 70 167 19
1 109 242 314 790 484
2 208 554 408 1172 706
3 133 300 275 750 473
4 224 971 1313 1293 701
[5 rows x 22 columns]
You could use pandas isin()
First add an extra column called 'sampletype' and fill it with 'd'. Then, find all samples that have a geneid in health and fill them with 'h'. Suppose your main dataframe is called df, then you would use something like:
healthy = ['39','41','49','50','51','52','53','54','56']
df['sampletype'] = 'd'
df['sampletype'][df['Geneid'].isin(healthy)]='h'

Is there any way to keep a PeriodIndex as a series of Periods with a reset_index()?

I've noticed that for a DataFrame with a PeriodIndex, the month reverts to its native Int64 type upon a reset_index(), losing its freq attribute in the process. Is there any way to keep it as a Series of Periods?
For example:
In [42]: monthly
Out[42]:
qunits expend
month store upc
1992-12 1 21 83 248.17
72 3 13.95
78 2 6.28
79 1 5.82
85 5 28.10
87 1 1.87
88 6 11.76
...
1994-12 151 857 12 81.48
858 23 116.15
880 7 44.73
881 13 25.05
883 21 67.25
884 44 190.56
885 13 83.57
887 1 4.55
becomes:
In [43]: monthly.reset_index()
Out[43]:
month store upc qunits expend
0 275 1 21 83 248.17
1 275 1 72 3 13.95
2 275 1 78 2 6.28
3 275 1 79 1 5.82
4 275 1 85 5 28.10
5 275 1 87 1 1.87
6 275 1 88 6 11.76
7 275 1 89 21 41.16
...
500099 299 151 857 12 81.48
500100 299 151 858 23 116.15
500101 299 151 880 7 44.73
500102 299 151 881 13 25.05
500103 299 151 883 21 67.25
500104 299 151 884 44 190.56
500105 299 151 885 13 83.57
500106 299 151 887 1 4.55
Update 6/13/2014
It worked beautifully but the end result I need is the PeriodIndex values to be passed onto a grouped DataFrame. I got it to work but it seems to me that it could be done more compactly. I.e., my code is:
periods_index = monthly.index.get_level_values('month')
monthly.reset_index(inplace=True)
monthly.month = periods_index
grouped=monthly.groupby('month')
moments=pd.DataFrame(monthly.month.unique(),columns=['month'])
for month,group in grouped:
moments.loc[moments.month==month,'meanNo0']=wmean(group[group.relative!=1].avExpend,np.log(group[group.relative!=1].relative))
Any further suggestions?
How about this:
periods_index = monthly.index.get_level_values('month')

Categories