python pandas how to read csv file by block

python pandas how to read csv file by block - python

I'm trying to read a CSV file, block by block.
CSV looks like:
No.,time,00:00:00,00:00:01,00:00:02,00:00:03,00:00:04,00:00:05,00:00:06,00:00:07,00:00:08,00:00:09,00:00:0A,...
1,2021/09/12 02:16,235,610,345,997,446,130,129,94,555,274,4,
2,2021/09/12 02:17,364,210,371,341,294,87,179,106,425,262,3,
1434,2021/09/12 02:28,269,135,372,262,307,73,86,93,512,283,4,
1435,2021/09/12 02:29,281,207,688,322,233,75,69,85,663,276,2,
No.,time,00:00:10,00:00:11,00:00:12,00:00:13,00:00:14,00:00:15,00:00:16,00:00:17,00:00:18,00:00:19,00:00:1A,...
1,2021/09/12 02:16,255,619,200,100,453,456,4,19,56,23,4,
2,2021/09/12 02:17,368,21,37,31,24,8,19,1006,4205,2062,30,
1434,2021/09/12 02:28,2689,1835,3782,2682,307,743,256,741,52,23,6,
1435,2021/09/12 02:29,2281,2047,6848,3522,2353,755,659,885,6863,26,36,
Blocks start with No., and data rows follow.
def run(sock, delay, zipobj):
zf = zipfile.ZipFile(zipobj)
for f in zf.namelist():
print(zf.filename)
print("csv name: ", f)
df = pd.read_csv(zf.open(f), skiprows=[0,1,2,3,4,5] #,"nrows=1435? (but for the next blocks?")
print(df, '\n')
date_pattern='%Y/%m/%d %H:%M'
df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time,date_pattern))), axis=1) # create epoch as a column
tuples=[] # data will be saved in a list
formated_str='perf.type.serial.object.00.00.00.TOTAL_IOPS'
for each_column in list(df.columns)[2:-1]:
for e in zip(list(df['epoch']),list(df[each_column])):
each_column=each_column.replace("X", '')
#print(f"perf.type.serial.LDEV.{each_column}.TOTAL_IOPS",e)
tuples.append((f"perf.type.serial.LDEV.{each_column}.TOTAL_IOPS",e))
package = pickle.dumps(tuples, 1)
size = struct.pack('!L', len(package))
sock.sendall(size)
sock.sendall(package)
time.sleep(delay)
Many thanks for help,

Load your file with pd.read_csv and create block at each time the row of your first column is No.. Use groupby to iterate over each block and create a new dataframe.
data = pd.read_csv('data.csv', header=None)
dfs = []
for _, df in data.groupby(data[0].eq('No.').cumsum()):
df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0])
dfs.append(df.rename_axis(columns=None))
Output:
# First block
>>> dfs[0]
No. time 00:00:00 00:00:01 00:00:02 00:00:03 00:00:04 00:00:05 00:00:06 00:00:07 00:00:08 00:00:09 00:00:0A ...
0 1 2021/09/12 02:16 235 610 345 997 446 130 129 94 555 274 4 NaN
1 2 2021/09/12 02:17 364 210 371 341 294 87 179 106 425 262 3 NaN
2 1434 2021/09/12 02:28 269 135 372 262 307 73 86 93 512 283 4 NaN
3 1435 2021/09/12 02:29 281 207 688 322 233 75 69 85 663 276 2 NaN
# Second block
>>> dfs[1]
No. time 00:00:10 00:00:11 00:00:12 00:00:13 00:00:14 00:00:15 00:00:16 00:00:17 00:00:18 00:00:19 00:00:1A ...
0 1 2021/09/12 02:16 255 619 200 100 453 456 4 19 56 23 4 NaN
1 2 2021/09/12 02:17 368 21 37 31 24 8 19 1006 4205 2062 30 NaN
2 1434 2021/09/12 02:28 2689 1835 3782 2682 307 743 256 741 52 23 6 NaN
3 1435 2021/09/12 02:29 2281 2047 6848 3522 2353 755 659 885 6863 26 36 NaN
and so on.

Sorry, i don't find a correct way with your code:
def run(sock, delay, zipobj):
zf = zipfile.ZipFile(zipobj)
for f in zf.namelist():
print("using zip :", zf.filename)
str = f
myobject = re.search(r'(^[a-zA-Z]{4})_.*', str)
Objects = myobject.group(1)
if Objects == 'LDEV':
metric = re.search('.*LDEV_(.*)/.*', str)
metric = metric.group(1)
elif Objects == 'Port':
metric = re.search('.*/(Port_.*).csv', str)
metric = metric.group(1)
else:
print("None")
print("using csv : ", f)
#df = pd.read_csv(zf.open(f), skiprows=[0,1,2,3,4,5])
data = pd.read_csv(zf.open(f), header=None, skiprows=[0,1,2,3,4,5])
dfs = []
for _, df in data.groupby(data[0].eq('No.').cumsum()):
df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0])
dfs.append(df.rename_axis(columns=None))
print("here")
date_pattern='%Y/%m/%d %H:%M'
df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time,date_pattern))), axis=1) # create epoch as a column
tuples=[] # data will be saved in a list
#formated_str='perf.type.serial.object.00.00.00.TOTAL_IOPS'
for each_column in list(df.columns)[2:-1]:
for e in zip(list(df['epoch']),list(df[each_column])):
each_column=each_column.replace("X", '')
tuples.append((f"perf.type.serial.{Objects}.{each_column}.{metric}",e))
package = pickle.dumps(tuples, 1)
size = struct.pack('!L', len(package))
sock.sendall(size)
sock.sendall(package)
time.sleep(delay)
thanks for your help,

Related

Intersection of a DataFrame slice and a list

I'm trying to select a subsection of a dataframe using .loc as such:
for date in months.index:
labels = list(df.index.values)
X = df.loc[(date - relativedelta(months=+3)):date.intersection(labels), ['A', 'B']]
Y = df.loc[(date - relativedelta(months=+3)):date.intersection(labels), 'C']
months.at[date, 'Prediction'] = forest.fit(X, Y)
I am following the method suggested at https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike but am running into the error
AttributeError: 'Timestamp' object has no attribute 'intersection'
Is this an issue because I am using a time-indexed dataframe, because I am intersecting with a slice of the dataframe and not the whole index, or another issue? I tried to convert the timestamps to datetime objects to no avail.

#import librarys
import pandas as pd
import numpy as np
import datetime
#get today
today = datetime.datetime.now()
print(today)
output:
datetime.datetime(2020, 8, 26, 20, 25, 40, 480870)
Then work out the dates you want to work from, 90 days is easier.
#get date back 90 days
date_from = today - pd.to_timedelta(90,"D")
#create a mock DF
dates_this_year = pd.date_range("2020-01-01",datetime.datetime.now().strftime("%Y-%m-%d"))
mock_values = np.arange(0,len(dates_this_year))
df = pd.DataFrame({"date":dates_this_year,"A":mock_values,"B":mock_values,"C":mock_values})
date_df = df.set_index("date")
date_df
output:
date A B C
2020-01-01 0 0 0
2020-01-02 1 1 1
2020-01-03 2 2 2
2020-01-04 3 3 3
2020-01-05 4 4 4
... ... ... ...
2020-08-22 234 234 234
2020-08-23 235 235 235
2020-08-24 236 236 236
2020-08-25 237 237 237
2020-08-26 238 238 238
Then to index it just use the date from and to.
date_df.loc[date_from:today,["A","B"]]
output:
date A B
2020-05-29 149 149
2020-05-30 150 150
2020-05-31 151 151
2020-06-01 152 152
2020-06-02 153 153
... ... ...
2020-08-22 234 234
2020-08-23 235 235
2020-08-24 236 236
2020-08-25 237 237
2020-08-26 238 238

Find specific string and output the whole line Python

I am having the following input txt file:
17,21.01.2019,0,0,0,0,E,75,meter tamper alarm
132,22.01.2019,64,296,225,996,A,,
150,23.01.2019,63,353,351,805,A,,
213,24.01.2019,64,245,244,970,A,,
201,25.01.2019,86,297,364,943,A,,
56,26.01.2019,73,678,678,1437,A,,
201,27.01.2019,83,654,517,1212,A,,
117,28.01.2019,58,390,202,816,A,,
69,29.01.2019,89,354,282,961,C,,
123,30.01.2019,53,267,206,852,A,,
Need to make a python program that can parse through the file.
I need to find all the lines not containing A or C and output those lines in a new file.
I'm completely stuck after trying several regex :( can you help me ?

Try
with open('filename') as f:
for line in f.readlines():
if 'A' not in line or 'C' not in line:
print(line)
OR Better, as your file content seems to resemble a csv(Comma Seperated Values) format, use pandas for better manipulations
Read the file
import pandas as pd
df = pd.read_csv('filename', header=None, sep=',')
0 1 2 3 4 5 6 7 8
0 17 21.01.2019 0 0 0 0 E 75.0 meter tamper alarm
1 132 22.01.2019 64 296 225 996 A NaN NaN
2 150 23.01.2019 63 353 351 805 A NaN NaN
3 213 24.01.2019 64 245 244 970 A NaN NaN
4 201 25.01.2019 86 297 364 943 A NaN NaN
5 56 26.01.2019 73 678 678 1437 A NaN NaN
6 201 27.01.2019 83 654 517 1212 A NaN NaN
7 117 28.01.2019 58 390 202 816 A NaN NaN
8 69 29.01.2019 89 354 282 961 C NaN NaN
9 123 30.01.2019 53 267 206 852 A NaN NaN
Output
print(df[~df[6].str.contains('A|C', regex=True)])
0 1 2 3 4 5 6 7 8
0 17 21.01.2019 0 0 0 0 E 75.0 meter tamper alarm

Try:
with open(r'file.txt', 'r') as f:
for line in f:
if 'A' not in line or 'C' not in line:
print(line)

Python - Aggregate on week of the month basis and compare

I am working on a small csv data set where the values are indexed as per Week of the Month occurrence. What I want is to aggregate all of the weeks in sequence, barring the current week or the last column, to compute weekly average of the remaining data (average for ...10/1 + 11/1 + 12/1.. to get week 1 data).
The data is available in this format:
char 2019/11/1 2019/11/2 2019/11/3 2019/11/4 2019/11/5 2019/12/1 2019/12/2 2019/12/3 2019/12/4 2019/12/5 2020/1/1
A 1477 1577 1401 773 310 1401 1464 1417 909 712 289
B 1684 1485 1220 894 297 1618 1453 1335 920 772 275
C 37 10 1 3 6 17 6 6 3 2 1
D 2041 1883 1302 1136 376 2175 1729 1167 960 745 278
E 6142 5991 5499 3883 1036 4949 6187 5760 3974 2339 826
F 842 846 684 462 140 789 802 134 386 251 94
This column (2020/1/1) shall later be used to compare with the mean of all aggregate values from week one. The desired output is something like this:
char W1 W2 W3 W4 W5 2020/1/1
A 1439 1520.5 1409 841 511 289
B 1651 1469 1277.5 907 534.5 275
C 27 8 3.5 3 4 1
D 2108 1806 1234.5 1048 560.5 278
E 5545.5 6089 5629.5 3928.5 1687.5 826
F 815.5 824 409 424 195.5 94
Is it possible to use rolling or resample in such a case? Any ideas on how to do it?

I beleive you need DataFrame.resample by weeks:
df = df.set_index(['char', '2020/1/1'])
df.columns = pd.to_datetime(df.columns, format='%Y/%m/%d')
df = df.resample('W', axis=1).mean()
print (df)
2019-11-03 2019-11-10 2019-11-17 2019-11-24 2019-12-01 \
char 2020/1/1
A 289 1485.000000 541.5 NaN NaN 1401.0
B 275 1463.000000 595.5 NaN NaN 1618.0
C 1 16.000000 4.5 NaN NaN 17.0
D 278 1742.000000 756.0 NaN NaN 2175.0
E 826 5877.333333 2459.5 NaN NaN 4949.0
F 94 790.666667 301.0 NaN NaN 789.0
2019-12-08
char 2020/1/1
A 289 1125.50
B 275 1120.00
C 1 4.25
D 278 1150.25
E 826 4565.00
F 94 393.25
EDIT: If want grouping first 7 days per each month to separate groups use:
df = df.set_index(['char', '2020/1/1'])
c = pd.to_datetime(df.columns, format='%Y/%m/%d')
df.columns = [f'{y}/{m}/W{w}' for w,m,y in zip((c.day - 1) // 7 + 1,c.month, c.year)]
df = df.groupby(df.columns, axis=1).mean()
print (df)
2019/11/W1 2019/12/W1
char 2020/1/1
A 289 1107.6 1180.6
B 275 1116.0 1219.6
C 1 11.4 6.8
D 278 1347.6 1355.2
E 826 4510.2 4641.8
F 94 594.8 472.4
EDIT1: For grouping dy years and days use DatetimeIndex.strftime:
df = df.set_index(['char', '2020/1/1'])
df.columns = pd.to_datetime(df.columns, format='%Y/%m/%d').strftime('%d-%Y')
df = df.groupby(df.columns, axis=1).mean()
print (df)
01-2019 02-2019 03-2019 04-2019 05-2019
char 2020/1/1
A 289 1439.0 1520.5 1409.0 841.0 511.0
B 275 1651.0 1469.0 1277.5 907.0 534.5
C 1 27.0 8.0 3.5 3.0 4.0
D 278 2108.0 1806.0 1234.5 1048.0 560.5
E 826 5545.5 6089.0 5629.5 3928.5 1687.5
F 94 815.5 824.0 409.0 424.0 195.5

Here is a way using groupby:
m= df.set_index(['char', '2020/1/1']).rename(columns=lambda x: pd.to_datetime(x))
m.groupby(m.columns.week,axis=1).mean().add_prefix('W_').reset_index()
char 2020/1/1 W_44 W_45 W_48 W_49
0 A 289 1485.000000 541.5 1401.0 1125.50
1 B 275 1463.000000 595.5 1618.0 1120.00
2 C 1 16.000000 4.5 17.0 4.25
3 D 278 1742.000000 756.0 2175.0 1150.25
4 E 826 5877.333333 2459.5 4949.0 4565.00
5 F 94 790.666667 301.0 789.0 393.25

appending values to lists in python

I have the following code.
rushingyards = 0
passingyards = 0
templist = []
combineddf = play.groupby(['GameCode','PlayType']).sum()
combineddf.to_csv('data/combined.csv', sep=',')
combineddff =pd.DataFrame.from_csv('data/combined.csv')
temp = {}
for row in combineddff.itertuples():
if row[1] in ('RUSH', 'PASS'):
temp['GameCode'] = row[0]
if row[1] == 'RUSH':
temp['Rushingyards'] = row[10]
else:
temp['PassingYards'] = row[10]
else:
continue
templist.append(temp)
The head of my combined csv is below.
PlayType PlayNumber PeriodNumber Clock OffenseTeamCode \
GameCode
2047220131026 ATTEMPT 779 19 2220 1896
2047220131026 FIELD_GOAL 351 9 1057 946
2047220131026 KICKOFF 1244 32 4388 3316
2047220131026 PASS 8200 204 6549 14730
2047220131026 PENALTY 1148 29 1481 2372
DefenseTeamCode OffensePoints DefensePoints Down Distance \
GameCode
2047220131026 1896 142 123 NaN NaN
2047220131026 476 52 51 12 17
2047220131026 2846 231 195 NaN NaN
2047220131026 23190 1131 1405 147 720
2047220131026 2842 188 198 19 84
Spot DriveNumber DrivePlay
GameCode
2047220131026 24 NaN NaN
2047220131026 19 49 3
2047220131026 850 NaN NaN
2047220131026 3719 1161 80
2047220131026 514 164 1
I have to check if the playtype is Rush or pass and accordingly create a list like following.
Gamecode rushing_yards passingyards
299004720130829 893 401
299004720130824 450 657
299004720130821 430 357
I am not able to append the values correctly. Evey time it runs, it gives all similar values of gamecode, rushing_yards and passingyards. Kindly help.

This is because you are appending the reference to the object temp. You are basically just storing references to the same object which is causing the values to be the same for all of them. Put your temp dict inside the for loop and you should see resolution of this issue as it instantiates a new dict object upon each iteration in the loop.

Is there any way to keep a PeriodIndex as a series of Periods with a reset_index()?

I've noticed that for a DataFrame with a PeriodIndex, the month reverts to its native Int64 type upon a reset_index(), losing its freq attribute in the process. Is there any way to keep it as a Series of Periods?
For example:
In [42]: monthly
Out[42]:
qunits expend
month store upc
1992-12 1 21 83 248.17
72 3 13.95
78 2 6.28
79 1 5.82
85 5 28.10
87 1 1.87
88 6 11.76
...
1994-12 151 857 12 81.48
858 23 116.15
880 7 44.73
881 13 25.05
883 21 67.25
884 44 190.56
885 13 83.57
887 1 4.55
becomes:
In [43]: monthly.reset_index()
Out[43]:
month store upc qunits expend
0 275 1 21 83 248.17
1 275 1 72 3 13.95
2 275 1 78 2 6.28
3 275 1 79 1 5.82
4 275 1 85 5 28.10
5 275 1 87 1 1.87
6 275 1 88 6 11.76
7 275 1 89 21 41.16
...
500099 299 151 857 12 81.48
500100 299 151 858 23 116.15
500101 299 151 880 7 44.73
500102 299 151 881 13 25.05
500103 299 151 883 21 67.25
500104 299 151 884 44 190.56
500105 299 151 885 13 83.57
500106 299 151 887 1 4.55
Update 6/13/2014
It worked beautifully but the end result I need is the PeriodIndex values to be passed onto a grouped DataFrame. I got it to work but it seems to me that it could be done more compactly. I.e., my code is:
periods_index = monthly.index.get_level_values('month')
monthly.reset_index(inplace=True)
monthly.month = periods_index
grouped=monthly.groupby('month')
moments=pd.DataFrame(monthly.month.unique(),columns=['month'])
for month,group in grouped:
moments.loc[moments.month==month,'meanNo0']=wmean(group[group.relative!=1].avExpend,np.log(group[group.relative!=1].relative))
Any further suggestions?

How about this:
periods_index = monthly.index.get_level_values('month')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python pandas how to read csv file by block - python

Related

Intersection of a DataFrame slice and a list

Find specific string and output the whole line Python

Python - Aggregate on week of the month basis and compare

appending values to lists in python

Is there any way to keep a PeriodIndex as a series of Periods with a reset_index()?

Categories

Resources