How to update HDF5 table with partial data? - python

I'm wondering how one might update an HDF5 table when one only has partial data? For example, suppose the following df is stored in an HDF5 table.
import pandas as pd
df = pd.DataFrame([
[100,90,80,70,36,45],
[101,78,65,88,55,78],
[92,77,42,79,43,32],
[103,98,76,54,45,65]],
index = pd.date_range(start='2022-01-01', periods=4)
)
df.columns = pd.MultiIndex.from_tuples(
(("mkf", "Open"),
("mkf", "Close"),
("tdf", "Open"),
("tdf","Close"),
("ghi","Open"),
("ghi", "Close"))
)
df
mkf tdf ghi
Open Close Open Close Open Close
2022-01-01 100 90 80 70 36 45
2022-01-02 101 78 65 88 55 78
2022-01-03 92 77 42 79 43 32
2022-01-04 103 98 76 54 45 65
store = pd.HDFStore('store.h5')
store.append('data', df)
Next, suppose I obtain partial data (e.g. data for mkf and tdf but not ghi).
df1 = pd.DataFrame([
[70,80,90,70],
[91,68,45,88],
[92,47,32,79],
[43,38,77,74]],
index = pd.date_range(start='2022-01-05', periods=4)
)
df1.columns = pd.MultiIndex.from_tuples((("mkf", "Open"),
("mkf", "Close"),
("tdf", "Open"),
("tdf","Close"),
)
)
df1
mkf tdf
Open Close Open Close
2022-01-05 70 80 90 70
2022-01-06 91 68 45 88
2022-01-07 92 47 32 79
2022-01-08 43 38 77 74
How can I update store? I tried the following but got a ValueError:
store.append('data',df1)
ValueError: cannot match existing table structure for [(mkf, Open),(mkf, Close),(tdf, Open),(tdf, Close),(ghi, Open),(ghi, Close)] on appending data

Related

How to iterate pandas Dataframe month-wise to satisfy demand over time

Suppose I have a dataframe df
pd demand mon1 mon2 mon3
abc1 137 46 37 31
abc2 138 33 37 50
abc3 120 38 47 46
abc4 149 39 30 30
abc5 129 33 42 42
abc6 112 30 45 43
abc7 129 43 33 45
I want to satisfy the demand of each pd month-wise. I am generating some random numbers which indicate satisfied demand. For example, for pd abc1, demand is 137, say I have produced 42 units for mon1, but mon1 demand is 46. Hence revised dataframe would be
pd demand mon2 mon3
abc1 137 - 42= 95 37 + 4 (Unsatisfied demand for previous month) 31
Then it will run for mon2 and so on. In this way, I would like to capture, how much demand would be satisfied for each pd (excess or unsatisfied).
My try:
import pandas as pd
import random
mon = ['mon1', 'mon2', 'mon3']
for i in df['pd'].values.tolist():
t = df.loc[df['pd'] == i, :]
for m in t.columns[2:]:
y = t[m].iloc[0]
n = random.randint(20, 70)
t['demand'] = t['demand'].iloc[0] - n
Not finding the logic exactly.

How to add the List data to the first column of the CSV file, which has 256 columns file via python?

I have a CSV file which has 255 columns and 16,000 rows of data, and I want to add a list of data which contains 16,000 data to the first column of my CSV file.
The code I tried to use is
# Append the name of the file to List
path = 'C:/Users/User/Desktop/Guanlin_CNN1D/CNN1D/0.3 15 and 105 circle cropped'
list = os.listdir(path)
List = []
for a in list:
List.append(str(a))
## Load the to-be-added CSV file
data = pd.read_csv('C:/Users/User/Desktop/Guanlin_CNN1D/CNN1D/0.3 15 and 105 for toolpath recreatation.csv',sep=',', engine='python' ,header=None)
tempdata = pd.DataFrame(data)
features = tempdata.values[:, 1:]
file_num = tempdata.values[:, 0]
# add the List to first columns of CSV file
Temp = {List,file_num,features}
temp = pd.DataFrame(Temp)
temp
The result shows
TypeError: unhashable type: 'list'
How to rewrite the code?
Thanks in advance!
I think you simply need to use the dataframe insert method. It looks like you are trying to create a new dataframe but I think it is not necessary. Below example inserts a new column at the zeroth position. It looks like you were trying to make a new dataframe from a dict; this link has some easy examples on way to populate a dataframe with lists and dicts. I think the number of rows and columns should not be a concern for you in this case.
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 100, size=(5, 5)), columns=list('ABCDE'))
print(df)
df.insert(0,column='newcol', value=np.random.randint(0, 100, size=(5)))
print()
print(df)
df.to_csv( r'data.csv', index=False, header=True)
will produce this output
A B C D E
0 44 47 64 67 67
1 9 83 21 36 87
2 70 88 88 12 58
3 65 39 87 46 88
4 81 37 25 77 72
newcol A B C D E
0 9 44 47 64 67 67
1 20 9 83 21 36 87
2 80 70 88 88 12 58
3 69 65 39 87 46 88
4 79 81 37 25 77 72

How to get max value from second column & min value from third column in CSV file with no row header in Python

How to get the max value from the second column and min value from the third column in CSV file with no row headers as per the screenshot of DataFrame through defining a function?
My code is:
import pandas as pd
def minmaxvalue(filename):
# some code
minmaxvalue("my_data.cvs")
How to get the max&min value between the defining function?
i a b
1 33 99
2 35 100
3 37 101
4 39 102
5 41 103
6 43 104
7 45 105
8 47 106
9 49 107
10 51 108
11 53 109
12 55 110
13 57 111
14 59 112
15 61 113
import pandas as pd
def minmaxvalue(filename):
# reading from file
df = pd.read_csv(filename, names=['a', 'b'])
# returning max and min
return df['a'].max(), df['b'].min()
minmaxvalue("my_data.csv")
One way is this:
def minmaxvalue(filename):
minim = filename['a'][0]
maxim = filename['b'][0]
for i in range(0, len(filename)):
if minim > filename['a'][i]:
minim = filename['a'][i]
if maxim < filename['b'][i]:
maxim = filename['b'][i]
return minim, maxim

Subtract a constant from a column in a pandas dataframe

I have a dataframe as follows:
year,value
1970,2.0729729191557147
1971,1.0184197388632872
1972,2.574009084167593
1973,1.4986879160266255
1974,3.0246498975934464
1975,1.7876222478238608
1976,2.5631745148930913
1977,2.444014336917563
1978,2.619502688172043
1979,2.268273809523809
1980,2.6086169818316645
1981,0.8452720174091145
1982,1.3158922171018947
1983,-0.12695212493599603
1984,1.4374230626622169
1985,2.389290834613415
1986,2.3489311315924217
1987,2.6002265745007676
1988,1.2623717711036955
1989,1.1793426779313878
I would like to subtract a constant from each of the values in the second column. This is the code I have tried:
df = pd.read_csv(f1, sep=",", header=0)
df2 = df["value"].subtract(1)
However when I do this, df2 becomes this:
70 1.072973
71 0.018420
72 1.574009
73 0.498688
74 2.024650
75 0.787622
76 1.563175
77 1.444014
78 1.619503
79 1.268274
80 1.608617
81 -0.154728
82 0.315892
83 -1.126952
84 0.437423
85 1.389291
86 1.348931
87 1.600227
88 0.262372
89 0.179343
The year becomes only the last two digits. How can I retain all of the digits of the year?
I think column year is not modified, only need assign back subtracted values:
df["value"] = df["value"].subtract(1)

Python parsing data from a website using regular expression

I'm trying to parse some data from this website:
http://www.csfbl.com/freeagents.asp?leagueid=2237
I've written some code:
import urllib
import re
name = re.compile('<td>(.+?)')
player_id = re.compile('<td><a href="(.+?)" onclick=')
#player_id_num = re.compile('<td><a href=player.asp?playerid="(.+?)" onclick=')
stat_c = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">(.+?)</span><br><span class="[^"]?">')
stat_p = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">"[^"]+"</span><br><span class="[^"]?">(.+?)</span></td>')
url = 'http://www.csfbl.com/freeagents.asp?leagueid=2237'
sock = urllib.request.urlopen(url).read().decode("utf-8")
#li = name.findall(sock)
name = name.findall(sock)
player_id = player_id.findall(sock)
#player_id_num = player_id_num.findall(sock)
#age = age.findall(sock)
stat_c = stat_c.findall(sock)
stat_p = stat_p.findall(sock)
First question : player_id returns the whole url "player.asp?playerid=4209661". I was unable to get just the number part. How can I do that?
(my attempt is described in #player_id_num)
Second question: I am not able to get stat_c when span_class is empty as in "".
Is there a way I can get these resolved? I am not very familiar with RE (regular expressions), I looked up tutorials online but it's still unclear what I am doing wrong.
Very simple using the pandas library.
Code:
import pandas as pd
url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)
# print dfs[3]
# dfs[3].to_csv("stats.csv") # Send to a CSV file.
print dfs[3].head()
Result:
0 1 2 3 4 5 6 7 8 9 10 \
0 Pos Name Age T PO FI CO SY HR RA GL
1 P George Pacheco 38 R 4858 7484 8090 7888 6777 4353 6979
2 P David Montoya 34 R 3944 5976 6673 8699 6267 6685 5459
3 P Robert Cole 34 R 5769 7189 7285 5863 6267 5868 5462
4 P Juanold McDonald 32 R 69100 5772 4953 4866 5976 67100 5362
11 12 13 14 15 16
0 AR EN RL Fatigue Salary NaN
1 3747 6171 -3 100% --- $3,672,000
2 5257 5975 -4 96% 2% $2,736,000
3 4953 5061 -4 96% 3% $2,401,000
4 5982 5263 -4 100% --- $1,890,000
You can apply whatever cleaning methods you want from here onwards. Code is rudimentary so it's up to you to improve it.
More Code:
import pandas as pd
import itertools
url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)
df = dfs[3] # "First" stats table.
# The first row is the actual header.
# Also, notice the NaN at the end.
header = df.iloc[0][:-1].tolist()
# Fix that atrocity of a last column.
df.drop([15], axis=1, inplace=True)
# Last row is all NaNs. This particular
# table should end with Jeremy Dix.
df = df.iloc[1:-1,:]
df.columns = header
df.reset_index(drop=True, inplace=True)
# Pandas cannot create two rows without the
# dataframe turning into a nightmare. Let's
# try an aesthetic change.
sub_header = header[4:13]
orig = ["{}{}".format(h, "r") for h in sub_header]
clone = ["{}{}".format(h, "p") for h in sub_header]
# http://stackoverflow.com/a/3678930/2548721
comb = [iter(orig), iter(clone)]
comb = list(it.next() for it in itertools.cycle(comb))
# Construct the new header.
new_header = header[0:4]
new_header += comb
new_header += header[13:]
# Slow but does it cleanly.
for s, o, c in zip(sub_header, orig, clone):
df.loc[:, o] = df[s].apply(lambda x: x[:2])
df.loc[:, c] = df[s].apply(lambda x: x[2:])
df = df[new_header] # Drop the other columns.
print df.head()
More result:
Pos Name Age T POr POp FIr FIp COr COp ... RAp GLr \
0 P George Pacheco 38 R 48 58 74 84 80 90 ... 53 69
1 P David Montoya 34 R 39 44 59 76 66 73 ... 85 54
2 P Robert Cole 34 R 57 69 71 89 72 85 ... 68 54
3 P Juanold McDonald 32 R 69 100 57 72 49 53 ... 100 53
4 P Trevor White 37 R 61 66 62 64 67 67 ... 38 48
GLp ARr ARp ENr ENp RL Fatigue Salary
0 79 37 47 61 71 -3 100% $3,672,000
1 59 52 57 59 75 -4 96% $2,736,000
2 62 49 53 50 61 -4 96% $2,401,000
3 62 59 82 52 63 -4 100% $1,890,000
4 50 70 100 62 69 -4 100% $1,887,000
Obviously, what I did instead was separate the Real values from Potential values. Some tricks were used but it gets the job done at least for the first table of players. The next few ones require a degree of manipulation.

Categories