I need to iterate over row data in a pandas dataframe. However, I am stuck with looping because spending much time on millions data. I think my code still is not optimal.
new_columns = ['alt', 'alt_anomaly']
df_new = pd.DataFrame(columns=new_columns)
loop = 20
idx = 0
for i, row in df.iterrows():
for alt in range(loop):
alt_anomaly = df.iloc[i]['alt'] * (400.00)
df_new.loc[idx] = row.values.tolist() + [alt_anomaly]
idx += 1
print(df_new)
Use 400 ft as multiples to gradually change on the first vector, the second by 800 feet, and so on by multiple.
its like
row[1] = 27800+400
row[2] = 27775+800
etc....
Thanks for your help, I appreciate that.
You can do the following without looping:
df['alt_anomaly'] = df['alt'] + (df.index+1)*400
Or use Pandas .add option:
df['alt'].add((df.index+1)*400)
Related
I'd like to append consistently empty rows in my dataframe.
I have following code what does what I want but I'm struggling in adjusting it to my needs:
s = pd.Series('', data_only_trades.columns)
f = lambda d: d.append(s, ignore_index=True)
set_rows = np.arange(len(data_only_trades)) // 4
empty_rows = data_only_trades.groupby(set_rows, group_keys=False).apply(f).reset_index(drop=True)
How can I adjust the code so I add two or more rows instead of one?
How can I set a starting point (e.g. it should start with row 5 -- Do I have to use .loc then in arange?)
Also tried this code but I was struggling in setting the starting row and the values to blank (I got NaN):
df_new = pd.DataFrame()
for i, row in data_only_trades.iterrows():
df_new = df_new.append(row)
for _ in range(2):
df_new = df_new.append(pd.Series(), ignore_index=True)
Thank you!
import numpy as np
v = np.ndarray(shape=(numberOfRowsYouWant,df.values.shape[1]), dtype=object)
v[:] = ""
pd.DataFrame(np.vstack((df.values, v)))
I think you can use NumPy
but, if you want to use your manner, simply convert NaN to "":
df.fillna("")
I am looking to simplify my code a bit when I fill in a a Pandas dataframe, for example this works great to auto-increment a string:
results['Cluster'] = ['Cluster %s' %i for i in range(1, len(results) + 1)]
Result:
Cluster 1
Cluster 2
Cluster 3
etc
How I do the same elegant code when I refer to another dataframe values, all I need to do is to increment cluster_X and total_cX, rather then refer to each one individually...
results['Item per Population'] = [cluster_1.shape[0]/total_c1,cluster_2.shape[0]/total_c2,cluster_3.shape[0]/total_c3]
Thank you.
ok, I got it myself.
First define
clusters = [cluster_1, cluster_2...etc)
totals_pop = [total_c1, total_c2...etc)
then:
appended_data = []
for i, x in zip(clusters,totals_pop):
data = i.shape[0]/x
appended_data.append(data)
appended_data
results['Car Wash per Population'] = np.array(appended_data)
There might be a more elegant way of doing this with results.loc[...] instead.
I am now busy with analyzing CSV datalog files and I am now strugling with how to speed up some pandas calculations. The following for loop is working but is not very fast:
for i in range(len(df)):
if i==0:
df.loc[i,"delta_t"] = 3
df.loc[i,"E_sun"] = 0
else:
df.loc[i,"delta_t"] = (df.loc[i,"date_time"]- df.loc[i-1,"date_time"]).total_seconds()
df.loc[i,"E_sun"] = df.loc[i-1,"E_sun"] + df.loc[i,"delta_t"] * df.loc[i,"E_flow_sun"]
Is there a fast way to do this calculation? The problem is that I am referencing information on different rows. If all the data is on the same row than things become very easy for example:
df["column1"] = df["column2"] + df["column3"]
You are looking at diff and cumsum:
df['delta_t'] = df['date_time'].diff().total_seconds().fillna(3)
df['E_sun'] = (df['delta_t'] * df['E_flow_sun']).cumsum()
I have a pretty big dataframe of about 1.5 million rows and I am trying to execute the code below into batches of 10,000. Then append the results into the "dataset" dataframe. One of the columns, 'subjects' is structured really weird so I had to clean it up but it takes a long time to process. That's why I want to use the k=10,000 batch. Thoughts on the best way to accomplish this?
reuters_set = reuters_set.loc[reuters_set['subjects'].str.contains('P:')]
reuters_set.shape[0]
1590478
reuters_set.subjects.iloc[33] #Example of data in column that needs to be processed
['B:1092', 'B:12', 'B:19', 'B:20', 'B:22', 'B:227', 'B:228', 'B:229', 'B:24', 'G:1', 'G:6', 'G:B1', 'G:K', 'G:S', 'M:1QD', 'M:AV', 'M:B6', 'M:Z', 'R:600058.SS', 'N2:ASIA', 'N2:ASXPAC', 'N2:BMAT', 'N2:BMAT08', 'N2:CMPNY', 'N2:CN', 'N2:EASIA', 'N2:EMRG', 'N2:EQTY', 'N2:IRNST', 'N2:LEN', 'N2:MEMI', 'N2:METWHL', 'N2:MIN', 'N2:MINE', 'N2:MINE08', 'N2:MTAL', 'N2:MTAL08', 'N2:STEE', 'P:4295865030']
dataset = []
k = 10000
ct=0
# Testing the first 10,000. It takes really long after this value...
bk = reuters_set.iloc[0:k]
bk.reset_index(inplace = True)
bk['id'] = np.arange(bk.shape[0])
bk['N2'] = ''
bk['P'] = ''
bk['R'] = ''
for index, row in bk.iterrows():
a = [i.split(':') for i in ast.literal_eval(row['subjects'])]
b = pd.DataFrame(a)
b = b.groupby(0, as_index = False).agg({1:'unique'})
dict_code = dict(zip(b[0], b[1]))
if 'N2' in dict_code.keys():
bk.loc[bk['id']== index, 'N2'] = str(dict_code['N2'].tolist())
if 'R' in dict_code.keys():
bk.loc[bk['id']== index, 'R' ] = str(dict_code['R'].tolist())
if 'P' in dict_code.keys():
bk.loc[bk['id']== index, 'P' ] = str(dict_code['P'].tolist())
I have this data in a CSV:
first, middle, last, id, fte
Alexander,Frank,Johnson,460700,1
Ashley,Jane,Smith,470000,.5
Ashley,Jane,Smith,470000,.25
Ashley,Jane,Smith,470000,.25
Steve,Robert,Brown,460001,1
I need to find the rows of people with the same ID numbers, and then combine the FTEs of those rows into the same line. I'll also need to add 0s for the rows that don't have duplicates. For example (using data above):
first, middle, last, id, fte1, fte2, fte3, fte4
Alexander,Frank,Johnson,460700,1,0,0,0
Ashley,Jane,Smith,470000,.5,.25,.25,0
Steve,Robert,Brown,460001,1,0,0,0
Basically, we're looking at the jobs people hold. Some people work one 40-hour per week job (1.0 FTE), some work two 20-hour per week jobs (.5 and .5 FTEs), some might work 4 10-hour per week jobs (.25, .25, .25, and .25 FTEs), and some may have other combinations. We only get one row of data per employee, so we need the FTEs on the same row.
This is what we have so far. Right now, our current code only works if they have two FTEs. If they have three or four, it just overwrites them with the last two (so if they have 3, it gives us 2 and 3. If they have 4, it gives us 3 and 4).
f = open('data.csv')
csv_f = csv.reader(f)
dataset = []
for row in csv_f:
dictionary = {}
dictionary["first"] = row[2]
dictionary["middle"] = row[3]
dictionary["last"] = row[4]
dictionary["id"] = row[10]
dictionary["fte"] = row[12]
dataset.append(dictionary)
def is_match(dict1, dict2):
return (dict1["id"] == dict2["id"])
def find_match(dictionary, dict_list):
for index in range(0, len(dict_list)):
if is_match(dictionary, dict_list[index]):
return index
return -1
def process_data(dataset):
result = []
for index in range(1, len(dataset)):
data_dict = dataset[index]
match_index = find_match(data_dict, result)
id = str(data_dict["id"])
if match_index == -1:
result.append(data_dict)
else:
(result[match_index])["fte2"] = data_dict["fte"]
return result
f.close()
for row in process_data(dataset):
print(row)
Any help would be extremely appreciated! Thanks!
I would say to use the pandas library to make it simple. You can use group by along with aggregation. The below is an example following the aggregation example provided here https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm.
import pandas as pd
import numpy as np
df = pd.read_csv('filename.csv')
grouped = df.groupby('id')
print grouped['fte'].agg(np.sum)