I have a very large spatial dataset stored in a dataframe. I am taking a slice of that dataframe into a new smaller subset to run further calculations.
The data has x, y and z coordinates with a number of additional columns, some of which are text and some are numeric. The x and y coordinates are on a defined grid and have a known separation.
Data looks like this
x,y,z,text1,text2,text3,float1,float2
75000,45000,120,aa,bbb,ii,12,0.2
75000,45000,110,bb,bbb,jj,22,0.9
75000,45100,120,aa,bbb,ii,11,1.8
75000,45100,110,bb,bbb,jj,45,2.4
75000,45100,100,bb,ccc,ii,13.6,1
75100,45000,120,bb,ddd,jj,8.2,2.1
75100,45000,110,bb,ddd,ii,12,0.6
For each x and y pair I want to iterate over a two series of text values and do three things in the z direction.
Calculate the average of one numeric value for all the values with a third specific text value
Sum another numeric value for all the values with the same text value
Write the a resultant table of 'x, y, average, sum' to a csv.
My code does part three (albeit very slowly) but doesn't calculate 1 or 2 or at least I don't appear to get the average and sum calculations in my output.
What have I done wrong and how can I speed it up?
for text1 in text_list1:
for text2 in text_list2:
# Get the data into smaller dataframe
df = data.loc[ (data["textfield1"] == text1) & (data["textfield2"] == text2 ) ]
#Get the minimum and maximum x and y
minXw = df['x'].min()
maxXw = df['x'].max()
minYw = df['y'].min()
maxYw = df['y'].max()
# dictionary for quicker printing
dict_out = {}
rows_list = []
# Make output filename
filenameOut = text1+"_"+text2+"_Values.csv"
# Start looping through x values
for x in np.arange(minXw, maxXw, x_inc):
xcount += 1
# Start looping through y values
for y in np.arange(minYw, maxYw, y_inc):
ycount += 1
# calculate average and sum
ave_val = df.loc[df['textfield3'] == 'text3', 'float1'].mean()
sum_val = df.loc[df['textfield3'] == 'text3', 'float2'].sum()
# Make Dictionary of output values
dict_out = dict([('text1', text1),
('text2', text2),
('text3', df['text3']),
('x' , x-x_inc),
('y' , y-y_inc),
('ave' , ave_val),
('sum' , sum_val)])
rows_list_c.append(dict_out)
# Write csv
columns = ['text1','text2','text3','x','y','ave','sum']
with open(filenameOut, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=columns)
writer.writeheader()
for data in dict_out:
writer.writerow(data)
My resultant csv gives me:
text1,text2,text3,x,y,ave,sum
text1,text2,,74737.5,43887.5,nan,0.0
text1,text2,,74737.5,43912.5,nan,0.0
text1,text2,,74737.5,43937.5,nan,0.0
text1,text2,,74737.5,43962.5,nan,0.0
Not really clear what you're trying to do. But here is a starting point
If you only need to process rows with a specific text3value, start by filtering out the other rows:
df = df[df.text3=="my_value"]
If at this point, you do not need text3 anymore, you can also drop it
df = df.drop(columns="text3")
Then you process several sub dataframes, and write each of them to their own csv file. groupby is the perfect tool for that:
for (text1, text2), sub_df in df.groupby(["text1", "text2"]):
filenameOut = text1+"_"+text2+"_Values.csv"
# Process sub df
output_df = process(sub_df)
# Write sub df
output_df.to_csv(filenameOut)
Note that if you keep your data as a DataFrame instead of converting it to a dict, you can use the DataFrame to_csv method to simply write the output csv.
Now let's have a look at the process function (Note that you dont really need to make it a separate function, you could as well dump the function body in the for loop).
At this point, if I understand correctly, you want to compute the sum and the average of every rows that have the same x and y coordinates. Here again you can use groupby and the agg function to compute the mean and the sum of the group.
def process(sub_df):
# drop the text1 and text2 columns since they are in the filename anyway
out = sub_df.drop(columns=["text1","text2"])
# Compute mean and max
return out.groupby(["x", "y"]).agg(ave=("float1", "mean"), sum=("float2", "sum"))
And that's preety much it.
Bonus: 2-liner version (but don't do that...)
for (text1, text2), sub_df in df[df.text3=="my_value"].drop(columns="text3").groupby(["text1", "text2"]):
sub_df.drop(columns=["text1","text2"]).groupby(["x", "y"]).agg(ave=("float1", "mean"), sum=("float2", "sum")).to_csv(text1+"_"+text2+"_Values.csv")
To do this in an efficient way in pandas you will need to use groupby, agg and the in-built to_csv method rather than using for loops to construct lists of data and writing each one with the csv module. Something like this:
groups = data[data["text1"].isin(text_list1) & data["text2"].isin(text_list2)] \
.groupby(["text1", "text2"])
for (text1, text2), group in groups:
group.groupby("text3") \
.agg({"float1": np.mean, "float2": sum}) \
.to_csv(f"{text1}_{text2}_Values.csv")
It's not clear exactly what you're trying to do with the incrementing of x and y values, which is also what makes your current code very slow. To present sums and averages of the floating point columns by intervals of x and y, you could make bin columns and group by those too.
data["x_bin"] = (data["x"] - data["x"].min()) // x_inc
data["y_bin"] = (data["y"] - data["y"].min()) // y_inc
groups = data[data["text1"].isin(text_list1) & data["text2"].isin(text_list2)] \
.groupby(["text1", "text2"])
for (text1, text2), group in groups:
group.groupby(["text3", "x_bin", "y_bin"]) \
.agg({"x": "first", "y": "first", "float1": np.mean, "float2": sum}) \
.to_csv(f"{text1}_{text2}_Values.csv")
Related
I have two dataframes: one comprising a large data set, allprice_df, with time price series for all stocks; and the other, init_df, comprising selective stocks and trade entry dates. I am trying to find the highest price for each ticker symbol and its associated date.
The following code works but it is time consuming, and I am wondering if there is a better, more Pythonic way to accomplish this.
# Initial call
init_df = init_df.assign(HighestHigh = lambda x:
highestHigh(x['DateIdentified'], x['Ticker'], allprice_df))
# HighestHigh function in lambda call
def highestHigh(date1,ticker,allp_df):
if date1.size == ticker.size:
temp_df = pd.DataFrame(columns = ['DateIdentified','Ticker'])
temp_df['DateIdentified'] = date1
temp_df['Ticker'] = ticker
else:
print("dates and tickers size mismatching")
sys.exit(1)
counter = itertools.count(0)
high_list = [getHigh(x,y,allp_df, next(counter)) for x, y in zip(temp_df['DateIdentified'],temp_df['Ticker'])]
return high_list
# Getting high for each ticker
def getHigh(dateidentified,ticker,allp_df, count):
print("trade %s" % count)
currDate = datetime.datetime.now().date()
allpm_df = allp_df.loc[((allp_df['Ticker']==ticker)&(allp_df['date']>dateidentified)&(allp_df['date']<=currDate)),['high','date']]
hh = allpm_df.iloc[:,0].max()
hd = allpm_df.loc[(allpm_df['high']==hh),'date']
hh = round(hh,2)
h_list = [hh,hd]
return h_list
# Split the list in to 2 columns one with price and the other with the corresponding date
init_df = split_columns(init_df,"HighestHigh")
# The function to split the list elements in to different columns
def split_columns(orig_df,col):
split_df = pd.DataFrame(orig_df[col].tolist(),columns=[col+"Mod", col+"Date"])
split_df[col+"Date"] = split_df[col+"Date"].apply(lambda x: x.squeeze())
orig_df = pd.concat([orig_df,split_df], axis=1)
orig_df = orig_df.drop(col,axis=1)
orig_df = orig_df.rename(columns={col+"Mod": col})
return orig_df
There are a couple of obvious solutions that would help reduce your runtime.
First, in your getHigh function, instead of using loc to get the date associated with the maximum value for high, use idxmax to get the index of the row associated with the high and then access that row:
hh, hd = allpm_df[allpm_df['high'].idxmax()]
This will replace two O(N) operations (finding the maximum in a list, and doing a list lookup using a comparison) with one O(N) operation and one O(1) operation.
Edit
In light of your information on the size of your dataframes, my best guess is that this line is probably where most of your time is being consumed:
allpm_df = allp_df.loc[((allp_df['Ticker']==ticker)&(allp_df['date']>dateidentified)&(allp_df['date']<=currDate)),['high','date']]
In order to make this faster, I would setup your data frame to include a multi-index when you first create the data frame:
index = pd.MultiIndex.from_arrays(arrays = [ticker_symbols, dates], names = ['Symbol', 'Date'])
allp_df = pd.Dataframe(data, index = index)
allp_df.index.sortlevel(level = 0, sort_remaining = True)
This should create a dataframe with a sorted, multi-level index associated with your ticker symbol and date. Doing this will reduce your search time tremendously. Once you do that, you should be able to access all the data associated with a ticker symbol and a given date-range by doing this:
allp_df[ticker, (dateidentified: currDate)]
which should return your data much more quickly. For more information on multi-indexing, check out this helpful Pandas tutorial.
Hi all i have a dataframe of approx 400k rows with a column of interest. I would like to map each element in the column to a category (LU, HU, etc.). This is obtained from a smaller dataframe where the column names are the category. The function below however runs very slow voor only 400k rows. I m not sure why. In the example below ofcourse it is fast for 5 examples.
cwp_sector_mapping = {
'LU': ['C2P34', 'C2P35', 'C2P36'],
'HU': ['C2P37', 'C2P38', 'C2P39'],
'EH': ['C2P40', 'C2P41', 'C2P42'],
'EL': ['C2P43', 'C2P44', 'C2P45'],
'WL': ['C2P12', 'C2P13', 'C2P14'],
'WH': ['C2P15', 'C2P16', 'C2P17'],
'NL': ['C2P18', 'C2P19', 'C2P20'],
}
df_cwp = pd.DataFrame.from_dict(cwp_sector_mapping)
columns = df_cwp.columns
ls = pd.Series(['C2P44', 'C2P43', 'C2P12', 'C2P1'])
temp = list((map(lambda pos: columns[df_cwp.eq(pos).any()][0] if
columns[df_cwp.eq(pos).any()].size != 0 else 'UN', ls)))
Use next with iter trick for possible get first meached value of columns, if no match get default value UN:
temp = [next(iter(columns[df_cwp.eq(pos).any()]), 'UN') for pos in ls]
I have 2 really big data frames (lets say A and B).My objective is to make a new dataframe(C) from A and B with an additional Boolean columns; True: if the row is in B and False if its not in B
I get all unique identities from smaller one(B) and stored in a list (its size is 73739559)
then I have element matching by pandas apply function but it crushes frequently
df['responsive'] = df.apply(lambda row: row.FullPhoneNumber in FullPhoneNumber, axis = 1)
Now im trying using the following code
f_res = open(main_path +"responsive.csv", 'a')
f_irr = open(main_path +"irresponsive.csv", 'a')
res_writer = csv.writer(f_res)
irr_writer = csv.writer(f_irr)
df['responsive'] =False
for row in df.iterrows():
if row[1]['FullPhoneNumber'] in FullPhoneNumber:
row[1]['responsive']=True
res_writer.writerow(row[1])
else:
irr_writer.writerow(row[1])
Buts its too slow.. I'm looking for something faster as I have +10GB
I have a CSV file that looks like this:
Build,Avg,Min,Max
BuildA,56.190,39.123,60.1039
BuildX,57.11,40.102,60.200
BuildZER,55.1134,35.129404123,60.20121
I want to get the average, min, max of each column and have each of these stats be as a new row. I exclude the non numeric column (the build column) and then run the statistics. I accomplish this by doing:
df = pd.read_csv('fakedata.csv')
columns = []
builds = []
for column in df.columns:
if(df[column].dtype == 'float64'):
columns.append(column)
else:
builds.append(column)
save = df[builds]
df = df[columns]
print(df)
df.loc['Min']= df.min()
df.loc['Average']= df.mean()
df.loc['Max']= df.max()
If I were then to write this data to a CSV it would look like:
,Avg,Min,Max
0,56.19,39.123,60.1039
1,57.11,40.102,60.2
2,55.1134,35.129404123,60.20121
Min,55.1134,35.129404123,60.1039
Average,55.8817,37.3709520615,60.1522525
Max,57.11,40.102,60.20121
Which is close to what I want but I want the Build column to be column one again and have the build names exist on top of the Min, Average, Max. Basically this:
Builds,Avg,Min,Max
BuildA,56.19,39.123,60.1039
BuildX,57.11,40.102,60.2
BuildZER,55.1134,35.129404123,60.20121
Min,55.1134,35.129404123,60.1039
Average,55.8817,37.3709520615,60.1522525
Max,57.11,40.102,60.20121
I have attempted to accomplish this by doing:
df.insert(0,'builds', save)
with open('fakedata.csv', 'w') as f:
df.to_csv(f)
But this gives me this CSV:
,builds,Avg,Min,Max
0,Build1,56.19,39.123,60.1039
1,Build2,57.11,40.102,60.2
2,Build3,55.1134,35.129404123,60.20121
Min,,55.1134,35.129404123,60.1039
Average,,55.8817,37.3709520615,60.1522525
Max,,57.11,40.102,60.20121
How can I fix this?
IIUC:
df_out = pd.concat([df.set_index('Build'),df.set_index('Build').agg(['max','min','mean'])]).rename(index={'max':'Max','min':'Min','mean':'Average'}).reset_index()
Output:
index Avg Min Max
0 BuildA 56.1900 39.123000 60.10390
1 BuildX 57.1100 40.102000 60.20000
2 BuildZER 55.1134 35.129404 60.20121
3 Max 57.1100 40.102000 60.20121
4 Min 55.1134 35.129404 60.10390
5 Average 56.1378 38.118135 60.16837
If I have data as:
Code, data_1, data_2, data_3, [....], data204700
a,1,1,0, ... , 1
b,1,0,0, ... , 1
a,1,1,0, ... , 1
c,0,1,0, ... , 1
b,1,0,0, ... , 1
etc. same code different value (0, 1, ?(not known))
I need to create a big matrix and I want to analyze.
How can I import data in a dictionary?
I want to use dictionary for column (204.700+1)
There is a built in function (or package) that return to me pattern?
(I expect a percent pattern). I mean as 90% of 1 in column 1, 80% of in column 2.
Alright so I am going to assume you want this in a dictionary for storing purposes and I will tell you that you don't want that with this kind of data. use a pandas DataFrame
this is how you will get your code into a dataframe:
import pandas as pd
my_file = 'file_name'
df = pd.read_csv(my_file)
now you don't need a package for returning the pattern you are looking for, just write a simple algorithm for returning that!
def one_percentage(data):
#get total number of rows for calculating percentages
size = len(data)
#get type so only grabbing the correct rows
x = data.columns[1]
x = data[x].dtype
#list of touples to hold amount of 1s and the column names
ones = [(i,sum(data[i])) for i in data if data[i].dtype == x]
my_dict = {}
#create dictionary with column names and percent
for x in ones:
percent = x[1]/float(size)
my_dict[x[0]] = percent
return my_dict
now if you want to get the percent of ones in any column, this is what you do:
percentages = one_percentage(df)
column_name = 'any_column_name'
print percentages[column_name]
now if you want to have it do every single column, then you can grab all of the column names and loop through them:
columns = [name for name in percentages]
for name in columns:
print str(percentages[name]) + "% of 1 in column " + name
let me know if you need anything else!