I suppose that someone might have asked this already, but for the life of me I cannot find what I need after some looking, possibly my level of Py is too low.
I saw several questions with answers using globals() and exec() with comments that it's a bad idea, other answers suggest using dictionaries or lists. At this point I got a bit loopy about what to use here and any help would be very welcome.
What I need is roughly this:
I have a Python DataFrame, say called dftest
I'd like to split dftest into say 6 parts of similar size
then I'd like to iterate over them (or possibly parallelise?) and run some steps calling some spatial functions that use parameters (param0,param1, ... param5) over each of the rows of each df to add more columns, preferably export each result to a csv (as it takes long time to complete one part, I wouldn't want to loose the result of each iteration)
And then I'd like to put them back together into one DataFrame, say dfresult (possibly with concat) and continue doing the next thing with it
To keep it simple, this is how a toy dftest looks like (the original df has more rows and columns):
print(dftest)
# rowid type lon year
# 1 1 Tomt NaN 2021
# 2 2 Lägenhet 12.72 2022
# 3 3 Lägenhet NaN 2017
# 4 4 Villa 17.95 2016
# 5 5 Radhus 17.95 2021
# 6 6 Villa 17.95 2016
# 7 7 Fritidshus 18.64 2020
# 8 8 Villa 18.64 2019
# 9 9 Villa 18.63 2021
# 10 10 Villa 18.63 2019
# 11 11 Villa 17.66 2017
# 12 12 Radhus 17.66 2022
So here is what I tried:
dfs = np.array_split(dftest, 6)
for j in range(0,6):
print ((f'dfs[{j}] has'),len(dfs[j].index),'obs ',min(dfs[j].index),'to ',max (dfs[j].index))
where I get output:
# dfs[0] has 2 obs 1 to 2
# dfs[1] has 2 obs 3 to 4
# dfs[2] has 2 obs 5 to 6
# dfs[3] has 2 obs 7 to 8
# dfs[4] has 2 obs 9 to 10
# dfs[5] has 2 obs 11 to 12
So now I'd like to iterate over each df and create more columns. I tried a hardcoded test, one by one something like this:
for row in tqdm(dfs[0].itertuples()):
x = row.type
y = foo.bar(x, param="param0")
i = row[0]
dfs[0].x[i, 'anotherColumn'] = baz(y)
#... some more functions ...
dfs[0].to_csv("/projectPath/dfs0.csv")
I suppose this should be possible to automate or even run in parallel (how?)
And in the end I'll try putting them together (no clue if this would work), possibly something like this:
pd.concat([dfs[0],dfs[1],dfs[2],dfs[3],dfs[4],dfs[5]])
If I had a 100 parts - perhaps dfs[0]:dfs[5] would work...I'm still in the previous step
PS. I'm using a Jupyter notebook on localhost with Python3.
As far as I understand, you can use the chunk_apply function of the parallel-pandas library. This function splits the dataframe into chunks and applies a custom function to each chunk then concatenates the result. Everything works in parallel.Toy example:
#pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas
#initialize parallel-pandas
# n_cpu - count of cores and split chunks
ParallelPandas.initialize(n_cpu=8)
def foo(df):
# do something with df
df['new_col'] = df.sum(axis=1)
return df
if __name__ == '__main__':
ROW = 10000
COL = 10
df = pd.DataFrame(np.random.random((ROW, COL)))
res = df.chunk_apply(foo, axis=0)
print(res.head())
Out:
0 1 2 ... 8 9 new_col
0 0.735248 0.393912 0.966608 ... 0.261675 0.207216 6.276589
1 0.256962 0.461601 0.341175 ... 0.688134 0.607418 5.297881
2 0.335974 0.093897 0.622115 ... 0.442783 0.115127 3.102827
3 0.488585 0.709927 0.209429 ... 0.942065 0.126600 4.367873
4 0.619996 0.704085 0.685806 ... 0.626539 0.145320 4.901926
Below is my data set.
I want to calculate the average temperature of each station. It is more desired if I can remove zero (noise) data.
How can I do it?
enter image description here
I have no idea on how to start with.
I think you can use standard csv reader to iterate over rows, collect all temperatures to the list t and do something like sum(t) / len(t)
Just turn the data into a pandas.DataFrame and use pandas.DataFrame.groupby method:
import pandas as pd
file = #your csv file
df = pd.read_csv(file)
df = df.drop(df[df['temp'] == 0].index)
print(df.groupby('ID')[['temp']].mean())
Gives:
temp
ID
1 20.5
2 32.1
3 14.4
Note: the file I used looks like...
ID,stuff,temp
1,3,20
1,6,20.1
1,7,21.4
2,1,30.2
2,3,0
2,2,34
3,7,0
3,6,0
3,2,14.4
If wanting to turn that data into a column, you create a dictionary and use it to 'replace' (but not really) a new column in the DataFrame:
mean = df.groupby('ID')[['temp']].mean() # Store this into a variable
groups = {}
for i in mean.itertuples(): # Iterate over the values
groups[i[0]] = i[1]
df['avg_temp'] = df['ID'].replace(groups) # Create a new column
print(df)
Gives:
ID stuff temp avg_temp
0 1 3 20.0 20.5
1 1 6 20.1 20.5
2 1 7 21.4 20.5
3 2 1 30.2 32.1
5 2 2 34.0 32.1
8 3 2 14.4 14.4
I am working on below df but unable to apply filter in percentage field,but it is working normal excel.
I need to apply filter condition > 100.00% in the particular field using pandas.
I tried reading it from Html,csv and excel in pandas but unable to use condition.
it requires float conversion but not working with given data
I am assuming that the values you have are read as strings in Pandas:
data = ['4,700.00%', '3,900.00%', '1,500.00%', '1,400.00%', '1,200.00%', '0.15%', '0.13%', '0.12%', '0.10%', '0.08%', '0.07%']
df = pd.DataFrame(data)
df.columns = ['data']
printing the df:
data
0 4,700.00%
1 3,900.00%
2 1,500.00%
3 1,400.00%
4 1,200.00%
5 0.15%
6 0.13%
7 0.12%
8 0.10%
9 0.08%
10 0.07%
then:
df['data'] = df['data'].str.rstrip('%').str.replace(',','').astype('float')
df_filtered = df[df['data'] > 100]
Results:
data
0 4700.0
1 3900.0
2 1500.0
3 1400.0
4 1200.0
I have used below code as well.str.rstrip('%') and .str.replace(',','').astype('float') it working fine
I would like to perform the following task. Given a 2 columns (good and bad) I would like to replace any rows for the two columns with a running total. Here is an example of the current dataframe along with the desired data frame.
EDIT: I should have added what my intentions are. I am trying to create equally binned (in this case 20) variable using a continuous variable as the input. I know the pandas cut and qcut functions are available, however the returned results will have zeros for the good/bad rate (needed to compute the weight of evidence and information value). Zeros in either the numerator or denominator will not allow the mathematical calculations to work.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
Here is an explanation of what I need to do to the above dataframe.
Roughly speaking, anytime I encounter a zero for either column, I need to use a running total for the column which is not zero to the next row which has a non-zero value for the column that contained zeros.
Here is the desired output:
dd={'AAA':range(0,16),
'good':[19,20,60,59,72,64,52,38,24,17,19,12,5,7,6,2],
'bad':[1,1,1,6,8,10,6,6,10,5,8,2,2,1,3,2]}
desired_df=pd.DataFrame(data=dd)
print(desired_df)
The basic idea of my solution is to create a column from a cumsum over non-zero values in order to get the zero values with the next non zero value into one group. Then you can use groupby + sum to get your the desired values.
two_good = df.groupby((df['bad']!=0).cumsum().shift(1).fillna(0))['good'].sum()
two_bad = df.groupby((df['good']!=0).cumsum().shift(1).fillna(0))['bad'].sum()
two_good = two_good.loc[two_good!=0].reset_index(drop=True)
two_bad = two_bad.loc[two_bad!=0].reset_index(drop=True)
new_df = pd.concat([two_bad, two_good], axis=1).dropna()
print(new_df)
bad good
0 1 19.0
1 1 20.0
2 1 28.0
3 6 91.0
4 8 72.0
5 10 64.0
6 6 52.0
7 6 38.0
8 10 24.0
9 5 17.0
10 8 19.0
11 2 12.0
12 2 5.0
13 1 7.0
14 3 6.0
15 1 2.0
This code treats your etch case of trailing zeros different from your desired output, it simple cuts it off. You'd have to add some extra code to catch that one with a different logic.
P.Tillmann. I appreciate your assistance with this. For the more advanced readers I would assume you to find this code appalling, as I do. I would be more than happy to take any recommendation which makes this more streamlined.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
row_good=0
row_bad=0
row_bad_zero_count=0
row_good_zero_count=0
row_out='NO'
crappy_fix=pd.DataFrame()
for index,row in df.iterrows():
if row['good']==0 or row['bad']==0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count += 1
row_good_zero_count += 1
output_ind='1'
row_out='NO'
elif index+1 < len(df) and (df.loc[index+1,'good']==0 or df.loc[index+1,'bad']==0):
row_bad=row['bad']
row_good=row['good']
output_ind='2'
row_out='NO'
elif (row_bad_zero_count > 1 or row_good_zero_count > 1) and row['good']!=0 and row['bad']!=0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='3'
else:
row_bad=row['bad']
row_good=row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='4'
if ((row['good']==0 or row['bad']==0)
and (index > 0 and (df.loc[index-1,'good']!=0 or df.loc[index-1,'bad']!=0))
and row_good != 0 and row_bad != 0):
row_out='YES'
if row_out=='YES':
temp_dict={'AAA':row['AAA'],
'good':row_good,
'bad':row_bad}
crappy_fix=crappy_fix.append([temp_dict],ignore_index=True)
print(str(row['AAA']),'-',
str(row['good']),'-',
str(row['bad']),'-',
str(row_good),'-',
str(row_bad),'-',
str(row_good_zero_count),'-',
str(row_bad_zero_count),'-',
row_out,'-',
output_ind)
print(crappy_fix)
I am trying to solve one of the coursera's homework for beginners.
I have read the data and tried to convert it as it shown in the code piece below. I am looking for the frequency distribution of the considered variables and for this reason I am trying to round the values. I tried several methods but nothing give me what I am expecting (see below please)..
import pandas as pd
import numpy as np
# loading the database file
data = pd.read_csv('gapminder-2.csv',low_memory=False)
# number of observations (rows)
print len(data)
# number of variables (columns)
print len(data.columns)
sub1 = pd.DataFrame({'income':data['incomeperperson'].convert_objects(convert_numeric=True),
'alcohol':data['alcconsumption'].convert_objects(convert_numeric=True),
'suicide':data['suicideper100th'].convert_objects(convert_numeric=True)})
sub1.apply(pd.Series.round)
income = sub1['income'].value_counts(sort=False)
print income
However, I got
285.224449 1
2712.517199 1
21943.339898 1
1036.830725 1
557.947513 1
What I expect:
285 1
2712 1
21943 1
1036 1
557 1
You can implement Series.round()
ser = pd.Series([1.1,2.1,3.1,5.1])
print(ser)
0 1.1
1 2.1
2 3.1
3 5.1
dtype: float64
From here you can use .round(), the default is set to 0 per docs.
print(ser.round())
0 1
1 2
2 3
3 5
dtype: float64
To save changes you need to re-assign it to ser=ser.round().