I have a data frame of some features and corresponding years. Each value of the feature is listed for different years. I need to compare the values of a specific year with that 7 years earlier. So basically I need to define a function that will generate two columns one will give me the value of the feature from the table for a specific year and the other one for the same feature but 7 years earlier. How can I do that?
feature year
value1 2001
value1 2008
vlaue2 1996
etc
e.g. I want to compare value1(2008) with value1(2008 - 7) etc.
there should also be some conditional statements as year 2000 can't be compared with (2000-7 =1993) because there is no value for the feature for year (1993) for example.
Here's a quick solution from what I understand from your question,
import numpy as np
import pandas as pd
data = {'feature': ['A', 'B', 'C', 'A'],
'value': [1, 10, 3, 50],
'year':[2001, 2002, 2003, 2008]}
df = pd.DataFrame(data)
def compFeature(df, f, y):
if df[(df.feature == f) & (df.year == (y-7))].year is not None:
now = df[(df.feature == f) & (df.year == y)].value
old = df[(df.feature == f) & (df.year == (y-7))].value
result = np.subtract(now,old)
else:
result = np.nan
return result
This is just to get you started.
With the minimum information you have given, this can be used as a solution:
Let's create a function to get data for both years, if available.
def compare(x):
f1 = df.loc[df['year'] == x, 'feature'].values[0]
y2 = x - 7
if y2 in df['year'].unique():
f2 = df.loc[df['year'] == y2, 'feature'].values[0]
return (x, f1, y2, f2)
else:
pass
Apply the function to the year column and assign a new dataframe name.
foo = df['year'].apply(compare)
Create a dataframe of non-null values in foo:
bar = pd.DataFrame(data = list(foo.loc[~foo.isnull()]), columns = ['f1', 'y1', 'f2', 'y2'])
This will result in four columns for easy comparison. I understand you were looking for two columns solution but a four column solution with comparative data next to each other would make sense for later usage as well.
Related
This question already exists:
Group by +- margin using python
Closed 7 months ago.
I want to optimize my code where Group by +- margin using python. I want to group my Dataframe composed of 2 columns ['1', '2'] based on a margin +-1 (1) and +-10 (2)
For example, a really simplified overlook
[[273, 10],[274, 14],[275, 15]]
Expected output:
[[273, 10],[274, 14]],[[274, 14],[275, 15]]
My data is much more complex with nearly 1 million data points looking like this 652.125454455
This kind of code for example take me for ever, with no results
a = np.random.uniform(low=300, high=1800, size=(300000,))
b = np.random.uniform(low=0, high=7200, size=(300000,))
print("Random numbers were created")
df = pd.DataFrame({'1': a, '2':b})
df['id'] = df.index
1_MARGIN = 1
2_MARGIN = 10
tic = time.time()
group = []
for index, row in df.iterrows():
filtered_df = df[(row['1'] - 1_MARGIN < df['1']) & (df['1'] < row['1'] + 1_MARGIN) &
(row['2'] - 2_MARGIN < df['2']) & (df['2'] < row['2'] + 2_MARGIN)]
group.append(filtered_df[['id', '1']].values.tolist())
toc = time.time()
print(f"for loop: {str(1000*(toc-tic))} ms")
I also tried
data = df.groupby('1')['2'].apply(list).reset_index(name='irt')
but in this case there is no margin
I tried my best to understand what you wanted and I arrived at a very slow solution but at least it's a solution.
import pandas as pd
import numpy as np
a = np.random.uniform(low=300, high=1800, size=(300000,))
b = np.random.uniform(low=0, high=7200, size=(300000,))
df = pd.DataFrame({'1': a, '2':b})
dfbl1=np.sort(df['1'].apply(int).unique())
dfbl2=np.sort(df['2'].apply(int).unique())
MARGIN1 = 1
MARGIN2 = 10
marg1array=np.array(range(dfbl1[0],dfbl1[-1],MARGIN1))
marg2array=np.array(range(dfbl2[0],dfbl2[-1],MARGIN2))
a=time.perf_counter()
groupmarg1=[]
groupmarg2=[]
for low,upper in zip(marg1array[:-1],marg1array[1:]):
for low2,upper2 in zip(marg2array[:-1],marg2array[1:]):
groupmarg1.append(df.loc[(df['1']>low) & (df['1']<upper)&(df['2']>low2) & (df['2']<upper2)].values.tolist())
print(time.perf_counter()-a)
I also tried to do each loop seperately and intersect them which should be faster but since we're storing .values.tolist() I couldn't figure out a faster way than below.
a=time.perf_counter()
groupmarg1=[]
groupmarg2=[]
for low,upper in zip(marg1array[:-1],marg1array[1:]):
groupmarg1.append(df.loc[(df['1']>low) & (df['1']<upper)])
newgroup=[]
for subgroup in groupmarg1:
for low2,upper2 in zip(marg2array[:-1],marg2array[1:]):
newgroup.append(subgroup.loc[(subgroup['2']>low2) & (subgroup['2']<upper2)].values.tolist())
print(time.perf_counter()-a)
which runs in ~9mins on my machine.
Oh and you need to filter out the empty dataframes, and if you want them as values.tolist() you can do it while filtering like this
gr2=[grp.values.tolist() for grp in newgroup if not grp.empty]
I have a dataset:
Year Name Value
1 A 10
2 B 20
3 A 25
3 B 10
I want to be able to find how each name has changed over the years. Ideally the result should look like
Name Growth/Year
A (25-10)/(3-1)
B (10-20)/(3-2)
I can build a list of unique name first, and then loop through the dataset to find the value change. But is there an easy way to do it?
Try using pandas groupby:
df = pd.DataFrame({'year':[1,2,3,3], 'name':['A','B','A','B'], 'value':[10,20,25,10]})
df.groupby('name').apply(lambda x: (x['value'].iloc[1]-x['value'].iloc[0])/(x['year'].iloc[1]-x['year'].iloc[0]))
>>>
name
A 7.5
B -10.0
dtype: float64
Or to have more flexibility, you could define an aggregation function:
def value_change(x):
yr1 = min(x['year'])
yr2 = max(x['year'])
# Get values corresponding to min and max years in case
# min and max year rows aren't contiguous
value1 = x[x['year']==yr1]['value'].iloc[0]
value2 = x[x['year']==yr2]['value'].iloc[0]
return (value2-value1)/(yr2-yr1)
df.groupby('name').apply(value_change)
So first you want the rows the which correspond to the first and last year for each name:
df = pd.DataFrame([[1,"A", 10], [2, "B", 20], [3, "A", 25], [3, "B", 10]], columns=["Year", "Name", "Value"])
first_year = df.groupby("Name").Year.idxmin().sort_index()
last_year = df.groupby("Name").Year.idxmax().sort_index()
I added the sort_index just to be sure that the order lines up but it might be a good idea to test this to make sure that nothing went wrong. Here's how I would test it:
assert (first_year.index == last_year.index).all()
Next, you can use these to extract the values you want and compute the change:
change_value = (df.loc[last_year.values, "Value"].values - df.loc[first_year.values, "Value"].values)
change_time = (df.loc[last_year.values, "Year"].values - df.loc[first_year.values, "Year"].values)
change_over_time = change_value / change_time
Then, you can convert this into pandas frame if you'd like:
pd.Series(change_over_time, index=first_year.index)
There probably a tidier way to do this all within the groupby but I think this way easier to understand.
I've several hundreds of pandas dataframes and And the number of rows are not exactly the same in all the dataframes like some have 600 but other have 540 only.
So what i want to do is like, i have two samples of exactly the same numbers of dataframes and i want to read all the dataframes(around 2000) from both the samples. So that's how thee data looks like and i can read the files like this:
5113.440 1 0.25846 0.10166 27.96867 0.94852 -0.25846 268.29305 5113.434129
5074.760 3 0.68155 0.16566 120.18771 3.02654 -0.68155 101.02457 5074.745627
5083.340 2 0.74771 0.13267 105.59355 2.15700 -0.74771 157.52406 5083.337081
5088.150 1 0.28689 0.12986 39.65747 2.43339 -0.28689 164.40787 5088.141849
5090.780 1 0.61464 0.14479 94.72901 2.78712 -0.61464 132.25865 5090.773443
#first Sample
path_to_files = '/home/Desktop/computed_2d_blaze/'
lst = []
for filen in [x for x in os.listdir(path_to_files) if '.ares' in x]:
df = pd.read_table(path_to_files+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
df = df.sort_values('stlines', ascending=False)
df = df.drop_duplicates('wave')
df = df.reset_index(drop=True)
lst.append(df)
#second sample
path_to_files1 = '/home/Desktop/computed_1d/'
lst1 = []
for filen in [x for x in os.listdir(path_to_files1) if '.ares' in x]:
df1 = pd.read_table(path_to_files1+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
df1 = df1.sort_values('stlines', ascending=False)
df1 = df1.drop_duplicates('wave')
df1 = df1.reset_index(drop=True)
lst1.append(df1)
Now the data is stored in lists and as the number of rows in all the dataframes are not same so i cant subtract them directly.
So how can i subtract them correctly?? And after that i want to take average(mean) of the residual to make a dataframe?
You shouldn't use apply. Just use Boolean making:
mask = df['waves'].between(lower_outlier, upper_outlier)
df[mask].plot(x='waves', y='stlines')
One solution that comes into mind is writing a function that finds outliers based on upper and lower bounds and then slices the data frames based on outliers index e.g.
df1 = pd.DataFrame({'wave': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'stlines': [0.1, 0.2, 0.3, 0.4, 0.5]})
def outlier(value, upper, lower):
"""
Find outliers based on upper and lower bound
"""
# Check if input value is within bounds
in_bounds = (value <= upper) and (value >= lower)
return in_bounds
# Function finds outliers in wave column of DF1
outlier_index = df1.wave.apply(lambda x: outlier(x, 4, 1))
# Return DF2 without values at outlier index
df2[outlier_index]
# Return DF1 without values at outlier index
df1[outlier_index]
For a project, I want to create a script that allows the user to enter values (like a value in centimetres) multiple times. I had a While-loop in mind for this.
The values need to be stored in a dataframe, which will be used to generate a graph of the values.
Also, there is no maximum nr of entries that the user can enter, so the names of the variables that hold the values have to be generated with each entry (such as M1, M2, M3…Mn). However, the dataframe will only consist of one row (only for the specific case that the user is entering values for).
So, my question boils down to this:
How do I create a dataframe (with pandas) where the script generates its own column name for a measurement, like M1, M2, M3, …Mn, so that all the values are stored.
I can't acces my code right now, but I have created a While-loop that allows the user to enter values, but I'm stuck on the dataframe and columns part.
Any help would be greatly appreciated!
I agree with #mischi, without additional context, pandas seems overkill, but here is an alternate method to create what you describe...
This code proposes a method to collect the values using a while loop and input() (your while loop is probably similar).
colnames = []
inputs = []
counter = 0
while True:
value = input('Add a value: ')
if value == 'q': # provides a way to leave the loop
break
else:
key = 'M' + str(counter)
counter += 1
colnames.append(key)
inputs.append(value)
from pandas import DataFrame
df = DataFrame(inputs, colnames) # this creates a DataFrame with
# a single column and an index
# using the colnames
df = df.T # This transposes the DataFrame to
# so the indexes become the colnames
df.index = ['values'] # Sets the name of your row
print(df)
The output of this script looks like this...
Add a value: 1
Add a value: 2
Add a value: 3
Add a value: 4
Add a value: q
M0 M1 M2 M3
values 1 2 3 4
pandas seems a bit of an overkill, but to answer your question.
Assuming you collect numerical values from your users and store them in a list:
import numpy as np
import pandas as pd
values = np.random.random_integers(0, 10, 10)
print(values)
array([1, 5, 0, 1, 1, 1, 4, 1, 9, 6])
columns = {}
column_base_name = 'Column'
for i, value in enumerate(values):
columns['{:s}{:d}'.format(column_base_name, i)] = value
print(columns)
{'Column0': 1,
'Column1': 5,
'Column2': 0,
'Column3': 1,
'Column4': 1,
'Column5': 1,
'Column6': 4,
'Column7': 1,
'Column8': 9,
'Column9': 6}
df = pd.DataFrame(data=columns, index=[0])
print(df)
Column0 Column1 Column2 Column3 Column4 Column5 Column6 Column7 \
0 1 5 0 1 1 1 4 1
Column8 Column9
0 9 6
I have a large dataset stored as a pandas panel. I would like to count the occurence of values < 1.0 on the minor_axis for each item in the panel. What I have so far:
#%% Creating the first Dataframe
dates1 = pd.date_range('2014-10-19','2014-10-20',freq='H')
df1 = pd.DataFrame(index = dates)
n1 = len(dates)
df1.loc[:,'a'] = np.random.uniform(3,10,n1)
df1.loc[:,'b'] = np.random.uniform(0.9,1.2,n1)
#%% Creating the second DataFrame
dates2 = pd.date_range('2014-10-18','2014-10-20',freq='H')
df2 = pd.DataFrame(index = dates2)
n2 = len(dates2)
df2.loc[:,'a'] = np.random.uniform(3,10,n2)
df2.loc[:,'b'] = np.random.uniform(0.9,1.2,n2)
#%% Creating the panel from both DataFrames
dictionary = {}
dictionary['First_dataset'] = df1
dictionary['Second dataset'] = df2
P = pd.Panel.from_dict(dictionary)
#%% I want to count the number of values < 1.0 for all datasets in the panel
## Only for minor axis b, not minor axis a, stored seperately for each dataset
for dataset in P:
P.loc[dataset,:,'b'] #I need to count the numver of values <1.0 in this pandas_series
To count all the "b" values < 1.0, I would first isolate b in its own DataFrame by swapping the minor axis and the items.
In [43]: b = P.swapaxes("minor","items").b
In [44]: b.where(b<1.0).stack().count()
Out[44]: 30
Thanks for thinking with me guys, but I managed to figure out a surprisingly easy solution after many hours of attempting. I thought I should share it in case someone else is looking for a similar solution.
for dataset in P:
abc = P.loc[dataset,:,'b']
abc_low = sum(i < 1.0 for i in abc)