I have a dataframe like this:
userName _2643698_1 _2643699_1 _2643700_1 _2643701_1 _2643702_1
_test2 5.0 4.8 3.75 3.6 2.2
_test3 4.0 5.0 4.40 5.0 5.0
_test4 5.0 4.4 5.00 5.0 4.0
Three unique users, 5 unique columns that correspond to the users, and a unique score per column/per user.
I need to feed this data into a patch request with this logic:
Per username, update each key (column title) with the score for that user.
Example:
patch = change_data(userName, colId, score)
The goal being to update the data for all three users, each having a score in the same 5 columns (the column headers like _263698_1, with the score the user has in that column).
The real dataset I'm wrestling with has 78 users and 14 unique columns with scores for each user.
I have been playing around with a lot of options suggested on the web to get the logic I need as efficiently as possible, and any suggestions would be greatly appreciated.
Thank you.
Use melt()
new_df = pd.melt(id_vars='userName',
var_name='colId',
value_vars=[c for c in df.columns if c != 'userName']
)
So new_df looks like this
userName colId value
0 _test2 _2643698_1 5.00
1 _test3 _2643698_1 4.00
2 _test4 _2643698_1 5.00
3 _test2 _2643699_1 4.80
4 _test3 _2643699_1 5.00
5 _test4 _2643699_1 4.40
6 _test2 _2643700_1 3.75
7 _test3 _2643700_1 4.40
8 _test4 _2643700_1 5.00
9 _test2 _2643701_1 3.60
10 _test3 _2643701_1 5.00
11 _test4 _2643701_1 5.00
12 _test2 _2643702_1 2.20
13 _test3 _2643702_1 5.00
14 _test4 _2643702_1 4.00
Then you can iterate over new_df and call change_data on each row
for row in new_df.itertuples(index=False):
patch = change_data(row.userName, row.colId, row.value)
# do something with patch
Related
thanks for taking the time to read this question.
I am using time series data which is reported weekly. I am trying to calculate the minimum value of each row over 3 years which I have done using the code below. Since the data is reported weekly for each row it would be the minimum value of 156 rows (3yrs before). The column Spec_Min details the minimum value for each row over 3 years.
However, halfway through my data, it begins to be reported twice a month but I still need to have the minimum values over 3 years therefore no longer 156 rows later. I was wondering if there was a more simple way of doing this?
Perhaps doing it via date rather than rows but I am not sure how to do that.
df1['Spec_Min']=df1['Spec_NET'].rolling(156).min()
df1
Date Spec_NET Hed_NET Spec_Min
1995-10-31 9.0 -13.5 -49.7
1995-11-07 11.9 -23.5 -49.7
1995-11-14 9.8 -19.4 -49.7
1995-11-21 9.7 -25.4 -49.7
1995-11-28 10.4 -20.3 -49.7
1995-12-05 1.6 -15.3 -49.7
1995-12-12 -17.0 14.2 -49.7
1995-12-19 -16.6 15.2 -49.7
1995-12-26 4.7 -15.2 -49.7
1996-01-02 5.3 -22.7 -49.7
1996-01-16 7.3 -21.0 -49.7
1996-01-23 1.3 -20.4 -49.7
Pandas allows you to operate with a datetime aware rolling window. You'll need to structure your code to operate in terms of the number of days (365 * 3 for 3 years).
I used your provided sample DataFrame
df['Spec_Min'] = df.rolling(f'{365 * 3}D', on='Date')['Spec_NET'].min()
print(df)
Date Spec_NET Hed_NET Spec_Min
0 1995-10-31 9.0 -13.5 9.0
1 1995-11-07 11.9 -23.5 9.0
2 1995-11-14 9.8 -19.4 9.0
3 1995-11-21 9.7 -25.4 9.0
4 1995-11-28 10.4 -20.3 9.0
5 1995-12-05 1.6 -15.3 1.6
6 1995-12-12 -17.0 14.2 -17.0
7 1995-12-19 -16.6 15.2 -17.0
8 1995-12-26 4.7 -15.2 -17.0
9 1996-01-02 5.3 -22.7 -17.0
10 1996-01-16 7.3 -21.0 -17.0
11 1996-01-23 1.3 -20.4 -17.0
Try something like this:
(if your index is already a datetimeindex, skip the first two rows)
df.set_index('Date',inplace = True,drop = True)
df.index = pd.to_datetime(df.index)
# resample your dataframe in weekly frequency, and interpolate missing values
conformed = df.resample('W-MON').mean().interpolate(method = 'nearest')
n_weeks = 3 # the length of the rolling window (in weeks)
result = conformed.rolling(n_weeks).min()
Note that, you mention that you want the minimum of each row. But it seems like you are calculating the rolling minimum of each column...
Suppose I have dataframe (approx., 10 columns) focused on three columns. Let's say 'A', 'B' & 'C'.
'A' column has some continuous value between any range. E.g., the price of any item is between 5-20 bucks.
'B' column is categorical. E.g., it has two categories, like 'Old', 'New'
'C' column is like a unique ID for that item.
My motive is to find the top 10 items which should be sorted by their rank in price & rank should be separated by categories mentioned in column 'B'.
Result is required in the plot (seaborn/matplotlib). Barplot should show top 10 IDs from column C, and each bar should have it's price from column A, this should be sorted rank-wise from higher price to lower price (plot should show bar FOR EACH CATEGORY FROM COLUMN B)
Someone please help to make related code in Python using Seaborn/Matplotlib libraries.Example table like below :
A B C
0 5.0 Old A001
1 6.2 New A002
2 10.0 Old A003
3 19.6 Old A004
4 12.0 Old A005
5 11.0 New A006
6 7.0 New A007
7 8.0 Old A008
8 7.0 New A009
9 5.0 New A010
10 17.0 Old A011
11 8.0 Old A012
12 12.0 Old A013
13 13.0 New A014
14 15.0 New A015
15 9.0 Old A016
16 9.0 New A017
17 10.0 Old A018
new answer top 5 per group
df2 = df.loc[df.groupby('B', group_keys=False)['A'].nlargest(5).index]
df2.set_index('C')['A'].plot.bar(color=df2['B'].map({'New': 'r', 'Old': 'b'}).values)
output:
old answer
IIUC, you want the top ten items in A and sort the output by B and A:
df2 = df.loc[df['A'].nlargest(10).index].sort_values(by=['B', 'A'])
output:
A B C
5 11.0 New A006
13 13.0 New A014
14 15.0 New A015
15 9.0 Old A016
2 10.0 Old A003
17 10.0 Old A018
4 12.0 Old A005
12 12.0 Old A013
10 17.0 Old A011
3 19.6 Old A004
Then plot using:
df2.set_index('C')['A'].plot.bar(color=df2['B'].map({'New': 'r', 'Old': 'b'}).values)
output:
Below is the basic code which fulfilled my requirement, however, later I used Seaborn's barplot instead.
import matplotlib.pyplot as plt
old = df[df['B']=='Old'].sort_values(by=['A'], ascending=False).head(5)
new = df[df['B']=='New'].sort_values(by=['A'], ascending=False).head(5)
fig, a = plt.subplots(1, 2, figsize=(4,4))
old.plot.bar('C', ax=a[0])
new.plot.bar('C', ax=a[1])
I have two dataframes with the same headers
df1\
**Date prix moyen mini maxi H-Value C-Value**
0 17/09/20 8 6 9 122 2110122\
1 15/09/20 8 6 9 122 2110122\
2 10/09/20 8 6 9 122 2110122
and
df2
**Date prix moyen mini maxi H-Value C-Value**\
1 07/09/17 1.80 1.50 2.00 170 3360170\
1 17/09/20 8.00 6.00 9.00 122 2110122\
2 17/09/20 9.00 8.00 12.00 122 2150122\
3 17/09/20 10.00 8.00 12.00 122 14210122
I want to compare the two dataframes alone 3 parameters (Date, H-Value and C-Value), identify the new values present in df2 (values which do not occur in df1) and then append them in df1.
I am using
df_unique = df2[~(df2['Date'].isin(df1['Date']) & df2['H-Value'].isin(df1['H-Value']) & df2['C-Value'].isin(df1['C-Value']) )].dropna().reset_index(drop=True)
and it is not working in identifying the new values in df2. The resulting table only identifies some values and not others.
Where am I going wrong?
What is your question?
In [4]: df2[~(df2['Date'].isin(df1['Date']) & df2['H-Value'].isin(df1['H-Value']
...: ) & df2['C-Value'].isin(df1['C-Value']) )].dropna().reset_index(drop=Tru
...: e)
Out[4]:
Date prix moyen mini maxi H-Value C-Value
0 1 07/09/17 1.8 1.5 2.0 170 3360170
1 2 17/09/20 9.0 8.0 12.0 122 2150122
2 3 17/09/20 10.0 8.0 12.0 122 14210122
These are all rows in df2 that are not present in df1. Looks good to me...
I was actually able to solve the problem. The issue was not the command being used to compare the two datasets but rather the fact that one of the columns in df2 had a data format different from the same column in df1, rendering a direct comparison not possible.
Here's what I try
df1 = pd.concat([df1, df2[~df2.set_index(['Date', 'H-Value', 'C-Value']).index.isin(df1.set_index(['Date', 'H-Value', 'C-Value']).index)]])
I have a dataframe which looks like this:
1 2 3 4 Density
Mineral
Quartz 13.4 23.0 23.4 28.3 2.648
Plagioclase 5.2 8.2 8.5 11.7 2.620
K-feldspar 2.3 2.4 2.6 3.1 2.750
What I need to do is to calculate the new rows based on the condition made on the row:
DESIRED OUTPUT
1 2 3 4 Density
Mineral
Quartz 13.4 23.0 23.4 28.3 2.648
Plagioclase 5.2 8.2 8.5 11.7 2.620
K-feldspar 2.3 2.4 2.6 3.1 2.750
Quartz_v 5.06 8.69 8.84 10.69 2.648
Plagioclase_v ...
So the process is basically I need to the following steps:
1) Define the new row, for example, Quartz_v
2) and then perform the following calculation Quartz_v = each column value of Quartz divided by the Density value of Quartz_v
I have already loaded the data as a two dataframes, the density and mineral ones, and merged them, so the each mineral will have the correct density in front of it.
Use
DataFrame.div with axis=0 to perform division,
rename to rename the index, and
append to concatenate the result to the original (you can also use pd.concat instead).
d = df['Density']
result = df.append(df.div(d, axis=0).assign(Density=d).rename(lambda x: x+'_v'))
result
1 2 3 4 Density
Mineral
Quartz 13.400000 23.000000 23.400000 28.300000 2.648
Plagioclase 5.200000 8.200000 8.500000 11.700000 2.620
K-feldspar 2.300000 2.400000 2.600000 3.100000 2.750
Quartz_v 5.060423 8.685801 8.836858 10.687311 2.648
Plagioclase_v 1.984733 3.129771 3.244275 4.465649 2.620
K-feldspar_v 0.836364 0.872727 0.945455 1.127273 2.750
This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 4 years ago.
I am manipulating a data frame using Pandas in Python to match a specific format.
I currently have a data frame with a row for each measurement location (A or B). Each row has a nominal target and multiple measured data points.
This is the format I currently have:
df=
Location Nominal Meas1 Meas2 Meas3
A 4.0 3.8 4.1 4.3
B 9.0 8.7 8.9 9.1
I need to manipulate this data so there is only one measured data point per row, and copy the Location and Nominal values from the source rows to the new rows. The measured data also needs to be put in the first column.
This is the format I need:
df =
Meas Location Nominal
3.8 A 4.0
4.1 A 4.0
4.3 A 4.0
8.7 B 9.0
8.9 B 9.0
9.1 B 9.0
I have tried concat and append functions with and without transpose() with no success.
This is the most similar example I was able to find, but it did not get me there:
for index, row in df.iterrows():
pd.concat([row]*3, ignore_index=True)
Thank you!
Its' a wide to long problem
pd.wide_to_long(df,'Meas',i=['Location','Nominal'],j='drop').reset_index().drop('drop',1)
Out[637]:
Location Nominal Meas
0 A 4.0 3.8
1 A 4.0 4.1
2 A 4.0 4.3
3 B 9.0 8.7
4 B 9.0 8.9
5 B 9.0 9.1
Another solution, using melt:
new_df = (df.melt(['Location','Nominal'],
['Meas1', 'Meas2', 'Meas3'],
value_name = 'Meas')
.drop('variable', axis=1)
.sort_values('Location'))
>>> new_df
Location Nominal Meas
0 A 4.0 3.8
2 A 4.0 4.1
4 A 4.0 4.3
1 B 9.0 8.7
3 B 9.0 8.9
5 B 9.0 9.1