I have a question how to select different column(create new series) based on another column value. raw data as following:
DEST_ZIP5 EXP_EDD_FRC_DAY GND_EDD_FRC_DAY \
0 00501 5 6
1 00544 5 6
2 01001 4 8
3 01001 4 8
4 01001 4 8
EXP_DAY_2 EXP_DAY_3 EXP_DAY_4 EXP_DAY_5 ... \
0 0.0 1.00 1.00 1.0 ...
1 0.0 1.00 1.00 1.0 ...
2 0.0 0.85 1.00 1.0 ...
3 0.0 1.00 1.00 1.0 ...
4 0.0 0.85 0.85 1.0 ...
GND_DAY_3 GND_DAY_4 GND_DAY_5 GND_DAY_6 GND_DAY_7 GND_DAY_8 \
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 0.0 0.0 0.16 0.33 0.83 1.00
3 0.0 0.0 0.00 0.14 0.71 0.85
4 0.1 0.1 0.20 0.40 0.40 0.60
I want to have two new data serize which get the number value of for responding column.
(the row 1, EXP_EDD_FRC_DAY =5, so, return df[EXP_DAY_5].
GND_EDD_FRC_DAY =6, return df[GND_DAY_6]
DEST_ZIP5 EXP_percentage GND_percentage \
0 00501 1.0 NaN
1 00544 1.0 NaN
2 01001 1.0 1.00
3 01001 1.0 0.85
4 01001 0.85 0.60
I found function lookup. Not not sure how to use that.
Thank you very much
-
IIUC:
c = df['EXP_EDD_FRC_DAY'].astype(str).radd('GND_DAY_')
new_series = pd.Series(df.lookup(df.index, c), df.index)
Related
I made a mock-up example to illustrate my problem, naturally, what I am working with is something way more complex. Reading the example will make everything easier to understand but my goal is to use the row reference values of one dataframe to set the values of a new column of another dataframe. That, taking my example, I want to create a new column in df1 named z1, that column will be formed by considering the values of x1 taking the reference of y2 values of d2.
import numpy as np
import pandas as pd
x1 = np.array([])
for i in np.arange(0, 15, 3):
x1i = np.repeat(i, 3)
x1 = np.append(x1, x1i)
y1 = np.linspace(0, 1, len(x1))
x2 = np.arange(0, 15, 3)
y2 = np.linspace(0, 1, len(x2))
df1 = pd.DataFrame([x1, y1]).T
df2 = pd.DataFrame([x2, y2]).T
df1.columns = ['x1', 'y1']
df2.columns = ['x2', 'y2']
So, we have that df1 is:
x1 y1
0 0.0 0.000000
1 0.0 0.071429
2 0.0 0.142857
3 3.0 0.214286
4 3.0 0.285714
5 3.0 0.357143
6 6.0 0.428571
7 6.0 0.500000
8 6.0 0.571429
9 9.0 0.642857
10 9.0 0.714286
11 9.0 0.785714
12 12.0 0.857143
13 12.0 0.928571
14 12.0 1.000000
and df2 is:
x2 y2
0 0.0 0.00
1 3.0 0.25
2 6.0 0.50
3 9.0 0.75
4 12.0 1.00
What I would like to achieve is:
x1 y1 z1
0 0.0 0.000000 0.00
1 0.0 0.071429 0.00
2 0.0 0.142857 0.00
3 3.0 0.214286 0.25
4 3.0 0.285714 0.25
5 3.0 0.357143 0.25
6 6.0 0.428571 0.50
7 6.0 0.500000 0.50
8 6.0 0.571429 0.50
9 9.0 0.642857 0.75
10 9.0 0.714286 0.75
11 9.0 0.785714 0.75
12 12.0 0.857143 1.00
13 12.0 0.928571 1.00
14 12.0 1.000000 1.00
You can use map for this.
df1['z'] = df1['x1'].map(df2.set_index('x2')['y2'])
x1 y1 z
0 0.0 0.000000 0.00
1 0.0 0.071429 0.00
2 0.0 0.142857 0.00
3 3.0 0.214286 0.25
4 3.0 0.285714 0.25
5 3.0 0.357143 0.25
6 6.0 0.428571 0.50
7 6.0 0.500000 0.50
8 6.0 0.571429 0.50
9 9.0 0.642857 0.75
10 9.0 0.714286 0.75
11 9.0 0.785714 0.75
12 12.0 0.857143 1.00
13 12.0 0.928571 1.00
14 12.0 1.000000 1.00
I have 6000 rows and 8 columns, where 'Date' is like index or I can reset index and it would be like first column with string type. I need to Extract the list of 'Lake_Level' values where date of a record is second and seventh day of a month ( and provide top 3 and bottom 3 values of the 'Lake_Level' feature). Please show me how to make it. Thank you in advance.
Date Loc_1 Loc_2 Loc_3 Loc_4 Loc_5 Temp Lake_Level Flow_Rate
03/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
04/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
05/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
06/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
07/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
26/06/2021 0.0 0.0 0.0 0.0 0.0 22.50 250.85 0.60
27/06/2021 0.0 0.0 0.0 0.0 0.0 23.40 250.84 0.60
28/06/2021 0.0 0.0 0.0 0.0 0.0 21.50 250.83 0.60
29/06/2021 0.0 0.0 0.0 0.0 0.0 23.20 250.82 0.60
30/06/2021 0.0 0.0 0.0 0.0 0.0 22.75 250.80 0.60
Why don't you just filter rows with your ideal condition?
You can run queries on your dataset using pandas DataFrame like below:
If datetimes are in column
df[pd.to_datetime(df['Date'], dayfirst=True).dt.day.isin([2,7])]
If datetimes are as indexes
df[pd.to_datetime(df.index, dayfirst=True).day.isin([2,7])]
Here is an example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
...: 'Date': [random_date() for _ in range(100)],
...: 'Lake_Level': [random.randint(240, 260) for _ in range(100)]
...: })
In [3]: df[pd.to_datetime(df['Date'], dayfirst=True).dt.day.isin([2,7])]
Out[3]:
Date Lake_Level
2 07/08/2004 245
27 02/12/2017 249
30 02/06/2012 252
51 07/10/2013 257
I am grouping a pandas data frame by their value in column 0, which happens to be year, month column (formatted as a float64 like yy,mm).
Before using the groupby function, by dataframe is as follows:
0 1 2 3 4 5 6 7 8 9
0 13,09 0.00 NaN 26.0 5740.0 NaN NaN NaN NaN 26
1 13,09 0.02 NaN 26.0 5738.0 NaN NaN NaN NaN 26
2 13,09 0.00 NaN 26.0 5738.0 NaN NaN NaN NaN 26
3 13,09 0.00 NaN 29.0 NaN NaN NaN NaN NaN 29
4 13,09 0.00 NaN 25.0 NaN NaN NaN NaN NaN 25
After running my groupby code (seen here)
month_year_total = month_year.groupby(0).sum()
I am given the following dataframe
1 2 3 4 5 6 7 8 9
0
13,09 1.55 0.0 383.0 51583.0 0.0 0.0 0.0 0.0 383
13,10 12.56 0.0 2039.0 142426.0 0.0 0.0 0.0 0.0 2039
13,11 0.65 1890.0 1663.0 170038.0 0.0 0.0 0.0 0.0 3553
13,12 1.43 7014.0 1055.0 176217.0 0.0 0.0 0.0 0.0 8069
14,01 1.53 7284.0 856.0 101971.0 0.0 0.0 0.0 0.0 8140
I wish to keep column 0 when converting to numpy, as I intend it to be the x axis of my graph; however, the column is dropped when I convert data types. In fact, I cannot manipulate the column at all, even within the pandas dataframe.
How do I keep this column or add an identical column?
Suppose I have a df that looks like this:
posF ffreq posR rfreq
0 10 0.50 11.0 0.08
1 20 0.20 31.0 0.90
2 30 0.03 41.0 0.70
3 40 0.72 51.0 0.08
4 50 0.09 81.0 0.78
5 60 0.09 NaN NaN
6 70 0.01 NaN NaN
7 80 0.09 NaN NaN
8 90 0.08 NaN NaN
9 100 0.02 NaN NaN
In the posR column, we see that it jumps from 11 to 31, and there is not a value in the "20's". I want to insert a value to fill that space, which would essentially just be the posF value, and NA, so my resulting df would look like this:
posF ffreq posR rfreq
0 10 0.50 11.0 0.08
1 20 0.20 20 NaN
2 30 0.03 31.0 0.90
3 40 0.72 41.0 0.70
4 50 0.09 50 NaN
5 60 0.09 60 NaN
6 70 0.01 70 NaN
7 80 0.09 80 NaN
8 90 0.08 81.0 0.78
9 100 0.02 100 NaN
So I want to fill the NaN values in the position with the values from posF that are in between the values in posR.
What I have tried to do is just make a dummy list and add values to the list based on if they were less than a (I see the flaw here but I don't know how to fix it).
insert_rows = []
for x in df['posF']:
for a,b in zip(df['posR'], df['rfreq']):
if x<a:
insert_rows.append([x, 'NA'])
print(len(insert_rows))#21, should be 5
I realize that it is appending x several times until it reaches the condition of being >a.
After this I will just create a new df and add these values to the original 2 columns so they are the same length.
If you can think of a better title, feel free to edit.
My first thought was to retrieve the new indices for the entries in posR by interpolating with posF and then put the values to their new positions - but as you want to have 81 one row later than here, I'm afraid this is not exactly what you're searching for and I still don't really get the logic behind your task.
However, perhaps this is a starting point, let's see...
This approach would work like the following:
Retrieve the new index positions of the values in posR according to their order in posF:
import numpy as np
idx = np.interp(df.posR, df.posF, df.index).round()
Get rid of nan entries and cast to int:
idx = idx[np.isfinite(idx)].astype(int)
Create a new column by copying posF in the first step, and set newrfreq to nan respectively:
df['newposR'] = df.posF
df['newrfreq'] = np.nan
Then overwrite with the values from posR and rfreq, but now at the updated positions:
df.loc[idx, 'newposR'] = df.posR[:len(idx)].values
df.loc[idx, 'newrfreq'] = df.rfreq[:len(idx)].values
Result:
posF ffreq posR rfreq newposR newrfreq
0 10 0.50 11.0 0.08 11.0 0.08
1 20 0.20 31.0 0.90 20.0 NaN
2 30 0.03 41.0 0.70 31.0 0.90
3 40 0.72 51.0 0.08 41.0 0.70
4 50 0.09 81.0 0.78 51.0 0.08
5 60 0.09 NaN NaN 60.0 NaN
6 70 0.01 NaN NaN 70.0 NaN
7 80 0.09 NaN NaN 81.0 0.78
8 90 0.08 NaN NaN 90.0 NaN
9 100 0.02 NaN NaN 100.0 NaN
is there a good code to split dataframes into chunks and automatically name each chunk into its own dataframe?
for example, dfmaster has 1000 records. split by 200 and create df1, df2,….df5
any guidance would be much appreciated.
I've looked on other boards and there is no guidance for a function that can automatically create new dataframes.
Use numpy for splitting:
See example below:
In [2095]: df
Out[2095]:
0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.00 0.0 0.00 0.0 0.94 0.00 0.00 0.63 0.00
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.00 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN
In [2096]: np.split(df, 2)
Out[2096]:
[ 0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.0 0.0 0.0 0.0 0.94 0.0 0.0 0.63 0.0
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN,
0 1 2 3 4 5 6 7 8 9 10
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.0 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN]
df gets split into 2 dataframes having 2 rows each.
You can do np.split(df, 500)
I find these ideas helpful:
solution via list:
https://stackoverflow.com/a/49563326/10396469
solution using numpy.split:
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.split.html
just use df = df.values first to convert from dataframe to numpy.array.