Filling pandas DataFrame with values from another DataFrame of different shape

Filling pandas DataFrame with values from another DataFrame of different shape - python

I have a dataframe df2 containing four columns: A, B, C, D. I want to fill this dataframe with the values from another data frame temp= (1, 2, 6.5, 8, 3, 4, 6.6, 7.8, 5, 6, 5, 4).
What I want to obtain is given in
Any idea on how to do this?

If length of values modulo 4 is equal 0 seelct first row to Series by DataFrame.iloc, convert to numpy array and reshape by -1 for by default counts number of rows and 4 for number of columns:
print (len(df.iloc[0]) % 4)
0
df2 = pd.DataFrame(df.iloc[0].to_numpy().reshape(-1, 4), columns=list('ABCD'))
print (df2)
A B C D
0 1.0 2.0 6.5 8.0
1 3.0 4.0 6.6 7.8
2 5.0 6.0 5.0 4.0

Related

Why isn't transpose working to allow me to concat dataframe with df.loc["Y"] row

I have 2 dataframes:
df = pd.DataFrame({'A': [1, 2, 3],
'B': [4, 5, 6]},
index=['X', 'Y', 'Z'])
and
df1 = pd.DataFrame({'M': [10, 20, 30],
'N': [40, 50, 60]},
index=['S', 'T', 'U'])
i want to append the df1 with a row of from the df dataframe.
i use the following code to extract the row:
row = df.loc['Y']
when i print this i get:
A 2
B 5
Name: Y, dtype: int64
A and B are key values or column heading names. so i transpose this with
row_25 = row.transpose()
i print row_25 and get:
A 2
B 5
Name: Y, dtype: int64
this is the same as row, so it seems the transpose didn't happen
i then add this code to add the row to df1:
result = pd.concat([df1, row_25], axis=0, ignore_index=False)
print(result)
when i print df1 i get:
M N 0
S 10.0 40.0 NaN
T 20.0 50.0 NaN
U 30.0 60.0 NaN
A NaN NaN 2.0
B NaN NaN 5.0
i want A and B to be column headings (key values) and the name of row (Y) to be the row index.
what am i doing wrong?

Try
pd.concat([df1, df.loc[['Y']])
It generates:
M N A B
S 10.0 40.0 NaN NaN
T 20.0 50.0 NaN NaN
U 30.0 60.0 NaN NaN
Y NaN NaN 2.0 5.0
Not sure if this is what you want.
To exclude column names 'M' and 'N' from the result you can rename the columns beforehand:
>>> df1.columns = ['A', 'B']
>>> pd.concat([df1, df.loc[['Y']])
A B
S 10 40
T 20 50
U 30 60
Y 2 5
The reason why you need double square brackets is that single square brackets return a 1D Series, that cannot be transposed. And double brackets return a 2D DataFrame (in general double brackets are used to reference several columns, like df1.loc[['X', 'Y']]; it is called 'fancy indexing' in NumPy).
If you are allergic to double brackets, use
pd.concat([df1.rename(columns={'M': 'A', 'N': 'B'}),
df.filter('Y', axis=0)])
Finally, if you really want to transpose something, you can convert the series to a frame and transpose it:
>>> df.loc['Y'].to_frame().T
A B
Y 2 5

Pandas sum column until >= sum(values) then split left df items to dfNEW

I have df with mixed datatypes, example:
df:
name
value
1st
1
2nd
5
3rd
3.5
4th
8
when df['value'].sum() >= 10 split to dfNew (sum of first 3 = 9.5 so need to split left items to dfNew, in my example need split last row)
dfNew:
name
value
4st
8
P.s. I assume I can done it by iterating(itertuples/iteritems) & sum(items) then get indexes and split but what are 'more pandas' way done this?

use cumsum and then do the binning and then do the groupby in binning for example:
df['cumsum_val'] = df['value'].cumsum()
binning = list(range(0, math.ceil(df['cumsum_val'].iloc[-1]) + 10), 10) # here you are checking with every 10 (addition of the latest)
df['binning'] = pd.cut(df['cumsum_val'], bins= binning)
after this, you can do the groupby in binnings:
grouped_df = df.groupby('binning')
finally get the splitted df_lists
df_new_lists = [grouped_df.get_group(x) for x in grouped_df.groups]
df_new_lists is the lists of the dataframe

You can use a combination of:
.cumsum to get the cumulative sums;
cut to compare the cumulative sums with your threshold;
.groupby to group rows that received the same result.
import pandas as pd
df = pd.DataFrame({'name': ['1st', '2nd', '3rd', '4th', '5th', '6th', '7th'], 'value': [1,5,3.5,8,2,9,2]})
threshold = 10
df['cumsums'] = df['value'].cumsum()
bins = range(0, int(df['cumsums'].iloc[-1]+threshold+1), threshold)
df['groups'] = pd.cut(df['cumsums'], bins=bins)
df_list = [pd.DataFrame(g) for _,g in df.groupby('groups')]
print(df)
# name value cumsums groups
# 0 1st 1.0 1.0 (0, 10]
# 1 2nd 5.0 6.0 (0, 10]
# 2 3rd 3.5 9.5 (0, 10]
# 3 4th 8.0 17.5 (10, 20]
# 4 5th 2.0 19.5 (10, 20]
# 5 6th 9.0 28.5 (20, 30]
# 6 7th 2.0 30.5 (30, 40]
print(df_list)
# [
# name value cumsums groups
# 0 1st 1.0 1.0 (0, 10]
# 1 2nd 5.0 6.0 (0, 10]
# 2 3rd 3.5 9.5 (0, 10],
# name value cumsums groups
# 3 4th 8.0 17.5 (10, 20]
# 4 5th 2.0 19.5 (10, 20],
# name value cumsums groups
# 5 6th 9.0 28.5 (20, 30]
# name value cumsums groups
# 6 7th 2.0 30.5 (30, 40]
# ]
Note that this code assumes that values are nonnegative, so that the cumsums are positive and increasing. If that is not the case, then you must use a more robust definition for bins, such as:
bins = range(int(min(df['cumsums'])), int(max(df['cumsums']))+threshold+1, threshold)

Use cumsum and idxmax then slice the dataframe ([idx:])
dfNew = df.iloc[df['value'].cumsum().ge(10).idxmax():]
Output
>>> dfNew
name value
3 4th 8.0

Python Pandas Next Sequence Number

I have a dataframe
df = pd.DataFrame([1,5,8, np.nan,np.nan], columns = ["UserID"])
I want to fill np.nan with next sequence numbers from starting with highest value + 1
expected result of df.UserID
[1, 5, 8, 9, 10]

Use Series.isna with Series.cumsum for counter and add original data with forward filling missing values:
df['UserID'] = df['UserID'].isna().cumsum().add(df['UserID'].ffill(), fill_value=0)
print (df)
UserID
0 1.0
1 5.0
2 8.0
3 9.0
4 10.0

Faster method to multiply column lookup values with vectorization

I have two Dataframes, one contains values and is the working dataset (postsolutionDF), while the other is simply for reference as a lookup table (factorimportpcntDF). The goal is to add a column to postsolutionDF that contains the product of the lookup values from each row of postsolutionDF (new column name = num_predict). That product is then multiplied by 2700. For example, on first row, the working values are 0.5, 2, -6. The equivalent lookup values for these are 0.1182, 0.2098, and 0.8455. The product of those is 0.0209, which when multiplied by 2700 is 56.61 as shown in output.
The code below works for this simplified example, but it is very slow in the real solution (1.6MM rows x 15 numbered columns). I'm sure there is a better way to do this by removing the 'for k in range' loop but am struggling with how since already using apply on rows. I've found many tangential solutions but nothing that has worked for my situation yet. Thanks for any help.
import pandas as pd
import numpy as np
postsolutionDF = pd.DataFrame({'SCRN' : (['2019-01-22-0000001', '2019-01-22-0000002', '2019-01-22-0000003']), '1' : 0.5,
'2' : 2, '3' : ([-6, 1.0, 8.0])})
postsolutionDF = postsolutionDF[['SCRN', '1', '2', '3']]
print('printing initial postsolutionDF..')
print(postsolutionDF)
factorimportpcntDF = pd.DataFrame({'F1_Val' : [0.5, 1, 1.5, 2], 'F1_Pcnt' : [0.1182, 0.2938, 0.4371, 0.5433], 'F2_Val'
: [2, 3, np.nan, np.nan], 'F2_Pcnt' : [0.2098, 0.7585, np.nan, np.nan], 'F3_Val' : [-6, 1, 8, np.nan], 'F3_Pcnt' :
[0.8455, 0.1753, 0.072, np.nan]})
print('printing factorimportpcntDF..')
print(factorimportpcntDF)
def zero_filter(row): # row is series
inner_value = 1
for k in range(1, 4): # number of columns in postsolutionDF with numeric headers, dynamic in actual code
inner_value *= factorimportpcntDF.loc[factorimportpcntDF['F'+str(k)+'_Val']==row[0+k], 'F'+str(k)+'_Pcnt'].values[0]
inner_value *= 2700
return inner_value
postsolutionDF['num_predict'] = postsolutionDF.apply(zero_filter, axis=1)
print('printing new postsolutionDF..')
print(postsolutionDF)
Print Output:
C:\ProgramData\Anaconda3\python.exe C:/Users/Eric/.PyCharmCE2017.3/config/scratches/scratch_5.py
printing initial postsolutionDF..
SCRN 1 2 3
0 2019-01-22-0000001 0.5 2 -6.0
1 2019-01-22-0000002 0.5 2 1.0
2 2019-01-22-0000003 0.5 2 8.0
printing factorimportpcntDF..
F1_Pcnt F1_Val F2_Pcnt F2_Val F3_Pcnt F3_Val
0 0.1182 0.5 0.2098 2.0 0.8455 -6.0
1 0.2938 1.0 0.7585 3.0 0.1753 1.0
2 0.4371 1.5 NaN NaN 0.0720 8.0
3 0.5433 2.0 NaN NaN NaN NaN
printing new postsolutionDF..
SCRN 1 2 3 num_predict
0 2019-01-22-0000001 0.5 2 -6.0 56.610936
1 2019-01-22-0000002 0.5 2 1.0 11.737312
2 2019-01-22-0000003 0.5 2 8.0 4.820801
Process finished with exit code 0

I'm not sure how to do this in native pandas, but if you go back to numpy, it is pretty easy.
The numpy.interp function is designed to interpolate between values in the lookup table, but if the input values exactly match the values in the lookup table (like yours do), it becomes just a simple lookup instead of an interpolation.
postsolutionDF['1new'] = np.interp(postsolutionDF['1'].values, factorimportpcntDF['F1_Val'], factorimportpcntDF['F1_Pcnt'])
postsolutionDF['2new'] = np.interp(postsolutionDF['2'].values, factorimportpcntDF['F2_Val'], factorimportpcntDF['F2_Pcnt'])
postsolutionDF['3new'] = np.interp(postsolutionDF['3'].values, factorimportpcntDF['F3_Val'], factorimportpcntDF['F3_Pcnt'])
postsolutionDF['num_predict'] = postsolutionDF['1new'] * postsolutionDF['2new'] * postsolutionDF['3new'] * 2700
postsolutionDF.drop(columns=['1new', '2new', '3new'], inplace=True)
Gives the output:
In [167]: postsolutionDF
Out[167]:
SCRN 1 2 3 num_predict
0 2019-01-22-0000001 0.5 2 -6.0 56.610936
1 2019-01-22-0000002 0.5 2 1.0 11.737312
2 2019-01-22-0000003 0.5 2 8.0 4.820801
I had to pad out the factorimportpcntDF so all the columns had 4 values, otherwise looking up the highest value for a column wouldn't work. You can just use the same value multiple times, or split it into 3 lookup tables if you prefer, then the columns could be different lengths.
factorimportpcntDF = pd.DataFrame({'F1_Val' : [0.5, 1, 1.5, 2], 'F1_Pcnt' : [0.1182, 0.2938, 0.4371, 0.5433],
'F2_Val' : [2, 3, 3, 3], 'F2_Pcnt' : [0.2098, 0.7585, 0.7585, 0.7585],
'F3_Val' : [-6, 1, 8, 8], 'F3_Pcnt' : [0.8455, 0.1753, 0.072, 0.072]})
Note that the documentation specifies that your F1_val etc columns need to be in increasing order (yours are here, just an FYI). Otherwise interp will run, but won't necessarily give good results.

Row-wise DataFrame segmentation

Given the following dataframe:
df = pd.DataFrame(data={'item': [1, 2, 3, 4], 'start':[0.0, 2.0, 8.0, 6.0],
'end': [2.0, 6.0, 8.0, 14.0]})
How do I quickly expand the above dataframe row-wise by segmenting the interval 'start' - 'end' into multiples of 2?
For the above example, the resulting dataframe should be
Out=
item start end
1 0.0 2.0
2 2.0 4.0
2 4.0 6.0
3 8.0 8.0
4 6.0 8.0
4 8.0 10.0
4 10.0 12.0
4 12.0 14.0
Performance is of utmost importance for me, as I have millions of lines to check.
I had already filtered the entire dataframe using boolean indexing for those rows that do not need segmenting. That is a great speed-up However, on the remainder of the rows I applied a 'for loop' and made dataframes of the correct length that I kept appending. Unfortunately, the performance is not sufficient for millions of rows.
Looking forward to expert solutions!

You can write a function that returns a DataFrame of the expanded start and end times. In this example, I groupby item as I'm not sure you can return a DataFrame from apply without it being grouped first.
def convert(row):
start = row.start.values[0]
end = row.end.values[0]
if start == end:
return pd.DataFrame([[start, end]], columns=['start', 'end'])
else:
return pd.DataFrame({'start': np.arange(start, end, 2),
'end':np.arange(start + 2, end + 2, 2)},
columns=['start', 'end'])
df1=df.groupby('item').apply(convert)
df1.index = df1.index.droplevel(1)
df1.reset_index()
item start end
0 1 0.0 2.0
1 2 2.0 4.0
2 2 4.0 6.0
3 3 8.0 8.0
4 4 6.0 8.0
5 4 8.0 10.0
6 4 10.0 12.0
7 4 12.0 14.0

Start from the original dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'item': [1, 2, 3, 4], 'start':[0.0, 2.0, 8.0, 6.0],
'end': [2.0, 6.0, 10.0, 14.0]})
Then, run the following code:
lengths = pd.Series([1, 2, 1, 4]) # For the example, I just created this array,
# but obviously I would use the mod function to
# determine the number of segments to create
# Row below elongates the dataframe according to the array 'lengths'
df = df.reindex(np.repeat(df.index.values, lengths), method='ffill')
df['start'] += pd.Series(df.groupby(level=0).cumcount()*2.0)
df['end'] = df['start'] + 2.0
print df
Note that the initial dataframe contained an error. Item '3' required a 'start=8.0' and 'end=10.0'.
I believe this method is extremely quick due to the use of pandas Cython functions. Of course, still open to other possibilities.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filling pandas DataFrame with values from another DataFrame of different shape - python

I have a dataframe df2 containing four columns: A, B, C, D. I want to fill this dataframe with the values from another data frame temp= (1, 2, 6.5, 8, 3, 4, 6.6, 7.8, 5, 6, 5, 4). What I want to obtain is given in Any idea on how to do this?

Related

Why isn't transpose working to allow me to concat dataframe with df.loc["Y"] row

Pandas sum column until >= sum(values) then split left df items to dfNEW

Python Pandas Next Sequence Number

Faster method to multiply column lookup values with vectorization

Row-wise DataFrame segmentation

Categories

Resources