two different csv file data manipulation using pandas - python

I have two data frame df1 and df2
df1 has following data (N Rows)
Time(s) sv-01 sv-02 sv-03 Val1 val2 val3
1339.4 1 4 12 1.6 0.6 1.3
1340.4 1 12 4 -0.5 0.5 1.4
1341.4 1 6 8 0.4 5 1.6
1342.4 2 5 14 1.2 3.9 11
...... ..... .... ... ..
df2 has following data which has more rows than df1
Time(msec) channel svid value-1 value-2 valu-03
1000 1 2 0 5 1
1000 2 5 1 4 2
1000 3 2 3 4 7
..... .....................................
1339400 1 1 1.6 0.4 5.3
1339400 2 12 0.5 1.8 -4.4
1339400 3 4 -0.20 1.6 -7.9
1340400 1 1 0.3 0.3 1.5
1340400 2 6 2.3 -4.3 1.0
1340400 3 4 2.0 1.1 -0.45
1341400 1 1 2 2.1 0
1341400 2 8 3.4 -0.3 1
1341400 3 6 0 4.1 2.3
.... .... .. ... ... ...
What I am trying to achieve is
1.first multiplying Time(s) column by 1000 so that it matches with df2
millisecond column.
2.In df1 sv 01,02 and 03 are in independent column but those sv are
present in same column under svid.
So goal is when time of df1(after changing) is matching with time
of df2 copy next three consecutive lines i.e copy all matched
lines of that time instant.
Basically I want to iterate the time of df1 in df2 time column
and if there is a match copy three next rows and copy to a new df.
I have seen examples using pandas merge function but in my case both have
different header.
Thanks.

I think you need double boolean indexing - first df2 with isin, for multiple is used mul:
And then count values per groups by cumcount and filter first 3:
df = df2[df2['Time(msec)'].isin(df1['Time(s)'].mul(1000))]
df = df[df.groupby('Time(msec)').cumcount() < 3]
print (df)
Time(msec) channel svid value-1 value-2 valu-03
3 1339400 1 1 1.6 0.4 5.30
4 1339400 2 12 0.5 1.8 -4.40
5 1339400 3 4 -0.2 1.6 -7.90
6 1340400 1 1 0.3 0.3 1.50
7 1340400 2 6 2.3 -4.3 1.00
8 1340400 3 4 2.0 1.1 -0.45
9 1341400 1 1 2.0 2.1 0.00
10 1341400 2 8 3.4 -0.3 1.00
11 1341400 3 6 0.0 4.1 2.30
Detail:
print (df.groupby('Time(msec)').cumcount())
3 0
4 1
5 2
6 0
7 1
8 2
9 0
10 1
11 2
dtype: int64

Related

How to get an average of row excluding specific value less than or greater than and add new column at last, Python, Pandas

following is my input data frame
>>data frame after getting avg
a b c d avg
0 1 4 7 8 5
1 3 4 5 6 4.5
2 6 8 2 9 6.25
3 2 9 5 6 5.5
Output required after adding criteria
>>
a b c d avg avg_criteria
0 1 4 7 8 5 7.5 (<=5)
1 3 4 5 6 4.5 5.5 (<=4.5)
2 6 8 2 9 6.25 8.5 (<=6.25)
3 2 9 5 6 5.5 7.5 (<=5.5)
> This is the code I have tried
read file
df_input_data = pd.DataFrame(pd.read_excel(file_path,header=2).dropna(axis=1, how= 'all'))
adding column after calculating average
df_avg = df_input_data.assign(Avg=df_input_data.mean(axis=1, skipna=True))
criteria
criteria = df_input_data.iloc[, :] >= df_avg.iloc[1][-1]
#creating output data frame
df_output = df_input_data.assign(Avg_criteria= criteria)
I am unable to solve this issue. I have tried and googled it many times
From what I understand, you can try df.mask/df.where after comparing with the mean and then calculate mean:
m=df.drop("avg",1)
m.where(m.ge(df['avg'],axis=0)).mean(1)
0 7.5
1 5.5
2 8.5
3 7.5
dtype: float64
print(df.assign(Avg_criteria=m.where(m.ge(df['avg'],axis=0)).mean(1)))
a b c d avg Avg_criteria
0 1 4 7 8 5.00 7.5
1 3 4 5 6 4.50 5.5
2 6 8 2 9 6.25 8.5
3 2 9 5 6 5.50 7.5

Np random sampling in python

I have two pd data tables. I want to create a new column in df2 by assign random Rate using Weight from df1.
df1
Income_Group Rate Weight
0 1 3.5 0.5
1 1 2.5 0.25
2 1 3.75 0.15
3 1 5.0 0.15
4 2 4.5 0.35
5 2 2.5 0.25
6 2 4.75 0.20
7 2 5.0 0.20
....
30 8 2.25 0.75
31 8 4.15 0.05
32 8 6.35 0.20
df2
ID Income_Group State Rate
0 12 1 9 3.5
1 13 2 6 4.5
2 15 8 1 6.35
3 8 1 5 2.5
4 9 8 4 6.35
5 17 2 3 4.75
......
100 50 1 4 3.75
I tried the following code:
df2['Rate']=df1.groupby('Income_Group').apply(lambda gp.np.random.choice(a=gp.Rate, p=gp.Weight,
replace=True))
Of course, the code didn't work. Can someone help me on this? Thank you in advance.
Your data is pretty small, so we can do:
rate_dict = df1.groupby('Income_Group')[['Rate', 'Weight']].agg(list)
df2['Rate'] = df2.Income_Group.apply(lambda x: np.random.choice(rate_dict.loc[x, 'Rate'],
p=rate_dict.loc[x, 'Weight'])
)
Or you can do groupby on df2 as well:
(df2.groupby('Income_Group')
.Income_Group
.transform(lambda x: np.random.choice(rate_dict.loc[x.iloc[0], 'Rate'],
size=len(x),
p=rate_dict.loc[x.iloc[0], 'Weight']))
)
You can try:
df1 = pd.DataFrame([[1,3.5,.5], [1,2.5,.25], [1,3.75,.15]],
columns=['Income_Group', 'Rate', 'Weight'])
df2 = pd.DataFrame()
weights = np.random.rand(df1.shape[0])
df2['Rate'] = df1.Rate.values * weights

Assign values from pandas.quantile

I just try to get the quantiles of a dataframe asigned on to an other dataframe like:
dataframe['pc'] = dataframe['row'].quantile([.1,.5,.7])
the result is
0 NaN
...
5758 NaN
Name: pc, Length: 5759, dtype: float64
any idea why the dataframe['row'] got plenty of values
It is expected, because different indices, so no align Series created by quantile with original DataFrame and get NaNs:
#indices 0,1,2...6
dataframe = pd.DataFrame({'row':[2,0,8,1,7,4,5]})
print (dataframe)
row
0 2
1 0
2 8
3 1
4 7
5 4
6 5
#indices 0.1, 0.5, 0.7
print (dataframe['row'].quantile([.1,.5,.7]))
0.1 0.6
0.5 4.0
0.7 5.4
Name: row, dtype: float64
#not align
dataframe['pc'] = dataframe['row'].quantile([.1,.5,.7])
print (dataframe)
row pc
0 2 NaN
1 0 NaN
2 8 NaN
3 1 NaN
4 7 NaN
5 4 NaN
6 5 NaN
If want create DataFrame of quantile add rename_axis + reset_index:
df = dataframe['row'].quantile([.1,.5,.7]).rename_axis('a').reset_index(name='b')
print (df)
a b
0 0.1 0.6
1 0.5 4.0
2 0.7 5.4
But if some indices are same (I think it is not what you want, only for better explanation):
Add reset_index for default indices 0,1,2:
print (dataframe['row'].quantile([.1,.5,.7]).reset_index(drop=True))
0 0.6
1 4.0
2 5.4
Name: row, dtype: float64
First 3 rows are aligned, because same indices 0,1,2 in Series and DataFrame:
dataframe['pc'] = dataframe['row'].quantile([.1,.5,.7]).reset_index(drop=True)
print (dataframe)
row pc
0 2 0.6
1 0 4.0
2 8 5.4
3 1 NaN
4 7 NaN
5 4 NaN
6 5 NaN
EDIT:
For multiple columns need DataFrame.quantile, it also exclude non numeric columns:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df1 = df.quantile([.1,.2,.3,.4])
print (df1)
B C D E
0.1 4.0 2.5 0.5 2.5
0.2 4.0 3.0 1.0 3.0
0.3 4.0 3.5 1.0 3.5
0.4 4.0 4.0 1.0 4.0

Pandas - Merging Different Sized DataFrames

I am having an issue merging two frames with a different amount of rows. The first dataframe has 5K rows, and the second dataframe has 20K rows. There is a column "id" in both frames, and all 5K "id" values will occur in the frame with 20K rows.
first frame "df"
A B id A_1 B_1
0 1 1 1 0.5 0.5
1 3 2 2 0.2 0.4
2 3 4 3 0.8 0.9
second frame "df_2"
A B id
0 1 1 1
1 3 2 2
2 3 4 3
3 1 2 4
4 3 1 5
Hopeful output frame "df_out"
A B id A_1 B_1
0 1 1 1 0.5 0.5
1 3 2 2 0.2 0.4
2 3 4 3 0.8 0.9
3 1 2 4 na na
4 3 1 5 na na
My attempts to merge on 'id' have left me with only the 5k rows. The operation I am seeking is to preserve all the rows of the large dataframe, and stick Nan values for the data that does not exist in the large frame.
Thanks
Just specify how=outer to df.merge so that you use the union of both DataFrames.
>>> df.merge(df_2, how='outer')
A A_1 B B_1 id
0 1.0 0.5 1.0 0.5 1.0
1 3.0 0.2 2.0 0.4 2.0
2 3.0 0.8 4.0 0.9 3.0
3 1.0 NaN 2.0 NaN 4.0
4 3.0 NaN 1.0 NaN 5.0

Applying function to each row of pandas data frame - with speed

I have a dataframe that has the following basic structure:
import numpy as np
import pandas as pd
tempDF = pd.DataFrame({'condition':[0,0,0,0,0,1,1,1,1,1],'x1':[1.2,-2.3,-2.1,2.4,-4.3,2.1,-3.4,-4.1,3.2,-3.3],'y1':[6.5,-7.6,-3.4,-5.3,7.6,5.2,-4.1,-3.3,-5.7,5.3],'decision':[np.nan]*10})
print tempDF
condition decision x1 y1
0 0 NaN 1.2 6.5
1 0 NaN -2.3 -7.6
2 0 NaN -2.1 -3.4
3 0 NaN 2.4 -5.3
4 0 NaN -4.3 7.6
5 1 NaN 2.1 5.2
6 1 NaN -3.4 -4.1
7 1 NaN -4.1 -3.3
8 1 NaN 3.2 -5.7
9 1 NaN -3.3 5.3
Within each row, I want to change the value of the 'decision' column to zero if the 'condition' column equals zero and if 'x1' and 'y1' are both the same sign (either positive or negative) - for the purposes of this script zero is considered to be positive. If the signs of 'x1' and 'y1' are different or if the 'condition' column equals 1 (regardless of the signs of 'x1' and 'y1') then the 'decision' column should equal 1. I hope I've explained that clearly.
I can iterate over each row of the dataframe as follows:
for i in range(len(tempDF)):
if (tempDF.ix[i,'condition'] == 0 and ((tempDF.ix[i,'x1'] >= 0) and (tempDF.ix[i,'y1'] >=0)) or ((tempDF.ix[i,'x1'] < 0) and (tempDF.ix[i,'y1'] < 0))):
tempDF.ix[i,'decision'] = 0
else:
tempDF.ix[i,'decision'] = 1
print tempDF
condition decision x1 y1
0 0 0 1.2 6.5
1 0 0 -2.3 -7.6
2 0 0 -2.1 -3.4
3 0 1 2.4 -5.3
4 0 1 -4.3 7.6
5 1 1 2.1 5.2
6 1 1 -3.4 -4.1
7 1 1 -4.1 -3.3
8 1 1 3.2 -5.7
9 1 1 -3.3 5.3
This produces the right output but it's a bit slow. The real dataframe I have is very large and these comparisons will need to be made many times. Is there a more efficient way to achieve the desired result?
First, use np.sign and the comparison operators to create a boolean array which is True where the decision should be 1:
decision = df["condition"] | (np.sign(df["x1"]) != np.sign(df["y1"]))
Here I've used DeMorgan's laws.
Then cast to int and put it in the dataframe:
df["decision"] = decision.astype(int)
Giving:
>>> df
condition decision x1 y1
0 0 0 1.2 6.5
1 0 0 -2.3 -7.6
2 0 0 -2.1 -3.4
3 0 1 2.4 -5.3
4 0 1 -4.3 7.6
5 1 1 2.1 5.2
6 1 1 -3.4 -4.1
7 1 1 -4.1 -3.3
8 1 1 3.2 -5.7
9 1 1 -3.3 5.3

Categories