I have a sample data set:
import pandas as pd
import re
df = {'READID': [1,2,3 ,4,5 ,6,7 ,8,9],
'VG': ['LV5-F*01','LV5-F*01' ,'LV5-A*02','LV5-D*01','LV5-E*01','LV5-C*01','LV5-D*01','LV5-E*01','LV5-F*01'],
'Pro': [1,1,1,0.33,0.59,1,0.96,1,1]}
df = pd.DataFrame(df)
it looks like this:
df
Out[12]:
Pro READID VG
0 1.00 1 LV5-F*01
1 1.00 2 LV5-F*01
2 1.00 3 LV5-A*02
3 0.33 4 LV5-D*01
4 0.59 5 LV5-E*01
5 1.00 6 LV5-C*01
6 0.96 7 LV5-D*01
7 1.00 8 LV5-E*01
8 1.00 9 LV5-F*01
i want to groupby column 'VG' but only the part before '*' for each row, and then group by the same values and output them into separate files.
my concept is:
group the dataset 'df' by column 'VG'
for each row of column 'VG' look at only the part before the '*', e.g. 'LV5-F', 'LV5-A', 'LV5-D', etc.
group the dataset once again but this time for the same values from step 2
output each different grouped set to a separate file.
desire output, individual separate files:
'LV5-F.txt':
Pro READID VG
0 1.00 1 LV5-F*01
1 1.00 2 LV5-F*01
8 1.00 9 LV5-F*01
'LV5-A.txt':
Pro READID VG
2 1.00 3 LV5-A*02
'LV5-D.txt':
Pro READID VG
3 0.33 4 LV5-D*01
6 0.96 7 LV5-D*01
'LV5-E.txt':
Pro READID VG
4 0.59 5 LV5-E*01
7 1.00 8 LV5-E*01
'LV5-C.txt':
Pro READID VG
5 1.00 6 LV5-C*01
my attempt:
(df.groupby('VG')
.apply(lambda x: re.findall('([0-9A-Z-]+)\*',x) )
.groupby('VG')
.apply(lambda gp: gp.to_csv('{}.txt'.format(gp.name), sep='\t', index=False))
)
but it failed at the '.apply(lambda x: re.findall('([0-9A-Z-]+)*',x)' step and i'm not sure why it doesn't work because when i ran that code by itself without in the context of being a lambda function, it worked fine.
You'll have to adjust the function below to_csv to suit your needs. In particular, instead of printing, just provide a file name somehow.
But I'd structure it this way:
def to_csv(df):
print df.to_csv()
# extract
# within
# parens
# /------\
# r'^([^\*]+)'
# ^ \----/
# | \__________________________
# match | | |
# beginning [^this] \* '+'
# of string matches have to match
# not this escape * one or more
#
df.groupby(df.VG.str.extract(r'^([^\*]+)', expand=False)).apply(to_csv)
,Pro,READID,VG
2,1.0,3,LV5-A*02
,Pro,READID,VG
2,1.0,3,LV5-A*02
,Pro,READID,VG
5,1.0,6,LV5-C*01
,Pro,READID,VG
3,0.33,4,LV5-D*01
6,0.96,7,LV5-D*01
,Pro,READID,VG
4,0.59,5,LV5-E*01
7,1.0,8,LV5-E*01
,Pro,READID,VG
0,1.0,1,LV5-F*01
1,1.0,2,LV5-F*01
8,1.0,9,LV5-F*01
I modified my code with help from #piRSquared and it worked :
df.groupby(df.VG.str.extract(r'^([^\*]+)')).apply(lambda gp: gp.to_csv('{}.txt'.format(gp.name), sep='\t', index=False))
Related
df1
Place Location
Delhi,Punjab,Jaipur Delhi,Punjab,Noida,Lucknow
Delhi,Punjab,Jaipur Delhi,Bhopal,Jaipur,Rajkot
Delhi,Punjab,Kerala Delhi,Jaipur,Madras
df2
Target1 Target2 Strength
Jaipur Rajkot 0.94
Jaipur Punjab 0.84
Jaipur Noida 0.62
Jaipur Jodhpur 0.59
Punjab Amritsar 0.97
Punjab Delhi 0.85
Punjab Bhopal 0.91
Punjab Jodhpur 0.75
Kerala Varkala 0.85
Kerala Kochi 0.88
The task is to match 'Place' value with 'Location' values and assign score '1' in case of direct match and refer df2 in case of indirect match and assign strength score from that. For Ex: In Row1 Delhi and Punjab are direct match as both are present in 'Place' and 'Location' wherein Jaipur is there in 'Place' but not in 'Location. So, Jaipur will be iterated in df2 Target1 and try to find the corresponding 'Location' values of Row1 in Target2. In df2 Jaipur is related to Punjab and Noida which there in ROW1 Location values. So, corresponding to Jaipur, Punjab strength will be alloted as 0.84 is higher than Noida's 0.62. Final score is calculated as (1+1+0.84)/3 i.e sum of direct and indirect matches divided by number of 'Place' items.
Expected output is :
Place Location Avg. Score
Delhi,Punjab,Jaipur Delhi,Punjab,Noida,Lucknow (1+1+0.84)/3 = 0.95
Delhi,Punjab,Jaipur Delhi,Bhopal,Jaipur,Rajkot (1+0.91+1)/3 = 0.97
Delhi,Punjab,Kerala Delhi,Jaipur,Madras (1+0.85+0)/3 = 0.67
My try
data1 = df1['Place'].to_list()
data2 = df1['Location'].to_list()
dict3 = {}
exac_match = []
for el in data1:
#print(el)
el=[x.strip() for x in el.split(',')]
for ell in data2:
ell=[x.strip() for x in ell.split(',')]
dict1 = {}
dict2 = {}
for elll in el:
if elll in ell:
#print("Exact match:::", elll)
dict1[elll]=1
dict2[elll]=elll
Use:
#convert splitted values of df1['Place'] to rows
df = df1.assign(Place = df1['Place'].str.split(',')).explode('Place').reset_index()
#test if match Place in Location (splitted values)
mask = [a in b for a, b in zip(df.Place, df['Location'].str.split(','))]
#filter matched and remove duplicates, assign 1 to final column
df11 = df[mask].drop_duplicates(['index','Place','Location']).assign(final=1)
#filter not matched rows (indirect match) and join df2
df12 = df[~np.array(mask)].merge(df2, left_on='Place', right_on='Target1')
#test if Target2 in Location
mask = [a in b for a, b in zip(df12.Target2, df12['Location'].str.split(','))]
#get maximal Strength per Place
df12 = df12[mask].copy()
df12 = (df12.loc[df12.groupby(['index','Place'])['Strength'].idxmax()]
.assign(final = lambda x: x['Strength']))
#join together
df3 = pd.concat([df11, df12[['index','Place','final','Location']]])
#join to exploded DataFrame with replace NaN to 0 in final column
df = df.merge(df3, how='left', on=['index','Place']).fillna({'final':0})
print (df)
index Place Location_x Location_y \
0 0 Delhi Delhi,Punjab,Noida,Lucknow Delhi,Punjab,Noida,Lucknow
1 0 Punjab Delhi,Punjab,Noida,Lucknow Delhi,Punjab,Noida,Lucknow
2 0 Jaipur Delhi,Punjab,Noida,Lucknow Delhi,Punjab,Noida,Lucknow
3 1 Delhi Delhi,Bhopal,Jaipur,Rajkot Delhi,Bhopal,Jaipur,Rajkot
4 1 Punjab Delhi,Bhopal,Jaipur,Rajkot Delhi,Bhopal,Jaipur,Rajkot
5 1 Jaipur Delhi,Bhopal,Jaipur,Rajkot Delhi,Bhopal,Jaipur,Rajkot
6 2 Delhi Delhi,Jaipur,Madras Delhi,Jaipur,Madras
7 2 Punjab Delhi,Jaipur,Madras Delhi,Jaipur,Madras
8 2 Kerala Delhi,Jaipur,Madras NaN
final
0 1.00
1 1.00
2 0.84
3 1.00
4 0.91
5 1.00
6 1.00
7 0.85
8 0.00
#last aggregate mean and assign to df1['Score']
df1['Score'] = df.groupby('index')['final'].mean()
print (df1)
Place Location Score
0 Delhi,Punjab,Jaipur Delhi,Punjab,Noida,Lucknow 0.946667
1 Delhi,Punjab,Jaipur Delhi,Bhopal,Jaipur,Rajkot 0.970000
2 Delhi,Punjab,Kerala Delhi,Jaipur,Madras 0.616667
I am trying to do something that I think should be rather simple but I am stuck.
I would like to be able to get the standard deviation of each column in my dataframe and remove that column if the standard deviation is below a set number. This is as far as I have gotten.
stdev_min = 0.6
df = pd.DataFrame(np.random.randn(20, 5), columns=list('ABCDE'))
namelist = list(df.columns.values.tolist())
stdev = pd.DataFrame(df.std())
I've tried a few things but nothing worth mentioning, any help would be greatly appreciated.
You don't need any loops.
You rarely do with pandas.
In this case, you need boolean indexing:
import pandas
import numpy
numpy.random.seed(37)
stdev_min = 0.95
df = pandas.DataFrame(numpy.random.randn(20, 5), columns=list('ABCDE'))
So now df.std() gives me:
A 0.928547
B 0.859394
C 0.998692
D 1.187380
E 1.092970
dtype: float64
so I can do
df.loc[:, df.std() > stdev_min]
And get:
C D E
0 0.35 -1.30 1.52
1 -0.45 0.96 -0.83
2 0.52 -0.06 -0.03
3 1.89 0.40 0.19
4 -0.27 -2.07 -0.71
5 -1.72 -0.40 1.27
6 0.44 -2.05 -0.23
7 1.76 0.06 0.36
8 -0.30 -2.05 1.68
9 0.34 1.26 -1.08
10 0.10 -0.48 -1.74
11 1.95 -0.08 1.51
12 0.43 -0.06 -0.63
13 -0.30 -1.06 0.57
14 -0.95 -1.45 0.93
15 -1.13 2.23 -0.88
16 -0.77 0.86 0.58
17 0.93 -0.11 -1.29
18 -0.82 0.03 -0.44
19 0.40 1.13 -1.89
Here's a way to do this.
Iterate through each column. Get the Standard Deviation for the column. Check if it is less than the minimum standard deviation value. If it is, drop the column using inplace=True
stdev_min = 0.6
df = pd.DataFrame(np.random.randn(10, 5), columns=list('ABCDE'))
for col in df.columns:
print (col, df[col].std())
if df[col].std() < stdev_min:
df.drop(col,axis='columns', inplace=True)
print (df)
Output:
A 0.5046725928657507
B 1.1382221163449697
C 1.0318169576864502
D 0.7129102193331575
E 1.3805207184389312
The value of A is less than 0.6 and so it got dropped.
B C D E
0 -0.923822 1.155547 -0.601033 -0.066207
1 0.068844 0.426304 -0.376052 0.368574
2 0.585187 -0.367270 0.530934 0.086811
3 0.021466 1.381579 0.483134 -0.300033
4 0.351492 -0.648734 -0.736213 0.827953
5 0.155731 -0.004504 0.315432 0.310515
6 -1.092933 1.341933 -0.672240 -3.482960
7 -0.587766 0.227846 0.246781 1.978528
8 1.565055 0.527668 -0.371854 -0.030196
9 -2.634862 -1.973874 1.508080 -0.362073
Did a few more runs. Here's an example with before and after.
DF before
A B C D E
0 0.496740 0.799021 1.655287 0.091138 0.309186
1 -0.580667 -0.749337 -0.521909 -0.529410 1.010981
2 0.212731 0.126389 -2.244500 0.400540 -0.148761
3 -0.424375 -0.832478 -0.030865 -0.561107 0.196268
4 0.229766 0.688040 0.580294 0.941885 1.554929
5 0.676926 -0.062092 -1.452619 0.952388 -0.963857
6 0.683216 0.747429 -1.834337 -0.402467 -0.383881
7 0.834815 -0.770804 1.299346 1.694612 1.171190
8 0.500445 -1.517488 0.610287 -0.601442 0.343389
9 -0.182286 -0.713332 0.526507 1.042717 1.229628
Standard Deviations for each column of DF:
A 0.49088743174291477
B 0.8047513692231202
C 1.333382184686379
D 0.8248456756163864
E 0.8033725216710547
df['A'] is less than 0.6 and so got dropped.
DF after dropping the column.
B C D E
0 0.799021 1.655287 0.091138 0.309186
1 -0.749337 -0.521909 -0.529410 1.010981
2 0.126389 -2.244500 0.400540 -0.148761
3 -0.832478 -0.030865 -0.561107 0.196268
4 0.688040 0.580294 0.941885 1.554929
5 -0.062092 -1.452619 0.952388 -0.963857
6 0.747429 -1.834337 -0.402467 -0.383881
7 -0.770804 1.299346 1.694612 1.171190
8 -1.517488 0.610287 -0.601442 0.343389
9 -0.713332 0.526507 1.042717 1.229628
I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')
You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.
Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!
I have a dataframe with the values:
3.05
35.97
49.11
48.80
48.02
10.61
25.69
6.02
55.36
0.42
47.87
2.26
54.43
8.85
8.75
14.29
41.29
35.69
44.27
1.08
I want transform the value into range and give new value to each value.
From the df we know the min value is 0.42 and the max value is 55.36.
From range min to max, I want divide to 4 group which is:
0.42 - 14.15 transform to 1
14.16 - 27.88 transform to 2
27.89 - 41.61 transform to 3
41.62 - 55.36 transform to 4
so the result I expected is
1
3
4
4
4
1
2
1
4
1
4
1
4
1
1
2
3
3
4
1
This is normally called binning, but pandas calls it cut. Sample code is below:
import pandas as pd
# Create a list of numbers, with a header called "nums"
data_list = [('nums', [3.05, 35.97, 49.11, 48.80, 48.02, 10.61, 25.69, 6.02, 55.36, 0.42, 47.87, 2.26, 54.43, 8.85, 8.75, 14.29, 41.29, 35.69, 44.27, 1.08])]
# Create the labels for the bin
bin_labels = [1,2,3,4]
# Create the dataframe object using the data_list
df = pd.DataFrame.from_items(data_list)
# Define the scope of the bins
bins = [0.41, 14.16, 27.89, 41.62, 55.37]
# Create the "bins" column using the cut function using the bins and labels
df['bins'] = pd.cut(df['nums'], bins=bins, labels=bin_labels)
This creates a dataframe which has the following structure:
print(df)
nums bins
0 3.05 1
1 35.97 3
2 49.11 4
3 48.80 4
4 48.02 4
5 10.61 1
6 25.69 2
7 6.02 1
8 55.36 4
9 0.42 1
10 47.87 4
11 2.26 1
12 54.43 4
13 8.85 1
14 8.75 1
15 14.29 2
16 41.29 3
17 35.69 3
18 44.27 4
19 1.08 1
You could construct a function like the following to have full control over the process:
def transform(l):
l2 = []
for i in l:
if 0.42 <= i <= 14.15:
l2.append(1)
elif i <= 27.8:
l2.append(2)
elif i <= 41.61:
l2.append(3)
elif i <= 55.36:
l2.append(4)
return(l2)
df['nums'] = transform(df['nums'])
I have a dataframe that looks like this:
I want to create another column called "engaged_percent" for each state which is basically the number of unique engaged_count divided by the user_count of each particular state.
I tried doing the following:
def f(x):
engaged_percent = x['engaged_count'].nunique()/x['user_count']
return pd.Series({'engaged_percent': engaged_percent})
by = df3.groupby(['user_state']).apply(f)
by
But it gave me the following result:
What I want is something like this:
user_state engaged_percent
---------------------------------
California 2/21 = 0.09
Florida 2/7 = 0.28
I think my approach is correct , however I am not sure why my result shows up like the one seen in the second picture.
Any help would be much appreciated! Thanks in advance!
How about:
user_count=df3.groupby('user_state')['user_count'].mean()
#(or however you think a value for each state should be calculated)
engaged_unique=df3.groupby('user_state')['engaged_count'].nunique()
engaged_pct=engaged_unique/user_count
(you could also do this in one line in a bunch of different ways)
Your original solution was almost fine except that you were dividing a value by the entire user count series. So you were getting a Series instead of a value. You could try this slight variation:
def f(x):
engaged_percent = x['engaged_count'].nunique()/x['user_count'].mean()
return engaged_percent
by = df3.groupby(['user_state']).apply(f)
by
I would just use groupby and apply directly
df3['engaged_percent'] = df3.groupby('user_state')
.apply(lambda s: s.engaged_count.nunique()/s.user_count).values
Demo
>>> df3
engaged_count user_count user_state
0 3 21 California
1 3 21 California
2 3 21 California
...
19 4 7 Florida
20 4 7 Florida
21 4 7 Florida
>>> df3['engaged_percent'] = df3.groupby('user_state').apply(lambda s: s.engaged_count.nunique()/s.user_count).values
>>> df3
engaged_count user_count user_state engaged_percent
0 3 21 California 0.095238
1 3 21 California 0.095238
2 3 21 California 0.095238
...
19 4 7 Florida 0.285714
20 4 7 Florida 0.285714
21 4 7 Florida 0.285714
titanic.groupby('Sex')['Fare'].mean()
you can try this example just put your example in that