I have a 200000 row dataframe that looks like this
df =
index
name
d2b(m)
0
Jon
199.9
1
Amy
29
2
Fyn
19
3
Luc
30
4
And
76
5
Pia
90
I am writing a function to classify the "distance to bus stop (d2b)" column into a new column for every 10 meters, expecting:
index
name
d2b (m)
class (<= x meters)
0
Jon
199.9
200m
1
Amy
29
30m
2
Fyn
19
20m
3
Luc
33
40m
4
And
76
80m
5
Pia
90
90m
Code that works (updated):
numpy.ceil(data["d2b (m)"]/10)*10
This is one way of achieving this:
import math
df['class (<= x meters)'] = math.ceil(df[d2b(m)]/10)*10
This is my data frame:
Name Age Stream Percentage
0 A 21 Math 88
1 B 19 Commerce 92
2 C 20 Arts 95
3 D 18 Biology 70
0 E 21 Math 88
1 F 19 Commerce 92
2 G 20 Arts 95
3 H 18 Biology 70
I want to download different excel file for each subject in one loop so basically, I should get 4 excel files for each subject
i tried this but didn't work:
n=0
for subjects in df.stream:
df.to_excel("sub"+ str(n)+".xlsx")
n+=1
I think groupby is helpful here. and you can use enumerate to keep track of the index.
for i, (group, group_df) in enumerate(df.groupby('stream')):
group_df.to_excel('sub{}.xlsx'.format(i))
# Alternatively, to name the file based on the stream...
group_df.to_excel('sub{}.xlsx'.format(group))
group is going to be the name of the stream.
group_df is going to be a sub-dataframe containing all the data in that group.
I have a .txt file that has the data regarding the total number of queries with valid names. The text inside of the file came out of a SQL Server 19 query output. The database used consists of the results of an algorithm that retrieves the most similar brands related to the query inserted. The file looks something like this:
2 16, 42, 44 A MINHA SAÚDE
3 34 !D D DUNHILL
4 33 #MEGA
5 09 (michelin man)
5 12 (michelin man)
6 33 *MONTE DA PEDRA*
7 35 .FOX
8 33 #BATISTA'S BY PITADA VERDE
9 12 #COM
10 41 + NATUREZA HUMANA
11 12 001
12 12 002
13 12 1007
14 12 101
15 12 102
16 12 104
17 37 112 PC
18 33 1128
19 41 123 PILATES
The 1st column has the Query identifier, the 2nd one has the brand classes where the Query can be located and the 3rd one is the Query itself (the spaces came from the SQL Server output formatting).
I then made a Pandas DataFrame in Google Colaboratory where I wanted the columns to be like the ones in the text file. However, when I ran the code, it gave me this:
The code that I wrote is here:
# Dataframe with the total number of queries with valid names:
df = pd.DataFrame(pd.read_table("/content/drive/MyDrive/data/classes/100/queries100.txt", header=None, names=["Query ID", "Query Name", "Classes Where Query is Present"]))
df
I think that this happens because of the commas in the 2nd column but I'm not quite sure. Any suggestions on why this is happening? I already tried read_csv and read_fwf and they were even worse in terms of formatting.
You can use pd.read_fwf() in this case, as your columns have fixed widths:
import pandas as pd
df = pd.read_fwf(
"/content/drive/MyDrive/data/classes/100/queries100.txt",
colspecs=[(0,20),(21,40),(40,1000)],
header=None,
names=["Query ID", "Query Name", "Classes Where Query is Present"]
)
df.head()
# Query ID Query Name Classes Where Query is Present
# 0 2 16, 42, 44 A MINHA SAÚDE
# 1 3 34 !D D DUNHILL
# 2 4 33 #MEGA
# 3 5 09 (michelin man)
# 4 5 12 (michelin man)
I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')
You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.
Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!
I have too many messy csv files and I am trying to extract information from them. There are random number of unnecessary columns at the beginning of each file. However, the columns that I am interested in are always have the same index. Let me explain in it through an example:
RandomInfo XX
Random2 ZZ
Random3 VV
Random4 KK
Companyname: Apple
VisitsMay ImpressionsMay VisitsApril ImpressionsApril...
Information
International 100 250 90 260
Local 10 22 12 26
With Proxy 5 12 8 16
I want to convert this to:
Companyname Month International Local With Proxy
Apple VistsMay 100 10 5
Apple ImpressionsMay 250 22 12
Apple VisitsApril 90 12 8
Apple ImpressionsApril 260 26 16
Example files here