I have too many messy csv files and I am trying to extract information from them. There are random number of unnecessary columns at the beginning of each file. However, the columns that I am interested in are always have the same index. Let me explain in it through an example:
RandomInfo XX
Random2 ZZ
Random3 VV
Random4 KK
Companyname: Apple
VisitsMay ImpressionsMay VisitsApril ImpressionsApril...
Information
International 100 250 90 260
Local 10 22 12 26
With Proxy 5 12 8 16
I want to convert this to:
Companyname Month International Local With Proxy
Apple VistsMay 100 10 5
Apple ImpressionsMay 250 22 12
Apple VisitsApril 90 12 8
Apple ImpressionsApril 260 26 16
Example files here
Related
I have a 200000 row dataframe that looks like this
df =
index
name
d2b(m)
0
Jon
199.9
1
Amy
29
2
Fyn
19
3
Luc
30
4
And
76
5
Pia
90
I am writing a function to classify the "distance to bus stop (d2b)" column into a new column for every 10 meters, expecting:
index
name
d2b (m)
class (<= x meters)
0
Jon
199.9
200m
1
Amy
29
30m
2
Fyn
19
20m
3
Luc
33
40m
4
And
76
80m
5
Pia
90
90m
Code that works (updated):
numpy.ceil(data["d2b (m)"]/10)*10
This is one way of achieving this:
import math
df['class (<= x meters)'] = math.ceil(df[d2b(m)]/10)*10
I have a .txt file that has the data regarding the total number of queries with valid names. The text inside of the file came out of a SQL Server 19 query output. The database used consists of the results of an algorithm that retrieves the most similar brands related to the query inserted. The file looks something like this:
2 16, 42, 44 A MINHA SAÚDE
3 34 !D D DUNHILL
4 33 #MEGA
5 09 (michelin man)
5 12 (michelin man)
6 33 *MONTE DA PEDRA*
7 35 .FOX
8 33 #BATISTA'S BY PITADA VERDE
9 12 #COM
10 41 + NATUREZA HUMANA
11 12 001
12 12 002
13 12 1007
14 12 101
15 12 102
16 12 104
17 37 112 PC
18 33 1128
19 41 123 PILATES
The 1st column has the Query identifier, the 2nd one has the brand classes where the Query can be located and the 3rd one is the Query itself (the spaces came from the SQL Server output formatting).
I then made a Pandas DataFrame in Google Colaboratory where I wanted the columns to be like the ones in the text file. However, when I ran the code, it gave me this:
The code that I wrote is here:
# Dataframe with the total number of queries with valid names:
df = pd.DataFrame(pd.read_table("/content/drive/MyDrive/data/classes/100/queries100.txt", header=None, names=["Query ID", "Query Name", "Classes Where Query is Present"]))
df
I think that this happens because of the commas in the 2nd column but I'm not quite sure. Any suggestions on why this is happening? I already tried read_csv and read_fwf and they were even worse in terms of formatting.
You can use pd.read_fwf() in this case, as your columns have fixed widths:
import pandas as pd
df = pd.read_fwf(
"/content/drive/MyDrive/data/classes/100/queries100.txt",
colspecs=[(0,20),(21,40),(40,1000)],
header=None,
names=["Query ID", "Query Name", "Classes Where Query is Present"]
)
df.head()
# Query ID Query Name Classes Where Query is Present
# 0 2 16, 42, 44 A MINHA SAÚDE
# 1 3 34 !D D DUNHILL
# 2 4 33 #MEGA
# 3 5 09 (michelin man)
# 4 5 12 (michelin man)
I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')
You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.
Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!
I am attempting to load a given csv file with the folowing structure:
Then, I'd like to join all the words with the same "Sent_ID" into one row, with the following code:
train = pd.read_csv("train.csv")
# Create a dataframe of sentences.
sentence_df = pd.DataFrame(train["Sent_ID"].drop_duplicates(), columns=["Sent_ID", "Sentence", "Target"])
for _, row in train.iterrows():
print(str(row["Word"]))
sentence_df.loc[sentence_df["Sent_ID"] == row["Sent_ID"], ["Sentence"]] = str(row["Word"])
However, the result of the print(str(row["Word"])) is:
Name: Word, Length: 4543833, dtype: object
0 Obesity
1 in
2 Low-
3 and
4 Middle-Income
5 Countries
...
i.e every single word in the column, for any given row. This occurs for all rows.
Printing the entire row gives:
id 89
Doc_ID 1
Sent_ID 4
Word 0 Obesity\n1 ...
tag O
Name: 88, dtype: object
This again suggests that every element of the "Word" column is present in each cell. (The 88th entry is not "Obesity\n1" in the .csv file.
I have tried changing the quoting argument in the read_csv function, as well as manually inserting the headers in the names argument, to no avail.
How do I ensure each Dataframe entry only contains its own word?
I've added a pastebin with some of the samples here (the pastebin will expire a week after this edit).
Building on #Aravinds answer, OP wanted a working example:
from io import StringIO
csv = StringIO('''
<paste csv snippet here>
'''
df = pd.read_csv(csv)
# Print first 5 rows
print(df.head())
id Doc_ID Sent_ID Word tag
0 1 1 1 Obesity O
1 2 1 1 in O
2 3 1 1 Low- O
3 4 1 1 and O
4 5 1 1 Middle-Income O
Now we have the data loaded as a pandas.DataFrame We can use the method to combine the words into sentences.
df = df.groupby('Sent_ID').Word.apply(' '.join).reset_index()
print(df)
Sent_ID Word
0 1 Obesity in Low- and Middle-Income Countries : ...
1 2 We have reviewed the distinctive features of e...
2 3 Obesity is rising in every region of the world...
3 4 In LMICs , overweight is higher in women compa...
4 5 Overweight occurs alongside persistent burdens...
5 6 Changes in the global diet and physical activi...
6 7 Emerging risk factors include environmental co...
7 8 Data on effective strategies to prevent the on...
8 9 Expanding the research in this area is a key p...
9 10 MICROCEPHALIA VERA
10 11 Excellent reproducibility of laser speckle con...
11 12 We compared the inter-day reproducibility of p...
12 13 We also tested whether skin blood flow assessm...
13 14 Skin blood flow was evaluated during PORH and ...
14 15 Data are expressed as cutaneous vascular condu...
15 16 Reproducibility is expressed as within subject...
16 17 Twenty-eight healthy participants were enrolle...
17 18 The reproducibility of the PORH peak CVC was b...
18 19 Inter-day reproducibility of the LTH plateau w...
19 20 Finally , we observed significant correlation ...
20 21 The recently developed LSCI technique showed v...
21 22 Moreover , we showed significant correlation b...
22 23 However , more data are needed to evaluate the...
23 24 Positive inotropic action of cholinesterase on...
24 25 The putative chloride channel hCLCA2 has a sin...
25 26 Calcium-activated chloride channel ( CLCA ) pr...
26 27 Genetic and electrophysiological studies have ...
27 28 The human CLCA2 protein is expressed as a 943-...
28 29 Earlier investigations of transmembrane geomet...
29 30 However , analysis by the more recently derive...
Use groupby()
df = df.groupby('Sent_ID')['Word'].apply(' '.join).reset_index()
You can group by multiple columns as a list. Like so
df.groupby(['Doc_ID','Sent_ID','tag'])
Forgive if this is a duplicate. It seems like it should be, but I've searched all the suggested and more.
I have this table
Look_Back_Months Total Spread Return sector
10 11 0.038961 Apartment
20 21 0.078029 Apartment
30 31 0.079272 Apartment
40 5 0.013499 Office
50 15 0.018679 Office
60 25 -0.003378 Office
I'd like to return
Look_Back_Months Total Spread Return sector
30 31 0.079272 Apartment
50 15 0.018679 Office
Have tried groupby, agg and I keep returning either the max Look_Back_Months and the Total Spread Return. Or just one or the other.
Thanks
By using
df.sort_values('TotalSpreadReturn').drop_duplicates('sector',keep='last')
Out[270]:
Look_Back_Months TotalSpreadReturn sector
50 15 0.018679 Office
30 31 0.079272 Apartment
You can use groupby.max with transform.
g = df.groupby('sector')['TotalSpreadReturn'].transform('max')
res = df[df['TotalSpreadReturn'] == g]
print(res)
Look_Back_Months TotalSpreadReturn sector
30 31 0.079272 Apartment
50 15 0.018679 Office
If it matters, this includes duplicate maximums and maintains index order.